Data preparation (part 2)

In the previous post, we went through the pre-preparation phase,  collecting meta-data, data profiling and data preparation rules. This post is mainly about dealing with missing values aka Nulls.

Before looking for methods to deal with nulls, confirm that you are really missing some data. It is possible to have some blanks in the data set that can be replaced with a value. Non quality data may have null in place of no. Let’s say a column that only has “yes” and “null” values. You should verify if the system/application that generates the data doesn’t assign any value when it is negative/false response. In that case, you only replace null with no and don’t delete the column.

In addition, meta-data can help with missing data by mentioning the out-of-range entries with types: unknown, unrecorded, irrelevant, and that can be for different reasons such as

  • Malfunctioning equipment
  • Changes in database design
  • Collation of different datasets
  • Measurement not possible

Missing data types

First, we need to understand the different types of missing data. There are 3 different types:

Missing completely at Random (MCAR)

The type title explains itself. The data are missing for random reasons (example: measurement sensor ran out of battery) and unrelated to any other measured variable. It just happened randomly.

Example:

We conducted a survey at University Campus about extracurricular activities, one of the questions is the student’s age. The survey was available online and we had some volunteers who asked students in the campus directly. After we collected the data, we started preparing it. We found out that some students did not mention their age because it was not mandatory. In this case, the age missing values are missing completely at random.

Missing at Random (MAR)

The missing data are independent on all the unobserved values, but it is dependent on the observed values. “What?!” Let’s make it simple:

We have a dataset of cars characteristics

BrandModelNbr of DoorsNbr of SeatsAirbag
AudiA655 
Mercedes BenzE6355Yes
AudiA455 
BMWM332Yes
RenaultMegan55No
SkodaSuperb55Yes
Mercedes BenzS56055Yes
Peugeot50855No
SkodaOctavia RS55Yes
TeslaModel S55Yes
AudiA855 
TeslaModel 355Yes

We have missing values in the airbag column. You notice that the missing values are dependent on the column Brand. If we group the data by the brand value, we find out that all the missing values have as brand value “Audi”.

Missing not at Random (MNAR)

The missing data are not only dependent on the observed data, but also dependent on the unobserved data.

Example:

A survey was conducted about mental disorder treatment worldwide. The results showed that respondents from low/lower-income countries are significantly less likely to report treatment than high-income country respondents.

— — — —

Dealing with missing data

How to deal with the missing data? There are 3 different possibilities:

1 – Data Deletion

First, this method should be used only with MCAR case. There are 2 different deletion methods that most of data analysts/scientists are using:

Drop them (Listwise deletion):

Basically you have to remove the entire row if it has at least one missing value. This method is recommended if your data set is large enough so that the dropped data does not affect the analysis. Most of the labs or companies have a minimum percentage of data that is required and if that threshold is not attainable, they remove the rows with missing data. Personally, if most (more than 50%) values of a variable are null or missing, I “usually” drop the column.

Pairwise deletion:

Data will not be dropped in this case. If the analysis needs all the columns, you select only the rows without any missing values. Meanwhile, if the analysis task needs some variables (not all of them) and it happens that the rows with missing values have the required values for this task, you add them to the selected data for the task resolution.

Example:

For this example, the CAR data set will be used. *Let’s assume it has 50 rows and there are missing data only in rows number 1 and 6

BrandModelNbr of DoorsNbr of SeatsAirbag
1AudiA655 
2Mercedes BenzE6355Yes
3BMWM332Yes
4SkodaSuperb55Yes
5Mercedes BenzS56055Yes
6Peugeot50855No
7SkodaOctavia RS55Yes
.
.
.
50TeslaModel 355Yes

1st Task: Association rules task to find association hypothesis between number of seats and number of doors. The needed attributes are: Brand, Model, Nbr of Seats and Nbr of doors. In this case, we can use all the data set because there are no missing values for the given models.


2nd Task: Association rules task to find association hypothesis between number of seats and number of airbags. The needed attributes are: Brand, Model, Nbr of Seats and Nbr of airbags. To resolve the task, we eliminate rows number 1 and 6 and we use the rest.

2 – Replace missing values

The third option to deal with missing values is to replace them. Here it gets a bit complicated because there are different ways to achieve it.

Mean/median substitution

Replace missing values with the mean or median value. We use this method when the missing values are numerical type and the missing values represent less than 30%.

However, with missing values that are not strictly random, especially in the presence of a great inequality in the number of missing values for the different variables, the mean substitution method may lead to inconsistent bias .

Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64(5):402–406. doi:10.4097/kjae.2013.64.5.402

Common value imputation

We use the most common value to replace the missing values. For example, we have a column color in the Car dataset that we used previously which has 100 records. The color column has 5 values only, the most common value (67x) is Black. So, we replace the missing values with Black. However, this method may lead also to inconsistent bias.

Regression imputation

Regression imputation let us avoid biases in the analysis. We know that Mean/Median method replaces the missing values with current ones. Instead of doing that, we predict the missing values using the available data. This way, we gain new values and retain the cases with missing values.

Multiple imputation

Multiple imputation “approach begin with a prediction of the missing data using the existing data from other variables [15]. The missing values are then replaced with the predicted values, and a full data set called the imputed data set is created. This process iterates the repeatability and makes multiple imputed data sets (hence the term “multiple imputation”). Each multiple imputed data set produced is then analyzed using the standard statistical analysis procedures for complete data, and gives multiple analysis results. Subsequently, by combining these analysis results, a single overall analysis result is produced. “

Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64(5):402–406. doi:10.4097/kjae.2013.64.5.402

The purpose of multiple imputation is to have a statistically valid inference and not to find the true missing data, because there is no way to predict the missing data and get it 100% right. The main advantage of this method is the elimination of biases and it is easy to use. Meanwhile, to get a correct imputation model, you need to take in consideration the conditions needed for this method and avoid some pitfalls.

In case you want to use multiple imputation method, I recommend reading the following articles : Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls (BMJ 2009;338:b2393) and When and how should multiple imputation be used for handling missing data in randomised clinical trials – a practical guide with flowcharts (DOI: 10.1186/s12874-017-0442-1)

3 – Create new field / variable

Missing data have its own usefulness mainly when it is not MCAR (Missing Completely At Random). Therefore, we create a new variable or field that records the witnessed behavior or pattern of the missing values. This can be also useful if you own the tool that generates the data, you can create a new engineered feature based on the missing data pattern.

— — — —

Further reading

  1. How to Handle Missing Data
  2. The prevention and handling of the missing data

References

  1. ibm.com: Pairwise vs. Listwise deletion: What are they and when should I use them? , Accessed 27/02/2019 (https://www-01.ibm.com/support/docview.wss?uid=swg21475199)
  2. ncbi.nlm.nih.gov: The prevention and handling of the missing data, Accessed 21/04/2019 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100)
  3. measuringu.com: 7 ways to handle missing data, Accessed 15/04/2019 (https://measuringu.com/handle-missing-data)

Data preparation (part 1)

Data preparation is the most time consuming phase in any data related cycle whether you are preparing the data for machine learning model or data mining or BI.

I will explain how to prepare the data efficiently by following different steps.

Many people who are starting their career in the data field forget about an important step. They ask for the data and start preparing it straight away.

But before that, you should do some pre-preparation.

Business Understanding (pre-prepration)

First, you need to understand the business logic. Every data analysis task is related to business task.

Ask for explanation and read the available documentation. In addition, meetings with a business analyst in that organization or service/product owners may be required. You gained a lot of time with this step (you would find out that some data are missing afterwards if you skip it or the data structure does not make sense and many random problems)

Tip: when collecting the data, ask for the data governance department (in case there is one). The people there have useful and priceless information.

*Don’t let them convince you that the data is self explanatory.

Business understanding does not have a simple method to use. You just need to figure out how the business works and most importantly how the data was generated. After finishing this step, ask for the
needed data to the given task.

Now, we can start the data preparation. To do so we need the metadata.

Collect the metadata

Metadata is the data that describes the data. Having the metadata is a must, if it not accessible you should create it with the help of the data owner.

Metadata helps with identifying the attributes of the data-set, the type of each attribute and sometimes even the values assigned for a concrete attribute.

Data profiling

Data profiling is important to better understand the data. It “is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data”.  Data profiling includes structure discovery, content discovery and relationship discovery. This step makes it easier to discover and choose the needed data. Also, if similar data are needed for next iterations, you know already how to deal with it and the whole data preparation process becomes more easier.

Define data preparation rules (optional)

This step applies for big data. Data preparation rules are the methods of cleansing and transforming the data.

Why? Cleaning big data is not a simple task and it’s time consuming. Imagine you delete rows using the value of an attribute as condition, than you find out that the condition is missing something and the size of your data-set is 5TB. That will take you forever to figure out the right condition.

How? We use a random sample from our data-set, we cleanse it and transform it. The script that was used to prepare the
random data sample will be used for the whole data-set.

The random sample must be valid. I will write a blog post about generating a correct and valid random sample.

Iterative preparation

Start with the basic cleansing steps that apply for any dataset. After that you tackle the challenging steps such as dealing with missing data. Let the data transformation to the end.

In the part 2, we will understand how to deal with missing values and how to get better quality data.

Boost your productivity: Azure Data Studio

“I am suffering from these tools, they consume a lot of memory and they need a lot of space” or ” I am overwhelmed with the features of this tool, somehow I find myself lost and I can’t figure out how to do simple tasks”.. Does this sound familiar to you? Many tools nowadays offer great features but we need terabytes of storage to have them locally and a powerful device is needed too. Well, today is your lucky day if you are dealing with databases. Have you heard about Azure Data Studio? In this post, I will give you some tips to improve your work performance with Azure Data Studio. Here is the structure of this blog post.

Table of contents

Introduction

Overview

UI

Export User Settings

Change Terminal Shell

Subscriptions Filter

Connect to multiple Azure accounts

Run script from file

Introduction:

Azure Data Studio was firstly introduced in Pass Summit 2017 (it was called SQL Operations Studio). It is a cross-platform tool for database design and operations. If you are friendly with Visual Code, you will love Azure Data Studio. It is a lightweight version, with the necessary tools and you won’t be overwhelmed with many features such as the case with SSMS (Microsoft SQL Server Management Studio).

Overview:

Azure Data Studio is a light-weight cross-platform database management tool for Windows, MacOS, and Linux. It is free (no license needed) and it is an open source project. Azure Data Studio is based on VS Code and MSSQL extension in VS Code, written in ElectronJS. You can report bugs, request new features and contribute to the project. Extensions are an amazing feature of Visual Code, so does the Azure Data Studio, you can add extensions but they are not that much (for the moment 😉 ).

It supports Azure SQL Database, Azure Data Warehouse, MSSQL Server whether running in cloud or on-premises. T-SQL Query is mainly supported by autosuggestions, formatting and advanced coding features. Meanwhile, it still supports other languages such as JSON, XML, Python, SQL, yaml, dockerfile… In addition, you can work with workspaces, folders. Source Control (GIT) is integrated, so no problem with managing your files. This is an amazing feature especially for those who opt for CI/DI pipeline using Azure DevOps. Speaking of Azure, Azure Resource Explorer is a panel in Azure Data Studio that allows you to connect to your Azure account(s) and work with your different subscriptions. If you work with PowerShell or different shells, you can do it also in Azure Data Studio thanks to the internal terminal as it is the case with Visual Code. It has another bunch of features.

The Queen of Vermont and Entity Framework @Julie Lerman, a Microsoft Regional Director and MVP, wrote two blog posts in MSDN magazine about Azure Data Studio: Data Points – Visual Studio Code: Create a Database IDE with MSSQL Extension(June 2017) and Data Points – Manage Data Across Multiple Sources with Azure Data Studio(December 2018). So, I advise you to read them because I am not repeating what she has already written (I don’t have her level of knowledge and skills so I won’t make it perfect the way she does). Meanwhile, I will give you some tips that will help you.

UI :

The User Interface of Azure Data Studio is similar to the one of Visual Code. It is simpler and not overwhelmed with menus. It has the classic left sidebar as it is the case in VS Code. You can split the window the way you want (literally limitless splitting). Figure 1 shows you how to change theme color.

Figure 1: change theme color

Export User Settings:

Most of us have at least 2 devices: a business laptop, and personal laptop/tablet. Let’s say you tested Azure Data Studio in your personal device and customized it. Then, you decided to install it in the second device with the same customized settings. I got you covered! It is easier than you think. I will recreate the two first steps in the Figure 2. (1) Open user settings by clicking on the Settings logo in the bottom left corner and then settings, you can open it from command palette: open it using the keyboard shortcut CTRL+comma. (2) Now, move the cursor over the tab and give it a right click, then “Reveal in Explorer”. (3) Copy the file settings.json and send it to your work device. (4) Repeat the same first steps in your work device and finish it by replacing the current settings.json file with the other file. Done.

Figure 2: export user settings

Change Terminal Shell:

The first time you install Azure Data Studio, you will have to choose default terminal shell. Later on, when you click on the add new terminal icon, you will get the same terminal shell, but what if I want to use Powershell and Cmd at the same time. To do so, you need to change the default terminal shell by opening the Command Palette (CTRL+Shift+P), then type “Select Default Shell” and hit the enter key. You can change the shell type by selecting one from the given list. There is another way to do it: Open User Settings (the first step in the previous Tip), then search for :

"terminal.integrated.shell.windows": “Shell path”

You have to change the path and you are ready to go.

For example, my default terminal shell is CMD so it looks like this

"terminal.integrated.shell.windows": "C:\\WINDOWS\\System32\\cmd.exe"

In order to change it to PowerShell, I only replace the Shell path string

    "terminal.integrated.shell.windows": "C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe"

*You have to repeat these steps every time you want to change the shell.

Subscriptions Filter:

It is a simple tip, but it is worth it in case you have many subscriptions. Azure Data Studio allows multiple linked accounts, that means you can connect to different Azure accounts and use all the resources at the same time, which is really COOL. However, you may have many subscriptions and you will not use all of them. To get rid of the unnecessary ones, you use the subscription filter by hovering over the account as it is demonstrated in Figure 3.

Figure 3: subscriptions filter

Connect to multiple Azure accounts:

I just discovered it while writing the Subscriptions Filter tip that it may be tricky to connect to a second or third Azure account. You need to click on the person icon (bottom left corner of the window). Figure 4 shows you the way.

Figure 4: connect to multiple Azure accounts

Run a script from file:

Internal terminals, an amazing feature that Azure Data Studio has. It makes the life of DB Admins easier, you don’t need to open many apps and windows, just all in one tool is needed. We all know that we can open a folder or a workspace in Explorer (Explorer panel of Azure Data Studio, not Windows Explorer), you have some file scripts and you want to run it. There is command provided for this task to run the active file or just the selected text in the active terminal. In Figure 5, you can see how it works. This command doesn’t have a keyboard shortcut by default (you can add one by editing the keyboard shortcuts 😉). To make it work, open the command palette and type:

Terminal: Run Active File in Active Terminal

Or

Terminal: Run Selected Text in Active Terminal
Figure 5: Run script from file

~*~*~*~*~*~

There are other features you will enjoy in Azure Data Studio and they will improve your work such as Auto Save (you don’t have to save changes with Ctrl+S every time), the process Explorer or peek a definition, etc.

The purpose of this blog post was showing some cool features of this great tool. You may notice that I did not mention anything related to working with databases, queries, or extensions; they need another blog post. For now, you can start with the Quickstarts Tutorials and the official documentation. If you don’t have it yet, you download it here.

Contribute

Please help the community by giving your feedback and contributing on GitHub.

— — — —

References

To write this blog post, I used the official documentation of Azure Data Studio, Julie Lerman blog posts in MSDN Magazine and two blogs from VisualStudioMagazine (here and here).

Let’s talk about data

Introduction

I decided to write a series concerning data. Nowadays, most of the buzzwords in the IT world are big data, data science, artificial intelligence, etc. All of them put in value data. It is starting to be as expensive as black gold, some people claim it is more expensive, you can google or duckduckgo “data is the new oil” to read some news articles. We can notice that all newspapers and media are talking about the misuse of data that we generate and offer for free on the internet. Although I am not going to discuss the last point, but it is a fact that we are living and it can change our life for better or worse.

This blog post was meant to be just informative about the series, then I said you should learn at least one thing. So, do you know the difference between data and information? If yes, skip this part and go straight to the next part.

What does data mean?

According to Cambridge, data is


information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer.

dictionary.cambridge.org

Well, this is a typical mistake where data is confused with information.

Oxford dictionaries has a better definition, data is defined as


Facts and statistics collected together for reference or analysis -; – The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.


Oxford dictionaries

This is similar to the definition of data by some scholars. You can read their definitions in this study Conceptual Approaches for Defining Data, Information, and Knowledge, written by Chaim Zins

We can understand that data itself is not significant.

Data is NOT synonym of information

Information Is something that is understandable and can inform us about something. When data is set and presented in a given context, it becomes useful and called information; information is inferred from data.

Example:

Let’s say: Robot,112, C, Baby. These data aren’t useful and do not inform us about anything. However, if I say the company Robot sold 112 units of the product C. All the customers are persons who have at least one baby. Right now, with the given context the data become somehow useful. Robot is the name of the company, C is the product, Baby is a tag that let us categorize the customers and 112 is the sold quantity for this category (it may be the total, we don’t know).

— — —

It is important to understand the basic definitions because later on it becomes more complex with tens of technical words that you can encounter and this can lead to confusion.

In this series, we will go through different topics where data is the core of the business/subject

Data Preparation

Data sampling

Data in BI / ETL, data mining, data science, Big data, AI.

Data quality, Data maintenance, Data governance

To be honest with you, I was going to write only about data preparation and data sampling because it is the most underrated part and everyone is focusing in creating a good Machine Learning model and solving the problems of humanity with Big Data while with prepared data. The most important stage during the process of data is the preparation stage. It takes a lot of time, and if it is not well tackled, everything can go wrong.

The series will contain some theory and practice using different tools.

That’s why the list is not complete and not in order. I will post a content table that will have all the links. I may not be posting in right order, because something has to be explained before moving to the next step or I feel the urge to explain it.

Who is the series intended for?

The blog posts are useful for anyone who is dealing with data such as data scientists (1), BI developers (2), software developers (3), etc.

(1) And (2) are the people who are using the data as input and get us as output knowledge and wisdom. *They are not the only ones in this category.

(3) They may design databases and work with data a lot in the product life cycle. *They are not the only ones in this category .