datapreparation

In the previous post, we went through the pre-preparation phase, collecting meta-data, data profiling and data preparation rules. This post is mainly about dealing with missing values aka Nulls.

Before looking for methods to deal with nulls, confirm that you are really missing some data. It is possible to have some blanks in the data set that can be replaced with a value. Non quality data may have null in place of no. Let’s say a column that only has “yes” and “null” values. You should verify if the system/application that generates the data doesn’t assign any value when it is negative/false response. In that case, you only replace null with no and don’t delete the column.

In addition, meta-data can help with missing data by mentioning the out-of-range entries with types: unknown, unrecorded, irrelevant, and that can be for different reasons such as

Malfunctioning equipment
Changes in database design
Collation of different datasets
Measurement not possible

Missing data types

First, we need to understand the different types of missing data. There are 3 different types:

Missing completely at Random (MCAR)

The type title explains itself. The data are missing for random reasons (example: measurement sensor ran out of battery) and unrelated to any other measured variable. It just happened randomly.

Example:

We conducted a survey at University Campus about extracurricular activities, one of the questions is the student’s age. The survey was available online and we had some volunteers who asked students in the campus directly. After we collected the data, we started preparing it. We found out that some students did not mention their age because it was not mandatory. In this case, the age missing values are missing completely at random.

Missing at Random (MAR)

The missing data are independent on all the unobserved values, but it is dependent on the observed values. “What?!” Let’s make it simple:

We have a dataset of cars characteristics

Brand	Model	Nbr of Doors	Nbr of Seats	Airbag
Audi	A6	5	5
Mercedes Benz	E63	5	5	Yes
Audi	A4	5	5
BMW	M3	3	2	Yes
Renault	Megan	5	5	No
Skoda	Superb	5	5	Yes
Mercedes Benz	S560	5	5	Yes
Peugeot	508	5	5	No
Skoda	Octavia RS	5	5	Yes
Tesla	Model S	5	5	Yes
Audi	A8	5	5
Tesla	Model 3	5	5	Yes

We have missing values in the airbag column. You notice that the missing values are dependent on the column Brand. If we group the data by the brand value, we find out that all the missing values have as brand value “Audi”.

Missing not at Random (MNAR)

The missing data are not only dependent on the observed data, but also dependent on the unobserved data.

Example:

A survey was conducted about mental disorder treatment worldwide. The results showed that respondents from low/lower-income countries are significantly less likely to report treatment than high-income country respondents.

— — — —

Dealing with missing data

How to deal with the missing data? There are 3 different possibilities:

1 – Data Deletion

First, this method should be used only with MCAR case. There are 2 different deletion methods that most of data analysts/scientists are using:

Drop them (Listwise deletion):

Basically you have to remove the entire row if it has at least one missing value. This method is recommended if your data set is large enough so that the dropped data does not affect the analysis. Most of the labs or companies have a minimum percentage of data that is required and if that threshold is not attainable, they remove the rows with missing data. Personally, if most (more than 50%) values of a variable are null or missing, I “usually” drop the column.

Pairwise deletion:

Data will not be dropped in this case. If the analysis needs all the columns, you select only the rows without any missing values. Meanwhile, if the analysis task needs some variables (not all of them) and it happens that the rows with missing values have the required values for this task, you add them to the selected data for the task resolution.

Example:

For this example, the CAR data set will be used. *Let’s assume it has 50 rows and there are missing data only in rows number 1 and 6

	Brand	Model	Nbr of Doors	Nbr of Seats	Airbag
1	Audi	A6	5	5
2	Mercedes Benz	E63	5	5	Yes
3	BMW	M3	3	2	Yes
4	Skoda	Superb	5	5	Yes
5	Mercedes Benz	S560	5	5	Yes
6	Peugeot	508	5	5	No
7	Skoda	Octavia RS	5	5	Yes
. . .
50	Tesla	Model 3	5	5	Yes

1^st Task: Association rules task to find association hypothesis between number of seats and number of doors. The needed attributes are: Brand, Model, Nbr of Seats and Nbr of doors. In this case, we can use all the data set because there are no missing values for the given models.

2^nd Task: Association rules task to find association hypothesis between number of seats and number of airbags. The needed attributes are: Brand, Model, Nbr of Seats and Nbr of airbags. To resolve the task, we eliminate rows number 1 and 6 and we use the rest.

2 – Replace missing values

The third option to deal with missing values is to replace them. Here it gets a bit complicated because there are different ways to achieve it.

Mean/median substitution

Replace missing values with the mean or median value. We use this method when the missing values are numerical type and the missing values represent less than 30%.

However, with missing values that are not strictly random, especially in the presence of a great inequality in the number of missing values for the different variables, the mean substitution method may lead to inconsistent bias .
Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64(5):402–406. doi:10.4097/kjae.2013.64.5.402

Common value imputation

We use the most common value to replace the missing values. For example, we have a column color in the Car dataset that we used previously which has 100 records. The color column has 5 values only, the most common value (67x) is Black. So, we replace the missing values with Black. However, this method may lead also to inconsistent bias.

Regression imputation

Regression imputation let us avoid biases in the analysis. We know that Mean/Median method replaces the missing values with current ones. Instead of doing that, we predict the missing values using the available data. This way, we gain new values and retain the cases with missing values.

Multiple imputation

Multiple imputation “approach begin with a prediction of the missing data using the existing data from other variables [15]. The missing values are then replaced with the predicted values, and a full data set called the imputed data set is created. This process iterates the repeatability and makes multiple imputed data sets (hence the term “multiple imputation”). Each multiple imputed data set produced is then analyzed using the standard statistical analysis procedures for complete data, and gives multiple analysis results. Subsequently, by combining these analysis results, a single overall analysis result is produced. “
Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64(5):402–406. doi:10.4097/kjae.2013.64.5.402

The purpose of multiple imputation is to have a statistically valid inference and not to find the true missing data, because there is no way to predict the missing data and get it 100% right. The main advantage of this method is the elimination of biases and it is easy to use. Meanwhile, to get a correct imputation model, you need to take in consideration the conditions needed for this method and avoid some pitfalls.

In case you want to use multiple imputation method, I recommend reading the following articles : Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls (BMJ 2009;338:b2393) and When and how should multiple imputation be used for handling missing data in randomised clinical trials – a practical guide with flowcharts (DOI: 10.1186/s12874-017-0442-1)

3 – Create new field / variable

Missing data have its own usefulness mainly when it is not MCAR (Missing Completely At Random). Therefore, we create a new variable or field that records the witnessed behavior or pattern of the missing values. This can be also useful if you own the tool that generates the data, you can create a new engineered feature based on the missing data pattern.

— — — —

References

ibm.com: Pairwise vs. Listwise deletion: What are they and when should I use them? , Accessed 27/02/2019 (https://www-01.ibm.com/support/docview.wss?uid=swg21475199)
ncbi.nlm.nih.gov: The prevention and handling of the missing data, Accessed 21/04/2019 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100)
measuringu.com: 7 ways to handle missing data, Accessed 15/04/2019 (https://measuringu.com/handle-missing-data)

Data preparation is the most time consuming phase in any data related cycle whether you are preparing the data for machine learning model or data mining or BI.

I will explain how to prepare the data efficiently by following different steps.

Many people who are starting their career in the data field forget about an important step. They ask for the data and start preparing it straight away.

But before that, you should do some pre-preparation.

Business Understanding (pre-prepration)

First, you need to understand the business logic. Every data analysis task is related to business task.

Ask for explanation and read the available documentation. In addition, meetings with a business analyst in that organization or service/product owners may be required. You gained a lot of time with this step (you would find out that some data are missing afterwards if you skip it or the data structure does not make sense and many random problems)

Tip: when collecting the data, ask for the data governance department (in case there is one). The people there have useful and priceless information.

*Don’t let them convince you that the data is self explanatory.

Business understanding does not have a simple method to use. You just need to figure out how the business works and most importantly how the data was generated. After finishing this step, ask for the
needed data to the given task.

Now, we can start the data preparation. To do so we need the metadata.

Collect the metadata

Metadata is the data that describes the data. Having the metadata is a must, if it not accessible you should create it with the help of the data owner.

Metadata helps with identifying the attributes of the data-set, the type of each attribute and sometimes even the values assigned for a concrete attribute.

Data profiling

Data profiling is important to better understand the data. It “is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data”. Data profiling includes structure discovery, content discovery and relationship discovery. This step makes it easier to discover and choose the needed data. Also, if similar data are needed for next iterations, you know already how to deal with it and the whole data preparation process becomes more easier.

Define data preparation rules (optional)

This step applies for big data. Data preparation rules are the methods of cleansing and transforming the data.

Why? Cleaning big data is not a simple task and it’s time consuming. Imagine you delete rows using the value of an attribute as condition, than you find out that the condition is missing something and the size of your data-set is 5TB. That will take you forever to figure out the right condition.

How? We use a random sample from our data-set, we cleanse it and transform it. The script that was used to prepare the
random data sample will be used for the whole data-set.

The random sample must be valid. I will write a blog post about generating a correct and valid random sample.

Iterative preparation

Start with the basic cleansing steps that apply for any dataset. After that you tackle the challenging steps such as dealing with missing data. Let the data transformation to the end.

In the part 2, we will understand how to deal with missing values and how to get better quality data.

MS's Blog