Move Resources between 2 Azure Subscriptions

This blog post shows how to validate the move operation of resources between 2 Azure Subscriptions and how to move them successfully by going through all the steps needed.

Figure 1 : Move resources between 2 subscriptions

Microsoft Azure offers the possibility to move resources from one resource group to another one in the same subscription or from one Subscription to another Subscription in the same Azure tenant.

The available documentation is limited for the validation so that you have to put the puzzle together yourself by collecting each piece of information from different articles. Therefore, my main goal is to give you a guidance from A to Z without the need to waste your time. I will add links for the different articles that I used in case you are interested in reading them.

To move the resources, there are 2 major steps:

  • Validate the move operation: It is optional, but highly recommended.
  • Move operation : The main action.

— — — —

Part 1: Validate the move operation

To validate the move operation, we need to call the dedicated REST API endpoint. There is no other option for this action. To execute it succesfully, we have to:

  1. Create an Azure Service Principal,
  2. Prepare the request body,
  3. Get an access token,
  4. and finally make the REST Call

1 – Create Azure Service Principal

You can skip this step if you have one already.

First, Let’s create an Azure service principal (sp). If you are not familiar with Azure sp, basically we are registering an application in the Azure Active Directory (AAD) and assigning a role to it. Check the official documentation to learn more about Azure service principal.

Create the application with just 2 clicks in Azure Portal: go to Azure Active Directory >> App registrations >> fill the form. You can do it also with PowerShell or Azure CLI.

Figure 2 : Register an Azure Application

Next, we generate a client secret by going to Certificates & Secrets tab and clicking on Add a client secret. Note the client secret, we will need it in the next steps.

Figure 3 : Generate a client secret

Finally, we have to assign the Contributor role for the registered application in the source resource group.

Figure 4 : Assign contributor role

2 – Prepare the request body and URI Parameters

We need to collect the following items:

  • Tenant id
  • SubscriptionId of both the source and the target Subscriptions
  • Names of both the source and the target Resource Groups
  • Sources that we desire to move

Let’s start by connecting to azure and listing all subscriptions in PowerShell:

Connect-AzAccount

# List subscriptions
Get-AzContext -ListAvailable | Select-Object Name, Subscription, Tenant | Format-List

Then, we need to set the context to the source Subscription and get the resource group name (in case you forgot it  😄)

# Select a subscription as current context
Set-AzContext -SubscriptionId <sourceSubscriptionId>

# Get Names and Locations of resource groups in the selected Subscription:
Get-AzResourceGroup | Select-Object ResourceGroupName, Location

Finally, we get the Ids of the resources in the given resource group. I add formatting so that you copy paste straight to request body. You only need the first command to get the resources:

# Get the resources 
$resourcesList= Get-AzResource -ResourceGroupName 'rg-sdar-westeurope' | Select-Object 'ResourceId'  | foreach {$_.ResourceId}

# Format the values by adding double quotes and join them with commas
$resourcesListFormatted= '"{0}"' -f ($resourcesList -join '","')

# Copy to clipboard
Set-Clipboard -Value $resourcesListFormatted

Create a new http request in Postman, go straight to the body tab, and choose the type raw. Construct the request body as follows:

{
 "resources": [<paste the recently copied resources list>],
 "targetResourceGroup": "/subscriptions/<targetSubscriptionId>/resourceGroups/targetResourceGroupName"
}

It should look like this:

Figure 5 : Request body

3 – Get Oauth2 Token

It is mandatory for the authorization of the POST request.

We get it by making a POST Call to https://login.microsoftonline.com/<subscriptionId>/oauth2/token with the following values in the request body (formatted as x-www-form-urlencoded in Postman):

  • grant_type : client_credentials
  • client_id : client_id of the registered app in the first step
  • client_Secret : the noted client secret in the first step
  • resource : https://management.azure.com/
Figure 6: Get an Oauth2 token

4 – Validate the move operation

All the previous steps lead to this action. Go to the Post request created in the second step and paste the request URL with the required values https://management.azure.com/subscriptions/<sourceSubscriptionId>/resourceGroups/<sourceResourceGroupName>/validateMoveResources?api-version=2019-05-10

The authorization type is a Bearer token. Use the received access token that we gained in the previous step.

Figure 7 : Add an authorization token

Send it. The response status code should be 202 Accepted with an empty response body.

Figure 8 : Validate move resources

In case you get 400 Bad Request with a response error message ResourceNotTopLevel, you need to remove that resource from the validation request body, because it will be moved automatically with the main resource. You can get this error with database or WebApp slot >> The solution is to move the SQL Server and the whole WebApp that has the WebApp slot respectively.

Make sure you have the permissions needed, you checked the limitations of your sources, and the subscription quotas. For more details, read the checklist before moving resources section in the official documentation.

— — — —

Part 2: Move resources

To move the resources to another subscription, it is possible to do it in 4 different ways:

  • Post Request with the REST API (Similar to the validate move operation)
  • Azure PowerShell
  • Azure CLI
  • Or using the Portal

I am a lazy person, so I always choose the easiest way. That means Portal is the choice 😂.

Go to the source resource group, click on Move and choose to another subscription. Next check the checkbox for the resources that you want to move, select the target subscription and the target resource group. Finally click ok.

Figure 9 : Move resources

Congrats, you moved the resources to the new subscription like a champ 😎.

~*~*~*~*~*~

References and important links

  1. Move resources to a new resource group or subscription
  2. Supported resources – Move operation
  3. Validate Move Resources
  4. Troubleshoot moving Azure resources to new resource group or subscription

LUIS Migration – False Error Message: Module {..} already exists

Here I am back with another story from my daily work problems. We have been using Language Understanding (LUIS) for one of our solutions for a while. Recently, Microsoft made some upgrades and created a new platform that benefits from resource authoring based on Role-Based Access Control (RBAC), Azure Active Directory (Azure AD) and custom domains. So, we had to migrate our LUIS app. And that did not go well.

For those who are not familiar with LUIS, Azure offers Azure cognitive services offer a set of REST APIs that helps you build intelligent applications without the need to develop your own models with Machine Learning or Deep Network, etc. One of these services is Language Understanding (LUIS). It is a cloud-based API service that applies custom machine-learning intelligence to natural language text to predict overall meaning, and pull out relevant, detailed information. For more details, check the official site.

The Luis portal is being changed because it cannot keep up with Azure services. So, Microsoft decided to connect the new upgrades with creation of a new portal. The bad part is that every user, who has been using the old portal, has to migrate to the new one by himself/herself, so that it decommissions the old portal by June 2020. See the unofficial announcement here https://github.com/azure-deprecation/dashboard/issues/26.

To help with the Migration Microsoft offered some documentation such as this one https://docs.microsoft.com/en-us/azure/cognitive-services/luis/luis-migration-authoring.

We decided to export the application from the old portal and add it to the new portal. In other words, we are not following Microsoft’s migration workflow because it is not the smartest for a professional environment. We chose the best approach that is possible to keep the solution in production running with the LUIS app in the old portal. Meanwhile, we create another one in the new portal and hoping that everything goes smooth.

(For the demo, I used Travel Agent Sample from the samples provided by Microsoft)

So, we started the migration process and it did not go as planned (like always). When importing the application, we had this error message:

BadArgument: The models: { BookFlight } already exist in the specified application version.

Figure 1 : Import Application

The error message does not say a lot. “The models” literally does not make any sense in this context or in the context of LUIS. BookFlight is the name of an intent and an entity in the imported application. That is what caused the error. The new platform does not accept same name for an intent and an entity, so we had to rename it.

A good error message can be something like this:

You cannot assign the same name for an entity and an intent. Please use a unique name for the intent {BookFlight}

(I have no idea why we have it like that, the LUIS application is there for more than 2 years, literally before I joined my team, but this is not an excuse.)

Such a small issue took us some hours and an effort of 3 persons with troubleshooting it.

There are 2 solutions for this problem:

  1. Make sure that each intent has a unique name. The name must not be used in entities or any other part of the application. (This is the best solution)
  2. Rename the entity, but also the pattern if you are using one

To retrace the error, upload the application that is provided in this GitHub repository. An other option to retrace the error is by adding an entity with the name of an existing intent or the other way around.

Figure 2: Create new entity

— — — —

To recap, we saw how a wrong error message looks like and how to fix the error (in case you had the same error). Check out also my blog post Fix a gap in GitHub API v4. Well if you read to the end, thank you very much for your support. Wishing you a Happy New Year!

Fix a gap in GitHub API v4

In this blog post, I want to show you how I solved the problem of getting commit files content with GitHub GraphQL API (v4) and to avoid using the GitHub REST API (v3) with its crazy amount of API calls. The solution is available in this GitHub Repository.

If you want the solution straight away , click here to skip the explanation of the problem.

GitHub Octocat (source: https://github.com/logos)

GitHub is the most famous code-hosting platform for version control and collaboration. It is essential for all IT people. One of the best features of GitHub is the offered APIs to manipulate the content and integrate GitHub in a workflow such as CI pipeline. It offers two versions of API: GitHub API v3, which is a REST API and GitHub GraphQL API v4. These are the current two stables API versions. (If you are wondering about the previous versions, you can get some information in GitHub documentation.

The GitHub REST API v3 covers “all” (or almost, I am not sure) areas. However, tons of API calls need to be made in order to satisfy a workflow job. Therefore, GitHub decided to replace REST API with GraphQL in the version 4 and this is why:

GitHub chose GraphQL for our API v4 because it offers significantly more flexibility for our integrators. The ability to define precisely the data you want—and only the data you want—is a powerful advantage over the REST API v3 endpoints. GraphQL lets you replace multiple REST requests with a single call to fetch the data you specify.

GitHub documentation, https://developer.github.com/v4

However, the new API (GraphQL API v4) did not solve the problem 100%. It still has some gaps and it does not cover all areas. You may need to go back to the version 3 to satisfy a given job.

Disclaimer: I was novice to GitHub API and GraphQL when I started working in this project.

I was working in a project that integrates GitHub in one of its processes. I chose the GraphQL API v4. Everything was going smooth until I had to get the content of the last commit.  Google helped a bit sometimes and github.community forums too (not well referenced in Google). I looked for a way to get the content of a given commit with GraphQL, I thought that it was an obvious thing; I managed to create this graph with the provided documentation.

If you are novice to GraphQL, you can start your learning journey here graphql.org/learn and I recommend the query editor GraphiQL : the GitHub repository github.com/skevy/graphiql-app / download page electronjs.org/apps/graphiql

{
   rateLimit{
    cost
    remaining
  }
  repository(name: "GitHubAPIDemo", owner: "MohamedSahbi") {
    ref(qualifiedName: "master") {
      name
      id
      target {
        ... on Commit {
          id
          history(first: 1) {
            pageInfo {
              hasNextPage
            }
            totalCount
            edges {
              node {
                author {
                  name
                  date
                }
                changedFiles
                commitResourcePath
                oid
                abbreviatedOid
                tree {
                  entries {
                    name
                    type
                    oid
                    object {
                      #This is a fragment
                      ...GetAllFiles
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

fragment GetAllFiles on Tree {
  ... on Tree {
    entries {
      name
      type
      oid
      object {
        ... on Tree {
          entries {
            name
            type
            oid
            object {
              ... on Blob {
                text
              }
            }
          }
        }
      }
    }
  }
}

This graph query returns by order:

  • The cost of my request in the rate limit section. This is important because every user has a limited credit of 5000 request per hour, but a single GraphQL call can cost 1 credit, 100 credits or more than 5000 credits. For more details, see the explanation provided by GitHub.
  • The most recent commit details since we choose “history (first:1)”
  • The repository content that we chose to get in the section:
                tree {
                  entries {
                    name
                    type
                    oid
                    object {
                      #This is a fragment
                      ...GetAllFiles
                    }
                  }
                }

What I found it that the commit history does not include the changed files URLs that are provided using the REST API v3. I kept looking for a way to fix my query. I was still believing that it is provided out-of-the-box, but since I am not experienced with GraphQL, I thought I made a mistake in my query.

Then, I lost hope for a while. I decided to query the GitHub repository with REST API v3. To go the content of each file I have to:

"files": [
    {
      "sha": "9907549076a9271ee4948e909eb0669d3ba4875b",
      "filename": "LICENSE",
      "status": "added",
      "additions": 21,
      "deletions": 0,
      "changes": 21,
      "blob_url": "https://github.com/MohamedSahbi/GitHubAPIDemo/blob/5f4538bce768c67bcfd3e71cb05a14614657f68f/LICENSE",
      "raw_url": "https://github.com/MohamedSahbi/GitHubAPIDemo/raw/5f4538bce768c67bcfd3e71cb05a14614657f68f/LICENSE",
      "contents_url": "https://api.github.com/repos/MohamedSahbi/GitHubAPIDemo/contents/LICENSE?ref=5f4538bce768c67bcfd3e71cb05a14614657f68f",
      "patch": "@@ -0,0 +1,21 @@\n+MIT License\n+\n+Copyright (c) 2019 Mohamed Sahbi\n+\n+Permission is hereby granted, free of charge, to any person obtaining a copy\n+of...."
    },
    {
      "sha": "7b9e8fe3adf9f784749834da35fecda8a5392bd3",
      "filename": "README.md",
      "status": "added",
      "additions": 2,
      "deletions": 0,
      "changes": 2,
      "blob_url": "https://github.com/MohamedSahbi/GitHubAPIDemo/blob/5f4538bce768c67bcfd3e71cb05a14614657f68f/README.md",
      "raw_url": "https://github.com/MohamedSahbi/GitHubAPIDemo/raw/5f4538bce768c67bcfd3e71cb05a14614657f68f/README.md",
      "contents_url": "https://api.github.com/repos/MohamedSahbi/GitHubAPIDemo/contents/README.md?ref=5f4538bce768c67bcfd3e71cb05a14614657f68f",
      "patch": "@@ -0,0 +1,2 @@\n+# GitHubAPIDemo\n+This repository contains a sample demo for my blog post"
    }
  ]
  • Third, we loop over the files array and each time we have to:
    • Call the endpoint that is giving in the attribute contents_url
    • The previous call returns a JSON object that contains the attribute download_url, which is the last API endpoint that we have to call to get the file content (Finally!!!).

Imagine you want to get the updated files in the last commit without knowing the commit hash, that is:

2 API calls + (2 * number of updated files) API calls >= 4 API calls

The duplicate data, the useless extra information in the JSON responses, and the huge number of endpoint calls to get files content pushed me to keep looking further for a better solution. I kept looking until I found this post about the same problem that I am facing. It convinced me that I have to fix it myself since there is no out-of-the-box solution.

Suddenly, I found this great post about the GraphQL aliases. I should have paid more attention when learning GraphQL or spend some extra hours learning. And yeah I do not how I end up finding about aliases when I was solving the problem.

I guess you know where I am going here. Alias is the best way to avoid the 2*number of updated files API calls that I had to do using the REST API v3.  The solution is to profit from the best features of both GitHub API v3 and v4.

Solution

The final process looks like this:

  1. Get the commits using REST API v3
  2. Get the commit content also using REST API v3
  3. Generate a single GraphQL query with aliases to get the files content.

I created a sample project in GitHub with a GitHub Service that you can reuse easily. The are 2 main methods offered by this service:

  • GetLastCommitFilesContent(string directory, DateTime? startingFrom = null) : it gets the last commit, look for its content and then generate a graphQL query and retrieve the files by calling GraphQL API. It returns an object of type GitHubServiceOutput.
  • GetCommitFilesContent(string commitHash, string directory) : a similar method to the previous one, but it takes commit hash as parameter. The output is like the previous method’s output.

You will find the details about the code sample and the used libraries in the GitHub repository.

Custom Object Comparison in C#

A while ago, I needed to compare complex objects in C#. I forget to mention the real reason when I wrote the article and thanks to the feedback that I got, here is the main reason: Not only a Boolean result is needed from the comparison but also I need to have as output the properties with different values and the possibility to exclude properties from the comparison, . I looked for such function that would provide the same functionality of string.Compare() but for complex objects of the same type. Object.Equals() Method alone did not satisfy my need and it needed an implementation and overriding of the method which was not convenient. My next stop was stackoverflow and I found quite interesting approach in this discussion. That did not satisfy my need to the fullest so I took it as a start point and I had to implement my own comparison class (the implementation is available on GitHub with a running sample).

I created a static class Comparer with a constraint on the generic parameter to a class which satisfies my need. If you do not know what do constraints on type parameters mean, go to Microsoft docs (I recommend reading it and understanding what it is and why it is used because you will need it for sure). Then, reflection was the choice to get things done. PropertyInfo class which belongs to System.Reflection namespace was enough to do the work. This class allows us to get the different attributes of a property of an object and its metadata which I use to compare the properties of the 2 objects.

The created class offers different methods that you may find helpful :

GenerateAuditLog() method literally generates log. It returns an object of type ComparisonResult which can be inserted into logHistory table in your database. This method is overloaded so that you can exclude some properties from the comparison.

GenerateAuditLogMessages() method returns a list of messages that contains only changes. There is no overload for this method.

HasChanged() method simply returns Boolean result. You can eventually exclude some properties from the comparison. I found this method useful for updating records in the database.

That is all!! I hope you find it useful. Feel free to use the code or improve it.

Note: The code is not optimized (no DRY approach) because I take in consideration people who wants to use one method so they can copy the code simply (I personally recopy my code, improve it and adopt it to the case that I have).

Raw queries with Entity Framework Core

tl;dr

In this blog post, I showcase how to migrate raw SQL query from Entity Framework 6 to EF Core 2.1 and EF Core 3.1. You can find the whole sample in GitHub.

— — — —

I have been working lately on project migration form ASP.NET MVC 5 that is using Entity Framework 6 to ASP.NET Core 2.1 with Entity Framework Core 2.1. During the work, I found a raw query implemented in Entity Framework 6 as following (not really): 

public async Task<double> GetScore(int studentId)
{
    string query = @"select ((e.Grade * c.Credits)/sum(c.Credits)) as Grade
                                        from Enrollment e
                                        inner join Course c
                                        on e.CourseId = c.CourseId
                                        where studentId= @studentId
                                        group by e.Grade, c.Credits";

    var studentIdParam = new SqlParameter("@studentId", studentId);

    var gradeList = await _universityContext.Database
        .SqlQuery<int>(query, studentIdParam).ToListAsync();

    if (gradeList.Count == 0)
    {
        return 0;
    }

    return gradeList.Average();
}

Meanwhile, it is not possible to do so in Entity Framework core. I have to look for solutions and I found 2 of them:

The first solution is a simple implementation with ADO.NET, you can find it in my github account: method GetScoreAdoNet(int studentId). However, I try to avoid ADO.NET because of internal rules in our team and mainly for maintenance reasons.

So, I kept looking for another solution using Entity Framework Core. Thanks to the great community in stackoverflow, I found this answer for my problem. Here is the second and better solution : 

Solution for EF Core 2.1

I will be using Query types proposed by Entity Framework Core. 

First, we have to create a data model that will be used as return type of the executed SQL query. Although in my sample (here), I just return a number (int), I have to create a model that has one property. The name of the property should be the same name of the column selected in the SELECT Statement.

 public class AverageGrade
    {
        public int Grade { get; set; }
    }

Then, we need to configure it in the dbcontext in the method OnModelCreating 

modelBuilder.Query<AverageGrade>();

And finally we can call the raw SQL query:

public async Task<double> GetScore(int studentId)
{
    string query = @"select ((e.Grade * c.Credits)/sum(c.Credits)) as Grade
                                        from Enrollment e
                                        inner join Course c
                                        on e.CourseId = c.CourseId
                                        where studentId= @studentId
                                        group by e.Grade, c.Credits";

    var idParam = new SqlParameter("@studentId", studentId);

    var gradeList = await _universityContext.Query<AverageGrade>()
        .FromSql(query, idParam).ToListAsync();

    return gradeList.Select(x => x.Grade).ToList().Average();
}

Solution for EF Core 3.1

Starting from EF Core 3.0, the proposed solution for EF Core 2.1 is obsolete. It is part of many changes in EF Core 3.0 that you can find here.

A data model is needed for the output of the executed SQL query, same like in EF Core 2.1 solution.

public class AverageGrade
    {
        public int Grade { get; set; }
    }

The next step is adding the data model to the ModelBuilder in the method OnModelCreating :

modelBuilder.Entity<AverageGrade>().HasNoKey();

The last step is to use DbContext.Set<>() instead of DbContext.Query<>() in the method GetScore(int studentId). In other words, the line number 12 (in the last code block in the Solution for EF Core 2.1) is replaced with this line of code:

var gradeList = await _universityContext.Set<AverageGrade>().FromSqlRaw(query, idParam).ToListAsync();

— — — —

That’s it, migration was done successfully.

The Query Types have some limitations. In case you will be using them, please read the official documentation.

Finally, I prepared 3 projects, one with Entity Framework 6, the second with Entity Framework Core 2.1 (the first 2 projects use ASP.NET Core 2.1), and the third one with EF Core3.1. You can find the code source in GitHub.

Data preparation (part 2)

In the previous post, we went through the pre-preparation phase,  collecting meta-data, data profiling and data preparation rules. This post is mainly about dealing with missing values aka Nulls.

Before looking for methods to deal with nulls, confirm that you are really missing some data. It is possible to have some blanks in the data set that can be replaced with a value. Non quality data may have null in place of no. Let’s say a column that only has “yes” and “null” values. You should verify if the system/application that generates the data doesn’t assign any value when it is negative/false response. In that case, you only replace null with no and don’t delete the column.

In addition, meta-data can help with missing data by mentioning the out-of-range entries with types: unknown, unrecorded, irrelevant, and that can be for different reasons such as

  • Malfunctioning equipment
  • Changes in database design
  • Collation of different datasets
  • Measurement not possible

Missing data types

First, we need to understand the different types of missing data. There are 3 different types:

Missing completely at Random (MCAR)

The type title explains itself. The data are missing for random reasons (example: measurement sensor ran out of battery) and unrelated to any other measured variable. It just happened randomly.

Example:

We conducted a survey at University Campus about extracurricular activities, one of the questions is the student’s age. The survey was available online and we had some volunteers who asked students in the campus directly. After we collected the data, we started preparing it. We found out that some students did not mention their age because it was not mandatory. In this case, the age missing values are missing completely at random.

Missing at Random (MAR)

The missing data are independent on all the unobserved values, but it is dependent on the observed values. “What?!” Let’s make it simple:

We have a dataset of cars characteristics

BrandModelNbr of DoorsNbr of SeatsAirbag
AudiA655 
Mercedes BenzE6355Yes
AudiA455 
BMWM332Yes
RenaultMegan55No
SkodaSuperb55Yes
Mercedes BenzS56055Yes
Peugeot50855No
SkodaOctavia RS55Yes
TeslaModel S55Yes
AudiA855 
TeslaModel 355Yes

We have missing values in the airbag column. You notice that the missing values are dependent on the column Brand. If we group the data by the brand value, we find out that all the missing values have as brand value “Audi”.

Missing not at Random (MNAR)

The missing data are not only dependent on the observed data, but also dependent on the unobserved data.

Example:

A survey was conducted about mental disorder treatment worldwide. The results showed that respondents from low/lower-income countries are significantly less likely to report treatment than high-income country respondents.

— — — —

Dealing with missing data

How to deal with the missing data? There are 3 different possibilities:

1 – Data Deletion

First, this method should be used only with MCAR case. There are 2 different deletion methods that most of data analysts/scientists are using:

Drop them (Listwise deletion):

Basically you have to remove the entire row if it has at least one missing value. This method is recommended if your data set is large enough so that the dropped data does not affect the analysis. Most of the labs or companies have a minimum percentage of data that is required and if that threshold is not attainable, they remove the rows with missing data. Personally, if most (more than 50%) values of a variable are null or missing, I “usually” drop the column.

Pairwise deletion:

Data will not be dropped in this case. If the analysis needs all the columns, you select only the rows without any missing values. Meanwhile, if the analysis task needs some variables (not all of them) and it happens that the rows with missing values have the required values for this task, you add them to the selected data for the task resolution.

Example:

For this example, the CAR data set will be used. *Let’s assume it has 50 rows and there are missing data only in rows number 1 and 6

BrandModelNbr of DoorsNbr of SeatsAirbag
1AudiA655 
2Mercedes BenzE6355Yes
3BMWM332Yes
4SkodaSuperb55Yes
5Mercedes BenzS56055Yes
6Peugeot50855No
7SkodaOctavia RS55Yes
.
.
.
50TeslaModel 355Yes

1st Task: Association rules task to find association hypothesis between number of seats and number of doors. The needed attributes are: Brand, Model, Nbr of Seats and Nbr of doors. In this case, we can use all the data set because there are no missing values for the given models.


2nd Task: Association rules task to find association hypothesis between number of seats and number of airbags. The needed attributes are: Brand, Model, Nbr of Seats and Nbr of airbags. To resolve the task, we eliminate rows number 1 and 6 and we use the rest.

2 – Replace missing values

The third option to deal with missing values is to replace them. Here it gets a bit complicated because there are different ways to achieve it.

Mean/median substitution

Replace missing values with the mean or median value. We use this method when the missing values are numerical type and the missing values represent less than 30%.

However, with missing values that are not strictly random, especially in the presence of a great inequality in the number of missing values for the different variables, the mean substitution method may lead to inconsistent bias .

Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64(5):402–406. doi:10.4097/kjae.2013.64.5.402

Common value imputation

We use the most common value to replace the missing values. For example, we have a column color in the Car dataset that we used previously which has 100 records. The color column has 5 values only, the most common value (67x) is Black. So, we replace the missing values with Black. However, this method may lead also to inconsistent bias.

Regression imputation

Regression imputation let us avoid biases in the analysis. We know that Mean/Median method replaces the missing values with current ones. Instead of doing that, we predict the missing values using the available data. This way, we gain new values and retain the cases with missing values.

Multiple imputation

Multiple imputation “approach begin with a prediction of the missing data using the existing data from other variables [15]. The missing values are then replaced with the predicted values, and a full data set called the imputed data set is created. This process iterates the repeatability and makes multiple imputed data sets (hence the term “multiple imputation”). Each multiple imputed data set produced is then analyzed using the standard statistical analysis procedures for complete data, and gives multiple analysis results. Subsequently, by combining these analysis results, a single overall analysis result is produced. “

Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64(5):402–406. doi:10.4097/kjae.2013.64.5.402

The purpose of multiple imputation is to have a statistically valid inference and not to find the true missing data, because there is no way to predict the missing data and get it 100% right. The main advantage of this method is the elimination of biases and it is easy to use. Meanwhile, to get a correct imputation model, you need to take in consideration the conditions needed for this method and avoid some pitfalls.

In case you want to use multiple imputation method, I recommend reading the following articles : Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls (BMJ 2009;338:b2393) and When and how should multiple imputation be used for handling missing data in randomised clinical trials – a practical guide with flowcharts (DOI: 10.1186/s12874-017-0442-1)

3 – Create new field / variable

Missing data have its own usefulness mainly when it is not MCAR (Missing Completely At Random). Therefore, we create a new variable or field that records the witnessed behavior or pattern of the missing values. This can be also useful if you own the tool that generates the data, you can create a new engineered feature based on the missing data pattern.

— — — —

Further reading

  1. How to Handle Missing Data
  2. The prevention and handling of the missing data

References

  1. ibm.com: Pairwise vs. Listwise deletion: What are they and when should I use them? , Accessed 27/02/2019 (https://www-01.ibm.com/support/docview.wss?uid=swg21475199)
  2. ncbi.nlm.nih.gov: The prevention and handling of the missing data, Accessed 21/04/2019 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100)
  3. measuringu.com: 7 ways to handle missing data, Accessed 15/04/2019 (https://measuringu.com/handle-missing-data)

Data preparation (part 1)

Data preparation is the most time consuming phase in any data related cycle whether you are preparing the data for machine learning model or data mining or BI.

I will explain how to prepare the data efficiently by following different steps.

Many people who are starting their career in the data field forget about an important step. They ask for the data and start preparing it straight away.

But before that, you should do some pre-preparation.

Business Understanding (pre-prepration)

First, you need to understand the business logic. Every data analysis task is related to business task.

Ask for explanation and read the available documentation. In addition, meetings with a business analyst in that organization or service/product owners may be required. You gained a lot of time with this step (you would find out that some data are missing afterwards if you skip it or the data structure does not make sense and many random problems)

Tip: when collecting the data, ask for the data governance department (in case there is one). The people there have useful and priceless information.

*Don’t let them convince you that the data is self explanatory.

Business understanding does not have a simple method to use. You just need to figure out how the business works and most importantly how the data was generated. After finishing this step, ask for the
needed data to the given task.

Now, we can start the data preparation. To do so we need the metadata.

Collect the metadata

Metadata is the data that describes the data. Having the metadata is a must, if it not accessible you should create it with the help of the data owner.

Metadata helps with identifying the attributes of the data-set, the type of each attribute and sometimes even the values assigned for a concrete attribute.

Data profiling

Data profiling is important to better understand the data. It “is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data”.  Data profiling includes structure discovery, content discovery and relationship discovery. This step makes it easier to discover and choose the needed data. Also, if similar data are needed for next iterations, you know already how to deal with it and the whole data preparation process becomes more easier.

Define data preparation rules (optional)

This step applies for big data. Data preparation rules are the methods of cleansing and transforming the data.

Why? Cleaning big data is not a simple task and it’s time consuming. Imagine you delete rows using the value of an attribute as condition, than you find out that the condition is missing something and the size of your data-set is 5TB. That will take you forever to figure out the right condition.

How? We use a random sample from our data-set, we cleanse it and transform it. The script that was used to prepare the
random data sample will be used for the whole data-set.

The random sample must be valid. I will write a blog post about generating a correct and valid random sample.

Iterative preparation

Start with the basic cleansing steps that apply for any dataset. After that you tackle the challenging steps such as dealing with missing data. Let the data transformation to the end.

In the part 2, we will understand how to deal with missing values and how to get better quality data.