Data preparation (part 1)

Data preparation is the most time consuming phase in any data related cycle whether you are preparing the data for machine learning model or data mining or BI.

I will explain how to prepare the data efficiently by following different steps.

Many people who are starting their career in the data field forget about an important step. They ask for the data and start preparing it straight away.

But before that, you should do some pre-preparation.

Business Understanding (pre-prepration)

First, you need to understand the business logic. Every data analysis task is related to business task.

Ask for explanation and read the available documentation. In addition, meetings with a business analyst in that organization or service/product owners may be required. You gained a lot of time with this step (you would find out that some data are missing afterwards if you skip it or the data structure does not make sense and many random problems)

Tip: when collecting the data, ask for the data governance department (in case there is one). The people there have useful and priceless information.

*Don’t let them convince you that the data is self explanatory.

Business understanding does not have a simple method to use. You just need to figure out how the business works and most importantly how the data was generated. After finishing this step, ask for the
needed data to the given task.

Now, we can start the data preparation. To do so we need the metadata.

Collect the metadata

Metadata is the data that describes the data. Having the metadata is a must, if it not accessible you should create it with the help of the data owner.

Metadata helps with identifying the attributes of the data-set, the type of each attribute and sometimes even the values assigned for a concrete attribute.

Data profiling

Data profiling is important to better understand the data. It “is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data”.  Data profiling includes structure discovery, content discovery and relationship discovery. This step makes it easier to discover and choose the needed data. Also, if similar data are needed for next iterations, you know already how to deal with it and the whole data preparation process becomes more easier.

Define data preparation rules (optional)

This step applies for big data. Data preparation rules are the methods of cleansing and transforming the data.

Why? Cleaning big data is not a simple task and it’s time consuming. Imagine you delete rows using the value of an attribute as condition, than you find out that the condition is missing something and the size of your data-set is 5TB. That will take you forever to figure out the right condition.

How? We use a random sample from our data-set, we cleanse it and transform it. The script that was used to prepare the
random data sample will be used for the whole data-set.

The random sample must be valid. I will write a blog post about generating a correct and valid random sample.

Iterative preparation

Start with the basic cleansing steps that apply for any dataset. After that you tackle the challenging steps such as dealing with missing data. Let the data transformation to the end.

In the part 2, we will understand how to deal with missing values and how to get better quality data.

Open Sourcing of Windows Calculator is great news for many developers

Microsoft announced on Wednesday 6th of March 2019 that it made Windows Calculator an open source software (you can find it on GitHub). Suddenly, the internet went crazy with memes and posts mocking the last announcement judging it as a small piece of code, it’s not worth it and blah blah blah.

Figure 1: Windows Calculator. source: me

I understand why some people are frustrated and see it as a small project and it is not worth it when comparing the project size with .NET or VS Code, but these people are narrow-minded. They did not look at it from a different angle. Well, let me explain the real value of the Windows Calculator in the open source world.

The calculator is a simple project that new developers and students make as one of their first projects, and they feel proud of it (at least I was proud of my Calculator project). You start with simple operations, but you find out that Parsing is needed and some conditions are a must so that the app does not crash.

Figure 2: young boys on computers. source: pixabay

Later on, complex operations can be added and the real dev problems appear, such as different results using different types (Float vs Double vs Decimal), or saving the last operations. And it keeps getting bigger and bigger. At this moment, beginners don’t know how to choose the right project structure or how to improve their code and write clean code. Imagine you have the source code of the most used calculator in the world made by the biggest Software Company worldwide! That’s insane.  Wait! Microsoft offered more than source code, it included the project Architecture, unit tests, and the build system.

The Application Architecture is useful even for junior developers and students. Solid use of the MVVM design pattern in a real application is helpful to plan their first applications.

Moreover, the Calculator application is written in Visual C++ (C++/CX), a set of extensions to the C++ language using for Win Apps and Win Runtime components, which is a solid programming language and it most students in universities have C++ classes. Microsoft offered them the semestral project to have great grades 😀 This is just joke, do not do it if you are a new learner, do it yourself and then you can compare your work. That way, you improve your skills and it is great to learn from your own mistakes 😉

 Windows calculator is built for the Universal Windows Platform using XAML UI framework. Developers can learn more about making their own custom controls and VisualStates, and that comes in handy for creating and publishing apps in Microsoft Store.

Finally, they won’t stop in the development phase, but they will learn Azure Pipelines for the build, deployment and release phases. This is so important because it can be hard to apply CI/CD in the first projects.

Figure 3: thumbs up. source: pixabay

To conclude, the Windows Calculator is the best example to learn Microsoft’s full development lifecycle.