Let’s talk about data

Introduction

I decided to write a series concerning data. Nowadays, most of the buzzwords in the IT world are big data, data science, artificial intelligence, etc. All of them put in value data. It is starting to be as expensive as black gold, some people claim it is more expensive, you can google or duckduckgo “data is the new oil” to read some news articles. We can notice that all newspapers and media are talking about the misuse of data that we generate and offer for free on the internet. Although I am not going to discuss the last point, but it is a fact that we are living and it can change our life for better or worse.

This blog post was meant to be just informative about the series, then I said you should learn at least one thing. So, do you know the difference between data and information? If yes, skip this part and go straight to the next part.

What does data mean?

According to Cambridge, data is


information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer.

dictionary.cambridge.org

Well, this is a typical mistake where data is confused with information.

Oxford dictionaries has a better definition, data is defined as


Facts and statistics collected together for reference or analysis -; – The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.


Oxford dictionaries

This is similar to the definition of data by some scholars. You can read their definitions in this study Conceptual Approaches for Defining Data, Information, and Knowledge, written by Chaim Zins

We can understand that data itself is not significant.

Data is NOT synonym of information

Information Is something that is understandable and can inform us about something. When data is set and presented in a given context, it becomes useful and called information; information is inferred from data.

Example:

Let’s say: Robot,112, C, Baby. These data aren’t useful and do not inform us about anything. However, if I say the company Robot sold 112 units of the product C. All the customers are persons who have at least one baby. Right now, with the given context the data become somehow useful. Robot is the name of the company, C is the product, Baby is a tag that let us categorize the customers and 112 is the sold quantity for this category (it may be the total, we don’t know).

— — —

It is important to understand the basic definitions because later on it becomes more complex with tens of technical words that you can encounter and this can lead to confusion.

In this series, we will go through different topics where data is the core of the business/subject

Data Preparation

Data sampling

Data in BI / ETL, data mining, data science, Big data, AI.

Data quality, Data maintenance, Data governance

To be honest with you, I was going to write only about data preparation and data sampling because it is the most underrated part and everyone is focusing in creating a good Machine Learning model and solving the problems of humanity with Big Data while with prepared data. The most important stage during the process of data is the preparation stage. It takes a lot of time, and if it is not well tackled, everything can go wrong.

The series will contain some theory and practice using different tools.

That’s why the list is not complete and not in order. I will post a content table that will have all the links. I may not be posting in right order, because something has to be explained before moving to the next step or I feel the urge to explain it.

Who is the series intended for?

The blog posts are useful for anyone who is dealing with data such as data scientists (1), BI developers (2), software developers (3), etc.

(1) And (2) are the people who are using the data as input and get us as output knowledge and wisdom. *They are not the only ones in this category.

(3) They may design databases and work with data a lot in the product life cycle. *They are not the only ones in this category .

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s