Data Types in Machine Learning

In our previous article we looked into the definition of machine learning, how machine learns, how machine learning is different than traditional rule based systems and various machine learning categories, i.e. Supervised, Unsupervised and Reinforcement Learning.

We also know that historical data is crucial for any machine learning algorithms we use, but how to read the data and how to interpret it, so that we can make effective use of it in our algorithm.

Remember garbage in is garbage out.

So, let us understand the data with an example. Let’s say following is the data from a bank which that wants to create an algorithm to determine whether a customer loan should be approved or not.

Now, let’s look at what kind of data we are dealing with here.

In any data set there are three main important factors that we need to understand, before we start working on solving the problem using Machine Learning.:

Type of Variables

So, in this example, we are trying to determine whether the loan application of the customer will be approved or not, in other words the approval is based on the events in the past, because we want our Machine Learning algorithm to predict similar results.

Data Type

At a high level we can safely assume that the data that has been provided to us has some variables as Character/String and some variables as Numeric values. There can be more type or subtype within them, such as integer, float, etc., but as long as we understand these two broad types we are good.

Now, what would happen to the data type, if the bank says that we are going to treat all the customers same, if they have three or more dependents on them. In that case the values above three or the number of dependents does not have any impact and we should change it to “3+”. In such a scenario the “Dependent” variable will become a string type variable.

So, you should pay very close attention to such details coming from the data provider to avoid any data errors in the data processing stage.

Category

Category of the data is a very important aspect. Now, if you look closely some variables contain options such as male, female; yes, no; etc. Such variables are called Categorical variables. Whereas other variables are Continuous variables, where the values can have any possible range.

These are small things but very important for reading and processing the data.

Stay tuned for more articles!