Basics of Machine Learning

In this article we will go through the basics of ML terminologies which will be useful for Data Processing and Data Visualization.

Mean

It is the average of all the values.

Mean = Sum of all observations / Number of observations

For example:

Mean = 62800 / 11

= $ 5709.09

Median

It is the numerical middle value of the sorted observations with equal number if observations on both sides.

For example, if we plot the same data as above in ascending order in straight line, you will see that there are equal number of observations on both the sides of “5800”. Hence this is the median value of this sample set.

In case you are wondering about its significance, well, then this helps us understand on which side majority of the observations are tilted. So, in case of data clean-up process if some observations have missing values, we can either replace them with mean or median values.

Mode

It is the value that appears most often in a set of data. In our above sample data, the value “6400” occurs 3 times, which is higher that the occurrences of all the other values. So the mode of this set of observations is “6400”.

This can also be used for replacing missing values in a dataset.

Range

It is the difference of highest and lowest values in a sample of observations. So as per the our sample dataset:

Range = 7000 – 4000

= 3000

This helps us understand how widely the values are spread in a given set of observations.

Probability

This is one of the most important term in machine learning and we all have heard this term in one way or another.

Probability is a numerical way of describing how likely something is going to happen.

Probability is derived from a Sample Space (S). Sample Space is set of possible outcomes that might be observed for an event.

If all this sound Greek to you then let us take a simple example of dice. So when we throw a dice, what are the possible outcome? The only possible outcome is one of the following: 1,2,3,4,5 or 6. So the sample space for dice is the following:

Dice Sample Space (S) = {1,2,3,4,5,6}

Now if we want to know what is the Probability of 3, i.e. if we roll the dice, what is the likelihood of getting a 3?

It is 1 out of 6.

P(A) = 1/6 = 0.1667 OR 16.67%

Similarly, what is the probability of getting an even number?

P(A) = 3/6 = 0.5 OR 50%

Does it make sense now? This is one of the most important topic and many machine learning algorithms such as Naive Bayes, Logistic Regression, etc. are based on fundamental principles of probability.

Stay tuned for more!