Azure Machine Learning is Microsoft’s machine learning studio. It provides a workbench for analysts to perform data analysis including applying predictive analytics and machine learning algorithms.
One of the key uses of Machine Learning is finding correlations in data and using the relationships between different indicators to provide predictive power. Here is an example scenario I built in Azure ML.
I found a dataset that describes a set of Community Health Status Indicators by county for the United States. It provides a set of health rates such as homicide, cancer, obesity, suicide, etc. In addition, it provides a set of demographic indicators such as the size of the county, the population density, the poverty rate and the population breakdown by race.
Creating a Dashboard
I created a Power BI Dashboard that summarizes some of the key indicators.
While this an interesting dashboard, it doesn’t tell us what factors influence key metrics like Average Life Expectancy and with dozens of potential indicators it’s not clear which ones really are key drivers and which are less important.
Finding Predictors with Azure ML
What if we could determine the indicators that predict Average Life Expectancy? We could then understand the factors that impact this key metric and put them on our dashboard.
Using Excel, I pulled the several indicator files together into a single CSV file that combined all the possible indicators together. I then loaded this file into Azure ML Studio.
I then use the Project Columns module in Azure ML to pick out a number of potential columns that could impact Average Life Expectancy.
Which Features Have Predictive Power?
Azure ML provides a number of methods for analyzing features and determining which of them have a strong predictive relationship with the indicator you are trying to predict.
One of the modules in Azure ML Studio is the Filter Based Feature Selection which provides a method for filtering the number of columns based on statistical analysis. You set the target for your prediction (in this case Average Life Expectancy) and the module goes through your list of features and finds the ones with the strongest correlations.
In reviewing the output, here are some of the features that have the strongest correlation with Average Life Expectancy.
ALE is obviously the top feature since it is the one we’re trying to predict. Features such as the number of people under 18, the poverty rate, lung cancer rate and so on seem to be the best candidates for predicting Average Life Expectancy.
How Predictive is our Model?
In order to test the predictive power of our model, we need to apply some algorithms to see if we can use the columns we selected to make an accurate prediction of Average Life Expectancy. Azure ML Studio provides a number of industry standard algorithms for such analysis. In this case, because we are trying to predict a variable value (e.g. could be any number) this lends itself to using regression algorithms which try to determine the equation that can provide a predicted Average Life Expectancy value based on our set of features. Using machine learning, the algorithms try a number of feature combinations using different weightings to try to find the best fit equation that aligns to the actual results from the dataset. We can then test the accuracy of the equation using our dataset as well.
In order to training dataset and a testing dataset, we can split our original list of 3142 rows in half, using 50% for training the dataset and using 50% for testing and evaluation. In Azure ML, you can use the Split Data module to do exactly this. We can use Linear Regression as our algorithm and feed it through the training model to calibrate our algorithm using machine learning. Once this has been done, we can then score and evaluate the model to test its predictive power.
When you run this model, you get the following results in the evaluate model.
The model is a reasonably good but not excellent predictor of Average Life Expectancy. If you look at the Coefficient of Determination, the closer this value is to 1, the better the predictive power. In this case a 0.60 is a reasonably good score – a 0.90 or greater would be considered excellent. If you look at the Error Histogram, this is very illustrative – this shows the error variability. In this experiment, 48% of the results were within 0.0014 – that’s very good for an Average Life Expectancy of between 70-80 years old. Another 30% were off by less than a year.
However, there are a few outliers in the data where the algorithm was off by more than 3 years.
Revising Our Dashboard
What does this analysis tell us? A few important conclusions are worth noting.
The first is the key factors that impact Average Life Expectancy seem to be:
- Births with Mothers Under 18
- Lung Cancer
- Low Birth Weight
- Births with Mothers Under 40
- % Black Population
- Very Low Birth Rate
- Infant Mortality
- % White Population
If we’re interested in Average Life Expectancy than having these on our dashboard would provide a good explanation.
In addition, we could use the predictive model to forecast Average Life Expectancy where the data is missing as long as we have these other factors. Using Azure ML, you can turn your experiment into a web service whereby you would submit the input columns and the service will generate the predicted value based on the model. This turns your experiment into an engine that can be harnessed to process future data as it arrives.