AutoML

AutoML

 

Introduction

Automated Machine Learning

AutoML (Automated Machine Learning) is the process of automating the end-to-end traditional machine learning model and applying it to solve the real-world problems. 

Machine learning is a subset of AI concerned with the algorithms and statistical models that computer systems use to make predictions, based on patterns and inferences created through training. The conventional machine learning process relies heavily on human expertise to process the data used for training models, construct appropriate features, apply the right algorithms, and analyze the results obtained.  

AutoML ensures that these resource-intensive, time-consuming, and iterative tasks of the traditional machine learning process are completely automated. It enables non-experts to easily work with machine learning, without requiring significant domain knowledge about designing, creating, or comparing models because of its automation.

Catalyst Zia AutoML

Catalyst AutoML is a Zia AI-driven service that analyzes a set of training data and predicts the outcome of a particular subset of that data, without requiring explicit instructions. You can train models in AutoML by providing a dataset and training columns for analysis, and specifying the target column whose value you need to predict. Zia then iterates through a variety of machine learning algorithms and trains the model to generate predictive analytics on the dataset. You can then implement AutoML in your Catalyst applications that include predictive analytics.

You can access and configure AutoML for your project from the Catalyst web console. Once you have trained a model, you can test it and predict values from the Catalyst console. You can also obtain a detailed evaluation report of the trained model.

You can perform predictions in your AutoML models from your Catalyst applications using APIs. For more information, refer to the API documentation. You can refer to the Java SDK documentation and Node.js SDK documentation for code samples of AutoML.

 

Key Concepts

Before you read more about using AutoML, ensure that you understand the following concepts of Zia AutoML.

Model

A model is a set of computations generated as a result of training the input dataset using various machine learning algorithms. You can use the AutoML model to make predictions in the dataset for various conditions. A model is therefore a mathematical representation of a real-world process which you can perform an in-depth analysis on to test various hypotheses.

Once a model is generated in AutoML, you can provide a set of input values and generate a set of predictive output values based on the patterns observed in the dataset.

Dataset

An input training dataset is the collection of structured data that you provide for the model to analyze and train to perform predictions on. You must provide the dataset in the form of a CSV file that contains columns and rows of data in AutoML. You can upload the CSV file directly from your computer or import it from the Catalyst File Store. You can learn more about this in the Implementation section

Target

The target is the column whose value needs to be predicted after the model is trained with the dataset. The value prediction is based on the data type of the target column. 

You can only choose a numerical or a categorical type column as the target in AutoML. Zia cannot predict the values of a string or date type column, as they do not hold calculable data. You will learn about the data types of a column in the next part.

Attributes of a Column

Zia determines six attributes for every column in a dataset that is uploaded. Various algorithms calculate and determine the values of these attributes before you select a target. 

The following attributes are determined for the columns in a dataset:

  1. Type
    This is determined for every column in the data set. AutoML supports the following data types:
    • Numerical: A column with only numerical values in it is classified as Numerical.
    • String: A column with a set of numerical, alphabetical, or any other characters as values is classified as a String. Any column that contains mixed values of various data types is also classified as a String. 
    • Date: A column with only date-time values in it is classified as a Date. AutoML supports the following date formats:
       
      FormatExample
      YYYY-MM-DD'2019-02-12'
      YYYY/MM/DD'2008/07/28'
      YYYY/MM/DD hh:mm:ss'2011/03/17 23:58:30'
      DD-MM-YYYY'03-09-2016'
      DD/MM/YYYY'22/11/2018'
      DD-Month-YYYY'13-January-2012'
      YYYY-MM-DDThh:mm:ss.sTZD'2019-11-28T05:19:31.665523+00:00'
      YYYY.MM.DD'2020.01.24'
      Unix timestamp string in seconds'1574918464'
      Unix timestamp string in milliseconds'157491844000'
      Unix timestamp string in microseconds'157491844000000'

    • Categorical: A column with a limited number of distinct values in it is classified as Categorical. There are two types of Categorical columns: 
      • Binary-class: A binary-class column contains only two distinct values in all the records. For example, columns with values as Yes/No, Win/Lose.
      • Multi-class: A multi-class column contains three or more, but a limited number of, distinct values in all the records. For example, a column that depicts the states in a country, or a column that lists the graduate programs available in a University.
    The following table depicts the columns that can or cannot be used as the target or for training the model, based on their data types: 
    Data TypeTargetTraining
    Numerical
    String
    Date
    Categorical (Both binary- and multi-class)
  2. Missing (in%)
    This represents the percentage of missing values in a column in the dataset. For example, in a dataset that contains 20 records, if the values of a column are empty for 10 records, the missing amount of data is 50%. 
  3. Distinct Values
    This represents the number of distinct entries in the values of a column in the dataset. For example, if a column's values contain only 'Yes', 'No', and 'Maybe' for all the records, the number of distinct values is three and the column is classified as the Multi-Class Categorical type. 
  4. Mean
    This represents the mean value of all values in the column. This is determined only for Numerical columns.
  5. SD
    This represents the standard deviation of all values in the column. This is determined only for Numerical type columns.
  6. Correlation with Target
    This represents the correlation of a column with the target ranging from 0 to 1, where 0 indicates no correlation and 1 indicates perfect correlation. The correlation of a column with the target is determined by the patterns observed in the column's values with reference to the values in the target column. 

    For example, a column reporting the number of common flu cases is the target of a model. Another column depicting the months of the year will have a high correlation with the target, as the number of flu cases are generally higher during the winter months, and they are therefore highly dependent on each other. This is determined for every column in the dataset, except for the String type columns.

    The following table depicts how the various attributes are determined for columns, based on the data types: 
    Data TypeMissingDistinctMeanSDCorrelation with Target
    Numerical
    String
    Date
    Categorical (Both binary- and multi-class)

Input Feature Selection

AutoML allows you to select the columns to be used for training the model. This is based on a machine learning concept known as feature selection, which is the process of selecting a subset of relevant features to use to build a model. You can select the features that you think will contribute most to your prediction variable. 

The columns that you select for training have a high impact on the accuracy of a model's prediction. The accuracy is calculated and determined for the binary-class and multi-class classification models. You will learn about these in the next part.

It is a good practice to exclude the columns that are irrelevant or that have low correlations to the target, as they will affect the model's learning by providing unnecessary patterns. You can also exclude columns based on the percentage of missing data in them, since columns with a high number of missing values can alter the accuracy of the model's prediction.

A String type column cannot be used for training a model, as shown in the table earlier. This is because, the String type does not contain quantifiable or calculable data.

Model Types

After you select a target for a model, it is classified into one of the following three types based on the data type of the target column you selected:

  • Regression: If the target column of a model is of the numerical type, then the model is classified as a regression model. This model predicts a numerical value.
  • Binary-Class Classification: If the target column of a model is of the binary-class categorical type, then the model is classified as a binary-class classification model. This model predicts a binary or a Boolean outcome.
  • Multi-Class Classification: If the target column of a model is of the multi-class categorical type, then the model is classified as a multi-class classification model. This model predicts one class from three or more discrete classes.

You can see a model's type in its evaluation report.

Training a Model

AutoML runs machine learning algorithms to identify patterns, draw inferences, and build and train models by using 80% of the dataset that you provide. AutoML then uses the remaining 20% of the dataset to validate the model it has built. This entire process happens while the model training is in progress.

After a model is trained, AutoML provides various statistics that were calculated during the training process in the model's evaluation report. The information provided differs based on the model type. 

Evaluation Report for Binary-Class and Multi-Class Classification Models

AutoML provides percentage values for the following attributes of a binary-class classification model in the form of a confusion matrix:  

  • True Positive (TP): A true positive is an outcome where the model correctly predicts the positive class.
  • True Negative (TN): A true negative is an outcome where the model correctly predicts the negative class.
  • False Positive (FP): A false positive is an outcome where the model incorrectly predicts the positive class..
  • False Negative (FN): A false negative is an outcome where the model incorrectly predicts the negative class.

The confusion matrix is a 2 x 2 matrix where the columns represent the predicted class and the rows represent the actual class. 

 Predicted FalsePredicted True
Actual FalseTNFP
Actual TrueFNTP

The positive class and negative class are characteristics of a binary-class classification where each class lies on either side of a boundary. For example, in a case where there are only two possible values for a column, Domestic and InternationalDomestic is assigned to the positive class when the classifier is looking for "Domestic" positive results. Anything that is not Domestic, which means the values that are International, are assigned to a "Domestic" negative class.

The confusion matrix helps you understand the instances of misclassification, or wrongly assigning a value to a category, that occurred during the model's training. 

Note: AutoML only provides the confusion matrix for binary-class classification models, and not for the multi-class classification models. 

The following information is provided for both binary-class and multi-class classification models in their evaluation reports:

  1. Accuracy
    The accuracy is the fraction of total predictions made by the model on the test data that were correct, as a percentage value.
     
    Accuracy = Number of Correct predictions/Number of Total predictions
    For a binary-class classification model, the accuracy can also be calculated as:
     
    Accuracy = (TP + TN) / (TP + TN + FP + FN)
    As discussed earlier, you can improve the accuracy of a model's prediction by excluding irrelevant columns or columns with a high amount of missing data during the input feature selection. You can also improve it by ensuring that you provide correct and valid data.
  2. Precision
    The precision is the fraction of total positive predictions made by the model on the test data that were correct.
     
    Precision = TP / (TP+FP)
    The precision indicates how right a model's positive prediction is.
  3. Recall
    The recall is the fraction of the true positive predictions made by the model, out of all true positives and false negatives. 
     
    Recall = TP / (TP+FN)
    This is used to select the best model when there is a high cost associated with the false negatives. The recall is also known as the True Positive Rate.
  4. F1 Score
    The F1 score is the harmonic mean of the precision and recall.  
     
    F1 score = 2 x (Precision*Recall) / (Precision+Recall)
    The F1 score is a useful metric if you are looking for a balance between precision and recall.
  5. Log Loss
    The log loss measures the uncertainty of a model's prediction. A small log loss value indicates low uncertainty. Therefore, a high log loss value is not desirable.

Evaluation Report for Regression Models

The statistics discussed in the previous part do not apply to regression models. AutoML provides the following statistics in a regression model's evaluation report:

  1. Mean Absolute Error (MAE)
    The Mean Absolute Error is the average absolute difference between the target values and the predicted values. This metric ranges from zero to infinity, where a lower value indicates a higher quality model.
  2. Mean Squared Error (MSE)
    The Mean Squared Error is the average of the squares of the absolute difference between the target values and the predicted values.
  3. Root Mean Squared Error (RMSE)
    The Root Mean Squared Error is the square root of the mean squared error.
 

Benefits

  1. Building Efficient and Reliable Models

    Zia AutoML enables you to easily build production-ready machine learning models with high accuracy and precision, and low error margins. You can customize the characteristics of the dataset you provide and choose the columns for the model's training as required. This ensures that your models are efficient, sustainable, and perform well.
  2. Abstraction of Complexities

    Zia AutoML is generally highly useful and advantageous for non-experts in machine learning. The complex processing behind a model's creation and training, including the algorithm implementation and data pipelines, are completely abstracted. Catalyst entirely handles the implementation of AutoML, which allows you to focus on the essentials instead of behind-the-scenes management. 
  3. Rapid and End-To-End Processing of Automated Machine Learning

    AutoML covers the entire pipeline from analyzing raw data to building a production-ready machine learning model. It substantially accelerates the time it takes to configure, train, and deploy machine learning models. You can easily build and train a functionally-rich AutoML model within minutes and implement it in your Catalyst application.
  4. Insightful Evaluation Reports

    The evaluation reports provided by AutoML contains insightful and actionable information, that is different for each model variant based on relevance. You can obtain a clear and perceptive view of your model's strengths and potential from the Catalyst console after the model's training is complete. You can even train a different model with different feature selections for the same dataset easily, based on the evaluation.
  5. Testing Before Implementation

    Catalyst allows you to test the model's performance from the web console or from your local machine's terminal with the click of a button. You can thoroughly test, rebuild, retrain, and customize your model before you implement it in your application.
 

Use Cases

The following are some use-cases for Zia AutoML:

  • Recommendation Engines: An e-commerce service uses Zia AutoML to predict and suggest recommendations for products that a user might be interested in. The service constructs an efficient recommendation engine by collecting explicit and implicit data from the user's browsing and purchase history, and using AutoML to analyze and discover patterns in the datasets.
  • Dynamic Pricing: A ride service hailing mobile application uses AutoML to determine the price for a trip dynamically. The AutoML model predicts the right price for a trip, consistent with the incentive given to the driver, customer satisfaction, and profitability based on various factors such as the time of the day, location, weather, customer demand, cab availability, and more.
  • Sales Forecasting: A pharmaceutical company uses Zia AutoML in a web application designed to be used internally by the company's sales team. The sales analysts use the application to analyze previous sales and revenue data, evaluate sales patterns, and predict trends in their upcoming proposals to formulate sales forecast and plan strategies. They create and train several models in AutoML, using datasets of various sample sizes in their application.

Some other examples where Zia AutoML can be implemented are:

  • A job portal application for a HR service that predicts the suitability of a candidate for a particular job position based on their educational qualifications and previous work experience
  • An election forecasting application that predicts the election results based on previous election performances, results of opinion polls and surveys, user activities on social media, and more
  • Advertisement personalization on a website based on the users' interests
  • Fraud detection and prevention in banking and finance applications
 

Implementation

This implementation section acts as a step-by-step procedure guide to working with Zia AutoML. As discussed earlier, you can configure and train your model from the Catalyst console. Refer to the SDK and API documentation sections for implementing AutoML in your application.

Access AutoML

To access AutoML in your Catalyst console:

  1. Navigate to Zia Services under Discover, then click Access Now on the AutoML window.
  2. Click Go to console in the AutoML feature page.

    This will open the AutoML feature.
 

Create a Model

The process of creating a model and training it is divided into three stages. We will discuss them in order.

To create and train an AutoML model in Catalyst:

  1. Click Create Model from the AutoML page.
  2. The first step is to upload a dataset. You can import a dataset by either selecting a CSV file from one of your folders in the File Store or uploading a CSV file from your computer.
    You can select a CSV file from the File Store by navigating to the folder it is in and clicking it. 

    You can upload a CSV file from your computer by browsing for it or dragging it to the drop box.

    You must then save the CSV file in one of your folders in the File Store. You can select an existing folder or create a new folder.
    Notes:
    • The File Store selection window only displays CSV files in your folders.
    • There is no size limit for the CSV file that you upload through your computer. The storage limits of the File Store apply to storing the CSV file in the File Store.
  3. Click Save and Next.

The next stage is to fix the target column. 

As discussed earlier, Zia analyzes the dataset you uploaded and determines the data type of each column in the CSV file after it is uploaded. The data types are determined based on the values in the column. For example, if a column has only two distinct values repeated in all the records, Zia determines its data type to be binary-class categorical.

AutoML also calculates and displays the values for missing percentage and distinct entries for all columns, mean and SD for numerical columns. You can hover over the tooltip of a column for a short description of it.

You can also find other information such as the name, total number of columns, and the number of records in the dataset in the page.

  1. Select a target column from the dataset in the dropdown list.

    As mentioned earlier, you can only select a numerical or a categorical column as the target column to be predicted. The dropdown list only displays these columns.
    Note: You can only select a column as a target if its missing percentage is zero. Any column that has incomplete data cannot be selected as the target for prediction.
    Once you have selected a target column, the value of the correlation with the target will be calculated and displayed for every other column, except for the columns of the String type.

    You can change the data type of a column if you think Zia predicted it incorrectly. However, if the data type you choose does not match the values of the column, you will receive an "invalid update" error message. For example, if you select the data type of a column whose values are completely numerical as Date, you will receive the error message.
    Note: You cannot change the data type of a column once you have set it as a target column.
    You can also view the overall stats of the dataset in a pictorial graph by clicking View Overall Stats in the page. 

    You can filter the columns displayed in the page by clicking on Filters.

    You can select the filters for each column characteristic based on the results you require. 

    Select the filters and click Apply Filters to view the results.
  2. After you have configured the target column, click Next.

The final stage is to select the inputs. This page displays the model type and the name of the target column. As mentioned earlier, the model type is based on the data type of the target column that you select.

  1. Enter a name for the model. 
  2. Select the columns to be used for training the model from the dropdown list. All columns in the dataset, other than the String type, are selected for training by default. 
    Note: You can exclude columns from being used for training based on the percentage of missing data in them, their correlation to the target, or other factors. This is because, columns with high missing percentages or low correlations to the target can alter the accuracy of a model prediction.
  3. After you have selected the required columns for training, click Train Model

The console displays a training progress message while the model training is in progress.

When the training is completed, you will get a notification in your Catalyst console alerting you of the success or failure of the model's training.

You can now view the Evaluation Report and Model Prediction sections of the model. We will discuss these at the end of this section.

The created model is listed in the AutoML page. A unique Model ID is created for the model, which is used to refer to the model working with the API. 

The page also displays details like the name of the dataset that is associated with the model, model type, created time, and status for each model. The status is shown as Completed for the models that have completed the training successfully. You can search for a model by its name using the search bar.


 

Rename a Model

To rename an AutoML model:

  1. Click the ellipsis icon for the model you need to rename from the AutoML page and click Rename.
  2. Enter the new name for the model and press Enter.
 

Delete a Model

To delete an AutoML model:

  1. Click the ellipsis icon for the model you need to delete from the AutoMl page and click Delete.
  2. Click Yes, Proceed in the confirmation window.
 

View a Model's Evaluation Report

You will be automatically navigated to a model's Evaluation Report section once the training process is complete. You can also open it by clicking on the model's name from the AutoML page.

As discussed earlier, the information provided in the evaluation report differs for each model type.

The following evaluation report was generated for the multi-class classification model that we created earlier in this section.

You can refer to this section to learn about the accuracy, F1 score, precision, recall, and log loss statistics. The report also displays the columns that were used for the training under Selected Columns.

The Evaluation Report section also provides you a shortcut to train a new model with the same dataset that you used for this model. You can train a new model if you find the accuracy of this model's prediction to be too low for your purposes. You can make changes to the new model, such as including or excluding different columns in the dataset this time or changing the data type of a column while training it.

If you click Train, Catalyst will redirect you to the Fix a Target stage of the model creation. The same dataset will be included again automatically. You can then make the necessary changes and train the new model.

The Evaluation Report section displays a graph named Feature Importance for all three types of model. This displays the importance of each feature or column in the dataset for training this model, in terms of relative percentages. That is, it shows which features AutoML found the most useful while building the model.

The evaluation report of a binary-class classification model includes additional information, such as the confusion matrix with the TP, TN, FP, FN values, as discussed in this Key Concepts section.

Similarly, the evaluation report of a regression model includes the Mean Absolute Error, Mean Squared Error, and Root Mean Squared Error statistics. 


 

Predict Valuess

The Model Prediction section provides a test console for you to provide input values and obtain the predictive output. 

The model applies the knowledge it obtained during its training with the dataset to predict the value of the target for a selection of input that you provide. The accuracy level of this prediction is displayed in the Evaluation Report section. The test console will display the columns that were found to be required for building the model as input fields. 

To predict the value of the target of an AutoML model for a set of conditions:

  1. Click the model's name from the AutoML page and click Model Prediction.
  2. Enter input values for the fields in the JSON format.
    Notes:
    • If you enter a value in a format that does not match the data type of the column, such as a numerical value for the date type, the console will display a "cannot parse" error. Ensure that you provide the data in the right format.
    • You must provide the value for atleast one valid column while testing the prediction.
    • If you don't enter the value for an input field, a default value will be entered for the column by Zia automatically. However, this will affect the accuracy of the prediction.
  3. Click Predict.

AutoML will display the predicted value of the target column in a pop-up window.

The prediction of a multi-class classification model will contain the possibility of the occurrence of each class in the target, as a percentage value. For example, the model we created earlier predicts the following outcome of the payment type by the customers for the given input data. 

The prediction of a binary-class classification model will contain the possibility of the occurrence of the positive class and the negative class in the target. For example, in a model predicting the possibility of one of the two delivery types, Standard and Express, where the target is an "Express Delivery" positive class, the prediction results estimate that 81% of the time the customers will not use the express delivery, and they will use it only 19% of the time, for a given set of input conditions.

The prediction of a regression model will contain a single numerical result as the value of the target. For example, a model predicts the number of customers of a business for the next year. The Predict Label is the name of the target column, and the Prediction Result column holds the value of the predicted number of customers as 4,017,890 for a given set of input conditions.

AutoML also provides you with the API request template that you can use from your terminal in cURL format. 

You can copy this code using the copy icon and paste it in your terminal to test your model's prediction from your local system. You can also implement this request code in your Catalyst application to enable predictions. 

You must replace the values of the Project ID and Model ID in the request URL, and the value of the Zoho authorization token in the code. You can refer to the API documentation for more information. You must provide the data for all the columns that are requested in the test console as key-value pairs in JSON format in this request API query.

Note: You cannot create, configure, or train a model using APIs. This can only be done from the Catalyst console.

Share this post : FacebookTwitter

Still can't find what you're looking for?

Write to us: support@zohocatalyst.com