You've heard how practically useful classification is for learning about your users. You are keen to try it for yourself. However, you've been unable to find any realistic user classification datasets. Such data is very difficult to find in published data sources.
You're familiar with Python and Apache Spark. You may have done some machine learning using Spark.ML. You have done parts of an end-to-end machine learning task, but there are other parts that you have yet to learn.
You are good at some aspects of machine learning but less confident at others. You would like to work through a user classification task from beginning to end, starting from raw transactional user log data.
You will use a realistic user classification dataset that closely emulates the type of data one might encounter in a production setting. You will create a machine learning feature set from raw user log data and use it to create a classification model. You won't stop there - you will tune this model and improve it using hyperparameter tuning and data selection.
Spark combines the power of distributed computing with the ease of use of Python and SQL. Level up today.
What will you learn in this course?
What are the learning objectives?
The learner will complete two machine learning projects from beginning-to-end.
The learner will know how to improve the accuracy of a seemingly marginally useful model. The learner will do a pre-deployment analysis. A machine learning task does not end when the model is trained. Calculating the overall model accuracy is not always sufficient. Assessing the model accuracy under different constraints on the input data is important to a successful deployment. This is because a model can be more or less accurate in some regions of the input space compared to other regions of the input space.
Here are some of the specific concepts that are covered:
What technologies, packages, and functions will students use?
All libraries used in this course are contained in the pyspark package. This means that we will not need to install additional packages. All modules are either part of the standard python library or imported from the pyspark module.
What terms and jargon will be defined?
Here is a list of technical terms, jargon, and acronyms that will be used in the course:
Feature engineering, machine learning, model fitting, classification, logistic regression, logistic classification, power law, Extract Transform and Select (ETS), pipeline, cross-validation, hyperparameters, feature engineering, feature sets, vectorizer, vocabulary, area under the curve (AUC), data selection, grid search, automated model tuning, Dataframe, Spark.ML, StructType, StructField, CountVectorizer,
What concepts will be taught?
Active versus Casual is a common way to characterize users. Active users can have many different behavioral characteristics than casual users. How can you ascertain this from the data?
Sanity checks and sensitivity analysis are pro tips familiar to every seasoned machine learning engineer. How can you quickly determine whether your model is unbiased? How can you quickly determine how your model will perform on various segments of your user base?
Prediction Accuracy vs Coverage. Students are usually taught to calculate prediction accuracy by averaging over the entire sample space. One can surface opportunities to improve prediction accuracy by segmenting the space. One can also determine what is the effective coverage of a model. The coverage is determined by the portion of novel data for which the prediction accuracy exceeds a desired threshold. A model might do extremely well for certain data, but be no better than a random guess on other data.
When presented with a large amount of raw data that belongs to two different classes, the analyst might initially believe that there is no discernible statistical difference between the two classes of data. However, if you know how to look more carefully, you can determine differences that can be identified. Moreover, you can determine whether these effects are strong enough to automate a classification task using machine learning.
What pro tips are taught?
What datasets will be used?
Two datasets are used. One is for guessing user demographic class membership from usage log data. The second is a text corpus.
The usage of the log dataset is special. This dataset closely emulates data commonly found in production settings, but it is difficult to find in published data sources. The dataset has been carefully created to emulate real demographic data. Such data is rarely if ever published. One reason is to protect proprietary secrets. Another is to avoid privacy leaks.
This dataset looks and behaves very much like real data. This allows practicing real techniques that are easily transferrable to proprietary data you might encounter in a production setting.
The data will be transformed several times along the way for the various stages of the task. So, while the entire course is based on a single dataset, the data will be manipulated such that it can seem like a completely different dataset at various stages along the way.
Although the dataset closely emulates the statistical qualities of a real-world demographic dataset, rather than use real-world labels we instead use a comical hypothetical scenario. However, make no mistake, the data is very real in that it captures very realistic qualities of data encountered in a production setting.
This approach has some benefits. Firstly, it prevents us from being biased by unwarranted assumptions. You may encounter a dataset and make assumptions about it based on your own experience. Instead, you should be data-driven.
Using this hypothetical scenario also emphasizes the generality of this technique. This approach can be used to guess the gender, political affiliation, retiree vs teenager, homeowner vs renter, cancerous vs healthy, hotdog vs non-hotdog, just to name a few.
Additionally, this allows us to have a bit of levity along the way.
The primary dataset labels its two classes as "rabbit" and "duck". At first, they look alike, but when the learner knows what to look for, they are easy to tell apart.
The text corpus dataset is a standard dataset commonly used to teach or demonstrate text processing. In our case, we are going to use it to demonstrate sequence prediction. The text corpus is large enough that it poses realistic constraints on our algorithms. It is small enough that training can still be done quickly. In a production situation, the actual text corpus may be orders of magnitude larger. However, the underlying concepts taught here are applicable.
We use the text corpus as a stand-in for user session data. Examples of this include song identifiers, topic ids, hashtags, URLs, and any other type of identifier that is consumed in a sequential manner. The task is predicting the last item in a sequence.
We could have instead created a dataset that closely mimics the statistical characteristics of a production session log dataset. However, using a text corpus has some advantages. Working with text makes this more intuitive. It is easier to understand what is going on when working with sequences of tokens that correspond to sentences of words rather than opaque identifiers.