Artificial Intelligence ZA Meetup – June 2016

AI Meetup

The second AI ZA meetup was held on 21 June 2016 at Entelect’s headquarters in Melrose Arch, Johannesburg. The event was packed with over 50 attendees eager to learn about artificial intelligence and machine learning.

The goal of this meetup was to continue with a more in depth classification problem as well as run through the introductory concepts for all those that were new to the meetup.

The Talk

Rishal Hurbans and Hennie Brink gave a short talk on classification algorithms. The uses, advantages, and disadvantages of classification algorithms were covered.

The first concept covered was how to determine if a classification algorithm can be used. The diagram below illustrates the decision tree for deciding if a classification algorithm is right for the problem being tackled and dataset at hand.

Deciding on Classification

Two families of algorithms were explored; the Linear Support Vector Machines(SVC) algorithm, and the Naive Bayes algorithm. These were selected to illustrate the differences in classification algorithms, and where each are most effective.

Linear SVC is used for supervised learning for high dimensional spaces where the number of features being analyzed is large, however, it can perform poorly when the number of dimensions is far greater than the number of samples.

Naive Bayes is best when each feature is meant to be analyzed independently without any assumptions of relationships between features. This makes it a good classification algorithm for the problem tackled in the hackathon, which was text analysis.

The Hackathon

The goal of the hackathon was to analyze the Amazon fine foods reviews dataset and determine if a review score could be assigned based on a person’s textual review. Most of the meetup group teamed up and tried to solve this problem together, whilst some got up and running with the introductory hackathon on password strength.

After 2 hours of hacking away on the text sentiment problem, some got close, but realized that sentiment analysis is difficult and is often inaccurate. Many were looking forward to expanding on their code after the meetup to improve the accuracy of their implementation.

By simply querying the dataset, some interesting trends came to light. Many 5 star reviews have the word “love” in it, about a third of those reviews also have the word “hate” in it! The dataset also contained quite a number of 1 star reviews with the word “love” or “like” in them. This goes to show that simply using keywords for textual analysis can be extremely inaccurate.

The meetup ended with some interesting casual chats about artificial intelligence, philosophy, ethics, virtual reality, and our future as humans.

To tinker with this problem, visit the GitHub repository

The “Machine Learning for Beginners” documentation and exercises are available on GitHub. If you’re interested in AI, join us on Meetup.

Artificial Intelligence ZA Meetup Kickoff

AI Meetup

Thursday 19 May 2016 marked the kickoff of the first AI hackathon meetup in Johannesburg, South Africa. The meetup is planned to happen monthly with a focus on growing a community around learning and practicing concepts in artificial intelligence.

The evening started with drinks and pizza where everyone got to socialize and network with their peers. Some interesting conversations around AI and innovation sprouted.

The Talk

Rishal Hurbans and Gail Shaw gave a short talk introducing AI concepts and algorithms. The talk was light and useful to anyone without a background in AI.

The first concept that the meetup group will be tackling is Machine Learning. The Machine Learning for Beginners GitHub repository included a getting started guide for Python and R. The repository demonstrates a simple example for learning the difference between an apple and orange based on it’s weight and texture. It also includes an example for learning classifications of the Iris flower species – the Iris dataset is a popular dataset used for testing and learning in the area of data science. The two examples serve as a quick getting started guide for beginners in machine learning and people not familiar with the Scikit Learn library for Python. The Scikit Learn library provides a number of built-in classifiers and prediction algorithms for machine learning which makes getting up and running simple and easy.

The Hackathon

The hackathon session included an exercise and dataset where the group were challenged to create an algorithm to classify password strength. The dataset consists of 50000 randomly generated passwords for use as training data and 25000 randomly generated passwords for use as testing data. The password strength is measured by detecting the use of uppercase characters, numbers, special characters, and length. More information on the exercise can be found here.

The key aspect of the hackathon was to think about the properties of the password that are most likely to be useful for machine learning, and thereafter, preparing the data such that it can be consumed by a machine learning classifier. Often data, as it exists, is not suitable for machine learning. Removing redundant features, removing unnecessary features, and choosing the correct types of features are an important part of the process.

The group split up into small teams of two or three where they discussed solutions and hacked away at some code. The outcome was interesting as different teams had different approaches to the problem, and conducted various experiments to learn more about the performance of the machine learning classifier.

Some teams chose Boolean features for the occurrence of numbers, special characters, and upper case characters; whilst other teams counted the occurrences of the mentioned features. Some teams represented the output as a percentage of accuracy, and other teams visualized the output in good looking charts. One of the most interesting outcomes were the experiments that teams conducted to test the threshold for learning by reducing the training set and testing the outcomes of the machine learning.

The Outcomes

Here’s some insight from a few of the participants:

The appproach I used was to extract the data into a list of inputs and targets, then mapping a feature extraction function to each password to get a list of feature sets that map nicely with the target array. The feature extraction function extracted four features from the data:

An integer value representing the length of the password subtracting 8,

Three bit/boolean value representing if the password contains an uppercase letter, a digit and a special character.

Three learning algorithms were trained using the input data set and then these classifiers were used to predict the strengths of the password in the testing set. The results of the predictions were compared to the target values, with the error being calculated as the difference between the prediction and the target. The accuracy (error = 0) of each method used is as follows:

Stochastic Gradient Descent: 62.552% accuracy (15638 / 25000)

K-nearest neighbours: 99.998% accuracy (24997 / 25000)

Random Forests: 100.000% accuracy (25000 / 25000)
-Kevin Gray

We decided on the following stack: Jumpy, Pandas, re and obviously sklearn.

Since we needed to generate features, we decided that the apply function of a pandas series would work out really well. For each series you can apply a function which is “applied” to each element in the series.

Given this, we wrote two functions. The first function parses the file by just opening it and iterating over each line. Each line is split using regex to strip out the password and the score. The second function, called count_chars, counts the number of characters in the password. It takes a password and a set of characters from the string module and returns the count of characters. We can then apply the function using the syntax below:
training.password.apply(lambda x: count_chars(x, string.ascii_uppercase))
So from the latter function and the len function we created the features below.
upper_case_count
lower_case_count
punc_case_count
password_len

We then used the cross_val_score to evaluate how well the model generalizes and got an average score of 99%. Checking this against the test set we got approximately the same accuracy. We ran some performance checks on the code and found the following:
Parsing either file takes about 5.72 µs
Creating all the features takes about 6.91 µs

-Bradley, Stuart, Kirton

The “Machine Learning for Beginners” documentation and exercises are available on GitHub. If you’re interested in AI, join us on Meetup.