ML – Code Over Coffee

Spark is a platform for running computations or tasks on large amounts of data (we are talking petabytes of data) – these tasks can range from map-reduce sorts of tasks to streaming and machine learning applications. The real power of Spark comes from the extensive APIs and supported languages (Python, Java, Scala, and R) that developers can use to manage and create data-based workflows, plus the fact that it supports integrations which pretty much any data store that you’d want to use.

Why is Spark used today? Companies like Databricks has poured tons of money into the technology, keeping the project alive and up to date. It is also extremely fast – at its core, Spark does a great job of distributing computations to multiple nodes, which in turn compute and cache whatever operation is being requested, using a driver to communicate requests and results. It is often said to be faster and easier to use than competitors who use Hadoop for big data management.

However, there are reportedly many issues with Spark as well. For instance, it can take lots of debugging to make sure configurations are right so that memory errors are no encountered, and not all of the languages that are supported are updated in unison. Overall, things seem to work well when going through tutorials and typical use cases, but large scale usage can be finicky.

TLDR; GANs are cool! If you have a problem where you need to use training data to generate new data, as well as verify if some input is similar to the training data, they may prove useful.

GANs are a type of machine learning technique that uses two neural networks to generate and verify data. These two networks are:

A generative neural network uses random data to generate some new piece of data
A discriminative network uses that generated data (and real data from training set) to determine if the generated data is likely to be from the training set.

By iterating through a process of generating data and verifying data, over time we get two networks that do useful things – one is trained to mimic the training set data, and the other can be used to classify data as being “real” or “fake”. These models have a tremendous number of uses in the wild. In the case of the classic task of digit recognition, one network is trained to generate digits, and the other is trained to classify whether some image is a digit (possibly of a specific digit as well).

What’s interesting is that this technique can be applied to so many different neural network architectures, since it is pretty much a wrapper around two separate networks. For instance, we might use CNNs to generate and discriminate against faces, or LSTMs to generate and discriminate against human text.

What are some regulatory implications of this kind of work? GANs have been used to generate fake faces and human text, for instance, which may be wielded by harmful bots or scammers.

Want to dive deeper? Check out some example code and more details in this post.

Tag: ML

What is Spark?

GAN – Generative Adversarial Networks