Processing Unstructured Data With Deep Learning

Analyzing data means using the information to make sense of the world. It could be something as complex as predicting the weather or as simple as finding the average weight of a certain population. Academia and businesses alike rely on data to make accurate predictions and understand the nature of their respective areas of expertise.

Data comes in many shapes and sizes, it can be qualitative or quantitative, it can follow a pattern or have no clear order, it can be digitized or manually created. Every type of data poses a new problem for data scientists and analysts, who have to figure out the best way to gather and clean it before it gets analyzed.

Structured Vs. Unstructured data

Imagine that we have two scientists we’ll call Ellen and Charlie. They are both brilliant researchers in their respective fields but, while Ellen is organized, keeps track of her experiments, and organizes her data in spreadsheets, Charlie sometimes forgets to write the results right away so he writes it on the first thing he finds.

Now, imagine if we were to ask for the database of each scientist. Ellen would send a spreadsheet file over the internet, while Charlie would literally show up one day with a folder filled with pieces of paper with different values and dates written all over the place.

Ellen’s database is an example of structured data. It’s data that has a model, or that follows a clear pattern that one could discern. Charlie’s is an example of unstructured data. Assuming that he didn’t make any mistakes, his data is as equally valuable as Ellen’s, but it’s more difficult to organize and process.

That example might cast a bad light on Charlie, but the truth is that his behavior is the norm. That’s according to Computer World Magazine that says that between 70% and 80% of all available data is unstructured. For the most part, that might prove to be an annoyance, but not an insurmountable one.

Data can be tidied up, and unstructured data can be organized and turned into structured data given enough time and resources. But what if you don’t have either? Or what if the nature of the data is such that it can’t be easily turned into structured data?.

Data analysis with the help of AI

The growth of AI technology in the last couple of decades has opened the floodgate of new and better strategies for dealing with data. Thanks to machine learning and intelligent assistants we can gather, clean, and process unimaginable amounts of data in record time.

For those who don’t know what machine learning is, it can be defined as a set of algorithms that parse data, learn from said data, and be able to make informed decisions and predictions.

For example, streaming services use machine learning to find correlations in consumption habits so they can build a recommendation pool for people with similar tastes. Another example is how web stores will try to guess what you are likely to buy based on your purchasing and browsing history.

Learning in this context means that the algorithm gets better the more data it processes. Think of it as a tool that gets more sharp and efficient the more you use it.

Machine learning can go from overly simple models, such as the ones based on linear regressions, to complex ones, like deep models designed to deal with difficult problems that arise from unstructured data.

Enter deep learning

Deep learning is a sub-field of machine learning that tries to create models that emulate human-decision making capabilities. There are dozens of architectures that have been used for anything from social media filtering to visual and voice recognition.

Let’s go back to our scientist Charlie to better understand this. Imagine that just before he handed over his database he spilled some soda over the folder. Now some of the results are blurry and it’s very difficult to tell the numbers apart. You, as a human, are capable of making an educated guess of what a number is supposed to be, but a simple algorithm will probably have a lot of problems figuring out the values.

That’s where deep learning comes in. Instead of having one algorithm, we have several algorithms in multiple layers that create a decision-making tree that is perfectly capable of reaching an answer, checking if the answer was correct, and then automatically adjusting itself so it can make better guesses.

The best way we can illustrate how deep learning works is to talk about Go, the Japanese strategy game. Google has spent a copious amount of resources trying to create an AI capable of playing Go and so far it’s proven a lot more difficult than building AIs that play Chess (even if AlphaGo is a major success).

Why? Because Go is a game with millions of ramifications, so the amount of processing power it would take to compute every possibility is simply outrageous. Instead, any AI that tries to learn how to play Go needs to be able to make educated guesses based on the current state of the board. Much like a human who can’t see all possibilities at once, the machine relies on heuristics to make decisions.

Working with unstructured data

So, the biggest hurdle in working with unstructured data is that it may have an opaque pattern or no pattern at all. Natural language is a perfect example. One could write the same idea in hundreds of different ways, and humans would have no problem in figuring out the semantics of the phrase. Machines, on the other hand, require complex architectures and models.

One very common example of deep learning is Tesseract, an OCR tool that uses deep-learning to recognize words in images, which, as simple as it may sound, can actually be a really complex task.

For example, some images might be in low definition, or maybe the text is out of focus, or maybe it’s part of a video and it’s just a couple of frames. In other words, every image is a world unto its own, and instead of creating an algorithm for each corner case, we use deep learning to work with all the images at once, saving time and effort.

We live in an unstructured world and human beings rarely think in terms of structured data, as such, deep learning has become one of the most important AI tools in the tech industry. To be fair, it’s still in its infancy, but it’s just a matter of time before Siri and similar applications become something more than glorified search engines.