Businesses lose millions when data is left unused. How can we change that?

02 October 2020 - 5 mins reading time

Data is the most valuable asset for many companies today. But there is as fundamental problem. Value is lost when data ends up unused due to a lack of tools to analyse or work with it.

Between 60% and 73% of all data within an enterprise goes unused for analytics. (source: Forrester)

Consider this example. We have a word “banana”, a picture of a banana and a word “banan” in another language. We are asked to take these three objects and compare them. This is the challenge business faces when it has a varied set of rich data.

We would start off by considering how to compare them. Two are text-based and one is image-based. We could try comparing the letters in the two words, but that wouldn’t be very helpful as how would we then compare the letters with the pixels in an image. Or worse, if the other language uses a different alphabet then we can’t compare them either. Instead, we could compare things like the total size in bytes of each item. It’s not particularly useful though.

What if instead we could transform all three items of data into a common format? And in this common format, we would expect them to be the same as they all represent the same thing - a banana.

This would be perfect but now we have two more problems. One, how do we design a universal format and two, how do we convert them into this format.

Universal format

Creating a universal format requires us to be able to transform any piece of data into it. A requirement of this format is that it’s standard and uniform. Each element represented in this format should be comparable to another. This suggests we should do it numerically, as that is comparable.

So let's represent banana as "1". The image, “banana” and “banan” would now all be 1 in their universal format and we’d see they are the same. However, what happens if we now have an image of an Apple. In our universal format, what number would we give it? 2? Is an Apple 1 unit away from a banana? We’ve now created a linear scale where we can place our data.

This doesn’t seem right though as fitting more and more fruits on our linear scale would be a challenge to represent the relationship between each one. For example, a strawberry should be close to berries but far from other fruit that should be close to berries on our scale. Therefore, our 1 dimensional space isn’t adequate to contain all the information. What if we added more dimensions? It would be possible to have them close in certain dimensions but far away in others to make sure their relationships are adequately represented. Bingo.

Our universal format is a multi-dimensional space where each item will have a coordinate. “Banana”, “banan” and the image of a banana should share a coordinate. This seems like it could work, okay onto the next problem.

Converting data

Representing data in our universal format requires us to understand something about the data where we get to pick what that something is. Deep learning is the process of training a neural network (a series of layers) to optimise for a cost based on an input. Each layer of the network is made up of nodes. Each node has a weighting that gets applied to the number inputted. This then results in a numerical output based on these numbers.

So what if we took the layer before the output? If we have 5 nodes then we’d expect to end up with 5 numbers. For every item of input we’d have 5 numbers. That means if we treated each number as a dimension we’d have a 5 dimensional representation of the input. And there we go, we’ve represented our data into a universal format and because any data we put through will go through the same steps to reach those 5 numbers we can consider them sitting in the same space.

The validity of this space depends on the quality of the model. A model that is trained well to predict the outcome you desire should mean that it is working with features of the object that are meaningful to the task. We assume our model does a good job of this.

Extracting value

Now we can take any piece of data, put it through a model and extract a multi-dimensional representation or.. a vector.

As these vectors are numerical and exist in the same space we can compare them to understand more about the data. This forms the basis of vector search (nearest neighbour search). This is not the only approach for analysis though. Clustering is another method that highlights outliers, similarities and groups.

As a result, business decisions can be based on additional metrics and analysis of previously intangible data. Adopting new methods of processing data is critical and vectors are on the bleeding edge of what's possible. At Vector AI we make working with them a joy - request early access today.