Home General Beyond Big Data: AI has to Get Small Data Right

Beyond Big Data: AI has to Get Small Data Right

Last Modified Date - May 27, 2020

The interest surrounding the capturing of massive amounts of data has never been as strong as it has in recent years.

As more companies experiment with artificial intelligence and machine learning, the allure keeps getting stronger.

If you are not certain about what you may need in the end, the best way is to capture everything.

However, having additional data is not always good for a company, as all that data makes it difficult for you to manage it and acquire more valuable insights as well as use workable sets of data in achieving the preferred results and accomplishing certain tasks.

In the AI context, discussing big data can cause you to ask certain questions regarding big data’s future.

Do we need big data to the degree that some people think?

In most cases, the response to this question is no and other than going big; we ought to be thinking smaller because of the following reasons:

The Case for Small Data

The same way you cannot establish a skyscraper without laying down the appropriate foundation, you cannot get big data right without first mastering the art of using small data.

Small data can be viewed as any data set belonging to a business that can be stored in one machine.

It is highly manageable and is not accompanied by high costs (not to point out regulatory and compliance risks) associated with big data, which requires a lot of work to maintain, manage and keep it clean.

Even in its unstructured form, the labeling of small data is somehow easy.

While big companies like Facebook and Google may be in a position to label their unstructured big data perfectly, not all companies can accomplish the same.

This means that in case you cannot find the right techniques of managing and leveraging your small data, all your endeavors to “go big “may not bear fruit.

This is likely to be one of the main reasons why we are stuck in the early stages of achieving true enterprise AI; most businesses, if not all, are still trying to identify what to use artificial intelligence(AI) for leave alone having an idea of the data they require or not in a bid to get the desired answers.

As such, you may want to determine the cases whereby small data is better than big data.

For instance, Rainforest, an on-demand QA platform, applies small data instead of big data in trying to solve some crucial problems.

One of the best examples of this undertaking is the company’s software tester vetting process.

Rainforest provides human testers through an API; bringing in their expertise at the ideal time during testing.

Armed with the need to find out what testers the company could trust and the ones to de-emphasize to its clients, Rainforest started collecting some samples that indicated when a tester utilized best practices or not.

Then, Rainforest experts consisting of product managers and some engineers embarked on the process of labeling those examples.

The few thousands of data samples collected proved to be sufficient for the company to start training its machine learning algorithm in the desired manner.

Eventually, this undertaking forced Rainforest to create and reinforce its best practices for applying machine learning in the production process.

In turn, the initiative paved the way for Rainforest to work with bigger datasets more effectively across its business operations.

Small Data or Big Data?

If you are wondering whether small or big data is the ideal tool for your next artificial intelligence or machine learning project, you need to ask yourself the following quick questions:

Do you have the necessary data already, and has it been labeled?

In case you have numerous terabytes of data, but it is not labeled, using such data could be tricky.

In cases whereby your big dataset is labeled, you may have to utilize it for your upcoming project, even though this is a rare and idealistic scenario.

Nevertheless, if you have just an idea, especially before going to collect big data, ensure that you look around.

You may be in possession of usable dataset that might come in handy for the work at hand.

In spite of not being labeled yet, you can fix this by investing more time in exchange for a more swift solution.

What is your use case and the minimum amount of data required for addressing it?

A model of word vector that has undergone training on a huge Google News dataset may work.

However, a straightforward linear algebra might provide you with similar performance, particularly on multiple real-world activities.

In the technology space, there is a lot of talk regarding having a “minimum viable product”, and such thinking only applies to data.

To minimize costs and maximize efficiencies, you may want to utilize the minimum amount of data necessary for getting the job done.

How advanced is your company as far as artificial intelligence/machine learning is concerned?

Building your company’s capabilities one step at a time is important as opposed to moving straight into the most challenging issues, irrespective of how exciting they may seem.

What ’s more, if your business or company is newer to machine learning experiments, solving some of the basic issues with small data is probably the ideal point to start.

After some few successes under your belt, you can now consider scaling from that point henceforth.

The Use of Small Data

Even though big data is not going anywhere anytime soon, it is not the ideal path as far as resolving each machine learning issue is concerned.

Similar to how you create superb software, a good artificial intelligence (AI) and machine learning algorithm ought to be all about accomplishing more using the minimum amount of data.

Big data entails all the data sets that are either too large or sophisticated for conventional data-processing software to handle.

These massive data sets mostly require being analyzed computationally to reveal associations, trends, patterns, particularly those associated with human interactions and behavior.

Small data, on the other hand, encompasses data that is small enough for humans to understand.