DefinedCrowd: Helping companies mine, structure highly accurate data in AI applications

The Portuguese startup’s quality-controlling smart data platform is driving its exponential growth and major partnerships with the likes of IBM’s Watson Studio and Amazon

2018 has been a bumper year for Portuguese startup DefinedCrowd. In its third year of existence, it raised US$11.8 million in Series A funding from a group of major backers; launched its software as a service (SaaS) platform; opened its fourth office – its first in Asia – and embarked on two major collaborations, one with Amazon's Alexa Skills and, more recently, with IBM's Watson Studio.

The level to which it has managed to stand out from most other artificial intelligence (AI) startups – and for which it has been rewarded – is evident from its recent product integration with IBM’s Watson Studio.

The partnership with Watson Studio means that training data for two of the most in-demand workflows, namely image tagging and text sentiment analysis, is systematically sourced, structured and enriched in the same place, with users able to establish customizable data workflows via Watson Studio's user interface.

“Product integrations like this one with IBM are massive for expanding our product capabilities – always my number one priority,” the startup's founder and CEO Daniela Braga wrote in a blog post. "It’s crucial that these sorts of datasets are sourced and delivered with precision and accuracy. Earning a tech giant like IBM’s trust in handling such critical workflows is a real testament to our work so far.”

Customers turn investors

This collaboration came six months after DefinedCrowd became one of four official Alexa Skills partners – helping develop and test Alexa voice software applications, and shortly after the startup secured Series A funding led by VC Evolution Equity Partners together with Kibo Ventures, Mastercard and Portuguese electric utility Energias de Portugal. Earlier seed-round investors, like Amazon Alexa Fund and Sony, also upped their investment in the US$11.8 million round.

“DefinedCrowd’s SaaS platform has very quickly positioned the company as an innovative leader to solve AI/ML’s global most pressing problem: the need for continuous access to highly accurate data,” said Evolution Equity Partners Founder and Managing Partner Dennis Smith.

Braga was inspired to start DefinedCrowd in 2015 after facing this problem for 15 years, during which time she became one of the most sought-after data scientists in the field of AI and natural language processing (NLP).

An arts graduate in linguistics, she had come across a job posting by the University of Porto’s engineering department in 2000 for a linguist to help create the world’s first Portuguese text-to-speech system. As it turned out, she was the only person who applied for the role and the rest, as they say, is history. Braga went on to acquire a decade’s worth of experience developing NLP technologies for the likes of Microsoft and Voicebox Technologies.

Her experience in the field gave her a profound understanding of the "bad data" problem that often besets scientists like herself – the dagger in the heart of up to 40% of business initiatives. According to the Harvard Business Review, data cleansing and preparation account for the lion’s share of production costs when undertaking data analytics projects. If data is not mined to suitable standards in the first instance, scientists have to spend a disproportionate amount of time sorting low-quality data from usable material, preventing them from putting their resources and skills to better use – namely doing the jobs they were actually hired to do.

Born of bad data

“If you use bad data, then you get garbage out of the [AI] model. We are solving these pain points, which were my own pain points when I started my career as a data scientist,” said Braga.

With this in mind, she founded DefinedCrowd in August 2015 with her friend, Amy Du (who has since left the company to start a new venture), combining NLP, voice recognition and computational imagery to enhance the deep-learning capabilities of AI systems.

“As natural user interfaces continue to evolve, we believe that voice will become the primary input and response mechanism for all human-computer interactions,” said Du.

Even though the market appears to be crowded with voice-recognition technologies, Braga is adamant that an effective model goes beyond merely recognizing speech patterns. “There’s contextual information, accents, background noise, then you need to detect intent,” she explained. “There’s many different ways of saying, ‘Find me a Starbucks’.”

It seems a herculean task, but the solution might be far simpler than we think. The secret, Braga believes, has less to do with creating a flawlessly automated super-computer and more about rerouting a human touch back into AI.

DefinedCrowd routinely solicits voice recordings from native speakers of some 46 languages, allowing its AI language program to constantly expand on the depth and breadth of its deep learning model. The startup has an online platform called Neevo that pays people from all over the world to complete a variety of different language tasks remotely.

Currently, more than 45,000 people contribute via Neevo, and more than half a million pieces of data are processed daily – a major unique selling proposition for DefinedCrowd as multiple-language offers in sector-specific, customized datasets are a major challenge to create.

The best data, then, is good data that is continuously refined. It is attuned not only to changes that occur within a single company or industry but also in the world around us – from the addition of a new subway line to the latest celebrity gossip.

“Crowdsourcing brings the necessary human judgment to the machine-learning techniques used by speech technology,” said Braga. “To train an acoustic model, you need at least 1,000 speakers speaking for one hour each. Those speakers need to be balanced in gender, age, and region. You can train a system to recognize different dialects and sociolects… There is still no way to replace humans in this type of variation.”

Tailor-made, multilingual

Rather than formulate a one-off SaaS product that clients then have to build their businesses around, DefinedCrowd’s other formidable selling point lies in its customization of bespoke data systems at enterprise-scale.

“Often [clients] have some kind of AI on the modeling side, but they really need the data side to enhance that experience,” said Braga. The vernacular required of an in-flight entertainment system, for instance, would differ vastly from that of an app used to develop educational resources for classrooms. Thanks to its Neevo contributors, DefinedCrowd can augment existing datasets with new specialized data that helps tune models for specific applications.

The program’s backend is kept streamlined and user-friendly, but clients subscribing to the SaaS can rest assured that they are getting the most out of their investment with specialized tools tailor-made for their line of work. From this year, clients can search for and select appropriate datasets for their applications via user interface (UI) as well as application programming interfaces (API).

The war against bad data might soon be over, but DefinedCrowd's steady ascent is only just beginning. While the demand for data scientists is steadily increasing, recruiting data scientists is still no easy task 18 years after Braga first entered the sector. It usually takes an average of two months to fill a job opening, making prospective employees among the most all-rounded and coveted in the world.

“Think of big data as an epic wave gathering now, starting to crest,” foretold the Harvard Business Review, back in 2012, “If you want to catch it, you need people who can surf.”