Data Gurus


Data Quality, Part 3 of 4 with Vedant Misra of OpenAI | Ep. 87

Vedant Misra

On today’s episode, Charlie Allieri, CEO of Imperium, and the sponsor of this series on data quality, and Vedant Misra, an engineer who is managing the reasoning algorithm team at OpenAI, join Sima to deliberate on AI, looking at its evolution, breaking down some of the terminology, delving into the conceptual themes around AI and how it might impact or help improve data quality. 

OpenAI is a research laboratory based in San Francisco, California, whose mission is to ensure that artificial general intelligence (AGI)—meaning highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity. OpenAI attempts to directly build safe and beneficial AGI, but will also consider their mission fulfilled if their work aids others to achieve this outcome.

OpenAI is governed by the board of OpenAI Nonprofit, which consists of both OpenAI LP employees and non-employees. Their investors include Microsoft, Reid Hoffman’s charitable foundation, and Khosla Ventures.

Safety in Artificial Intelligence

The dystopian, apocalyptic kill-a-robot-scenario is a bit farfetched in the realm of AI safety. The only way this situation would occur is if we built a system that is so amazing that we trust it implicitly for a long, long time. Engineered into our society, it could rise up against us because it decides it’s the right thing for “it” to do. So although it’s possible that we could end up in that kind of world, it seems like a concern that is so much further off than immediate safety concerns.

More immediate safety concerns involve the ways that a variety of agents are using artificial intelligence, even today. One example includes the application of machine vision and perception systems to surveillance, and is of utmost concern from a civil rights perspective. There are state actors and governments that are using this technology already in a broad-based and expansive way.

There are also concerns about the disruption of the labor market, which is also safe in a different way in terms of making sure that our society can adapt to this technology and that the way that we’re deploying it is not disruptive to what we all think should be happening. Massive unemployment where people could not support their families would not be desirable.

Another critical thought in safety in AI is the way we train the systems. They ultimately learn from data, which means the information that goes in is numbers, and what comes out is numbers. What they learn is ultimately represented in the space of math and it’s not always the case that the math these systems have learned maps onto the values and things that we want them to learn. It’s important to think about how we can get the math to align more closely with what we actually want these systems to learn.

AI’s Evolution

Neural networks have been some of the most successful phenomenon over the last decade or so. 

This field kicked off sometime in the early 1950’s at a darkness conference focused on understanding artificial intelligence. This turned into a very long brainstorming session where some of the biggest thought-leaders in the field ended up making major and influential contributions. They laid the groundwork for much of the theory and advancements that came later. 

Since the 1950’s, excitement about artificial intelligence has blown up again, waned again, and now we’re in this resurgence for a third time. Whenever limitations in technology are discovered, interest has historically waned until progress is made.

Vedant covers the 1960’s through the 1980s, where serious advancement was seen, but it wasn’t until the 1990s that people started applying neural networks for real problems, such as image recognition, and in the early 2000s, for speech. 

In the past 10-15 years, is when we’ve seen a major resurgence of this technique, of deep learning, in particular. This paradigm of neural networks can work very well if you put them in the right setting, have enough data, and give them enough compute.

There are three components to the recipe of AI: algorithms, compute, and data.

Compute has scaled ridiculously in the last 50-60 years. There has basically been a two-year doubling time for the compute that was used in each of the results that came out. That doubling time of compute has gone from two years to 3.4 months.

Recognizing Good Quality Data

One of the recent big releases from OpenAI is the GB2, a model that they trained on a vast amount of web source data with the goal of predicting the next word that it was going to see.

The technique was simple. The implementation, though complex,  wasn’t algorithmic change that they made, but was mostly about making sure that the data that went into the system was extremely high quality and what they wanted the model to learn.   

The ensuing result was a model that completely blew out of the water anything that people thought was possible before, in that it was capable of generating realistic human-quality text in long-form.

This happened much sooner than anyone thought possible.


Sima loves to hear from her listeners with input, questions, suggestions and just to connect! You can find her at the links below!



Sima is passionate about data and loves to share, learn and help others that share that passion. If you love data as much as her, subscribe on iTunes and don’t forget to leave a rating and review!

This 4 part series is sponsored by Imperium. We are grateful for their support to bring you this important series.
Connect with Vedant Misra: