The importance of data audits when building AI


We’re excited to bring back Transform 2022 in person on July 19 and virtually from July 20 through August 3. Join leaders in AI and data for in-depth discussions and exciting networking opportunities. Learn more about Transformer 2022

Artificial intelligence can do a lot to improve business practices, but AI algorithms can also introduce new sources of risk. For example, consider Recent Stop Offers by Zillow, the company’s arm dedicated to buying tops to fix, after its prediction models significantly outpaced home values. When house price data changed unpredictably, the group’s machine learning models did not adapt quickly enough to account for the volatility, leading to significant losses. This type of data mismatch or “conceptual drifthappens if you don’t give proper attention and respect to data audits.

Zillow’s inability to properly audit its data has not only hurt the company; it could have done more damage by scaring other companies away from AI. Negative perceptions of a technology can stunt its progress in the commercial world, especially for a category like AI that has already gone through several winters. Machine learning pioneers like Andre Ng recognize what is at stake and have launched campaigns to highlight the importance of data audits by doing things like keeping an annual competition for the best data quality assurance methods (instead of picking winners based on model alone, as is traditionally done).

Beyond my own work to build the AIas host of The Robot Brains Podcast, I also interviewed dozens of AI practitioners and researchers about their approach to auditing and maintaining high-quality data. Here are some of the best practices I’ve compiled from this work:

  • Beware of outsourcing your data retention and labeling. Data maintenance isn’t the sexiest task and it takes a lot of time. When time is of the essence, as it is for most entrepreneurs, it’s tempting to outsource the responsibility. But beware of the risks that come with it. A third-party vendor won’t know your product vision as intimately, know the contextual nuances, or have the personal motivations to keep the tight reins that are necessary. Andrej Karpathy, Head of AI at Teslasays he spends 50% of his time maintaining vehicle data manuals because it’s this important.
  • If your data is incomplete, fill in the gaps. All is not lost if your data sources reveal gaps or potential areas of misprediction. Demographic data is an often problematic source. As we know, historical demographic data sources tend to favor white males, which can skew your entire model. Olga Russakovsky, professor at Princeton and co-founder of AI4All, created the REVIEW the model, which highlights patterns of (possibly spurious) correlations in visual data. You can use the pattern to request insensitivity to these patterns or decide to collect more data that does not have the patterns. (Here Is the code to run the model if you want to use it.) Demographic data is cited most often in this type of situation (i.e. medical history data traditionally contains a higher percentage of information about Caucasian males), but it can be applied in any scenario.
  • Understand the implications of sacrificing intelligence for speed. Your data audit can motivate you to integrate larger datasets with more comprehensive coverage. In theory, this may sound like a great strategy, but in reality it may not match the business objective to be achieved. The larger the data set, the slower the analysis. Is this extra time justified by the value of increased insight?

    Financial services companies must have asked themselves this question quite often given the huge amounts involved and the technology in the industry getting faster and faster (think nanoseconds.) Mike Schuster, head of AI at financial services firm Two Sigma, shared that it’s important to keep in mind that a more accurate model, driven by more data, can often lead to longer inference times during deployment, perhaps not meeting your need of speed. Conversely, if you’re making longer-term decisions, you’ll need to compete with others in the market that incorporate much larger amounts of data, so you’ll need to do the same to be competitive.

The application of AI models to solve business problems is becoming mainstream as the open source community makes them freely available to everyone. The downside is that as AI-generated insights and predictions become the status quo, the less flashy work of data maintenance can be overlooked. It’s like building a house on sand. It may seem fine at first, but over time the structure will crumble.

Professor Pieter Abbeel is director of the Berkeley Robot Learning Lab and co-director of Berkeley Artificial Intelligence Laboratory (BAIR). He founded three companies: Covariant (AI for intelligent warehouse and factory automation), Gradescope (AI for helping teachers grade assignments and exams), and Berkeley Open Arms (low-power 7-degree-of-freedom robotic arms). cost). He also hosts the podcast robotic brains.


Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including data technicians, can share data insights and innovations.

If you want to learn more about cutting-edge insights and up-to-date information, best practices, and the future of data and data technology, join us at DataDecisionMakers.

You might even consider writing your own article!

Learn more about DataDecisionMakers

Source link


Comments are closed.