When you ask ChatGPT for something of African origin, it often gives a scant and less nuanced answer. Sometimes you get a made-up answer. This is in part because AI models are trained on little datasets from Africa.
To address this data deficit, African countries must curate their datasets and make them accessible online, according to panelists at Moonshot by TechCabal on Wednesday.
While most Africans leverage already built large language models (LLMs) from global organizations, those LLMs are trained on little data from Africa—for instance, only 2% of world healthcare data is from Africa. This is partly attributable to the lack of documentation for some cultures and languages.
Uploading African data online comes with challenges. There is a lack of documentation for some African languages. Uploading the datasets online can also be expensive for Africans who struggle with poverty and the cost of living crisis. The biggest challenge perhaps might be Africa’s widespread digital literacy challenges. Bayo Adekanmbi, Founder, of Data Science Nigeria proposes workarounds including using voice-to-text to document data.
Some African startups like Intron Health, a Nigerian AI company are already leveraging this. Intron Health allows doctors to professionals enter medical records by converting speech into text.
To collect voice data, startups across Africa are employing agents to gather audio recordings. However, capturing voice data in African contexts presents unique challenges, as many Africans incorporate pidgin or Yoruba into their speech patterns. To accommodate this, Bayo Adekanmbi suggests that startups consider code-switching in their AI models.
To achieve outsized documentation of African languages and culture, Lavina Ramkisson, AI Board, GSMA, believes that global partnerships in infrastructure and skill are needed. Olumide Okubadejo, Head of Product at Sabi, agrees that public and private partnerships are a great way to improve data collection to improve AI adoption on the continent.