Opinions and some observations on using public data to train AI.
- johnhauxwell
- Jan 13
- 2 min read
Updated: Feb 4
A few questions immediately came to mind.
Can we count on AI to tell us accurate news/results? And the short answer is "probably not". But if the question is not essential, then what about the correct answer? Does it really matter? If the question is mission/business critical then AI is not the answer! We all believe we have to let someone into the equation to catch "hallucinations" or miscalculations as I like to describe them. I think humans should play a key role in the entire AI process, from validating the data to verifying the output, especially as Apples AI had some pretty wacky news stories this last week (https://www.bbc.co.uk/news/articles/cx27zwp7jpxo) I personally think that human control over everything from validation of data to final output is vital. That being said, lets think about the following….

Web-scraped Data– sometimes Source data A lot of AI models are trained on public data that have been web-scraped. DataMarts sell this bundled data which might not be consensual, maybe also mixed with a bad synthetic data, or scraped without the owners’ knowledge or approval! That this kind of data – and thus its questionable veracity – has a dubious provenance that should signal red flags. We do this because we have a robust data governance model that makes sure that the data you are working with is of the right quality for your project. At AIdentity we are rigorous in establishing provenance/Data Lineage to ensure data quality and veracity
Privacy Regulations and Considerations
Data protection authorities like the UK’s Information Commissioner’s Office (ICO) – are shaping the AI regulatory ecosystem. Privacy is one of our GoPES themes – Governance Privacy Ethics and Security.
Transparency
Organisations must have a clean record of how they train AI models from human data. This is evangelized in the ICO release by Steve Almond, Executive Director Regulatory Risk at the ICO, and we investigate in our first data scan and in greater detail elsewhere across the GoPES model and program.
Security Precautions
Organizations need to secure/remove personal and non consensual information in advance of model training. We help organisations establish and use security measures as part of the GoPES process User consent Users need an easy to understand method to deny the processing of their data. It is part of GDPR and very much tied with trust.See (https://www.aidentity.uk/post/trust-consent-and-control-a-data-management-white-paper) for more info.
Notes
Stephen Almond, Executive Director Regulatory Risk at the ICO, said:
We have been clear that any organisation using its users’ information to train generative AI models needs to be transparent about how people’s data is being used.
Organisations should put effective safeguards in place before they start using personal data for model training, including providing a clear and simple route for users to object to the processing.”Source: https://ico.org.uk/about-the-ico/media-centre/news-and-blogs/2024/09/ico-statement-in-response-to-metas-announcement-on-user-data-to-train-ai/
GoPES
A simple use case driven solution that allows ALL organisations to apply governance privacy, ethics and security (in line with enterprise best practice), and indicates some technology choices for a solution to provide the correct level of adherence to legislation. This framework is available on a consultative basis from AIdentity
So how do. we ensure the data used in our LLMs meets the required standards?
The below diagram should give some indication

Contact details john@AIdentity.uk or book a quick discussion around data related topicshttps://calendly.com/john-aidentity




Comments