July 10, 2025

Sources of Bias in Data

This is a guide of sources of bias in data.

This is the Ankr Power Bank I have. It has been great and reliable when I go on trips or when I get on my laptop to write somewhere away from home.

Now that we have looked at the types of data bias in machine learning, we can talk through the sources. Although there is not a way to make an environment completely free of bias, it is important to be able to identify and reduce the amount of bias found in any model or dataset.

Sources of bias can come from the humans who are responsible for creating the model or generating the data, from the data not being robust and lacking the proper amount of data points to represent a situation accurately, or from the way the model builds upon what users input into the model.

Some of the common sources of data bias are: human subconscious prejudices and assumptions; lack of data points creating outputs that misrepresent the situation; and bias feedback loops from the ways users interact with a model that perpetuate bias.

We will look at each of these in more detail as well as their impact. Data bias can occur on various stages of the AI process. Data collection: biases may come from distorted survey questions, incomplete data collection, or favoring certain data sources, which can lead to incomplete or distorted datasets that influence AI models to make inaccurate predictions. Historical biases originate from existing prejudices and inequalities within historical records or datasets.

Sampling methods: bias in sampling methods occur when samples are selected in a way that does not accurately represent the broader population, which can lead to models that struggle to generalize to real-world scenarios, particularly for underrepresented groups.

Bias can occur during data aggregation when data is combined without accounting for subgroup variations, which may lead to obscure disparities among subgroups causing models to overlook specific patterns or needs within the data. Bias can occur during data labeling from subjective or culturally influenced labeling.

This can result in inaccurate predictions or classifications when labels reflect subjective judgements. Data preprocessing bias can come from decisions like handling missing values or outliers, and such biased choices can introduce artifacts into the data affecting the performance and fairness of models. When datasets that contain bias are applied to AI and machine learning models, the biases can have a large impact on the ability to make ethical decisions within the data.

One issue from these biases existing is that the models trained on data chosen or gathered by humans, and models trained from historical data about human activities, can include insensitive connections or biases. Another issue is that user-generated data can lead to biased feedback loops. Machine learning algorithms could conclude culturally offensive or insensitive information. Models that are trained with data that has been created or gathered by humans, can inherit different cultural and social biases.

For example, data from historical sources or past news articles, may produce outputs that contain racial or social language biases. These outputs, in turn, have negative impacts on the decisions being made. For example, an algorithm used to support hiring might be trained to seek applicants that use language commonly associated with men. Data generated by users can produce feedback loops that are rooted in cultural biases. For example, the more users search keywords together, the more the results contain those words, whether the words are searched together or not.

When machine learning algorithms make statistical connections, they might produce outcomes that are unlawful or inappropriate. For example, a model looking at loans across a range of ages, could identify that an age group is more likely to default. That information could not be used to make decisions without breaching discrimination laws. When using AI and machine learning models, there are ethical concerns to consider.

There are unfortunately many cases of bias in AI, and they have large negative impacts for people. This situation occurs in numerous industries, with people being discriminated against in varying ways.

Our first example of AI bias is Amazon's hiring algorithm case study. Because of its success with automation elsewhere at Amazon, there is a push to automate parts of the hiring process. Sorting and scoring resumes, and then recommending the highest-scoring candidates to hiring managers and other HR stakeholders was the objective for the program.

Using AI and machine learning, resumes could be analyzed for the right terms and then given preference over resumes that did not include these terms, or at least rank higher than those resumes that included more basic terms and terms that are associated with lower-level skills. The models were trained on data that was heavily focused on male-related terms and their resumes.

So, in turn, that is what the model preferred to other options. The model also built on the data and began to devalue resumes that included the word women or lacked male references. The algorithm was scoring highly qualified women lower than their equally qualified male counterparts. This was leading to hiring managers being given information about ranking that was discriminatory against women.

The program was ultimately abandoned when it could not be corrected. An investigation by The Markup in 2019 found that applicants of color were 40-80% more likely to be denied loan approvals compared to white applicants. Even in cases where the applicants were identical, the white applicants were approved and the applicants that were Latino, Black, Asian, and others were denied.

Outside of the creators of the algorithms used to underwrite the loans, there were few who knew how the algorithm works. This has led to criticism from the public and an impact to the trust in the model that is producing these results. In 2022, there was a viral message that highlighted Twitter's discrimination when selecting which part of a photo to show.

The feature was auto cropping pictures to focus on white people over people of color. When the issue was made public, twitter released an explanation to show how it happened and to take accountability for the issue. By being transparent, Twitter was able to maintain trust in their product and alter the program to include a wider variety of source data, so the issue would not continue.

You should also read:

Data Bias in Artificial Intelligence

How AI Works with Data

Modeling Data in SQL

Subscribe