Fairness-aware Query Answering

Description

In the era of big data and advanced computation models, we are all constantly being judged by the analysis, algorithmic outcomes, and AI models generated using data about us. Such analysis are valuable as they assist decision makers take wise and just actions. For example, the abundance of large amounts of data has enabled building extensive big data systems to fight COVID-19, such as controlling the spread of the disease, or in finding effective factors, decisions, and policies. Similar examples can be found in almost all corners of human life including resource allocation and city policies, policing, judiciary system, college admission, credit scoring, breast cancer prediction, job interviewing, hiring, and promotion, to name a few. In particular, let us consider the following as a running example:

EXAMPLE 1 (Part 1)

Consider a company that would like to make a policy decision, targeted at its ``profitable'' employees. Following our real experiment, suppose the company has around $150K employees. Using salary as an indicator of how profitable an employee is, the business management office of the company considers the query SELECT * FROM EMP WHERE salary<=$65K, which includes around 18% of employees. Surveying this group, the company wants to develop some mechanisms to motivate and retain these employees.

Looking at these analyses through the lens of fairness, algorithmic decisions look promising as they seem to eliminate human biases. However, ``an algorithm is only as good as the data it works with''. In fact, the use of data in all aforementioned applications have been highly criticised for being discriminatory, racist, sexist, and unfair. Probably the main reason is that real-life social data is almost always ``biased''. Using biased data for algorithmic decisions create fairness dilemmas such as impossibility and inherent trade-offs of fairness. Besides historical biases and false stereotypes reflected in data, other sources such as selection bias can amplify unfairness issues. To highlight a real example, let us continue with Example 1:

EXAMPLE 1 (Part 2)

It turns out the company has more female employees than male. Still, due to the known historical discrimination, the selected group of employees contain noticeably more males. As a result, targeting this group for the analysis, the company will end up favoring the preferences of the male employees, which is unfair to female employees and will, in a feedback loop, result in losing more of the ``profitable'' female candidates.

Despite extensive efforts within the database community, there is still a need to integrate fairness requirements with database systems. To address this need, as our first attempt, we consider range queries and pay attention to the facts: (i) the conditions in the range query may be selected intuitively by the human user. For instance, in Example 1 the user could have chosen $65K as the query bound because it was (roughly) a good choice that would make sense for them; (ii) considering the ethical obligations and consequences, the user might be interested in accepting a ``similar enough'' query to their initial choice, if it returns a ``fair'' outcome.

In Example 1, we note that the company could, for instance, in a post-query processing step, remove some male employees from the selected group, or it could add some females to the selected pool, even though they do not belong to the query result. While such fixes are technically easy, those are illegal in many jurisdictions, because those amount to disparate-treatment discrimination: ``when the decisions an individual user receives change with changes to her sensitive attribute information''. For instance, one cannot simply increase or decrease the grade of a student, because of their race or gender. Instead, they should design a ``fair rubric'' that is not discriminatory. Therefore, instead of practicing disparate treatment, we propose to adjusting the range to find a range (similar to finding a rubric for grading) with a fair output. Our system allows the user to specify the fairness and similarity constraints (in a declarative manner) along with the selection conditions, and we return an output range that satisfies these conditions. To further clarify this, let us continue with Example 1 in the following.

EXAMPLE 1 (Part 3)

Being aware of the historical discrimination, ethical obligations, and the potential negative impacts on the company, besides knowing that the choice of salary lower-bound has been fuzzy, the business management office would like to find a query whose output is similar enough to the initial query and the number of male employees returned is at most 1000 (around 5%) more than the females. Using our system, they can issue a SQL query to find such a set. Our system returns the most similar fair range as SELECT * FROM EMP WHERE \$60K<=salary<=$152K. Its outcome is 75% similar to the initial range query, and satisfies the fairness requirement. Observing the high Jaccard similarity between these two sets, the company now has the option to use this for their analysis, to make sure they are not discriminating against their female employees, hence not losing their valuable candidates.

Our system provides an alternative to the initial query provided by the user. This is useful since often the choice of filtering ranges is ad-hoc, hence our system helps the user responsibly tune their range. If the discovered range is not satisfactory to the user, they can change the fairness and similarity requirements and explore different choices until they select the final result in a responsible manner.