Letter from the Lead Auditor
From: Shea Brown
Lead Auditor
BABL AI Inc.
sheabrown@babl.ai
To: Interviewstreet Incorporation (HackerRank)
PO Box 1660
211 Hope Street
Mountain View, California 94041
Re: Audit Opinion on HackerRank’s Plagiarism Detection System
07/22/2024
We have independently audited the bias testing assertions and related documentary evidence of HackerRank (the "Company") as of 07/22/2024, presented to BABL AI in relation to Company’s Plagiarism Detection System (the “system”) in accordance with the criteria and audit methodology set forth in this report. The goals of this audit are to:
-
Determine whether the bias testing methodologies, controls, and procedures performed by Company satisfy the audit criteria (see Findings)
-
Obtain reasonable assurance as to whether the statements made by the Company, including the summary of bias testing results presented in this report, are free from material misstatement, whether due to fraud or error.
Note that the criteria presented in this report were constructed specifically to address the requirements of a “bias audit” outlined in NYC Local Law No. 144 of 2021. The system was audited as though it were an automated employment decision tool (AEDT) under NYC Local Law No. 144 of 2021, but we do not make any determination whether the system is, in fact, an AEDT under this law.
Company Responsibilities
It is the responsibility of Company representatives to ensure that bias testing and related procedures comply with the criteria outlined in this report. The Company representatives are responsible for ensuring that the documents submitted are fairly presented and free of misrepresentations, providing all resources and personnel needed to ensure an effective and efficient audit process, and providing access to evidential material as requested by the auditors.BABL AI Responsibilities
It is the responsibility of the lead auditor to express an opinion on the Company's assertions related to the bias testing of the system. In light of the current absence of generally accepted standards for the auditing of algorithms and autonomous systems, our examination was conducted in accordance with the standards and normative references outlined in this report.
Those standards require that we plan and perform audit procedures to obtain reasonable assurance about whether the assertions referred to above 1) satisfy the audit criteria and 2) are free of material misstatement, whether due to error or fraud. Within the scope of our engagement, we performed amongst others the following procedures:
-
Inspection of submitted documents and external documentation
-
Interviewing Company employees to gain an understanding of the process for
determining the disparate impact and risk assessment results
-
Observation of selected analytical procedures used in Company’s bias testing
-
Inspection of the select samples of the bias testing data
-
Inquiry of personnel responsible for governance and oversight of the bias testing and
risk assessment
We believe that the procedures performed provide a reasonable basis for our opinion.
Independence
Our role as an independent auditor conforms to ForHumanity and Sarbanes-Oxley definitions of Independence. Fees associated with this contract are for the provision of the service to assess compliance. The payment of fees is unrelated to the decision rendered. Our decision is grounded solely in the criteria presented below.
Opinion
In our opinion, based on the procedures performed and the evidence received to obtain assurance, the bias testing and results presented by Company, as of 07/22/2024, is prepared, in all material respects, in accordance with the criteria outlined below.
Sincerely,
Shea Brown, Ph.D.
Lead Auditor, BABL AI Inc.
Emphasis of Matter
We emphasize several matters related to the dataset sourced by HackerRank to test for disparate impact: 1) self-declared demographic labels were not available for the system, so sex and race/ethnicity labels were inferred, 2) the use of inferred demographic labels means that the data is likely to be considered “Test Data” under § 5-3001, and 3) the amount and quality of the testing dataset (see Findings) were limited due the lack of historical data. Consequently, the conclusions drawn from the disparate impact quantification are subject to the limitations arising from the dataset and should be interpreted in light of this constraint. Our opinion is not modified with respect to this matter.
1 This is at the descresion of the NYC Department of Consumer and Worker Protection. 4
System Description
BABL AI was engaged to audit HackerRank’s testing of its Plagiarism Detection System.
From HackerRank: “Plagiarism detection system uses user-generated signals into an advanced machine-learning algorithm to flag suspicious behavior during an assessment. By understanding code iterations made by the candidate, the model can predict if they had major external help.
It’s important to note,... [that HackerRank] made the following efforts to prevent potential bias from PII or coding habits or question difficulty related bias:
-
No race, gender or PII information in its training or at the time of inference.
-
No typing speed data is used in the model since we consider typing speed is more like a
personal habit instead of plagiarism indicators.
-
Features are normalized in respect of the same question to minimize the bias from different question difficulties.
-
Human-in-the-loop decision making: ML coding plagiarism [system] will never automatically drop candidates from the model prediction. Instead, we provide a comprehensive report of the suspicious attempt[s]... Our customers can review the detailed report and check the code replay by themselves to decide whether to move a candidate forward.”
Audit Summary
Background
New York City Local Law No. 144 of 2021 requires yearly “bias audits” for automated employment decision tools (AEDTs) used to substantially assist or replace decisions in hiring or promotion. Specifically, the law states that (1) the bias audit must “assess the [AEDTs’] disparate impact” on certain persons, (2) the audit must be conducted by an “independent auditor ... no more than one year prior to the use”, and (3) a “summary of the results of the most recent bias audit ... [must be] made publicly available on the website of the employer or employment agency.” The audit outlined in this document has been conducted to satisfy the law’s requirement for a bias audit only, and does not include other requirements such as candidate notifications. This report does not make any determination whether the system under this audit is, in fact, an automated employment decision tool as defined under NYC Local Law 144, or not.
Auditor Responsibilities
It is the responsibility of BABL AI auditors to:
-
Obtain reasonable assurance as to whether the statements made by the auditee are free from material misstatement, whether due to fraud or error,
-
Determine whether the statements made by the auditee provide sufficient evidence that the audit criteria (see Findings) have been satisfied, and
-
Issue an auditor’s report that includes an opinion.
As part of an audit in accordance with good auditing practice, BABL AI exercises professional judgment and maintains professional skepticism throughout the audit. Specifically, BABL AI auditors identify and assess the risks of material misstatement in documents provided by the auditee, perform audit procedures responsive to those risks, and obtain audit evidence that is sufficient and appropriate to provide a basis for our opinion, per Public Company Accounting Oversight Board (PCAOB)’s Auditing Standard 1105 on Audit Evidence,2 where applicable. In addition, this audit report follows International Standard on Assurance Engagements (ISAE) 3000’s guidelines on Assurance Report, where applicable.3
BABL AI is also responsible for maintaining auditors’ independence and objectivity to ensure the integrity of the opinion and certification provided. BABL AI as an organization, and all employee and contract auditors, adhere to strict independence as codified by the Sarbanes–Oxley Act of 20024 and the ForHumanity’s Code of Ethics.5 In addition, BABL AI Lead Auditors are ForHumanity Certified Auditors under NYC AEDT Bias Audit.6 For more details about our methodology and process, see Appendix – Audit Methodology.
2 https://pcaobus.org/oversight/standards/auditing-standards/details/AS1105
3 https://www.iaasb.org/publications/international-standard-assurance-engagements-isae-3000 -revised-assurance-engagements-other-audits-or-0
Scope & Objective
Audit Section | Audit Objective |
Disparate Impact Quantification | To ensure that the auditee has conducted sufficient testing of their model to “assess the tool’s disparate impact on persons of any component 1 category,” – i.e., race and gender – as the minimal requirement for a bias audit under Local Law 144 of 2021. |
Governance | To ensure that effective internal governance exists to own, manage, and monitor risks related to bias and fairness. |
Risk Assessment | To ensure that risks of the model that potentially contribute to bias have been rigorously identified, acknowledged, and assessed. |
Out of Scope
-
The audit did not ensure sufficient testing of the tool’s disparate impact on any other protected class beyond race/ethnicity and gender
-
The audit did not certify that the system is “bias-free”
-
The audit is not intended for compliance purposes for any legislation other than the NYC Local Law No. 144
Conclusions
Our opinions for the bias audit of Plagiarism Detection System by HackerRank are as follows:
Audit Section | Opinion |
Disparate impact quantification | PASS |
Governance | PASS |
Risk assessment | PASS |
Overall | PASS |
Findings
Note: The information disclosed under each criterion is not documentary evidence.
Disparate Impact Quantification
Audit Criteria | Opinion |
Q.A. Components: The system to be tested for disparate impact shall be defined.
|
PASS |
Components or combinations of components that were tested: N/A
Q.B. Testing dataset: The dataset on which disparate impact was quantified shall be defined and characterized.
|
PASS |
Testing conducted by: HackerRank
Date of last testing: 05/10/2023
Time span of data: Nov 2022 – Feb 2023
Justification for the use of Test Data: From HackerRank: “Given the data protection contract with our customers, open source libraries are selected for name to race and name to gender estimation since no data transfer is required to a different vendor.”
Q.C. Disparate-impact quantifiable PCVs: PCVs that can be quantified using the testing dataset shall be defined.
|
PASS |
PCVs for which disparate impact was quantified:
1. Gender
2. Race/ethnicity
PCVs for which disparate impact was not quantified:
-
Age
-
Immigration or citizenship status
-
Disability status
-
Marital status and partnership status
-
National origin
-
Pregnancy and lactation accommodations
-
Religion/creed
-
Sexual orientation
-
Veteran or Active Military Service Member status
Q.D. Positive vs. negative outcome: Where the selection rate method was used, positive and negative outcomes of the model shall be clearly defined. Q.D.1. Evidence shall show justification for why the selected definition of positive outcome was appropriate. Q.D.2. Where thresholding is used, evidence shall show justification for why the level/levels of threshold to determine positive vs. negative outcomes was/were appropriate. Q.D.3. Evidence shall identify and disclose
Q.D.4. Evidence shall disclose the user-configurable settings and combinations of settings on which disparate impact was tested. |
PASS |
Positive outcome: not being flagged by the system
User-configurable settings that can affect positive outcome: N/A
Settings on which disparate impact was tested: N/A
Q.E. Selection rate or scoring rate: A metric corresponding to selection rate or scoring rate shall be defined. Q.E.1. Where the selection rate method was used, evidence shall show that the selection rate of a group was defined as the ratio of positive outcome to all outcomes for that group. Q.E.2. Where the scoring rate method was used, evidence shall show that the scoring rate of a group was defined as the rate at which that group receives a score from the AEDT above the median score of the sample |
PASS |
Method of quantifying disparate impact: selection rate defined as the rate at which candidates in a demographic group are not flagged by the system
Q.F. Favored, disfavored groups: Favored and disfavored groups shall be defined, for all PCVs. Q.F.1. Evidence shall show that favored and disfavored groups were defined according to selection rates or scoring rates ordered by PCV. Q.F.2. Evidence shall show that the groups pertaining to race and ethnicity satisfy § 60-3.4 B in the EEO guidelines. Q.F.3. Where the groups pertaining to race and ethnicity do not satisfy EEO guidelines, evidence shall show justification for why EEO grouping was not used, and the appropriateness of any substituted groupings. Q.F.4. Evidence shall show that the groups pertaining to gender contain at least “Male” and “Female”. Q.F.5. Evidence shall show intersectional groups containing all permutations of gender and race/ethnicity group combinations. Q.F.6. Where race/ethnicities and genders are not known for a sample of candidates assessed by the AEDT, evidence shall disclose its sample size. |
PASS |
Q.G. Impact ratio: Impact ratios shall be disclosed for all disfavored groups, for all PCVs. Q.G.1. Where an impact ratio for a disfavored group is below 0.8, evidence shall show justification for why the disfavored group is disadvantaged. Q.G.2. Evidence shall show results of uncertainty analysis (e.g., standard error for the mean) or error propagation of impact ratios in the form of errors or error bars. Q.G.3. Where PCV data was inferred, evidence shall show that systematic errors due to PCV inference were properly propagated in impact ratio calculations. Q.G.4. Where a gender, race/ethnicity, or intersectional group was excluded from impact ratio calculation due to its size being below 2% of the total sample size of each analysis, evidence shall show
|
PASS |
Non-intersectional, Gender, sorted by Impact ratio
N applicants |
Selection rate |
Impact ratio |
|
Male |
3,732 |
0.798 |
1.000 |
Female |
2,112 |
0.784 |
0.982 |
Non-intersectional, Race/ethnicity, sorted by Impact ratio
N applicants |
Selection rate |
Impact ratio |
|
Hispanic or Latino |
1,127 |
0.860 |
1.000 |
Asian |
1,184 |
0.835 |
0.971 |
White |
1,797 |
0.774 |
0.900 |
Black or African American |
1,070 |
0.762 |
0.866 |
Native American or Alaskan Native |
3 |
0.667 |
N/A |
Intersectionals
N applicants | Selection rate | Impact ratio8 | |||
Hispanic or Latino | Male | 790 | 0.857 | 0.995 | |
Female | 287 | 0.861 | 1.000 | ||
Non- Hispanic or Latino |
Male |
White | 711 | 0.839 | 0.974 |
Asian | 924 | 0.774 | 0.899 | ||
Black or African American |
807 |
0.756 |
0.878 |
||
Native American or Alaskan Native |
2 |
0.500 |
N/A |
||
Female |
Asian | 697 | 0.772 | 0.897 | |
White | 389 | 0.833 | 0.967 | ||
Black or African American |
807 |
0.756 |
0.878 |
||
Native American or Alaskan Native |
1 |
1.000 |
N/A |
Note: Data on these applicants was not included in the calculations above:
- 123 applicants with an unknown gender category
- 1053 applicants with an unknown race/ethnicity category, and
- 1176 applicants with at least an unknown gender or an unknown race/ethnicity
7 N/A refers to the demographic group representing less than 2% of the total N applications in the table
Q.H. Statistical significance: Where the selection rate method was used, statistical significance calculation shall satisfy UGESP guidelines. Q.H.1. Evidence shall show that statistical significance was calculated using the Two Independent-Sample Binomial Z-Test for sample sizes of 30 or more, and using the Fisher’s Exact Test for sample sizes of fewer than 30. |
PASS |
Governance
Audit Criteria | Opinion |
G.A. Accountable party for disparate impact risks: The auditee shall have a party who is accountable for risks related to disparate impact.
|
PASS |
Accountable party: Committee on Automated Decision Making Tools
Contact information: rohan.raman@hackerrank.com
Role in the auditee organization: Governance group/committee
G.B. Defined duties of the accountable party: Duties of the party accountable for disparate impact risks shall be clearly defined.
|
PASS |
G.C. Documentation pertaining to duties carried out: The auditee shall provide evidence that the defined duties of the party accountable for disparate impact risks are carried out.
|
PASS |
Risk Assessment
Audit Criteria | Opinion |
R.A. Completion: The auditee shall have completed a risk assessment of the model.
|
PASS |
Evidence of Risk Assessment completion: Risk Assessment Documents (risk assessment spreadsheet and narrative notes), and verbal testimony from Risk Assessment participants.
R.B. Identification of risks: Risk assessment shall show identification of relevant risks related to bias.
|
PASS |
R.C. Evaluation of risks: Risk assessment shall demonstrate appropriate evaluation of relevant risks.
|
PASS |
Appendix
Audit Methodology
The Process Audit
The BABL AI audit framework is the Process Audit, defined as “an impartial verification and evaluation process conducted by an independent auditor to determine whether sufficient evidence exists to warrant the judgment that AI risks have been prevented, detected, mitigated, or otherwise managed.” A process audit is modeled after the financial auditing practice as a cooperative third-party audit, and is distinguished from other commonly used forms of assessment of algorithms, such as first- or second-party assessments, and assurance.8 The audit framework contains three phases:
-
Scoping – The auditor conducts a preliminary survey of the auditee’s algorithm to gain a full understanding to contextualize documentary evidence
-
Evaluation & Verification – The auditee submits documentation containing evidence demonstrating satisfaction of the audit criteria which the auditors evaluate and verify.
-
Certification – If the auditee is determined to pass the audit criteria, the auditor drafts the auditor’s report and certifies the auditee’s algorithm.
Evaluation & Verification
The procedure for all BABL AI auditors in conducting a process audit follows the guidelines set forth in the Public Company Accounting Oversight Board (PCAOB)’s Auditing Standard 1105 on Audit Evidence, where applicable. Specifically, the auditors:
-
Obtain audit claims and statements from the auditee’s submitted documentation which either support or contradict the criteria and sub-criteria,
-
Evaluate the claims and statements in regard to satisfying the criteria and sub-criteria, based on the sufficiency and appropriateness of the evidence, and
-
Verify that the claims and statements made by the auditee are free from material misstatement, whether due to fraud or error.9
8 Carrier, R., & Brown, S. (2021). Taxonomy: AI Audit, Assurance & Assessment. ForHumanity.
https://forhumanity.center/blog/taxonomy-ai-audit-assurance-assessment/
9 “Reasonable assurance” is a high level of assurance but is not a guarantee that an audit conducted in accordance with good auditing practice always detects a material misstatement when it exists. Misstatements can arise from fraud or error and are considered material if, individually or in aggregate, they could reasonably be expected to influence the decisions of stakeholders taken based on these statements.
In addition, evaluation and verification of claims and statements may involve requesting additional supporting documentary evidence, and/or interviewing those responsible for the governance of the algorithm, other relevant employees of the auditee organization, or other third parties referenced in the submitted documentation.
At the end, the auditors reach an audit opinion based on:
-
The sufficiency and appropriateness of the audit evidence, and
-
The risk of material misstatement of the audit evidence.
Terminologies & Definitions
Term | Abbrev | Definition |
automated employment decision tool |
AEDT |
“any computational process, derived from machine learning, statistical modeling, data analytics, or artificial intelligence, that issues simplified output, including a score, classification, or recommendation, that is used to substantially assist or replace discretionary decision making for making employment decisions that impact natural persons.” – see § 20-870 of the Code and § 5-300 of the adopted rule for full definition |
disfavored group |
|
any gender or race/ethnicity group not having the highest selection rate or average score |
disparate impact or adverse impact |
|
“a selection rate for any race, sex, or ethnic group which is less than four-fifths (⁄) (or 80%) of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact” – see § 60-3.4.D of UGESP (1978) for full definition |
error propagation |
|
calculation or computation of a variable's uncertainty that is dependent on another variable’s uncertainty |
favored group |
|
the gender or race/ethnicity group having the higher selection rate or average score compared to the other groups |
impact ratio |
|
“either (1) the selection rate for a category divided by the selection rate of the most selected category or (2) the scoring rate for a category divided by the scoring rate for the highest scoring category. ” – see § 5-300 of the adopted rule for full definition |
scoring rate |
“the rate at which individuals in a category receive a score above the sample’s median score, where the score has been calculated by an AEDT” |
|
justification |
a compelling reason that illuminates the issue and carries normative force, as opposed to solely explanatory power |
|
positive outcome |
the basis for selection rate, the favorable outcome for a candidate from the use of the model, such as being selected to move forward in the hiring process or assigned a classification by an model |
|
protected category variables |
PCV |
defined per jurisdiction, equivalent to protected class, including but not limited to: race/ethnicity, age, gender, religion, ability or disability, sexual orientation, color, nation of origin, socioeconomic class |
risk assessment |
an assessment of the risk that the use of the algorithm negatively impacts the rights and interests of stakeholders, with a corresponding identification of situations of the context and/or features of the algorithm which give rise or contribute to these negative impacts9 |
|
selection rate |
“the rate at which individuals in a category are either selected to move forward in the hiring process or assigned a classification by an AEDT” – see § 5-300 of the adopted rule for full definition |
|
testing dataset |
the dataset used to test for or quantify disparate impact |
|
uncertainty analysis |
calculation or computation to quantify the uncertainty of a variable, outputting errors or error bars |
10 Hasan, A., Brown, S., Davidovic, J., Lange, B., & Regan, M. (2022). Algorithmic Bias and Risk Assessments: Lessons from Practice. Digital Society, 1(1). https://doi.org/10.1007/s44206-022-00017-z