Detection Method

"Extraordinary claims require extraordinary evidence." - Carl Sagan

From the first heartbeat to the last, our identity is present in every breath we take, every move we make, and every word we write. The vocabulary we use in our writing is heavily influenced by our story. Consciously or not, we all have a set of words and terms we prefer to use. Therefore, to submit something we did not write is to lie about who we are. We may get away with it, but with every word, the mask slowly falls apart.

AI writing models, just like us, have their own set of favorite words and terms they prefer to use. With this in mind, I went on a journey to identify the vocubalary that humans tend to prefer over AI and that AI tends to prefer over humans.

I used data science techniques to examine a large volume of AI-generated and human-written text to analyze the usage every word in the english language as well as more than 100 million word combinations. My analysis identified 38 thousand words and 2.8 million terms that are critical in distinguising AI writing from human writing. I have created four precomputed tables which map words and terms with a score, ranging from -2,000 to 2,000, that indicates how likely humans, or AI is to use it. With positive representing AI, and negative representing humans. This score is what shows up as 'prevalence factor' in the home page.

The process of computing 'prevalence factor' consists on iterating through every word and combination of words in a text and looking up the score associated with it in the prevalence tables. All this values are added up and a final 'prevalence factor' is obtained.

This value, along with other metrics such as the length of the text and the variance of writing style are fed into a classification model called K-nearest neighbors. This model, which has seen hundreds of thousands of human and AI generated text, finds the 1000 most similar texts to the text that is being analyzed, and makes a prediction based on how many of these 1000 cases where AI generated, and how distant they are from the current text.

I heavily calibrated the model to reduce false positives at all cost. It lets go many AI generated texts with the purpose of never producing a false positive. The model was tested with 100 thousand human-written text and did not make a single false positive. AlphaDetector is the only AI-detector with a 0% false positive rate.

This doesn't come without a cost, of course, as AlphaDetector correctly detects 80% of AI-generated content, while other AI detectors detect 95-99%. This may lead people to believe that it is easy to fool and that students can get away with using AI. However, due to the sheer volume of assignments students submit each semester, a student who frequently uses AI to complete assignments will eventually be caugth.

AlphaDetector will dissuade students from cheating and encourage them to complete assignments the way their teacher assigned them. This is the only solution that protects honorable students and that will let teachers be the ones who decide how AI is used in their classroom.