Nov. 17, 2020, 7:18 a.m. | jashley
Today, we're proud to report that we have ingested over 700 new cases into the Registry. You can find a list of all the new entries here: https://corporate-prosecution-registry.s3.us-east-2.amazonaws.com/media/20201117-new.csv The large number of cases is the result of a multi-month effort to double-check our collection. To do this, we obtained a list of every federal district court criminal case going back as far back as PACER allowed. The resulting list had over 1.2 million entries. Going through each of these one-by-one to find corporate names is a nightmare and our usual search queries already use regular expressions to capture obvious terms like "Corp." and "Inc." Machine learning proved to be the technology to help us extract what we needed. After the creation of testing and training sets, a few tweaks to a few models, we found an useful algorithm and generated a list of cases we missed. Machine learning, in this case, worked out well and found many entities that lacked the typical "Corp.", "Inc.", etc. labels that make it easy to identify these cases. Since then, we've been collecting dockets, pleas and the like, parsing all of them, and entering data into the site. We have also been continuously updating the Registry with the freshest prosecution agreements and it should be up-to-date as the year and the administration wind to a close. If you see anything amiss or have a hot tip on a case we missed drop us a line. Special thanks to Rebecca Hawes Owen, Jen Goldshtein, and Angelica Bosko for their time and labor with this project.