Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category.
I read the McAuley Lab's full 2023 Amazon Reviews dataset, 571,544,386 reviews and 275 GB on the HuggingFace CDN, and ranked every single review on four simple signals: how many strong-profanity word hits it has, how much of it is in ALL CAPS, the longest single run of consecutive exclamation marks, and how long it is. The question I started with was "how do people actually behave in Amazon reviews, and does the category they're reviewing change that?"
Live site, per-category breakdown, and the Wall of the loudest reviews: https://burla-cloud.github.io/amazon-review-distiller/
What surfaced:
- Video Games is the rowdiest category by a huge margin. 6.54% of video game reviews hit the strong-profanity list. Compare that to Gift Cards at 1.19% and Handmade at 1.08%. Movies & TV, CDs & Vinyl, Subscription Boxes, and Kindle Store fill out the top five. Cultural products attract feelings, consumer goods attract utility.
- Subscription Boxes is the angriest category. 15.89% of subscription box reviews are one-star. Almost 1 in 6. Charging people monthly for a curated surprise generates a lot of regret.
- The longest exclamation-mark run is 10,594 in a row. The review itself is two words ("love these") on a baby product. One person held one key down for a long time.
- The longest all-caps review is 1,169 words. Posted on a Mozart CD by a self-described disabled Vietnam veteran and Mozart scholar. He opens by apologizing for the caps (macular degeneration) and then keeps going for 1,169 more words.
- Forty reviewers gave a product five stars and wrote zero or one word. One five-star review of a cherry cough drop was just "Taste." That's the whole text.
- Books, music, and games write essays. Gift card buyers write nothing. Average review length: CDs & Vinyl 428 chars, Books 423, Kindle Store 367, Digital Music 340, Video Games 308. Gift Cards is at the bottom by a wide margin. Culture gets words, utility gets silence.
Methodology, plain version:
- The dataset is 34 separate
.jsonl.gzfiles on HuggingFace, one per Amazon category, totaling 275 GB. The usual workflow is to download all 275 GB to a laptop, then iterate. I didn't want to do that. - The HuggingFace CDN supports HTTP Range requests. A worker can ask for "give me bytes 1,000,000,000 to 1,500,000,000 of this file" and get just that slice without downloading the whole file. I split the 34 files into 545 chunks of about 500 MB each, on byte-range boundaries.
- Each chunk runs on its own worker. The worker streams its byte range row by row, scores every review on the four signals, and writes the top scoring reviews to a shared folder.
- A separate reducer container merges the per-chunk top-K shards into the final ranked lists per finding.
Map step: 3.21 minutes. Reduce step: 9.2 seconds. End to end under four minutes for 571 million reviews.
The pipeline runs on Burla using remote_parallel_map(worker, jobs, func_cpu=1, func_ram=4, max_parallelism=1000, grow=True). In English: "ask for up to 1000 parallel workers, each with 1 CPU and 4 GB of RAM, and let the cluster grow to meet that demand." In practice the cluster peaked around 500 concurrent workers and held there for the run. Workers run on a stock python:3.12 Docker image, and Burla auto-installs my local Python packages onto each one. The shared output folder is a Google Cloud Storage path that every worker writes to like a network drive.
(Disclosure: I work on Burla. The script and the live site are open source on GitHub. The dataset is the McAuley Lab's 2023 corpus on HuggingFace.)
Caveats worth being upfront about:
- Scoring is rule-based, not model-based. Word lists for strong, medium, and mild profanity, plus caps ratio, plus longest exclamation run. No sentiment model. That's deliberate: every score is reproducible and you can see exactly why a review got it.
- English-only. Reviews not in English get scored only by length, caps, and punctuation, because the word list is English. A multilingual sentiment model would do better here.
- Quoted titles leak in. A review of "Dick Tracy" can match the strong word list. There's a rescorer that penalizes capitalized-noun matches but it's imperfect.
- 2023 snapshot. The dataset is the McAuley Lab 2023 release, so it doesn't include reviews posted after mid-2023.
Repo with the full pipeline: https://github.com/Burla-Cloud/amazon-review-distiller
If anyone has a cleaner pattern for streaming huge HuggingFace datasets without materializing them locally, I'd love to hear it. I went with requests.get(..., stream=True) plus manual line splitting to keep the worker dependency surface tiny, but the datasets library probably has a cleaner Range-based path.
[link] [comments]
Want to read more?
Check out the full article on the original site