Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category.

I read the McAuley Lab's full 2023 Amazon Reviews dataset, 571,544,386 reviews and 275 GB on the HuggingFace CDN, and ranked every single review on four simple signals: how many strong-profanity word hits it has, how much of it is in ALL CAPS, the longest single run of consecutive exclamation marks, and how long it is. The question I started with was "how do people actually behave in Amazon reviews, and does the category they're reviewing change that?"

Live site, per-category breakdown, and the Wall of the loudest reviews: https://burla-cloud.github.io/amazon-review-distiller/

What surfaced:

Video Games is the rowdiest category by a huge margin. 6.54% of video game reviews hit the strong-profanity list. Compare that to Gift Cards at 1.19% and Handmade at 1.08%. Movies & TV, CDs & Vinyl, Subscription Boxes, and Kindle Store fill out the top five. Cultural products attract feelings, consumer goods attract utility.
Subscription Boxes is the angriest category. 15.89% of subscription box reviews are one-star. Almost 1 in 6. Charging people monthly for a curated surprise generates a lot of regret.
The longest exclamation-mark run is 10,594 in a row. The review itself is two words ("love these") on a baby product. One person held one key down for a long time.
The longest all-caps review is 1,169 words. Posted on a Mozart CD by a self-described disabled Vietnam veteran and Mozart scholar. He opens by apologizing for the caps (macular degeneration) and then keeps going for 1,169 more words.
Forty reviewers gave a product five stars and wrote zero or one word. One five-star review of a cherry cough drop was just "Taste." That's the whole text.
Books, music, and games write essays. Gift card buyers write nothing. Average review length: CDs & Vinyl 428 chars, Books 423, Kindle Store 367, Digital Music 340, Video Games 308. Gift Cards is at the bottom by a wide margin. Culture gets words, utility gets silence.

Methodology, plain version:

The dataset is 34 separate .jsonl.gz files on HuggingFace, one per Amazon category, totaling 275 GB. The usual workflow is to download all 275 GB to a laptop, then iterate. I didn't want to do that.
The HuggingFace CDN supports HTTP Range requests. A worker can ask for "give me bytes 1,000,000,000 to 1,500,000,000 of this file" and get just that slice without downloading the whole file. I split the 34 files into 545 chunks of about 500 MB each, on byte-range boundaries.
Each chunk runs on its own worker. The worker streams its byte range row by row, scores every review on the four signals, and writes the top scoring reviews to a shared folder.
A separate reducer container merges the per-chunk top-K shards into the final ranked lists per finding.

Map step: 3.21 minutes. Reduce step: 9.2 seconds. End to end under four minutes for 571 million reviews.

The pipeline runs on Burla using remote_parallel_map(worker, jobs, func_cpu=1, func_ram=4, max_parallelism=1000, grow=True). In English: "ask for up to 1000 parallel workers, each with 1 CPU and 4 GB of RAM, and let the cluster grow to meet that demand." In practice the cluster peaked around 500 concurrent workers and held there for the run. Workers run on a stock python:3.12 Docker image, and Burla auto-installs my local Python packages onto each one. The shared output folder is a Google Cloud Storage path that every worker writes to like a network drive.

(Disclosure: I work on Burla. The script and the live site are open source on GitHub. The dataset is the McAuley Lab's 2023 corpus on HuggingFace.)

Caveats worth being upfront about:

Scoring is rule-based, not model-based. Word lists for strong, medium, and mild profanity, plus caps ratio, plus longest exclamation run. No sentiment model. That's deliberate: every score is reproducible and you can see exactly why a review got it.
English-only. Reviews not in English get scored only by length, caps, and punctuation, because the word list is English. A multilingual sentiment model would do better here.
Quoted titles leak in. A review of "Dick Tracy" can match the strong word list. There's a rescorer that penalizes capitalized-noun matches but it's imperfect.
2023 snapshot. The dataset is the McAuley Lab 2023 release, so it doesn't include reviews posted after mid-2023.

Repo with the full pipeline: https://github.com/Burla-Cloud/amazon-review-distiller

If anyone has a cleaner pattern for streaming huge HuggingFace datasets without materializing them locally, I'd love to hear it. I went with requests.get(..., stream=True) plus manual line splitting to keep the worker dependency surface tiny, but the datasets library probably has a cleaner Range-based path.

submitted by /u/Ok_Post_149
[link] [comments]

Ranked all 571M Amazon reviews from 2023 by category profanity rate. Video games is 6× the cleanest category.

Want to read more?

Tagged with