As of late, I haven’t had a lot of time to chase after spammers, so â€“ despite of anti-spam captchas and everything â€“ a couple of my wikis have been overgrowing with spam. One after the other I’ve been closing them down to anonymous edits, even closing down user registration alltogether, but some a little too late.
The last couple of months my hosting expenses shot through the roof, because my Timber Investments Wiki database kept expanding to well over 14 GiB. So I kind of went into panic mode and I even made time for another one of my famous spam crackdownsâ€”the first in many, many months.
The awfully inefficient bulk deletion of spam users
Most of this latest tsunami of spam was in the form of â€œfakeâ€ user pages filled with bullshit and links. The only process that I could think of to get rid of it was quite cumbersome. First, I made a special category for all legitimite users. From that I created a simple text file (â€œrealusers.txtâ€) with one user page name per line.
cp unfilteredusers.txt todelete.txt cat realusers.txt | while read u do sed -i -e "/$u/d" todelete.txt done
(I’d like to know how I could have done this with less code, by the way.)
This filtered list, I fed to the deleteBatch.php maintenance script:
php maintenance/deleteBatch.php -u BigSmoke -r spam todelete.txt
By itself, this would only increase the size of MW’s history, so, as a last step, I used deleteArchivedRevisions.php to delete the full revision history of all deleted pages.
This work-flow sucked so bad that I missed thousands of pages (I had to copy-paste this listings by hand, as I mentioned earlier above), and had to redo it again. This time, the mw_text table size shrunk from 11.5 GiB to about 10 GiB. Not enough. Even the complete DB dump was still way over 5 Gig [not to mention the process size which remained stuck at around 15 GiB, something which I woudn't be able to solve even with the configuration setttings mentioned after this].
Enter $wgCompressRevisions and compressOld.php
The huge size of mw_text was at long last easily resolved by a MW setting that I had never heard about before: $wgCompressRevisions. Setting that, followed by an invocation of the compressOld.php maintenance script took the mw_text table size down all the way from >10 GiB to a measly few MiB:
SELECT table_schema 'DB name', sum(data_length + index_length) / 1024 / 1024 "DB size in MiB" FROM information_schema.TABLES WHERE table_schema LIKE 'hard%' GROUP BY table_schema;
+----------+----------------+ | DB name | DB size in MiB | +----------+----------------+ | hardhout | 41.88052750 | | hardwood | 489.10618973 | +----------+----------------+
But it didn’t really, because of sweet, good, decent, old MySQL. After all this action, the DB process was still huge (still ~15 GiB). This far exceeded the combined reported database sizes. Apparently, MySQL’s InnoDB engine is much like our economy. It only allows growth and if you want it to shrink, you have to stop it first, delete everything and then restart and reload.
Future plans? Controlled access only?
One day I may reopen some wikis to new users with a combination of ConfirmAccount and RevisionDelete and such, but combatting spam versus giving up on the whole wiki principle is a topic for some other day.