Smokes your problems, coughs fresh air.

Tag: spam

The terrible state of my tech

It’s a bit pathethic how well I have been taking care of my own tech. ever since I started studying. What’s worse: this haven’t improved when I stopped my studies for other ventures. I think the reason for this is exceedingly simple: I don’t like to do work-like things I don’t write about; luckily, I love to structure my goals in writing, and this blog still hasn’t completely collapsed under my neglect.

If I love writing about stuff I do, have I simply not done anything in the past few years? Well, I’ve done things. I’ve even written about them too, in other places, mostly on paper. It’s just that, dispite working for a tech company for the last couple of years, I have done hardly a thing about ridding myself of my private technical debt. Or is it because of working at a tech company, where I’m hard at work fighting technical debts in Python/Django projects and documenting the progress in Redmine?

I won’t mention the reasons why I want to spend time on this weird, hodgy-podgy blog again. Let me just say that the main motivation is not guilt for having created a technical debt. And the actual reasons are better suited for other posts at another time.

Domain Problem Short-term fix Mid-term fix Long-term fix Spam ✔ 2017-07-18 Disable comments and ping/trackbacks on new posts.
2017-07-18 Disable comments and ping/trackbacks on old posts.
2017-07-19 Remove spam.
Security Upgrade WordPress svn:external Automatically upgrade WordPress Non-responsive Make responsive
Outdated 2018-10-28 Reduce and update content
* Huge hosting costs Find sinkhole between my NFSN accounts 2019-01-26Move to cheapsolid VM host (TransIP? Tilaa) Security Move personal posts to and replace with redirects
Replace blog with static rendering of blog
opschoot Wheezy fan Replace or re-attach fan.
Neglected backups Backup monitoring: opschoot should register itself when online and I should then be nagged if I don’t backup.
butler Legacy Move files somewhere else (public?)

Shrinking/compressing a MediaWiki database

As of late, I haven’t had a lot of time to chase after spammers, so – despite of anti-spam captchas and everything – a couple of my wikis have been overgrowing with spam. One after the other I’ve been closing them down to anonymous edits, even closing down user registration alltogether, but some a little too late.

The last couple of months my hosting expenses shot through the roof, because my Timber Investments Wiki database kept expanding to well over 14 GiB. So I kind of went into panic mode and I even made time for another one of my famous spam crackdowns—the first in many, many months.

The awfully inefficient bulk deletion of spam users

Most of this latest tsunami of spam was in the form of “fake” user pages filled with bullshit and links. The only process that I could think of to get rid of it was quite cumbersome. First, I made a special category for all legitimite users. From that I created a simple text file (“realusers.txt”) with one user page name per line.

Then, I used Special:AllPages to get a list of everything in the User namespace. After struggling through all the paginated horror, I finally found myself with another copy-pasted text file (“unfilteredusers.txt”) that I could filter:

cp unfilteredusers.txt todelete.txt
cat realusers.txt | u
  sed -i -e "/$u/d" todelete.txt

(I’d like to know how I could have done this with less code, by the way.)

This filtered list, I fed to the deleteBatch.php maintenance script:

php maintenance/deleteBatch.php -u BigSmoke -r spam todelete.txt

By itself, this would only increase the size of MW’s history, so, as a last step, I used deleteArchivedRevisions.php to delete the full revision history of all deleted pages.

This work-flow sucked so bad that I missed thousands of pages (I had to copy-paste this listings by hand, as I mentioned earlier above), and had to redo it again. This time, the mw_text table size shrunk from 11.5 GiB to about 10 GiB. Not enough. Even the complete DB dump was still way over 5 Gig [not to mention the process size which remained stuck at around 15 GiB, something which I woudn’t be able to solve even with the configuration setttings mentioned after this].

Enter $wgCompressRevisions and compressOld.php

The huge size of mw_text was at long last easily resolved by a MW setting that I had never heard about before: $wgCompressRevisions. Setting that, followed by an invocation of the compressOld.php maintenance script took the mw_text table size down all the way from >10 GiB to a measly few MiB:

php maintenance/storage/compressOld.php

SELECT table_schema 'DB name', sum(data_length + index_length) / 1024 / 1024 "DB size in MiB"
FROM information_schema.TABLES
WHERE table_schema LIKE 'hard%'
GROUP BY table_schema;

| DB name  | DB size in MiB |
| hardhout |    41.88052750 | 
| hardwood |   489.10618973 | 

But it didn’t really, because of sweet, good, decent, old MySQL. 🙁 After all this action, the DB process was still huge (still ~15 GiB). This far exceeded the combined reported database sizes. Apparently, MySQL’s InnoDB engine is much like our economy. It only allows growth and if you want it to shrink, you have to stop it first, delete everything and then restart and reload.

Future plans? Controlled access only?

One day I may reopen some wikis to new users with a combination of ConfirmAccount and RevisionDelete and such, but combatting spam versus giving up on the whole wiki principle is a topic for some other day.

MediaWiki ConfirmEdit/QuestyCaptcha extension

Since I moved my LDAP wiki over from DokuWiki to MediaWiki, I’ve been burried by a daily torrent of spam. Just like with my tropical timber investments wiki, the ReCaptcha extension (with pretty intrusive settings) doesn’t seem to do much to stop this shitstream.

How do the spammers do this? Do they primarily trick visitors of other websites into solving this captchas for them or do they employ spam-sweatshops in third-world countries? Fuck them! I’m trying something new.

I’ve upgraded to the ConfirmEdit extension. (ReCaptcha has also moved into this extension.) This allows me to try different Captcha types. The one I was most interested in is QuestyChaptcha, which allows me to define a set of questions which the user needs to answer. I’m now trying it out with the following question:

$wgCaptchaQuestions[] = array( 'question' => "LDAP stands for ...", 'answer' => "Lightweight Directory Access Protocol" );

I don’t think it’s a particularly good question, since it’s incredibly easy to Google. But, we’ll see, and in the mean time I’ll try to come up with one or two questions that are context-sensitive, yet easy enough to answer for anyone with some knowledge of LDAP. If you have an idea, please leave a comment.

Disabling Zimbra’s spam learning

Zimbra learns ham and spam by sending it to certain mailboxes. For our setup, this doesn’t work (easily), because our server is configured to always send mail to another SMTP server and not do any local delivery. I did that, because our zimbra server is not actually on the domain it thinks.

To disable the learning accounts, I did this:

zmprov mcf zimbraSpamIsSpamAccount ''
zmprov mcf zimbraSpamIsNotSpamAccount ''
zmcontrol stop
zmcontrol start

I didn’t delete the accounts, so I can enable it later.

To enable it, I guess I have to configure these two accounts on our hosting provider’s servers, fetch and deliver them to Zimbra and it works. I’ll do that some time…


If you’ve never heard of reCAPTCHA before, reCAPTCHA is a free CAPTCHA service that helps to digitize books, newspapers and old time radio shows. I’m using reCAPTCHA on this and other blog to protect myself from automated spam comments. I’m also using it on some of my MediaWiki sites to protect myself from wiki spam.

The great thing about reCAPTCHA is that it solves two problems at once. First, the OCR problem: it uses user time that would otherwise have been wasted solving meaningless capchas to aid the digitization process of public texts, to fill in the gaps where OCR fails. By doing so it solves the automated spam problem by challenging website visitors to proof that they’re human. We can be pretty confident that someone is human if they’re able to recognize characters that the best OCR software cannot.

Anyway, enough about the serious stuff. I’m blogging about this now because Wiebe sent me a few links to funny reCAPTCHA combinations. Here’s a nice example (from the I Am Not A Robot weblog):

Johnson Chorus

Johnson Chorus

© 2024 BigSmoke

Theme by Anders NorenUp ↑