Smokes your problems, coughs fresh air.

Tag: MediaWiki (Page 1 of 2)

Shrinking/compressing a MediaWiki database

As of late, I haven’t had a lot of time to chase after spammers, so – despite of anti-spam captchas and everything – a couple of my wikis have been overgrowing with spam. One after the other I’ve been closing them down to anonymous edits, even closing down user registration alltogether, but some a little too late.

The last couple of months my hosting expenses shot through the roof, because my Timber Investments Wiki database kept expanding to well over 14 GiB. So I kind of went into panic mode and I even made time for another one of my famous spam crackdowns—the first in many, many months.

The awfully inefficient bulk deletion of spam users

Most of this latest tsunami of spam was in the form of “fake” user pages filled with bullshit and links. The only process that I could think of to get rid of it was quite cumbersome. First, I made a special category for all legitimite users. From that I created a simple text file (“realusers.txt”) with one user page name per line.

Then, I used Special:AllPages to get a list of everything in the User namespace. After struggling through all the paginated horror, I finally found myself with another copy-pasted text file (“unfilteredusers.txt”) that I could filter:

cp unfilteredusers.txt todelete.txt
cat realusers.txt | u
  sed -i -e "/$u/d" todelete.txt

(I’d like to know how I could have done this with less code, by the way.)

This filtered list, I fed to the deleteBatch.php maintenance script:

php maintenance/deleteBatch.php -u BigSmoke -r spam todelete.txt

By itself, this would only increase the size of MW’s history, so, as a last step, I used deleteArchivedRevisions.php to delete the full revision history of all deleted pages.

This work-flow sucked so bad that I missed thousands of pages (I had to copy-paste this listings by hand, as I mentioned earlier above), and had to redo it again. This time, the mw_text table size shrunk from 11.5 GiB to about 10 GiB. Not enough. Even the complete DB dump was still way over 5 Gig [not to mention the process size which remained stuck at around 15 GiB, something which I woudn’t be able to solve even with the configuration setttings mentioned after this].

Enter $wgCompressRevisions and compressOld.php

The huge size of mw_text was at long last easily resolved by a MW setting that I had never heard about before: $wgCompressRevisions. Setting that, followed by an invocation of the compressOld.php maintenance script took the mw_text table size down all the way from >10 GiB to a measly few MiB:

php maintenance/storage/compressOld.php

SELECT table_schema 'DB name', sum(data_length + index_length) / 1024 / 1024 "DB size in MiB"
FROM information_schema.TABLES
WHERE table_schema LIKE 'hard%'
GROUP BY table_schema;

+----------+----------------+
| DB name  | DB size in MiB |
+----------+----------------+
| hardhout |    41.88052750 | 
| hardwood |   489.10618973 | 
+----------+----------------+

But it didn’t really, because of sweet, good, decent, old MySQL. 🙁 After all this action, the DB process was still huge (still ~15 GiB). This far exceeded the combined reported database sizes. Apparently, MySQL’s InnoDB engine is much like our economy. It only allows growth and if you want it to shrink, you have to stop it first, delete everything and then restart and reload.

Future plans? Controlled access only?

One day I may reopen some wikis to new users with a combination of ConfirmAccount and RevisionDelete and such, but combatting spam versus giving up on the whole wiki principle is a topic for some other day.

MediaWiki ConfirmEdit/QuestyCaptcha extension

Since I moved my LDAP wiki over from DokuWiki to MediaWiki, I’ve been burried by a daily torrent of spam. Just like with my tropical timber investments wiki, the ReCaptcha extension (with pretty intrusive settings) doesn’t seem to do much to stop this shitstream.

How do the spammers do this? Do they primarily trick visitors of other websites into solving this captchas for them or do they employ spam-sweatshops in third-world countries? Fuck them! I’m trying something new.

I’ve upgraded to the ConfirmEdit extension. (ReCaptcha has also moved into this extension.) This allows me to try different Captcha types. The one I was most interested in is QuestyChaptcha, which allows me to define a set of questions which the user needs to answer. I’m now trying it out with the following question:

$wgCaptchaQuestions[] = array( 'question' => "LDAP stands for ...", 'answer' => "Lightweight Directory Access Protocol" );

I don’t think it’s a particularly good question, since it’s incredibly easy to Google. But, we’ll see, and in the mean time I’ll try to come up with one or two questions that are context-sensitive, yet easy enough to answer for anyone with some knowledge of LDAP. If you have an idea, please leave a comment.

How to make a wiki work: PALDAP

The first ever wiki I started was www.paldap.org. PALDAP stands for “PALDAP: A Lazy Directory Administrator’s Pal”. Yes, that’s a recursive acronym. Cute, ainnit? I actually registered the domain because it was the name of a crappy abandonware PHP LDAP administration tool that I wrote in PHP, but decided instead to configure it as a wiki to host some of my assorted experiences with LDAP and OpenLDAP in particular.

I never much bother with LDAP anymore, but the wiki remains because cool URLs don’t change and it doesn’t cost me that much. AdSense income for the wiki is only marginal (€15 in over three years) because the wiki’s content is only marginally useful and the traffic (300 visitors/month) reflects that fact.

[…]

Six days passed since I wrote the last paragraph. It’s a funny thing how writing can mess with your head. I was going to use PALDAP as in introduction to my struggle to make money of my wikis in general. Because the most promising of these wikis are my Hardwood Wikis and not my LDAP wiki, I wasn’t going to linger too much on it. But it’s a week later and some unexpected things happened.

Often, since becoming more familiar with Semantic MediaWiki, I’ve been considering the idea of converting the DokuWiki installation that runs www.paldap.org to a Semantic MediaWiki installation. Yet, nothing ever happened. I no longer work with LDAP professionally and most of the time I just kind of forgot that the site even existed. Until a week ago, when I started writing this post.

So, what happened? How do decisions happen? I have no idea. I’m not a neurologist. (I’m not even a sceintist; Hell, I can’t even spell “scientist”.)

What I have now are rationalizations for my decision but my decision is quite clear: I can’t kill my darling, even though I never really properly cared for it. For the last four years or so I had simply abandoned it on the grounds that it wasn’t costing me much anyway (it’s hosted at NearlyFreeSpeech.Net). Yes, the costs have gone up, but that’s just a rationalization. I could have just gone on and ignore the site’s existence without it ever making much of an impact on my cash flow. (I did some years ago actually promise the site’s most active contributer to never take the site off-line.)

So, if I was being rational, I would have just left the site alone. But, I’m not a rational being. Increasingly less so, in fact. A happy fact, if you ask me.

PALDAP logo

The PALDAP logo designed by Jeroen Dekker

The looks

Anyway, I still haven’t told you what happened last week. I didn’t leave the site alone. I created a development version of the site based on Semantic MediaWiki. It’s fucking kick-ass. It looks awesome thanks to MediaWiki’s new vector skin. But it looks even more awesome thanks to Jeroen Dekker. As we often do, we were hanging out at his place in a lazy haze, being generally unproductive but with random bursts of intelligent conversation and productivity. This day I had been absent-mindedly hacking away on my new MediaWiki darling and I was about to leave and jump on my bicycle when I mentioned that I could use a logo for PALDAP.

It was probably way past one in the morning already, but Jeroen was still in a creative mood from play-practicing with his new lighting set. All I can say about his creativity is that it was late, I hadn’t slept very long the night before (and the night before that and the night before and…) Let’s just say that he besides his excellent gear he didn’t have some very good material to work with. I was feeling ugly and tired. Yet…

Rowan, Januari 7, 2010

Jeroen's creative genius is a compliment to my awesome facial features 🙂

He went into a kind of frenzy on his big-ass touch-screen and being coaxed by me he created the perfect offset for the boring technical subject that is LDAP.

The brains

In the meantime, since last weekend, I’ve been starting to assemble a logical structure of semantic properties (think of LDAP attributes or SQL fields) and templates (sort of like MediaWiki functions) that’ll allow me to capture all the semantics related to the LDAP and the ecology around it.

The booty

I still don’t believe that PALDAP has a huge revenue potential, but hosting costs have increased and if I can get the website to awaken from its winter sleep, maybe it’ll at least start paying for itself again. Not that I really care, honestly. Somehow it’s just masturbatorily satisfying to use the expressiveness of RDF to capture the semantics of LDAP. What I like about it is that the wiki concept (and especially the semantic wiki concept) is a very tight fit for technical documentation. Another thing that I like about working with a wiki about a technical subject is that the wiki has a technical audience. I mean, there’s a reason that the visitors of my Hardwood Investment Wikis click on all those expensive links and that reason is not the technical insight that’ll lead to users clicking the edit button and actually contributing content.

In fact, even with the old DokuWiki version of the site, much of the content was actually created by other users (most by the same user called brontolo). If the community of my Hardwood Wikis worked this well… Let’s just say that I could remain in retirement for a while then.

So, even if this’ll just be an exercise in effective community building/plumbing rather than a way to make advertising income easily, it’ll still be effective as an exercise. I’m going to follow my intuition on this one and see how successful it’ll become and how much time it’ll take.

Fuck, this post sucks, but it sure does help me. Don’t ask me how, but it does. Kinda.

MediaWiki thumb.php and rewrite rules

May, last year, I created an empty draft for this post, because, around that time, I had gone through quite some effort before I got thumbnails for foreign file repos working just right. Now, I’m taking a dive into my MediaWiki working dirs in preparation of the creation of a separate development environment, so it’s a good moment to rehash the past experience (almost as good as when I’d have done it right away).

This is how I configured the foreign file repo to be able to use images uploaded to the English wiki from the Dutch wiki:

$wgHashedUploadDirectory = false;
 
$wgForeignFileRepos[] = array(
    'class' => 'ForeignDBRepo',
    'name' => 'en',
    'url' => "http://wiki.hardwood-investments.net/media",
    'hashLevels' => 0,
    //'thumbScriptUrl' => "http://wiki.hardwood-investments.net/thumb.php",
    'transformVia404' => true,//!$wgGenerateThumbnailOnParse,
    'dbType' => $wgDBtype,
    'dbServer' => $wgDBserver,
    'dbUser' => $wgDBuser,
    'dbPassword' => $wgDBpassword,
    'dbName' => 'hardwood',
    'tablePrefix' => 'mw_',
    'hasSharedCache' => false,
    'descBaseUrl' => 'http://wiki.hardwood-investments.net/Image:',
    'fetchDescription' => false
);

To make thumbnails be generated by thumb.php on request I added the following to my .htaccess at the other end (and visa versa, because the Dutch wiki actually contains most of the images):

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^media/thumb/([^/]+)/([0-9]+)px-.*$ /thumb.php?f=$1&width=$2 [L,QSA]

Tracking MediaWiki in a Subversion vendor branch

Vendor branches are the proper™ way of merging upstream changes in your web application installations. In Subversion, managing vendor branches isn’t so easy as it is in Git. Still, vendor branches make it much easier to track upstream.

From before I first deployed the Omega Research Wiki, I already used svn to track changes to my MediaWiki installation. However, for upgrading from one upstream release to the next, I used diff and patch. This isn’t the most reliable of methods as is exemplified by doing a diff comparing a fresh MediaWiki download with the actual files in my repo (which are supposed to belong to the same version).

What’s basically a shortcoming of svn is that I can’t just say:

svn merge http://svn.wikimedia.org/svnroot/mediawiki/tags/REL1_15_0/phase3/ \
http://svn.wikimedia.org/svnroot/mediawiki/tags/REL1_15_1/phase3/ .

This would have been incredibly helpful, because now I’m keeping a vendor branch not because of local modifications to upstream, but just to be able to merge cleanly.

In Subversion, maintaining a vendor branch by hand is quite some work, because you need to do a checkout first before you can import each version. (My working copy is normally a checkout of /trunk, not of /vendor/mediawiki.) Luckily, Subversion is distributed with a handy Perl script, svn_load_dirs.pl, which can do most of the heavy lifting.

Still, I didn’t feel like having to do to many manual steps, such as typing in the painfully long URLs for merging, so I decided to wrap the whole process into a nice little Bash script:

cat upgrade-mediawiki.sh
#!/bin/bash
 
svn_repo_url=file:///var/svn/wiki.omega-research.org/vendor/mediawiki/
 
merge=1 [ $# -gt 2 ] # Process extra options
    "$1"
        --no-merge)
            merge=0
            ;;
   
   
 [ $# -ne 2 ]; then
    "Usage: $0 [--no-merge] version version"
    1
 
last_version=$1
new_version=$2
 
tmp_dir=`mktemp -d` $tmp_dir
 version $*;
    "Downloading and extracting MediaWiki version $version..."
 
    [ -z `svn ls $svn_repo_url|grep $version` ];
        branch= $version |sed -e 's/\.[0-9]\+$//'`
        download_file="mediawiki-$version.tar.gz"
        download_url="http://download.wikimedia.org/mediawiki/$branch/$download_file"
 
        wget $download_url || { "Downloading $download_file failed">&2; 1; }
        tar --extract --ungzip --transform 's/^mediawiki-//' --file $download_file
        svn_load_dirs.pl $svn_repo_url -t $version current $version
   
 - [ $merge == '1' ];
    svn merge "$svn_repo_url$last_version" "$svn_repo_url$new_version" .
 0

The script only downloads and imports each specified version if that version doesn’t already exist in /vendor/mediawiki/. Also, because it has a --no-merge option, you can download all the old versions of MediaWiki that you’ve ever used, so that you can go back and compare old versions of your installation with the factory version. Of course, this is only useful if you were already tracking your installation in svn at the time, and even then not really. 😉

Anyway, the important thing is that you can use the script to download the version you’re running now and the version you want to upgrade too. I wanted to make a big jump, from 1.11.1 to 1.15.0 (I had put of the upgrade for a long time, because I first wanted to learn more about vendor branches):

Of course, as always, you need to check if the patch went well. My own results were a vivid demonstration of the unreliability of my previous method:

svn st|grep '^C'
C      languages/messages/MessagesKrj.php
C      languages/messages/MessagesWar.php
C      languages/messages/MessagesSe.php
C      languages/messages/MessagesFrc.php

Luckily, these were all files I was sure I hadn’t modified:

i `svn st|grep '^C'|sed -e 's/^C //'`; mv $i.merge-right.r32 $i; svn resolved $i;
Resolved conflicted state of 'languages/messages/MessagesKrj.php'
Resolved conflicted state of 'languages/messages/MessagesWar.php'
Resolved conflicted state of 'languages/messages/MessagesSe.php'
Resolved conflicted state of 'languages/messages/MessagesFrc.php'

Apparently, a previous upgrade hadn’t turned my MediaWiki installation exactly into 1.11.1.

Editing Wikipedia/MediaWiki articles with VIM

Wikipedia has a number of tips for people who like to edit Wikipedia articles with VIM.

Of course, it all starts with opening an article in VIM. How I do this depends on my mood: I can open it by firing GVIM using Vimperator or I can mount the wiki with WikipediaFS. I prefer the latter.

Spell-checking

One tool that’s very convenient for editing large sections of prose is VIM’s built-in spell-checker. Example:

set textwidth=0
set spell spelllang=nl

Wrapping, matching and folding

There’s some more useful stuff that I copied to “ftplugin/mediawiki.vim” in “$HOME/.vim/“:

" Wikipedia articles often only have line-breaks at the end of each paragraph,
" a situation Vim by default doesn't handle as other text editors.
setlocal wrap linebreak
setlocal textwidth=0

" No auto-wrap at all
setlocal formatoptions-=tc formatoptions+=l
if v:version >= 602 | setlocal formatoptions-=a | endif

" Make navigation more amenable to the long wrapping lines.
noremap  k gk
noremap  j gj
noremap   gk
noremap   gj
noremap  0 g0
noremap  ^ g^
noremap  $ g$
noremap D dg$
noremap C cg$
noremap A g$a
inoremap   gk
inoremap   gj

" utf-8 should be set if not already done globally
setlocal fileencoding=utf-8
setlocal matchpairs+=<:>

" Treat lists, indented text and tables as comment lines and continue with the
" same formatting in the next line (i.e. insert the comment leader) when hitting
"  or using "o".
setlocal comments=n:#,n:*,n:\:,s:{\|,m:\|,ex:\|}
setlocal formatoptions+=roq

" match HTML tags (taken directly from $VIM/ftplugin/html.vim)
if exists("loaded_matchit")
    let b:match_ignorecase=0
    let b:match_skip = 's:Comment'
    let b:match_words = '<:>,' .
    \ '<\@<=[ou]l\>[^>]*\%(>\|$\):<\@<=li\>:<\@<=/[ou]l>,' .
    \ '<\@<=dl\>[^>]*\%(>\|$\):<\@<=d[td]\>:<\@<=/dl>,' .
    \ '<\@<=\([^/][^ \t>]*\)[^>]*\%(>\|$\):<\@<=/\1>'
endif

" Other useful mappings
" Insert a matching = automatically while starting a new header.
inoremap   = =(getline('.')==''\|\|getline('.')=~'^=\+$')?"==\Left>":"="

" Enable folding based on ==sections==
:set foldexpr=getline(v:lnum)=~'^\\(=\\+\\)[^=]\\+\\1\\(\\s*\\)\\=\\s*$'?\">\".(len(matchstr(getline(v:lnum),'^=\\+'))-1):\"=\"
:set fdm=expr

To have this automatically executed when opening files ending in .mw:

"autocmd BufRead,BufNewFile *.mw setfiletype mediawiki" > $HOME/.vim/ftdetect/mediawiki.vim

Syntax highlighting

Additionally, if you want to have syntax highlighting for MediaWiki articles, you can save this syntax file to $HOME/.vim/syntax/mediawiki.vim.

Writing a MediaWiki Gravatars template

This is a tutorial for MediaWiki users who want to learn how to create and use templates.

One day, I was looking for a MediaWiki extension to include a Gravatar (globally recognized avatar) in my user page on one of my MediaWiki installations. I did find an extension, but I thought it to be no better than writing a template myself.

A MediaWiki template

In a MediaWiki installation, if you find yourself repeating yourself in a lot of pages, you can turn the repetitive content into a template. This template can then be included in any page where you’d otherwise duplicate the content. In fact, you can include any page, but most templates will live in a separate namespace: Template.

To include a template, put the template name inside double curly braces, like this: {{My template}}. For templates inside the Template namespace (most templates), you can omit the namespace part. This means that if your template page name is Template:My template, you will call your template as in the previous example ({{My template}}).

Template parameters

Parameterization is a fancy word for what programmers do when they don’t want a function to return the same value on each invocation. They add parameters to the function. MediaWiki templates support parameters.

If you have a template called Template:My template which contains “I say this very often, but not {{{1}}}.“, you can invoke it with “{{My template|quite often enough}}“, you’ll get the text “I say this very often but not quite often enough“.

Template parameters are preceded by a pipe symbol (“|”). If you add more parameters, each in turn has to be preceded by the same symbol. Whitespace (even newlines) may be used before or after the pipe to increase readability.

To indicate where in the template you want the parameter value to appear, triple curly braces are used with the number of the parameter inside.

Named parameters

When your template accepts a lot of parameters, you might find it convenient to name the parameters instead of numbering them. Named and numbered parameters can be mixed (but the numbered parameters must come first):

{{Fruit|apple|color=red}}

Referencing the named parameter in the template is as easy as referencing a numbered parameter:

When I need vitamins, I can eat an {{{1}}}. This fruit is {{{color}}}.

Default parameter values

Parameters can be made optional, by adding a default value:

When I need vitamins, I can eat an {{{1|orange}}}. This fruit is {{{color|orange}}}.

Now, you can call the Fruit template with no arguments if you just want an orange.

Gravatar API

Gravatars are retreived by using a very simple URL API. The URL is constructed using the MD5-sum of your email address:

-n $EMAIL|md5sum
77eca8beb1df91ca21e6c781e62630f8

http://www.gravatar.com/avatar/77eca8beb1df91ca21e6c781e62630f8

Optional parameters can be given in the URL. For now, I’m only interested in s (for size). This defaults to 80 pixels, but I want a bigger image:

http://www.gravatar.com/avatar/77eca8beb1df91ca21e6c781e62630f8?s=200

A working Gravatar template

On my Hardwood Investments Wiki, I have made a template that is meant to be used on a user page (such as my own) to display a Gravatar similarly to how captioned images are normally formatted:

<noinclude>
This template shows a Gravatar for the given e-mail address belonging to a given MD5 sum.

== Example ==

<pre>
{{User Gravatar|41165a7e7126d616a0ae0762e00718e2}}
</pre>

</noinclude><includeonly><div style="margin: 0 0 1em 1em; float: right; border: 1px solid #ccc; background-color: #f9f9f9; padding: 3px;">
<div style="border: 1px solid #ccc;">
http://www.gravatar.com/avatar/{{{1}}}?s=200.jpg
</div>{{PAGENAME}}
</div></includeonly>

This template introduces a two new concepts: the <noinclude> section is only used when viewing the template page and the <includeonly> section is used only when viewing the page that includes the template; {{PAGENAME}} is a magic word that is used to display the username as the image caption.

Now that I’ve demonstrated a practical application of MediaWiki templates, I can finally publish this draft. 🙂

Mounting a MediaWiki installation with WikipediaFS for FUSE

Another one of those I-wish-I-had-blogged-about-this moments: I wanted to bulk-edit some pages in my Dutch Hardwood Wiki and I knew I had previously done some scripting using a FUSE filesystem for MediaWiki. However, I didn’t remember the name of the filesystem I had used before. Searching for “mediawiki fuse” didn’t help in any way.

I found the link again in the Wikipedia meta-space. Now that I know that it’s called WikipediaFS and not something like MediaWikiFS, I remember that this was the exact same problem as the last time when I was looking for this. But, alas, I’ve blogged about it now, so now I can safely forget the name again. 🙂

MediaWiki problems with MySQL VARBINARY padding

I have been wanting to protect a certain page on my Dutch 15Monkeys wiki for weeks. I could install the ReCaptcha extension already (the best Captcha there is), but it’s just that one page that’s being spammed. So far, I’ve been unsuccessful in regaining my Sysop right for that wiki. I need these rights if I’m going to protect these pages.

In the previous post about this problem, I reported how there appeared to be padding problem caused by a MySQL upgrade. Fixing the padding didn’t solve the problem, however. If I look at the user list in MediaWiki, I notice that MediaWiki doesn’t seem to like it that MySQL returns all the \0x0 padding characters on SELECT:

15Monkeys MediaWiki NL user list with unrecognized characters

15Monkeys MediaWiki NL user list with unrecognized characters

The same problem in the EN wiki

The same problem in the EN wiki

BigSmoke ‎(bureaucrat������, sysop�����������)

Repair was suggested by the check operation in phpMyAdmin. (CHECK TABLE `mw_nl_user_groups`) I couldn’t find a repair operation (only OPTIMIZE), so I tried that and I was told that it went ok. In the database structure view, the repair option was available ; when I tried to repair all the tables belonging to this installation, I was told for this particular table (and one other) “The storage engine for the table doesn’t support repair”. I found out that the reason this error didn’t appear the first time because it appears quite randomly. Sometimes it just silently fails in true MySQL style.

Googling for the error message produced the advice to dump and recreate the table. Then, when recreating the table from my backup didn’t work, I found out that REPAIR really isn’t supported for InnoDB. Only CHECK is. If you want to repair an InnoDB table, you use OPTIMIZE or you have to restart mysqld with the innodb_force_recovery option enabled.

Then, it turned out that I misinterpreted the MySQL documentation for the VARBINARY type, although I still can’t figure out how I am supposed to deduce this from said documentation. Anyway, if I UPDATE the rows, all I can achieve is that whitespace is converted to \0 characters. But, if I reinsert them without the padding, it does work. 😕 Go figure…

« Older posts

© 2024 BigSmoke

Theme by Anders NorenUp ↑