<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>BigSmoke &#187; CSV</title>
	<atom:link href="http://blog.bigsmoke.us/tag/csv/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.bigsmoke.us</link>
	<description>Smokes your problems, coughs fresh air.</description>
	<lastBuildDate>Sat, 04 Feb 2012 18:03:39 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>PHP fgetcsv() behavior on empty lines</title>
		<link>http://blog.bigsmoke.us/2009/09/05/php-fgetcsv</link>
		<comments>http://blog.bigsmoke.us/2009/09/05/php-fgetcsv#comments</comments>
		<pubDate>Sat, 05 Sep 2009 08:55:13 +0000</pubDate>
		<dc:creator>Rowan Rodrik</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[CSV]]></category>
		<category><![CDATA[fgetcsv]]></category>
		<category><![CDATA[PHP]]></category>

		<guid isPermaLink="false">http://blog.bigsmoke.us/?p=759</guid>
		<description><![CDATA[]]></description>
			<content:encoded><![CDATA[<p>The PHP documentation for <a href="http://ww.php.net/fgetcsv"><tt>fgetcsv()</tt></a> states that <q cite="http://ww.php.net/fgetcsv">A blank line in a CSV file will be returned as an array comprising a single null field, and will not be treated as an error. </q> Here&#8217;s a quick demonstration of this behavior.</p>

<p><tt>fgetcsv.php</tt>:</p>

<pre class="php"><span style="color: #000000; font-weight: bold;">&lt;?php</span>
&nbsp;
<span style="color: #b1b100;">while</span> <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$fields</span> = <a href="http://www.php.net/fgetcsv"><span style="color: #000066;">fgetcsv</span></a><span style="color: #66cc66;">&#40;</span>STDIN, <span style="color: #cc66cc;">0</span>, <span style="color: #ff0000;">';'</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
  <a href="http://www.php.net/print_r"><span style="color: #000066;">print_r</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$fields</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<a href="http://www.php.net/exit"><span style="color: #000066;">exit</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">0</span><span style="color: #66cc66;">&#41;</span>;</pre>

<p>Execute the script and feed it some CSV with empty lines:</p>

<pre class="bash">php -q fgetcsv.php
<span style="color: #ff0000;">&quot;Veld 1&quot;</span>;<span style="color: #ff0000;">&quot;Veld 2&quot;</span>;<span style="color: #ff0000;">&quot;Veld 3&quot;</span>;;<span style="color: #ff0000;">&quot;Veld 5&quot;</span>
&nbsp;
<span style="color: #ff0000;">&quot;Field 1&quot;</span>;;<span style="color: #ff0000;">&quot;Field 3&quot;</span>;<span style="color: #ff0000;">&quot;Field 4&quot;</span>;
;;;;
;<span style="color: #ff0000;">&quot;Campo 2&quot;</span>;;;<span style="color: #ff0000;">&quot;Campo 5&quot;</span></pre>

<p>After pressing <kbd>Ctrl+D</kbd>, I&#8217;m presented with the following output:</p>

<pre>
Array
(
    [0] => Veld 1
    [1] => Veld 2
    [2] => Veld 3
    [3] => 
    [4] => Veld 5
)
Array
(
    [0] => 
)
Array
(
    [0] => Field 1
    [1] => 
    [2] => Field 3
    [3] => Field 4
    [4] => 
)
Array
(
    [0] => 
    [1] => 
    [2] => 
    [3] => 
    [4] => 
)
Array
(
    [0] => 
    [1] => Campo 2
    [2] => 
    [3] => 
    [4] => Campo 5
)
Array
(
    [0] => 
)
</pre>

<p>This behaviour on empty lines is a little bit annoying if you want to test if the line is <tt>empty()</tt>:</p>

<pre class="php"><span style="color: #0000ff;">$a</span> = <a href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #000000; font-weight: bold;">null</span><span style="color: #66cc66;">&#41;</span>;
<a href="http://www.php.net/print_r"><span style="color: #000066;">print_r</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$a</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> <a href="http://www.php.net/empty"><span style="color: #000066;">empty</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$a</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>
  <a href="http://www.php.net/echo"><span style="color: #000066;">echo</span></a> <span style="color: #ff0000;">'$a is empty'</span>;
<span style="color: #b1b100;">else</span>
  <a href="http://www.php.net/echo"><span style="color: #000066;">echo</span></a> <span style="color: #ff0000;">'$a is not empty'</span>;
&nbsp;
<a href="http://www.php.net/echo"><span style="color: #000066;">echo</span></a> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>;</pre>

<p>This code will print:</p>

<pre>
Array
(
    [0] => 
)
$a is not empty
</pre>

<p>Hence, the following function:</p>

<pre class="php"><span style="color: #808080; font-style: italic;">/**
 * This function tests if the given array (as returned by fgetcsv())
 * is the result of an empty line in the CSV file.
 *
 * It does not work for lines that contain only delimiters.
 * From the POV of this function, these are simply records with
 * many empty fields.
 */</span>
<span style="color: #000000; font-weight: bold;">function</span> fgetcsv_empty_line<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$row_array</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
  <span style="color: #b1b100;">return</span> <span style="color: #66cc66;">&#40;</span> !<a href="http://www.php.net/isset"><span style="color: #000066;">isset</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$row_array</span><span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span> and <a href="http://www.php.net/empty"><span style="color: #000066;">empty</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$row_array</span><span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
<span style="color: #66cc66;">&#125;</span></pre>

<p>Now, if I change the call to <tt>empty()</tt> in my test to a call to <tt>fgetcsv_empty_line()</tt>:</p>

<pre>
$a is empty
</pre>]]></content:encoded>
			<wfw:commentRss>http://blog.bigsmoke.us/2009/09/05/php-fgetcsv/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Moved from Mnemosyne to FlashcardDB</title>
		<link>http://blog.bigsmoke.us/2008/01/14/moved-from-mnemosyne-to-flashcarddb</link>
		<comments>http://blog.bigsmoke.us/2008/01/14/moved-from-mnemosyne-to-flashcarddb#comments</comments>
		<pubDate>Mon, 14 Jan 2008 22:26:33 +0000</pubDate>
		<dc:creator>Rowan Rodrik</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[CSV]]></category>
		<category><![CDATA[flashcard]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://blog.bigsmoke.us/2008/01/14/moved-from-mnemosyne-to-flashcarddb</guid>
		<description><![CDATA[]]></description>
			<content:encoded><![CDATA[<p>When I was studying Spanish last year, I had to <a href="http://blog.bigsmoke.us/2007/03/02/making-flash-cards-on-line">choose a flashcard program</a> to memorize new words. At the time, I couldn&#8217;t find any on-line program that just did the job and did it well. In a <a href="http://blog.bigsmoke.us/2007/03/02/making-flash-cards-on-line#comment-7086">comment</a> on my <a href="http://blog.bigsmoke.us/2007/03/02/making-flash-cards-on-line">blog post</a> from last year, however, I was pointed by Jeff to his amazing <a href="http://flashcarddb.com/">FlashcardDB</a>.</p>

<p>The program I ended up with last year was <a href="http://mnemosyne-proj.sourceforge.net/">Mnemosyne</a>. Mnemosyne is not based on your regular <a href="http://en.wikipedia.org/wiki/Flashcard">Leitner system</a>, but rather on a concept where, after each card, you have to indicate yourself <q>how well</q> you have remembered it. I found that, in the end, having to tell the system in which box to put the card instead of just saying if my answer was right or wrong was taking me more effort than the actual recollection of the information. Also, as someone who rarely remains at one place for very long, a desktop program just isn&#8217;t as practical for me as an online program.</p>

<p style="width: 400px;"><a title="With Mnemosyne, I had to constantly remind myself of a complicated grading system." href="http://blog.bigsmoke.us/uploads/2008/01/mnemosyne-full.jpg" rel="lightbox"><img src='http://blog.bigsmoke.us/uploads/2008/01/mnemosyne1.jpg' alt='Mnemosyne' /></a><br /><small class="caption">With Mnemosyne, I had to constantly remind myself of a complicated grading system.</small></p>

<p>Now to FlashcardDB. The site is pretty social, which means that you can study (and sometimes even edit) card sets made by other users. When you sign up, you can also create card sets yourself. Card sets can be tagged and you can study these tags instead of individual card sets if you wish. If you already have cards somewhere else, import is easy as well.</p>

<p>The user interface is very slick, especially for such a new program. Thoughtful usage of AJAX means that you&#8217;re never distracted by page reloads when this would interrupt your flow of thought. Simple key bindings making studying an easier affair than in most desktop programs. The right arrow is used to show the answer, the up arrow (thumbs up) to mark the answer as correct, the down arrow (thumbs down) to mark the answer incorrect and the left arrow to go back to the previous card. Also the interface for adding cards is very pleasant. It&#8217;s just a matter of filling in the front of the card, pressing Tab, filling in the back of the card, pressing Tab, then Enter and on the next card.</p>

<p>Before going on to the conclusion, I want to add that also the Leitner system is very well implemented in FlashcardDB, including pretty diagrams to make it instantly clear to everyone how the system works. Now for my conclusion: My advice if you ever need to make flashcards yourself is that you really should take a look at <a href="http://flashcarddb.com/">FlashcardDB</a> before looking at anything else.</p>

<p>Finally, the following Ruby code is a quick hack I used to convert Mnemosyne&#8217;s XML export to CSV data which can be imported by FlashcardDB:</p>

<pre class="ruby"><span style="color:#008000; font-style:italic;">#!/usr/bin/ruby</span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">require</span> 'rexml/document'
<span style="color:#CC0066; font-weight:bold;">require</span> 'csv'
&nbsp;
xmldoc = REXML::Document.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span>$stdin<span style="color:#006600; font-weight:bold;">&#41;</span>
&nbsp;
CSV::Writer.<span style="color:#9900CC;">generate</span><span style="color:#006600; font-weight:bold;">&#40;</span>$stdout<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#9966CC; font-weight:bold;">do</span> |csv|
  xmldoc.<span style="color:#9900CC;">each_element</span><span style="color:#006600; font-weight:bold;">&#40;</span>'//item'<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#9966CC; font-weight:bold;">do</span> |el|
    csv &lt;&lt; <span style="color:#006600; font-weight:bold;">&#91;</span>  el.<span style="color:#9900CC;">elements</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">1</span>,'Q'<span style="color:#006600; font-weight:bold;">&#93;</span>.<span style="color:#9900CC;">text</span>, el.<span style="color:#9900CC;">elements</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">1</span>,'A'<span style="color:#006600; font-weight:bold;">&#93;</span>.<span style="color:#9900CC;">text</span>  <span style="color:#006600; font-weight:bold;">&#93;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.bigsmoke.us/2008/01/14/moved-from-mnemosyne-to-flashcarddb/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Web scraping in Ruby: why I had to use scrAPI instead of WWW::Mechanize and Hpricot</title>
		<link>http://blog.bigsmoke.us/2007/05/02/scrapi-wins-over-mechanize-and-hpricot-for-web-scraping-in-ruby</link>
		<comments>http://blog.bigsmoke.us/2007/05/02/scrapi-wins-over-mechanize-and-hpricot-for-web-scraping-in-ruby#comments</comments>
		<pubDate>Wed, 02 May 2007 12:53:32 +0000</pubDate>
		<dc:creator>Rowan Rodrik</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[CSS]]></category>
		<category><![CDATA[CSV]]></category>
		<category><![CDATA[Hpricot]]></category>
		<category><![CDATA[Perl]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[scrAPI]]></category>
		<category><![CDATA[XPath]]></category>

		<guid isPermaLink="false">http://blog.bigsmoke.us/2007/05/02/scrapi-wins-over-mechanize-and-hpricot-for-web-scraping-in-ruby</guid>
		<description><![CDATA[Thursday evening: so, I had written myself a nice little script using Aaron Patterson's <a href="http://rubyforge.org/projects/mechanize/">WWW::Mechanize</a> and why's <a href="http://code.whytheluckystiff.net/hpricot/">Hpricot</a> to extract some data from a popular web-based airport directory.]]></description>
			<content:encoded><![CDATA[<p>Thursday evening: so, I had written myself a nice little script using Aaron Patterson&#8217;s <a href="http://rubyforge.org/projects/mechanize/">WWW::Mechanize</a> and why&#8217;s <a href="http://code.whytheluckystiff.net/hpricot/">Hpricot</a> to extract some data from a popular web-based airport directory.</p>

<img style="float: right; margin-left: 1ex; margin-bottom: .5em;" src='http://blog.bigsmoke.us/uploads/2007/04/hpricot-small.png' alt='Hpricot logo' />

<p>I was warmed up for Hpricot by the promise of XPath and CSS selector support (and a very cool logo, of course). As a long time XPath user, I started banging out some crispy XPath expressions until I realized that XPath support was only <a href="http://code.whytheluckystiff.net/hpricot/wiki/SupportedXpathExpressions"><em>very</em> partial</a>. I kept on trying expressions that <em>would</em> work, even bowing down to expressions that, according to the Wiki, would work, but <q>differently</q>. Come on guys, either support a standard or just plainly ignore it, please! <img src='http://blog.bigsmoke.us/wp-factory/wp-includes/images/smilies/icon_mad.gif' alt=':-x' class='wp-smiley' />  Because I couldn&#8217;t figure out how I&#8217;d have to integrate why&#8217;s fork of the XPath spec in my expressions, I decided to stick with why&#8217;s fork of the <a href="http://code.whytheluckystiff.net/hpricot/wiki/SupportedCssSelectors">CSS selectors</a> instead.</p>

<p>Then, it became time to execute my code. I had estimated that it would take about two hours to finish downloading and parsing the approximately 10.000 pages which contained the data in which I was interested. So, I executed my script, detached my screen session and went to bed, trusting that I would find a nice, handy CSV file in the morning.</p>

<p>Friday morning, I was disappointed to find that my script had been killed. I was left wondering what could have killed the script. I decided to restart the script at the countries starting with the letter <q>b</q> (it had died somewhere halfway the list of countries starting with a <q>b</q>). Soon the script was happily appending data again to the existing CSV file.</p>

<p class="sidenote"><b>Disclaimer:</b> why is a much more prolific Ruby coder than I&#8217;ll ever be, so please take my comments with a grain of salt. No, actually, rather take them with a few spoonfuls of salt.</p>

<p>Later, I talked about the spontaneous death of the script with <a href="http://www.halfgaar.net/">Wiebe</a>. Curious, he looked at the memory usage of my script and saw that it was happily munching away hundreds of megs of memory on our server. And memory usage was growing! With crucial server processes at the risk of running out of memory and with me having to build a circumference around the vegetable garden to protect it from a bunch of brawling chickens, Wiebe was friendly enough to drop in and take a look at my spaghetti code to see if he could fix the leak. He couldn&#8217;t, because the leak <a href="http://code.whytheluckystiff.net/hpricot/ticket/48">didn&#8217;t appear</a> to be in my code. I <a href="http://code.whytheluckystiff.net/hpricot/ticket/48">wasn&#8217;t the first</a> to be bugged by a leak in Hpricot.</p>

<p>That news didn&#8217;t make me very happy, because it implied I had to redo the script using different tools. I knew that WWW::Mechanize had been inspired by the <a href="http://search.cpan.org/dist/WWW-Mechanize/">Perl package by the same name</a>, so I started by looking at that. After installing <a href="http://search.cpan.org/dist/WWW-Mechanize/">WWW::Mechanize</a>, I explored CPAN&#8217;s <a href="http://search.cpan.org/modlist/World_Wide_Web/WWW">WWW</a> namespace a bit further and noticed that the Perl crowd also had two other good scrapers at their fingertips: <a href="http://search.cpan.org/dist/WWW-Extractor/">WWW::Extractor</a> and <a href="http://search.cpan.org/dist/Scraper/">WWW::Scraper</a>. Once again I was reminded that Perl, despite its funky syntax, is still the king of all scripting languages when it comes to the availability of quality modules. <img src='http://blog.bigsmoke.us/wp-factory/wp-includes/images/smilies/icon_sad.gif' alt=':-(' class='wp-smiley' />   After a few deep breaths, I set my rusty Perl skill into (slow)motion. Hell, this was supposed to be a quick script. Why was this taking so much time? <small>(Yeah, yeah; cue all the jokes about developer incompetence. <img src='http://blog.bigsmoke.us/wp-factory/wp-includes/images/smilies/icon_confused.gif' alt=':-?' class='wp-smiley' />  )</small></p>

<p>I was almost stamped by a horde of camels, each with a name more syntactically confusing than the other. Just before I was crushed, I came across a reference to a Ruby scraper with decent support for CSS3 selectors: <a href="http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/">scrAPI</a>. Credits for this discovery go to the documentors of <a href="http://scrubyt.org/">scRUBYt</a>, a featurefull scraper layered on top of WWW::Mechanize. The documentation writers of scRUBYt where friendly enough to help their users by including a link to the competition.</p>

<p>It took me some time to rewrite the script using scrAPI, partially because it was hard to find any documentation that was more comprehensive than a few <a href="http://blog.labnotes.org/category/scrapi/">blog posts</a> and a <a href="http://cheat.errtheblog.com/s/scrapi/">cheat sheet</a> and less of a hassle than reading the source. But, when Assaf answered <a href="http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/#comment-137860">my need</a> by <a href="http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/#comment-137862">pointing me</a> to the online <a href="http://content.labnotes.org/rdoc/scrapi/">API docs</a>, I was happy.</p>

<p>Another reason why it was hard to migrate from WWW::Mechanize/Hpricot to scrAPI was that Hpricot starts element offsets for XPath predicates and CSS selectors at zero instead of one where they should start. And of course, I had to rid myself of the weird breed between CSS and XPath selectors.</p>

<p>I was surprised that the script using scrAPI ran about twice as fast as the Hpricot-based script. This was including a cumulative <tt>sleep()</tt> time between each request of almost an hour, because the speed during testing made me worry about over-exerting their web server. Knowing that one of the popular features of Hpricot is its speed, this was very unexpected, although I have to admit that Hpricot did fill my memory very quickly.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.bigsmoke.us/2007/05/02/scrapi-wins-over-mechanize-and-hpricot-for-web-scraping-in-ruby/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

