A simple Scala parser to parse 44GB Wikipedia XML Dump

· Read in about 2 min · (359 Words)

I had to parse a Wikipedia XML Dump ( 44GB XML file uncompressed ). The XML dump is available here, and I have also created a smaller sample file to run this code: sample wiki.xml file.

Below is the XML event based parser using Scala’s XMLEventReader:

package xml

import scala.io.Source
import scala.xml.pull._
import scala.collection.mutable.ArrayBuffer
import java.io.File
import java.io.FileOutputStream
import scala.xml.XML

object wikipedia extends App {

  val xmlFile = args(0)
  val outputLocation = new File(args(1))

  val xml = new XMLEventReader(Source.fromFile(xmlFile))

  var insidePage = false
  var buf = ArrayBuffer[String]()
  for (event <- xml) {
    event match {
      case EvElemStart(_, "page", _, _) => {
        insidePage = true
        val tag = "<page>"
        buf += tag
      }
      case EvElemEnd(_, "page") => {
        val tag = "</page>"
        buf += tag
        insidePage = false

        writePage(buf)
        buf.clear
      }
      case e @ EvElemStart(_, tag, _, _) => {
        if (insidePage) {
          buf += ("<" + tag + ">")
        }
      }
      case e @ EvElemEnd(_, tag) => {
        if (insidePage) {
          buf += ("</" + tag + ">")
        }
      }
      case EvText(t) => {
        if (insidePage) {
          buf += (t)
        }
      }
      case _ => // ignore
    }
  }

  def writePage(buf: ArrayBuffer[String]) = {
    val s = buf.mkString
    val x = XML.loadString(s)
    val pageId = (x \ "id")(0).child(0).toString
    val f = new File(outputLocation, pageId + ".xml")
    println("writing to: " + f.getAbsolutePath())
    val out = new FileOutputStream(f)
    out.write(s.getBytes())
    out.close
  }
}

Xml pull parser for Wikipedia XML dumps: Find this code snippet on Github

Lets see how long it takes to process all the Wikipedia pages in the 44GB XML Dump.

It took roughly 7 hours 30 minutes. Thats not bad:

$ time sbt "run-main xml.wikipedia enwiki-20140102-pages-articles-multistream.xml wiki-pages"
[success] Total time: 26918 s, completed Feb 4, 2014 9:56:38 AM
real    448m41.888s
user    82m47.594s
sys 192m46.238s

And it generated 14128976 XML files:

$ ls wiki-pages/ | wc -l
14128976
$ du -sh wiki-pages/
80G wiki-pages/

As you can see that 44GB uncompressed XML file got split up onto 80GB of total storage for all the separate pages. Now that’s something to be worked on.

References: