I had to parse a Wikipedia XML Dump ( 44GB XML file uncompressed ). The XML dump is available here, and I have also created a smaller sample file to run this code: sample wiki.xml file.
Below is the XML event based parser using Scala’s XMLEventReader:
package xml
import scala.io.Source
import scala.xml.pull._
import scala.collection.mutable.ArrayBuffer
import java.io.File
import java.io.FileOutputStream
import scala.xml.XML
object wikipedia extends App {
val xmlFile = args(0)
val outputLocation = new File(args(1))
val xml = new XMLEventReader(Source.fromFile(xmlFile))
var insidePage = false
var buf = ArrayBuffer[String]()
for (event <- xml) {
event match {
case EvElemStart(_, "page", _, _) => {
insidePage = true
val tag = "<page>"
buf += tag
}
case EvElemEnd(_, "page") => {
val tag = "</page>"
buf += tag
insidePage = false
writePage(buf)
buf.clear
}
case e @ EvElemStart(_, tag, _, _) => {
if (insidePage) {
buf += ("<" + tag + ">")
}
}
case e @ EvElemEnd(_, tag) => {
if (insidePage) {
buf += ("</" + tag + ">")
}
}
case EvText(t) => {
if (insidePage) {
buf += (t)
}
}
case _ => // ignore
}
}
def writePage(buf: ArrayBuffer[String]) = {
val s = buf.mkString
val x = XML.loadString(s)
val pageId = (x \ "id")(0).child(0).toString
val f = new File(outputLocation, pageId + ".xml")
println("writing to: " + f.getAbsolutePath())
val out = new FileOutputStream(f)
out.write(s.getBytes())
out.close
}
}
Xml pull parser for Wikipedia XML dumps: Find this code snippet on Github
Lets see how long it takes to process all the Wikipedia pages in the 44GB XML Dump.
It took roughly 7 hours 30 minutes. Thats not bad:
$ time sbt "run-main xml.wikipedia enwiki-20140102-pages-articles-multistream.xml wiki-pages"
[success] Total time: 26918 s, completed Feb 4, 2014 9:56:38 AM
real 448m41.888s
user 82m47.594s
sys 192m46.238s
And it generated 14128976 XML files:
$ ls wiki-pages/ | wc -l
14128976
$ du -sh wiki-pages/
80G wiki-pages/
As you can see that 44GB uncompressed XML file got split up onto 80GB of total storage for all the separate pages. Now that’s something to be worked on.
References:
- First steps with Scala: XML pull parsing
- Scala finding elements in big (30MB) xml files