Skip to content

JDBM

A colleague recently came up to me with a problem: he had a trace file representing the history of several years’ worth of CRUD operations on many records. Something resembling:

1:create:hello
123:create:xxx
12345:create:hello
123:update:uuu
999:update:42
123:delete
1:update:xxx
1:update:iuu
12345:update:hello world
1:delete

The task at hand was to work out which records were active at the end of the trace and to determine the most up-to-date value each active record.

The expected result here is:

Didn't find 999 to update. Creating new entry.
Results:
	12345 => hello world
	999 => 42

This instantly reminded me of a concordancing problem.

The natural data structure for this is a Map. Thus we have the first potential solution for the problem:

def m = [:]
new File(/..\data.dat/).splitEachLine(':') { line ->
    def (id, op, data) = line
    switch (op) {
        case 'update':
            if (!m.containsKey(id))
                println "Didn't find ${id} to update. Creating new entry."
        case 'create':
            m.put(id, data)
            break

        case 'delete':
            m.remove(id)
            break
    }
}

println "Results:"
m.each { k,v ->
    println "\t${k} => ${v}"
}

This works nicely: it’s clear and concise…all the things I like! Groovy makes it even nicer :-)

But (there’s always a but…sadly) my colleague had lots of data to process. Too much to be realistically held in memory.

The tool I would normally reach for in this circumstance is a plain old database, but hey! this is the age of noSQL.

Enter JDBM3:

JDBM provides HashMap, TreeMap, HashSet, TreeSet and LinkedList backed up by disk storage. Now you can insert more than a billion records into collection without running out of memory. JDBM is probably the fastest pure Java database.

Ten minutes of hacking and version one had transformed into version two:

import net.kotek.jdbm.*

// uses JDBM3-alpha-2 from https://github.com/jankotek/JDBM3

final path = /..\data/

// start with non-existent database
new File(path).eachFile { it.delete() }

new DBMaker(path + /\ids/).build().with {  db ->

    db.createHashMap("ids").with { m ->

        new File(/..\data.dat/).splitEachLine(':') { line ->
            def (id, op, data) = line
            switch (op) {
                case 'update':
                    if (!m.containsKey(id))
                        println "Didn't find ${id} to update. Creating new entry."
                case 'create':
                    m.put(id, data)
                    break

                case 'delete':
                    m.remove(id)
                    break
            }
            db.commit()
        }

        m.each { k,v ->
            println "\t${k} => ${v}"
        }
    }

    db.close()
}

Still nice, clear and concise…but now capable of dealing with a (few) billion records.

Even more to like!

Update:
It is now possible to simplify this further. After mentioning to the developer that:

If there is one small thing I would like tidied up in the example code, it is this:

> // start with non-existent database
> new File(path).eachFile { it.delete() }

Having to delete the database/files by hand, rather than just specifying a create-truncate mode of some sort ‘feels’ wrong.

He responded within hours:

…it is done now, check DBMaker.

This means that one can now do:

new DBMaker(path + /\ids/).build().deleteFilesAfterClose().with {  db ->

Open source FTW!

Tags: ,

C, Java Enterprise Edition, JEE, J2EE, JBoss, Application Server, Glassfish, JavaServer Pages, JSP, Tag Libraries, Servlets, Enterprise Java Beans, EJB, Java Messaging Service JMS, BEA Weblogic, JBoss, Application Servers, Spring Framework, Groovy, Grails, Griffon, GPars, GAnt, Spock, Gradle, Seam, Open Source, Service Oriented Architectures, SOA, Java 2 Standard Edition, J2SE, Eclipse, Intellij, Oracle Service Bus, OSB