Monday, 20 June 2011

Refactoring JQTI: JQTI+

Last December I blogged that I was looking into maybe refactoring JQTI. Things went rather quiet after that point as my initial exploration was enough for me decide that this would be a lot of work and I found myself with plenty other stuff to do and not enough time to justify having fun of this nature (for some definition of the word "fun").


Since starting work on the JISC QTI Implementation and Profiling Support project a couple of months ago, I thought now might be a good time to look at this idea again and revisit my initial attempt with fresh eyes and with more experience of JQTI under my belt. My deeper knowledge of JQTI is both a blessing and a curse - I know much more about how it works, but I'm also now aware of a few other aspects of the software that are ripe for being refactored and improved.

So, what do I think can be improved? Here's a quick summary:
  • Splitting out state, logic and data: This is what I blogged about originally, and is still a very good idea. As I mentioned back then, it's not all that easy but I'm now convinced it's not only worth doing, but doable.
  • Stateless shuffling of interactions: For interactions that can be shuffled, JQTI implements the shuffling process by physically reordering the choices within the Object model. This is bad for two reasons. First, it's obviously not stateless! Secondly, it means that writing the assessmentItem Object back out as XML gives you something different from what you started with, which seems wrong.
  • Instantiating from XML: JQTI's hierarchy of Java Objects that mirror the concepts in the QTI spec all have a set of overloaded load() methods that instantiate Objects from XML in various forms. This is quite simple to use, but doesn't really mesh together too well. One problem is that resolving links to related resources, such as response processing templates and items referenced in tests only really works in some cases, and there's lots of duplicated code that doesn't handle all cases we'd need. Another issue is that JQTI keeps a deep copy of every DOM Node in the XML tree, which eats memory up for no great reason. JQTI also doesn't load in schemas (actually it now does in the fork I created for MathAssessEngine) which makes its validation a bit half-hearted, and adding performant support for using the core schemas is a bit of work that library clients have to do that would be better being done by JQTI. It also doesn't really "get" namespaces, but that's being a bit nitpicky! So, I envisage a refactored JQTI to have better support for locating and reading in XML, and controlling this process.
  • Better test navigation: JQTI has a concept of "Item Flow" that is used to help model the movement through a test. This idea works quite well in simple cases, but doesn't work so well for weird combinations of navigation and submission modes. So I think a lot of the test navigation ideas need refactored, which probably means I'll merge in the module previously known as JQTI-Controller back into JQTI as I'm not sure the additional level of flexibility that offered is much use.
So, there's even more to do than I first thought. I am clearly insane!

Saturday, 18 June 2011

MathAssessEngine development update

Following our successful award of JISC funding in the form of our QTI Implementation and Profiling Support project, I have now recommenced development work on MathAssessEngine. I'm about 2 months into development and things have gone rather well so far. I'm currently about to start the second big iteration of work so thought this would be a good time to summarise what work has been done so far, as I'm conscious of the fact that I've not been very good at blogging recently. (That said, when have I ever been any good at this?!)

Anyway, back to the point. There have been essentially 3 main strands of work going so far:

  1. QTI 2.1 reference implementation: There has been a fair flurry of work and new interest in the QTI 2.1 spec as it advances towards completion. I've been using MathAssessEngine as a way of tracking new ideas and proposals as and when they've been mooted in order to let people play around with them and see how they work, so MathAssessEngine is playing the role of a reference implementation here, along with Graham Smith's JAssess. (That said, this is not a proper 100% implementation as it doesn't quite implement the whole spec and there are a growing list of historic bugs being noticed as well, but let's not talk about those today!)
  2. Resolution of known issues: There have been a number of well-known issues with MathAssessEngine (and its parent QTIEngine) that have needed fixed for a while. One of the main issues is the testing functionality, which doesn't handle some of the more esoteric combinations of navigation and submission modes that QTI specifies. So this work is going to fix as many of these known issues as possible, which will probably result in significant changes to the way that tests work. (It's also bringing up new issues, but that's always to be expected!)
  3. General improvements and refactoring: This is a good time to improve and tidy up the underlying architecture, based on lessons learned over the last couple of years and my growing experience of the inherited JQTI code base. I enjoy doing this kind of thing in a kind of masochistic way, but this practice invariably opens virtual cans of worms all over the place, which makes for a terrible mess for a while. So we'll have to see how far we go here.
So what are the key developments so far, then? Well, I'm quite pleased with what has been achieved so far. We're currently at the stage where individual assessment items render really nicely, with many improvements and better browser compatibility over the last version of MathAssessEngine. Here are some highlights:
  • Better MathML support: MathML is now rendered via MathJax, which means it will work really nicely in most modern browsers. The rendered content is now HTML5 (+ MathML) by default, though I still support XHTML1 (+ MathML).
  • Better QTI validation: All items and tests are now validated against the relevant schemas, as well as performing additional QTI-specific validation to ensure that they are going to work correctly.
  • Better input validation: Many of the interactions weren't validating responses correctly to ensure that they make sense, a problem that in some cases permeated all the way down into underlying the JQTI library. This has been resolved for all validations, and our fork of JQTI now validates all of the interactions it supports.
  • New rendering of mathEntryInteraction: Candidates can now input maths using an improved input widget that provides real-time AJAX feedback as to whether their input makes sense.
  • Improvements to certain interactions: The orderInteraction rendering has been completely revamped so that it can support the minChoices and maxChoices attributes. sliderInteraction has been redone to use pure JavaScript. The rendering for interactions has been improved with rewritten JavaScript and better handling of bad inputs.
  • Support for proposed new QTI expressions: We now support the proposed new mathOperator, mathConstant, statsOperator, min, max, lcm, gcd and roundTo expressions. The rounding logic in JQTI has also been rewritten so that it does the right thing when resolving tie breaks.
  • Better JQTI processing logic: The logic that controls the processing of assessment items has been improved to fix a number of known issues and support the spec better.
Phew! You may note from this that I haven't said anything about tests yet. Tests are going to be sorted out in the next development iteration. Because of this, they're not currently working in the latest snapshot of MathAssessEngine, which is why I haven't replaced the "live" version with the new one yet. But if you're feeling brave, you're welcome to have a play with the latest snapshot of development work at http://www2.ph.ed.ac.uk/MathAssessEngine-dev.

More updates later. I'll try not to take so long for the next update...

Saturday, 5 March 2011

ASCIIMathParser.js released

Peter Jipsen's ASCIIMathML script has been a very useful solution for getting maths on the web in some cases for a good few years now. In my opinion, one of its greatest assets is its very compact input syntax for maths, which is simpler and easier to pick up than LaTeX.


In the last few years, I have increasingly been using ASCIIMath purely as a means for parsing its input syntax into MathML, which is then passed to some other software component for further processing. For example, this idea is used in MathAssessEngine as a mechanism for students to input simple mathematical expressions as responses to assessment items. Other features in the script, such as pre-processing an HTML page, providing an online calculator widget, drawing graphics, and triggering the display of MathML in Firefox and Internet Explorer simply have not been used, so it has seemed a bit wasteful keeping them around.

In light of my very specific requirements for using ASCIIMath, I spent a few hours this week having a go at creating a cut-down version of the script that concentrates solely on converting ASCIIMath input syntax to MathML. This seemed quite an interesting thing to do for 2 main reasons:
  • I was keen to see just how small the actual parsing code is. (It is indeed very compact, which is nice, as it makes porting it to other languages seem like an easier option than it would have done looking at the "kitchen sink" approach of the original script.)
  • I wanted to minimise the dependency on browser-based JavaScript Objects to see if it could be ported to other JavaScript/ECMAScript environments. (I ended up just requiring a suitably decent DOM Document Object, which means the script could run on Java with the Rhino JS engine.)
The result of this experiment is a new script called ASCIIMathParser.js. I have posted this in the SnuggleTeX Math Playground and you are free to do whatever you like with it if you think it might be useful.

It is squarely aimed at developers like myself. There is an example of its usage in my new ASCIIMath input widget demo, and I have also done a very simple example showing it being used server-side in a Java servlet.

Implementation notes

I wanted to keep the structure of my script as close to the original script (v2.1) as possible so that I could keep it in sync with any future fixes/changes. To help me with this, I've put in lots of comments telling me where changes/cuts were made.

The first thing I did was find the actual entry point into the parsing code. This is the AMparseExpr() function. I then cut out the bits at the start of the script that handle the display of MathML, and the bits that provide the LaTeX, SVG and calculator functionality.

Next up was creating the new parser entry point, which is called parseASCIIMathInput().

I also had to replace the createMmlNode() with one which wasn't written for the Firefox vs. IE/MathPlayer dichotomy that ASCIIMath supports. There's still a dichotomy in the resulting function, since Microsoft's DOM still doesn't have a createElementNS() method, but the results are now "standard" DOM vs. Microsoft DOM, rather than Firefox vs. IE/MathPlayer.

I also removed the extraneous <mstyle> wrapping that ASCIIMath does, and commented out some bits of code and most of the old global variables that weren't used.

Finally, I wrapped the whole thing up in
function ASCIIMathParser(document) {
...

this.parseASCIIMathInput = function(asciiMathInput) {
...
}
}
This encapsulates all of the code into a simple class, and also demonstrates that the only real dependency is a suitable XML DOM Document Object.

It was quite easy to test that this code works in any modern browser. You really do need to create a new XML DOM to run the code since, if you're running in an HTML page, your document Object will be a bit deficient. But this is not hard, and I've provided a helper script that you can use for this.

I also spent an hour or so getting the script to run from Java using the Rhino JavaScript engine and Java's native org.w3c.dom.Document Object. This worked surprisingly well, with the exception that traversal of a org.w3.dom.NodeList via node[i] didn't work, so I did a search & replace to change these to node.item(i) in the code. There is probably a more elegant future solution to this.

Nevertheless, I am quite pleased at how easy this turned out. I haven't done the ASCIIMath LaTeX or SVG input, but that may well be possible too. (It really depends on how closely the code ties itself to browser JavaScript Objects.)

Feel free to play around with the code. Feedback, comments, bug reports are welcome.

Friday, 4 March 2011

New SnuggleTeX Math Playground

Today I relaunched the "Math Playground" demo site that I originally created just before Christmas back in 2008. This site was initially created to try out and demonstrate some ideas for the JISC MathAssess project, which ended up seeding the SnuggleTeX Up-conversion/Semantic Enrichment process and the Jacomax project, as well as incubating some ideas that later found their way into MathAssessEngine.


This new version of the site has been rebranded as the "SnuggleTeX Math Playground" as one of its main purposes is to give me somewhere to incubate and refine some new ideas that might find their way into SnuggleTeX 1.3.0 or later.

The site launches with the following:
  • A purified version of ASCIIMathML called ASCIIMathParser.js that contains just the core ASCIIMath parsing code. This might open up new avenues for using and integrating ASCIIMathML in new situations. I plan to blog about this shortly...!
  • A trivial example demonstrating ASCIIMathParser.js running in a Java webapp.
  • A brand new version of the ASCIIMath input demo from the current SnuggleTeX website. This one uses MathJax for rendering the maths, enabling it to work in any modern browser. It also does real-time AJAX calls to a brand new SnuggleTeX up-conversion web service that lets users see whether their input "makes sense" while they type.
  • A version of the new ASCIMath input demo that uses SnuggleTeX's LaTeX as an input format, just for fun.
Have a play... you won't break anything. Probably...

Saturday, 26 February 2011

SnuggleTeX Development Update

It has now been a good few months since I released SnuggleTeX 1.2.2 way back in May 2010. Since then, I haven't really had as much time to allocate to it as I would prefer, but I have made some progress on putting together the next release, which will either be 1.2.3 or 1.3.0 depending on how things go.

Here's a quick run-down of recent developments:

  • Unicode input support: You can finally use arbitrary Unicode characters in your input as you can with modern LaTeX distributions. This still needs a bit of polish but the general idea is now there.
  • Simpler definition of many MATH commands: Many MATH mode commands are really just aliases for Unicode characters. Because of the newly added Unicode support, these commands are all now defined in simple text files that you can override by chucking alternative versions in the ClassPath, which will allow you to customise things easier than before.
  • Cleverer parser: LaTeX is a really complex beast to support as it allows you to use a rather messy mixture of markup styles. For example, the older style commands like \bf last until the next closing brace, whereas newer ones like \textbf{...} have a more tree-like approach that are easier to map to the XML way of doing things. SnuggleTeX has always been able to parse both, but was limited in how it would propagate style commands in the resulting output. However, I have now tweaked the core SnuggleTeX code so that things behave correctly. (This was actually quite a big change that required much chewing of fingers!)
  • Support for MathML 3.0: This isn't really as exciting as it sounds, but basically the MathML that SnuggleTeX generates is suitably unexciting that it's fine as MathML 2.0 or MathML 3.0. The unit tests now verify that the MathML generates is valid according to the latest RELAX NG schema for MathML 3.0, just to make sure I'm not giving you mince in your outputs.
Things currently in the pipeline:
  • Modern web page generation: Many of the options currently available for generating MathML-enabled web pages are getting a bit long in the tooth. The recent emergence of MathJax has made it far easier to create MathML-based web pages that will render in any modern browser, so I'm currently playing around with adding support for this. I also want to add support for HTML5 output which, when combined with MathJax, is something that is future-proof and already working in pretty much any modern browser as well. I'm still playing around with this, but something concrete should be committed soon...
  • New web demo applications: I want to update the demo web applications to try out some new stuff. This is currently happening in a little experimental spin-off webapp that I'll blog about very soon. Watch this space!
  • Rejigged website: I can never find anything in the current SnuggleTeX website. You probably can't either. I really need to fix this.
So there you go! The next release will probably be called 1.3.0 as some of the new features required a lot of changes to the internal API that some folks may be using, which warrants bumping up the minor version number. There may be a 1.2.3 bug-fix release that incorporates some minor bug-fixes I've done if an issue is discovered with 1.2.2 that would benefit from an immediate fix going out. We'll see...

Monday, 6 December 2010

Java and XML Gotchas #1: Don't pass a java.io.Reader to an XML parser

This particular gotcha is a good example of a class of problems I call encoding mismatches and is an easy and rather nasty to trap to fall into for a number of reasons:
  • The Java API for XML processing lets you get away with it so effortlessly.
  • Novice Java programmers learn to use Reader classes to read in textual data and, "since XML is textual data", this seems an obvious and helpful thing thing to do when reading in XML.
  • English speakers don't usually notice anything is wrong until it's too late!
I have certainly made this mistake in the past myself, as have many people I've worked with over the years, so think it is definitely something that every Java programmer who uses XML should at least think about, even if you're already doing things correctly.

Before we even look at XML though, we need to take a few steps back and briefly remind ourselves of some of the things we should know about when communicating textual data.

Unicode and encodings

Every programmer needs to know at least a little bit about encodings, which are algorithms specifying how textual data should be represented as binary data for storage and transmission. Java and XML both support the Unicode standard, which defines well over 100000 characters and symbols in use throughout the world. In order to communicate all of these characters digitally, they need to be packed into bytes and, with a single byte capable of representing only 256 possible characters, this is clearly not a trivial task. One arguably old-fashioned approach is to throw away most of Unicode and select a subset that can be mapped into a single byte. This gives us popular encodings such as the 128-character ASCII encoding and the various Latin encodings, which flesh ASCII out with characters used in various European countries. More flexible encodings that allow you to use all of the Unicode character range necessarily use more than one byte to represent certain characters. Perhaps the most common example of these encodings is UTF-8, which uses between 1 and 4 bytes to encode each Unicode character and is very efficient for representing English text as it encodes ASCII characters in exactly the same way as the ASCII and Latin encodings.

Reading encoded text data

In order to read in textual data represented as bytes, you need to know the encoding that was used so that it can be decoded correctly. This simple fact often surprises novice programmers, who all too easily rely on things like a "default" encoding for reading and writing textual data. Default encodings are very convenient if you are the sole producer and consumer of your data, but they are are useless if you're using data you've obtained from someone or somewhere else. For this reason, most "protocols" that communicate textual data have a mechanism of telling you which encoding is being used so that you (or your software) can correctly decode it.

Encoding mismatches

If you read in textual data using the wrong encoding then you will find erroneous characters introduced into the decoded text and your users will start complaining about "funny symbols". You may also get an error report if you're lucky, depending on the API and the way the encoding algorithm works.

Noticing these mismatched encoding bugs can sometimes be harder than you imagine, especially when dealing with English text in the common encodings. As a Brit, the most common encodings I encounter are UTF-8 and Latin-1. Since ASCII characters are encoded into the same bytes when using the ASCII, Latin and UTF-8 encodings, encoding mismatches only become evident when using non-ASCII characters such as accented European characters or mathematical symbols, and it's not uncommon to encounter software that has managed to get into production without having thought of ever using such characters, leading to funny symbol reports from confused users. For example, a German ü character is encoded as 2 bytes in UTF-8. If these bytes are then (incorrectly) decoded using Latin-1, then you'll end up with 2 (wrong) characters instead of the ü. Fun!

Reading in text using Java

The simplest and traditional way of reading in textual data in Java is to use a subclass of Reader. You can correctly and consistently read a UTF-8-encoded text file in with the rather verbose:
InputStreamReader reader = new InputStreamReader(new FileInputStream(new File("myfile.txt")), "UTF-8");
Less experienced programmers might opt for the shorter:
FileReader reader  = new FileReader(new File("myfile.txt"));
The problem with this second form is that the encoding is not specified anywhere, so Java will use the "platform default" encoding, which may or may not be the correct one and will be specific to the computer the code is running on. (So, in particular, this form should never be used in "server-side" code.)

If you look at the java.io package Java API, you'll see that many Reader constructors let you specify the encoding that should be used, whereas many specify no encoding, using the platform default. This can be OK if you're reading and writing out text files locally, but you should only use these default encodings if you are 100% sure that the default encoding is the correct one, otherwise the character data will be decoded wrongly. Also, the Reader classes don't report decoding errors so it's hard to detect when things go wrong.

XML and encodings

The XML specification is clever here and allows you to specify the encoding within the (normally optional) XML declaration at the start of the file, using a default of UTF-8 if no declaration is found. Here's an example:
<?xml version="1.0" encoding="ISO-8859-1"?>
This says that the encoded binary representation of this XML file uses the ISO-8859-1 (a.k.a. Latin-1) encoding.

When you tell an XML parser to parse a binary stream, it looks at the first few bytes to work out which encoding should be used. It then decodes the stream using this encoding and parses the resulting textual data, hopefully correctly. Your XML parser is actually doing a lot of work for you here, which you should be thankful for. You should also let it do this work, as it's much more likely to do it correctly than you are!

If, on the other hand, you decide to decode your XML first (e.g. using a Java Reader class), then you need to know the correct encoding in advance. You'll then be passing character data to your XML parser and it will correctly ignore the encoding specified in the XML declaration since you have already decoded the text. If you decoded using the wrong encoding, then funny symbols will no doubt ensue.

Reading XML with Java

The Java XML APIs generally come with a number of overloaded methods for parsing, transforming and doing other exciting things to XML sources. Based on what you've read so far, you'll now generally know to avoid using ones that take Reader, favouring InputStream or File instead.

Here's an example of parsing a File with a SAX Parser:
public static void parseXMLGood(File file, DefaultHandler handler) throws Exception {
SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
saxParser.parse(file, handler);
}

This is actually nice and simple in this case since the API helpfully provides a parse() method taking a File. In other cases, you might need to obtain a FileInputStream first.

Of course, as with all "rules", there are valid cases for breaking them. For example, if you've built up some XML programmatically as a big String, then using a StringReader is of course the right approach.

Conclusion... and moral of the story

Unless you have reason to do otherwise - and know what you're doing - you should always:
  • Pass raw binary streams (e.g. InputStream, File) to your XML parser
  • Let your XML parser do the decoding for you
  • Only use the Reader constructors that specify an explicit encoding

Wednesday, 1 December 2010

Refactoring JQTI

During some fleeting moments of deep thought recently, I've started to think that the JQTI library might benefit significantly from a bit of refactoring to split out some of the competing types of information that it models.


What's JQTI again? Well, JQTI is a lovely little Java library that models the IMS QTI specification, which was created by the folks at the University of Southampton. When I say little, I should really say large, but necessarily so as the QTI specification is itself large and powerful. JQTI provides developers with a family of Java Classes that closely matches the concepts defined in the QTI specification. For example, there is an AssessmentItem class that mirrors the <assessmentitem> element in the QTI schema, which describes a single assessment item. An instance of this Class has methods to manipulate the actual data in the question (i.e. the XML attributes and element content), methods for validating data and inherited utility methods for reading and writing XML. But it also contains methods for doing the question, such as initialising template variables and processing responses and all of the question state is contained within the Object as well.

This self-contained approach has some nice qualities - it makes it quite easy to use the library and the API is correspondingly nice and simple too. However, there are some obvious issues with mixing things together in this way. First of all, imagine we're using JQTI in a testing system that issues the same question to 1000 students. We would then need to instantiate 1000 AssessmentItem instances: one for each student. Within these are 1000 identical copies of the underlying XML data, which is wasteful. Another issue with this model is with the statefulness of these JQTI Objects, meaning that we need to keep them alive for a long enough time, and there's no well-defined way of serializing these Objects into XML for passing around. This makes creating a RESTful API to JQTI-based systems a bit awkward. (Southampton now have a kind-of RESTful version of QTIEngine, but it is HTTP session-based so arguably not quite there.)

In light of these issues, I've spent a few days looking at the code to see how easy it might be to refactor things to split things up so as to separate (at least) the XML data parts, the candidate state parts, and the QTI business logic. In smaller models, this is not usually too hard, but it's actually going to be quite a bit of work for JQTI and I think it might be wise to make some compromises that would make code purists feel slightly ill. For example, one compromise might be to leave some of the business logic in the original JQTI classes, acting on passed state and context Objects rather than on underlying internal state. If I don't do this, then I fear we would end up having another hierarchy of logic classes as there's a lot of business logic in very specific places at the moment, and I'd then have to worry about weaving everything together. So we might have to be a bit pragmatic here.

Indeed, I have already tried bumping code around to split up the AssessmentItem part of the spec (i.e. the easy bit) to see how things might take shape and identify difficulties and risks with te process. Doing the same for AssessmentTest is an order of magnitude harder, though. And then there's the risk that I break more than I fix with this, and there's already limited unit test coverage so that won't provide much of a safety blanket. So I clearly need to think a bit harder.

So watch this space! For those that are interested, I already maintain a fork of JQTI in the FETLAR SVN repository on SourceForge.net. This JQTI-MathAssess fork adds in some hooks to make MathAssessEngine perform better with Maxima, fixes the (slightly bizarre... and arguably wrong) way that JQTI reads and writes XML and tracks the more experimental stuff that the QTI working group have been trying out as we move towards QTI 2.1. The refactored QTI stuff is going to be branched from this, but I won't give details until/if I commit to doing it.