I was working on an Android app, jata which depends on getting data from a web based server which uses XML  I tried using an Android Document object  org.w3c.dom.Document and it worked well, but I switched to using an XmlPullParserfor performance reasons.  Using the pull parser was a lot of trial and error, as I was fairly new to both Android programming and parsing XML.  Eventually, I wanted to upgrade the App by creating my own server API which supplies some supplemental data, also in XML.  Debugging the parser for the new data feed, I noticed that a particular xpp.next() call actually caused the parser to get out of sync, and the reason was found to be that the original data feed has a lot of extra whitespace, which did not exist in the feed I created for the extra data.


XML as seen in a web browser

To see the structure of the incoming XML, I would use a web browser, which would display an xml response as in the image to the left.  Looks simple, but the browser has already dealt with stripping out unneeded white-space.

When I started work on parsing my own data source, the browser showed similar results.  But my parser was not happy, because, as written,  it was expecting the same “extra” white-space that the original feed had.


XML from my new server in browser.

This (to the right)  looks the same to me, but the browser is hiding extra white-space.

Then I remembered my old friend wget.   Linux users should already know about this command, windows users can look here.

The data feed URLs are of the form “www.someserver.com?cmd=somecommand&parm=someparameter”.  So the command to get this as a text file is

wget “http://foo.bar.net/baz/baz-api.php?cmd=command”

When using wget for an address like this, be sure to enclose the URL in quotes, as your command shell will try to interpret it with no quotes.

The bottom line is, after wgetting the feeds to a local text file, I opened them up with notepad++, and turned on the display of non-printing characters.   And I saw this:


XML in Noterpad++

A lot of unneeded CR/LF pairs as well as tabs in the feed.  This will have me going back to look at my original code to fix logic that dealt with unexpected white-space.    I suspect changing my XmlPullParser.next() calls to nextToken() might help.  Another thing is to deal with IGNORABLE_WHITESPACE events.

That’s it for now.  I need to go re-factor some code.