If not, please make sure that you have the correct prerequisites to continue following this article in a successful manner. For those readers already familiar with the API of the JsaPar library and those that are looking for more detailed information on how to use the library in specific cases, please consult the basic features, advanced features and working with events articles for further assistance.
The JsaPar API
An Application Programming Interface (API) is a set of routines, protocols, and tools for building and/or extending software applications. The API is the contract between you - the developer - and the JsaPar library. The API defines how the library interacts with your client code. The API expresses the type of operations it can perform, the inputs and outputs, the data types to be used in conjunction with the client code and what kind of tools you have at your disposal. For the library to work properly you must therefore adhere to the contract that is provided. Deviating from this contract could mean that future revisions of the library could break the client code functionality and that the code no longer will operate correctly. It is not guaranteed that the library operations, types, inputs and outputs that are not part of the official API remain the same in future versions of the library. It is therefore important to always respect the API guidelines of the library.
As with many developers, you want to start coding quickly without reading a lot of text at the beginning, but for the sake of clarity we recommend you to NOT dive straight-away into the code examples. You will soon figure out why that is not recommended. When you invest a little bit of time in getting the concepts of the JsaPar library clear in your mind, you will have less struggle in getting the library to do as you please. Next to that, your client code will look more organized and is easier to read as well.
In this article we discuss the concepts of the JsaPar library first, next we explain the API in detail and at the end we show you how to integrate the JsaPar library into your own code by supplying a few basic examples to get you started. After these introductory steps you can continue reading the above mentioned articles to deepen your knowledge of the JsaPar library and learn more about all the features this library has to offer.
Concepts of the JsaPar library
Data source types
The JsaPar library supports two types of data sources that can be processed:
- Delimited* separated value data sources, commonly known as Comma Separated Value data sources.
- Fixed width data sources, also known as flat data sources.
For both data sources the processing of lines can be based on:
- field position(s), and/or
- field control value(s).
Note: the only limitation the library currently has, is when processing lines based on a field control value, the control value must be in the first field position of the line. In future versions of the library this limitation will be removed, so that the control value can be detected in each field position and is no longer limited to the first field position. Also combined control field options are planned for a future version.
* Tip: we intentionally talk about 'delimited separated value' data sources and not 'comma separated value' data sources (CSV) because to our knowledge the delimiter used to separate the fields within a line doesn't necessarily have to be a comma, it can be any type of delimiter. The JsaPar library is not limited to only comma delimiters, you can choose any delimiter you want to distinguish fields within a line. We prefer to talk about 'data sources' instead of 'files', because the data source doesn't have to be a physical file on an information carrier. It can also be a memory reconstruction as objects of a specific type of data file, or some other form.
We differentiate four situations in which the JsaPar library can be used as a direct or indirect intermediate to process delimited and/or fixed width data sources:
- data source into object representation
- object representation into data source
- data source into another data source without client code intervention
- data source into another data source with client code intervention
Types of Tools
For handling these situations, the JsaPar library accommodates four tools to work with delimited and/or fixed width data sources. You have the following tools at your disposal:
- Parser (org.jsapar.input.Parser)
- Outputter (org.jsapar.output.Outputter)
- Converter (org.jsapar.io.Converter)
- ConverterMain (org.jsapar.io.ConverterMain)
You will need a Parser tool for reading from a data source and an Outputter tool for writing to a data source. When you want to convert one data source into another data source, you can use a Converter tool.
Under the hood the library actually uses only two unique tools: the Parser and the Outputter. The Converter tool is actually constructed by combining a Parser and an Outputter tool. The ConverterMain tool is a special case of the Converter tool because you can use this tool in a standalone setting. This means that you do not need to write client code in Java to use this tool. It can be operated from the command line in any operating system that runs Java. For now we will only focus our attention on the first three tools: Parser, Outputter and Converter, as these are the most important tools used by any developer working with the JsaPar library in conjunction with their own Java code. In the advanced features article, the latter tool is described in full.
Interfaces with the outside world
The tools communicate with the outside world using the standard Java interfaces:
- Reader (java.io.Reader)
- Writer (java.io.Writer)
This means you can use any I/O class that is derived from these interfaces in communication with the tools. For example: you could use the Parser or Outputter tool in combination with a StringReader class (i.e. a character stream whose source is a string) or in combination with a FileReader class (i.e. a character stream whose source is a file). This makes the library a highly flexible and useful instrument to process all kinds of delimited separated value and fixed width data sources, not limited by their physical or memory representation.
Document, Line and Cells
To process the data sources (both input and output), the tools use plain old Java objects (POJOs) that represent (parts of) the lines within a processed data source. These are:
- Document (org.jsapar.Document)
- Line (org.jsapar.Line)
- Cell (org.jsapar.Cell)
A Cell object represents a field value within a single line. A Line object represents a single line within the data source. The Document object represents the entire collection of lines within the data source. So this means that a Document object holds a collection of Line objects and that each Line object holds a collection of Cell objects.
TODO Explain Document/Line/Cell structure.
Document structure can be marshalled/unmarshalled directly as of version 1.7 of the JsaPar library.
- Document.importFromXml(Reader reader)
- document.exportToXml(Writer writer)
Before version 1.7 of the JsaPar library you have to use a convenience class called Xml2Document.
TODO load Document content from XML to Document object (marshalling)
It is possilbe to build a org.jsapar.Document by using a xml document according to the XMLDocumentFormat.xsd (http://jsapar.tigris.org/XMLDocumentFormat/1.0). Use the class org.jsapar.input.XmlDocumentParser in order to convert a xml document into a org.jsapar.Document.
TODO Converting java objects into a file
Use the class org.jsapar.input.JavaBuilder in order to convert java objects into a org.jsapar.Document, which can be used to produce the output file according to a schema.
This means: Java objects converted to a Document converted to a output file using a schema.
Each Cell must have a data type appointed. You can choose from the following data types:
You define the Cell at creation time, or within the XML document.
When processing data sources, you have two options on how these data sources are represented within memory and thus in your client code. These options are:
- Document object based, or
- Line object based using Java events
When you choose to represent the data source(s) using a Document object, all lines within the data source will be loaded into memory, thus creating a Document object that holds all Line objects including the Cell objects for each line! For small data sources this is not really an issue, but for larger data sources you will use too much memory resources to effectively process the data source(s), because really huge data sources cannot be read entirely into memory. Therefore, representing data sources 'Document based' is only recommended for small data sources (e.g. less than a few Megabytes). The advantage of this representation option is that it is easy to implement in your client code (e.g. you get the full processed data source at once after the Parser finished).
When you choose to work 'Document based' the complete data source is returned as a Document object to the client code after the process is completed. This means your client code is waiting for the Parser tool to finish until all lines are processed.
Line based using Java events
When you choose to represent the data source(s) using Line objects using Java events, the entire data source is processed by the Parser using one single line at a time, creating a Line object and firing a Java event to indicate to your client code that one line is processed. This way the memory resources will not be consumed in such a way as with the 'Document based' option. All previous processed lines and corresponding Line objects will be discarded when processed by your client code and thus freeing memory along the way while processing the complete data source. Therefore, representing data sources 'Line based using Java events' is recommended for both small and large data sources (e.g. more than a few Megabytes). The advantage of this representation option is that the memory consumption is very low. A disadvantage of this representation option is that it is somewhat harder to implement, because it requires knowledge on how to handle Java events within your client code. For really huge data sources you have no other alternative than to go for this type of processing of the data sources(s).
When you choose to work 'Line based using Java events' the complete data source is returned as a number of Java events containing a single Line object in each event per read line. This means your client code doesn't have to wait for the Parser tool to finish and can continue executing other operations.
Based on the choice on how you want the data source to be represented in memory (either 'Document based' or 'Line based using Java events'), you will receive a filled Document object with Line and Cell objects after the Parser finished parsing the entire data source, or you will receive a Java event that will hold one single Line object and corresponding Cell objects for each line in the data source that the Parser processes.
Data source schema
For the tools to understand how the data source(s) need to be read (parsed) and written (outputted), you need to define a schema object. The schema object represents the description of how the structure of the data source is defined. You can see the schema object as the meta data of the data source: it describes where each part of the data can be found on each line and how it should be interpreted by the Parser, Outputter and Converter tools.
A schema can be constructed in two ways:
- Programmatically. By creating a schema object in code.
- XML document. By providing a XML document that represents the schema.
You either create a schema object yourself in code or construct a schema object by loading a XML document, which in return creates the schema object for you based on the settings stored within the XML document. The tools always need the object representation of the schema to process the data source(s). The Parser and Outputter tools only need one schema object, while the Converter tool needs two (different) schema objects.
Defining the data source structure programmatically is pretty straightforward. Please consider that by defining the data source structure in code has two disadvantages: when you need to adjust the data source structure definition, you always have to modify the Java code to change it. This can become inconvenient in situations where only the data source structure changes. For example: when you want to correct typos, or other small changes that do not impact the actual processing code. When you have large and complex data source structures, the code can become cluttered and hard to maintain. In such cases it is advised to switch to the XML document approach for the schema.
Defining the data source structure using XML document(s) is preferred for almost all situations. There is one advantage compared to the programmatically approach: you can change the data source structure definition at any time without having to adjust the Java code (if that doesn't impact on the actual processing code). This is usually the case in situations where the data source structure does change over time, but still keeps the same record fields that are now positioned in different locations within the data source.
In order to process a delimited or fixed width data source, the tools need one or more of the following schema definitions:
There are some differences between the above mentioned schemas that need some explanation. All of these schemas are used to describe the different types of lines that can appear in the data sources. To be able to detect these different types of lines, there must be some sort of uniqueness within these lines to distinguish the different lines from each other. In order to do so, you can use different techniques to detect this uniqueness of lines and therefore you need to define the lines and the corresponding fields that make up this data source.
The CsvSchema detects the line by its position within the data source. Each cell is distinguished from another cell by a separator character (or characters).
The CsvControlCellSchema detects the line by the first leading cell of each line within the data source. Each cell is distinguished from another cell by a separator character (or characters).
The FixedWidthSchema detects the line by its position within the data source. Each cell is distinguished from another cell by its begin and end character position.
The FixedWidthControlCellSchema detects the line by the first leading cell of each line within the data source. Each cell is distinguished from another cell by its begin and end character position.
The schemas without the control cell option are used in situations where you only want to use field position(s) for detecting the type of line that needs to be processed. The schemas with the control cell option are used in situations where you are dependent of the value of a control cell for detecting the type of line that needs to be processed.
Note: there is one limitation when using control cell schemas. The control cell has to be the first cell in the line in order to detect the line type. This is a limitation of the current versions of the JsaPar library. In future versions, this limitation will be removed.
Construct data source schema programmatically
This schema is based upon one or more schemalines that define - in great detail - how the lines within the data source should be interpreted.
The tools always have schema objects as input, so you can always modify schema settings runtime, you are not limited by the content of the XML document.
Construct data source schema using XML document
If you want to describe the data source schema using XML, the best way to do this is either by generating a XML document from the XML Schema (XSD) or by constructing a data source schema object in code and export this object into an XML file. The advantage of constructing a schema object in code and exporting it into an XML is that you exactly have the XML file the way you want it to be. When generating a XML file from the XSD you get a template XML file that you manually need to adjust in order to match it with the file structure of the data source.
You can of course do it the hard way by typing in all the needed XML elements yourself, or copy the contents of another schema file and work from that starting point to define the structure you need.
Importing and exporting data source schema using XML
The schema structure can be imported from XML or exported to XML by using to convenience classes:
schemaReader = new FileReader("mySchema.xml");
xmlBuilder = new Xml2SchemaBuilder();
schemaObject = xmlBuilder.build(schemaReader);
For writing the schema to a XML source, use the following code snippet:
As of version 1.7 of the JsaPar library, the schema structure can be imported from XML or exported to XML without the need of the above mentioned convenience classes. From version 1.7 and up, you just call the direct methods of the Schema class called:
- Schema.importFromXml(Reader reader)
- Schema.exportToXml(Writer writer)
TODO how to define schema programmatically.
|Fig. 1: A delimited line.|
Fixed Width Line:
|Fig. 2: A fixed width line.|
Exceptions and errors
There are situations in which you need to address the times when failure comes along and mess up your intentions as a developer. Nobody can predict what will happen at certain times outside the boundaries of your own code or the library code, so you are forced to deal with these situations nevertheless.
Because the JsaPar library does the parsing and outputting process for you, the library can report errors and exceptions to the client code. The library does that in two forms:
- throwing exceptions
- firing events
Each form is dependent of how you use the library. When you work 'Document based' the library will throw exceptions when some failure occurs. When you work 'Line based using Java events' the library will throw exceptions and/or fire Java events to report failure to the client code.
When dealing with non-recoverable failures, the library can throw the following exceptions:
When dealing with recoverable failures, the library can fire the following Java events:
The concept of the Parser in detail
To discuss each tool and their specific workings, we have split up the library into three landscape diagrams. We start with the Parser tool landscape, because that is mostly the starting point for any developer seeking to simplify their processing complexity of delimited and/or fixed width data sources.
The Parser landscape
Figure 1 depicts the landscape of the Parser. In order to use the Parser, we need to tell it what kind of data source we like to parse and where the actual data source can be found.
The Parser tool reads a specified data source, parses its content and ....
|Fig. 3: The Parser landscape.|
When you want the Parser to output a Document/Line/Cell structure, you have to use the Parser method:
When you want the Parser to output a single Line/Cell structure which is fired using a Java event, you have to use the Parser method:
Any other methods left?
After you have defined the schema definitions for the data sources you want to parse or produce, and you have selected the tool(s) to use, you need to tell the selected tool(s) where to find these data sources by specifying them using a specific class which is derived from a Reader and/or a Writer interface. When you have done that, you start the tool by providing these data source classes as parameters in the method call of that specific tool. Below are the specific methods specified for each tools:
The concept of the Outputter in detail
The Outputter landscape
|Fig. 4: The Outputter landscape.|
The concept of the Converter in detail
|Fig. 5: The Converter landscape.|
In this article both approaches are discussed using a very basic example. For more complex situations, please consult the following articles: basic features, advanced features, working with events and document schemas. These articles will provide more in depth knowledge and show in more detail the capabilities of the JsaPar library.
Types of Converters:
The Converter tool uses a Parser and an Outputter tool under the hood that works 'Line based using Java events'. The converter does all the work for you without giving you the opportunity to intervene in the conversion process: it reads, converts and writes one line at a time.
Intervene the conversion process by using the following interfaces:
TODO: how? interaction between client code and converter using linemanipulator
If your client code doesn't need to intervene between a conversion between two data sources, you can sustain with using a Converter* tool. If your client code does need to intervene between a conversion between two data sources, because you need to alter some data in runtime and need an object representation in memory to achieve this, then you will have to use the Parser and the Outputter tools together and glue them together within your client code.
The JsaPar library in action
The best way to learn something new is just by doing it! That is why we demonstrate the workings of the JsaPar library with real examples. These examples can be downloaded from the code samples page of this website and can be imported into Eclipse so that you can run them yourself while continuing reading this article. The examples use the Maven build manager, so make sure it is installed on your system. All data files needed to run the examples successfully are included within the specific Maven projects.
Delimited Separated Value data source example
The first basic example is reading a simple delimited separated value file based on the field position only and loading all lines into a Document object. The 'by code only' approach is used to realize the solution.
Eclipse project: TODO.zip
This is the file structure that we want to read into a Document object:
[show content of file]
Consider the following CsvExample class:
[class text here]
Don't forget to close the Reader and Writer source classes!
Fixed width data source example
The second basic example is reading a simple fixed width file based on the field position only and loading all lines into a Document object. The 'by code only' approach is used to realize the solution.
Eclipse project: TODO.zip
This is the file structure that we want to read into a Document object:
[show content of file]
Consider the following CsvExample class:
[class text here]
In this article we discuss a few basic examples to demonstrate the 'Document based' option of the JsaPar library so you quickly become familiar with the API terminology of the JsaPar library. Other capabilities of the 'Document based' option can be found within the basic features and advanced features articles on this website. The 'Line based using java events' option is not discussed in this article.
A specific article is dedicated to this option because it needs a deeper knowledge of the Delegation Event Model in Java and goes beyond the scope of this getting started article. In the working with events article a comprehensive explanation is given to deepen your knowledge of Java events so that you understand how to start using the Java event mechanism used by the JsaPar library in cooperation with your own Java code.
By now you should have gotten a clear understanding of the possiblities of this fine and easy to use Java library. If you are still eager to learn the ins and outs of the JsaPar library, we recommend you to read the following articles of this website:
- basic features. In this article... TODO.
- advanced features. In this article... TODO.
- working with events. In this article... TODO.
- document schemas. In this article... TODO.
- handling errors. In this article... TODO.
- how tos. In this article... TODO.
Happy learning and coding. Have fun using the JsaPar library!