Getting started with Apache Lucene and JSON indexing
In this post, I am going to talk about how to index JavaScript Object Notation (JSON) using Lucene Core. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. Many companies like LinkedIn or Twitter use Lucene for Real-time search and faceted search.
You can find all the code used in this post in github. In order to process a JSON file and index it, we need to:
- Process a JSON file
- Create an index
- Add documents to the index
- Commit the changes and close the index
Parse a JSON file
I have a created a test file which contains 3 JSON objects, with different data types (long, String, double and boolean)
[xml]
[
{
“studyId” : 1,
“name”: “id1”,
“lon”: 204.0,
“lat”: 101.0,
“stored” : true
},
{
“studyId” : 2,
“name”: “id2”,
“lon”: 204.0,
“lat”: 101.0,
“stored” : true
},
{
“studyId” : 3,
“name”: “id3”,
“lon”: 204.0,
“lat”: 101.0,
“stored” : false
}
]
[/xml]
With the aim to process the JSON file, I am using the json-simple library. Using json-simple, Decoding a JSON file is quite simple, you only need to call to JSONValue.parse(yourfile);
[java]
/**
* Parse a Json file. The file path should be included in the constructor
*/
public JSONArray parseJSONFile(){
//Get the JSON file, in this case is in ~/resources/test.json
InputStream jsonFile = getClass().getResourceAsStream(jsonFilePath);
Reader readerJson = new InputStreamReader(jsonFile);
//Parse the json file using simple-json library
Object fileObjects= JSONValue.parse(readerJson);
JSONArray arrayObjects=(JSONArray)fileObjects;
return arrayObjects;
}
[/java]
Create an index
I have created a class called LuceneIndexWriter which gets a directory path, and a file path which contains a number of JSON objects. This class will be in charge of creating the index, add all the JSON objects to the index, and finally, close the index.
[java]
public class LuceneIndexWriter {
String indexPath;
String jsonFilePath;
IndexWriter indexWriter = null;
public LuceneIndexWriter(String indexPath, String jsonFilePath) {
this.indexPath = indexPath;
this.jsonFilePath = jsonFilePath;
}
public void createIndex(){
JSONArray jsonObjects = parseJSONFile();
openIndex();
addDocuments(jsonObjects);
finish();
}
…
[/java]
Then we need to create an Index, I have used an StandardAnalyzer. An Analyzer is in charge of processing a field text and return the most important terms. I am using an StandardAnalyzer because it removes stop words (for example a, an , the, …) and punctuation, but you could use a different analyzer like WhitespaceAnalyzer, SimpleAnalyzer or an StopAnalyzer.
In this case, I am opening the index using the OpenMode.CREATE, that means that the index will be overwrite each time we run this function. if you want to create the index only the first time, then you should use OpenMode.CREATE_OR_APPEND.
[java]
public boolean openIndex(){
try {
Directory dir = FSDirectory.open(new File(indexPath));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_48);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48, analyzer);
//Always overwrite the directory
iwc.setOpenMode(OpenMode.CREATE);
indexWriter = new IndexWriter(dir, iwc);
return true;
} catch (Exception e) {
System.err.println(“Error opening the index. ” + e.getMessage());
}
return false;
}
[/java]
Add documents to the index
I have created a generic function to add documents to the index. For each type of data I am using a different implementation, for instance all the String fields (in my case only the studyId is an String field) will be stored using an StringField class.
Each type of field have 3 parameters: the name of the field, the value of the field and you can choose whether or not to store the field.
[java]
/**
* Add documents to the index
*/
public void addDocuments(JSONArray jsonObjects){
for(JSONObject object : (List
Document doc = new Document();
for(String field : (Set
Class type = object.get(field).getClass();
if(type.equals(String.class)){
doc.add(new StringField(field, (String)object.get(field), Field.Store.NO));
}else if(type.equals(Long.class)){
doc.add(new LongField(field, (long)object.get(field), Field.Store.YES));
}else if(type.equals(Double.class)){
doc.add(new DoubleField(field, (double)object.get(field), Field.Store.YES));
}else if(type.equals(Boolean.class)){
doc.add(new StringField(field, object.get(field).toString(), Field.Store.YES));
}
}
try {
indexWriter.addDocument(doc);
} catch (IOException ex) {
System.err.println(“Error adding documents to the index. ” + ex.getMessage());
}
}
}
[/java]
Finally, you only need to commit the changes and close the index.
[java]
/**
* Write the document to the index and close it
*/
public void finish(){
try {
indexWriter.commit();
indexWriter.close();
} catch (IOException ex) {
System.err.println(“We had a problem closing the index: ” + ex.getMessage());
}
}
[/java]
Testing
In order to test the lucene writer class, I have created a couple of JUnit tests:
– testWriteIndex : This test creates the index in the folder “indexDir” retrieving the data from a JSON file. Once the index is created, checks that the number of documents indexed is correct and prints each document.
– testQueryLucene: Creates a term query and searches for it and checks that the number of documents retrieved are correct.
[java]
public class LuceneIndexWriterTest {
static final String INDEX_PATH = “indexDir”;
static final String JSON_FILE_PATH = “/test.json”;
@Test
public void testWriteIndex(){
try {
LuceneIndexWriter lw = new LuceneIndexWriter(INDEX_PATH, JSON_FILE_PATH);
lw.createIndex();
//Check the index has been created successfully
Directory indexDirectory = FSDirectory.open(new File(INDEX_PATH));
IndexReader indexReader = DirectoryReader.open(indexDirectory);
int numDocs = indexReader.numDocs();
assertEquals(numDocs, 3);
for ( int i = 0; i < numDocs; i++) { Document document = indexReader.document( i); System.out.println( "d=" +document); } } catch (Exception e) { e.printStackTrace(); } } @Test public void testQueryLucene() throws IOException, ParseException { Directory indexDirectory = FSDirectory.open(new File(INDEX_PATH)); IndexReader indexReader = DirectoryReader.open(indexDirectory); final IndexSearcher indexSearcher = new IndexSearcher(indexReader); Term t = new Term("name", "id2"); Query query = new TermQuery(t); TopDocs topDocs = indexSearcher.search(query, 10); assertEquals(1, topDocs.totalHits); } } [/java]
Great Article.
I’m actually looking around to see if it’s possible with Lucene’s core API to search within a field which holds a deserialized JSON document. Any ideas?
Thanks for your work!
Hello, there may be one problem. In your LuceneIndexWriter.java, parseJsonFile function, test.json can not be read in my case.
I changed your code following the solution of http://stackoverflow.com/questions/10926353/how-to-read-json-file-into-java-with-simple-json-library. Following is a brief change.
JSONParser parser = new JSONParser();
JSONArray arrayObjects = (JSONArray) parser.parse(new FileReader(jsonFilePath));
for (Object o : arrayObjects)
{
JSONObject person = (JSONObject) o;
String name = (String) person.get(“name”);
System.out.println(name);
Double city = (Double) person.get(“lat”);
System.out.println(city);
}
return arrayObjects;
And I have another question.
My goal is to find how many json objects contain query text as part value of their key, namely, test.json like follows,
[
{
“studyId” : 1,
“name”: “id1”,
“lon”: 204.0,
“lat”: 101.0,
“interest”: “apple banana”,
“stored” : true
},
{
“studyId” : 2,
“name”: “id2”,
“lon”: 204.0,
“lat”: 101.0,
“interest”::”banana”,
“stored” : true
},
{
“studyId” : 3,
“name”: “id3”,
“lon”: 204.0,
“lat”: 101.0,
“interest”:”apple”,
“stored” : false
}
]
Then I wanna search how many people’s interest is apple.
Do you have any suggestion?
Hi, How do we deal with nest json objects?
nested*
Hello, i have parsed a JSON file but i don’t know how to index it
Hello, i have parsed a JSON file but i don’t know how to index it using lucene