Getting started with Apache Lucene and JSON indexing

by Ignacio Suay · Published July 22, 2014 · Updated December 7, 2016

In this post, I am going to talk about how to index JavaScript Object Notation (JSON) using Lucene Core. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. Many companies like LinkedIn or Twitter use Lucene for Real-time search and faceted search.

You can find all the code used in this post in github. In order to process a JSON file and index it, we need to:

Process a JSON file
Create an index
Add documents to the index
Commit the changes and close the index

Parse a JSON file

I have a created a test file which contains 3 JSON objects, with different data types (long, String, double and boolean)
[xml]
[
{
“studyId” : 1,
“name”: “id1”,
“lon”: 204.0,
“lat”: 101.0,
“stored” : true
},
{
“studyId” : 2,
“name”: “id2”,
“lon”: 204.0,
“lat”: 101.0,
“stored” : true
},
{
“studyId” : 3,
“name”: “id3”,
“lon”: 204.0,
“lat”: 101.0,
“stored” : false
}
]
[/xml]

With the aim to process the JSON file, I am using the json-simple library. Using json-simple, Decoding a JSON file is quite simple, you only need to call to JSONValue.parse(yourfile);

[java]
/**
* Parse a Json file. The file path should be included in the constructor
*/
public JSONArray parseJSONFile(){

//Get the JSON file, in this case is in ~/resources/test.json
InputStream jsonFile = getClass().getResourceAsStream(jsonFilePath);
Reader readerJson = new InputStreamReader(jsonFile);

//Parse the json file using simple-json library
Object fileObjects= JSONValue.parse(readerJson);
JSONArray arrayObjects=(JSONArray)fileObjects;

return arrayObjects;

}
[/java]
Create an index

I have created a class called LuceneIndexWriter which gets a directory path, and a file path which contains a number of JSON objects. This class will be in charge of creating the index, add all the JSON objects to the index, and finally, close the index.

[java]
public class LuceneIndexWriter {

String indexPath;

String jsonFilePath;

IndexWriter indexWriter = null;

public LuceneIndexWriter(String indexPath, String jsonFilePath) {
this.indexPath = indexPath;
this.jsonFilePath = jsonFilePath;
}

public void createIndex(){
JSONArray jsonObjects = parseJSONFile();
openIndex();
addDocuments(jsonObjects);
finish();
}
…
[/java]

Then we need to create an Index, I have used an StandardAnalyzer. An Analyzer is in charge of processing a field text and return the most important terms. I am using an StandardAnalyzer because it removes stop words (for example a, an , the, …) and punctuation, but you could use a different analyzer like WhitespaceAnalyzer, SimpleAnalyzer or an StopAnalyzer.

In this case, I am opening the index using the OpenMode.CREATE, that means that the index will be overwrite each time we run this function. if you want to create the index only the first time, then you should use OpenMode.CREATE_OR_APPEND.

[java]
public boolean openIndex(){
try {
Directory dir = FSDirectory.open(new File(indexPath));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_48);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48, analyzer);

//Always overwrite the directory
iwc.setOpenMode(OpenMode.CREATE);
indexWriter = new IndexWriter(dir, iwc);

return true;
} catch (Exception e) {
System.err.println(“Error opening the index. ” + e.getMessage());

}
return false;

}
[/java]

Add documents to the index

I have created a generic function to add documents to the index. For each type of data I am using a different implementation, for instance all the String fields (in my case only the studyId is an String field) will be stored using an StringField class.

Each type of field have 3 parameters: the name of the field, the value of the field and you can choose whether or not to store the field.

[java]
/**
* Add documents to the index
*/
public void addDocuments(JSONArray jsonObjects){
for(JSONObject object : (List) jsonObjects){
Document doc = new Document();
for(String field : (Set) object.keySet()){
Class type = object.get(field).getClass();
if(type.equals(String.class)){
doc.add(new StringField(field, (String)object.get(field), Field.Store.NO));
}else if(type.equals(Long.class)){
doc.add(new LongField(field, (long)object.get(field), Field.Store.YES));
}else if(type.equals(Double.class)){
doc.add(new DoubleField(field, (double)object.get(field), Field.Store.YES));
}else if(type.equals(Boolean.class)){
doc.add(new StringField(field, object.get(field).toString(), Field.Store.YES));
}
}
try {
indexWriter.addDocument(doc);
} catch (IOException ex) {
System.err.println(“Error adding documents to the index. ” + ex.getMessage());
}
}
}
[/java]

Finally, you only need to commit the changes and close the index.

[java]
/**
* Write the document to the index and close it
*/
public void finish(){
try {
indexWriter.commit();
indexWriter.close();
} catch (IOException ex) {
System.err.println(“We had a problem closing the index: ” + ex.getMessage());
}
}
[/java]
Testing

In order to test the lucene writer class, I have created a couple of JUnit tests:

– testWriteIndex : This test creates the index in the folder “indexDir” retrieving the data from a JSON file. Once the index is created, checks that the number of documents indexed is correct and prints each document.

– testQueryLucene: Creates a term query and searches for it and checks that the number of documents retrieved are correct.

[java]
public class LuceneIndexWriterTest {

static final String INDEX_PATH = “indexDir”;
static final String JSON_FILE_PATH = “/test.json”;

@Test
public void testWriteIndex(){
try {
LuceneIndexWriter lw = new LuceneIndexWriter(INDEX_PATH, JSON_FILE_PATH);
lw.createIndex();

//Check the index has been created successfully
Directory indexDirectory = FSDirectory.open(new File(INDEX_PATH));
IndexReader indexReader = DirectoryReader.open(indexDirectory);

int numDocs = indexReader.numDocs();
assertEquals(numDocs, 3);

for ( int i = 0; i < numDocs; i++) { Document document = indexReader.document( i); System.out.println( "d=" +document); } } catch (Exception e) { e.printStackTrace(); } } @Test public void testQueryLucene() throws IOException, ParseException { Directory indexDirectory = FSDirectory.open(new File(INDEX_PATH)); IndexReader indexReader = DirectoryReader.open(indexDirectory); final IndexSearcher indexSearcher = new IndexSearcher(indexReader); Term t = new Term("name", "id2"); Query query = new TermQuery(t); TopDocs topDocs = indexSearcher.search(query, 10); assertEquals(1, topDocs.totalHits); } } [/java]

Tags: json lucene search

Rui Santos says:

January 11, 2016 at 10:20 am

Great Article.
I’m actually looking around to see if it’s possible with Lucene’s core API to search within a field which holds a deserialized JSON document. Any ideas?

Reply
Carol says:

February 11, 2016 at 11:55 am

Thanks for your work!

Reply
Carol says:

February 11, 2016 at 1:35 pm

Hello, there may be one problem. In your LuceneIndexWriter.java, parseJsonFile function, test.json can not be read in my case.
I changed your code following the solution of http://stackoverflow.com/questions/10926353/how-to-read-json-file-into-java-with-simple-json-library. Following is a brief change.

JSONParser parser = new JSONParser();
JSONArray arrayObjects = (JSONArray) parser.parse(new FileReader(jsonFilePath));
for (Object o : arrayObjects)
{
JSONObject person = (JSONObject) o;

String name = (String) person.get(“name”);
System.out.println(name);

Double city = (Double) person.get(“lat”);
System.out.println(city);

}
return arrayObjects;

Reply
Carol says:

February 11, 2016 at 1:58 pm

And I have another question.

My goal is to find how many json objects contain query text as part value of their key, namely, test.json like follows,
[
{
“studyId” : 1,
“name”: “id1”,
“lon”: 204.0,
“lat”: 101.0,
“interest”: “apple banana”,
“stored” : true
},
{
“studyId” : 2,
“name”: “id2”,
“lon”: 204.0,
“lat”: 101.0,
“interest”::”banana”,
“stored” : true
},
{
“studyId” : 3,
“name”: “id3”,
“lon”: 204.0,
“lat”: 101.0,
“interest”:”apple”,
“stored” : false
}
]

Then I wanna search how many people’s interest is apple.

Do you have any suggestion?

Reply
Anuj says:

May 12, 2017 at 3:30 pm

Hi, How do we deal with nest json objects?

Reply
- Anuj says:
  
  May 12, 2017 at 3:30 pm
  
  nested*
  
  Reply
annacyli says:

June 1, 2017 at 1:55 pm

Hello, i have parsed a JSON file but i don’t know how to index it

Reply
annacyli says:

June 1, 2017 at 1:55 pm

Hello, i have parsed a JSON file but i don’t know how to index it using lucene

Reply

Indexing Multilevel JSON Objects in Lucene - JavaTechji

December 30, 2019

[…] I am new to Lucene. I have worked on Lucene search using field value pairs in documents. Now there is a requirement to parse some JSON files and Index them up for Lucene search. I have an idea on working with simple form of JSON file according to this article. […]

Getting started with Apache Lucene and JSON indexing

You may also like...

9 Responses

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

twitter

Categories

Getting started with Apache Lucene and JSON indexing

You may also like...

GWT Review

Docker for JHipster in Production mode

9 Responses

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

twitter

Categories