Integrating NER with Solr.
- February 18, 2016
- Named Entity Recognition
What is Solr?
Solr is a Searching platform, that is built upon Lucene and is supported by the Apache Software Foundation.
Used mainly for providing blazing fast searches by indexing the content that requires searching.
By implementation, the structure of the Solr is highly modular, and its functionality can be further increased by adding Solr Plugins.
We used a similar plugin for our problem, in which we had to identify a person name(This is where NER comes in) in PDF documents.
What is NER?
Named Entity Recognition(NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. In our particular case, we were interested in finding only the person names in a given text.
So we were looking around for some of the NER implementations that had already been created.
And we discovered that The Stanford Natural Language Processing Group has developed a JAVA based library for NLP including a Named Entity Recognizer.
Our Approach :
We had clarity now of what all was required now to solve our problem, which was the following:-
-The Stanford NER library.
-A way to get the text out the PDFs.
-A Solr Plugin that would allow us to process our NER request and give us the Person names from the text.
The Stanford NER library is available freely, which you can download from here :
You can play around with this, and get to know what all it can do.
Getting text from the PDF seemed pretty straightforward as well. Another software(also Provided by Apache Foundation), TIKA does exactly the same job that we were looking for. We simply feed our PDFs to TIKA and it returns to us the Text Content from the PDFs.
Now we needed the Solr plugin to do the Name Recognition, to make one we headed over to this article.
This will give you a pretty neat idea of what you need to do to create a Solr Plugin.
And they even had their own implementation of NER already built for integration with Solr.
This can be found here :
We used their wisdom to build a similar plugin for us, that would give out the person names from the text document.
The Search box package used Maven, so it handled all the dependencies for us. We built the source and placed the generated JAR file in the bin folder of the required Solr-Collection, and updated the Solr config file to enable NER with Solr(This part is mentioned in the article).
So this is how we achieved NER using Solr and solved our problem of finding person names in a document.