Spark NLP by John Snow Labs
What is Spark NLP?
Spark NLP is a text processing library built on top of Apache Spark and its Spark ML library. It provides simple, performant and accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
There are some eye-catching phrases that got my attention the first time I read an article on Databricks introducing Spark NLP about a year ago. I love Apache Spark and I learned Scala (and still learning) just for that purpose. Back then I wrote my own Stanford CoreNLP wrapper for Apache Spark. I wanted to stay in the Scala ecosystem so I avoided Python libraries such as spaCy, NLTK, etc.
However, I faced many issues since I was dealing with large-scale datasets. Also, I couldn’t seamlessly integrate my NLP codes into Spark ML pipelines. I can sum up my problems by quoting some parts from the same blog post:
Any integration between the two frameworks (Spark and another library) means that every object has to be serialized, go through inter-process communication in both ways, and copied at least twice in memory.
We see the same issue when using spaCy with Spark: Spark is highly optimized for loading & transforming data, but running an NLP pipeline requires copying all the data outside the
Tungsten optimized format, serializing it, pushing it to a Python process, running the NLP pipeline (this bit is lightning fast), and then re-serializing the results back to the JVM process.
This naturally kills any performance benefits you would get from Spark’s caching or execution planner, requires at least twice the memory, and doesn’t improve with scaling. Using CoreNLP eliminates the copying to another process, but still requires copying all text from the data frames and copying the results back in.
So I was really excited when I saw there was an NLP library built on top of Apache Spark and it natively extends the Spark ML Pipeline. I could finally build NLP pipelines in Apache Spark!
Spark NLP is open source and has been released under the Apache 2.0 license. It is written in Scala but it supports Java and Python as well. It has no dependencies on other NLP or ML libraries. Spark NLP’s annotators provide rule-based algorithms, machine learning, and deep learning by using TensorFlow. For a more detailed comparison between Spark NLP and other open source NLP libraries, you can read this blog post.
As a native extension of the Spark ML API, the library offers the capability to train, customize and save models so they can run on a cluster, other machines or saved for later. It is also easy to extend and customize models and pipelines, as we’ll do here.
The library covers many NLP tasks, such as:
For the full list of annotators, models, and pipelines you can read their online documentation.
Full disclosure: I am one of the
Installing Spark NLP
- Spark NLP 2.0.3 release
- Apache Spark 2.4.1
- Apache Zeppelin release 0.8.2
- Local setup with MacBook Pro/macOS
- Cluster setup by Cloudera/CDH 6.2 with 40 servers
- Programming language: Scala (but no worries, the Python APIs in Spark and Spark NLP are very similar to the Scala language)
I will explain how to set up Spark NLP for my environment. Nevertheless, if you wish to try something different you can always find out more about how to use Spark NLP either by visiting the main public repository or have look at their showcase repository with lots of examples:
Main public repository:
Let’s get started! To use Spark NLP in Apache Zeppelin you have two options. Either use Spark Packages or you can build a Fat JAR yourself and just load it as an external JAR inside Spark session. Why don’t I show you both?
First, with Spark Package:
Either add this to your conf/zeppelin-env.sh
# set options to pass spark-submit commandcom.johnsnowlabs.nlp:spark-nlp_2.11:2.0.3
2. Or, add it to Generic Inline ConfInterpreter (at the beginning of your notebook before starting your Spark Session):
# spark.jars.packages can be used for adding packages into spark interpreter
Second, loading an external JAR:
To build a Fat JAR all you need to do is:
$ git clone https://github.com/JohnSnowLabs/spark-nlp
$ cd spark-nlp
$ sbt assembly
Then you can follow one of the two ways I mentioned to add this external JAR. You just need to change “ — packages” to “ — jars” in the first option. Or for the second solution, just have “spark.jars”.
Start Spark with Spark NLP
Now we can start using Spark NLP 2.0.3 with Zeppelin 0.8.2 and Spark 2.4.1 by importing Spark NLP annotators:
Apache Zeppelin is going to start a new Spark session that comes with Spark NLP regardless of whether you used Spark Package or an external JAR.
Read the Mueller Report PDF file
Remember the issue about the PDF file not being a real PDF? Well, we have 3 options here:
- You can either use any OCR tools/libraries you prefer to generate a PDF or a Text file.
- Or you can use already searchable and selectable PDF files created by the community.
- Or you can just use Spark NLP!
Spark NLP comes with an OCR package that can read both PDF files and scanned images. However, I mixed option 2 with option 3. (I needed to install Tesseract 4.x+ for image-based OCR on my entire cluster so I got a bit lazy)
You can download these two PDF files from Scribd:
Of course, you can just download the Text version and read it by Spark. However, I would like to show you how to use the OCR that comes with Spark NLP.
Spark NLP OCR:
Let’s create a helper function for everything related to OCR:
val ocrHelper = new OcrHelper()
Now we need to read the PDF and create a Dataset from its content. The OCR in Spark NLP creates one row per page:
//If you do this locally you can use file:/// or hdfs:/// if the files are hosted in Hadoop
val muellerFirstVol = ocrHelper.createDataset(spark, "/tmp/Mueller/Mueller-Report-Redacted-Vol-I-Released-04.18.2019-Word-Searchable.-Reduced-Size.pdf")
As you can see I’m loading the “Volume I” of this report in the format of PDF into a Dataset. I do this locally just to show you don’t always need a cluster to use Apache Spark and Spark NLP!
TIP 1: If the PDF was actually a scanned image, we could have used these settings (but not in our use case, we found a selectable PDF):
TIP 2: You can simply convert Spark Dataset into DataFrame if needed by:
Spark NLP Pipelines and Models
NLP by Machine Learning and Deep Learning
Now it’s time to do some NLP tasks. As I mentioned at the beginning, we would like to use already pre-trained pipelines and models provided by Spark NLP in Part I. These are some of the pipelines and models that are available:
However, I would like to use a pipeline called “explain_document_dl” first. Let’s see how we can download this pipeline, use it to annotate some inputs, and what exactly does it offer:
val pipeline = PretrainedPipeline("explain_document_dl", "en")
// This DataFrame has one sentence for testing
val testData = Seq(
"Donald Trump is the 45th President of the United States"
// Let's use our pre-trained pipeline to predict the test dataset
Here is the result of .show():
I know! It’s a lot going on in this pipeline. Let’s start with NLP annotators we have in “explain_document_dl” pipeline:
- WordEmbeddings (GloVe 6B 100)
- NerConverter (chunking)
To my knowledge, there are some annotators inside this pipeline which are using Deep Learning powered by TensorFlow for their supervised learning. For instance, you will notice these lines when you are loading this pipeline:
pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_dl,en,public/models)
loading to tensorflow
For simplicity, I’ll select a bunch of columns separately so we can actually see some results:
So this is a very complete NLP pipeline. It has lots of NLP tasks like other NLP libraries and even more like Spell checking. But, this might be a bit heavy if you are just looking for one or two NLP tasks such as POS or NER.
Let’s try another pre-trained pipeline called “entity_recognizer_dl”:
val pipeline = PretrainedPipeline("entity_recognizer_dl", "en")
val testData = Seq(
"Donald Trump is the 45th President of the United States" ).toDS.toDF("text")
// Let's use our pre-trained pipeline to predict the test dataset
As you can see, using pre-trained pipeline is very easy. You just need to change its name and it will download and cache it locally. What is inside this pipeline?
- NER chunk
Let’s walk through what is happening with the NER model in both of these pipelines. The Named Entity Recognition (NER) uses Word Embeddings (GloVe or BERT) for training. I can quote one of the main maintainers of the project about what it is:
NerDLModel is the result of a training process, originated by NerDLApproach SparkML estimator. This estimator is a TensorFlow DLmodel. It follows a Bi-LSTM with Convolutional Neural Networks scheme, utilizing word embeddings for token and sub-token analysis.
You can read this full article about the use of TensorFlow graphs and how Spark NLP uses it to train its NER models:
Back to our pipeline, NER chunk will extract chunks of Named Entities. For instance, if you have Donald -> I-PER and Trump -> I-PER, it will result in Donal Trump. Take a look at this example:
Personally, I would prefer to build my own NLP pipelines when I am dealing with pre-trained models. This way, I have full control over what types of annotators I want to use, whether I want ML or DL models, use my own trained models in the mix, customize the inputs/outputs of each annotator, integrate Spark ML functions, and so much more!
Is it possible to create your own NLP pipeline but still take advantage of pre-trained models?
The answer is yes! Let’s look at one example:
val document = new DocumentAssembler()
val sentence = new SentenceDetector()
val token = new Tokenizer()
val normalized = new Normalizer()
val pos = PerceptronModel.pretrained()
val chunker = new Chunker()
val embeddings = WordEmbeddingsModel.pretrained()
val ner = NerDLModel.pretrained()
.setInputCols("document", "normalized", "embeddings")
val nerConverter = new NerConverter()
.setInputCols("document", "token", "ner")
val pipeline = new Pipeline().setStages(Array(
That’s it! Pretty easy and Sparky. The important part is that you can set which inputs you want for each annotator. For instance, for POS tagging, I can either use tokens, stemmed tokens, lemmatized tokens, or normalized tokens. This can change the results of annotators. Same for NerDLModel. I chose normalized tokens for both POS and Ner models since I am guessing my dataset is a bit messy and requires some cleaning.
Let’s use our customized pipeline. If you know anything about Spark ML pipeline, it has two stages. One is fitting which is where you train the models inside your pipeline. The second is predicting your new data by transforming it into a new DataFrame.
val nlpPipelineModel = pipeline.fit(muellerFirstVol)
val nlpPipelinePrediction = nlpPipelineModel.transform(muellerFirstVol)
The .fit() is for decoration here as everything already comes pre-trained. We don’t have to train anything so the .transform() is where we use the models inside our pipeline to create a new DataFrame with all the predictions. But if we did have our own models or Spark ML functions which required training then the .fit() would take some time to train the models.
On a local machine, this took about 3 seconds to run. My laptop has a Core i9, 32G Memory, and Vega 20 (if this matters at all) so it is a pretty good machine.
This example is nowhere near a Big Data scenario where you are dealing with millions of records, sentences, or words. In fact, it’s not even small data. However, we are using Apache Spark for a reason! Let’s run this in a cluster where we can distribute our tasks.