Similarity-Based Clustering of Research Data:
Our efforts on the COVID-19 research corpus
May 4, 2020


The number of publications on COVID-19 have been growing exponentially. It is very difficult for researchers to be on top of all that is being published. We wanted to build an AI system that would make it easier for researchers and frontline medical workers to query this ever-growing body of research.

One of the first steps to building such a system is an ability to understand and segment the document corpus into “topics”. A key technique to achieving this is Topic Modeling, which enables us to cluster similar documents together. The rest of the document will detail our approach to implementing Topic Modeling on this corpus.

Topic Modelling is an unsupervised learning method to understand the patterns of the words within the corpus of documents. Latent Dirichlet Allocation (LDA) is one of the most used algorithms to implement this. As part of this approach, we will use the ktrain package and see how to implement an LDA model on the COVID-19 Research documents, web scraped from

We will attempt to answer the following questions:

  • Can we get some meaningful clustering of the documents using LDA?
  • Can we try to search the relevant content from the corpus to match the queries?
  • What should be the next items on this research?

Please go through the following sections if you want to understand the LDA algorithm in detail.

Build LDA Model

The data is web scraped from We have taken 4,000 documents related to COVID-19 to validate the LDA approach.

Step 1:  Import the packages and Load the data 

Step 2: Null Value treatment of document column

While checking for null values, it was found that Row 0 has all “NAN” values. In the preprocessing step, we removed Row 0 and then verified the null values for the “text_body” column, which is what we are going to use for the LDA model building. If we observe the following screenshot, there are many missing values in other columns. For the current model, we will be using only the “content” column, which has complete articles. We are not using titles at this point.

Step 3: Cleansing the data

For the current attempt, we have removed the stop words and symbols. LDA works on the “Bag of Words” model and, hence, stop words and symbols act as a distraction more often than not.

Step 4: Install Ktrain package and create an LDA model

Use PIP to install the Ktrain. The following is the GitHub link for this package.

Once the installation is successful, we can go ahead and build an LDA model as shown in the following screenshot

Once the LDA is run and ready, we can check the topics it learnt from the distribution of words across all the documents in the corpus. This can be seen in the below screenshot.

One of the important parameters here is the “threshold”. In this example, we took a threshold value of 0.25. This means that all the documents where the highest probability for the topic is less than 0.25 are ignored or filtered out.

The build method builds the topic-document distribution, while the topic-world distribution gets built when we run the LDA model in the get_topic_model method.

Step 5: Visualize the LDA model

Let us check how the initial LDA model has created the topics.

The clustering looks good and there is some separation, but then we are not interested in all the topics. We want to filter out the interesting ones. We were only interested in:

  • Transmission
  • Incubation
  • Environment
  • Risk for children and pregnant woman
  • Virus
  • Protection

When the above code is run, we got the following beautiful visualization. It has lesser clusters, but all the topics are properly clustered with the separation among clusters much more pronounced.

In the screenshot, we can clearly see the cluster related to the topics containing the word ‘respiratory’.

Step 6: Can we get relevant answers to the questions we pass through the LDA model?

The next step in our research was to find out if we can get relevant answers to questions from the documents that have been clustered into the topics we were interested in.We need to run the following code to get the answers from the model.

Initial tests show that we are able to get good quality results.

For the Question – “What is known about Coronavirus transmission?”, one of the responses was:

“Current worldwide outbreak new type coronavirus covid 19 originated wuhan china spread 140 countries including japan koreaitaly world health organization declared covid 19 become global health concern causing severe respiratory tract infections humans current evidence indicates sarscov 2 spread humans via transmission wild animals illegally sold huanan seafood wholesale market phylogenetic analysis shows sarscov 2 new member coronaviridae family distinct sarscov identity approximately 79 merscov identity approximately 50 1 2 knowing origin pathogen critical developing means block transmission vaccines 3 notably sarscov 2 shares high level genetic similarity 96 3 bat coronavirus”

For the question – “risks for pregnant woman”, one of the responses was:

“cov 2 pandemic continues pregnant women may high risks affection due immunosuppressive state affection status mothers may cause adverse maternal neonatal complications outcomes 15 following urgent questions need addressed whether sarscov 2 could transmitted vertically fetus maternal fetal interface whether affection virus would cause dysfunction placenta even abortion among pregnant women based scrna seq data early human placenta first trimester identified “

For the question – “Pneumonia and corona virus”, one of the responses was:

“coronaviruses enveloped positive sense single stranded rna viruses infect wide range human animal species december 2019 six human pathogenic coronaviruses known severe acute respiratory syndrome coronavirus sarscov middle east respiratory syndrome coronavirus merscov cause severe acute atypical pneumonia extrapulmonary manifestations immunocompetent immunocompromised patients human coronavirus 229e hcov 229e hcov nl63 hcov oc43 hcov hku1 usually cause mild self limiting upper respiratory tract infections immunocompetent patients occasionally lower respiratory tract infections immunocompromised hosts none dec 31 2019 informed cluster unexplained cases pneumonia wuhanhubei province china subsequent investigations identified novel lineage b betacoronavirus later named severe acute respiratory syndrome coronavirus 2 sarscov 2 high degree genomic similarity bat coronaviruses 2 3 4 none typical clinical features sarscov 2 infection coronavirus disease 2019 covid 19 similar severe acute respiratory syndrome sars include fever myalgia dry cough dyspnoea fatigue radiological evidence ground glass lung opacities compatible atypical pneumonia

A lot more work needs to be done to elicit better quality responses, but this is a promising start.


In summary, the LDA package has been used for Topic modelling. This blog explains the approach to get relevant texts from the corpus. It is quite intuitive. We can use the same model to get the relevant topics of the document too –that is, we can pass the entire document into the LDA model and get the topics relevant to it. We can use the same for classification of the documents as well.

At this point, we have implemented a latent probabilistic model as LDA. Over the next couple of weeks, we are planning to build a Variational Autoencoder and check the performance on the current dataset.

Thank you for reading. Feedback and comments are always welcome.

If you want to run the code, please feel free to reach out to