Pharmaceutical Data Transformation
We collaborated with a large pharmaceutical company seeking to improve their system for text retrieval and ontology management. The company needed a solution to organise and retrieve vast amounts of scientific literature, research data, and internal documentation. Their existing system struggled with accurate text identification, document classification, and dating, which hindered their ability to access critical research and data efficiently.
The Challenge
The company faced several key challenges in text retrieval. They struggled to accurately identify specific texts within their growing document database, especially those with similar or identical names. This led to confusion and inefficiencies, with irrelevant documents frequently appearing in search results. Additionally, accurately dating documents was a persistent problem, impacting the chronological organization of research data and the integrity of their findings.
The solution
To address the company’s challenges in text retrieval, ontology management, and document identification, we implemented a comprehensive solution that combined advanced technologies with tailored strategies. We first developed custom transformer models, specialized datasets, and knowledge graphs to significantly enhance retrieval accuracy and resolve issues with name cross-over and overlapping content in pharmaceutical documents. We then integrated Retrieval-Augmented Generation (RAG) techniques, large language models (LLMs), TF-IDF algorithms, and vector databases to further improve precision, relevance, and retrieval speed, optimizing the overall document management process.
The features
We developed advanced transformer models, leveraging the Hugging Face library, to accurately process complex pharmaceutical texts. By focusing on sentence-level semantics, these models improved text identification and retrieval, resolving issues with overlapping content and name cross-over.
To enhance the performance of the transformer models, we created custom datasets with precise annotations tailored to the company’s domain. This targeted training improved the accuracy of text retrieval, ensuring that even documents with similar names were correctly identified.
We built new knowledge graphs to structure relationships between texts, entities, and concepts. This enhanced the retrieval of relevant information by linking related documents, reducing confusion from overlapping names and improving accessibility.
To further improve text retrieval, we integrated RAG techniques, combining transformer models with external knowledge sources to generate more relevant and contextually accurate information. This approach improved the retrieval of specialized queries and older, hard-to-find documents.
By embedding large language models (LLMs) into the company’s existing workflow via APIs, we enhanced real-time access to external scientific literature, providing broader, more accurate information retrieval and smoother interaction with data systems.
We implemented TF-IDF algorithms to prioritize the most relevant texts by analysing term importance. Combined with transformer models, this method significantly improved document retrieval accuracy and reduced irrelevant results.
We developed vector databases to store and retrieve documents based on semantic similarity. This enabled quick access to relevant documents, even when query terms didn’t precisely match, improving search efficiency and document ranking.
The result
By implementing these advanced technologies, the company saw a significant improvement in their text retrieval and ontology management system. The new solution provided precise, context-aware search capabilities, solving issues related to document name overlap, accurate identification, and text dating. As a result, they experienced better document management, increased research accuracy, and faster access to critical information.