When most people think of the International Atomic Energy Agency (IAEA), they picture inspectors in protective gear examining nuclear facilities. Few realize that tucked away in Monaco, IAEA scientists are quietly building one of the world's most comprehensive databases on marine radioactivity. The Marine Radioactivity Information System (MARIS) contains hundreds of academic papers tracking radioactivity in our oceans - vital information for environmental monitoring, research, and public safety.
But how do you make such specialized knowledge accessible to both experts and the general public? Enter Retrieval-Augmented Generation (RAG) - the AI approach everyone's talking about but few are implementing at scale. This post documents our ongoing journey to build a RAG system for MARIS, including the unexpected challenges, pragmatic solutions, and evaluation methods we've developed along the way.
Deep in the Mediterranean coastline of Monaco sits an IAEA laboratory that few people know about. Here, scientists manage what might be one of the most comprehensive databases of marine radioactivity in the world: the Marine Radioactivity Information System (MARIS).
MARIS serves as a global repository for measurements of radioactivity in our oceans. It contains data from thousands of samples collected across the world's seas - everything from seawater and marine sediments to fish, seaweed, and other marine organisms. For over 20 years, the IAEA has been collecting, verifying, and standardizing this data, making it available to researchers, policymakers, and environmental scientists.
Why does this matter? Radioactivity in our oceans comes from both natural sources and human activities - from nuclear power plant operations to historical weapons testing. Understanding these levels helps scientists track environmental changes, assess potential risks, and establish baselines for what's "normal" in different marine environments.
But MARIS isn't just a database of numbers. Behind those measurements are more than 500 academic papers - detailed scientific studies explaining how the measurements were taken, what they mean, and how they fit into our broader understanding of marine ecosystems. This knowledge base is constantly growing as new research emerges.
The challenge we faced was clear: How do we make this treasure trove of specialized scientific knowledge accessible to everyone who might need it? A government official with no scientific background might need to understand radioactivity trends in their coastal waters. A marine biologist might need detailed information about specific isotopes in certain species. A student might be researching environmental impacts of nuclear energy.
Traditional search methods fall short with such specialized content. Keyword searches miss semantic connections. Manual browsing is impractical given the volume and complexity of the papers. And the technical nature of the content creates barriers for non-specialists.
This is where our RAG implementation journey began - with the goal of transforming MARIS from a specialized database accessible mainly to experts into a knowledge system that could respond to natural language questions from users of all backgrounds, providing accurate, sourced information tailored to their level of expertise.
When we first outlined our RAG implementation roadmap, PDF parsing seemed like a mere checkbox item - a technical necessity but hardly a major challenge. After all, converting documents to text is a solved problem, right? As it turns out, this assumption couldn't have been further from reality.
Academic papers on marine radioactivity aren't simple text documents. They're complex publications with two-column layouts, tables packed with isotope measurements, specialized scientific figures, equations, footnotes, and references. What looks visually organized to the human eye becomes a parsing nightmare for machines.
Our first attempts at using standard PDF extraction libraries produced text that resembled a jigsaw puzzle assembled by a blindfolded person. Sentences from adjacent columns merged mid-thought. Tables transformed into unintelligible character soup. Figure captions appeared randomly in the middle of paragraphs. And critical data - the very information we needed most - was often mangled beyond recognition.
We quickly realized that poor-quality text extraction would poison our entire RAG pipeline. No matter how sophisticated our embedding model or how well-tuned our retrieval system, if the source text was garbled, our results would be unreliable.
After evaluating several OCR solutions, Mistral OCR emerged as our champion, offering the best quality-to-cost ratio for scientific content. It proved remarkably capable at:
This significantly streamlined our parsing pipeline, but our work wasn't done. We still needed to implement:
Content Sanitization: We developed processes to remove irrelevant content like repetitive journal names, page numbers, phone numbers in footers, and other artifacts that would add noise to our retrieval system.
Figure and Table Processing: Rather than keeping raw tables and figures, we replaced them with rich descriptions generated by an LLM. For figures, we used a prompt like:
Please describe figure '{im_fname.name}' in the context of this academic paper.
Extract its caption and provide a description detailed enough to be useful for a RAG system.
When describing the figure, also consider its surrounding to further contextualize it.
Analyse the image and syntesize the main findings as well as main trends if any.
Always use latex to express nuclide names.
If you notice that the figure is a logo from the journal, simply ignore it
This approach transformed visual data into textual descriptions that could be embedded and retrieved alongside the main text. For example, a graph showing cesium concentrations in different fish species would be described with all relevant trends and data points, ensuring this information remained accessible through text-based retrieval.
The combination of Mistral OCR's powerful baseline capabilities with our custom post-processing created a robust pipeline that transformed complex scientific PDFs into clean, structured text while preserving the scientific integrity of the original content.
For anyone implementing RAG systems for scientific literature, our advice is clear: invest in quality OCR and don't underestimate the importance of post-processing. The parsing stage isn't just a technical prerequisite - it's the foundation upon which your entire system rests.
With our parsing pipeline delivering high-quality text, we faced the next critical decision: how to divide our documents into chunks for embedding and retrieval. This seemingly simple decision has profound implications for retrieval quality.
For our initial implementation, we opted for a straightforward approach: a sliding window of 800 tokens with 50% overlap. This means each chunk contains about 800 tokens (roughly 600 words), and consecutive chunks share half their content. We chose this conservative strategy to get our RAG pipeline operational quickly and begin collecting performance metrics.
While simple, this approach offers several advantages:
However, we're well aware that more sophisticated chunking strategies exist:
The late chunking strategy is especially compelling for our use case. Instead of pre-chunking documents before embedding, this approach embeds entire documents or large sections first. Then, at query time, it dynamically extracts the most relevant passages from these larger embedded units. This preserves more context and potentially improves retrieval accuracy, particularly for complex scientific content where concepts often span multiple paragraphs.
As we move beyond our initial implementation, we plan to experiment with late chunking and evaluate whether it improves performance on our specific corpus of scientific papers. Our approach to chunking, like every aspect of our RAG system, will evolve based on empirical evidence from our evaluation metrics.
For now, our simple sliding window approach gives us a solid baseline to measure future improvements against. And as we'll see in the next section, having a robust evaluation framework is crucial for making informed decisions about these technical choices.
If there's one lesson we've learned from implementing RAG systems, it's this: you can't improve what you don't measure. While many RAG implementations focus exclusively on the final output quality, we recognized early that a systematic evaluation framework would be essential for iterative improvement.
The retrieval component is the linchpin of any RAG system. No matter how sophisticated your generation model, it can only work with the context it's given. If the retrieval fails to surface relevant information, even the most advanced LLM will produce inadequate answers.
This is especially critical for scientific content, where rigor matters and hallucinations can be particularly problematic. Our approach focuses on evaluating the retrieval component independently before assessing the full system.
To evaluate retrieval effectively, we need a ground truth dataset of questions paired with the chunks that should be retrieved to answer them. But how do you create such a dataset without extensive manual annotation?
Our solution: synthetic question generation. For each chunk in our corpus, we prompt an LLM to generate questions that:
We carefully engineered our question generation prompt to ensure diversity, avoiding simplistic questions that merely repeat keywords from the chunk. The goal is to create questions that test the semantic understanding capabilities of our embedding model, not just keyword matching.
Each synthetic question is paired with its source chunk as the ground truth - the chunk that should ideally be retrieved when this question is asked. Authorized users can then review and refine generated questions as shown in Fig. 1:
With our synthetic question-chunk pairs established, we implemented a robust evaluation framework using multiple complementary metrics:
We calculate Recall@k across a wide range of k values (from 1 to 40). This measures the percentage of questions where the ground truth chunk appears within the top-k retrieved chunks. By examining this metric at different cutoff points, we can understand both the relevancy of our top results and the overall coverage of our retrieval system.
While Recall tells us if the relevant chunk appears somewhere in our results, MRR helps us understand where it typically appears. For each question, we identify the rank position of the ground truth chunk and calculate the reciprocal (1/rank). If the ground truth appears first, the score is 1; if second, 0.5; if third, 0.33, and so on. We then average these scores across all questions.
MRR is particularly valuable for our use case because it quantifies the likelihood that relevant information will be prominently positioned in the context provided to the generation model. A higher MRR indicates that correct information is not just present but prioritized.
We've approached our RAG implementation as a scientific experiment, systematically testing different configurations to identify optimal performance. Our evaluation framework allows us to compare:
Each combination is evaluated using our metrics suite, with results tracked in a dashboard that allows us to identify clear performance trends. This methodical approach ensures that every architectural decision is backed by empirical evidence rather than assumptions or industry trends.
In future blog posts, we'll dive deeper into specific findings from these experiments, including our detailed analysis of vector vs. hybrid search performance and the impact of rerankers on scientific content retrieval.
We've integrated this evaluation framework directly into our development workflow:
By establishing this rigorous evaluation framework early in our implementation, we've been able to make continuous, measurable improvements to our system rather than relying on subjective assessments or anecdotal feedback.
The final piece of our RAG implementation puzzle was crafting prompts that would effectively bridge the gap between technical scientific content and diverse user needs. Prompt engineering for scientific RAG systems differs significantly from general-purpose applications, particularly when working with specialized content like marine radioactivity research.
Our system needed to serve both domain experts and the general public, requiring prompts that could adapt to different levels of technical understanding. We developed a two-tier approach:
The generation component proved particularly challenging due to several unique requirements of scientific communication.
Working with academic papers raised important copyright considerations. Our prompts needed to ensure the system would:
Here's an excerpt from our generation prompt that addresses these concerns:
RESPONSE REQUIREMENTS
Attribution
- Always cite the source of information in your responses
- Include author names, publication year, title, and journal/publisher where available
- Format: (Author et al., Year, "Title", Journal/Source)
- When synthesizing from multiple sources, cite each source that contributed to your answer
Content Transformation
- Paraphrase and synthesize information rather than reproducing exact text
- Limit direct quotations to short, necessary passages (maximum 2-3 sentences)
- When quoting, use quotation marks and provide precise citation
- Focus on communicating core concepts, findings, and facts rather than reproducing text
Another critical aspect of our prompt engineering was maintaining scientific rigor. Our prompts instruct the model to:
To enhance usability, we implemented a structured response format that helps users quickly find the information they need:
### Summary Answer: Brief, synthesized response to the query
### Key Findings: Core information points relevant to the question
### Scientific Context: How this information fits within broader understanding
### Sources: Properly formatted citations for all information provided
This format ensures that users can immediately see a concise answer while having access to supporting details and proper citations if needed.
Our prompt engineering process wasn't a one-time effort but an iterative cycle:
This approach allowed us to identify and address subtle issues, such as the system occasionally using outdated terminology or struggling with certain types of scientific notation.
Beyond response generation, we also developed specialized prompts for metadata extraction during the document processing phase. These prompts help identify key information like:
This extracted metadata enriches our chunks and improves retrieval accuracy for specific technical queries.
The careful engineering of these various prompts has been essential to creating a system that not only retrieves relevant information but presents it in a way that is scientifically accurate, properly attributed, and accessible to different user groups.
After several months of development, our MARIS RAG system has reached an important milestone. We've successfully implemented a complete pipeline from PDF parsing to response generation, with a robust evaluation framework to guide our continued improvements.
Even with our relatively straightforward initial approach, we're seeing encouraging results:
These results validate our approach and provide a solid foundation for further refinement. The fact that we're achieving good retrieval performance even without advanced techniques like reranking suggests significant room for improvement as we implement more sophisticated components.
Perhaps the most valuable outcome of this first phase has been establishing our evaluation framework. By prioritizing metrics before optimization, we've created a development environment where every decision can be guided by empirical evidence rather than intuition.
This approach has already helped us avoid several potential pitfalls. For instance, when considering different chunking strategies, our metrics revealed that our simple sliding window approach was outperforming more complex semantic chunking in certain scenarios - a counterintuitive finding we would have missed without systematic measurement.
While this blog post has outlined our overall workflow and approach, we've only scratched the surface of the technical details. In upcoming posts, we'll dive deeper into specific components and share our findings on:
Scaling to the full MARIS knowledge base - We've focused initially on biota samples, but plan to expand to the complete corpus covering seawater, sediment, and suspended matter
Fine-tuning for scientific content - We'll explore fine-tuning embedding models and rerankers specifically for marine radioactivity literature, again leveraging synthetic data generation
User segmentation and adaptation - Different user groups (scientists, policymakers, journalists, students) have distinct needs. We'll share our approach to tailoring responses based on user expertise and intent
Error analysis and continuous improvement - We'll establish processes for monitoring performance, identifying failure patterns, and implementing targeted improvements
Building a RAG system for specialized scientific content presents unique challenges, from PDF parsing complexities to the need for rigorous scientific accuracy. Our journey with MARIS has reinforced that successful implementation requires both technical expertise and domain knowledge.
By sharing our approach, challenges, and early results, we hope to contribute to the growing body of knowledge around practical RAG implementations. As we continue to refine our system, we'll provide updates on our progress and insights that may help others tackling similar challenges.
The potential impact of making specialized scientific knowledge more accessible extends far beyond our specific use case. As more organizations implement similar systems for their domain-specific knowledge bases, we may see a democratization of expertise that accelerates scientific progress and informed decision-making.
Our application is now accessible at https://marise-dev.pla.sh (currently by request only), where users can not only query the system but also provide feedback and annotations that help us continually improve performance.
While we've focused primarily on methodology in this post, our implementation relies on several key technologies that have proven invaluable:
For those interested in building similar systems, several resources have been particularly valuable in our journey: