100 Papers an Hour: 10x'ing Your Strategy Research Speed With AI
Batch process academic papers, glean key insights, and craft your own innovations and takeaways using basic LLM and OCR tooling.
As much as LLMs and AI seem to be writing our code, creating our art, and potentially replacing (or at least supplementing) our own artistic souls, they also still excel at pretty mundane tasks. When applied correctly, they can chew through hundreds of research papers at a time and give you deeper insights, inspiration, and clarity that you can use to apply to your own research process.
So, if you’re still reading papers off of Arxiv with your own human eyes (yuck!), you’re probably wasting a ton of your time that could be used in better and more productive ways.
Here’s the thing:
AI isn’t going away. Your competitors and peers are adopting it at a faster rate than any technology in human history. Shouldn’t you be using it to make your job easier and to come to more profitable systems faster?
The Paradox of Infinite Potential
Here we sit in 2025 with chart after chart and story after story telling us how in 2 years we will have AGI. At the same time, every research article ever published by man is available at our finger tips, and any individual fact or piece of knowledge is accessible within 30 seconds from a computer in our pockets.
And thus, we become paralyzed.

Where do we start? How do we even begin to dig into this pile of research papers?
And once we chew through one and we think we have the answer, 12 more questions are raised and 20 more potential tweaks, remixes, and innovations on top of your original idea come into mind. Thus, the cycle repeats recursively and our head explodes.
The horror!
Don’t Be A Bricklayer
But, as our access to data speeds up, and our use to automated intelligence becomes more ubiquitous, we have to realize that we must think higher.
Just like a bricklayer eventually sees how they can build a building and must become a foreman to manage their own team to accomplish their greater goals. We must ourselves shed what we think is human based processing and pass it down to artificial intelligence so that we can step up to a higher level of reasoning and processing.
This concept is called cognitive offloading and it’s the same thing that happens when we pull out the phone’s calculator to get an answer to 12 x 25, even though we could probably do it ourselves.
This is directly applicable to trading strategy research and design, especially in the case of sifting through research papers. We really don’t need to scour every single paper to read every single detail when we can offload that functionality off to AI.
Instead, we can use AI to summarize batches of papers like so:
So let’s start with OCR - what it is, and how it turns PDFs into high quality Markdown files.
From Human-Legible to Machine-Legible
Research papers are usually in PDF format and often times have fancy tables and LaTeX math formulas. A lot of times they also come with charts and plots and things of that nature.
This type of rich media cannot be put into an LLM which is expecting text. Sure, some LLMs allow you to upload pictures so you can do analyses on them. But, their understanding of technical charts and data this way is often limited.
Instead, it’s better to use OCR (optical character recognition) to transform a rich media file like a PDF into text that is readable by an LLM and very cheap to compute.
I downloaded and tried out 9 different open-source OCR libraries:
And put them to the test against the OLMAR paper (here). Frankly, none of them did well at all except Marker, which is a self-hosted CNN model that runs on your own hardware or in the cloud. I was able to extract the PDF in about 90 - 120 seconds with my Apple M2 Max neural processor and it gave amazing results.
Here are some comparisons of the original PDF to the Markdown with LaTeX output on an online viewer:
What you are seeing is simply amazing! And, this is in pure Markdown. So, while the math equations look all pretty in our viewer, they are just text and easily put into our LLMs.
Playing Senior to Your Digital Junior
Now with our initial proof of concept, it’s time to try this on a batch of papers and feed it into the LLM. You need to treat your LLM assistant as just that… an assistant. Don’t give it too much credit or room to make mistakes. Don’t let it stray from your guidance. It’s best if you have an idea and direction of your own that you leave somewhat hidden from the AI so that it doesn’t go on a tangent because they often think they know better than you.
My Process
I have been researching asset allocation and realized that most of them follow very similar rules. So, I want to get a batch of 25 papers, feed them into the OCR, and then ask an LLM to give me a report on the similarities of all of these papers.
Then, I should be able to understand at a higher level what asset allocation papers do. And, more importantly, I should be able to get some inspiration for my own asset allocation strategies based on what works.
And so, in went my 25 PDFs, and out came 25 Markdown files.
I compiled everything into one large concatenated text file that I could easily copy and paste into some LLM like so:
Now it’s as simple as plopping this into an LLM with a large context window (I use Gemini 2.5 Pro for this), and asking it some pertinent questions.
Gemini is able to crunch through all of the context and give me structured output in a way that makes sense to what I’m trying to do.
I instantly gained the following insights:
There are ‘basic’ asset allocation systems and ‘complex’ ones that use machine learning, reinforcement learning, etc. that I probably am not interested in.
Most of the basic systems share a lot of similarities. They all trade basic baskets of ETFs. They all have some sort of ‘risk-off’ portfolio that is made up of treasuries, bonds, and other low risk, yield bearing assets.
I can begin to abstract these papers into a higher level framework such that I can generate my own asset allocation systems (can you see where this is going?)
Gemini also gave me a breakdown of all the papers together in this format:
Module 1: Universe & Asset Definition
This module defines the "what" of the strategy.
Module 2: Signal Generation Engine
This module is the "brain" of the strategy, deciding when to be aggressive or defensive.
Module 3: Portfolio Construction Logic
This module translates the signals from Module 2 into precise portfolio weights.
Module 4: Execution Protocol
This module defines the practical rules of implementation.
I then had Gemini give me all of the baskets that all of the strategies used, and continue with the signals and portfolio construction logic. I immediately started to see that Gemini was not completely correct, or not completely aligned with what I wanted and what I saw from my own expertise. I noticed that a lot of signals were acting as filters and we should split them apart.
This is where the cognitive offloading concept comes in. What I am doing here is having Gemini to the ‘low’ work of summarizing, finding patterns, and reporting to me. I then am retaining the ‘high’ work of analyzing, thinking about how I want to structure my problem, and using my own experience in systems design and engineering to have Gemini rework the concept at hand. They are the junior and I am the senior.
Next Steps
Now that I have a rough basis for what I want as a quant, I can start to think about building the system I want: an asset allocation system generator.
Because I had this in mind prior, I knew that I had to force Gemini to structure it’s responses in a certain way. I also never give the full responsibility to Gemini because I play the senior and they play the junior. It is important to always retain this separation because it is very easy for an LLM to run away with the task in a direction you don’t want at all, much like an enthusiastic new hire who doesn’t get the full picture yet.
User: “Can you just say how to solve this problem?”
AI: <Produces whole new code base of 10000 lines.>
From here, I will continue my research and develop into abstracting asset allocation strategies into a framework that will allow us to evolve our own programmatically.
Stay tuned for the next articles as I go through, step-by-step, exactly how I did it and how you can too.
Download the Tool
I’m including my easy PDF extractor CLI tool that I wrote alongside this blog post. Download it from the Google Drive, open up a terminal, navigate to inside of the folder, and then run python -m pip install -e . to install it (replace python with python3, etc. if that is your main python executable).
Then, you can simply run the PDF extractor tool like so:
# Basic PDF extraction
pdf-extract input.pdf output.md
# With custom device
pdf-extract input.pdf output.md --device cuda
# With verbose output
pdf-extract input.pdf output.md --verboseOr, embed it into your own Python project:
from pdf_extractor import MarkerExtractor
# Create extractor
extractor = MarkerExtractor(device="auto", verbose=True)
# Extract text from PDF
text = extractor.extract_file("my_document.pdf")
print(text)
# Extract and save as markdown in one step
text = extractor.extract_file("my_document.pdf", save_file="output.md")Don’t forget to subscribe!
If you are already a paid subscriber, thanks so much for your monetary support.
And to all of my readers and subscribers, thank you for reading my post, and happy researching!
















