[Tutorial] Getting Hands-On with Cloudflare Auto RAG

Oct 31, 2025 min read

Preface: AI + LLM = A Second Brain?

A few days ago, I came across a post in a Facebook group discussing how AI + LLM can act as a second brain.

As I recall, the author’s example involved integrating Obsidian (a note-taking app) with an LLM plugin. This allows your past notes to truly become a brain, where you can explore your own content by conversing with the LLM.

It sounds cool! If I understand correctly, it seems like using all your notes as a database and applying a RAG mechanism.

This is much like having a private NotebookLM. I saw it back then but didn’t delve deeper. Anyway, I’ve already chosen Anytype for my private notes, so there’s no need to research it just to let an AI know about my sensitive information.

Background: A Damn Error Appeared in My Side Project

But coincidences happen. In my previous WSJ side project script, I discovered a serious summarization error.

In the report from 2025-10-25, the reality was that Netflix had dropped by nearly 10% in the past week, but the summary mistakenly reported it as a 10% increase.

This is not an error I can just ignore. It would seriously mislead my newsletter readers (even if they are just friends and family.😂).

Clearly, simple prompt design and prompt engineering can no longer solve the problem I’m facing.

So I started looking for possible optimization directions, and they all pointed to one conclusion.

Slim down the source data

Our current approach is to directly upload a 10-18 MB PDF file to the Gemini API. Although Gemini provides a file API to support oversized files, the 2.5 Pro model claims a context of 1.04 million tokens, but uploading the entire PDF file results in over 4 million tokens. This clearly exceeds the model’s supported context length.

That’s right! It’s the first step in RAG mentioned in the title: Loading. We need to successfully extract the text content from the PDF.

It’s time to try slimming down our source data.

Details: Complex PDF Layouts

In the process of selecting a tool to extract plain text from the PDF, I encountered many difficulties. Due to the complex layout of newspapers, a simple PDF Loader couldn’t meet our needs.

I also tried some OCR-based PDF Loaders. While the results were good, running them locally was too time-consuming. It was slow on my laptop, let alone on the small, low-spec home NAS I plan to run it on…

Later, I tried using PyMuPDF, and the results seemed quite good, and it was fast enough. It has a toMarkdown function. The resulting text was usable, but still contained a lot of noise.

I suddenly thought, “This is so troublesome. If only there were an API I could just call.”

Unfortunately, it seems this kind of service is a paid API from major vendors.

In a flash, I remembered the developer’s best friend: Cloudflare. I recalled seeing some articles about RAG, and thought I might find the solution I needed there.

Great! If there’s a RAG solution, there must be a solution for the preceding Loading step! And sure enough, I found markdown-conversion.

Wonderful! Happiness is just around the corner, with our old friend Cloudflare.

I tried it out. It doesn’t just convert to markdown; it also prefixes the original file’s metadata to the markdown article.

I also compared the original PDF and the output markdown in NotebookLM, and the completeness seems very high.

And the best part is, it’s free (or rather, the free quota is more than enough for my use).

By the way, the slimmed-down markdown content still has about 110,000 to 150,000 tokens. That’s a terrifying amount of information…

Auto RAG Implementation: Just a Few Clicks to Finish

Since my original problem was solved, maybe it’s time to try something new: auto RAG. Its interface is very simple, condensing the important steps of RAG into a web-based operation.

  • Loading: Converting various data sources to markdown relies on the aforementioned markdown-conversion, and then storing it in Cloudflare’s R2 bucket.
  • Splitting: This is not mentioned in its interface. It’s integrated under Embedding, where you define the chunk size and overlap percentage.
  • Embedding: Currently not adjustable, it uses their in-house model @cf/baai/bge-m3.
  • Storage: Stored in their vector database, Vectorize.
  • Retrieval: The parameters that can be set in the interface here are similarity threshold and max results.
  • Generation: You can choose other models for querying here. The current default option is this one: @cf/meta/llama-3.3-70b-instruct-fp8-fast

There are also things like query rewriting and system prompt that I won’t go into detail about. You can just go and play with it to understand.

Conclusion: A Great Try and Experience

I think the auto RAG service is a very innovative approach. Packaging the entire RAG process for easy use by customers is a smart product direction.

In the past, building a RAG system from scratch, including data cleaning, and finally validating the results with end-users, involved huge and lengthy upfront setup costs. This would cause many business owners to get stuck at the “let’s think about it” stage, without actually committing to building the whole system.

Cloudflare helps you handle all the preliminary parts, and even provides an API for the data conversion tool. (I actually think this is also a very difficult pain point).

The so-called “second brain” can be achieved with pretty good results just by clicking around. For their effort and promotion in this area, Cloudflare really deserves a big thumbs up! I’ll buy their stock to show my support!

Although in terms of practical application, I don’t think it’s ready for production-level use yet.

My testing method was to parse the Wall Street Journal content from the last two months and ask it a specific question: Tell me what happened with Netflix in the second half of October? How did the stock price change?. The results, however, still didn’t seem that great.

Here is its answer:

According to the Wall Street Journal (October-26,), Netflix reported higher quarterly profit and sales, driven by membership growth, price hikes, and increased advertising revenue. However, the company’s quarterly earnings came in below expectations due to an expense related to a dispute with Brazilian tax authorities. As a result, Netflix shares sank 10% on Wednesday (source date: October -26, 2025).

Of course, it could also be that I haven’t tuned the model and parameters well enough, or that there was information loss during the initial markdown conversion that I didn’t check for.

However, it is important to understand the core value of RAG to better diagnose the problem. It’s worth noting that the magic of RAG lies in its ability to bypass the LLM’s context window limitations through “splitting” and “retrieval.” Even if my source documents are composed of several 150,000-token files, the LLM only sees the few document chunks most relevant to the query when generating an answer.

The amount of information it processes is perhaps only a few thousand tokens, ensuring precision and efficiency.

While RAG excels at answering specific, fact-based questions like the one above, my subpar results might indicate issues in the retrieval step (i.e., not finding the right chunks) or the need for more advanced RAG strategies for complex documents.

Broad summarization tasks are generally a poor fit for basic RAG, as they require synthesizing information from many disparate chunks, which is a known challenge.

But at least it was a good tinkering session!

Biggest Takeaway: markdown-conversion

I think the most important takeaway for me from this experiment is markdown-conversion. This discovery gives me more things to play with.

I can directly create a worker as a web API to use in my other projects. Although the transparency is not as good as writing a Python script for conversion myself, since the source of the original document is not sensitive information for me, there’s nothing to worry about.