Blog 7

April 21, 2025

This week, I finally transitioned from preparation to execution. After identifying the correct structure of the USPTO Gazette files last week, I coded a custom parser that can extract all key information from each patent HTML file. This includes the title, inventor names, classification codes, and—most importantly—the full description text. I applied this parser to Gazette issues going all the way back to 2020, giving me a rich and diverse dataset.

Once extracted, I embedded the textual content using a SentenceTransformer model and stored the resulting vectors in a FAISS database. This step is at the heart of the Retrieval-Augmented Generation (RAG) pipeline: transforming patent data into a searchable vector space that enables natural language queries. The FAISS index is now fully populated and ready to power context-aware retrieval.

One area I’m still working on is embedding the patent images. While each HTML file links to a corresponding .gif image, I’m exploring the best approach to turn these images into embeddings—likely using a model like CLIP. This step is trickier due to image formatting and memory requirements, but I’m confident I can solve it next week.

Overall, this was a major leap forward. I now have a working, text-based RAG backend with over four years of real-world patent data.

View more of Arush J.'s posts.

Blog 7

Reader Interactions

Leave a Reply Cancel reply