Disclaimer
Will it blend scale? That’s the question I’ll be looking to answer over the coming months and years as I continue adding more content to this website. As I’ll explain in more detail, since the search data file is generated during build time, I anticipate the time to build and size of json file will increase as I add more content. However, your mileage may vary and being new to Astro I’m always open to feedback.
Overview
This Astro Nano theme by Mark Horn I forked didn’t have search by default, however his Astro Sphere theme did. I originally planned to copy the search page, make minor adjustments and integrate it into this theme. While it’s possible to implement search, I discovered during testing that only the title, URL, tags, and summary/description are searchable. Looking to see how others had implemented search I stumbled upon this Astro Search Tutorial from Coding in Public. I really liked the approach, but it still had similar limitations. I wanted the full content of posts to be searchable, along with the ability to include forthcoming project pages in the search results as well.
Using the search logic from Astro Sphere as a foundation and Coding in Public’s approach, I made a few changes, which I’ll explain in detail below. The key modifications were adding full content to the fuzzy search and implementing a scoring system to ensure more accurate results when searching with multiple keywords.
Search Page
The search.astro file is straightforward to anyone familiar with Astro. Simple scaffolding file for the search page.
Lines 2 through 4 are normal layout, components and constant imports.
- Line 5: Importing the main function from SearchLogic.tsx.
- Line 15: Forcing rendering to happen on page load for the client, so I can use the SolidJS framework.
(Read more about client only here.)
Search Data File (search.json)
The search.json.ts file is where I organize and preprocess the search data, which the search logic will ingest later. It generates a search.json file containing all distinct words from blog posts and project pages, along with a unique ID, content type (blog post or project page), URL slug, title, and description.
Using getCollection to get all my blog posts and project pages. After getting the content it’s still in markdown format, to get just the words I’m using remark with the strip-markdown plugin to remove the markdown.
The getBlogAndProjectContent
function gathers all blog posts and project pages that aren’t drafts (line 8). It sorts the content by date, placing the newest content first (line 16). Once organized, it increments and assigns a unique ID (line 18) using a counter initialized earlier (line 6). Lastly, I concatenate the body, description and title into a single searchText property (line 23 & 24).
markDownCleaner
does exactly what the name suggests, removes markdown (with the strip-markdown remark plugin) then removes extra spacing, punctuation & special characters with regex. After running a string through it I’m left with only alphanumeric characters and singular spaces.
wordProcessor
ensures I’m consolidating the words to a unique list and I’m only considering words with a length of 3 or more. Guess that contains duplicate LeetCode problem is useful for something…
mapToJson
is a little helper function to convert a map that has a set for a value to a json object.
Before I show you how I put it all together, I want to show you an example of what the search.json file looks like, to help you visualize the result I’m trying to achieve. wordMap
has all the unique words as keys and the value is the unique ID’s of content
it maps to. This what I’ll be watching closely watching the build time and file size growth. During development it isn’t compressed so the file size is ~18.8kb when it’s served from Netlify it’s compressed and it’s ~6.1kb.
getSearchJson
is where everything comes together. First, I call the function to retrieve all blog and project content. Then, I create a word-mapping data structure where each word is linked to a unique set of posts using their post IDs. After that, I iterate over each post, clean the searchText, and map each word to its corresponding post. Finally, I return the word mapping along with the blog and project content, excluding the search text.
The way I’ve defined it, Astro will treat search.json.ts like a static endpoint. So simply calling my getSearchJson function, return the data as the response and just like that the search.json file is created.
Search Logic
SearchLogic
has all the logic for fetching search.json, performing the fuzzy search, using scoring to figuring out which results match the search best, reading and writing search parameters to the url and rendering the results.
I’m using fuse to enable fuzzy search, dompurify to sanitize search input preventing cross-site scripting (XSS) attacks and SolidJS for all the client side variable changes and actions during certain states.
First, I’m defining the createSignal, interfaces and setting default values.
searchQuery and searchResults are as the name suggest, sanitized user query and where results are stored.
ContentItem, WordMap and SearchData are interfaces are for processing the search.json file.
FuzzyData is an interface used to map the fuzzy word to which post ID it matches.
FUSE_SEARCH is the Fuse instance.
FUZZY_SEARCH_DATA is an array of FuzzyData.
SEARCH_DATA is where the search.json is stored after fetching.
options are the Fuse options
getMapValue
is a helper function that’s basically a get with default, used with the scoring logic.
searchContent
has a lot going on, so I’ll take it line by line.
Line 50 & 51 - To ensure Fuse is only created once, check to see if FUSE_SEARCH is null, it’s default value. If it is null, create the Fuse instance.
Line 53 - Assuming the search query has spaces, split up the words so I can evaluate them individually.
Line 54 - idCount
tracks the number of times each post ID matches the search query.
Line 56 to 70 - For each word in the search query, attempt to directly find matching posts. If there’s a direct match, increase the idCount for those posts by 1. If no direct match is found, use Fuse to perform a fuzzy search, limiting the results to two words (as testing showed two words are usually sufficient). or the fuzzy matches, since they are likely relevant, increase their idCount by 1 and factor in the Fuse fuzziness score, which helps ensure the most relevant posts appear higher in the search results.
Line 72 - Sort the posts IDs by score.
Line 74 - Match the IDs to their corresponding ContentItem and return the result.
fetchSearchResults
is basically a wrapper for searchContent that ensures search.json has been fetched successfully before trying to move forward with the actual search. It’s designed to fulfill the promise no matter what happens, if the fetch fails, if fuzzy search fails, etc… It’ll return an empty array.
Starting at Line 81, check if SEARCH_DATA.content is empty. If it’s not, that means SEARCH_DATA has already been fetched and populated. If it is empty, fetch the search.json file, verify the response is okay, then populate SEARCH_DATA with the JSON data and FUZZY_SEARCH_DATA with the wordMap.
After all those checks are done, Line 94, only perform the search if the search query has more than 2 characters.
At first glance, the code in createEffect
might seem redundant compared to some of the other checks. I experienced an issue with multiple fetching calls of search.json as a user typed, to limit these extra calls I decided to use the milliseconds between a user typing the 2nd and 3rd character in the search query to populate the SEARCH_DATA & FUZZY_SEARCH_DATA before performing the fuzzy search.
(I might reevaluate this in the future.)
updateSearchResults
before setting searchQuery, validate that the query text has been sanitized by DOMPurify.
onMount
to check the URLSearchParams for any search query, if any are found, passes it to updateSearchResults.
onInput
is called from the input field as the user types and passes it to updateSearchResults.
onResultClick
is a user friendly feature I decided to add to enable easily navigating back to the search and picking up right where you left off, without having to type the query again. After you click any of the search results, the whatever is in the search query will be added to your navigation history, before sending you to the post selected.
Finally, the return
part of the code renders the search input field, dynamically updates the displayed search results as the user types, and provides helpful messages when no results are found or when the search query is too short.
Final Thoughts
Go ahead and see it in action.
I was initially concerned about integrating search into Astro, but I’m pleased with how seamlessly it worked with minimal effort, allowing for future-proofing as I update the website. I’ve identified some refactoring opportunities that I plan on implementing during the next redesign, and I’ll provide an update when I revisit the search feature. For now, this implementation suffices, and I’m happy it’s functioning well. I’ll continue to monitor the .json file though.