Search Improvement

I know you’ve previously discussed the inability to output custom fields in the RUST generated elasticlunr.js JSON index. I accept custom search fields would be mission creep and Zola needs to stick to what it does best. But... :slight_smile: Would it be possible to specify a trim size for the page data?

For example; At it’s simplest, a config.toml variable that said search_truncate = 200 i.e. first 200 hundred characters would be used to build the stem words and be held in the index JSON data, you would simply skip the rest. If the page is < 200 characters long you just roll with what you’ve got.

You change none of the current implementation other than to lookup a configuration variable which determines if you trim the page length data.

I think one of zola’s USP’s is the search built in. Implementing search for a static website can be a major pain or worse still cost money.

Implementing this basic limitation does a few things:

  • It makes search possible for websites with a small number of pages, even when the pages contain a lot of character data/text.
  • It gives people, who have 100’s of pages, the flexibility to reduce the file size of the search_index.en.js file to keep it usable before having to look for alternative search engine options.
  • Possibility to set search_truncate = 0 which would effectively make the search title only
  • In theory it’s not breaking any of the current implementation, it all works exactly the same way, except the data you generate the search index is trimmed first.

At this point your thinking, omg these guys just don’t get it. It’s impossible :slight_smile: or hopefully, this one is doable :slight_smile:

I look forward to your views.

I think there are many ways to improve search, from limiting the amount of text to selecting only title/description or some random field in [extra]
I need to create a GH issue for that this weekend

2 Likes

Great, thanks. We’ve currently got a 5mb json file (around 250,000 words) which when gzip’d is around 500k over the network, this is fine on a desktop, but interestingly on mobile the JVM on the mobile browser takes ages 5+ seconds on older phones to decompress and scan/load the database it’s quite slow. So having options to tweak the size and how much is indexed would be a really valuable.

I’ve created https://github.com/getzola/zola/issues/961 for that. We can continue discussing in this discourse thread though, the issue is there so I don’t forget it

I use Zola in local knowledgebase and I’ve found only one solution we discussed there: to build my own version of zola binary and search index with just title&description. It works despite of some bounds, an example, I should use separate debian virtual server with ownbuilded zola and nginx to serve html files. It would be super useful to have settings string for search fields in config.toml

I’ve noticed the discussion on Zola git has started to talk about a completely new search engine, which when I took a look at it, does not even provide a stand alone static website javascript library. I’m not sure if I’m reading this the wrong way, but I’d like to make sure that Zola search evolves, rather than being replaced.

I (and I am sure lots of other people) have spent quite a bit of time and effort getting the current search functionality to work. It works fine, in fact it’s an elegant solution, the only downside is the lack of any ability to optimise the json index file size to suit different use cases (like mobile).

To be clear, I am happy with the way search is implemented, improvements would be welcome. Please don’t move away from a static website compatible implementation. Please consider carefully the impact of breaking changes, I’m sure you will :slight_smile:

No the goal for any search engine in Zola would be to have something that works without a server so Tantivy is out. Then it’s just a matter of how good the results are, if we can have the same UX for search as right now but with a better search engine, why not.

Anyone interested in making some changes to Zola regarding that?

IE select which fields to index, a max size for the context etc?

1 Like

I’ll be happy if it would be possible to map fileds in config

I know this is probably not a huge priority for everyone, but over on https://adeptenglish.com/ we now have an index file that’s 6mb (600kb over the wire).

On desktop this is no big deal, but on mobile the 600kb is not great but OK, its the decompression time to unzip the JavaScript to its 6mb is killing older mobile phones. Google lighthouse bench-marking is hideous.

It’s getting to the point where we are going to switch it off.

All it needs is the ability to truncate the amount of text taken from page.content in the short term. So we can index on first 300-500 words, and ignore the other 2000+ words in the article.

I will take a PR for those kind of things, which fields to use + trimming.

1 Like

Hi Keats, what does

I will take a PR for those kind of things

Mean? Are you asking me to raise a PR? I don’t know how to do that, but I’ll have a go :slight_smile: if you show me how.

I’ve been looking at the search_index.en.js json output and interestingly enough I was able to reduce the file size just removing the full URL and replacing with a \.

For example

Using notepad++ I just global replaced "https://adeptenglish.com/" with "/"

So on a local search index the original was 6,713,339 bytes (Before Gzip over the wire)

For example:

{"tf":1.0},"http://192.168.1.96:2015/language-courses/podcast-bundle-201-250/":{"tf":1.0},

Becomes:

{"tf":1.0},"/language-courses/podcast-bundle-201-250/":{"tf":1.0},

Now it’s 5,315,371 bytes

It appears no functional change happened, i.e. it seems to work fine. Around a 20% saving for free. Your mileage will vary based on domain length. This is a huge saving and should be trivial? What do you think?

I’m still very keen to see an improvement in the search but I’ll take a quick win

It means if anyone wants to implement it, I’ll merge it :slight_smile: It’s on the TODO list for the next release but it will be faster if someone else implements it since the list is long (https://github.com/getzola/zola/projects/2) and I don’t have a lot of time right now.

For this change, we need to add some options in the config and then just change what we are passing to the search index in https://github.com/getzola/zola/blob/master/components/search/src/lib.rs depending on those config options.
Using the path instead of the permalink is a really good idea too.

of the three of us looks like you’re the only one who will be able to make it :upside_down_face:

I’ve pushed a commit to address that: https://github.com/getzola/zola/pull/1038/commits/fb994c71d7d4ae5aa7875889077046db5bebde80

Can you build the next branch and see if it works for you?

I’ve tried with zola next build and config below (sorry for screens) and it couldn’t build search-index at all.

изображение

you still need build_search_index = true in your config

doh! I didn’t know why I had read this string like “you should toggle build_search_index = false” to implement new features :exploding_head:

I tried successfully include_title = true and include_description = true . It works! many thanks, hope to see it in the master branch

looking forward the new release with the new search features. thanks.

meanwhile, and sorry if this is stating the obvious, what worked for me was to defer loading of /search_index.en.js and only initialise the search when/if a user start typing in the search field.

so assuming https://github.com/getzola/zola/blob/master/docs/static/search.js - change the last seven lines to look something like this…

if (document.readyState === "complete" ||
    (document.readyState !== "loading" && !document.documentElement.doScroll)
) {
  delayedInitSearch();
} else {
  document.addEventListener("DOMContentLoaded", delayedInitSearch);
}

function delayedInitSearch() {
  function loadScript(url, callback) {
    var body = document.body
    var script = document.createElement('script')
    script.type = 'text/javascript'
    script.src = url
    script.onreadystatechange = callback
    script.onload = callback
    body.appendChild(script)
  }
  document.getElementById("search").addEventListener("input", function() {
    loadScript("/search_index.en.js", initSearch)
  }, {once: true})
}

and the large file will load when you type the first character. a small hickup will be noticed but that’s ok i find.

just a suggestion.

Can you also defer the script load itself? So it doesn’t block anything but still happens in the background.