Would you like to know the answer to a table tennis question? Chances are we’ve already answered it or something similar. And to help you search through our 8,000+ questions, we’ve built our own search engine. This allows us to highlight video responses and return the results in our own format. Try it out now; Search for an answer and let me know how relevant the results are. Also let me know if the results are returned quickly or slowly.
For the Tech Heads
So how do you build a custom search engine? Well I’m glad you asked. The first step is to create an inverted index. This is like an index in the back of a book. It has a list of words and alongside each word is the list of questions that word appears in. For example, if we had the following 2 sample question titles:
Question 1: Fast Footwork
Question 2: Fast Forehand Topspin
We would build the following index:
Querying the Index and Ranking the Results
Now if someone searches for “forehand”, we can look up forehand in our inverted index and we instantly know that this word is in question 2. Similarly we can see that fast is in questions 1 and 2. In reality, the word you are searching for is going to be found in a lot more than just 1 or 2 questions so we need a way to bring back the most relevant results. This is where ranking comes in to play. To rank the results, we use some basic information. First we count how many times the word appeared in each document. Doing this on it’s own would tend to favour longer documents as more writing gives more chances for the word to occur. For this reason we try to normalise the frequency count over all documents so that length is not a factor.
Then we try to see if there are any words which are more helpful in our search. For example, if we search for “penhold service”, the word penhold is a rarer word, and hence any documents with this term should be given extra consideration. If you’re really interested in finding out more information, read up on “term frequency – inverse document frequency“.
Through experimentation, we found that the title of a question was the most important piece of information, followed by the question and answer, and lastly by any comments. Based on this, we assign slightly different scores depending on where the word occurs.
Also, because we think that answers with a video response are more helpful, we bump relevant questions up the results list a little so they are more prominent.
To make the search a bit quicker, we remove extremely common words. For example the word “the”, would appear in nearly every single question. So it doesn’t help us determine which question is more relevant. So we simply ignore these common words when building our index and when searching the index.
Putting all of this together, we grab all the questions that contain the word you searched for (ignoring extremely common words), and then for each question we give it a score based on the frequency of that word in the question, how rare the word is over all questions, and whether this question has a video response. If you search for more than one word, we do the same thing for each word and add up the scores by question. Then we order the questions by the score returning the 20 highest (and hopefully most relevant).
Phrase queries: We’ve structured the inverted index so that the position of each word in a question is recorded. This will give us the ability to do phrase searches in the future. At the moment if you search for “forehand topspin”, the results don’t need to contain the exact phrase with topspin following immediately after forehand. In fact, the results don’t even have to contain both words although due to the ranking function, they probably will.
Search result pagination: At the moment, the top 20 results are displayed. Most search engines display 10 rows, and have previous and next links.
Snippets: When a result is displayed, you see the title of the question and a video icon if the question has a video response. It would be nice if the result showed a snippet of the relevant text from the question.
We’ve currently indexed 1,014,167 words in 8,640 questions. Ball(s) is the most common word (excluding common terms) occurring 20,348 times. Variations on “serve” are also extremely common occuring 12,830 times. Our longest question in words is Your Combination which has 4,860 indexed words. Coming in second place is Apologize for net or edge ball? with 4,255 indexed words.