Relevant Search Guide - OpenSource Connections

3 downloads 201 Views 4MB Size Report
beginning. At OSC we believe strongly the future of application development is relevance-driven ... How do you calculate
Your Guide to



Search Relevance Investments 


In this guide, you learn how to make investments in smarter search applications. Running a search-driven business doesn't mean sitting back and letting the technologists take over. Instead it requires careful business-level course adjustment. Think of this as an investor’s guide. Smarter search isn't a technical problem. Anyone should be able to evaluate the business merits of technical solutions against key metrics: sales, insights, or other positive outcomes through search. Central to this document is the idea of relevance. What do we mean by relevance? Relevance is when your application accurately and intuitively answers questions from your users, in alignment with your business strategy. Relevance appears most prominently in search, directly answering specific queries your users issue to you (as shown in the image below). At an e-commerce site, for example, the search engine is the sales person. You want your sales person to answer questions with expertise. You want to measure your sales staffs' performance against what's important to the business: sales, satisfied customers, etc. In a similar fashion, an archival search is like a librarian: you want to evaluate how well a librarian-like search answers researchers questions and connects them to insights.

2 of 16

While search is front-and-center in our discussion here, we hope you also think beyond search. Your application can refocus to be entirely relevance driven: built with personalization, recommendations, and domain intelligence from the beginning. At OSC we believe strongly the future of application development is relevance-driven applications: designed to understand users and their needs from the ground up.

O'REILLY MEDIA'S LIBRARY (BUILT BY OPENSOURCE CONNECTIONS) RETURNS RELEVANT SEARCH RESULTS ALONGSIDE RELEVANT, TARGETED ADS HTTP://LIBRARY.OREILLY.COM

With that in mind, let's lay out the questions you need to answer when investing in better search and relevance: • What value can the business gain from smarter search? • How do you make investment decisions in relevance? • When do you stop making investments? • How do you calculate a return on investment (ROI) for relevance work?

3 of 16

• How do you keep the business engaged in search, not just the domain of data scientists and developers?

What we'll walk through in this document is what we refer to as the virtuous cycle of search relevance improvements, shown in the figure above. This document also serves as the relevance consulting methodology of OpenSource Connections. OpenSource Connections drives business value through expert search and relevance solutions.

What's Your Bottom Line? Think of relevance as an investment. You can drive better business outcomes through better search and discovery. Linking users to content they need,

4 of 16

through search or other means, directly drives financial and business benefits to your business. Sure it takes time and effort to improve search and discovery, but being tricked into obsessing over costs misses the point. It's the return on that cost, the ROI that you need to calculate, measure, and monitor. The first and most important step in relevance work is to establish a bottom line. Your bottom line is a metric that measures search’s business value. For ecommerce, your bottom line might be revenue per search. For a medical diagnosis search app, it might measure whether doctors can diagnose and help patients efficiently on the first search query. Your goal with better search is to drive the bottom line. With it you measure the effectiveness of relevance work. You want metrics that can drive headlines. Headlines like: • We invested 100 hours in Search Relevance work and improved doctor’s efficiency diagnosing patients 35%. Doctors were more satisfied with the new relevance changes as compared to the legacy solution. • Sales were improved $3 million with $80,000 relevance work, a clear ROI. The statistics that measure success are your KPIs -- key performance indicators. Direct measures of the value brought into the business. It’s easy to see how more than just relevance drives those KPI numbers. For example a confusing checkout process might disrupt sales. But that’s ok. Using 
 A/B testing helps isolate and measure individual changes due to relevance

5 of 16

changes. If tweaking a relevance feature shows a 10% improvement, despite a lengthy checkout process, it’s worth it. Points to consider • How do you come up with a good bottom line metric? • How do you scientifically measure how relevance impacts the bottom line? • What are signs you’ve come across the right bottom line for your business? What might indicate you haven’t? • The bottom line measures what changed after work, how do you avoid expensive investments that don’t pan out?

Diagnosing Search: finding promising relevance investments

Ok, so the bottom line is what you’re driving at. It’s your goal. But how do you get there? The bottom line doesn’t tell you exactly what is wrong. To do that, you need to drill deeper. You need to find opportunity areas for investment. You need a way to measure where search works well, and where search isn’t doing its job. Establishing what works and where problems might exist can be tricky. This is where the art meets the science.

6 of 16

Relevance Feedback There are several forms of raw relevance feedback that you may be able to analyze. This is the raw stuff that points at problems and successes. Some examples This includes: • Direct user feedback (“my search for ‘dress shoes’ returns dresses!”) • Search analytics that collect, what users search and what they do after they search (“everyone pages on the ‘myocardial infarction’ query”) • The opinions of internal domain experts (“A doctor would look at ‘myocardial infarction’ here and ask for their money back”) • Usability studies ("as you can see, after subject 125 clicked this result, they began showing interest in non-search components of the page") • Opinions on how the search application should be marketed ("we should coach users to search for problems in their application") • The opinions of developers, managers, marketers, developers... But this is the raw “stuff” pointing at what’s broken. Whereas the bottom line measures something too high level to diagnose, these various opinions and click logs don’t tell you why a particular search isn’t achieving business goals. Consider a doctor diagnosis search application. You a list of 100 or so keyword searches that appear to be causing problems. You might have a list of these keyword searches in a spreadsheet, a list of areas to investigate. Maybe it looks something like this:

Broken Searches:

7 of 16

myocardial infarction
 whooping cough
 common cold
 (And so on for 100 searches)

This list of every broken query is too granular. It’s not actionable information. You can’t manually patch 100 broken queries. Someone needs to analyze these areas and find the common patterns. What’s happening? What information can be ascertained from this data? How can you put it together in a plan of action that will drive the bottom line? Points to consider Some additional points to ponder for your application when gathering raw relevance feedback. • When is domain expertise preferable to search analytics? • When does search analytics provide more value than domain expertise • How do you change how users use and approach your search application?

Content Curators and Discovering Information Needs How do you use the raw relevance feedback? We advocate for a specific role that can connect translate granular, per-query feedback into actionable information based on common patterns. We call this role the content curator. Content curators analyze user behavioral data. They understand how users are searching and why they’re disappointed. They listen to internal domain experts. They understand the strength and weaknesses of these different forms of raw feedback. They digest this into a set of work that they suspect can move the bottom line forward.

8 of 16

In our experience the most effective way to digest raw data into actionable information is through personas, use cases, and information needs. Personas describe a class of user. Use cases describe the tasks they want to accomplish in the application. Information Needs specifies the information the user needs to accomplish the task. For example, let’s consider a grocery search. Let’s look at a search for “red wines.”. The content curator examines this query and note others similar to it. The same users that buy “red wines” also search for other general party items “potato chips,” “funny hats,” and “beer.” A good content curator analyzes the raw relevance feedback and realizes that there’s a certain kind of shopper on their site. Shoppers that turn out to be relatively unsophisticated. Often they’re hosting a party or event, and they just need to check the “red wine” or “beer” box. This subdivision is known as a persona. Let’s write this down: Persona: Party organizer

We’ve identified some use cases to support here: searches for broad categories “red wine” or “beer”
 searches for items in the “party” aisle

We can go a step further to identify the information users need to support their task (the users' information needs). What does the search application need to bring to users to help support their decision? In this case:

9 of 16

with broad category searches ("beer" or "red wine"): users need to make a buying decision by comparing price and user reviews across multiple relevant items in the search results listing with specific item searches "Acme piñatas" they expect the first search result to be exactly an "Acme piñata" Through further sleuthing, you’ve been able to determine that organizing parties is actually a very common use case for your grocery site. You can isolate that use case and build relevance features that deliver products and results that “get” that persona. Further, you've diagnosed specific information needs your search application needs to support to help users make decisions. This connects with your business goals and bottom line: directing these party organizers to informed, satisfactory purchases, driving up overall site sales. Points to consider Some additional points to ponder that might make discovering personas, use cases, and information needs in your application different. • How does a content curator reconcile internal opinions and analytics to develop personas, use cases, and information needs? • What happens when user behavior changes over time? • How do you pivot your relevance solution to address changing needs? • Is this just about search? How do personas, use cases, and information needs apply to optimizing recommendations? • Should other relevance, recommendation, and personalization features be built into the site to support a specific persona? • Can we fingerprint personas, personalizing the whole experience for them?

10 of 16

Building Relevance Judgements & Tuning The content curator further breaks the various information needs down into representative keyword queries. They’re not trying to capture every possible keyword search. Instead, they’re working on representative queries associated with information needs. For example, we know we need to support searches of items in the “party” category to satisfy a specific persona. We suspect that helping with this persona, and the associated use cases can drive the bottom line, and is a sound relevance investment. Party aisle keyword searches: “party hat”
 “streamers”
 “whoopie cushions”

For each representative keyword search, the developer, working with the content curator describes validation criteria for each candidate search. This might come in the form of a golden set of search results. Also known as relevance judgements. Relevance judgements or a judgment list is a per keyword set of results rated based on relevance. This can be as simple as a “thumbs up” or “thumbs down” button.

11 of 16

For some information needs, it's important to evaluate search's ability to return exactly one right answer. We discussed a search earlier for "acme piñata." Other information needs require the user to evaluate multiple relevant options, so the top 5 must be relevant to the search (we saw this when users shop for "beer"). Still other information needs are very archival or research focussed: a legal researcher searching for precedents needs to evaluate all 100 search results as relevant. Where do you get judgment list information? Often judgement lists can be derived directly from analytics data. This requires careful consultation with the business to translate the raw relevance feedback into meaningful relevance judgements. Other times domain experts can provide this information where analytics are sparse. For example, internal medical experts can provide insight by manually asserting different results as relevant/not relevant. At each of these stages, the content curator owns the judgment data. They arbitrate between opinions and other raw forms of relevance feedback to come up with validation criteria. Try to use care to expand your validation criteria to several personas, use cases, and information needs. You don’t want to capture just what’s broken. Measuring what works will is vital. Test Driven Relevance Tuning Using judgement lists, you build real-time validation criteria for relevance work. This validation criteria is linked to, but not exactly the same as the “bottom line.” It's a correctness measure that can be measured easily and instantaneously. Armed with these keyword searches, you can begin test driven relevance. As you modify the relevance solution, you measure the impact of every little tweak against

12 of 16

your validation criteria. Tune a boost to high, and you’ll see how it negatively impacts specific use cases right away. For example, perhaps optimizing too far for the party searcher causes another persona, say the organic shoppers, to have poorer relevance. Instead of waiting to see this negative impact after shipping the solution, you see it immediately preventing disastrous consequences. Don’t spend a lot of time building the perfect search solution. You need to get to a point where you feel you’ve made a small, measurable improvement. That doesn’t dramatically impact other personas and use cases. If all you did during this phase was optimize slightly for party organizers with no negative impact to other personas and use cases, then pause your work for the moment. Ideally this is at most a few short weeks of tuning. However it could be even less! Notice we do not prescribe methods in this document. (Our book, Relevant Search published by Manning, covers the technical aspects of relevance tuning). There’s no technical silver bullet for search relevancy. Some problems call for NLP and text analytics. Other problems might be solved with machine learning. Yet others may be solved rather simply with a few tweaks to baseline search engine ranking. Measurement of search is more important to how the problem is solved. If you’re not measuring relevance work, you’re lost.

Deploy and Measure Once you’re satisfied with a small improvement, deploy as soon as possible. It may seem counterintuitive, but the ability to deploy relevance work confidently and quickly is fundamental. Relevance work that languishes

13 of 16

for months without measuring how it impacts your bottom line breaks your ability to monitor the effectiveness of relevance investments. A 6 month machine learning effort may turn out to have negligible positive impact on the bottom line, yet it cost $500K to build! How do you measure how well changes impact the bottom line? We advocate for getting into the habit of A/B testing relevance work. A/B testing horse races A (the original relevance solution) against B (your updates). You randomly assign users to A vs B and measure the bottom line for both groups scientifically. Less preferable is switching to B and throwing out A. This doesn’t reflect a scientific measurement of whether B beats A. Seasonal change and other external modifications to user behavior might cause a performance difference between earlier and now. Traffic may spike around Christmas, for example. Or for a grocery site around Thanksgiving. You can't declare B better than A just because you rolled out the changes around Thanksgiving! Instead you need to see how A and B work to support shoppers.

Points to consider Here's some points to consider when you evaluate your ability to deploy and test? • What do you do if the bottom line is difficult or expensive to measure? • “Deploying search changes takes 6 months to get through the change approval process. Help!” • How do I get started with A/B testing? • Would alternate solutions such as multi-armed bandit testing be better than A/B testing?

14 of 16

Retrospect You just went through a cycle of relevance improvements. Time to take stock: • How much did that cost you? • What impact did it have on your bottom line? With these two facts you can calculate your return on investment (ROI). You can review the entire process, and reflect. The personas and use cases were hypothesis. You suspected improving search for these groups would drive the bottom line. Did it? You further suspected your judgments and suspicions were correct. We're they? Take time with the team to reconnect the pieces of the process. Try to ferret out inaccuracies and false assumptions you had. Refine your understanding of how users actually behave.

Rinse and Repeat? You’re measuring the ROI and retrospecting on the accuracy of personas, use cases, and judgments. You’re actually beginning to repeat the earlier steps of this process. Decide whether another round of work is warranted. • Should you keep making investments? • Should you digest your refined personas and use cases into judgments, and in turn more relevance development? • Or is it time to stop tuning? • Do you suspect the ROI isn’t there? • The cost is too high, and previous rounds of relevance work haven’t paid off?

15 of 16

• Has anything changed externally that warrants a different understanding of users? This cycle is how you steer search relevance. By delivering small improvements quickly, you fly a fighter jet. Agile. Able to switch as business cycles demand. By delivering huge improvements once a year, you steer a jumbo jet. By not measuring, you fly blind. Hopefully you don’t crash and burn!

Need help from the experts? OpenSource Connections wrote the book on search relevance. If you need help with technical or business level guidance for your search relevance, please don't hesitate to reach out. We guarantee to improve your bottom line through our consulting services. http://opensourceconnections/services/relevancy [email protected]

16 of 16