Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
In my first passage as a product manager (ML), a simple question inspired passionate debates through functions and leaders: how do we know if this product really works? The product in question I have managed responded to internal and external customers. The model allowed internal teams to identify the main problems encountered by our customers so that they can prioritize the good set of experiences to solve customers’ problems. With such a complex network of interdependencies among internal and external customers, the choice of good measures to grasp the impact of the product was essential for the Imerger towards success.
Do not follow if your product works well is like landing an airplane without any instruction in air traffic control. There is absolutely no way you can make informed decisions for your customer without knowing what is going well or bad. In addition, if you do not actively define the measurements, your team will identify their own rescue measures. The risk of having several flavors of a metric of “precision” or “quality” is that everyone will develop their own version, leading to a scenario where you may not all work towards the same result.
For example, when I examined my annual objective and the metric underlying with our engineering team, immediate comments were: “But it is a commercial metric, we are already following precision and recall.”
First of all, identify what you want to know about your IA product
Once you have taken the task of defining the measurements of your product – where to start? According to my experience, the complexity of the operation of an ML product with several customers also results in the definition of measures for the model. What to use to measure if a model works well? Measuring the result of internal teams to prioritize launches according to our models would not be fast enough; Measuring if the solutions adopted by the customer recommended by our model could risk drawing conclusions from a very wide adoption metric (and if the customer had not adopted the solution because he just wanted to reach an assistance agent?).
Quick advance in the era of large language models (LLMS) – where we have not only a unique outing of an ML model, we also have textual answers, images and music as outings. The dimensions of the product that require measures now increases quickly – formats, customers, type … The continuous list.
On all my products, when I try to find measures, my first step is to distill what I want to know about its impact on customers in a few key questions. The identification of the good set of questions facilitates the identification of the good set of measures. Here are some examples:
- Did the customer get an outing? → Metric for the cover
- How long did it take the product to provide an outing? → Metric for latency
- Did the user like the output? → Measures for customer comments, customer adoption and retention
Once you have identified your key questions, the next step is to identify a set of sub which the “entry” and “output” signals. Exit measurements are late indicators where you can measure an event that has already occurred. Input measurements and main indicators can be used to identify trends or predict results. See below for the means to add the right sub which indicators to train and drive to the above questions. Not all questions must have head / delay indicators.
- Did the customer get an outing? → Cover
- How long did it take the product to provide an outing? → Latence
- Did the user like the output? → Customer comments, customer adoption and retention
- Did the user indicate that the output is good / bad? (to go out)
- Was the outing good / just? (to input)
The third and last step is to identify the method to bring together measures. Most of the measures are collected on scale by new instrumentation via data engineering. However, in some cases (such as question 3 above) in particular for products based on ML, you have the possibility of manual or automated assessments which assess the outputs of the model. Although it is always preferable to develop automated assessments, starting with manual assessments for “the outing was good / just” and creating a section for the definitions of the good, just and not good will also help you to lay the basics of a rigorous and tested automated evaluation process.
Example of use cases: IA search, listing descriptions
The above frame can be applied to any product based on ML to identify the list of primary measures for your product. Let’s take research as an example.
Question | Metric | Metric nature |
---|---|---|
Did the customer get an outing? → Cover | % Research sessions with the search results shown to the customer | To go out |
How long did it take the product to provide an outing? → Latence | Time taken to display search results for the user | To go out |
Did the user like the output? → Customer comments, customer adoption and retention Did the user indicate that the output is good / bad? (Was the outing good / just? (To input) | % of research sessions with “thumbs” comments on customer search results or for% of search sessions with clicks from the customer % of the search results marked as “good / just” for each search term, by quality section | To go out To input |
How about a product to generate descriptions for a list (whether it is a menu item in Doordash or a list of products on Amazon)?
Question | Metric | Metric nature |
---|---|---|
Did the customer get an outing? → Cover | % lists with generated description | To go out |
How long did it take the product to provide an outing? → Latence | Time taken to generate descriptions to the user | To go out |
Did the user like the output? → Customer comments, customer adoption and retention Did the user indicate that the output is good / bad? (Was the outing good / just? (To input) | % of announcements with generated descriptions that required changes to the Technical Content / Seller / Customer team % of the registration descriptions marked as “good / just”, by quality section | To go out To input |
The approach described above is expandable to several products based on ML. I hope this framework will help you define the right set of measurements for your ML model.
Sharanya Rao is group product manager at Intuity.