Matchmaker Benchmarking Metrics
This is a work in progress page for determining the metrics to measure for benchmarking the 4 different algorithms of matchmaking:
- By experts (adaptation at runtime possible?)
Quote from DoW, WP204 description: "A metric for comparison will be defined, reflecting the effectiveness, efficiency and user satisfaction of the resulting solution, as well as the development effort for the matching algorithm and its resources (i.e. rules, training data)."
Independent variables (to be controlled as much as possible):
- User (inherently non-constant)
- GW: Measurement/user tests under consideration of the following variables (by pre-questionnaire)
- User is aware of/misses some preference set attribute
- User is aware of/misses settings of some AT
- User is aware of/misses settings of some app (browser, settop box, kiosk, mobile phone)
- Note: This applies to the AT or applications that the users are commonly using, not the Cloud4All platform.
- Application and tasks (constant for all conditions)
- User interface of the application
- User interface for changing the settings
- Environment (keep constant)
- Type of matchmaker
- State of matchmaker (e.g. number of user preference sets available)
- User preference set known or not
We need to differentiate between user we have a preference set for, and those we don't have a preference set for. We may not be able to test users with preference sets in the first iteration.
For usage of the metrics, see the table in the page Cloud4all Testing: Essential Registry Terms.
Success criteria for WP204 (from DoW)
- First iteration (M18): All algorithms score at least 50% of the expert-provided solution.
- Second iteration (M30): All algorithms score at least 65%, and one algorithm at least 75% of the expert-provided solution.
- Third iteration (M36): All algorithms score at least 75%, and one algorithm at least 80% of the expert-provided solution.
These are the things we need to measure in order to be able to compute the more complex metrics below.
Atomic metrics to compute more complex measures below:
- + Completion per task (binary)
- Completion per task (percentage)
- + Time per task
- + Number of UI settings changed by user per task
- + Number of user errors per task
- + Post-task: User-perceived difficulty level per task (Likert 1-7)
- + Post-task questionnaire: Ask the user for good and bad experiences.
- Standard questionnaire per matchmaker (SUS)
- + Standard questionnaire per matchmaker (AttrakDiff)
- User comments (think aloud) in retrospective manner
- + Numbers of UI settings proposed by the matchmaker (accepted vs. rejected by user) - probably not implemented in the first iteration, only for second iteration
Other atomic metrics for measuring development:
- Development effort
- Training effort per user
- + Count user errors (superfluous mouse click, mistaken key press, ...) - "Expectation mismatch" => CL: Can be done by observation. The interesting point to this is the question what are the reasons for errors. Does an error occur because of the matchmaker result?
- User interaction errors reflect the quality of both the basic user interface (unadapted) and the adaptations performed by the matchmaker. However, since the different conditions are all based on the same basic user interface, the differences in this metric reflects differences on the quality of the matchmakers.
- Can be automatically collected or minuted.
- Could we do think-aloud + time measurement? - Retrospective testing: User comments on their action in hindsight - on the recording.
- + Task completion rate
- Either binary (0, 1)
- Or percentage of completion (0%..100%)
- + Time for task completion (with changing UI settings)
- + Number of UI settings changed
- Robustness: Number of tasks completed / number of UI settings changed
- Ideal situation: Many tasks completed. Very few settings changed.
- Problem: Is this efficiency? Where do we differentiate between conditions that make the user slow vs. fast?
- Problem: infinite result if the user doesn't change UI settings.
- Robustness: Per task: Time * number UI settings changed
- Low number has better efficiency
- Problem: If the user changes nothing, the result is 0.
- Time for task completion (but w/o time for changing UI settings) - but define upper limit
- Problem: Manual time measurement
- Tasks completed / total time (including uncompleted tasks)
- Problem: UI settings changes not taken into account
- + Single question for difficulty of each task (Likert scale 1-7)
- Single question for enjoyability of each task (Likert scale 1-7)?
- System Usability Scale (SUS) - Smith 1996
- + AttrakDiff (Hassenzahl)
- - Number of UI settings changed
- Problem: This measures the initial user satisfaction, not the one at the end of the task.
- But: Very high noise. A user may not change anything because they are frustrated with the UI produced by the MM. Or too engaged with it because they like it so much.
Metrics involving number of UI settings changes:
- Count how often the user changes a setting (but high number could mean that the user is either engaged or frustrated with the settings) => CL: Interesting question, if the settings changes by a user affects the satisfaction. If true, we can use this point for the MM algorithms in terms of machine learning.
- Settings changed / total time (including uncompleted tasks)
- Number of settings changed in first half of test / number of settings changed in last half
- Integral of product of settings function * weighting function
Metrics of ranking matchmaker results
- Think-Aloud: Number of positive comments - number of negative comments ("Valence method")
- User sshould rate each page on user friendliness
- User should assess a new solution suggested by the matchmaker
- User expectation compared to user satisfaction
- "Did the matchmaker meet your expectations?"
- Problem with reliability (different people)
- Not important for benchmarking
- Usable for reflection on dissemination?
- CL: Proposal for Benchmarking of MM: To compare (design) changes in the matchmakers or compare the matchmakers against each other we need to define a baseline measure.
- Pre-Test: Satisfaction, Information about the user, experiences, etc.
- Post-Test Usability (overall impressions of usability => SUS)
- Post-Task Satisfaction (task performance, usability problems, errors and higher task-times => SEQ, ASQ)
- Single Usability Metric for reporting by combining the measurements (SUM = 4 Metrics: task completion rates, task time, satisfaction and error counts)
- Single Usability Metric (SUM)
- Combined metric of: task completion, task time, user satisfaction (single question), number of errors, number of usability problems by observation
- Between-subject or within-subject experiment?
- If within-subject, is counter-balancing sufficient to make up for learning effects?
- If between-subject, do we have enough users to get significant results?
- Do we have preference sets for our subjects? If not, how do we initialize their preference set?
- Will we be able to do statistical analysis for separate groups of users (different kinds of disabilities)?
- But this would require a vast number of subjects.
- We could focus on specific user groups in the first iteration (visual impairment & blindness, elderly with some kind of vision impairment).
- Currently, we don't have all settings handler. The Windows, GSettings, DBus settings handler are probably the most advanced. We can convey our wishes to the architecture group regarding this.
- We need to define the test scenarios as quick as possible. TUD and HDM should come up with initial proposals after the prototypes are ready. The proposals will be discussed in the matchmaker meeting in Stuttgart.