Hack Your Health: A Self-experimentation Tool for Behavioral Interventions

The challenge

Many behavioral interventions can improve health and wellbeing. However, behavior change is pretty complex and multifaceted, and there’s considerable variation in how different people respond to different interventions, i.e., one intervention doesn’t work for everyone in the same manner.

Most evidence for popular interventions is from group-based studies, and while these insights are important for understanding the effect of the intervention overall, they can rarely be used to predict if the intervention would work for a given person. Put differently, what might be an effective intervention on average (at the group level) may not be effective for specific individuals.

One way in which individuals can find out if a given intervention works for them is via self-experimentation using N-of-1 study designs. N-of-1 studies involve one person as the sole subject.

I love the way Neuringer (1981) put it. Self-experimentation can help us go from ‘should do’ or generic contingencies to more personal contingencies!

While there are a lot of tools out there that help you self-track various aspects of health, from steps to sleep, very few existing tools help people figure out if a given healthy behavior is worth pursuing (answering the question, “does it work for me?”)

In partnership with WNYC (public radio in New York) we began designing a self-experimentation tool (which we now called “Hack Your Health”) that can support individuals in carrying out N-of-1 experiments to test if healthy activities that work in general (e.g., meditation, exercise) improve aspects of their own psychological well-being.

The idea is that people try an activity in a systematic manner, collect data about their well-being throughout the experiment, and once they’re done, we analyze their data and share results with them in simple form.

Here I describe the process we went through to design the current version of the tool.

Some boundary conditions we began with:

  1. The tool needs to be scalable, because a wide range of WNYC listeners should be able to participate
  2. All communication when using the tool should be via SMS/email, without needing a separate app
  3. Health outcomes to be tracked as part of the experiment should be self-reported (without needing a device)
  4. Include interventions and outcomes that are of interest to the WNYC audience and shown to be effective ‘on average’, but also often associated with variability in response.
  5. Included interventions/activities should require minimal training, be short, easy to self-administer and perform.
  6. The experience should have low participant burden.
  7. Balance the desire to generate population-level insights while maintaining the spirit of self-discovery and self-experimentation.
  8. Balancing the needs of three stakeholders: WNYC and the editorial team, researchers and the end-users.

User Research

(Read a more detailed and published version here).

We first conducted user research using an online survey (~550 people) and semi-structured interviews with a subset of the survey respondents (18 people) to get a sense of the kind of interventions and aspects of well-being people were interested in, and reactions to the proposed self-experimentation tool.

There are so many healthy habits and aspects of well-being that could be tested using a tool like this. Our team first ~10-12 healthy behaviors in different categories that are usually associated with variation in terms of response across participants to be included in the survey.

In the survey, participants were asked to pick three outcomes of interest to them, and two interventions that they’d like to try and compare to see which one (if any) improve those outcomes. The comparison idea stemmed from how N-of-1 studies are carried out in clinical settings (comparing two treatments, such as two medications to see which one seems to be the best for that particular patient).

Overall, through this work we realized that there were definitely discrepancies in users’ conceptualization of the tool’s purpose, and that:

  1. Users were interested in a tool that aids habit formation and provides accountability.
  2. Few users doubted the efficacy of a given intervention (possibly because the individual differences are rarely highlighted by media, and most of these activities are expected to ‘work’?).
  3. Most did not intuitively find any value in comparing two interventions to see which one works better (which in hindsight, makes sense, because you could want to do both meditation and exercise simultaneously, and you’re not always trying to choose one, like when doing this in a clinical setting and trying to choose the better medication out of two).

Design Implications

  1. Designing the tool to inform habit formation
  2. Simplifying the experimental design – testing one intervention against their usual routine
  3. Highlighting existing individual differences to intervention response in website materials

Final outcomes and interventions included in Hack Your Health


We chose 6 outcomes (4 aspects of well-being, and perceived enjoyment of the activity and its fit into their daily routine) to be included in the tool.


Other aspects of the tool’s design

Outcome choice vs. no choice, experience framing, medium for daily surveys, etc…

As mentioned earlier, one of the design constraints we were working under was “no app” and keeping most communication via SMS and email.

However, the daily surveys could be sent as as a link to an online survey using a tool such as Qualtrics, or via SMS (where participants provide answers via SMS). Maintaining all communication via SMS would have the advantage of higher reach – people wouldn’t need a smartphone. WNYC also wanted to reach parts of the population that may not have smartphones (although majority of the US population now owns a smartphone). But answering too many questions via SMS comes with the risk of being burdensome. Answering via a survey tool would also let us use features such as slider scales to make responses low burden.

We designed and deployed a short, 4-day 2×2 factorial cross-over design with 9 users to figure out a few such questions:

  1. Should participants compare one intervention to their usual routine, or two interventions to figure out which one works better?
  2. Should the focus of the project be about the outcome/(s) or about the interventions? (meaning, is it: see what’ll make you less stressed? Or is it, see what’ll help you at all?
  3. Which medium should we use? SMS vs. link to survey?
  4. How much do the various outcomes matter (i.e. do they feel very strongly about energy v. focus v. stress)? Is it important to the participant to pick the outcome? Would they want to track all three outcomes?
  5. How many daily questions should we ask?

When signing up for the prototype test, users could choose either 1 or 2 interventions to try. They were informed that if they choose 1, they’d be comparing that intervention to their usual routine. If they choose 2 interventions, they would be comparing those interventions to each other.

2 x 2 factorial, cross-over trial to inform design decisions

Insights from 4-day prototype test + interviews:

  1. Compliance to links was higher than responding via SMS
  2. Participants liked the ease of responding using slider scales
  3. Compliance did not differ between 3 questions vs. 9 questions
  4. Most participants chose 1 intervention to try instead of 2 and found no value in comparing 2 interventions.
  5. Users liked the questions about enjoyment and fit of activity into their routine
  6. Users did not seem to feel strongly about one outcome over others (at least through the short 4 day test, we know now that after going through an experiment for 18 days, some outcomes are definitely more relevant than others — so it could be that participants don’t feel strongly about it in the beginning, and that changes over time).

Instructions and daily prompts

Instructions: We wanted to develop instructional materials that provide some structure but also allow users some flexibility in terms of time of day they perform it, exactly how they perform it. We developed written instructions as well as instructional videos for each of the 4 interventions.

Instructional video for “Get Active” (10 minutes of vigorous physical activity)

Daily SMS: We also developed a bank of SMSs for use throughout the study. There were three types of daily messages:

  1. Morning reminder at 7a to perform the activity/stick to their usual routine (depending on that day’s assignment)
  2. Evening SMS at 7p with link to daily survey
  3. Head’s up about the next day’s assignment IF they were assigned to do the activity (so they get some time to plan/schedule it into their day).
First few messages after the user signs up for the self-experiment

Experimental Design and length of experiment

We needed to consider experimental designs that would not be detrimental to user engagement, and would enable causal inference (factor in potential carry-over effects). We also needed to choose a length of experiment that would enable causal inference and not hinder engagement.

After much discussion, we landed on a cross-over experimental design with phases of 3 days each (6 x 3 = 18 days). So each user would go through 3 x 3 phases of activity (phase B) and usual routine (phase A). Out of the 64 total combinations possible, we chose 12 that did not include extended periods of inactivity (e.g., AAABBB, which would have the user not starting the activity till day 10)

If a person is trying meditation, this is what their 18 days would look like if assigned to the BAABAB sequence.

Frequency of measurement and measures to track outcomes

All of these outcomes are dynamic in nature, and ideally, you might want to measure them multiple times a day. But this risks increasing user burden. We needed to choose a way of measuring outcomes that would enable causal inference and also not increase user burden.

There aren’t many validated single-item measures for most of these outcomes, however, since every person’s data is only used in their own experiment’s analysis, we cared more about them being face valid over other kinds of validity.

We decided to use simple 1x/day tracking (end of day), with a question for each outcome (example in screenshot below).

Screenshot of daily survey

Analyses of the N=1 experiments and sharing results with participants

Below is a screenshot of what the report we share at the end of the experiment looks like. To analyze the data, I used an R package that uses bayesian modeling to analyze data from N-of-1 experiments that was developed by one of our team members.

Figuring out data analyses for this project was challenging. We knew that we did not want to perform frequentist analyses that only provide p values and do not inform us about the magnitude of the effect, or the certainty behind the results — both aspects that previous work indicates are important for users for the results of such experiments to be useful. However, communicating uncertainty can be challenging, particularly when the magnitude of the effect is also highly variable and individualized (for example, a 10-point change in stress for me may be no big deal just because of the usual variation in my stress-levels, but might indicate a significant reduction for someone else).

I’ve also always struggled with getting feedback on results prototypes from people (especially in this case, where most people are not familiar with N=1 style studies) before they’ve gone through the experiment. We used prior work to drive our decisions about the phrasing of results in the participant reports.

Here are a few rough prototypes we sketched:

We also mocked up a few prototypes using dummy data.

However, after the user study (described in the next section) was done, and we had REAL data….which was way messier due to:

  • people didn’t always do the assigned activity
  • missing data
  • large credibility intervals and highly uncertain results
  • other things going on in their life (which we had them report through one open-ended question in the daily survey)

..we ended up modifying language and ended up creating reports that were much more individualized (I could not have used just a “template”, because each user’s case is different).

Here are a few examples of what the final report looked like for one participant:

User Study

We designed a website where people can read about the tool and sign up for their self-experiment. We then finalized the design using think-aloud protocols and cognitive walkthroughs.

As part of my doctoral thesis, I’m conducting a formative evaluation of Hack Your Health using mixed methods. The aim is to assess its usefulness in helping people make decisions about health. As part of it, I also want to explore tensions between participants’ experience with the activity and what the results from the analyses of their data suggest.

If a person signs up for the study, they go through this process:

Currently, 22 users have completed the study. I interviewed 14 of them, and I’m in the process of analyzing the data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s