Redact sensitive information from your cloud storage. This document will outline our research questions, our measurement targets, and our development tasks. Ideally, the research questions will motivate the measurement targets which in turn motivate the development tasks.

Research questions

  • How prevalent is "password in the clear" emailing?
    • how much happens in automated messages vs human-generated
      • how to determine human generated vs automated?
  • Is there a way to identify 'private information' without user intervention?
    • set of heuristics?
      • for "Password: XXX" heuristic, what is the recall of this heuristic?
    • "rolling" improvement of heuristics using previously seen structured text templates?
  • Can we "punt" and ask the user if they have any other passwords they would like redacted?
  • How can we present non-password "private information" to users? How will they want to redact it? if at all?
  • How prevalent is password-sharing between different accounts of the same user?
  • Can we estimate password sharing between cleartext-emailed passwords and "more sensitive" accounts like email or banking websites?
    • Possible strategy: include links to password reset information for popular services on a later page, watch who clicks on that info as a signal for "they might have been sharing an exposed password with that site"

Measurement Targets

  • User password behavior
    • password complexity
  • Service provider email behavior
  • heuristic training data
    • what information can we collect in a 'feedback loop' that balances subject privacy with an ability to improve our heuristics?

Development tasks

  • full workflow fleshed out
    1. presentation of passwords
    2. selection of "these are actually my passwords that I want redacted,"
    3. confirmation screen listing email "from" field and subject (perhaps with links to open the full email message in a gmail session in the same browser?)
  • generalize from passwords to 'private data'
  • measurement backend design
    • consider possible strategies for storing 'private' data beyond generalized, non-personal statistics
  • Full FAQ detailing what information we see, what information we store, and what options the user has for having their data stored by us.
  • How can we provide this service in an ongoing fashion in addition to a one-off search?

Blue sky ideas

Do the same thing but for dropbox

