Proper and Basic Property-Based Testing

zelenya

Zelenya

Posted on April 20, 2023

Proper and Basic Property-Based Testing

I assume you’ve heard of property-based testing (PBT). But has it ever felt annoying, impractical, or unclear? Give it another chance. I promise to keep it practical and not demonstrate overused examples like reverse(reverse(list)) shouldBe list.

We’ll talk about random vs. hardcoded data, tips for writing proper tests, common pitfalls, and a few ways to introduce PBT into your codebase.


📹 Hate reading articles? Check out the complementary video, which covers the same content.


💡 Is fuzzing the same as property-based testing?

These are different techniques that share some similarities.

  • Fuzzing focuses on discovering vulnerabilities and unexpected behavior.
  • PBT focuses on verifying that properties hold for all inputs.

Traditional test and hard-coded data

First, let’s look at some traditional tests with hard-coded data (we’ll review them one by one):

test("Foo should succeed for random user"){
  checkThatFooSucceedsFor("random.mail@gmail.com")
}

test("Foo should give a boost to jeff"){
  checkThatFooWorksWithBoostFor("jeff@mycompany.io")
}

test("Bar should work for random user"){
  checkThatBarWorksFor("random.mail@gmail.com")
}

test("Bar should work for jeff"){
  checkThatBarWorksFor("jeff@mycompany.io")
}
Enter fullscreen mode Exit fullscreen mode

💁 I’m using scala-like quickcheck-like snippets, which resemble code in Scala, Haskell, Rust, and the like, using QuickCheck or one of its children.


The first test verifies that the function foo works for a random user/email:

test("Foo should succeed for random user"){
  checkThatFooSucceedsFor("random.mail@gmail.com")
}
Enter fullscreen mode Exit fullscreen mode

Is there anything special about the given email? Hm, no? It seems random. Okay. What about the following test:

test("Foo should give a boost to jeff"){
  checkThatFooWorksWithBoostFor("jeff@mycompany.io")
}
Enter fullscreen mode Exit fullscreen mode

Is there anything special in this case? Yeah, you know how sometimes people in the company want special treatment? This is the case here – Jeff gets a boost.

Let’s zoom out and review all the tests:

test("Foo should succeed for random user"){
  checkThatFooSucceedsFor("random.mail@gmail.com")
}

test("Foo should give a boost to jeff"){
  checkThatFooWorksWithBoostFor("jeff@mycompany.io")
}

test("Bar should work for random user"){
  checkThatBarWorksFor("random.mail@gmail.com")
}

test("Bar should work for jeff"){
  checkThatBarWorksFor("jeff@mycompany.io")
}
Enter fullscreen mode Exit fullscreen mode
  • The first test checks a property of the foo when the input is a random email (we know this cause I told you, not because it’s obvious).
  • The second – when the input is a specific email.
  • The last two check that bar works for both emails.

Wait, why are we testing bar with two emails? What about the first test? Is it going to work with any random email? Or is it supposed to be any random email as long as it’s not "jeff@mycompany.io"?

These tests are very sad. When looking at this code, it’s hard to identify what we are trying to test.

  • Is the given input random or purposeful? Is it random.mail@gmail.com or special jeff@mycompany.io?
  • Are we covering and testing all the possible types of input? Does the function work for only one type of email or both, but we neglected to test the other type?

Moreover, they aren’t scalable (what if somebody else asks for special treatment?), and we haven’t even talked about invalid or empty emails.

So, can we do better?

Random data

No, I won’t tell you that PBT solves everything. In fact, the first point I want to make is that you don’t have to use property-based testing for everything and everywhere.

First, let’s talk about random fake data. Most language ecosystems have some sort of "fake", "faker", or "fakedata" libraries. For instance, Rust has fake, which provides fake::faker::internet::en::SafeEmail for generating random emails.

test("Foo should succeed for random user"){
  // use one random fake email
  checkThatFooSucceedsFor(SafeEmail().fake())
}

test("Foo should give a boost to jeff"){
  checkThatFooWorksWithBoostFor("jeff@mycompany.io")
}

test("Bar should work for random user"){
  checkThatBarWorksFor(SafeEmail().fake())
}

test("Bar should work for jeff"){
  checkThatBarWorksFor("jeff@mycompany.io")
}
Enter fullscreen mode Exit fullscreen mode

Using random fake data signals that the given input has no meaning or influence – we no longer have to puzzle over the significance of random.mail@gmail.com.

But we still have other obstacles: does fake data cover special emails?

  • If it does and we don’t expect it, we can get in trouble.
  • If it doesn’t, but we expect it, we can get in trouble.

And remember how I hinted at invalid or empty emails? We aren’t testing these either.

Generative-testing

It all depends on the domain and how good our types are, but, generally speaking, we’d want to cover all the different cases:

  • random emails;
  • special emails: jeff@mycompany.io;
  • invalid emails (including empty emails).

And be able to use subsets of this in different tests.

Of course, we can cover all of these by hand, but isn’t it too much to keep in one head and annoying? And then we have to do it for every function.

Okay, now, the answer is property-based testing, which enables us to have tests that work for all the emails as well as tests for specific subsets.

Naive generators

One fundamental part of generative-testing is generators. They allow us to program and test a range of inputs within a single test instead of writing a different one for every value we want to test.

Libraries provide general implementations (such as for strings, numbers, and lists), and we can write custom generators for our domain, constructed and limited in a particular fashion.

To illustrate, we can make an email generator:

val emailGen: Gen[Email] =
  for string <- Gen.alphaNumStr
  yield s"$string@gmail.com"
Enter fullscreen mode Exit fullscreen mode

We generate a string of alphanumerical characters, which we then use to construct an email.


💡 The for-comprehension (or do-notation) can be desugared as plain functions:

val emailGen: Gen[Email] =
  Gen.alphaNumStr.map(string => s"$string@gmail.com")
Enter fullscreen mode Exit fullscreen mode

We can use it to lift our tests into the property-based testing world:

test("Foo should succeed for random user"){
    forall(emailGen) { email =>
      checkThatFooSucceedsFor(email)
    }
}

test("Foo should give a boost to jeff"){
  checkThatFooWorksWithBoostFor("jeff@mycompany.io")
}

test("Bar should work for random user"){
    forall(emailGen) { email =>
      checkThatBarWorksFor(email)
    }
}

test("Bar should work for jeff"){
  checkThatBarWorksFor("jeff@mycompany.io")
}
Enter fullscreen mode Exit fullscreen mode

The forall functions use the supplied generators to generate example arguments (imagine about a hundred samples – depending on your library settings) and pass them to the testing. Truth be told, this isn’t better than the test we had before – they are worse. We still don’t cover empty/invalid emails and have redundant tests for special emails.

The worst part: now, we generate hundreds of beautiful emails like this for each test:

904TDBHCmtaMe1KU8ESUSKcfOl2q711XK9vzpSkeaiP6OP@gmail.com
Enter fullscreen mode Exit fullscreen mode

Tests are taking longer, but the results are the same. Oops.

This is the first trap to avoid: naive generators bring you zero PBT benefits, only penalties. Try to avoid wasting testing times with hundreds of “user names” like this mcxjfgG5 or especially like this ᒅퟧ荫スཽ駼롲ꃣ鲙꟎갏ᄟԋ䗠慎鏩ㆇ逢᳄䒪謑୾棂텊㕰랮螮ᑻ⎿⏦炩耾쁾峞゙鉼ꪱ摦僋ⴁ뺁.

Generous generators

The effectiveness of property-based testing heavily relies on the quality of the generated input data. We can combine generators and create varied and representative data to ensure sufficient property coverage and suitable generators.

First, let’s start with a couple of valid emails. The following generator returns one of the given emails:

val validEmailGen: Gen[Email] =
  Gen.oneOf(
    "britta.perry@gmail.com",
    "abed@hotmail.com",
    "troyriverside@yahoo.com"
  )
Enter fullscreen mode Exit fullscreen mode

The distribution is uniform – it picks a random value from a list.


💡 Note: We don’t have to write these by hand – we can use library-provided fake data in generators. The annoyance of the integration depends on the libraries and tools in the language of your choice.


We can write similar generators for invalid and special input data:

// Covers empty email, missing at symbol, missing subject, etc.
val invalidEmailGen: Gen[Email] =
  Gen.oneOf("", "@domain.com", "@gmailcom", "troyyahoo.com")

// Covers Jeff (always returns Jeff's email)
val specialEmailGen: Gen[Email] =
  Gen.const("jeff@mycompany.io")
Enter fullscreen mode Exit fullscreen mode

And then, we can tie them all together into one generator that covers all the cases:

val emailGen: Gen[Email] =
  Gen.frequency(
    (2, validEmailGen),
    (2, invalidEmailGen),
    (1, specialEmailGen)
  )
Enter fullscreen mode Exit fullscreen mode

We control the frequency: the email generator will use valid and invalid email generators more often than special one.


💡 Usually, libraries provide functions to gather the statistics (such as collect and classify). In our cases, we can collect something like this:

18% "jeff@mycompany.io"
15% "troyriverside@yahoo.com"
14% "britta.perry@gmail.com"
14% "@gmailcom"
12% "@domain.com"
11% "[troyyahoo.com](http://troyyahoo.com/)"
9% "abed@hotmail.com"
7% ""
Enter fullscreen mode Exit fullscreen mode

It seems like Jeff’s email is generated too frequently. Exercise for the reader? Just kidding.


Now we can rewrite all the tests. This time, we’re explicitly testing all the emails: random valid, invalid, and special; and we merge two bar tests into one.

test("Foo should succeed for all users") {
  forall(emailGen) { email =>
    checkThatFooSucceedsFor(email)
  }
}

test("Foo should give a boost to special users") {
  forall(specialEmailGen) { specialEmail =>
    checkThatFooWorksWithBoostFor(specialEmail)
  }
}

// One test covers all use-cases
test("Bar should work for all users") {
  forall(emailGen) { email =>
    checkThatBarWorksFor(email)
  }
}
Enter fullscreen mode Exit fullscreen mode

We have a smaller number of test but cover a more significant range of input data.

Conditional properties

Alternatively, we could have written one complex generator based on the parts of the email, something like this:

val newEmailGen: Gen[Email] = for
  // Generate optional subject (including Jeff)
  subject <- Gen
    .option(Gen.oneOf(Gen.alphaLowerStr, Gen.const("jeff")))
    .map(_.getOrElse(""))

  // Generate optional at symbol
  atSymbol <- Gen.option(Gen.const("@")).map(_.getOrElse(""))

  // Generate optional domain part (including my company's)
  domain <- Gen
    .option(Gen.oneOf("gmail.com", "yahoo.com", "mycompany.io"))
    .map(_.getOrElse(""))
// Construct the email
yield s"$subject$atSymbol$domain"
Enter fullscreen mode Exit fullscreen mode

The new email generator is similar to the previous one. Because every part can be empty, it covers empty emails, missing at symbols, and missing subjects. It also generates special Jeff’s email.

But the generator is less flexible – to test special emails, we have to add a condition (note the ==> operator):

...

test("Foo should give a boost to special users") {
  forall(newEmailGen) { email =>
    (isSpecialEmail(email)) ==>
      checkThatFooWorksWithBoostFor(email)
  }
}

...
Enter fullscreen mode Exit fullscreen mode

The generator still generates all the different input types, but the condition discards numerous tests. So, here is another tip: be careful with using conditions. If the condition is hard or impossible to fulfill, the library might fail to find enough passing test cases. You should try to refactor your tests or generators in such scenarios.

Speaking of conditions (but different kinds of conditions). If the foo test wasn’t supposed to “succeed” the same way for random and special emails, we could merge the first two tests accordingly:

test("Foo should work for all users") {
  forall(newEmailGen) { email =>
    if isSpecialEmail(email)
    then checkThatFooSucceedsFor(email)
    else checkThatFooWorksWithBoostFor(email)
  }
}

...
Enter fullscreen mode Exit fullscreen mode

Compare it to the tests we had before:

test("Foo should succeed for all users") {
  forall(emailGen) { email =>
    checkThatFooSucceedsFor(email)
  }
}

test("Foo should give a boost to special users") {
  forall(specialEmailGen) { specialEmail =>
    checkThatFooWorksWithBoostFor(specialEmail)
  }
}

...
Enter fullscreen mode Exit fullscreen mode

Which one is better depends on one’s preferences and domain. Let’s move on.

Writing proper properties

Testing simple emails is nice but what about real functions and data? And what about actual properties?

Well, emails are real. Simple use cases like that are a good and straightforward way to introduce PBT to your toolbox and codebase.

And we can build on top of it. If we have a user data type that contains an email, we already have this part covered; we just assemble the rest. But if your first property-based test requires generating a very complex data type, yeah, it will be tough.


💡 One thing to stay aware of is the variety of data:

  • If you’re testing something like parsing or external data, you must test invalid emails.
  • But if you’re already dealing with parsed or validated data, you can test only valid emails.

Regarding writing cool properties, I’ve been avoiding and not explaining this term on purpose. Properties are supposed to be well-defined, testable rules or assertions about a piece of code, which hold true for all possible inputs

It’s not that easy to get into this mindset and pretend like the systems I usually work with have any solid properties; there are a bunch of assumptions, random edge cases, unknown unknowns, and the requirements keep changing every week.

It’s easier to not worry about properties and instead think about postconditions or invariants. Let’s look at a few examples.

Successful testing

One relatively easy thing to test is whether some operation was a success or a failure. Think about a function that can fail, return a conditional flag or status code, and so on.

For example, if we test email parsing, we can expect that parsing a valid email should be successful:

test("Should parse valid emails") {
  forall(validEmailGen) { email =>
    expect(parseEmail(email).isSuccess)
  }
}
Enter fullscreen mode Exit fullscreen mode

Or alternatively, we can test both success and failure cases:

test("Parsing email should succeed when it's valid") {
  forall(emailGen) { email =>
    val result = parseEmail(email)
    if isValidEmail(email)
    then expect(result.isSuccess) // isDefined, isRight, isOk...
    else expect(result.isFailure) // isEmpty, isLeft...
  }
}
Enter fullscreen mode Exit fullscreen mode

Imagine that parseEmail is your function you’re testing and isValidEmail is some other library function, or something like that.

No-op testing

Similar to this can be testing that regardless of one argument, the function doesn’t “do anything” because of some other argument.

Imagine we have a function that enriches the user data based on the config depending on some status and it has no effect if the status is “blocked”; we can test that the user doesn’t change for any given config and user:

test("Baz does nothing if the status is blocked") {
  forall { (user: User, config: Config) =>
    enrichUser config user Blocked `shouldBe` user
  }
}
Enter fullscreen mode Exit fullscreen mode

💡 Note that we don’t have to provide the generators explicitly – we can rely on Arbitrary and arbitrary, which can improve the ergonomics. Check them out if you haven’t heard about them.


Round trips

The next low-hanging property is testing round trips: encoding/decoding, serializing/deserializing data in any format, jumping between date-time formats, and others.


💡 This feels like a legit property that can be tested in most systems.


💡 Can be especially handy for catching backward compatibility issues.


Here is a typical example:

test("User JSON encoding/decoding roundtrips") {
  forall(userGen) { user =>
    expect(decode(encode(user)) contains user) 
    // decode(encode(user)) `shouldBe` (Some user)
  }
}
Enter fullscreen mode Exit fullscreen mode

Note that decoding is usually fallible and returns the data wrapped in something.

Result/State check

When testing something simple like storage, it feels like we can directly confirm the operations, e.g., we should be able to find the key after we’ve inserted it.

Testing stateful systems feels more challenging, but the state can be either a curse or a blessing.

One option is to check the state changes (diffs) by the function we’re testing. For instance, if the foo function signs up a user for a meetup and, as a result, bumps the number of subscribers in some meetup state, we can property-test this bump:

// If foo changes some global state
test("Foo should bump the number of attendees") {
  forall { (email: Email, state: MeetupState) =>
    for
      globalState.updateMeetup(state) // Update global state
      foo(email)                      // Run foo with current state
      newState <- globalState.get()   // Get new global state
    yield expect(newState.meetupState.attendees == state.attendees + 1)
  }
}
Enter fullscreen mode Exit fullscreen mode
// Alternatively, if foo returns a new state
test("Foo should bump the number of attendees") {
  forall { (email: Email, state: MeetupState) =>
    for newState <- foo(state, email)
    yield expect(newState.attendees == state.attendees + 1)
  }
}
Enter fullscreen mode Exit fullscreen mode

Functions don’t have to be pure or mathematical to be property-based tested.

State consistency

If multiple states hold onto the same data or functions produce similar results, their consistency or synchronization can be properly tested. This includes:

  • Global state and different local states (for example, if you have different views for the same data on the front end).
  • Storage consistency between caches and golden storages.
  • Optimized and regular versions of an operation.

Imagine that foo adds a user to the current organization; our frontend displays organization users in the footer (1), somewhere in the hamburger menu (2), and in the optional org. widget in the corner (3) 🤷. We have to make sure that all of these are in sync:

test("Foo should consistently add the user to the organization") {
  forall { (email: Email) =>
    for
      foo(email)

      // for the footer
      currentUser <- getCurrentUser
      // for the hamburger          
      currentOrg <- globalState.map(_.currentOrg) 
      // for the widget
      orgWidget <- widgets.get(OrgWidget)         

      // Check that current org and user's org are in sync
      expect(currentUser.organization.members == currentOrg.members)

      // Check that current org and org widget are in sync
      whenSuccess(orgWidget) { widget =>
        expect(orgWidget.members == currentOrg.members)
      }
    yield success 
  }
}
Enter fullscreen mode Exit fullscreen mode

💡 Note that it’s usually not enough to have one property-based test.

If we forget to implement the user update or do something else silly in all three functions, the previous test still succeeds.


If you aim to test the whole stateful system, consider using techniques such as state machine or model-based testing.

Model-based testing

Sometimes our “complex” systems mimic or wrap more superficial data structures. So, instead of dabbling with complexity, we can test something simpler – some straightforward model.

To give you an idea, in one of my previous companies, the product's core was a graph with various bells and whistles, which wasn't trivial to test. And what is an essence of a graph? List of edges and a list of nodes. Or maps if we care about their ids.

test("Baz should do something to the graph") {
  forall(graphGen) { graph =>
    val nodes = graph.getNodesMap()
    checkTheBazPropertyOnNodes(nodes)
  }
}
Enter fullscreen mode Exit fullscreen mode

If we delete some parts of the graph, we can check that the relevant nodes and edges are deleted. If we modify the whole graph, for example, reverse/transpose it, we can check that the nodes stay the same, but the edges are reversed.

But if you’re not careful, your test models can become too complex, and you get close to reimplementing your source code in tests, such as reimplementing the reversing of the edges in the graph in test code.

Watch out. Don’t reimplement the data or functions you’re testing!

Final words

With property-based tests, we can generalize concrete scenarios by focusing on essentials. The testing framework or engine handles randomizing inputs to ensure the defined properties are correct.

You don’t have to use property-based testing for everything and everywhere. Start small by adapting just a few existing tests into simple properties to overcome the initial hurdles.

And keep these in mind:

  • Avoid using naive generators.
    • Generating a hundred names like 㕰랮螮ᑻ⎿⏦炩耾쁾峞゙鉼ꪱ摦僋ⴁ뺁 won’t catch any bugs.
  • Be careful writing conditional properties.
    • Generating a bunch of inputs that you throw away just make your tests worse.
  • Avoid reimplementing your functions.
    • Because you can reimplement the same bugs (or introduce new ones).

💖 💪 🙅 🚩
zelenya
Zelenya

Posted on April 20, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Proper and Basic Property-Based Testing
functional Proper and Basic Property-Based Testing

April 20, 2023