Composable Data Validation with Haskell

shepmaster · on July 27, 2021

It’s not obvious to me: does this extend to supporting multiple concurrent validations? For example, can you perform a single validation of a structure and get back multiple validation results for each of multiple field?

From some thought experiments I’ve done, I feel like my ideal solution would be to define a graph based on my data structure with validations as the edges. Then traverse the graph as far as possible before reporting errors.

This would allow validations to build on each other, as well as validating arbitrary combinations of fields.

I’ve never sat down to actually implement it though…

belevy · on July 27, 2021

Hey, I am one of the authors.

To clarify when you ask about multiple concurrent validations I assume you do not mean parallel. In that case I believe the system presented is the concurrent solution.

The final paragraph in which we use a `ValidationResult` will indeed collect all the failed validation errors from every leaf node (specifying which leaf node the error is for is not defined in the article but was implemented in the production system). Parallelism would require IO and is not used here though it would be fairly simple to add.

What we could do is when we run the validations on both sides an `and_` or an `or_` we can run them in parallel and combine the results. This approach should then run all of the leaves in parallel and then combine them together, I am not sure that it would speed anything up though and might even slow it down(unless the validations are effectful).

I hope that was clear, but if not let me know.

shepmaster · on July 27, 2021

A concrete example would be the input (in JSON to pick a format)

    { "alpha": "3", "beta": "5" }

Here, the fields `alpha` and `beta` are serialized as strings but should be treated as numbers. If the input were malformed:

    { "alpha": "dog", "beta": "cat" }

Then I'd like to get two validation errors for a single validation attempt. For example:

    { "alpha": ["must be a number"], "beta": ["must be a number"] }

> I believe the system presented is the concurrent solution.

It sounds like you are saying that I'd be able to get something isomorphic to the above, so that's good.

In addition to the above, I'd like to be able to say that once the values have been validated (a.k.a. parsed) as numbers, then `beta` must always be greater than `alpha`. This is a type of validation across fields that also depends on earlier validations (parsing, transformation, etc.) succeeding.

belevy · on July 27, 2021

So you want something like `eitherContramap :: (b -> Either ErrMsg a) -> ValidationRule a -> ValidationRule b`

we could define this as follows

    eitherContramap f rule = ValidationRule $ \b ->
       either failure (validate rule) (f b)

lalaithion · on July 27, 2021

Parallelism doesn't require IO, merely https://hackage.haskell.org/package/parallel-3.2.2.0/docs/Co... or https://hackage.haskell.org/package/monad-par-0.3.5/docs/Con...

ivanbakel · on July 27, 2021

>This would allow validations to build on each other, as well as validating arbitrary combinations of fields.

Validations "building on each other" is a common requirement, but in this style of code is often deliberately left out.

The trouble is that the goals of "collecting all the errors possible" and "allowing nested validation" are incompatible, in general - the system would have to be aware of dependencies between pieces of data in order to do it correctly (i.e. capture as many errors as possible). It's typically much better to have the programmer direct the nested validation, so that they can opt in to discarding error messages.

There's an interesting case of this in Haskell code, in the `Validation` type [0]. It's not identical to the OP, since that type is a functor rather than a cofunctor - but the idea is the same. Importantly, `Validation` is not a `Monad`, because that would let you write dependent rules which discarded errors (and therefore broke the typeclass laws).

[0]: https://hackage.haskell.org/package/validation-1.1.1/docs/Da...

shepmaster · on July 27, 2021

> the system would have to be aware of dependencies between pieces of data in order to do it correctly

Yep, which is why I expect that a graph is needed to model the validations that I picture in my mind.

> so that they can opt in to discarding error messages

Do you have any small examples of when you'd want to discard error messages?

ivanbakel · on July 28, 2021

>Do you have any small examples of when you'd want to discard error messages?

My point is more that error messages risk being discarded wherever there are nested validations (when talking about Haskell).

Consider the following crude example:

    validate = do
      x <- foo -- `x` is the result of some validation
      bar x    -- and here, it is fed into a dependent validation
        where
          bar y = do
            z <- baz   -- `baz` is a validation, but independent of the argument
            foobaz y z -- `foobaz` is a validation dependent on the argument
            return ()

No part of `bar` will run if `foo` fails to validate and yield the data required to run `bar`. But running the `bar` validation involves a sub-validation `baz` that is independent of the argument passed to `bar` - so in reality, it should be possible to always run `baz` and capture any error messages from it.

But the way the code is structured means that the `baz` errors are silently dropped some of the time, even when they could be captured. It's up to the programmer to decide if the extra complexity of floating `baz` to the top-level is worthwhile - and the programmer has to keep these kinds of decisions in mind constantly while writing the validation code.

Reordering validations is also a possibility, and it might be available in a graph-based rules engine. But you have to take care to not introduce side-effects into any part of the validation process, otherwise you would get surprising behaviour whenever there was a re-ordering. When writing Haskell code, re-ordering is not really an option for that reason.

catlifeonmars · on July 27, 2021

Agree, graph makes sense. Ideally, with a finite set of rules, you can traverse the entire graph, generating all possible error messages, and then filter error messages by some heuristic (such as shortest path); displaying only the filtered subset to the end user.

anchpop · on July 28, 2021

My bet is that you could represent that with Arrow. Too bad nobody uses them. They’re kind of monads except they represent graphs instead of sequences

agentultra · on July 27, 2021

I've been experimenting with a similar project. We encode the language in a data-structure which allows us to serialize/parse the rules. Paired with an interpreter that takes an environment context along with its input the rule language can be written and maintained by analysts and domain experts.

The usual caching strategies work as well. We can keep rules cached in-memory. Caching the contextual environment comprising all of the relevant data from the system is a bit trickier but we haven't gotten to that part and query for it at run time. It should be feasible. And if the language gets complex enough we may experiment with writing lightweight analysis tools [0]. For now though we're limiting it to internal use by engineers (working with analysts) until we learn more about the approach in production.

The surprising thing is just how little code it takes to get something like this going in Haskell. GHC can derive and write a lot of parsing/serialization code for you automatically and you get a lot of great type-classes out of the box which makes composing big parts out of little pieces nice and easy.

[0]: https://luctielen.com/posts/static_analysis_using_haskell_an...

Vosporos · on July 29, 2021

What does this bring over validation-selective?