String interpolation and overlapping instances 101

May 27, 2019
« Previous post   Next post »

Are you frustrated trying to do any kind of string manipulation in Haskell? The same kind of interpolation or string building that would require zero thought in other languages seems to always turn into a mess of (<>)s, mconcats, and shows in vanilla Haskell. Throw in Text and ByteString and now we have to juggle collisions between different definitions of pack, module imports, lazy versus strict versions of the same functions... not to mention having to manually handle Unicode conversions between Text and ByteString, even when 99% of the time you know it'll be UTF-8. In short, it just becomes a huge mess.

I got fed up with this situation and wrote string-interpolate.

Here's how you use it:

{-# LANGUAGE QuasiQuotes #-}

import Data.String.Interpolate ( i )
import Data.Text

firstName :: String
firstName = "Yuuko"

lastName :: String
lastName = "Shirasawa"

message :: Text
message = [i|Good morning, #{firstName} #{lastName}!|]

Note how we can directly interpolate Strings into a Text without needing to do any explicit conversion; interpolation does what you expect, where textual types and Chars will be interpolated without the surrounding quotes.

You don't have to worry about lazy versus strict:

import qualified Data.Text      as T
import qualified Data.Text.Lazy as LT

description :: T.Text
description = "a very, very old strategy game"

payload :: LT.Text
payload = [i|{ 'description': '#{description}' }|]

Anything which implements Show gets interpolated as you'd expect:

data Direction = Left | Right
  deriving Show

λ> dir = Left
λ> [i|Heading: #{dir}|] :: String
>>> "Heading: Left"

Conversions between Text/String and ByteString automatically convert to/from UTF-8:

import Data.Text
import Data.ByteString

text :: Text
text = "💩"

λ> [i|Bytes: #{text}|] :: ByteString
>>> "Bytes: \240\159\146\169"

In short, it makes building up strings much, much easier. It's also fast!

The rest of this post is about the internal architecture of string-interpolate; while the tricks used involving -XOverlappingInstances and related type hackery should be old hat to advanced Haskellers, I haven't seen a case study of using them in practice.


While using string-interpolate is extremely simple, internally we have to somehow transform the quasiquotes that users give us into code that:

  1. handle taking in values of the various textual types to interpolate, and
  2. outputting every textual type

Since quasiquote handlers don't allow you to look at the type of identifiers, we can't actually emit some code conditionally; we have to spit out the same code no matter what types our users give us. Sounds like a job for typeclasses.

The simplest way to implement string-interpolate would just be to use Show and IsString together, since every textual type already implements both of these. So a quasiquote like:

[i|Username: #{username} | Updoots: #{updoots}|]

would emit code like:1

mconcat
  [ "Username: "
  , fromString (show username)
  , " | Updoots: "
  , fromString (show updoots)
  ]

While this immediately gets us to a working, convenient solution, it's painfully slow; using show and toString means we have to construct a String for every interpolated value and then convert it into our target type. Just as importantly, it doesn't handle Unicode strings correctly! If we took our emoji-containing Text above and tried to interpolate it into a ByteString, we'd end up just getting a string containing the Unicode code point as a sequence of digit characters. Ditto if we tried to interpolate the UTF-8 bytes into a Text, just with the individual bytes. Weirdly, interpolating a ByteString into another ByteString won't even give you the same string if that ByteString isn't ASCII, which certainly seems counterintuitive for any users.

Another idea would be to create our own Interpolatable typeclass. Since we want the behavior to depend both on what we're interpolating in and what we want out, we'd have to take in two parameters:

{-# LANGUAGE MultiParamTypeClasses #-}

-- from `text-conversions'
import Data.Text.Conversions
-- from `utf8-string'
import qualified Data.ByteString.UTF8 as UTF8

class Interpolatable src dst where
  interpolate :: src -> dst

instance Interpolatable Text Text where
  interpolate = id
instance Interpolatable ByteString ByteString where
  interpolate = id
instance Interpolatable Text ByteString where
  interpolate = unUTF8 . convertText
instance Interpolatable ByteString Text where
  -- emit replacement character � for anything
  -- which isn't valid UTF-8
  interpolate = UTF8.foldr
    (\char output -> Data.Text.singleton char <> output)
    ""

So that our quasiquote from earlier now emits something like:

mconcat
  [ "Username: "
  , interpolate username
  , " | Updoots: "
  , interpolate updoots
  ]

Now we're fast and handle Unicode properly, but it means that we're kind of limited in what we can interpolate. Do we really want to provide instances for, say, Int, Integer, Float, Double, Bool, and so on? Not to mention that if someone wants to use their own type with our interpolation, they'll also have to write their own instances for Interpolatable. Wasn't our library supposed to be convenient?

It would be nice if we could simply use Show as a default, and use the specific instances if we've defined one. What happens if we just try to add an instance for anything that's Show?

{-# LANGUAGE FlexibleInstances #-}

instance Show src => Interpolatable src Text where
  interpolate = Data.Text.pack . show
instance Show src => Interpolatable src ByteString where
  interpolate = unUTF8 . convertText . show

This seems to compile fine, but when we try it out, GHC starts complaining at us:

λ> text = "abc" :: Text
λ> (interpolate text) :: ByteString

<interactive>:76:1: error:
Overlapping instances for Interpolatable Text ByteString
        arising from a use of ‘interpolate’
      Matching instances:
        instance [safe] Show src => Interpolatable src ByteString
          -- Defined at <interactive>:70:10
        instance [safe] Interpolatable Text ByteString
          -- Defined at <interactive>:44:10
In the expression: interpolate text :: ByteString
      In an equation for ‘it’: it = interpolate text :: ByteString

Hmm. Well, that's not great.

What GHC is telling us is that it doesn't know which typeclass instance to use for our call to interpolate; it could use the Interpolatable Text ByteString instance, which is probably what we want. But the Interpolatable src ByteString instance is also possible, since src could unify with Text.

We're able to describe in plain English what we want the compiler to do: only fall back to the Show-based instance if we don't have something more specific. Is there a way to explain that to the compiler? Thankfully, yes!

If we mark our instances as overlapping, GHC will attempt to pick the most specific instance when there are multiple that might work. The exact instance matching rules are explained in detail here.

instance {-# OVERLAPS #-} Interpolatable Text ByteString where
  interpolate = ...
instance {-# OVERLAPS #-} Show src => Interpolatable src ByteString where
  interpolate = ...

λ> text = "abc" :: Text
λ> (interpolate text) :: ByteString
>>> "abc"

λ> (interpolate True) :: ByteString
>>> "True"

Success!

In practice, I've found overlapping instances to be useful for exactly this sort of behavior: providing a "default" typeclass behavior that can be given specific overrides.


At this point we've sketched out the implementation of a fast, Unicode-aware interpolation backend; you could totally take this and spin off a new library with it. However, there's one last tweak we want to make to how this works for string-interpolate. Relying on the Monoid-based implementation to handle assembling the final output is okay in most cases, but would blow up if the interpolation gets really large. Both Text and ByteString have their own Builder types that are designed for efficiently constructing instances of their associated textual types. Could we turn our interpolation fragments into that intermediate type first, and then use that to output the result, instead of having to directly convert our fragments into the target type?

As it turns out, yes, but here's where the type hacking starts to get really hairy.

We need to associate with each output type what its builder type is, which sounds like a job for type families. If you're not familiar with type families, all they are are functions that operate on type terms instead of data. So we can have a function that takes in the output type and returns the builder type. One nice thing is that there's a convenient syntax for doing this right in the typeclass instance itself:

{-# LANGUAGE TypeFamilies        #-}
{-# LANGUAGE AllowAmbiguousTypes #-}

import           Data.ByteString
import qualified Data.ByteString.Lazy    as LB
import qualified Data.ByteString.Builder as B
import           Data.Text.Conversions

class InterpSink dst where
  type SinkBuilder dst :: *

  finalize :: SinkBuilder dst -> dst
  combine  :: SinkBuilder dst -> SinkBuilder dst -> SinkBuilder dst
  -- For the verbatim parts of the interpolation quasiquote
  literal  :: String -> SinkBuilder dst

class InterpSink dst => Interpolatable src dst where
  interpolate :: src -> SinkBuilder dst

instance {-# OVERLAPS #-} InterpSink ByteString where
  type SinkBuilder ByteString = B.Builder

  finalize = LB.toStrict . B.toLazyByteString
  combine  = (<>)
  literal  = B.lazyByteString . unUTF8 . convertText

Once again, though, we run into an issue when we try to define our "default" implementation:

{-# LANGUAGE FlexibleInstances    #-}
{-# LANGUAGE UndecidableInstances #-}

import Data.String
import Text.Show ( ShowS, showString )

instance {-# OVERLAPS #-} IsString dst => InterpSink dst where
  type SinkBuilder dst = ShowS

  finalize = fromString . ($ "")
  combine  = (.)
  literal  = showString

-- <interactive>:33:8: error:
--     Conflicting family instance declarations:
--       SinkBuilder ByteString = B.Builder -- Defined at <interactive>:33:8
--       SinkBuilder dst = ShowS -- Defined at <interactive>:62:8

Unfortunately, type families don't play nice with overlaps. The problem is exactly as it was with our basic overlapping instances: there are two possibilities for what the type SinkBuilder ByteString might be, and the compiler doesn't currently have the ability to decide between them. Here there doesn't seem to be any way to get around this restriction on overlapping instances. So instead, we have to figure out a way to make our instances not overlapping.

The approach I chose was to add another parameter to all the involved typeclasses purely for the purpose of making the instance heads not collide. So the typeclasses look like this:

{-# LANGUAGE DataKinds      #-}
{-# LANGUAGE KindSignatures #-}

data SinkDefinition = Default | Specific

type family HasCustomSink dst where
  HasCustomSink ByteString = 'Specific
  HasCustomSink _          = 'Default

class (HasCustomSink dst ~ flag) => InterpSink (flag :: SinkDefinition) dst where
  type SinkBuilder flag dst :: *

  finalize :: SinkBuilder flag dst -> dst
  combine  :: SinkBuilder flag dst -> SinkBuilder flag dst -> SinkBuilder flag dst
  literal  :: String -> SinkBuilder flag dst

class (HasCustomSink dst ~ flag) => Interpolatable (flag :: SinkDefinition) src dst where
  interpolate :: src -> SinkBuilder flag dst

And once that's set up, we can write our instances using the flag to make them distinct:

instance InterpSink 'Specific ByteString where
  type SinkBuilder 'Specific ByteString = B.Builder
  ...

instance (IsString dst, HasCustomSink dst ~ 'Default)
    => InterpSink 'Default dst where
  type SinkBuilder 'Default dst = ShowS
  ...

Since the flag parameter in the instances differ, a type can't match both of these at once, solving our overlap problem. The rest is then implementing all the instances for InterpSink and Interpolatable that we want to offer.

So... success! We have our working interpolation backend, using the most efficient builder types that we can for each textual type. Unfortunately, we gave up the ability for library users to override the Show/IsString default behavior for their own types, since there's no way to define a new instance of HasCustomSink. If that was a priority, we could stick with our two-parameter version using overlapping instances and take the performance hit on more complicated interpolations.

Unfortunately, I haven't been able to figure out how to get string-interpolate to have all three of 1) usage of efficient intermediate types, 2) defaulting to Show/IsString, and 3) allowing library users to customise the behavior for their own types. Any two together seem doable, but getting all three seems not to be.

Have you run into obvious use cases that Haskell makes way harder than other languages? Do you like string-interpolate and find it useful? Talk to me!

« Previous post   Next post »

Before you close that tab...


Footnotes

↥1 We’d still have to do some finagling to get rid of the surrounding quotes for textual types, but this approach is directionally correct.