Swiftpack.co - pointfreeco/swift-parsing as Swift Package

Swiftpack.co is a collection of thousands of indexed Swift packages. Search packages.
See all packages published by pointfreeco.
pointfreeco/swift-parsing 0.4.1
A library for turning nebulous data into well-structured data, with a focus on composition, performance, and generality.
⭐️ 387
🕓 1 week ago
.package(url: "https://github.com/pointfreeco/swift-parsing.git", from: "0.4.1")

swift-parsing

CI

A library for turning nebulous data into well-structured data, with a focus on composition, performance, and generality:

  • Composition: Ability to break large, complex parsing problems down into smaller, simpler ones. And the ability to take small, simple parsers and easily combine them into larger, more complex ones.

  • Performance: Parsers that have been composed of many smaller parts should perform as well as highly-tuned, hand-written parsers.

  • Generality: Ability to parse any kind of input into any kind of output. This allows you to choose which abstraction levels you want to work on based on how much performance you need or how much correctness you want guaranteed. For example, you can write a highly tuned parser on collections of UTF-8 code units, and it will automatically plug into parsers of strings, arrays, unsafe buffer pointers and more.

Motivation
Getting started
Design
Benchmarks
Documentation
Other libraries
License

Learn More

This library was designed over the course of many episodes on Point-Free, a video series exploring functional programming and the Swift language, hosted by Brandon Williams and Stephen Celis. You can watch all of the episodes here.

video poster image

Motivation

Parsing is a surprisingly ubiquitous problem in programming. We can define parsing as trying to take a more nebulous blob of data and transform it into something more well-structured. The Swift standard library comes with a number of parsers that we reach for every day. For example, there are initializers on Int, Double, and even Bool, that attempt to parse numbers and booleans from strings:

Int("42")         // 42
Int("Hello")      // nil

Double("123.45")  // 123.45
Double("Goodbye") // nil

Bool("true")      // true
Bool("0")         // nil

And there are types like JSONDecoder and PropertyListDecoder that attempt to parse Decodable-conforming types from data:

try JSONDecoder().decode(User.self, from: data)
try PropertyListDecoder().decode(Settings.self, from: data)

While parsers are everywhere in Swift, Swift has no holistic story for parsing. Instead, we typically parse data in an ad hoc fashion using a number of unrelated initializers, methods, and other means. And this typically leads to less maintainable, less reusable code.

This library aims to write such a story for parsing in Swift. It introduces a single unit of parsing that can be combined in interesting ways to form large, complex parsers that can tackle the programming problems you need to solve in a maintainable way.

Getting started

Suppose you have a string that holds some user data that you want to parse into an array of Users:

var input = """
1,Blob,true
2,Blob Jr.,false
3,Blob Sr.,true
"""

struct User {
  var id: Int
  var name: String
  var isAdmin: Bool
}

A naive approach to this would be a nested use of .split(separator:), and then a little bit of extra work to convert strings into integers and booleans:

let users = input
  .split(separator: "\n")
  .compactMap { row -> User? in
    let fields = row.split(separator: ",")
    guard
      fields.count == 3,
      let id = Int(fields[0]),
      let isAdmin = Bool(String(fields[2]))
    else { return nil }

    return User(id: id, name: String(fields[1]), isAdmin: isAdmin)
  }

Not only is this code a little messy, but it is also inefficient since we are allocating arrays for the .split and then just immediately throwing away those values.

It would be more straightforward and efficient to instead describe how to consume bits from the beginning of the input and convert that into users. This is what this parser library excels at 😄.

We can start by describing what it means to parse a single row, first by parsing an integer off the front of the string, and then parsing a comma that we discard using the .skip operator:

let user = Int.parser()
  .skip(",")

Already this can consume the beginning of the input:

user.parse(&input) // => 1
input // => "Blob,true\n2,Blob Jr.,false\n3,Blob Sr.,true"

Next we want to take everything up until the next comma for the user's name, and then skip the comma:

let user = Int.parser()
  .skip(",")
  .take(Prefix { $0 != "," })
  .skip(",")

Here the .take operator has combined parsed values together into a tuple, (Int, Substring).

And then we want to take the boolean at the end of the row for the user's admin status:

let user = Int.parser()
  .skip(",")
  .take(Prefix { $0 != "," })
  .skip(",")
  .take(Bool.parser())

Currently this will parse a tuple (Int, Substring, Bool) from the input, and we can .map on that to turn it into a User:

let user = Int.parser()
  .skip(",")
  .take(Prefix { $0 != "," })
  .skip(",")
  .take(Bool.parser())
  .map { User(id: $0, name: String($1), isAdmin: $2) }

That is enough to parse a single user from the input string:

user.parse(&input) // => User(id: 1, name: "Blob", isAdmin: true)
input // => "\n2,Blob Jr.,false\n3,Blob Sr.,true"

To parse multiple users from the input we can use the Many parser:

let users = Many(user, separator: "\n")

users.parse(&input) // => [User(id: 1, name: "Blob", isAdmin: true), ...]
input // => ""

Now this parser can process an entire document of users, and the code is simpler and more straightforward than the version that uses .split and .compactMap.

Even better, it's more performant. We've written benchmarks for these two styles of parsing, and the .split-style of parsing is more than twice as slow:

name                             time        std        iterations
------------------------------------------------------------------
README Example.Parser: Substring 3426.000 ns ±  63.40 %     385395
README Example.Adhoc             7631.000 ns ±  47.01 %     169332
Program ended with exit code: 0

Further, if you are willing write your parsers against UTF8View instead of Substring, you can eke out even more performance, more than doubling the speed:

name                             time        std        iterations
------------------------------------------------------------------
README Example.Parser: Substring 3693.000 ns ±  81.76 %     349763
README Example.Parser: UTF8      1272.000 ns ± 128.16 %     999150
README Example.Adhoc             8504.000 ns ±  59.59 %     151417

We can also compare these times to a tool that Apple's Foundation gives us: Scanner. It's a type that allows you to consume from the beginning of strings in order to produce values, and provides a nicer API than using .split:

var users: [User] = []
while scanner.currentIndex != input.endIndex {
  guard
    let id = scanner.scanInt(),
    let _ = scanner.scanString(","),
    let name = scanner.scanUpToString(","),
    let _ = scanner.scanString(","),
    let isAdmin = scanner.scanBool()
  else { break }

  users.append(User(id: id, name: name, isAdmin: isAdmin))
  _ = scanner.scanString("\n")
}

However, the Scanner style of parsing is more than 5 times as slow as the substring parser written above, and more than 15 times slower than the UTF-8 parser:

name                             time         std        iterations
-------------------------------------------------------------------
README Example.Parser: Substring  3481.000 ns ±  65.04 %     376525
README Example.Parser: UTF8       1207.000 ns ± 110.96 %    1000000
README Example.Adhoc              8029.000 ns ±  44.44 %     163719
README Example.Scanner           19786.000 ns ±  35.26 %      62125

That's the basics of parsing a simple string format, but there's a lot more operators and tricks to learn in order to performantly parse larger inputs. View the benchmarks for examples of real life parsing scenarios.

Design

Protocol

The design of the library is largely inspired by the Swift standard library and Apple’s Combine framework. A parser is represented as a protocol that many types conform to, and then parser transformations (also known as “combinators”) are methods that return concrete types conforming to the parser protocol.

For example, to parse all the characters from the beginning of a substring until you encounter a comma you can use the Prefix parser:

let parser = Prefix { $0 != "," }

var input = "Hello,World"[...]
parser.parse(&input) // => "Hello"
input // => ",World"

The type of this parser is:

Prefix<Substring>

We can .map on this parser in order to transform its output, which in this case is the string "Hello":

let parser = Prefix { $0 != "," }
  .map { $0 + "!!!" }

var input = "Hello,World"[...]
parser.parse(&input) // => "Hello!!!"
input // => ",World"

The type of this parser is now:

Parsers.Map<Prefix<Substring>, Substring>

Notice how the type of the parser encodes the operations that we performed. This adds a bit of complexity when using these types, but comes with some performance benefits because Swift can usually optimize the creation of those nested types.

Low-level versus high-level

The library makes it easy to choose which abstraction level you want to work on. Both low-level and high-level have their pros and cons.

Parsing low-level inputs, such as UTF-8 code units, has better performance, but at the cost of potentially losing correctness. The most canonical example of this is trying to parse the character "é", which can be represented in code units as [233] or [101, 769]. If you don't remember to always parse both representations you may have a bug where you accidentally fail your parser when it encounters a code unit sequence you don't support.

On the other hand, parsing high-level inputs, such as String, can guarantee correctness, but at the cost of performance. For example, String handles the complexities of extended grapheme clusters and UTF-8 normalization for you, but traversing strings is slower since its elements are variable width.

The library gives you the tools that allow you to choose which abstraction level you want to work on, as well as the ability to fluidly move between abstraction levels where it makes sense.

For example, say we want to parse particular city names from the beginning of a string:

enum City {
  case london
  case newYork
  case sanJose
}

Because "San José" has an accented character, the safest way to parse it is to parse on the Substring abstraction level:

let city = StartsWith("London").map { City.london }
  .orElse(StartsWith("New York").map { .newYork })
  .orElse(StartsWith("San José").map { .sanJose })

var input = "San José,123"
city.parse(&input) // => City.sanJose
input // => ",123"

However, we are incurring the cost of parsing Substring for this entire parser, even though only the "San José" case needs that power. We can refactor this parser so that "London" and "New York" are parsed on the UTF8View level, since they consist of only ASCII characters, and then parse "San José" as Substring:

let city = StartsWith("London".utf8).map { City.london }
  .orElse(StartsWith("New York".utf8).map { .newYork })
  .orElse(StartsWith("San José").utf8.map { .sanJose })

It's subtle, but StartsWith("London".utf8) is a parser that parses the code units for "London" from the beginning of a UTF8View, whereas StartsWith("San José").utf8 parses "San José" as a Substring, and then converts that into a UTF8View parser.

This allows you to parse as much as possible on the more performant, low-level UTF8View, while still allowing you to parse on the more correct, high-level Substring when necessary.

Benchmarks

This library comes with a benchmark executable that not only demonstrates the performance of the library, but also provides a wide variety of parsing examples:

These are the times we currently get when running the benchmarks:

MacBook Pro (16-inch, 2021)
Apple M1 Pro (10 cores, 8 performance and 2 efficiency)
32 GB (LPDDR5)

name                                         time            std        iterations
----------------------------------------------------------------------------------
Arithmetic.Parser                                1000.000 ns ±  15.80 %    1000000
BinaryData.Parser                                 250.000 ns ±  34.95 %    1000000
Bool.Bool.init                                      0.000 ns ±    inf %    1000000
Bool.BoolParser                                    42.000 ns ±  60.93 %    1000000
Bool.Scanner.scanBool                             584.000 ns ±  16.36 %    1000000
Color.Parser                                      208.000 ns ±  25.44 %    1000000
CSV.Parser                                    1523875.000 ns ±   0.84 %        922
CSV.Ad hoc mutating methods                    863479.000 ns ±   2.49 %       1656
Date.Parser                                      5875.000 ns ±   6.55 %     241380
Date.DateFormatter                              24166.000 ns ±   3.50 %      56837
Date.ISO8601DateFormatter                       32625.000 ns ±   3.53 %      42868
HTTP.HTTP                                        4875.000 ns ±   5.85 %     286041
JSON.Parser                                      6167.000 ns ±   3.99 %     224290
JSON.JSONSerialization                           1708.000 ns ±  11.68 %     802207
Numerics.Int.init                                  41.000 ns ±  69.46 %    1000000
Numerics.Int.parser                                42.000 ns ±  60.40 %    1000000
Numerics.Scanner.scanInt                          125.000 ns ±  33.94 %    1000000
Numerics.Comma separated: Int.parser          5002166.000 ns ±   0.75 %        279
Numerics.Comma separated: Scanner.scanInt    52291583.000 ns ±   0.43 %         27
Numerics.Comma separated: String.split       13989062.000 ns ±   0.86 %        100
Numerics.Double.init                               42.000 ns ± 103.43 %    1000000
Numerics.Double.parser                             84.000 ns ±  57.35 %    1000000
Numerics.Scanner.scanDouble                       167.000 ns ±  30.26 %    1000000
Numerics.Comma separated: Double.parser      12062791.500 ns ±   1.03 %        116
Numerics.Comma separated: Scanner.scanDouble 54031625.000 ns ±   0.39 %         26
Numerics.Comma separated: String.split       18210792.000 ns ±   0.68 %         77
PrefixUpTo.Parser                               13292.000 ns ±   3.85 %     104936
PrefixUpTo.Scanner.scanUpToString               97500.000 ns ±   1.22 %      14267
Race.Parser                                     27041.000 ns ±   3.84 %      51576
README Example.Parser: Substring                 2375.000 ns ±  11.91 %     583192
README Example.Parser: UTF8                       958.000 ns ±  14.26 %    1000000
README Example.Adhoc                             3334.000 ns ±   7.75 %     412681
README Example.Scanner                          14708.000 ns ±   5.34 %      94978
Routing.Parser                                   3375.000 ns ±   7.23 %     409775
String Abstractions.Substring                  613292.000 ns ±   0.85 %       2273
String Abstractions.UTF8                        58125.000 ns ±   1.78 %      23918
UUID.UUID.init                                    209.000 ns ±  30.39 %    1000000
UUID.UUIDParser                                   375.000 ns ±  20.14 %    1000000
Xcode Logs.Parser                             3340625.000 ns ±   0.97 %        417

Documentation

The latest documentation for swift-parsing is available here.

Other libraries

There are a few other parsing libraries in the Swift community that you might also be interested in:

License

This library is released under the MIT license. See LICENSE for details.

GitHub

link
Stars: 387
Last commit: 16 minutes ago
jonrohan Something's broken? Yell at me @ptrpavlik. Praise and feedback (and money) is also welcome.

Related Packages

Release Notes

0.4.1
2 weeks ago
  • Changed: Many and OptionalParser's Upstream generic has been deprecated and renamed to Element and Wrapped.
  • Infrastructure: benchmarks and documentation cleanup.

Swiftpack is being maintained by Petr Pavlik | @ptrpavlik | @swiftpackco | API | Analytics