DenTelezhkin/Swarm as Swift Package

DenTelezhkin/Swarm 0.3.0

Simple, fast, modular Web-scrapping engine written in Swift

⭐️ 16

🕓 3 years ago

.package(url: "https://github.com/DenTelezhkin/Swarm.git", from: "0.3.0")

Swarm is fast, simple and modular web-scrapping solution written in Swift.

Features

☑ Concurrently working spider instances
☑ Automatic request repeat and slow-down when requested by the server
☑ Customizable network layer (defaults to URLSession)
☑ Depth first / breadth first options
☑ Cross-platform

Quickstart

class Scrapper: SwarmDelegate {
    lazy var swarm = Swarm(startURLs: [startingURL, anotherStartingURL], delegate: self)

    init() {
        swarm.start()
    }

    func scrappedURL(_ url: VisitedURL, nextScrappableURLs: @escaping ([ScrappableURL]) -> Void) {
        if let htmlString = url.htmlString() {
            // Scrap data from htmlString

            nextScrappableURLs([ScrappableURL(url: nextURL)])
        } else {
            nextScrappableURLs([])
        }
    }

    func scrappingCompleted() {
        print("Scrapping took: \(swarm.scrappingDuration ?? 0)")
    }
}

Requirements

Xcode 12 and higher
Swift 5.3 and higher
iOS 10 / macOS 10.12 / Linux / tvOS 10.0 / watchOS 3.0

Although, if you are doing web scrapping on an Apple Watch, your use-case must be pretty wild :)

Installation

Swift Package Manager

To install Swarm using SPM, please add the following to your Package.swift file.

dependencies: [
    ...
    .package(url: "https://github.com/DenTelezhkin/Swarm.git", .upToNextMajor(from: "0.3.0"))
],
...
targets: [
    .target(
        name: ...
        dependencies: [
            ...,
            "Swarm"
        ]
    )
]

Alternatively, use Xcode UI (Project settings -> Swift Packages), if you don't need separate package.

CocoaPods

pod 'Swarm'

Adding more urls

After initializing Swarm with starting URL's, you might add more using following method:

swarm.add(ScrappableURL(url: newURL, depth: desiredDepth))

Also, you can add more URL's to scrap as a result of scrappedURL(_:nextScrappableURLs:) delegate method callback:

nextScrappableURLs([ScrappableURL(url: nextURL)])

Please note, that there is no need to check those URL's for uniqueness, as internally URL's are stored in Set, and while scraping is in progress, visited URL's are saved in a log.

Keep in mind, that calling nextScrappableURLs closure in scrappedURL(_:nextScrappableURLs:) delegate method is required, as Swarm is waiting for all such closures to be called in order to complete web-scraping.

Configuration

SwarmConfiguration is an object passed during Swarm initialization. It has sensible defaults for all parameters, however if you need, you can modify any of them:

Success status codes
Delayed retry status codes
Delayed retry delay
Number of delayed retries before giving up
Max auto-throttling delay (Swarm will try to adhere to "Retry-After" response header, but delay will not be larger than specified in this variable)
Max authothrottled request retries before giving up
Download delay (cooldown for each of the spiders)
Max concurrent connections

Example: downloadDelay = 1, .maxConcurrentConnections = 8 equates to approximately 8 requests in a second, excluding parsing time

Depth or breadth?

Web-page can contain several links to follow, and depending on your agenda, you may want to go deeper or wider (e.g. do I want to get all items on the page first, and then load details on each of them, or vice-versa).

By default, Swarm operates as LIFO queue, ignoring depth entirely. You can, however, require depth first by setting this value in SwarmConfiguration object:

configuration.scrappingBehavior = .depthFirst

In this case, when selecting next URL to scrap, Swarm will choose ScrappableURL instance with biggest depth value. Alternatively, if .breadthFirst behavior is used, least depth url will be prioritized.

Network transport

By default, Swarm uses Foundation.URLSession as network transport for all network requests. You can customize how requests are sent by adopting a delegate method:

func spider(for url: ScrappableURL) -> Spider {
    let spider = URLSessionSpider()

    // Handle cookies instead of ignoring them
    spider.httpShouldHandleCookies = true
    spider.userAgent = .static("com.my_app.custom_user_agent")
    spider.requestModifier = {
        var request = $0
        // Modify URLRequest instance for each request
        request.timeoutInterval = 20
        return request
    }
    // Modify HTTP headers
    spider.httpHeaders["Authorization"] = "Basic dGVzdDp0ZXN0"

    return spider
}

You can also implement your own network transport, if you need, by implementing simple Spider protocol:

public protocol Spider {
    func request(url: ScrappableURL, completion: @escaping (Data?, URLResponse?, Error?) -> Void)
}

Example of such implementation can be found in unit tests for Swarm, where MockSpider is used to stub all network requests in test suite.

Using Vapor HTTPClient as network transport for Swarm

There are several reasons why you might want to use Vapor HTTPClient to send network requests. One - it works much better with proxies, and does not require a lot of workarounds. Second - you are working on specific event loops, and don't break Vapor concurrency model by dispatching to networking queues, which URLSession does.

Example of using HTTPClient for network transport in Vapor 4 can be found here.

Spider lifecycle

A picturesrc="https://raw.github.com/DenTelezhkin/Swarm/main/his case is worth a thousand words.

Managing cooldown in Vapor app using SwiftNIO

By default, Swarm uses DispatchQueue.main.asyncAfter(deadline:execute:) method to execute a delay between network requests. This is fine for a Mac app, but if you are running in server environment such as Vapor, you should not be using main thread to schedule anything. In fact, dispatching work to main thread might not even work.

In order to properly execute a cooldown in Vapor app, set cooldown on Swarm instance in a following way:

swarm.cooldown = { interval, closure in
    eventLoop.scheduleTask(in: TimeAmount.seconds(Int64(interval)), closure)
}

Being a friendly neighborhood web-scraper

With great tools comes great responsibility. You may be tempted to decrease download cooldown, increase parallelism, and dowload everything at speeds of your awesome gigabit Ethernet. Don't do that, please :)

First of all, going slower might actually mean that scrapping succeeds, and you won't get banned on servers you are trying to get data from. Also, going slower means that you will put less strain on servers, and will allow them to not slow down everyones (and yours) requests while server is processing all data requests it received.

Read scrapy Common Practices doc, which generally applies to this framework too (in principle).

To expand on avoiding getting banned section of that doc, here are some tips for Swarm usage:

Use pool of rotating user-agents:

spider.userAgent = .randomized(["com.my_app.custom_user_agent", "com.my_app.very_custom_user_agent"])

Disable cookies:

spider.httpShouldHandleCookies = false

If you are hitting retry responses, slow-down by bumping up cooldown on spiders, and reducing concurrency:

configuration.downloadDelay = 2
configuration.maxConcurrentConnections = 5

To monitor retry responses, implement following delegate method, in which you can observe why Swarm decided to delay following request:

func delayingRequest(to url: ScrappableURL, for timeInterval: TimeInterval, becauseOf response: URLResponse?) {
  // Try inspecting (response as? HTTPURLResponse)?.statusCode
  // Also: (response as? HTTPURLResponse)?.allHeaderFields
}

FAQ

I'm into some serious web-scrapping, is this framework for me?

Well, depends. This project is built with simplicity in mind, as well as working on Apple platforms first, and Linux later. It works for my use case, but if you have more complex use case, you should look at scrapy, which has much more features, as well as enterprise support.

Also, this project is written in Swift, while scrapy is written in Python, which can be both an upside and a downside, depending on where you are coming from.

Why don't you have a built-in mechanism for extracting data?

This again goes back to simplicity. You might like SwiftSoup for HTML parsing or you might like Kanna to use XPath for extracting data. Maybe you even need a headless browser to render your web-pages first. With current approach, you install Swarm, and use any parsing library you need (if any).

Is there a CLI?

No, and it's not planned. Building a Mac app is trivial nowadays with just several lines of code, thus making a need for CLI obsolete at this moment.

Is there support for Carthage package manager?

At this moment, no. There's no Xcode project in repository, and Carthage does not directly support using project generated from SwiftPM.

You can, however, work around this by following this guide.

On the reason why I'm not including Xcode project in this repo - it's personal, I'm just very tired of managing dependencies using Carthage, which I find extremely slow, and extremely likely to break. CocoaPods, while judged by some people, is extremely popular, and has been working in production for me for years with much less issues then Carthage. Also, Swift Package Manager is currently my dependency manager of choice, as it has Xcode integration and actually beats CocoaPods in compilation speeds and simplicity.

What's on the roadmap?

Depending on interest from community and my own usage of the framework, following features might be implemented:

☑ Linux support
☐ Automatic robots.txt parsing
☐ Automatic sitemap parsing
☐ Automatic link detection with domain restrictions
☐ Example usage on a server with Vapor.
☐ External storage for history of visited pages
☐ Other stuff I'm currently not aware of

License

Swarm is released under the MIT license. See LICENSE for details.

Swiftpack.co - DenTelezhkin/Swarm as Swift Package