FullStack Labs

Please Upgrade Your Browser.

Unfortunately, Internet Explorer is an outdated browser and we do not currently support it. To have the best browsing experience, please upgrade to Microsoft Edge, Google Chrome or Safari.
Upgrade

Creating a Rust Web-Scraping Library That Will be Embeddable in a Ruby Module

Written by 
Mike Piccolo
,
CTO
Creating a Rust Web-Scraping Library That Will be Embeddable in a Ruby Module
blog post background
Multi-Cultural Business Development Influence on Nearshore Software Development
Outsourcing Delivery Models
Nearshore Software Development Basics: Development Teams in Latin America

This is part five of a series where I try to stumble my way through creating a Rust web-scraping library that will be embeddable in a Ruby module. If you are interested in starting from the beginning you can check out all my posts here: https://medium.com/@mfpiccolo

Follow along with this blog post with the part-5 branch of the scrape repo.

Last time in ARRIV, we learned how to pass structs and arrays back and forth between Rust and Ruby. The ability to work with these more advanced structures will definitely come in handy for what we are about to do. Today we are going to learn how to rust in parallel. One of the most painfully slow aspects of programming is working with multiple HTTP requests so this is a great problem to tackle.

The Slow Way

To start off we are going to implement a function that will make multiple HTTP requests synchronously. Each request will first be made, then wait for the response, do something with the response and then move on to the next iteration. As you can imagine this will be quite slow.

-- CODE language-rust keep-markup --
#![feature(test)]
extern crate hyper;
extern crate test;
use hyper::Client;
use std::io::Read;
extern crate time;

#[no_mangle]
pub extern fn run_threads() {
    let start_time = time::now();
    for i in 0..5 {
        let client = Client::new();
        println!("Requesting {}", i.to_string());
        let mut response = client.get("https://wikipedia.com/").send().unwrap();
        let mut body = String::new();
        response.read_to_string(&mut body).unwrap();
        println!("BodyLength: {}", body.len().to_string());
    }
    let end_time = time::now();
    println!("{:?}", (end_time - start_time));
}

-- CODE language-rust keep-markup --
[package]
name = "scrape"
version = "0.1.0"
authors = ["Mike Piccolo "]

[lib]
name = "scrape"
crate-type = ["dylib"]

[dependencies.hyper]
git = "https://github.com/hyperium/hyper.git"

[dependencies]
time = "0.1"

Let's break down this function that we have created. First off, we are using an external crate called hyper to handle the HTTP client. To do so we add it as a dependency to our cargo.toml file. Now we are building a function that will iterate over the range 0..5. Each iteration will set up a client, make a get request to google.com, convert the request to a string and print out the length of that string to the console. We are also including a simple benchmark to make sure we know how long this takes.

Now lets set up the Ruby interface for this function.

-- CODE language-rust keep-markup --
require 'ffi'

module Scrape
    extend FFI::Library
    ffi_lib './target/debug/libscrape.dylib'

    attach_function :run_threads, [], :void
end

Scrape.run_threads()

Simple FFI module will allow us to call this from ruby. Lets give that a try.

-- CODE language-rust keep-markup --
$ cargo build
$ ruby scrape.rb
Requesting 0
BodyLength: 42367
Requesting 1
BodyLength: 42367
Requesting 2
BodyLength: 42367
Requesting 3
BodyLength: 42367
Requesting 4
BodyLength: 42367
Duration { secs: 9, nanos: 235893000 }

Cool. It worked and it took around 9.2 seconds to complete. I think we can make that significantly faster if we use threads. Lets give that a try.

The Fast Way

Now it is time to do this in parallel. To do this we will need to use Rust threads and Arc. Lets go ahead an set up the function.

-- CODE language-rust keep-markup --
extern crate hyper;
use std::sync::Arc;
use std::thread;
use hyper::Client;
use std::io::Read;
extern crate time;

#[no_mangle]
pub extern fn run_threads() {
     let start_time = time::now();
    let client = Arc::new(Client::new());
     let threads: Vec<_> = (0..5).map(|i| {</_>
         let client = client.clone();
         thread::spawn(move || {
             println!("Requesting {}", i.to_string());
             let mut response = client.get("https://wikipedia.com").send().unwrap();
             let mut body = String::new();
             response.read_to_string(&mut body).unwrap();
             body.len().to_string()
    })
}).collect();

let responses: Vec<_> = threads</_>
    .into_iter()
    .map(|thread| thread.join())
    .collect();
println!("All threads joined. Full responses are:");
for response in responses.into_iter() {
    println!("The response have the following lengths: {:?}", response.ok());
}
let end_time = time::now();
println!("{:?}", (end_time - start_time));
}

Lets break down what we did here. On line 11 we are using the Arc::new wrapper around client. This will allow us to use client.clone() on line 13 inside the iteration and Rust will keep an atomic reference count (Arc) of the number of times that it is cloned in the threads. Rust needs this to know the number of clones that it will need to interact with and clean up at compile time.

On line 12 we setting the local variable threads to the a Vec which will be filled with threads. These threads will each make a HTTP request and return the length of the body of the response.

Now we can set up a vec to hold the responses by mapping over the threads and joining on line 23 through 26.

Line 27 through 32 are printing out the response length and the and the time it took.

We don’t need to change anything about our FFI module so we can go ahead with the compile and run.

-- CODE language-rust keep-markup --
$ cargo build
$ ruby scrape.rb
Requesting 1
Requesting 0
Requesting 2
Requesting 3
Requesting 4
All threads joined. Full responses are:
The response have the following lengths: Some("42367")
The response have the following lengths: Some("42367")
The response have the following lengths: Some("42367")
The response have the following lengths: Some("42367")
The response have the following lengths: Some("42367")
Duration { secs: 3, nanos: 933754000 }

Woohoo! That is around 2.3 times faster.

Next Time

To sum up where we are at so far in this series, we can receive, manipulate and return strings, numbers, structs and arrays from Ruby to Rust. We can now make HTTP requests in parallel and act on the responses. This is getting pretty close to a working HTTP scraping library. Next time we will be looking into what mozilla is up to with Servo and use some external libraries for HTML parsing similar to Nokgiri.

Special Thanks

The rust community, for the most part, is pretty nice to newbs so don’t be afraid to ask a Stack Overflow question or get on the rust IRC channel. Special thanks to Stack Overflow users Adrian, shepmaster, Chris Morgan, Vladimir Matveev and DK. Also Steve Klabnik for doing a great job on the docs.

And of course don’t hesitate to hit me up on twitter @mfpiccolo.

Mike Piccolo
Written by
Mike Piccolo
Mike Piccolo

As the CTO of FullStack Labs I'm directly responsible for the success of our development teams. I oversee and manage all of our client work at a high level to ensure projects are delivered on time, on budget, and with an exceptionally high level of quality. I've designed and implemented a robust project management process used by our development teams to deliver world class results for our clients. I'm a skilled and experienced developer and have directly contributed to dozens of custom software systems built with JavaScript, Node, Ruby, Rails, Python, React, and React Native among many other technologies. Prior to FullStack Labs I was a software engineer for a leading software consultancy. I hold a Bachelor of Business Administration (B.B.A.) from California State University, San Diego.

FullStack Labs Icon

Let's Talk!

We’d love to learn more about your project. Contact us below for a free consultation with our CEO.
Projects start at $50,000.

company name
name
email
phone
Type of project
Reason for contact
How did you hear about us?
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.