Caterpillar cURL Multi-Get PHP Crawler

Caterpillar is a PHP class intended for website crawling and screen scraping. It handles parallel requests using a modified and wrapped version of Josh Fraser's Rolling Curl library which utilizes curl_multi() functions in an efficient manner. Unlike most other curl_multi() implementations where you must wait for the set of requests to complete before processing the batch, Rolling cURL processes each request as soon as it has completed. This eliminates wasted CPU cycles due to busy waiting. The library also has a queue implementation for lining up future crawler requests. This ensures that the number of links being crawled at any given time is as close to the max as possible.

Because requests are handled in parallel, the fastest completed requests will trigger enqueuing any newly found URLs, ensuring the crawler runs continuously and efficiently. Rolling Curl is set to allow for a maximum number of simultaneous connections to ensure you do not DOS attack the requested host with requests.

Caterpillar will crawl the entirety of an internal website when given a starting URL and begin indexing (sitemapping) the pages it hits. When it encounters links on a page, it checks for their existance in the database and either inserts the link or updates their inbound link count. It also creates a contenthash to better determine when pages have been last modified. Caterpillar can easily be used to facilitate the generation of a Google Sitemap XML file.

Caterpillar
Fork me on GitHub

Usage

Caterpillar requires a small amount of legwork on your part to get up and running due to the necessity for data storage in MySQL. Note that crawling a website can be a memory intensive activity. For that reason, you are advised to bump up the PHP memory_limit to suit your needs.

  1. Import the caterpillar.sql file into the database of your choice.
  2. Copy the library to your application and include.
  3. Your database user will need CREATE TABLE, DROP TABLE, and TEMPORARY TABLES privileges.
  4. Using the example below, plug in your own configuration parameters:

Downloads

Downloads are available via github. The decision is all yours:

  • git clone [email protected]:cballou/caterpillar.git
  • git clone https://github.com/cballou/caterpillar.git
  • wget https://github.com/cballou/caterpillar/archive/master.zip
  • wget https://github.com/cballou/caterpillar/archive/master.tar.gz
Download as .zip Download as .tar.gz

Support

If you have any problems with Caterpillar, please file a ticket/issue/bug on Github and I will attempt to address it at my earliest convenience.

Caterpiller Issues on Github

License

Caterpillar is licensed under the MIT License.

The MIT License is simple and easy to understand and it places almost no restrictions on what you can do with Caterpillar.

You are free to use Caterpillar in commercial projects as long as any copyright headers and license file are left intact.

Changelog

  • Nov 10, 2012
    Added the MIT License for clarity.
  • May 13, 2010
    Commited a huge set of optimizations to the Caterpillar library.
    • Implemented a temporary MySQL table for storing the inbound link counts.
    • Created an internal page cache that flushes inboung link counts to the new temp table every 1000 pages (batch insert).
    • Added an internal cache lookup for pages already requested to avoid hitting the database.
    • Added a significant amount of fields to the crawl index table:
      • contenthash - The hashed page content for checksumming.
      • filesize - The page filesize.
      • last_update - Timestamp used for deletion of removed pages after 2 weeks missing.
      • last_tested - The last time the page was crawled.
    • Decreased the cURL timeouts to 15 seconds.
  • Jan 13, 2010
    Initial commit of Caterpillar on github.

About the Author

is a full-stack web applications developer in Charlotte, NC with 9+ years professional experience. He holds a bachelors degree in Computer Science and has been working remotely since 2012. He specializes in LAMP/LEMP stack development with Laravel and WordPress. Corey is the owner and principal consultant at Craft Blue, a custom web applications development consultancy. He's also the co-organizer of the Queen City PHP meetup group in Charlotte. He is an entrepreneur, blogger, open source contributor, beer lover, startup advocate, chicken wrangler, hydroponics gardening dabbler, and homebrewer.

Corey works with agencies, startups, and businesses.

Contact Corey to see how Craft Blue can help you.