Caterpillar is a PHP class intended for website crawling
and screen scraping. It handles parallel requests using a
modified and wrapped version of Josh Fraser's
library which utilizes
curl_multi() functions in an efficient manner.
Unlike most other
curl_multi() implementations where you must wait
for the set of requests to complete before processing the batch, Rolling cURL
processes each request as soon as it has completed. This eliminates wasted
CPU cycles due to busy waiting. The library also has a queue implementation
for lining up future crawler requests. This ensures that the number of links
being crawled at any given time is as close to the max as possible.
Because requests are handled in parallel, the fastest completed requests will trigger enqueuing any newly found URLs, ensuring the crawler runs continuously and efficiently. Rolling Curl is set to allow for a maximum number of simultaneous connections to ensure you do not DOS attack the requested host with requests.
Caterpillar will crawl the entirety of an internal website when given a starting URL and begin indexing (sitemapping) the pages it hits. When it encounters links on a page, it checks for their existance in the database and either inserts the link or updates their inbound link count. It also creates a contenthash to better determine when pages have been last modified. Caterpillar can easily be used to facilitate the generation of a Google Sitemap XML file.
Caterpillar requires a small amount of legwork on your part to get up and running
due to the necessity for data storage in MySQL. Note that crawling a website
can be a memory intensive activity. For that reason, you are advised to bump
up the PHP
memory_limit to suit your needs.
caterpillar.sqlfile into the database of your choice.
DROP TABLE, and
Downloads are available via github. The decision is all yours:
git clone [email protected]:cballou/caterpillar.git
git clone https://github.com/cballou/caterpillar.git
If you have any problems with Caterpillar, please file a ticket/issue/bug on Github and I will attempt to address it at my earliest convenience.Caterpiller Issues on Github
Caterpillar is licensed under the MIT License.
The MIT License is simple and easy to understand and it places almost no restrictions on what you can do with Caterpillar.
You are free to use Caterpillar in commercial projects as long as any copyright headers and license file are left intact.
contenthash- The hashed page content for checksumming.
filesize- The page filesize.
last_update- Timestamp used for deletion of removed pages after 2 weeks missing.
last_tested- The last time the page was crawled.