View on GitHub

crawld: a data crawler and repositories fetcher

Build Status GoDoc GoWalker

crawld is a metadata crawler and source code repository fetcher. Hence, crawld comprises two different parts: the crawlers and the fetcher.

Crawlers

crawld focuses on crawling repositories metadata and those of the users that contributed, or are directly related, to the repositories from code sharing platforms such as GitHub.

Only a GitHub crawler is currently implemented. However, the architecture of crawld has been designed in a way such that new crawlers (for instance a BitBucket crawler) can be added without hassle.

All of the collected metadata is stored into a PostgreSQL database. As crawld is designed to be able to crawl several code sharing platforms, common information is stored in two tables: users and repositories. For the rest of the information, specific tables are created (gh_repositories, gh_users and gh_organizations for now) and relations are established with the users and repositories tables.

The table below gives information about what is collected. Bear in mind that some information might be incomplete (for instance, if a user does not provide any company information).

Repository GitHub Repository User GitHub User GitHub Organization
Name GitHub ID Username GitHub ID GitHub ID
Primary language Full name Name Login Login
Clone URL Description Email Bio Avatar URL
  Homepage   Blog HTML URL
  Fork   Company Name
  Default branch   Email Company
  Master branch   Hireable Blog
  HTML URL   Location Location
  Forks count   Avatar URL Email
  Open issues count   HTML URL Collaborators count
  Stargazers count   Followers count Creation date
  Subscribers count   Following count Update date
  Watchers count   Collaborators count  
  Size   Creation date  
  Creation date   Update date  
  Update date      
  Last push date      

Fetcher

Aside from crawling metadata, crawld is able to clone and update repositories, using their clone URL stored into the database.

Cloning and updating can be done regardless of the source code management system in use (git, mercurial, svn, …), however only a git fetcher is currently implemented.

As source code repositories usually contain a lot of files, crawld has an option that allows storing source code repositories as tar archives which makes things easier for the file system shall you clone a huge number of repositories.

Installation

crawld uses git2go, a ligit2 Go binding for its git operations. Hence, libgit2 needs to be installed on your system unless you statically compile it with the git2go package.

To install crawld, run this command in a terminal, assuming Go is installed:

go get github.com/DevMine/crawld

Or you can download a binary for your platform from the DevMine project’s downloads page.

You also need to setup a PostgreSQL database. Look at the README file in the db sub-folder for details.

Usage and configuration

Copy crawld.conf.sample to crawld.conf and edit it according to your needs. The configuration file has several sections:

Once the configuration file has been adjusted, you are ready to run crawld. You need to specify the path to the configuration file with the help of the -c option. Example:

crawld -c crawld.conf

Some command line options are also available, mainly where to store log files and whether to disable the data crawlers or repositories fetcher (by default, the crawlers and the fetcher run in parallel). See crawld -h for more information.

Internals

crawld consists of two main parts internally: the crawlers (only GitHub for now) and the repositories fetcher (which only supports git for now).

The metadata crawled are put into the database whereas the repositories are cloned on physical storage.

Note that the internal architecture of crawld has been thought to make it easy to implement new crawlers or VCS backends.

crawld internals