CAP API – php rest application
By Peter
Cap crawler is a simple python crawler with multi threaded execution and shared execution queue, with cli execution options. This project is a wrapper(API) around cap crawler implemented in Php Zend Framework 2 and Cassanadra.
The project is available on Github, clone it 🙂
Features
- Private/public key or password based authentication
- Hmac(sha256) based request authentication (similar to AWS authentication)
- Chained validation(Auth<>identity<>endpoint<>request etc.)
- Json/Csv/zip response options
- Stateless & scalable
- Custom error handling
Dependencies
The API is written in php with Zend Framework 2. The framework dependencies can be installed with composer.
Modules can be installed separately, for more details visit the official Zend documentation.
I suggest to use Opcache for better performance or use the ClassMapAutoloader feature provided by Zend.
{#user-content-database.anchor}Database
The project uses Cassandra as the default database. There are few different packages available, but I used the Datastax PHP driver , the documentation is available here.
{#user-content-cassandra-dependencies-ubuntu-install.anchor}Cassandra Dependencies, Ubuntu install
The Datastax driver can be installed trough pecl, and requires additional dependencies as it is a wrapper around the datastax c++ driver.
Don’ forget to add the extension to php.ini(“extension=cassandra.so”)
For Zend server users on Linux I would suggest to install in both places for better IDE integration(Netbeans/PhpStorm) and enable the extension in both install locations.
Database structures
Cassandra The setup may differ from the use case. I suggest to run the following script trough cqlsh.
Available Modules in API:
- Common: holds core functions, classes used by other modules like Adapters, Listeners etc.
- User: User based functions and endpoints
- Crawl: Wrapper around the crawling jobs
Endpoint usage examples
User registration
The endpoint does not require any authorization. /api/user/register Request
Response
Note: in the background an email is sent with the credentials to the email address supplied in request.
{#user-content-single-job-report.anchor}Single job report
The endpoint does require authorization, based on the public/private keys and payload. In this case the payload is the endpoint URI /api/crawl/job/report/{USER_ID}/{JOB_ID} Request
Response
Note: the response holds details about job, results for the job are available trough different endpoints.
{#user-content-error-response-example.anchor}Error response example
If something goes wrong either within the application or by user error(wrong parameters/security issues etc.) a response in the following structure is returned with error message related to the issue.
Note: the status code is set based on the error in the above case ‘403 – Forbidden’
Notes
- PUT/POST/GET/HEAD HTTP methods are available on some endpoints.
- Hmac digest is calculated on the payload, which is either the body of the request or the path if body is not available(GET|HEAD). The path is only used to make scaling on multiple hosts easier, so the subdomain or different IP does not make a difference in the generation/calculation, which makes load balancing easier among others.
- The application can’t run without the config files set properly in config/autoload/. A _.dist version is supplied for each, with the required structure. Remove the ._dist* and fill the values accordingly.
- Error/logs generated by the application are stored in data/log/ folder with date-LOG-NAME.log.
- Error handling is managed on two levels, with listener attached to render events which can handle all errors happening within the framework. The second layer is catching the language level errors where recovery is not possible. With this in place a controlled response can be generated even if errors occurs.
Known issues
The zipped content after extraction contains binary not html.
Currently link crawling is only accessible without “focus content” or “xpath” options, which will collect the entire html content. This is a limitation of the crawler.
References & useful links
AWS authentication
Hash-based message authentication code (Hmac)
Zend Framework 2
Zend moduls trough composer – documentation.
ClassMapAutoloader feature provided by Zend.
Datastax PHP driver
CAP crawler