docker-pdftools

Docker image for PdfTools

Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.

You can see the cli reference here.

Usage

You can run awscli to manage your AWS services.

aws iam list-users
aws s3 cp /tmp/foo/ s3://bucket/ --recursive --exclude "*" --include "*.jpg"
aws sts assume-role --role-arn arn:aws:iam::123456789012:role/xaccounts3access --role-session-name s3-access-example

Pull latest image

docker pull cardboardci/pdftools

Test interactively

docker run -it cardboardci/pdftools /bin/bash

Run basic AWS command

docker run -it -v "$(pwd)":/workspace cardboardci/pdftools aws s3 cp file.txt s3://bucket/file.txt

Run AWS CLI with custom profile

docker run -it -v "$(pwd)":/workspace -v "~/.aws/":/cardboardci/.aws/ cardboardci/pdftools aws s3 cp file.txt s3://bucket/file.txt

Continuous Integration Services

For each of the following services, you can see an example of this image in that environment:

Tagging Strategy

Every new release of the image includes three tags: version, date and latest. These tags can be described as such:

  • latest: The most-recently released version of an image. (cardboardci/pdftools:latest)
  • <version>: The most-recently released version of an image for that version of the tool. (cardboardci/pdftools:1.0.0)
  • <version-date>: The version of the tool released on a specific date (cardboarci/awscli:1.0.0-20190101)

We recommend using the digest for the docker image, or pinning to the version-date tag. If you are unsure how to get the digest, you can retrieve it for any image with the following command:

docker pull cardboardci/pdftools:latest
docker inspect --format='{{index .RepoDigests 0}}' cardboardci/pdftools:latest

Fundamentals

All images in the CardboardCI namespace are built from cardboardci/ci-core. This image ensures that the base environment for every image is always up to date. The common base image provides dependencies that are often used building and deploying software.

By having a common base, it means that each image is able to focus on providing the optimal tooling for each development workflow.