'Persisting Gitlab CI Service data across jobs in a pipeline (Or: How to seed data in one job and access it in another)

I'm trying to figure out how to correctly use services in a CI pipeline. I'd like to have a pipeline like this:

variables:
  MYSQL_DATABASE: test
  MYSQL_ROOT_PASSWORD: root_password
  MYSQL_USER: mysqluser
  MYSQL_PASSWORD: user_password

services:
  - mysql

stages:
  - setup
  - test

seed data:
  stage: setup
  image: mysql
  script:
    - >-
      cat database_setup/*.sql | mysql 
      -hmysql
      -u${MYSQL_USER}
      -p${MYSQL_PASSWORD} 
      ${MYSQL_DATABASE}
test:
  script:
    - ./connect_to_sql_or_something

I would expect this to work, it's not too different than the canonical example, it just has a second stage and job. However, when running this, it seems like each job is using a different service. the test job has no access to the results of the seed job. I looked in the documentation and can't find any info on service lifecycle, and how that corresponds to jobs and stages. And I couldn't find any multi-stage examples of using mysql. Is there a way to make this work? Does this have to do with caching or passing artifacts between stages? What am I missing? Because it seems like something like this should be possible...



Solution 1:[1]

Gitlab runs each job in a completely new environment (often on different computers on different networks). As such, to simplify security needs, services are created for each job on that job's runner for the lifetime of the job and are then destroyed. You can see this in the Pipeline log:

Starting service mysql:latest ...

Which appears at the start of each job. They are then only available on that localhost (as no firewall management occurs to expose that service further for security and complexity reasons)

So if we want state to be shared between those services we have to manually pass information between them. And the only way gitlab allows us to pass data from one job to another is through artifacts.

In short we need to:

  1. Dump the database to a file at the end of a job
  2. Save that file as an artifact
  3. Load that artifact in the next job
  4. Restore that loaded artifact to the database.

As an example, here is a pipeline that creates a database with a single table in one job, then lists that table in the next job.

services:
  - mysql

variables:
  # Configure mysql service (https://hub.docker.com/_/mysql/)
  MYSQL_DATABASE: hello_world_test
  MYSQL_ROOT_PASSWORD: mysql

stages:
  - create
  - read

create_table:
  image: mysql
  stage: create
  script:
    - echo "CREATE TABLE table1(id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY);" | mysql --user=root --password="$MYSQL_ROOT_PASSWORD" --host=mysql "$MYSQL_DATABASE" 
    - mysqldump --user=root --password="$MYSQL_ROOT_PASSWORD" --host=mysql "$MYSQL_DATABASE" > db_backup.sql # 1) Dump the database to a file
  artifacts:
    paths:
      - db_backup.sql # 2) Save that file as an artifact

show_tables:
  image: mysql
  stage: read # 3) following stages receive artifacts from previous stages
  script:
    - mysql --user=root --password="$MYSQL_ROOT_PASSWORD" --host=mysql "$MYSQL_DATABASE" < db_backup.sql # 4) Restore the backup
    - echo "SHOW TABLES;" | mysql --user=root --password="$MYSQL_ROOT_PASSWORD" --host=mysql "$MYSQL_DATABASE" 

If you want further stages make sure to backup in each job that modifies the database. If you have many stages consider using before_script/after_script to simplify.

Solution 2:[2]

TL;DR:

Services attached to jobs in gitlab-ci get a /builds directory mounted on their image where this /builds directory is the root of a sophisticated dynamic tree. The path on that tree to the job current-dir is communicated to the service using a context env-var: CI_PROJECT_DIR.

The path passed in that var is the path on the service's OS where the the service can see the path the job container gets as it's work-directory.

Edited: also - the service gets the same env-vars like the job, so passing env-vars to the job passes them to their services. If it colides, you'll have to be creative...

Edited: we actually found that ALL containers in the job's pod including all services and the job container itself live within the same structure. So far we are sure that the job gets the repo as it's workdir, but we don't think that services do. They all get /builds and same env-vars.

Tested on gitlab 14.5

The FULL answer:

Disclaimer: we're using k8s runners, some of the details may differ a little on other runner types, however, I trust they should still make some sense.

Anyway:

CI_PROJECT_DIR answers the following formula:

/builds/${CI_RUNNER_SHORT_TOKEN}/${CI_CONCURRENT_PROJECT_ID}/${CI_PROJECT_PATH}

where:

  • CI_RUNNER_SHORT_TOKEN is a short sha-like opaque unique string that identifies the pod/runner. e.g. : FaTcVXZg
  • CI_CONCURRENT_PROJECT_ID is a running number of the currently running job, e.g. 0.
  • CI_PROJECT_PATH - is the group/group/repo path that is also apparent on the web-interface URLs. e.g. my-org/my-group/my-repo.

Mind that if your repo is a mono-repo - then all the projects from the repo will be there, together with everything that was checked out from your repo, and any cache/artifacts downloaded to your job workspace.

I'm still not sure how multiple artifacts conjoint, what happen when they define same file names - I'll have to experiment on that. anyway, I digress.

why?? how did you get into such a corner?

Several use-cases:

  1. I want to collect the logs of services as build artifacts. to conjunct with test results.
  2. I want to save build time and DRY concurrent parallel jobs that need db-migrate & data-seed (so I wanted create a fixtrue db once, and save the state the service creates naturally as an artifact)
  3. I wanted to explore and debug against OS some OS-bound services work with (e.g. verdaccio) - so I need that as an artifact too.

How did you get all that?

Started from one comment in the docs:

enter image description here from this page: https://docs.gitlab.com/ee/ci/services/

This means that artifacts are visible directly inside the service container through the /builds directory, and that if the container leaves files in /builds - they can be collected as artifacts or added to cache.

Then, I created the following node app:

const { assign } = Object;
const { promisify } = require('util');
const cmd = promisify(require('child_process').exec);

require('http').createServer(async (req, res) => {
  console.log('requested for ', req.url);

  const view = {
    env: process.env,
    cwd: process.cwd(),
    ls: {},
  };

  const {
    CI_RUNNER_SHORT_TOKEN: podToken,
    CI_PROJECT_PATH: projPath,
    CI_CONCURRENT_PROJECT_ID: curId,
  } = process.env;
  try {
    //here I added paths on the go as the picture became clearer
    const ls = await Promise.all([
      { path: '/builds', as: '/builds'},
      { path: `/builds/${podToken}`, as: '/builds${podToken}'},
      { path: `/builds/${podToken}/${curId}`, as: '/builds${podToken}/${curId}'},
      { path: `/builds/${podToken}/${curId}/${projPath}`, as: '/builds/${podToken}/${curId}/${projPath}' },
    ].map(({ path, as }) =>
      cmd(`ls -la ${path}`)
      .then(({ stdout }) => ({ as, path, stdout})),
    ));

    ls.forEach(({ as, stdout }) => view.ls[as] = stdout.split('\n'));
  } catch(e) {
    view.err = {
      message: e.message,
      stack: e.stack.split('\n'),
      code: e.code,
    };
  }

  res.end(JSON.stringify(view, null, 2));
}).listen(8080, (err) => console.log(err ? err.message : 'bound OK to port 8080'));

packed in this Dockerfile:

FROM node:14.18.1-alpine
COPY index.js index.js
EXPOSE 8080/tcp
CMD node index.js

and ran from a job like:

stages:
  - probe

probe:
  stage: probe
  services:
    - name: gitlab:5050/cicd-pocs/probe:latest
      alias: probe
  script:
    - curl http://probe:8080/wtf-is-going-on-here
  tags:
    - ci

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2