As a researcher I need to conduct experiments to validate my hypotheses. When the field of Computer Science is involved, it is well known that practitioners tend to drive experiments on different environments (at the hardware level: x86/arm/…, CPU frequency, available memory, or at the software level: operating system, versions of libraries). The problem with these different environments is the difficulty of accurately reproducing an experiment as it has been presented in a research article.
In this post we provide a way of conducting experiments that can be reproduced by using Kubernetes-as-a-service, a managed platform to perform distributed computations along with other tools (Argo, MinIO) that take the advantages of the platform into consideration.
The article is organised as follow, we first recall the context and the problem faced by a researcher who needs to conduct experiments. Then we explain how to solve the problem with Kubernetes and why we did not choose other solutions (e.g., HPC software). Finally, we give some tips on improving setup.
When I started my PhD, I read a bunch of articles related to the field I’m working on, i.e. AutoML. From this research, I realised how important it is to conduct experiments well in order to make them credible and verifiable. I started asking my colleagues how they carried out their experiments, and there was a common pattern: develop your solution, look at other(s) solution(s) that are related to the same problem, run each solution 30 times if it is stochastic with equivalent resources and compare your results to the other(s) solution(s) with statistical tests: Wilcoxon-Mann-Whitney when comparing two algorithms, or else Friedman test. As it is not the main topic of this article, I will not discuss statistical tests in detail.
As an experienced DevOps, I had one question about automation: How do I find out how to reproduce an experiment, especially of another solution? Guess the answer? Meticulously read a paper, or find a repository with all the information.
Either you are lucky and a source code is available, or else a pseudo-code is provided in the publication. In this case you need to re-implement the solution to be able to test it and compare it. Even if you are lucky and there is a source code available, often the whole environment is missing (e.g., exact version of the packages, python version itself, JDK version, etc…). Not having the right information impacts performance and may potentially bias experiments. For example, new versions of packages, languages, and so on, usually have better optimisations that your implementation can use. Sometimes it is hard to find the versions that have been used by practitioners.
The other problem, is the complexity of setting up a cluster with HPC software (e.g., Slurm, Torque). Indeed, it requires technical knowledge to manage such a solution: configuration of the network, verifying that each node has the dependencies required by the runs installed, checking that nodes have the same versions of libraries, etc… These technical steps consume time for researchers, thus take them away from their main job. Moreover, to extract the results, researchers usually do it manually, they retrieve the different files (through FTP or NFS), and then perform statistical tests that they save by hand. Consequently, the workflow to perform an experiment is relatively costly and precarious.
In my point of view, it raise one big problem: that an experiment can not really be reproduced in the field of Computer Science.
OVH offers Kubernetes-as-a-service, a managed cluster platform where you do not have to worry about how to configure the cluster (add node, configure network, and so on), so I started to investigate how I could perform experiments similarly to the HPC solutions. Argo Workflows, came out of the box. This tool allows you to define a workflow of steps that you can perform on your Kubernetes cluster within each step is confined in a container, loosely called image. A container allows you to run a program under a specific environment software (language version, libraries, third-parties), additionally to limiting the resources (CPU time, memory) used by the program.
The solution is linked to our big problem: make sure you can reproduce an experiment that is equivalent to run a workflow composed of steps under a specific environment.
Use case: Evaluate an AutoML solution
The use case that we use in our research will be related to measuring the convergence of a Bayesian Optimization (SMAC) on the problem of the AutoML
For this use case, we stated the Argo workflow in the following yaml file
Set up the infrastructure
First we will to setup a Kubernetes cluster, secondly we will install the services on our cluster and lastly we will run an experiment.
Installing a Kubernetes cluster with OVH is child’s play. Connect to the OVH Control Panel, go to
Public Cloud > Managed Kubernetes Service, then
Create a Kubernetes cluster and follow the steps depending on your needs.
Once the cluster is created:
- Take into consideration the change upgrade policy. If you are a researcher, and your experiment takes some time to run, you want to avoid an update that would shutdown your infrastructure with your runs. To avoid this situation, it is better to choose “Minimum unavailability” or “Do not update”.
- Download the
kubeconfigfile, it will serve later with
kubectlto connect on our cluster.
- Add at least one node on your cluster.
Once installed, you will need kubectl, a tool that allows you to manage your cluster.
If everything has been properly set up, you should get something like this:
kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% node01 64m 3% 594Mi 11%
Installation of Argo
As we mentioned before, Argo allows us to run a workflow composed of steps. To install the client and the service on the cluster, we were inspired by this tutorial.
First we download and install Argo (client):
curl -sSL -o /usr/local/bin/argo https://github.com/argoproj/argo/releases/download/v2.3.0/argo-linux-amd64 chmod +x /usr/local/bin/argo
Then the controller and UI on our cluster:
kubectl create ns argo kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.3.0/manifests/install.yaml
Configure the service account:
kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=default:default
Then, with the client try a simple hello-world workflow to confirm the stack is working (Status: Succeeded):
argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml Name: hello-world-2lx9d Namespace: default ServiceAccount: default Status: Succeeded Created: Tue Aug 13 16:51:32 +0200 (24 seconds ago) Started: Tue Aug 13 16:51:32 +0200 (24 seconds ago) Finished: Tue Aug 13 16:51:56 +0200 (now) Duration: 24 seconds STEP PODNAME DURATION MESSAGE ✔ hello-world-2lx9d hello-world-2lx9d 23s
You can also access the UI dashboard through
kubectl port-forward -n argo service/argo-ui 8001:80
Configure an Artifact repository (MinIO)
Artifact is a term used by Argo, it represents an archive containing files returned by a step. In our case we will use this feature to return final results, and to share intermediate results between steps.
In order to get Artifact working, we need an object storage. If you already have one you can pass the installation part but still need to configure it.
As specified in the tutorial, we used MinIO, here is the manifest to install it (
apiVersion: v1 kind: PersistentVolumeClaim metadata: # This name uniquely identifies the PVC. Will be used in deployment below. name: minio-pv-claim labels: app: minio-storage-claim spec: # Read more about access modes here: https://kubernetes.io/docs/user-guide/persistent-volumes/#access-modes accessModes: - ReadWriteOnce resources: # This is the request for storage. Should be available in the cluster. requests: storage: 10 # Uncomment and add storageClass specific to your requirements below. Read more https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1 #storageClassName: --- apiVersion: extensions/v1beta1 kind: Deployment metadata: # This name uniquely identifies the Deployment name: minio-deployment spec: strategy: type: Recreate template: metadata: labels: # Label is used as selector in the service. app: minio spec: # Refer to the PVC created earlier volumes: - name: storage persistentVolumeClaim: # Name of the PVC created earlier claimName: minio-pv-claim containers: - name: minio # Pulls the default MinIO image from Docker Hub image: minio/minio args: - server - /storage env: # MinIO access key and secret key - name: MINIO_ACCESS_KEY value: "TemporaryAccessKey" - name: MINIO_SECRET_KEY value: "TemporarySecretKey" ports: - containerPort: 9000 # Mount the volume into the pod volumeMounts: - name: storage # must match the volume name, above mountPath: "/storage" --- apiVersion: v1 kind: Service metadata: name: minio-service spec: ports: - port: 9000 targetPort: 9000 protocol: TCP selector: app: minio
Note: Please edit the following key/values:
spec > resources > requests > storage > 10correspond to 10 GB storage requested by MinIO to the cluster
kubectl create ns minio kubectl apply -n minio -f minio-argo-artifact.install.yml
Note: alternatively, you can install MinIO with Helm.
Now we need to configure Argo in order to use our object storage MinIO:
kubectl edit cm -n argo workflow-controller-configmap ... data: config: | artifactRepository: s3: bucket: my-bucket endpoint: minio-service.minio:9000 insecure: true # accessKeySecret and secretKeySecret are secret selectors. # It references the k8s secret named 'argo-artifacts' # which was created during the minio helm install. The keys, # 'accesskey' and 'secretkey', inside that secret are where the # actual minio credentials are stored. accessKeySecret: name: argo-artifacts key: accesskey secretKeySecret: name: argo-artifacts key: secretkey
kubectl create secret generic argo-artifacts --from-literal=accesskey="TemporaryAccessKey" --from-literal=secretkey="TemporarySecretKey"
Note: Use the correct credentials you specified above
Create the bucket
my-bucket with the rights
Read and write by connecting to the interface
kubectl port-forward -n minio service/minio-service 9000
Check that Argo is able to use Artifact with the object storage:
argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/artifact-passing.yaml Name: artifact-passing-qzgxj Namespace: default ServiceAccount: default Status: Succeeded Created: Wed Aug 14 15:36:03 +0200 (13 seconds ago) Started: Wed Aug 14 15:36:03 +0200 (13 seconds ago) Finished: Wed Aug 14 15:36:16 +0200 (now) Duration: 13 seconds STEP PODNAME DURATION MESSAGE ✔ artifact-passing-qzgxj ├---✔ generate-artifact artifact-passing-qzgxj-4183565942 5s └---✔ consume-artifact artifact-passing-qzgxj-3706021078 7s
Note: In case you are stuck with a message
ContainerCreating, there is a lot of chance that Argo is not able to access MinIO, e.g., bad credentials.
Install a private registry
Now that we have a way to run a workflow, we want each step to represent a specific software environment (i.e., an image). We defined this environment in a Dockerfile.
Because each step can run on different nodes in our cluster, the image needs to be stored somewhere, in the case of Docker we require a private registry.
You can get a private registry in different ways:
- Docker Hub
- OVH – tutorial
- Harbor: allows you to have your own registry on your Kubernetes cluster
In our case we used OVH private registry.
# First we clone the repository git clone firstname.lastname@example.org:automl/automl-smac-vanilla.git cd automl-smac-vanilla # We build the image locally docker build -t asv-environment:latest . # We push the image to our private registry docker login REGISTRY_SERVER -u REGISTRY_USERNAME docker tag asv-environment:latest REGISTRY_IMAGE_PATH:latest docker push REGISTRY_IMAGE_PATH:latest
Allow our cluster to pull images from the registry:
kubectl create secret docker-registry docker-credentials --docker-server=REGISTRY_SERVER --docker-username=REGISTRY_USERNAME --docker-password=REGISTRY_PWD
Try our experiment on the infrastructure
git clone email@example.com:automl/automl-smac-vanilla.git cd automl-smac-vanilla argo submit --watch misc/workflow-argo -p image=REGISTRY_IMAGE_PATH:latest -p git_ref=master -p dataset=iris Name: automl-benchmark-xlbbg Namespace: default ServiceAccount: default Status: Succeeded Created: Tue Aug 20 12:25:40 +0000 (13 minutes ago) Started: Tue Aug 20 12:25:40 +0000 (13 minutes ago) Finished: Tue Aug 20 12:39:29 +0000 (now) Duration: 13 minutes 49 seconds Parameters: image: m1uuklj3.gra5.container-registry.ovh.net/automl/asv-environment:latest dataset: iris git_ref: master cutoff_time: 300 number_of_evaluations: 100 train_size_ratio: 0.75 number_of_candidates_per_group: 10 STEP PODNAME DURATION MESSAGE ✔ automl-benchmark-xlbbg ├---✔ pre-run automl-benchmark-xlbbg-692822110 2m ├-·-✔ run(0:42) automl-benchmark-xlbbg-1485809288 11m | └-✔ run(1:24) automl-benchmark-xlbbg-2740257143 9m ├---✔ merge automl-benchmark-xlbbg-232293281 9s └---✔ plot automl-benchmark-xlbbg-1373531915 10s
- Here we only have 2 parallel runs, you can have much more by adding them to the list
withItems. In our case, the list correspond to the seeds.
run(1:24)correspond to the run
1with the seed
- We limit the resources per run by using requests and limits, see also Managing Compute Resources.
Then we just retrieve the results through the MinIO web user interface
http://localhost:9000 (you can also do that with the client).
The results are located in a directory with the same name as the argo workflow name, in our example it is
my-bucket > automl-benchmark-xlbbg.
Limitation to our solution
The solution is not able to run the parallel steps on multiple nodes. This limitation is due to the way we are merging our results from the parallel steps to the merge step. We are using
volumeClaimTemplates, i.e., we are mounting a volume, and this can’t be done between different nodes. The problem can be solved by two manners:
- Using parallel artifacts and aggregate them, however it is an ongoing issue with Argo
- Directly implement in the code of your run, a way to store the result on an accessible storage (MinIO SDK for example)
The first manner is preferred, it means you don’t have to change and customize the code for a specific storage file system.
Hints to improve the solution
In case you are interested in going further with your setup, you should take a look on the following topics:
- Controlling the access: in order to confine the users in different spaces (for security reasons, or to control the resources).
- Exploring Argo selector and Kubernetes selector: in case you have a cluster composed of nodes that have different hardware and that you require an experiment using a specific hardware (e.g., specific cpu, gpu).
- Configure a distributed MinIO: it ensures that your data are replicated on multiple nodes and stay available in case of a node fail.
- Monitoring your cluster.
Without needing in-depth technical knowledge, we have shown that we can easily set up a complex cluster to perform research experiments and make sure they can be reproduced.
- Automating Research Workflows at BlackRock
- The State of HPC Containers
- Kubernetes Meets High-Performance Computing