Is Apache Groovy or YAML better for pipelines?

Apache Groovy versus YAML a DevOps comparison by Eficode

People who write continuous delivery pipelines are generally divided into two camps: the Apache Groovy and the YAML camp. Which is better?

Groovy pipelines were dominating the field for a while but recently YAML solutions have had the wind at their back.

This comparison is based on my experience working with both Groovy and YAML, as well as discussions with others.

Readability

YAML pipelines tend to look good. For example, look at the sample pipeline from Gitlab CI:

image:  "ruby:2.5"

before_script:
  - apt-get update -qq && apt-get-install -y -qq sqlite3 libsqlite3-dev nodejs
  - ruby -v
   - which ruby 
  - gem install bundler --no-document
  - bundle install --jobs $(nproc)  "${FLAGS[@]}"        
 
rspec:
  script: 
    - bundle exec rspec
 
  rubocop:
    script: 
     - bundle exec rspec

This example certainly is clear and understandable. Given that most examples are, the initial response to defining pipelines as YAML is often very positive. “Of course we will define them in YAML. Look how beautiful the end result is!”

Groovy doesn’t look as good or easy to read because it has a more complicated syntax:

node('docker'){
    checkout scm 
    stage('Build'){
        docker.image('node:6.3').inside{ 
          sh 'npm --version'
        }           
    }
}

Groovy can be a bit scary. Especially people that aren’t Java or Scala programmers typically havestrong negative feelings towards Groovy. Java programmers typically love Groovy at first sight. However, as Java/Scala programmers are not a majority in the software industry (at least not in GitHub), it’s easy to understand the shift that is happening towards YAML based pipelines. There are even Jenkins plugins and blog posts about writing Jenkins pipelines in YAML!

Usability

YAML originally stood for “Yet Another Markup Language” but according to Wikipedia is “a human-readable data serialization language”. In any case, YAML is actually a superset of JSON.

The intended use case for both YAML and JSON is to serialize data. YAML attempts to make the end result more friendly for humans to read by removing brackets in favor of indentation and new line conventions. Anyway, the point here is that YAML is not a scripting language. The consequence for pipeliners is that YAML can’t express logic. There are no if-clauses, loops or variables in YAML.

Many YAML based CI engines provide their own framework or conventions for expressing logic. Take these GitLab rules for example:

workflow:
  rules: 
     -  if: $CI_COMMIT_REF_NAME =~ /-wip$/
      when: never
      -  if: $CI_COMMIT_TAG 
      when: never
    - when: always

When complex logic is added to pipelines, the majority of people stop feeling like these pipelines are simple and easily readable. On the contrary, most people start feeling like they need to contact their favorite “DevOps guy” if anything goes wrong with their pipelines. That’s because of two factors:

The threshold to fully understand the pipeline (that someone else wrote) rises rapidly.
People tend to seek comfort in ignorance.

All of this is completely understandable. If you happen to be that “DevOps guy”, please try to understand the “normal” users’ frustration and offer them encouragement and empathy. And snacks. Especially when they’re doing a great job on their own. Snacks always help.

Another consequence of YAML not being a scripting language is that the majority of YAML pipelines either reference scripts or contain embedded script blocks. Typically these blocks are written in Bash or Python. Please take a quick look at this example:

upload:  
  image: alpine 
  stage: upload 
  script: 
    - for f in $(find . -name '*.yml' -o -name '*.yaml'); do
        bname=${f#./};
        if [[ "$bnam" == ".gitlab-ci-yml"  ]]; then
          continue;
        fi;
        if [[ "$CI_COMMI_REF_SLUG" == "edge" ]]; then 
          sed -i.bak  's/s-latest/latest/g' $bname;
        fi
        url="$NEXUS_PIPELINES/$CI_COMMIT_REF_SLUG/$bname";
        curl --verbose --show-error --fail ----upload-file $bname url;;   
      done

Blocks like this quickly remove all remaining positive feelings that “normal” users still have towards “simple YAML pipelines”.

Groovy is an actual programming language. Therefore it offers all the features required to express logic. The upside is that Groovy pipelines hardly ever contain anything else than Groovy (and shell one liners).

The downside is that users who aren’t familiar with the syntax have to learn a rather complicated DSL before they can feel comfortable with the pipelines. For example, look at this “simple” Groovy function that contains a closure:

def withImageStagingSelector(Closure selector) {
       this.imageselector = selector ?:  { img -> img }
      this
}

Most people probably agree that this is not trivial to explain. It is rather common for Groovy pipelines to contain complex logic directly in the pipeline files. That makes it quite difficult to understand what’s going on.

On the other hand, in Groovy it’s easy to import a library and make a function call that takes parameters. For example:

mycorplib.deploy(‘myapp’, ‘production’)

In comparison, YAML doesn’t support functions at all. The ability to pass parameters to function calls is a significant advantage for Groovy over YAML.

Conclusion

I would like to return to the foundations for a moment and remind you that pipelines normally consist of build, test and deployment operations. All of those operations fundamentally consist of logic. After all, a deployment script is just a scripted definition of how exactly the deployment should work. In other words it’s an expression of logic.

Based on this information we can safely draw the conclusion that YAML can’t work for pipelines by itself. It must always be complemented by an actual scripting language or a framework that interprets YAML data as logic.

However, it would be reckless to jump to the conclusion that Groovy is always the correct choice. In fact, I think that the best practice in all pipelines is to keep them simple and separate most of the logic to reusable functions or templates. Furthermore, it’s easy to tell when a YAML pipeline's logic is exceeding the acceptable complexity level because it’s simply incapable of implementing complex logic.

My final opinion is that in most cases YAML is better for writing pipelines because it generates a natural incentive to keep complex logic outside of pipeline files. In addition to that, YAML pipelines are more readable given that complex logic is abstracted away.

However, I’m not trying to dismiss all Groovy usage from the universe. Groovy works better with imports and is an excellent choice when the main programming language for your organization is Java or Scala.

No matter which one you choose, it’s important to understand that the language is really a secondary, context-dependent choice. There is no universal best practice for it. There are some universal best practices for pipelines but none of them are about selecting Groovy or YAML.

Meanwhile, if you need professional advice with pipelines, please book a free consultation on Eficode Praqma’s homepage.

Thanks for reading. I wish you all the best with your pipeline efforts.