Boost CI Speed: Analysis & Ideas For Improvement

by ADMIN 49 views
Iklan Headers

Hey guys,

There's been a ton of awesome work lately to boost our Continuous Integration (CI) speeds, and I wanted to give a massive shoutout to @mariusandra, @pauldambra, @Twixes, @tomasfarias, and everyone else who's been diving in and submitting fixes to make things less of a headache. Seriously, you're all rockstars!

What a Great CI Experience Looks Like

So, what are we aiming for with our CI? Here's the dream:

  • Speed Demon: The entire suite runs in under 5 minutes. We want lightning-fast feedback, not hours of waiting.
  • Flake Finder: When a flaky test sneaks in, we catch it quickly and squash it before it causes chaos.
  • Signal, Not Noise: If a CI run throws an error, it should actually help us fix the problem, not just add to the confusion.

The Current State of Affairs

Okay, so where are we right now? Here's the scoop:

  • Mostly Fast, But...: Most CI runs are pretty speedy, but those storybook and backend test suites? They're clocking in at 10-15 minutes on average. Ouch.
  • Flaky Test Fugitives: Flaky tests tend to hang around, unnoticed, for way too long. We need to be better at spotting and dealing with these troublemakers.
  • Noise Overload: We've got a lot of noise in our CI. If an error pops up, it's often hard to tell if it's from your change or just some random CI gremlin.

Ideas for a Faster & More Reliable CI

I've got a bunch of ideas kicking around for making our CI faster and more reliable, but I haven't had the time to dive into them yet. So, I'm jotting them down here with my (very rough) estimate of their impact. Let's brainstorm!

General CI Improvements

Let's start with some general improvements that could benefit the whole CI process. These enhancements are crucial for optimizing the overall workflow and ensuring consistent performance across all tests. By addressing these areas, we can significantly reduce the time it takes to run our CI suite and improve the reliability of our builds.

  • Use prebuilt Docker images (low, probably ~1 minute): One straightforward way to speed up CI is by leveraging prebuilt Docker images. Instead of building images from scratch every time, we can use pre-existing images that contain all the necessary dependencies and configurations. This can save us valuable time, especially for frequently used environments. The initial setup might take some effort, but the long-term benefits in terms of CI speed are well worth it. Plus, prebuilt images ensure consistency across different CI runs, minimizing potential environment-related issues.

  • Send an error for failing tests in master, and notify a Slack channel on new flakes: We need a better system for catching and addressing flaky tests and failures in our master branch. One idea is to automatically send an error notification whenever a test fails in master. This will alert the team immediately and prompt quick action. Additionally, we should set up notifications in a dedicated Slack channel for new flaky tests. This way, we can track and prioritize flaky tests, ensuring they don't linger and cause ongoing issues. Implementing this feedback loop can drastically improve the stability of our codebase.

  • Break up the codebase into smaller parts using the products folder, so we can be more selective with path filters (high): Our codebase has grown significantly, and it's time to consider breaking it down into smaller, more manageable parts. One approach is to organize the code using a "products" folder structure. This would allow us to be more selective with path filters in our CI configurations. For example, if a change only affects a specific product, we can run tests relevant to that product, rather than the entire suite. This targeted approach can significantly reduce CI run times. It will require some upfront effort to reorganize the codebase, but the payoff in terms of faster CI and improved maintainability will be substantial.

  • Increase concurrency of steps: we often do step A and then B, even if those can be done at the same time (e.g. up the docker stack whilst doing some checks on the codebase). There is a balance here between keeping the CI runs clear vs being fast (medium): We're likely missing opportunities to run CI steps concurrently. Often, we execute steps sequentially even though they don't depend on each other. For example, we might set up the Docker stack and then run code checks, even though these tasks could be performed simultaneously. Increasing concurrency can cut down overall CI time. However, there's a balance to strike. We need to ensure that our CI configurations remain clear and understandable. Overly complex concurrent setups can be difficult to debug. It's about finding the right level of parallelism to maximize speed without sacrificing clarity.

Backend Tests Optimization

Optimizing our backend tests is crucial for faster feedback and more reliable builds. The backend often contains critical logic, and efficient testing here can prevent issues from propagating further. Let's dive into strategies that can significantly improve our backend test performance and accuracy. These strategies range from dynamic test selection to database optimizations, all aimed at making our backend testing more efficient and effective.

  • Update test times from runs on master: we currently use a file (.test_durations) that is committed to the repo, this quickly gets out of sync, we should update it on new test runs from master (medium): Our current approach of using a static .test_durations file to estimate test execution times is becoming a bottleneck. This file quickly becomes outdated as tests evolve and their durations change. We need a dynamic system that automatically updates test times based on recent runs, specifically from the master branch. By continuously learning from actual test execution times, we can make more accurate decisions about test prioritization and parallelization. This will lead to better resource utilization and faster overall CI times. The effort to implement this dynamic updating mechanism will pay off in the long run with more efficient test runs.

  • Detect which tests should be run based on changes using pytest-testmon, run a only these tests on the PR and run all of them on master (high, but risky): pytest-testmon is an intriguing tool that could significantly speed up our CI. It intelligently detects which tests are affected by a given change and runs only those tests. This targeted approach can drastically reduce test execution times, especially on pull requests (PRs). The idea would be to run only the affected tests on PRs for quick feedback, and run the full suite on master for comprehensive coverage. However, this approach comes with some risk. Incorrectly identifying affected tests could lead to missed issues. We need to carefully evaluate and test pytest-testmon to ensure its accuracy and reliability before fully adopting it. The potential benefits are high, but we must proceed cautiously.

  • Squash ClickHouse migrations (medium, probably 30s-1m reduction): Our ClickHouse migrations might be contributing to CI slowdowns. Over time, migrations can accumulate and make database setup a lengthy process. Squashing these migrations into a smaller set can streamline the process. This involves consolidating multiple migrations into fewer, more efficient operations. The estimated reduction in CI time might be in the range of 30 seconds to 1 minute, which can add up over many runs. Squashing migrations requires careful planning and execution to avoid data loss or inconsistencies. However, it's a worthwhile effort to maintain a healthy and efficient database setup for our tests.

Storybook Tests Optimization

Storybook tests are vital for ensuring the quality of our UI components, but they can also be time-consuming. Optimizing these tests is key to maintaining a fast and reliable CI pipeline. By focusing on strategies like splitting large files and detecting impacted stories, we can significantly reduce the runtime of our Storybook tests and improve the overall efficiency of our CI process.

  • Split large storybook files to help balance the tests easier (medium): Large Storybook files can lead to imbalances in test execution times. Some files might contain a disproportionate number of stories or complex components, causing tests to take longer. Splitting these large files into smaller, more manageable units can help balance the load and make tests run more consistently. This approach also makes it easier to identify and address performance bottlenecks within specific files. By distributing the test load more evenly, we can reduce the overall runtime of our Storybook test suite.

  • Detect stories that have been impacted by a change and run those only (high, risky): Similar to pytest-testmon for backend tests, we could explore ways to detect which Storybook stories are affected by a change and run only those. This targeted testing approach has the potential to significantly reduce test execution times. Imagine only running tests for the components you've actually modified! However, this is a risky proposition. Incorrectly identifying affected stories could lead to missed regressions and UI issues. We need to thoroughly investigate and validate any solution for detecting impacted stories before deploying it. The potential speed gains are substantial, but accuracy is paramount.

Let's get this discussion rolling, guys! What are your thoughts on these ideas? Any other suggestions for making our CI faster and more reliable? Let's make it happen! 🔥