GatorCAT: EtherCAT in Zig!

GatorCAT is a library for commanding EtherCAT networks and just released v0.1.0!

Some background on EtherCAT:

  1. EtherCAT is a bit-packed network protocol used in high-speed real-time industrial controls (CNC machines, packaging lines, aerospace, etc).
  2. Subdevices are arranged in a ring, connected together with ethernet cables.
  3. Frames travel from the main device (running the zig application, typically under linux), through all the subdevices (running their own firmware), and back to the main device.
  4. Frames are typically sent 250 to 10000 times per second.

GatorCAT prioritizes providing a near zero-allocation API because allocation generally increases real-time latency.

You can find examples of:

  1. Tagged-union based interface
  2. V-table based interface
  3. Depending on a pre-compiled c library and c-interoperability
  4. Using generics to deserialize packed structs
  5. Labelled switch based state machines
  6. Using comptime asserts to have readable important constants with compile-time checked derivations
  7. A general preference for asserts over tests, though the library generally deserves a lot more tests than it has currently
  8. Abusing packed structs as an exact description of a bit-packed primarily little-endian binary protocol
16 Likes

I think Zig is a really good fit for this kind of software: the fact that C bitfields are so unpredictable makes them basically useless for low-level networking.

How does your project compare to other opensource EtherCAT master stacks, most notably IgH EtherCAT and SOEM?

From a cursory look it seems a non Linux-only userspace library, hence quite similar to SOEM, but having seen you used (or tried to use) SOEM in the past, I’m curious to know what pushed you to start this project.

1 Like

Amazing to find someone else who knows what ethercat is, there are dozens of us!!

Written in Zig

GatorCAT is written in zig, SOEM in C, and IGH in C

SOEM can be difficult to use because the C code is barely readable (single letter variable names, horrible acronyms, no references to documentation, and no real documentation). While my documentation is certainly incomplete, I tried to at least cite the relevant IEC standard for the data structures and behavior implemented (the IEC standard, as far as I can tell, is a duplicate of the ETG standard and does not require ETG membership to access). I also believe that by being written in zig, it is more readable.

Another reason for zig is the build system and cross compilation as a first class use case. When working with IGH and SOEM, you need to learn the C build system, which in my experience was extremely difficult, and I didn’t even get to cross compilation. Learning the incantations required to get CMAKE to depend on Igh or SOEM was me constantly googling and spamming ChatGPT with prompts.

Cross compilation is also trivial in zig. In industrial controls, we are often using ARM based processors or other low-cost hardware to execute fairly simple control tasks (low rate data collection and sequencing). At my work, we often use raspberry pi compute module based industrial computers (Onlogic Factor 201), so this requires cross compilation to integrate well with my x86_64 laptop and CI build pipelines.

Declarative Configuration

GatorCAT attemps to provide a declarative API for the configuration of EtherCAT devices, similar to the user experience when using PLC runtimes like CODESYS and TwinCAT. The vast majority of use cases have known contents of the ethercat network at compile time, and want the network contents validated at runtime.

SOEM provides no help in this regard and expects the user to write procedural code to validate the contents of the network and configure the subdevices. This requires extensive knowledge about ethercat, when the PLC runtimes and IDEs(that are everyone’s first intro to ethercat) provide point and click, drop downs, and automatic scanning interfaces that make it easy.

I didn’t make it very far with Igh(see difficult installation (kernel modules) and CMAKE build system). It seems to have a reasonably declarative subdevice configuration API.

License

GatorCAT is more permissibly licenced (MIT) when compared to SOEM (GPL with exceptions) and IGH (LGPL).

User Space

GatorCAT is currently implemented in user space. SOEM is userspace and IGH is kernel module.

I chose to implement userspace first because it is easier, and I believe there is currently limited support for compiling kernel modules in zig. Raw sockets can get you very far in my experience. I have applications at work using commercial ethercat implementations (Acontis) using raw sockets holding 2000 us cycle times with 150 us of jitter. This on a typical Debian system with PREEMPT_RT patches, real time priority process, CPU frequency scaling disabled. Didn’t even need to implement ISOL_CPUS or IPC_LOCK (yet).

I think the future is user space for ethercat. Most applications do not require 100 us cycle times. Ethercat was designed to run on standard operating systems, it is why distributed clocks exists, so that the control cycle can be independent of the main device frame jitter.

I have also heard that advances in the Linux networking space (XDP) have enabled 100 us cycle times on standard PREEMPT_RT linux in userspace already. See this tweet by the creator of the rust project ethercrab.

Feature Completeness

Igh is the most feature complete, followed by SOEM and GatorCAT.

GatorCAT notably is missing support for distributed clocks and Ethernet over EtherCAT(EoE). Though the support for EoE in SOEM im not sure is simple to use.

What happened when you tried SOEM?

You are correct that I attempted to use SOEM, and attempted to wrap it in python. But what I came to realize was SOEM was lacking understandable organization and control flow. It was difficult to understand how to react to failures, and the code is riddled with hidden retry logic configurable by C DEFINEs (see every check of the working counter or timeout in the library).

I also found myself battling the build system more than my intended application.

SOEM also lacks any data hiding and least-priviledge principles, making it difficult to understand. The library is essentially a giant book of functions that all take references to a global mega-struct of mutable data. Attempting to assess the thread safety of certain operations is extremely difficult.

SOEM is designed to be copy-pasted into your codebase and modified to fit your use case. It is not designed to use as a library (hence the DEFINEs everywhere). There is very little support for runtime constructed or dynamically configured ethercat networks.

Primary Push

The thing that primarily pushed me to start this project was a desire to replace my dependence on commercial ethercat imementations at work (CODESYS, Acontis). CODESYS being horrible because of its horrible programming language (structured text) and lack of plain text project storage format, and Acontis because of the general annoyance of working with black box Hungarian notation extremely expensive pre-compiled licensed binaries. I will say though that Acontis’ configuration tools have allowed me to debug and configure very large and complex ethercat networks (200+ subdevices) and I would generally recommend it for commercial applications. If you need low risk to your delivery schedules and compatibility with virtually all ethercat devices and features and don’t mind the cost, use Acontis. Just be prepared for the most poorly written python library you will ever see in your life (if you don’t just use the C library).

2 Likes

I’m quite convinced of the opposite: I think you can’t have determinism in userspace. I would be happy to be proven wrong though.

I’m much more fond of IgH EtherCAT, so here some nitpicking:

  • license is LGPL (userspace library) and GPL (kernel modules)
  • the build system is a mix of Kbuild and autotools, and yes: it is complex as hell
  • I did not meet any specific issue when cross-compiling, but the build system underwent a major rework in the latest years

When I have some spare time I will surely look into GatorCAT. If you don’t mind I’ll add it in some way to my test project to see how its roundtrip time compares to other EtherCAT stacks.

2 Likes

Isn’t that just a matter of whether the kernel is written in a way to support determinism in userspace?

Also is determinism even the right word here, if the program computes always the same answer then it is deterministic, so is this more about predictable and consistent low latency and high throughput, without sacrificing any of those to have enough tolerance to avoid sending updates too late?

I don’t really know about EtherCAT stuff, just asking to understand better what you mean. I see how it would be easier to guarantee better timing from the kernel, it just doesn’t seem like that would be the only way.

1 Like

Yes, sorry: I implied the Linux kernel here.

“Determinism” is the right word here: in realtime context, it means you are able to accept or reject tasks based on time constraints. In other words, a realtime task must be guaranteed to complete within a certain time slot.

1 Like

Ah ok, so in realtime context the meaning is expanded to include how the whole system operates and communicates, makes sense, thank you for explaining!

I haven’t really worked with hard time constraints.

2 Likes

Sure, I’d love to hear how it goes!

I wouldn’t expect much of a performance difference from SOEM, since I am just using raw sockets on linux. Once I get to XDP I think it could be an improvement.

The best possible performance (<150 us jitter) requires at least:

  1. Turn off CPU frequency scaling in kernel boot parameters.
  2. PREEMPT_RT patched kernel (though I heard this may have recently been merged into mainline kernel)?
  3. Set scheduler and priority of process
  4. Pin process to specific CPU

I just added a benchmark command to my CLI application:

$ sudo zig-out/bin/gatorcat benchmark --ifname enx00e04c68191a
benchmarking for 10.00s...
Completed 28322 cycles in 10.00s or 2831.97 cycles/s.
Max cycle time: 0.000934s.
Min cycle time: 0.000246s.

This on my Debian laptop without PREEMPT_RT and on a crappy usb ethernet dongle.

When I get some time, I will test it on a raspberry pi CM4 with PREEMPT_RT.

I think zig will be huge for the accessibilty of writing real-time software. You might have seen the real time scheduler syscalls got added recently (just in time for me to use!!) Add realtime scheduling calls to std.os.linux (issue #19671) by curuvar · Pull Request #19675 · ziglang/zig · GitHub

Regarding Kernel space, I think I might eventually attempt to build the Acontis kernel module with zig: atemsys/atemsys.c at main · acontis/atemsys · GitHub or just re-write it in zig, it is only 5000 lines after all. I have not fully read through how it works yet. It looks like it could be drastically reduced if I just decide to only support modern linux versions.

1 Like

Another way of explaining the meaning of real-time:

Real-time software is characterized by deterministic latency, or sometimes deterministic response time.

Practically, the key metric is latency. Latency is the time between when a task is scheduled and when it actually begins executing. When you call sleep(1 second) in a preemptively multi-tasking OS (linux), you are telling the OS that other tasks can take the CPU for 1 second, and then this task should resume. The latency is the time between the sleep() expiring and the OS resuming your task.

In EtherCAT, we need to send ethernet frames at regular intervals (cycle time) (typically < 2000 us), and for advanced subdevices, inside a specific region of the cycle (typically the first 30% of the cycle time). This translates to an allowable latency of less than 600 us in this example.

An example of some of the most demanding applications is in motion control. There may be 128 axis of servo drives in extremely large CNC machines controlling multiple cutters cutting some metal. The servo drives are commanded every 2000 us to follow a specific motion profile. When the heads are close together, there a risk of collision if they are not commanded in time to follow the motion profile closely.

The PREEMPT_RT patches of the linux kernel enable the developer to estimate upper bounds for the latency on the order of 50-100 us. Practically, there is always an upper bound (expect for dead-locks), but the upper bound is not useful unless it is on the order of 100’s of microseconds. A commercial real-time operating system can provide latencies of 10’s of microseconds. Examples of why the kernel must be designed for real-time is that it may choose to swap your memory to disk, kernel print logs to disk, or have loads of interrupts on networking hardware that could steal CPU time away from your process. A real-time kernel provides APIs to prevent your memory from being swapped (mlock()), and prioritize your process above others (SCHED_FIFO).

2 Likes