A Coordinated Electric System Interconnection Review—the utility’s deep-dive on technical and cost impacts of your project.

Challenge: Frequent false tripping using conventional electromechanical relays
Solution: SEL-487E integration with multi-terminal differential protection and dynamic inrush restraint
Result: 90% reduction in false trips, saving over $250,000 in downtime

Electromagnetic transient (EMT) simulation is the most detailed tool in the power engineer's kit. It resolves the network at microsecond time steps, capturing the switching transients, control interactions, and fast dynamics that phasor-domain tools simply average away. That fidelity has a cost: EMT is computationally heavy, and as networks fill with inverter-based resources, HVDC, and large electronic loads, the cases we are asked to run keep getting bigger and more numerous.

The hardware sitting under most engineers' desks has changed to meet that demand. Multi-core processors with hyper-threading are now ordinary, and a typical workstation exposes eight, twelve, or more logical cores. The question is no longer whether parallel hardware exists — it is whether our simulation tools actually use it. A single EMTDC solver running on a single core leaves most of that silicon idle.

The right way to think about acceleration is holistic. The goal is rarely "make one run faster" in isolation; it is "finish the engineering task faster." And the task usually involves running the same model under many conditions — sometimes hundreds or thousands of conditions — or running one very large network that no single core can chew through in a reasonable wall-clock time. Concurrent EMTDC addresses both. This paper explains how, where the limits sit, and how high-performance interconnects move those limits.

THE CORE IDEA

Concurrent EMTDC runs multiple EMTDC solver instances at the same time, coordinated by PSCAD. Two distinct strategies data parallelism and task parallelism map cleanly onto two very different engineering problems.

There are two fundamentally different ways to spread an EMT workload across processors, and they answer different questions.

Data Parallelism Many Runs, Same Model

Data parallelism keeps a single solver process but runs it across multiple sets of data. This is the natural fit whenever you need the same simulation under a sweep of different variables: a parametric study, a Monte-Carlo screen, a control-tuning loop, or a fault-position scan. Because any one run does not depend on any other, the runs can execute simultaneously rather than one after another.

Traditionally a "multiple run" was strictly serial — execute, read the result, change a condition, execute again. Data-parallel execution breaks that chain. Each case is dispatched to its own EMTDC instance, and because the runs are independent, communication between them is minimal: values are exchanged only at the very end and beginning of each run, not on every time step.

Two architectures are useful here. The simplest is a pure parallel sweep, where many identical copies of a case run concurrently. The more elegant is a master–slave arrangement:

A master case holds only the logic that decides what to run next an optimizer, a sweep generator, or a search routine. It is built as an EMTDC case but does almost no power-system computation; it runs a single step to compute new conditions from incoming results.

Slave cases are copies of the actual EMT model. Each receives a condition from the master, runs the real simulation for a period, and returns its objective value.

Values move between projects using an existing PSCAD transfer component (a radio-link / inter-project transfer) configured to carry data from one project namespace to another. The master seeds initial conditions to all slaves; the slaves return results; the master computes the next set and re-dispatches — repeating until it has enough information, then signalling every slave to terminate.

KEY BUILDING BLOCKS

Simulation sets you must tell PSCAD which projects belong together and how they interconnect. A set defines the complete collection of EMTDC instances to generate and launch together. You run the set, not the individual case.

Volley count the number of identical slave instances to launch in parallel (e.g., a volley of three runs three solvers plus PSCAD a clean fit for a four-core machine).

Rank each parallel run is assigned a rank from 1 to N. That index lets each instance look up its own row in a parameter table, so a single model can self-select the right conditions for its slice of the study.

For very large sweeps, PSCAD's Parallel Multi-Run (PMR) executes a pure multiple-run scenario entirely in parallel — on the order of thousands of runs. A run profile such as "500 × 24" means 500 total runs executed at most 24 at a time, so the machine stays saturated without being overwhelmed. A practical rule of thumb on a hyper-threaded desktop is to run roughly twice the number of logical cores concurrently.

A second optimization is the snapshot. One simulation set runs the model up to a chosen instant — say 0.45 seconds — and captures a snapshot of the network state. The bulk study set then starts every run from that snapshot instead of re-simulating the identical pre-event window thousands of times. Between sets you can change project properties so one stage serves the next.

BOTTOM LINE ON DATA PARALLELISM

It is essentially completely scalable. The more cores you have, the more runs execute at once and the faster the study completes, because communication between runs is negligible.

Task Parallelism One Network, Many Cores

Task parallelism answers a different question: how do you accelerate one single large network? Here you take one electrical problem and split it into pieces, where each piece is a subsystem of the same network. Each subsystem becomes a separate project running on its own CPU core, and the pieces are tied back together through a communication interface so they behave as one coherent run.

The split is made across transmission lines. Travelling-wave line models naturally decouple the network in time — a sufficiently long line means each end can be solved a time step apart without loss of accuracy — which is exactly why line ends are the classic place to break a system for parallel or real-time simulation. The server side of the split holds the line model; the client side simply names the project and line it connects to, and PSCAD ties them together at runtime once both are placed in a simulation set.

The catch is communication. Unlike a parametric sweep, these subsystems are tightly coupled: they must exchange interface quantities on every time step. So where data parallelism trades data once per run, task parallelism trades data millions of times per run. That makes communication latency — not raw compute — the dominant factor, and it makes load balancing essential. If subsystems are unevenly sized, the fast ones simply wait on the slow ones, and your speed-up evaporates.

Dimension	Data Parallel	Task Parallel
What is split	The data / conditions	The network itself
Coupling	Independent runs	One coupled network
Communication	Once per run (low)	Every time step (high)
Limiting factor	Available cores	Communication latency + load balance
Scalability	Near-ideal, easy	Strong, but harder to tune
Typical use	Parametric / optimization studies	Single very large EMT network

Because task-parallel performance lives or dies on communication, it is worth measuring carefully and the case itself can contaminate that measurement. To isolate the cost of communication alone, a clean experimental method is to take standard IEEE benchmark networks, tie copies of them together with transmission lines, and watch what happens as you add more pieces.

A Controlled Scaling Experiment

Using IEEE 14-, 39-, 78-, 118-, and 300-bus cases on an eight-core machine, copies were chained into a ring from one piece up to seven, each piece on its own core. In a perfect world, seven pieces on seven cores would finish in the same wall-clock time as one piece, since they run concurrently. Reality departs from that ideal by exactly the amount communication costs you. Even over the local host with no network wire involved at all — the pattern is striking:

Subsystem size	What dominates	Result on 7 cores
IEEE 14-bus (tiny)	Communication overhead swamps compute	Negative speed-up — slower than one core
IEEE 39-bus	Overhead still significant	Slows down, but less severely
IEEE 78-bus	Overhead ≈ parallel benefit	Break-even — same speed as all-in-one
IEEE 118-bus	Compute begins to dominate	Speed-up ~2.5× with 7 cores
IEEE 300-bus (large)	Fixed overhead is a small fraction	Strong, but harder to tune

The throughline is simple. The communication cost per interface is roughly a fixed quantity. When each subsystem does very little computation, that fixed cost is huge relative to the work, and parallelizing actually loses. As each subsystem grows, the same fixed communication cost becomes a smaller and smaller share of the total, until the parallel benefit wins decisively.

The Granularity Insight

THE CENTRAL TRADE-OFF

The key to performance is smaller grains of computation split the network into more, smaller pieces to use more cores but only as far as you can stay ahead of the communication limit. Below a certain grain size, faster communication is the only thing that lets you keep splitting.

That reframes the whole problem. If 78 buses is the smallest grain that breaks even on a given interconnect, then the only way to profitably split into smaller pieces and thus use many more cores is to make communication faster. And communication gets dramatically worse the moment you leave a single machine. On the local host, a round trip is on the order of 20 microseconds. Cross the wire to another workstation over standard TCP/IP and that jumps to roughly 230 microseconds about ten times slower. Since a single workstation does not have enough cores to hold a network broken into dozens of pieces, you must eventually cross that wire. So the wire has to get faster.

RDMA and InfiniBand

Standard TCP/IP is slow for this job for two reasons: the operating system kernel has to handle every transfer, and the data is copied multiple times between buffers. Those copies and kernel transitions burn time and CPU cycles that should be spent solving the network.

Remote Direct Memory Access (RDMA) over a high-performance fabric takes a different path. The transfer moves data directly from one machine's memory to another's, bypassing the kernel stack and avoiding the intermediate copies. On a fabric such as InfiniBand, a dedicated processor on the network adapter performs the transfer in hardware, which offloads the operating system entirely leaving more CPU time for EMTDC while the adapter moves data far faster.

Measured on a two-machine test rig (six-core, 3.5 GHz, hyper-threaded workstations — twelve logical cores each), the latency gap is large:

Path	Standard TCP/IP	RDMA (high-perf fabric)
Local host, small packet	~20 microseconds	~0.9 microseconds
Across hosts, small packet	~230 microseconds	~2.9 microseconds
Larger (8 KB) packets	Substantially higher	Still far ahead of TCP

Measured Results

The payoff shows up directly in coupled EMT cases. Across a range of split models a trivial AC system, a DC-link case, DFIG wind-farm models, and mixed IEEE-bus combinations the high-performance fabric kept cross-host timing almost indistinguishable from local-host timing, even on the smallest, fastest cases that stress communication hardest. Over standard TCP, the same cases degraded sharply the moment the split crossed from one machine to another.

One case is especially telling. A doubly-fed induction generator (DFIG) wind farm built from 22 machine models produced 23 executables — more than would fit on a single host's cores. It simply could not be run all on one machine; it had to span hosts. Over the high-performance fabric, wall-clock time stayed essentially flat between the local-host and cross-host configurations on the order of 27 seconds whether at 10 or 22 generators — with no measurable degradation from crossing machines. That is the door opening: networks can now be split into many more, much smaller pieces, and spread across many machines, without paying the old cross-host penalty.

Shared Memory on the Local Host

Within a single machine, there is an even faster path than any network protocol: shared memory. The design uses first-in-first-out (FIFO) buffers behind an abstraction layer between EMTDC instances. A deliberate choice here is to avoid semaphores or other locking mechanisms — because if the operating system ever sees an EMTDC instance blocking on a lock, it may de-prioritize that process and starve it of CPU cycles. Keeping every solver visibly "busy" ensures the scheduler keeps all of them running at full tilt.

The destination is a hybrid that picks the fastest available path automatically. On the same machine, instances talk through shared memory; across machines, they talk through RDMA over a high-performance fabric via the host channel adapter. A communication fabric layer inside each EMTDC instance hides those details from the engineer entirely.

THE ENGINEER'S-EYE VIEW

You should never have to think about sockets, adapters, or copies. You declare that this line connects to that line and the fabric figures out how to move the data as fast as the hardware allows, shared-memory or RDMA, local or remote.

The ambition is to scale to sixty, seventy, even a hundred subsystem splits across multiple machines, and have the whole thing run as if all those cores lived in one box under your desk. With communication latency driven into the low-microsecond range, the granularity limit drops, more cores become usable, and EMT stops being constrained to what one workstation can hold.

At Keentel, we treat grid interconnection as a first-order design input, not a downstream formality and EMT studies are increasingly where interconnection is won or lost. Point-of-interconnection studies for inverter-based resources now demand EMT-domain models to capture control interactions, sub-synchronous behaviour, and weak-grid stability. Those studies multiply quickly: many fault positions, many dispatch conditions, many what-if configurations across an entire interconnection cluster.

Concurrent EMTDC maps onto that work in two ways. Data-parallel execution turns a multi-day parametric or fault-scan study into an overnight or same-day deliverable which directly compresses interconnection study timelines. Task-parallel execution, paired with high-performance interconnects, makes it feasible to model a wide-area network in EMT detail as a single coherent case rather than stitching together reduced equivalents. For utility-scale renewables, BESS, and large-load interconnections, that is the difference between a representative answer and an approximate one.

This explainer is part of Keentel Engineering's ongoing technical series on EMT modeling, interconnection engineering, and power system studies. The FAQ and case studies that follow distill the practical questions we hear most, and three anonymized, field-style scenarios that show the methods at work.

The following three scenarios are anonymized and generalized to illustrate the methods in practice. They contain no client, project, or location identifiers and are presented for educational purposes.

Case Study 1 — Automated Control Tuning via Master–Slave Data Parallelism

Method:

Data parallelism (master–slave optimization)

Scale:

Three parallel slaves on a four-core workstation

Challenge

A control system inside an EMT model required tuning across a parameter space. Done the traditional way run, read the objective, adjust, run again — the optimization was strictly serial and slow, leaving most of the machine's cores idle while a single solver worked through one candidate at a time.

Approach

The optimization logic was lifted out of the simulation and placed in a dedicated master case. The master takes incoming objective values on one side and generates the next set of conditions on the other, running only a single step to do so. The actual EMT model became the slave, replicated as a volley of three so that three solver instances plus PSCAD ran together on a four-core machine. Radio-link transfer components carried objective values back to the master and new initial conditions out to the slaves. Objective values were plotted on a trend graph whose horizontal axis represented run number rather than time, and the master was configured to terminate automatically once the objective crossed a convergence threshold.

Outcome

The optimization converged and self-terminated at roughly 74–75 runs, with no manual intervention.

All available cores stayed busy throughout, replacing a serial loop with concurrent candidate evaluation.

Waveform output on the slaves was suppressed to maximize throughput, since only the objective mattered during the search.

TAKEAWAY

When a study is really a search, separating the decision logic (master) from the physics (slaves) turns an inherently serial optimization into a parallel one — and lets the master stop the moment it has learned enough.

Case Study 2 High-Volume Parametric Study with Snapshot-Initialized PMR

Method:

Data parallelism (Parallel Multi-Run + snapshot)

Scale:

500 runs in waves of 24; six-core hyper-threaded desktop

Challenge

A switching study required running a network many hundreds of times repeatedly taking transmission lines in and out of service and observing the response. The portion of the network under study was small, but the run count was very high, on the order of hundreds to thousands. A serial campaign of that size was impractical, and re-simulating the identical pre-event window on every run wasted enormous compute.

Approach

The study used Parallel Multi-Run with a two-stage simulation-set design. The first set ran the model to a fixed instant (approximately 0.45 seconds) and captured a snapshot of the network state. The second set started every run from that snapshot, using a run profile of 500 by 24 500 total runs, at most 24 executing concurrently — chosen because roughly twice the logical core count keeps a six-core, hyper-threaded desktop saturated without overload. A rank component assigned each run an index from 1 to 500, and that index drove a table lookup so each run automatically selected its own switching parameters. Data output was suppressed after the snapshot so the runs produced only the required output files.

Outcome

Five hundred-run batches executed as tightly packed parallel waves rather than a long serial queue.

Snapshot initialization eliminated repeated pre-event computation across every run in the batch.

Rank-based table indexing let one unmodified model cover the entire parameter set, with results written straight to output files for post-processing.

TAKEAWAY

For large sweeps, combine three levers: parallel waves sized to your logical cores, a snapshot to skip redundant pre-event simulation, and rank-indexed tables so a single model self-configures each run.

Case Study 3 Region-Scale EMT Decomposition with a High-Performance Interconnect

Method:

Task parallelism + RDMA over high-performance fabric

Scale:

~2,500+ buses, ~900 lines, ~50 mesh interconnects, 10-way split

Challenge

A large interconnected transmission network more than 2,500 buses and roughly 900 lines — needed to be simulated in full EMT detail as a single coherent case. Run whole on one solver, the all-in-one wall-clock time was about 56 minutes, far too slow for iterative study work, and the network was too large to hold comfortably in a single EMTDC instance.

Approach

The network was decomposed by zones into about ten subsystem projects, split across long transmission lines. The interconnections formed a mesh rather than a simple ring roughly fifty interface points with each subsystem assigned to its own CPU core. For configurations that exceeded the cores available on one machine, RDMA over a high-performance fabric replaced standard TCP between hosts to keep interface latency in the low-microsecond range. A companion sub-case a DFIG wind farm of 22 machine models producing 23 executables was used to confirm cross-host behaviour, since it could not fit on a single host at all.

Outcome

Configuration	Wall-clock	Notes
All-in-one, single solver	~56 minutes	Baseline
10-way task-parallel split	~6 minutes	≈10:1, near-linear on 10 cores
DFIG 22-gen, cross-host (fabric)	~27 seconds, flat	No measurable degradation vs local host

The decomposed run finished in roughly six minutes against a 56-minute baseline close to a ten-to-one, near-linear improvement on ten cores. Over the high-performance fabric, the wind-farm sub-case showed essentially no penalty for crossing machines, where standard TCP would have degraded sharply. The combined result demonstrates that very large networks can be modeled in EMT as a single entity by splitting them across cores and, with a fast enough interconnect, across machines.

TAKEAWAY

Task-parallel decomposition turns an hour-long EMT case into a minutes-long one, and a high-performance interconnect removes the cross-machine penalty together unlocking wide-area EMT modeling that no single workstation could hold.

Keentel Engineering is a power systems and grid interconnection consulting firm with offices in Tampa, Florida and Austin, Texas. Our work spans EMT modeling, point-of-interconnection (POI) interconnection engineering, substation and transmission-line design, utility-scale renewables and battery energy storage (BESS) engineering, power system studies, and NERC operations and planning compliance. We approach every engagement interconnection-first — treating grid interconnection as a first-order design input rather than a downstream utility formality.

Author: Sandip R. Patel, P.E. —

Founder & Principal Engineer, IEEE Senior Member. Contact Keentel Engineering to discuss accelerating your EMT, interconnection, or parametric study workload.

DISCLAIMER & TRADEMARKS

This document is an independent educational and technical commentary produced by Keentel Engineering. PSCAD® and EMTDC™ are trademarks of their respective owner; InfiniBand® is a trademark of its respective owner. Keentel Engineering is not affiliated with, endorsed by, or sponsored by any such trademark holder.

Any case studies are anonymized, generalized, and presented for educational purposes; they do not identify any specific client, project, or location. Performance figures are illustrative of the methods described and will vary with hardware, network model, and configuration. © Keentel Engineering. All rights reserved.

This explainer synthesizes publicly understood concepts in parallel and high-performance computing as applied to electromagnetic transient simulation. Readers seeking primary technical detail may consult:

IEEE standard benchmark network models (14-, 39-, 118-, and 300-bus systems) commonly used to characterize simulation scaling.

Published literature on travelling-wave transmission-line decoupling for parallel and real-time electromagnetic transient simulation.

General references on Remote Direct Memory Access (RDMA) and high-performance interconnect fabrics, including InfiniBand architecture and verbs.

Operating-system scheduling and lock-free inter-process communication concepts (FIFO buffers, shared memory) as applied to compute-bound concurrent processes.

Vendor documentation for the PSCAD / EMTDC simulation environment regarding simulation sets, Parallel Multi-Run, snapshots, and inter-project transfer.

Q. What is the difference between data parallelism and task parallelism in EMTDC?
Data parallelism runs the same model many times under different conditions, with each run independent and on its own solver instance — ideal for parametric, optimization, and Monte-Carlo studies. Task parallelism splits one large network into coupled subsystems that run on different cores simultaneously and exchange data every time step. Data parallelism is limited mainly by how many cores you have; task parallelism is limited by communication latency and load balance.
Q. Do I run the individual case or something else?
For parallel work you run the simulation set, not the individual case. The set tells PSCAD the complete collection of EMTDC instances to generate and how they interconnect. Running a single case in isolation will not launch its partners, and a coupled case will fail to compile because it cannot find the other end of its interface.
Q. How are subsystems coupled when a network is split?
The split is made across transmission lines. The travelling-wave line model decouples the two ends by roughly one time step, so each end can be solved on a separate core without loss of accuracy. The side holding the line model is the server side; the other side is the client and simply references the partner project and line name. Both must sit in the same simulation set so PSCAD can tie them together at runtime.
Q. Why did splitting my small case make it run slower?
Because communication overhead per interface is roughly fixed, and on a small subsystem that fixed cost is large relative to the actual computation. Below a break-even grain size you get negative speed-up — the parallel version is slower than running it whole. The fix is either larger subsystem grains or a faster interconnect that lowers the communication cost.
Q. What size of subsystem actually benefits from task parallelism?
It depends entirely on your interconnect speed. In controlled tests over the local host, very small subsystems lost time, a mid-sized benchmark sat near break-even, and large subsystems scaled almost linearly with cores. The faster your communication, the smaller the grain you can profitably split into — which is precisely why high-performance fabrics matter.
Q. How much slower is cross-machine communication over standard networking?
On standard TCP/IP, a local-host round trip is on the order of 20 microseconds, while crossing the wire to another workstation is roughly 230 microseconds — about ten times slower. That gap is the main obstacle to spreading a coupled simulation across multiple machines, and it is what RDMA over a high-performance fabric is designed to close.
Q. What is RDMA, and why is it so much faster than TCP?
Remote Direct Memory Access moves data directly between the memories of two machines, bypassing the operating-system kernel stack and avoiding the repeated buffer copies that TCP requires. On a fabric such as InfiniBand, a processor on the network adapter performs the transfer in hardware, offloading the CPU. Measured small-packet latencies fall to roughly 0.9 microseconds local and 2.9 microseconds across hosts — orders of magnitude better than TCP.
Q. What are 'volley count' and 'rank'?
Volley count is the number of identical parallel slave instances to launch — a volley of three runs three solvers plus PSCAD, which fits a four-core machine neatly. Rank is the index, from 1 to N, automatically assigned to each parallel run. Rank lets a single model self-select its parameters by looking up its own row in a table, so the same case can cover a large sweep without manual editing.
Q. What is a snapshot, and why use one in a large study?
A snapshot captures the full network state at a chosen instant. In a bulk study, one set runs the model up to that instant once and saves the snapshot; the study set then starts every run from the snapshot instead of re-simulating the identical pre-event window thousands of times. It removes redundant computation and can substantially shorten a large parametric campaign.
Q. How many concurrent runs should I allow on a hyper-threaded desktop?
A practical rule is roughly twice the number of logical cores. On a six-core, hyper-threaded machine (twelve logical cores), running about 24 concurrently keeps the processors saturated while leaving headroom, so the rest queue and execute in waves. A run profile such as '500 × 24' expresses exactly that: 500 runs, 24 at a time.
Q. Why avoid locks and semaphores in the shared-memory path?
Because if the operating system detects an instance blocking on a lock, it may treat that process as idle and reduce its CPU allocation, starving the solver. Using lock-free FIFO buffers keeps every EMTDC instance visibly active, so the scheduler keeps all of them running at full speed.
Q. Can these methods scale across more than two machines?
Yes. Data-parallel studies already scale with whatever cores are available, including across a local-area network. For task-parallel work the goal is a unified communication fabric that uses shared memory within a machine and RDMA between machines, targeting dozens of subsystem splits across multiple workstations while behaving like one large machine.

In 1995, Sandip (Sonny) R. Patel earned his Electrical Engineering degree from the University of Illinois, specializing in Electrical Engineering . But degrees don’t build legacies—action does. For three decades, he’s been shaping the future of engineering, not just as a licensed Professional Engineer across multiple states (Florida, California, New York, West Virginia, and Minnesota), but as a doer. A builder. A leader. Not just an engineer. A Licensed Electrical Contractor in Florida with an Unlimited EC license. Not just an executive. The founder and CEO of KEENTEL LLC—where expertise meets execution. Three decades. Multiple states. Endless impact.