I can think of four main tricks for making a CA program run fast.

Lookup table. Generally a cell takes on a NewC value which is computed on
the basis of info in the cell's neighborhood. Try to find a way to pack the
neighborhood info bitwise into a nabecode number. Then use nabecode as an
index into a lookup table. Thus NewC = lookup[nabecode]. You precompute the
lookup values for all possible nabecodes before running the CA. Lookups can be
saved, as Walker and I did in CA LAB.

Pointer swap. To run a CA, you need two buffers, one for the current world
of cells, and one for the updated world of cells. After the update, *don't*
copy the updated world onto the current world. Just swap the pointers to world
and new world.

The flywheel. In stepping through the cells of the CA, you repeatedly
compute a cell's nabecode, then compute the nabecode of the next cell over,
and so on. Because the neighborhoods overlap, a lot of the info in the next
cell's nabecode is the same as in the old cell's nabecode. Try to arrange
nabecode so that you can left shift out the old info and OR in the new info.

Assembly language. A 2D VGA CA is going to have about 300K cells. That
means you are going to assemble nabecodes and lookup the NewC values about
300K times per screen. This means that your inner loop for flywheeling the
nabecode must be as efficient as possible. If you can write this in assembly
language, and keep an eye on the listed ``clocks'' per instruction, you can
shave off a few clocks here and there, which really adds up when done 300K
times.

If the rule set is known to lead to sparse configurations, e.g. Life Game
with a small initial pattern, then one can use sparse matrix tricks. That is, to
just compute in the vicinity of occupied cells. Generally these do not compile
as efficiently as a full matrix method, because there is more indirect address
and branches. However, one could include both a sparse and full matrix method in
the same program, then convert when the cross-over density is reached.

Tips for periodic boundary conditions

There are two basic methods:

1) coding for faster modulo arithmetic; and 2) using a larger array and
copying the boundary cells between iterations. The brute force method of doing
modulo arithmetic on index variable, x, for a range of 0..63 in C is
(x+offset)%64 whereas on some architectures (e.g. some Sun Sparcstations) it is
actually faster to do

register int tmp=x+offset;

tmp>63 ? tmp-64 : tmp

if offset is positive and similarly if it is negative.
Still better, if the
range is a power of 2, might be