In many of my pervious work using FPGA as an accelerator, I came across this situation.
- I need to configure the cores (usually parallel core with same or similar architecture) by writing configuration/parameter registers to the core.
- The same type of registers are use to sample the results or status of the cores.
- The master controlling the cores (by reading/writing the registers) is usually a soft CPU or a interface to the host PC.
Given the fact that all cores are in the same FPGA and we have balanced clock tree within the FPGA, de-skew and resynchronisation between the the master and the cores. This make things much easier. NOT!
In FPGA design, these registers are usually implemented in distributed memory (i.e. DFF primitive). This is sometime necessary since the contents of these registers are requires simultaneously (e.g. the parameters of a FIR filter). Then it is straight forward to allow them to be written simultaneously.
In an example design, we have 10 cores. Each core has 10 configuration/parameter/result registers. All registers are 32-bit. This is a design with 32*10*10=3200 signals (ignoring the read/write controls) just for the purpose of infrequent data communication.
Here comes the problem: The connection between the master and the cores make it difficult to meet the timing constraints or something impossible to pass P&R. We simply spend too much routing resources for something not relate to design performance.
Now you are thinking using a bus system to connect these registers to the master. And you create the address decoder in each core. This will relax the routing channel congestion since the number of signals now depends of the number of cores but not number of registers.
But the timing issue is still not solved. Putting a global decoder in the master and connect all registers in a single parallel bus will not work. First, some FPGAs have no tristate buffers for internal connection. So you end up in using more signals (for sending the read data back to the master). Also, the fan-out of the master output will be too high since it is going to drive all the cores. More importantly, we usually fill up the FPGA with parallel cores as much as possible. So these cores are all over the chip in different location. Thus the connection to the furthest away core dominate the critical path.
You are upset by the fact that you have a highly optimised design for computation but slowed down by a configuration bus. There are two options from here: to pipeline the bus between master and cores; or to use a slower clock for the configuration part.
The second approach will solve the issue once for all. But we have limited clocking resources in FPGA which is more valuable than DFFs and LUTs. It is also not easy to implement the synchroniser in FPGA. Using async BlockRAM for a handful of registers is another big waste. At the end, you also need to set special constraints for the STA tool to get the timing closure correct. If you are willing to go through all these troubles, why not just set the bus as multi-cycle path in timing constraints? Anyway, you cannot avoid asynchronous design in FPGA following this path.
The first approach is actually easier, with the help of auto retiming from the EDA tools. All it costs are some DFFs which should not be an issue in modern FPGAs. The issue is the latency which impacting the bus protocol. Since there is a latency between the write signal and the actual update of the registers in the cores. For example, the master must wait until the parameters are actually updated before sending the 'go' signal (usually a single wire) to the cores. It get more complicated if any kind of acknowledgement is required for the write operation (e.g. the full signal in FIFO interface, etc.).
What I am proposing here is a serial bus which has the following advantages:
- Running under the same clock shared by the master and the cores.
- Minimum number of signals (only two) between master and each core.
- Minimum control logic required for implementation.
It is going to have similar disadvantages as in the first approach above. But it can still save a lot of routing resources. It is also suitable for ASIC design where flip-flops are more valuable than in FPGA and the cost of retiming every buses is too high.
The details of this serial bus will be presented in the next post.
No comments:
Post a Comment