Reuse Timing-closed Logic As A Shell¶
Background¶
Often in FPGA development, a desirable timing-closed implementation is only achieved after several iterations or many parallel implementation runs of a design. Elusive timing closure can be caused by one or a few stubborn modules in a design that have tight constraints or a large number of moderately difficult paths that have a lower probability of timing closure on any given run.
One advantageous strategy to improve timing closure success can be to preserve and enable reuse of a known good implementation of the stubborn logic. By preserving the implementation, place and route tools can (hopefully) avoid rediscovering difficult timing closure and simply focus on the other logic.
Some traditional approaches in Vivado to employ this preservation strategy might be to use or Incremental Implementation Flows Dynamic Function eXchange (DFX, previously known as partial reconfiguration or PR). Incremental Implementation Flows can work if the design has mostly converged and the amount of future changes to the design is small. However, if significant development still remains, this strategy is unlikely to save compile time.
Using DFX, one can lock down a portion of the design to form a reusable shell along with one or more reconfigurable partitions that contains logic under development. However, using DFX for this reuse methodology comes with some additional restrictions such as requiring area constraints and partition pin placements between the static and dynamic partitions of the design. It is more difficult to achieve an overlap of the preserved logic and the new logic and the nature of DFX requires additional DRCs that would not normally be run without using DFX.
This tutorial offers an alternative to the DFX flow with fewer restrictions and the ability to reused timing-closed logic without the need of area constraints by using the capabilities inherent in RapidWright.
Approach¶
To enable reuse of a timing-closed design as a shell in RapidWright, the original design will need some minor modifications.
The design should be logically partitioned into two parts: static and dynamic (as shown in the diagram above). The static part of the design is everything that should be preserved and be part of the “shell”. For example, many designs include components for handling network, DDR memory or a PCIe interface. These kinds of modules typically will have more demanding timing constraints and benefit from reusing their timing closure. The dynamic component is the portion of the design that the designer wants to change over time. The main requirement is that the dynamic component must be composed of one or more logical modules. If there is logic that needs to be modified at the top level of the design, it should be migrated into an existing module or a new module should be created and the logic added to it.
The interface of the dynamic modules must be consistent with all future logic modules that will populate it. In theory, this is straight-forward. However, during synthesis, design optimization, placement and routing, optimizations can modify the original interface of a module so that it no longer is consistent with the original definition and subsequent runs can cause divergence. To avoid this, dynamic modules should have the DONT_TOUCH synthesis attribute applied to the module instance. The alternative
KEEP_HIERARCHY
is not sufficient asDONT_TOUCH
will stay persistent on the netlist through routing whereasKEEP_HIERARCHY
will only persist through synthesis.
Note that applying DONT_TOUCH
to a module instance means that
Vivado cannot add or remove pins of the instance, but can connect or
disconnect pins and optimize logic inside the hierarchical module.
Once the design is properly partitioned and synthesis attributes
applied to dynamic modules, the design should be implemented using the
typical implementation flow in Vivado. Once a fully placed and routed
implementation that meets all requirements has been achieved, this
design can be preserved as a design checkpoint (DCP) and used to seed
the shell creation process.
This candidate shell design can then be loaded into RapidWright and all dynamic modules turned into black boxes.
Getting Started¶
1. Prerequisites¶
To run this tutorial, you will need:
RapidWright 2023.1.3 or later
Vivado 2023.1 or later
2. Creating a Candidate Implementation¶
For the ease of demonstration purposes in this tutorial, we have
chosen a simple RISCV design targeting a KCU105 board (Kintex
UltraScale xcku040-ffva1156-2-e
). The design was created using
the Linux on LiteX-VexRiscv project, but
we will recreate the design using a minimal set of steps and
dependencies.
Note
This design compilation step can take up to 30 minutes to complete
and it is highly recommended to skip past it to save time. To do
so, you can download the output files
instead by running:
wget http://www.rapidwright.io/docs/_downloads/kcu105_step2.zip
unzip kcu105_step2.zip
cd kcu105
vivado &
and then skip to step 3.
To get started, follow the commands below to download the
source files
:
wget http://www.rapidwright.io/docs/_downloads/kcu105_example.zip
unzip kcu105_example.zip
cd kcu105
vivado -source kcu105.tcl &
The included script will create a Vivado project, load the generated
Verilog and synthesize, optimize and place and route the design. The
Verilog module for one of the RISCV CPUs has already been annotated
for you with DONT_TOUCH
and will serve as our dynamic module for
this tutorial. The script will take several minutes to complete but
will generate a placed and routed DCP and EDIF file ready for
RapidWright. Notice we are running Vivado in the background as we
will come back to the terminal shortly.
A sample result is shown in the image below with the leaf cells of CPU
core (cores_1_cpu_logic_cpu
) highlighted in yellow.
Out of convenience for this tutorial, we will generate the logic that
will populate the dynamic module directly from this project. We
simply need to change the top of the design to the VexRiscv_1
core
and then resynthesize using the -mode out_of_context
option:
set_property top VexRiscv_1 [current_fileset]
reset_run synth_1
synth_design -mode out_of_context
write_checkpoint riscv_1_synth.dcp
At this point we should have two DCPs, one placed and routed candidate DCP to be made into a shell and one synthesized RISCV core that will populate the dynamic region in our shell.
3. Creating a Shell¶
To create a shell implementation, we need to take our top-level RISCV design that has the static portion meeting all necessary constraints and remove all logic from the dynamic components.
To remove the logic in the dynamic module, we need to use RapidWright in order to carefully separate the static logic from the dynamic logic as no area constraints (i.e. pblocks) were used to separate the two. Vivado can create a black box but can only do so correctly when the module made into a black box was sufficiently constrained such that all of its logic does not share any sites with any static logic. RapidWright has a built-in command that can accept a DCP and one or more cell instance names and produce a shell-based design with the cell instances turned into black boxes. For our example, we can run RapidWright from the command line (outside of Vivado):
rapidwright MakeBlackBox kcu105_route.dcp kcu105_route_shell.dcp VexRiscvLitexSmpCluster_Cc4_Iw64Is8192Iy2_Dw64Ds8192Dy2_ITs4DTs4_Ldw512_Cdma_Ood/cores_1_cpu_logic_cpu
This will create a new “shell” DCP (kcu105_route_shell.dcp
) where the
dynamic module has been turned into a black box. This DCP can then be
used again and again as a base starting point as it contains an
implemented solution for all of the static logic and we will use
Vivado (and RapidWright in the future) to place and route additional
dynamic modules on top of it.
4. Populating a Black Box¶
Returning to our running Vivado instance, we can close our previous
project and load the shell DCP using open_checkpoint
at the Tcl
command prompt:
close_project
open_checkpoint kcu105_route_shell.dcp
Note
Due to the large number of constraints generated in RapidWright, opening the checkpoint might take a few minutes.
If RapidWright was able to correctly create the black box, you should see exactly one critical warning, which may show up in a dialog from Vivado as shown below:
The implemented design will look similar to the original design, except that the cells previously highlighted in yellow above will be missing:
You may also notice that several BEL sites have been marked with a
PROHIBIT
property that prevents any cells from being placed in
those locations. Through experimentation, it has been found that
cells placed in the same half SLICE as those in the existing static
logic portion of the design can lead to congestion. Therefore,
RapidWright adds the PROHIBIT
property to the remaining BEL sites
in any occupied half SLICEs to avoid this issue. These prohibited
locations can be seen in the image below (the red circles with a
slash):
We can also verify that the design is consistent by checking the routing status:
report_route_status
Which should return a result something like this:
Design Route Status
: # nets :
------------------------------------------- : ----------- :
# of logical nets.......................... : 65546 :
# of nets not needing routing.......... : 23431 :
# of internally routed nets........ : 20613 :
# of nets with no loads............ : 2818 :
# of routable nets..................... : 42115 :
# of unrouted nets................. : 38 :
# of fully routed nets............. : 42077 :
# of nets with routing errors.......... : 0 :
------------------------------------------- : ----------- :
The key element to look for is that there are no nets with routing
errors. Since we see that value is 0
we can proceed.
At this point, we want to lock down the implementation so that further place and route runs do not upset the timing closure of the design. We can do this by running the Vivado Tcl command:
lock_design -level routing
This tags the netlist, placement and routing such that
place_design
and route_design
do not modify the netlist of the
existing implementation–thus preserving the original timing closure.
To populate the black box with the synthesized, out-of-context version
of the RISCV core, we can load it directly in Vivado with
read_checkpoint -cell
(this is different from
open_checkpoint
).
read_checkpoint -cell VexRiscvLitexSmpCluster_Cc4_Iw64Is8192Iy2_Dw64Ds8192Dy2_ITs4DTs4_Ldw512_Cdma_Ood/cores_1_cpu_logic_cpu riscv_1_synth.dcp
Once the dynamic module has been loaded with the synthesized RISCV core, we can implement the design and check the results
# We need to waive a DRC due to the nature of the design
set_msg_config -id {Common 17-55} -new_severity {Warning}
set_property SEVERITY {Warning} [get_drc_checks REQP-1753]
place_design
route_design
report_route_status
report_timing
Results should looks similar to:
Design Route Status
: # nets :
------------------------------------------- : ----------- :
# of logical nets.......................... : 75917 :
# of nets not needing routing.......... : 28293 :
# of internally routed nets........ : 24553 :
# of nets with no loads............ : 3740 :
# of routable nets..................... : 47624 :
# of nets with fixed routing....... : 41853 :
# of fully routed nets............. : 47624 :
# of nets with routing errors.......... : 0 :
------------------------------------------- : ----------- :
and should meet timing:
Timing Report
Slack (MET) : 0.253ns (required time - arrival time)
Source: main_crg_idelayctrl_ic_reset_reg/C
(rising edge-triggered cell FDRE clocked by main_crg_clkout1 {rise@0.000ns fall@2.500ns period=5.000ns})
Destination: IDELAYCTRL_REPLICATED_0_2/RST
(recovery check against rising-edge clock main_crg_clkout1 {rise@0.000ns fall@2.500ns period=5.000ns})
Path Group: **async_default**
Path Type: Recovery (Max at Slow Process Corner)
Requirement: 5.000ns (main_crg_clkout1 rise@5.000ns - main_crg_clkout1 rise@0.000ns)
Data Path Delay: 3.838ns (logic 0.117ns (3.048%) route 3.721ns (96.952%))
Logic Levels: 0
Clock Path Skew: -0.211ns (DCD - SCD + CPR)
Destination Clock Delay (DCD): 5.765ns = ( 10.765 - 5.000 )
Source Clock Delay (SCD): 6.087ns
Clock Pessimism Removal (CPR): 0.112ns
Clock Uncertainty: 0.065ns ((TSJ^2 + DJ^2)^1/2) / 2 + PE
Total System Jitter (TSJ): 0.071ns
Discrete Jitter (DJ): 0.108ns
Phase Error (PE): 0.000ns
Clock Net Delay (Source): 2.666ns (routing 1.174ns, distribution 1.492ns)
Clock Net Delay (Destination): 2.359ns (routing 1.078ns, distribution 1.281ns)
Location Delay type Incr(ns) Path(ns) Netlist Resource(s)
------------------------------------------------------------------- -------------------
(clock main_crg_clkout1 rise edge)
0.000 0.000 r
G10 0.000 0.000 r clk125_p (IN)
net (fo=0) 0.001 0.001 IBUFDS/I
HPIOBDIFFINBUF_X1Y59 DIFFINBUF (Prop_DIFFINBUF_HPIOBDIFFINBUF_DIFF_IN_P_O)
0.521 0.522 r IBUFDS/DIFFINBUF_INST/O
net (fo=1, routed) 0.090 0.612 IBUFDS/OUT
G10 IBUFCTRL (Prop_IBUFCTRL_HPIOB_I_O)
0.000 0.612 r IBUFDS/IBUFCTRL_INST/O
net (fo=1, routed) 0.750 1.362 IBUFDS_n_0_BUFG_inst_n_0
BUFGCE_X1Y52 BUFGCE (Prop_BUFCE_BUFGCE_I_O)
0.083 1.445 r IBUFDS_n_0_BUFG_inst/O
net (fo=9, routed) 1.687 3.132 main_crg_clkin
MMCME3_ADV_X1Y2 MMCME3_ADV (Prop_MMCME3_ADV_CLKIN1_CLKOUT1)
-0.231 2.901 r MMCME2_ADV/CLKOUT1
net (fo=1, routed) 0.437 3.338 main_crg_clkout1
BUFGCE_X1Y69 BUFGCE (Prop_BUFCE_BUFGCE_I_O)
0.083 3.421 r BUFG/O
X0Y1 (CLOCK_ROOT) net (fo=31, routed) 2.666 6.087 idelay_clk
SLICE_X0Y139 FDRE r main_crg_idelayctrl_ic_reset_reg/C
------------------------------------------------------------------- -------------------
SLICE_X0Y139 FDRE (Prop_HFF2_SLICEL_C_Q)
0.117 6.204 f main_crg_idelayctrl_ic_reset_reg/Q
net (fo=25, routed) 3.721 9.925 main_crg_idelayctrl_ic_reset
BITSLICE_CONTROL_X0Y3
IDELAYCTRL f IDELAYCTRL_REPLICATED_0_2/RST
------------------------------------------------------------------- -------------------
(clock main_crg_clkout1 rise edge)
5.000 5.000 r
G10 0.000 5.000 r clk125_p (IN)
net (fo=0) 0.001 5.001 IBUFDS/I
HPIOBDIFFINBUF_X1Y59 DIFFINBUF (Prop_DIFFINBUF_HPIOBDIFFINBUF_DIFF_IN_P_O)
0.324 5.325 r IBUFDS/DIFFINBUF_INST/O
net (fo=1, routed) 0.051 5.376 IBUFDS/OUT
G10 IBUFCTRL (Prop_IBUFCTRL_HPIOB_I_O)
0.000 5.376 r IBUFDS/IBUFCTRL_INST/O
net (fo=1, routed) 0.649 6.025 IBUFDS_n_0_BUFG_inst_n_0
BUFGCE_X1Y52 BUFGCE (Prop_BUFCE_BUFGCE_I_O)
0.075 6.100 r IBUFDS_n_0_BUFG_inst/O
net (fo=9, routed) 1.524 7.624 main_crg_clkin
MMCME3_ADV_X1Y2 MMCME3_ADV (Prop_MMCME3_ADV_CLKIN1_CLKOUT1)
0.335 7.959 r MMCME2_ADV/CLKOUT1
net (fo=1, routed) 0.372 8.331 main_crg_clkout1
BUFGCE_X1Y69 BUFGCE (Prop_BUFCE_BUFGCE_I_O)
0.075 8.406 r BUFG/O
X0Y1 (CLOCK_ROOT) net (fo=31, routed) 2.359 10.765 idelay_clk
BITSLICE_CONTROL_X0Y3
IDELAYCTRL r IDELAYCTRL_REPLICATED_0_2/REFCLK
clock pessimism 0.112 10.876
clock uncertainty -0.065 10.812
BITSLICE_CONTROL_X0Y3
IDELAYCTRL (Recov_CONTROL_BITSLICE_CONTROL_REFCLK_RST)
-0.633 10.179 IDELAYCTRL_REPLICATED_0_2
-------------------------------------------------------------------
required time 10.179
arrival time -9.925
-------------------------------------------------------------------
slack 0.253
The final implementation with the newly populated dynamic module highlighted in green is shown below.
Complexity can vary widely amongst different designs, so not all designs may benefit from this approach. However, please reach out to the RapidWright team if you encounter challenges when applying this approach for your own projects.