This build took 11 hours on an Intel Skulltrail NUC w/ 32 GB DRAM, but I believe there are ways to speed this up going forwards (incremental and/or hierarchical builds, out of context synthesis). For example, the XCVI9P is a 3 die (3 "super logic region") device and by setting up a hierarchical design flow I think I can place and route each SLR separately (at the same time) across more (x86) cores on my build box. The inter-SLR interconnect nets are just some quite regular 300b wide Hoplite NOC links and clock and reset.
Can't the tools do it relatively fast with a geometric method if the individual cores already have area/timing data to use and are homogenous? And a FPGA instead of an ASIC?
My reading the various papers on synthesis as a non-hardware guy made me think this job shouldn't be as hard on that as the SOC's whose components vary considerably in individual attributes.
I wish it were so. While it is straightforward to do regular placement at the block level or even at the individual LUT/slice level using RPMs (relationally placed macros) or absolute LOC placement of LUTs in the XDC implementation constraints file, most of the implementation time goes into routing and there is not an easy mainstream way to take a routed one tile design and step and repeat it (say) 210 times across the die. In part this is due to non homogeneity across the columns and sometimes rows of the chip.
Better for FPGA development. The FPGA on the Arty is slightly larger (28K -> 33K logic cells), and all peripherals, including memory, are connected directly to the FPGA instead of through the ARM SoC.
Also, the I/O headers are omitted from the $99 Parallela board, so it's difficult to program. (No JTAG connector.)
Aight thanks. It felt like not having an actual processor was a drag but I'm not in that field. I guess it means you use a separate computer to program the FPGA and then use it standalone, rather than having the FPGA as a dynamic coprocessor.
does it means that it is possible to done the 1680 cores thing on the Arty board? if can't, what made the Xilinx board more suitable to implement the 1680-cores on it ?
The Xilinx board has a ton of logic slices to run the extra cores. It's like chips having more transistors able to do more stuff. The Arty is too small to hold the full design. Might run slower, too, if on a older, process node than the Xilinx FPGA.
Both of them are Xilinx boards. One of them is just a really, really high end xilinx board, so it basically has enough programmable fabric to hold 1680 cores. The other is cheap and can't hold as many cores.
As noted the Digilent Arty is $99 and hosts up to 32 cores. The XC7020 Zynq devices should host 80. That includes the Zedboard, the original Parallella kickstarter ed., the forthcoming Snickerdoodle Black (?), and the Digilent Pynq which is $65 Q1 for students. It is my intention to put out a version of GRVI Phalanx for 7020s, at least a bitstream and SDK, perhaps more, but much to do. Note the 7 series (including XC7A35T of Arty and the XC7Z020) have BRAMs but not UltraRAMs so those clusters have 4K instruction RAMs and 32K shared cluster RAMs. The 4-8K/128K clusters possiblr om the new UltraScale+ devices afford more breathing room for code and data per cluster.
"GRVI Phalanx aspires to make it easier to develop and
maintain an FPGA accelerator for a parallel software workload.
Some workloads will fit its mold, i.e. highly parallel SPMD or
MIMD code with small kernels, local shared memory, and
global message passing. Here are some parallel models that
should map fairly well to a GRVI Phalanx framework:
• OpenCL kernels: run each work group on a cluster;
• ‘Gatling gun’ parallel packet processing: send each new
packet to an idle cluster, which may exclusively work
on that packet for up to (#clusters) packet-time-periods.
• OpenMP/TBB: run MIMD tasks within a cluster;
• Streaming data through process networks: pass streams
as messages within a cluster, or between clusters;
• Compositions of such models.
Since GRVI Phalanx is implemented in an FPGA, these
and other parallel models may then be further accelerated via
custom GRVI and cluster function units; custom memories and
interconnects; and custom standalone accelerator cores on
cluster RAM or directly connected on the NOC."