GRVI Phalanx joins The Kilocore Club

vvanders · on Jan 21, 2017

Do I dare even ask how long the timing and routing pass took on 1680 cores?

jsgray · on Jan 21, 2017

This build took 11 hours on an Intel Skulltrail NUC w/ 32 GB DRAM, but I believe there are ways to speed this up going forwards (incremental and/or hierarchical builds, out of context synthesis). For example, the XCVI9P is a 3 die (3 "super logic region") device and by setting up a hierarchical design flow I think I can place and route each SLR separately (at the same time) across more (x86) cores on my build box. The inter-SLR interconnect nets are just some quite regular 300b wide Hoplite NOC links and clock and reset.

nickpsecurity · on Jan 21, 2017

Can't the tools do it relatively fast with a geometric method if the individual cores already have area/timing data to use and are homogenous? And a FPGA instead of an ASIC?

My reading the various papers on synthesis as a non-hardware guy made me think this job shouldn't be as hard on that as the SOC's whose components vary considerably in individual attributes.

jsgray · on Jan 21, 2017

I wish it were so. While it is straightforward to do regular placement at the block level or even at the individual LUT/slice level using RPMs (relationally placed macros) or absolute LOC placement of LUTs in the XDC implementation constraints file, most of the implementation time goes into routing and there is not an easy mainstream way to take a routed one tile design and step and repeat it (say) 210 times across the die. In part this is due to non homogeneity across the columns and sometimes rows of the chip.

nickpsecurity · on Jan 21, 2017

That makes sense. Thanks.

jackyinger · on Jan 21, 2017

If you floor plan it yourself you can make the tool's job a lot easier. You can create a grid of regions on the FPGA and assign a core to each.

quickben · on Jan 21, 2017

If I'm not mistaken, the devices go for 5-6k for the eval kit?

duskwuff · on Jan 21, 2017

Yep. The Arty board (which ran 32 cores) is $99, though, which is actually a lot more interesting.

http://store.digilentinc.com/arty-artix-7-fpga-development-b...

agumonkey · on Jan 21, 2017

Is it better than the Zynq platform ? Parallella still sells SBC with a dual core arm + fpga for 100$ IIRC

duskwuff · on Jan 21, 2017

Better for FPGA development. The FPGA on the Arty is slightly larger (28K -> 33K logic cells), and all peripherals, including memory, are connected directly to the FPGA instead of through the ARM SoC.

Also, the I/O headers are omitted from the $99 Parallela board, so it's difficult to program. (No JTAG connector.)

agumonkey · on Jan 21, 2017

Aight thanks. It felt like not having an actual processor was a drag but I'm not in that field. I guess it means you use a separate computer to program the FPGA and then use it standalone, rather than having the FPGA as a dynamic coprocessor.

gamiecc · on Jan 21, 2017

does it means that it is possible to done the 1680 cores thing on the Arty board? if can't, what made the Xilinx board more suitable to implement the 1680-cores on it ?

nickpsecurity · on Jan 21, 2017

The Xilinx board has a ton of logic slices to run the extra cores. It's like chips having more transistors able to do more stuff. The Arty is too small to hold the full design. Might run slower, too, if on a older, process node than the Xilinx FPGA.

aseipp · on Jan 21, 2017

Both of them are Xilinx boards. One of them is just a really, really high end xilinx board, so it basically has enough programmable fabric to hold 1680 cores. The other is cheap and can't hold as many cores.

quickben · on Jan 21, 2017

Now that, seems interesting. Thanks!

geolqued · on Jan 21, 2017

This one? USD7k https://www.xilinx.com/products/boards-and-kits/ek-u1-vcu118...

jsgray · on Jan 21, 2017

The board used (VCU118) is $7000.

As noted the Digilent Arty is $99 and hosts up to 32 cores. The XC7020 Zynq devices should host 80. That includes the Zedboard, the original Parallella kickstarter ed., the forthcoming Snickerdoodle Black (?), and the Digilent Pynq which is $65 Q1 for students. It is my intention to put out a version of GRVI Phalanx for 7020s, at least a bitstream and SDK, perhaps more, but much to do. Note the 7 series (including XC7A35T of Arty and the XC7Z020) have BRAMs but not UltraRAMs so those clusters have 4K instruction RAMs and 32K shared cluster RAMs. The 4-8K/128K clusters possiblr om the new UltraScale+ devices afford more breathing room for code and data per cluster.

ethagknight · on Jan 21, 2017

What does one do with such a cluster?

RandomOpinion · on Jan 21, 2017

>What does one do with such a cluster?

They presented a short paper at FCCM '16: http://fpga.org/wp-content/uploads/2016/05/grvi_phalanx_fccm... Section VI lists possible applications.

jaipilot747 · on Jan 21, 2017

For the lazy like me:

"GRVI Phalanx aspires to make it easier to develop and maintain an FPGA accelerator for a parallel software workload. Some workloads will fit its mold, i.e. highly parallel SPMD or MIMD code with small kernels, local shared memory, and global message passing. Here are some parallel models that should map fairly well to a GRVI Phalanx framework:

• OpenCL kernels: run each work group on a cluster;

• ‘Gatling gun’ parallel packet processing: send each new packet to an idle cluster, which may exclusively work on that packet for up to (#clusters) packet-time-periods.

• OpenMP/TBB: run MIMD tasks within a cluster;

• Streaming data through process networks: pass streams as messages within a cluster, or between clusters;

• Compositions of such models.

Since GRVI Phalanx is implemented in an FPGA, these and other parallel models may then be further accelerated via custom GRVI and cluster function units; custom memories and interconnects; and custom standalone accelerator cores on cluster RAM or directly connected on the NOC."

frozenport · on Jan 21, 2017

Now use it for a barrel processor!