* structure_tensor_accumulate, high runtime, low throughput
   * ~50% runtime
* built 5x5 (not 3x3) sliding window w/ horizontal sum and 4 row buffers (this dominates ASIC area)
   * verified with deterministic RNG with verilator
   * also verified with Verilog to an extent
* Verilator to compare throughput (from last week) CPU benchmark:
Resolution
	Avg Runtime
	Throughput
	256×256
	3.918 ms
	16.7 MP/s
	512×512
	17.476 ms
	15.0 MP/s
	1024×1024
	62.442 ms
	16.8 MP/s
	* estimate FPGA throughput
   * 100 MHz target clock timing constraint 
      * Worst Negative Slack (WNS): +1.840 ns
      * Estimated critical path delay: ~8.16 ns
      * Estimated Fmax: ~122 MHz
   * assuming 1 output per cycle after pipeline fill… 122MP/s
      * modestly 50MP/s → 2.5x speed up → 1/.5+.5/2.5→1.43x speedup
      * otherwise 7.6x → ~1.7x total speed up


Synthesis
LUTs
	~515
	Registers
	~459
	DSP48E1
	1
	BRAM
	1
	SRL16E
	32
	CARRY4
	47
	

register count being close to LUT count → lot of just moving data (and the LUT count is relatively small for the board). BRAM probably for the line buffers. shift register LUT for streaming pipeline (good). carry4, most math is just adding. 


Power
Idle:
Metric
	Value
	Total
	72 mW
	Dynamic
	3 mW
	Static
	68 mW
	

Active:
Metric
	Value
	Total
	114 mW
	Dynamic
	42 mW
	Static
	72 mW
	

in power report:
* DSPs go from 0.000 W to 0.001 W, so arithmetic is a tiny fraction overall
* Signals rise from <0.001 W to 0.007 W
* Slice logic rises from <0.001 W to 0.005 W
* I/O rises from <0.001 W to 0.024 W (largest single increase)
   * should look into toggling rate…
* u_accel rises from 0.003 W to 0.017 W, and lb_sxx (line buffer) appears in both, with much higher total in active


worth noting only 23% netlist match with medium confidence! (room for improvement here)
dynamic increase: ~39 mW
* suggests dynamic power increase came mostly from:
   * line-buffer activity
   * register switching
   * streaming state updates


ASIC stuff (yosys)
Module
	Area
	box_filter_h5
	~675
	box_filter_v5
	~722
	line_buffer_5
	~25,374
	tensor_accel total
	~142,841
	…mapped against Nangate45
dominant hardware cost comes from line buffers!


Overall, the cost of data movement and buffering dominated but this demonstrates the ability for high sustained throughput through a fully streaming one-pixel-per-cycle architecture. This could even be further improved to go beyond 1 px/cycle, reuse arithmetic.