Project overview
1. Goals & Scope
PyAsc2 is a tile-based programming model for writing Ascend NPU kernels in Python. It sits at a higher abstraction level than the original PyAsc (asc) model, which mapped Python APIs 1:1 to Ascend C intrinsics.
Core goals:
Let developers express kernels in terms of tensors (ND-arrays in global memory) and tiles (fixed-shape chunks in on-chip memory), without managing buffer addresses,
TPipe/TQuelifecycles, or synchronization barriers directly.Provide a NumPy-like operation set: arithmetic, reductions, shape manipulation, masking, atomics.
Automate synchronization insertion, UB memory allocation, and loop unrolling through compiler passes.
Support Ascend910B, Ascend910_93, and Ascend910_95 hardware families.
Relationship to PyAsc1 (asc):
PyAsc2 is implemented on top of the existing PyAsc infrastructure. The asc2.jit decorator re-uses JITFunction, the same AST-to-IR codegen, the same MLIR pass manager, and the same Bisheng compilation step. What is new is the asctile MLIR dialect and a dedicated lowering pipeline that converts tile-level IR down to the ascendc dialect before the existing backend takes over.
Out of scope:
Direct access to Ascend C intrinsics (use
ascfor that).Dynamic shapes at IR level — tile shapes must be statically known at JIT time; tensor shapes may be dynamic (runtime values).
2. Key Challenges
NOTE: This section is not finished.
2.1 Programming Model
Defining a tile-based Python API that is expressive enough for real kernels while remaining statically compilable. The API must hide TPipe/TQue management and synchronization from the user, yet map cleanly to Ascend C intrinsics after lowering. Key open questions: ergonomics of multi-dimensional tiling, handling of mixed UB/L0 tile locations, and interoperability with asc (PyAsc1) primitives.
2.2 Sync Insertion
Automatic insertion of set_flag/wait_flag barriers between pipeline stages at the ascendc level. The challenge is correctly inferring producer/consumer relationships across loop iterations after tile-level unrolling, without inserting redundant or missing syncs. The InsertBufIdSyncV2 algorithm addresses this for 910_95, but correctness across all hardware variants and unroll patterns is still being validated.
2.3 Ping-Pong Optimization
Enabling double-buffering (ping-pong) at the tile level so that loads for the next iteration overlap with computation on the current tile. This requires the compiler to automatically split the loop body, hoist loads outside the dependency chain, and insert the correct sync barriers — without the user annotating it explicitly.
2.4 UB Memory Allocation and Reuse
Mapping tile SSA values (which may be numerous after unrolling) to a finite on-chip UB region. Challenges include: computing liveness across unrolled iterations, reusing freed regions for new tiles of compatible size (ReuseUBAllocation), and ensuring the total live footprint does not exceed hardware UB limits (enforced by ComputeMemoryConsumption).
3. Architecture Overview
┌──────────────────────────────────────────────────────────┐
│ User Python Code │
│ @asc2.jit │
└──────────────────────┬───────────────────────────────────┘
│
┌──────────────────────▼───────────────────────────────────┐
│ PyAsc2 Frontend (python/asc2/) │
│ │
│ Tensor (GM) Tile (UB/L0/…) asc2.range asc2.mask │
│ load / store arithmetic reductions shape ops │
│ atomics matmul unary creation indexing │
└──────────────────────┬───────────────────────────────────┘
│ FunctionVisitor (AST → IR)
┌──────────────────────▼───────────────────────────────────┐
│ asctile MLIR Dialect │
│ TensorType / TileType LoadOp / StoreOp │
│ BinaryOps UnaryOps ReductionOps ShapeOps │
│ AtomicRMWOp MatmulOp SoftmaxOp SelectOp │
│ CountMaskOp BitwiseMaskOp │
└──────────────────────┬───────────────────────────────────┘
│ AscTile passes (lib/Dialect/AscTile/)
│ + AscLower passes (lib/Dialect/AscLower/)
┌──────────────────────▼───────────────────────────────────┐
│ ascendc MLIR Dialect (existing backend) │
│ LocalTensor / GlobalTensor TPipe / TQue │
│ set_flag / wait_flag Ascend C API ops │
└──────────────────────┬───────────────────────────────────┘
│ lib/Target/AscendC/ (CodeEmitter)
┌──────────────────────▼───────────────────────────────────┐
│ Ascend C source (.cce) │
└──────────────────────┬───────────────────────────────────┘
│ Bisheng compiler
▼
NPU kernel binary (.o)
End-to-end compilation flow
@asc2.jit kernel invocation
→ argument type specialization (same as asc1)
→ two-level cache lookup (same as asc1)
→ [miss]
AST walk → asctile MLIR (FunctionVisitor + asctile builder APIs)
↓
AscTile passes (unrolling, loop transforms, math specialization)
↓
AscLower passes (lower asctile → ascendc dialect)
↓
ascendc passes (UB allocation, sync insertion, boilerplate gen)
↓
CodeEmitter (ascendc → Ascend C source)
↓
Bisheng (Ascend C → .o binary)
↓
cache store
→ kernel launch via runtime rt library calls (same as asc1)
4. Programming Model
4.1 The two types: Tensor and Tile
|
|
|
|---|---|---|
Memory |
Global Memory (HBM, GM) |
On-chip: UB, L0A, L0B, L0C, L1 |
Shape |
ND, may be dynamic (RuntimeInt dims) |
ND, must be static (known at JIT time) |
Creation |
|
Returned by |
Mutability |
Passed to/from kernel; never computed in-kernel |
Value semantics; produced by ops, consumed once |
IR type |
|
|
Tensor is a descriptor — it holds a pointer and a shape. Tile carries ValueSemantics in MLIR, meaning instances are value types — self-contained and safe to CSE, copy, and move. Each op produces a new tile SSA value. The compiler is free to map multiple logical tiles to the same physical UB region.
4.2 Memory hierarchy (TileLocation)
Enum value |
Hardware unit |
Typical use |
|---|---|---|
|
Unified Buffer |
Vector operations (default) |
|
L0 matrix input buffers |
Matmul inputs |
|
L0 matrix output buffer |
Matmul accumulator |
|
L1 cache |
Intermediate staging |
load defaults to TileLocation.UB. Matmul implicitly creates L0A/L0B/L0C tiles; the LowerToL0 pass handles the layout conversion.
4.3 Load and store
# Load a tile from a 2-D tensor — two addressing modes:
tile = asc2.load(tensor, shape=[128], offsets=[base]) # explicit byte offsets
tile = asc2.load(tensor, shape=[128], tile_id=[block_idx]) # tile_id * shape = offset
# Load a scalar (no shape → scalar load)
scalar = asc2.load(tensor, offsets=[i])
# Store tile or scalar back
asc2.store(tile, tensor, offsets=[base])
asc2.store(scalar_value, tensor, offsets=[i])
The last dimension of every tile shape must be aligned to ub_block_size (32 bytes). This is enforced at JIT time via check_data_alignment.
4.4 Operations
All operations produce new Tile or scalar values; no mutation.
Arithmetic / comparison (tile ⊕ tile, tile ⊕ scalar, scalar ⊕ tile):
add, sub, mul, div, maximum, minimum, left_shift, right_shift,
equal, not_equal, greater, greater_equal, less, less_equal
Operator overloads on Tile (+, -, *, /, >, ==, …) call the same functions. Dtypes are promoted automatically via infer_common_dtype.
Unary:
abs, ceil, floor, negative, relu,
sqrt, rsqrt, exp, exp2, log, log2,
sin, cos, tan, sinh, cosh, tanh, erf, softmax
Reductions (one or more axes, optional keep_dims):
reduce_sum, reduce_max, reduce_min, reduce_prod
— with no axes: returns a scalar PlainValue.
— also bound as methods: tile.sum(), tile.max(), etc.
Shape manipulation:
reshape, broadcast_to, expand_dims, squeeze
Creation:
full(shape, value), zeros(shape), full_like, zeros_like
Matrix multiply:
matmul(a, b) — 2-D float tiles; result is always float32.
Conditional / masking:
where(mask, src0, src1)— element-wise select.mask(count=…)/mask(bits=(hi, lo), other=…)— context manager that wraps the enclosed operations in a conditional region (CountMaskOp/BitwiseMaskOp).
Atomics:
atomic_add, atomic_max, atomic_min — write back to a global Tensor.
4.5 Programming model operations
i = asc2.block_idx() # current NPU block index (PlainValue)
n = asc2.block_num() # total number of blocks (PlainValue)
k = asc2.num_tiles(tensor, axis=0, shape=[128]) # how many tiles fit along an axis
4.6 asc2.range
asc2.range extends the base range with two attributes that control compiler behaviour:
for i in asc2.range(start, stop, step, unroll_factor=4, parallel=False):
...
unroll_factor— how many iterations to unroll. Tagged on the MLIRforop and processed byTagUnrollGroups+UnrollLooppasses.parallel=True— marks the loop body soParallelLoadStorepass can emit stores and loads in parallel.
5. IR Design
5.1 asctile dialect
Defined in include/ascir/Dialect/AscTile/IR/ using TableGen.
Types:
Type |
MLIR syntax |
Python proxy |
|---|---|---|
|
ND global descriptor |
|
|
Fixed-shape local chunk |
|
Tile carries ValueSemantics, meaning instances are value types — self-contained and safe to CSE, copy, and move. Both Tile and Tensor implement ShapedTypeInterface.
Key operations (from Ops.td):
Op |
Description |
|---|---|
|
Create a |
|
Read a dynamic dimension from a |
|
Fill a tile with a scalar constant |
|
Load tile from |
|
Store |
|
Scalar load from |
|
Scalar store to |
|
Type conversion between tile dtypes |
|
Reshape a tile |
|
Broadcast tile to a wider shape |
|
Element-wise conditional select ( |
|
Fused ReLU (single op, not composed) |
|
Full softmax on a tile |
|
2-D matrix multiply |
|
Atomic read-modify-write (add / max / min) |
|
Region: conditional on element count |
|
Region: conditional on bitmask |
|
Region terminator |
Standard dialect ops (arith, scf, math) are emitted directly by the AST walker for scalar arithmetic and control flow; the AscLower passes later handle them.
5.2 IR generation
FunctionVisitor (Python AST walker, python/asc/codegen/function_visitor.py) calls pybind11-exposed builder methods from python/src/OpBuilder.cpp to create asctile.* ops. Every asc2.* Python API call produces exactly one (or a small fixed number of) asctile operations. No analysis is needed at this stage.
5.3 Lowering chain
asctile dialect
↓ AscTile passes (loop unrolling, math specialization, masking)
↓ AscLower passes (lower each asctile op → ascendc ops + scf + arith)
ascendc dialect
↓ ascendc passes (UB allocation, sync insertion, boilerplate)
Ascend C source
6. Optimization Passes
The full pass pipeline is scheduled by Compiler._schedule_passes() inside python/asc/runtime/compiler.py. The run_asc2_passes=True flag (set automatically by asc2.jit) activates the AscTile and AscLower phases.
6.1 AscTile passes (lib/Dialect/AscTile/Transforms/)
These operate on the asctile dialect before lowering begins.
Pass |
Purpose |
|---|---|
|
Scan |
|
Hoist pure (side-effect-free) ops — e.g., shape computations — out of loop bodies |
|
Cluster load/store ops within an unroll group so they appear adjacent, enabling more efficient pipeline scheduling |
|
Specialise generic |
|
Physically unroll loops tagged by |
|
Standard MLIR cleanup between stages |
6.2 AscLower passes (lib/Dialect/AscLower/)
These lower asctile.* operations into ascendc.* + standard MLIR ops.
Pass |
Purpose |
|---|---|
|
Replace |
|
Widen boolean (i1) tiles to |
|
Lower |
|
Lower tile-scalar and scalar-tile arithmetic |
|
Handle boolean arithmetic special cases |
|
Lower |
|
Main lowering: |
|
Lower remaining |
|
Lower |
|
Materialise any pending unrealized casts from the lowering |
|
Lower |
|
Fill in default optional operands required by |
6.3 ascendc passes (after lowering, lib/Dialect/Asc/Transforms/)
Once all asctile ops are gone, the existing ascendc pipeline takes over:
Phase |
Key passes |
|---|---|
Memory allocation |
|
Synchronization |
|
Optimization |
|
Postprocessing |
|
ComputeMemoryConsumption is only active in asc2 mode and records per-TPosition UB usage as a module attribute; it raises a compile-time error on overflow.
7. Memory & Resource Model
7.1 Tile-level view
From the user’s perspective, memory management is invisible. The user:
Creates
Tensordescriptors pointing to HBM buffers.Calls
load(tensor, shape, offsets=…)to bring data into aTile.Applies operations to produce new tiles.
Calls
store(tile, tensor, offsets=…)to write results back.
The compiler is responsible for mapping each Tile SSA value to a physical UB region.
7.2 Compiler-side UB allocation
Two strategies, selected per compilation:
Strategy |
Option |
How it works |
|---|---|---|
TPipe-managed (default) |
|
|
Static allocation |
|
|
UB pressure reduction (both modes):
HoistUBAllocation— move allocations above loops so one allocation covers all iterations.ReuseUBAllocation(reuse_ub=True) — when a tile’s lifetime ends before a new one is needed, reuse its UB region.reuse_ub_in_out=Trueextends this to input/output tiles (experimental).ComputeMemoryConsumption— sums all live tile sizes perTPositionand fails the compilation if UB limits are exceeded.
7.3 Synchronization
Because asc2 users don’t write set_flag/wait_flag, all synchronization is inserted by the compiler:
insert_sync=Trueis set automatically byasc2.jit.On 910B / 910_93:
InsertSyncuses the classic queue-position–based algorithm.On 910_95:
InsertBufIdSyncuses a newer algorithm that reduces sync overhead by tracking buffer IDs rather than queue positions;FuseBufIdSyncmerges adjacent sync ops.
8. Usage Examples
Vector addition
import asc2
@asc2.jit
def vadd(x_ptr, y_ptr, out_ptr, size: int, TILE: asc.ConstExpr[int]):
x_gm = asc2.tensor(x_ptr, [size])
y_gm = asc2.tensor(y_ptr, [size])
out_gm = asc2.tensor(out_ptr, [size])
tiles_per_block = asc2.num_tiles(x_gm, 0, [TILE])
base = asc2.block_idx() * tiles_per_block * TILE
for i in asc2.range(tiles_per_block):
off = base + i * TILE
x = asc2.load(x_gm, [TILE], offsets=[off])
y = asc2.load(y_gm, [TILE], offsets=[off])
asc2.store(x + y, out_gm, offsets=[off])
vadd[8](x, y, out, n, TILE=256)
Softmax (row-wise)
@asc2.jit
def softmax(x_ptr, out_ptr, rows: int, cols: int, TILE: asc.ConstExpr[int]):
x_gm = asc2.tensor(x_ptr, [rows, cols])
out_gm = asc2.tensor(out_ptr, [rows, cols])
for row in asc2.range(asc2.block_idx(), rows, asc2.block_num()):
x = asc2.load(x_gm, [1, TILE], offsets=[row, 0])
exp_x = asc2.exp(x - x.max())
asc2.store(exp_x / exp_x.sum(), out_gm, offsets=[row, 0])
Appendix: Key CompileOptions for PyAsc2
Option |
Default in |
Effect |
|---|---|---|
|
|
Enable AscTile + AscLower pipeline |
|
|
Auto-insert sync barriers |
|
|
Static vs TPipe-managed UB allocation |
|
|
Reuse freed UB regions |
|
|
Extend reuse to I/O tiles (experimental) |
|
|
Densify load/store groups (experimental) |
|
|
Bisheng |
|
|
Bypass cache, recompile every call |