The list and helpers track renderer-group samplers and arrays — values that
GPU instancing forces every instance in a draw call to share, *not* values
that go into the per-instance UBO. The original `_instanceBatchFields` /
`_updateRendererInstanceBatchFields` / `_matchesRendererInstanceBatch` read
as "fields packed into the instance batch", which is the opposite of what
they store. Rename to `_batchSharedFields` etc. so the names match the
"must be shared across the batch" semantics, and tighten the field to
`private` (only used by helpers on the same class).
`_setPropertyValue` was carrying ~25 lines of `_instanceBatchFields` maintenance
inline. Lift it into a dedicated `_updateRendererInstanceBatchFields` helper so
the main setter reads as "convert/validate property → maintain batch index →
store value" and the sorted-insert / unsorted-remove details stay localized.
GPU instancing requires every instance in a draw call to share the same sampler
bindings and array uniforms — GLSL forbids those types inside per-instance UBOs.
`MeshRenderer._canBatch` previously ignored this and would group renderers with
different `renderer_*` textures or array uniforms into one instanced draw, where
the leader's bindings silently won and the followers rendered with the wrong
data. The matching instanced path in `RenderQueue.render` also skipped uploading
the renderer-block plain uniforms entirely, so even a single-renderer batch had
its samplers/arrays unbound.
Track each renderer-group sampler/array property as it's set and keep the ids in
a sorted list on `ShaderData`. `_canBatch` now compares those lists index by
index — equal lists mean every instance agrees on what to bind, so the batch is
safe; any divergence falls back to separate draw calls. `RenderQueue.render`
uploads the renderer block on the instanced path too, so the leader's plain
uniforms reach the GPU.
UIBatchSorter used `Math.max(refX, refY)` as `canvasLongestEdge` to derive
the grid cell size. `_referenceResolution` is a screen-space concept and
has no relation to world-space canvas-local coordinates, so on WS canvases
(and on SS canvases whose pivot pushes elements into the negative half)
the grid degenerated — elements collapsed onto edge cells or stretched
beyond the grid, both inflating the per-cell occupancy that the spatial
hash was meant to keep flat.
Walk the elements once at the start of `sort`, accumulating the union
AABB while we already do the world-to-canvas-local transform, then use
`unionRange / gridDim` for the grid footprint. The grid now auto-fits
the actual element distribution regardless of canvas mode, pivot, or
world scale, and the accumulation rides on the existing transform pass
so the added cost is just a few minmax compares per element.
A multi-pass material whose passes target different queues (e.g. Opaque +
Transparent) pushed the same `RenderElement` object into both queues.
`RenderElement` carries the batch result (`instancedRenderers`,
`_isBatched`), so the queue that batches first stamped those fields on
the shared object and the next queue's `BatcherManager.batch` then
short-circuited the polluted leader straight into its output — the
opaque instance list got drawn under the transparent pass, and the
remaining members were re-batched into a second leader and drawn again
(producing the "1 olive + 5 greener" signature on N-cube repros).
Clone the element once per additional queue: the first target keeps the
original (zero overhead), subsequent ones get a pool-allocated copy with
its own batch state via `_cloneFrom`. The pool is `ClearableObjectPool`
so no GC pressure. Identity fields are shared by copy, batch state stays
per-queue.
The grid addressing in UIBatchSorter assumed canvas-local coords spanned
`[0, canvasLongestEdge]`, but `UITransform.pivot` defaults to `(0.5, 0.5)`,
making canvas-local `[-W/2, +W/2]`. For elements wholly in the negative
half (e.g. a button in the bottom half of a centered canvas), `floor()`
produced a negative `maxCell`, the `for (cellY <= maxCellY)` loop never
entered, and the element was both unregistered from the grid and skipped
during overlap queries. Missed overlaps left elements at `depth = 0`
together with the full-screen background, and `_compareEntries`'s
`textureId` tiebreaker reordered them across hierarchy, so the background
ended up drawn on top of buttons it should have sat behind.
Clamp every cell index on both sides. Negative-half elements now collapse
into edge cells; the precise `_rectOverlap` check still uses the real
bounds, so collapsing is safe and depth bumping behaves correctly.
`renderer_NormalMat` was emitted as `mat4(transpose(inverse(mat3(renderer_ModelMat))))`.
GLSL `inverse()` is undefined for singular matrices, and any zero scale axis
(e.g. an animation hiding an entity via `scale = (1, 0, 1)`) makes the 3x3
model matrix singular — drivers typically return NaN/Inf, which then
contaminates the normal, the lighting, and finally the whole fragment.
The plain uniform path is protected by `Matrix.invert`'s `if (!det) return null`
early-out, but instancing recomputes the matrix on the GPU each draw with no
such guard, so the same scene that renders (with stale-but-finite normals)
without instancing went all-black with instancing.
Cofactor (cross-product) form equals `det(M) · transpose(inverse(M))`, so it
matches the classic formula in direction after `normalize()` but avoids the
divide-by-det entirely. Aligned with Filament / Unreal.
`_scanInstanceUniforms` stripped renderer-group uniform declarations
before `_buildLayout` decided whether they could fit the std140 layout.
For types like `mat3` (which std140's row-padded layout doesn't support
in the current `_std140TypeInfoMap`), the declaration was removed but
no UBO field or `#define` replaced it — every later reference became an
undeclared identifier and the whole pass silently failed to compile.
Check the storage type against the std140 map up front. Unsupported
types stay declared in the source and emit a clear log; supported types
follow the existing strip-and-collect path.
`else if (curElement._isBatched) return` in `VertexMergeBatcher.batch`
was defensive: it caught the case where main pipeline fed an already-
batched leader back into `_batch(null, leader)`. After commit 9606bf3de
moved that check to `BatcherManager.batch`'s entry, the leader never
reaches this branch. Refresh `RenderElement._isBatched`'s doc to match
its current meaning ("batch leader — must not be merged again").
`_scanInstanceUniformsWithMacros` and `injectInstanceUBO`'s `activeMacros`
parameter only served the raw GLSL path, which was removed in e08af33d9
("refactor(shader): remove raw GLSL shader path"). The 4 preprocessor
regexes are only used by this function. All inputs are now ShaderLab
preprocessor-evaluated before reaching the injector.
Cleanup that should have landed alongside e08af33d9. Also reported by
reviewer as a dormant `#if`/`#elif` branch-stack bug — moot once the
function is gone.
UICanvas pre-batches its children into leaders that carry a self-contained
draw range. When two canvases sharing the same atlas push their leaders
into the transparent queue, the main-pipeline batcher previously fed those
leaders back into `_canBatch`/`_batch` as if they were single sub-elements.
That corrupted `subMesh.count` and re-appended already-written indices,
dropping draws or overlapping ranges.
The batch boundary is the canvas — matches Unity uGUI's behavior — so
detect already-batched leaders at the batcher entry and pass them through.
Shaders that only declare derived built-ins (e.g. `renderer_MVPMat`) had
their declarations stripped by `_scanInstanceUniforms` but skipped the
`#define` rewrite because `fieldMap` was empty, leaving dangling
`renderer_ModelMat` references that fail to compile.
This also fixed a latent bug where transform-free shaders with
`RENDERER_INSTANCING` enabled would silently lose `N-1` instances —
`_canBatch` returned true and only the leader was drawn.
ShaderData.setInt documents bool support and setVector* documents
bvec support, but the instance UBO layout table didn't list them —
declaring a `bool` or `bvec` renderer uniform would silently drop the
declaration without producing a #define, causing the shader to fail
with an undeclared identifier. Add std140 size/align entries, reuse
existing packScalar/packVec pack functions, and recognize the `b`
prefix for the intView dispatch.
Replace `any[]` with `RenderElement[]` on `_renderElements` /
`_batchedRenderElements`, and clear both in `_onDisable` so the
pooled elements they hold don't surface stale references when other
renderers reuse the same pool slots while the canvas is disabled.
InstanceBuffer read renderer.entity.transform.worldMatrix, while
Renderer._updateTransformShaderData uses _transformEntity.transform.
They diverge whenever a subclass remaps _transformEntity (e.g.
SkinnedMeshRenderer points it at the root bone); switch InstanceBuffer
to _transformEntity to align with the shader-data path.
UIBatchSorter is only used by UICanvas; keeping it in core forced a
cross-package export plus a ts-ignore at the UICanvas import site for
an @internal symbol. Move it next to its sole consumer in the ui
package and add RenderElement to core's RenderPipeline barrel so ui
can type the sort input. Utils._quickSort stays @internal; the new
call site carries a single ts-ignore acknowledging the reuse.
Calling Shader.create at module scope throws because the shader
compiler is only installed on Shader._shaderCompiler during engine
init. Move the call into the engine-create then callback. Re-add the
Stats overlay from @galacean/engine-toolkit-stats for dev observability.
When a shader enables RENDERER_GPU_INSTANCE but never declares
`renderer_ModelMat` itself (e.g. it only references `renderer_MVPMat`),
the derived defines we inject — which expand to expressions using
`renderer_ModelMat` and `camera_ViewMat` / `camera_VPMat` — would
resolve to undeclared identifiers under WebGL.
Two complementary fixes:
- `_buildLayout` force-injects `renderer_ModelMat` (as mat3x4 affine
pack) into the UBO whenever it's missing from `fieldMap`. Layout
ordering is now an explicit priority skip rather than the
`addField + delete` mutation pattern, which read like "add then
delete" at a glance.
- `_buildMissingCameraDecls` scans the post-evaluate GLSL for an
existing `uniform mat4 camera_*;` declaration and emits one only
when shader-compiler DCE stripped it from Transform.glsl. Sources
that legitimately pulled the camera matrix in (e.g. PBR fragment
using camera_ViewMat for refraction) keep their single declaration.
The 3-arg `Shader.create(name, vertexSource, fragmentSource)` overload no
longer exists after PR #2961 — passing GLSL strings drops into the SubShader
branch and crashes with `Cannot read properties of undefined (reading
'length')` from BasicRenderPipeline. Convert the two custom-instancing
cases to ShaderLab syntax with an explicit `ShaderCompiler` wired through
`WebGLEngine.create`.
Also pick up the upstream LFS baselines (particle/physx jpgs) — the merge
left their pointers at our pre-merge oid even though the working tree had
been resolved to upstream.
`injectInstanceUBO` rewrites `renderer_MVMat` / `renderer_MVPMat` to
`(camera_ViewMat * renderer_ModelMat)` / `(camera_VPMat * renderer_ModelMat)`,
introducing references the shader source itself may not have. shader-compiler
performs DCE on Transform.glsl declarations, so any vertex shader that only
reads `renderer_MVPMat` as a plain uniform ends up without `camera_VPMat`
visible to the rewritten GLSL — WebGL then fails to compile with
`'camera_VPMat' : undeclared identifier`.
Make the injector self-contained: scan the post-evaluate GLSL and emit
`uniform mat4 camera_*` declarations only for matrices not already present.
Sources that explicitly include them (e.g. PBR fragment using camera_ViewMat
for refraction) keep their single declaration intact.
Common.glsl already declares camera_ProjectionParams. Re-declaring it
in Transform.glsl makes every shader that includes both fail to
compile under WebGL (the entire PBR family among them). Regenerate
the affected compiled .shaderc artifacts (PBR, ShadowCaster,
DepthOnly).
* fix(text): mark WorldPosition dirty after slot reallocation in _updateLocalData
Both Text (UI) and TextRenderer share a `bounds` getter that runs
`_updateLocalData` then checks `WorldPosition` dirty. `_updateLocalData`
internally `_freeTextChunks` + `_buildChunk → allocateSubChunk`, which
under PrimitiveChunk's first-fit + free-list-merge allocator can return
a slot previously owned by another renderer. `_buildChunk` writes UV
and color but never pos (pos is `_updatePosition`'s job), so the new
slot retains the previous owner's pos floats as residue.
Before this fix, when a path sets only `LocalPositionBounds` dirty
(e.g. `Text._onRootCanvasModify(ReferenceResolutionPerUnit)` in UI
Text), the bounds getter would:
1. see LocalPositionBounds → run _updateLocalData (slot may swap)
2. see WorldPosition not dirty → skip _updatePosition
3. _setDirtyFlagFalse(Font) clear all dirty bits at once
The next _render also sees clean dirty bits and uploads the residue
pos to GPU — the renderer ends up rendering at someone else's old
world position. In practice this manifested as text glyphs jumping
to the wrong spot or appearing missing after UI tab switches that
free + reallocate chunk slots in the same frame.
Fix: force WorldPosition dirty at the end of _updateLocalData so the
contract "after this call, pos must be rewritten" is unconditionally
honored regardless of which caller invoked it.
Tests cover three layers:
- dirty-flag invariant: _updateLocalData must leave WorldPosition
dirty on exit
- corrupted-slot: bounds getter with only LocalPositionBounds dirty
rewrites pos even when the slot memory is poisoned
- full slot-reuse repro: destroy a sibling renderer occupying a
lower offset, then trigger bounds getter on the survivor — its
pos must remain correct after the slot moves
Without the fix, all three regression tests fail with the survivor
rendering at the destroyed sibling's old position.
* chore: drop Chinese commentary from text dirty-flag fix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(ui): destroy engine after regression describe block
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(text): move dirty propagation to input side
Previous fix added _setDirtyFlagTrue(WorldPosition) at the end of
_updateLocalData in both TextRenderer and UI Text. That treats the
output side as the place to declare invalidation, which conflates
two concerns: dirty flags should declare staleness from input
semantics, and update methods should be pure compute units that
don't propagate flags themselves.
Root cause is on the input side: _onRootCanvasModify(ReferenceResolutionPerUnit)
declared LocalPositionBounds dirty but not WorldPosition, even though
ReferenceResolutionPerUnit affects both local layout and the world
positions derived from it. Fix the declaration where the input
semantic event lives.
TextRenderer needs no change — it has no entry point that dirties
LocalPositionBounds without also dirtying WorldPosition (all setters
use DirtyFlag.Position which includes both).
Tests rewritten from white-box (poking private _dirtyFlag, hardcoded
enum values) to public-API integration tests that drive the bug
through uiCanvas.referenceResolutionPerUnit and assert observable
vertex position changes. The new tests fail without the fix
(maxDelta = 0, positions don't update) and pass with it.
* fix(text): include WorldVolume in dirty flag for ReferenceResolutionPerUnit change
Use DirtyFlag.Position (= LocalPositionBounds | WorldPosition | WorldVolume)
instead of the manual two-flag combination. ReferenceResolutionPerUnit
also affects world bounding volume; without the WorldVolume bit,
_updateBounds is skipped in the bounds getter and stale BoundingBox
leaks into frustum culling and raycasting.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: chenmo.gl <chenmo.gl@antgroup.com>
Stress level tuned to push DrawCall contrast against dev/2.0 (no batch
clustering) without overwhelming CPU. UI is typically static, so no
animation — keeps the test focused on batching algorithm cost.
Programmatic textures are now closer to real game UI:
- gradient panel bg (electric-blue gradient + double-tone border)
- 64×64 icon atlas with 4 icons (sword/heart/bolt/gem); buttons cycle
through atlas regions via Sprite Rect
Demonstrates the realistic batching scenario: many sprites sharing one
atlas texture still batch into a single draw call.
ScreenSpaceOverlay UI renders in a separate overlay pass that bypasses
camera.render(); initScreenshot's off-screen render-target therefore couldn't
capture it (downloaded image was empty grey). Switch the case to
ScreenSpaceCamera so the canvas goes through the main camera path and the
screenshot captures the full 4×3 button grid.
With the new pipeline + canvas-internal batching, the layout is now
deterministic across runs, so diffPercentage is set to 0.
The render pipeline rewrite for GPU instancing on feat/gpu-instancing produces
slightly different particle output than dev/2.0 (semi-transparent particles are
sensitive to draw-order microchanges). Visual is correct; baseline regenerated
to match the new pipeline output. Tolerance stays at 0.334%.
The case verifies the canvas hierarchy-order bug stays fixed: when
SubRenderElement was flattened, canvas sub-elements could be shuffled in
the main transparent queue under equal (priority, distance). Fix is in
place (canvas-internal batching + subDistancePriority); this e2e prevents
silent regression.
Adds UIBatchSorter that runs inside each canvas to cluster sub-elements by
(depth, material, texture, hierarchy) before they enter the main render queue.
Combined with VertexMergeBatcher, multi-material UI buttons collapse from
N draw calls to ~M (M = visual layer count), dramatically improving fps on
dense layouts (6000 buttons: 40fps → 67fps).
- UIBatchSorter: spatial-hash accelerated BatchSorting algorithm; cell size
derives from canvas referenceResolution to stay optimal across designs.
- RenderElement.subDistancePriority: tiebreaker so canvas leaders keep their
relative order when interleaved with 3D under unstable quicksort.
- RenderElement._isBatched: protects already-batched leaders from main-pipeline
_batch(null, leader) re-init that would corrupt subMesh.start/count.
- Image/Text always populate renderElement.subShader at production (was
conditional on overlay mode); needed because canvas-internal batching now
runs before pushRenderElement which used to backfill it.
- Tests: 12 unit cases for sort correctness; e2e + example for batch demo.
_scanInstanceUniforms regex-matches uniform declarations without understanding
#ifdef blocks. For raw GLSL paths the source still contains preprocessor
directives at scan time, so uniforms inside inactive branches (e.g. renderer_JointMatrix
under #ifdef RENDERER_HAS_SKIN) get matched even when they won't compile.
This caused "GPU Instancing does not support array uniform" errors for plain
MeshRenderer batching whenever a SkinnedMeshRenderer had previously registered
renderer_JointMatrix under ShaderDataGroup.Renderer.
Add _scanInstanceUniformsWithMacros that walks the source line-by-line with
a branch stack for #ifdef/#ifndef/#else/#endif, delegating active lines to the
original scanner. compilePlatformSource passes its active macro set; the
ShaderLab path keeps using the plain scanner since ShaderMacroProcessor.evaluate
already expands directives there.
Also change the array-uniform fallback from deletion to keeping the declaration
as a regular uniform, so stray matches never directly fail shader compilation.