More efficient hvcat of scalars and arrays of numbers #39729

BioTurboNick · 2021-02-18T08:08:12Z

First attempt to address #39713

Original:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
@btime [a c ; b d]   # 31 allocations and 1.25 kb

New:

@btime [a c ; b d]   # 15 allocations and 656 bytes

Others unchanged, as expected. Though if different types of numbers are mixed, it still takes the longer path. I tried expanding the definition but there's some weird stuff going on that increases allocations in the other situations I posted in that issue. Works for any Number element type.

Fixes #39713

BioTurboNick · 2021-02-18T13:38:18Z

Oops, yep. completely broken. Should have guessed.

base/abstractarray.jl

BioTurboNick · 2021-02-19T10:38:01Z

const a, b, c, d, e, f = zeros(Int, 2, 2), [3 4], [2 ; 4], 5, 5.0, "string"
using BenchmarkTools
e19() = [a c ; b d]
e20() = [a c ; b e]
e21() = [a c ; b f]
e22() = [a c ; b [d]]
@btime e19() # 12 alloc: 528 bytes     ; 1.6.0-rc1: 31 alloc: 1.25 KiB
@btime e20() # 13 alloc: 544 bytes     ; 1.6.0-rc1: 39 alloc: 1.38 KiB
@btime e21() # <broken>                 ; 1.6.0-rc1: 31 alloc: 1.25 KiB
@btime e22() # 10 alloc: 512 bytes     ; 1.6.0-rc1: 10 alloc: 512 bytes

const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1

e6() = [x y ; x y]
@btime e6() # 4 alloc: 672 bytes        ; 1.6.0-rc1: 104 alloc: 4.98 KiB

This is what I achieved by consolidating with hvncat (once it's ready). Note it integrates #39725, which is currently broken for assigning strings into a range setindex.

timholy · 2021-02-19T21:06:01Z

I haven't had time to look at this, but in case you're interested and if it helps, I recently sped up cat: #39292, #39294, #39314

BioTurboNick · 2021-02-19T21:09:05Z

Cool! @timholy I think I noticed some of that effect without knowing the cause. Don't worry about reviewing this yet, it's dependent on my other PR #33697

BioTurboNick · 2021-06-06T00:20:57Z

Updated this, now that the hvncat mechanism is merged:

julia> @btime e19()
  904.651 ns (8 allocations: 576 bytes)
3×3 Matrix{Int64}:
 0  0  2
 0  0  4
 3  4  5
# compare with 1.6.1: 2.611 μs (31 allocations: 1.25 KiB)
#             master: 2.156 μs (30 allocations: 1.44 KiB)

julia> @btime e20()
  910.811 ns (9 allocations: 592 bytes)
3×3 Matrix{Float64}:
 0.0  0.0  2.0
 0.0  0.0  4.0
 3.0  4.0  5.0
# compare with 1.6.1: 2.811 μs (39 allocations: 1.38 KiB)
#             master: 2.267 μs (38 allocations: 1.56 KiB)

julia> @btime e21()
  847.368 ns (5 allocations: 432 bytes)
3×3 Matrix{Any}:
 0  0  2
 0  0  4
 3  4   "string"
# compare with 1.6.1: 2.678 μs (31 allocations: 1.25 KiB)
#             master: 2.167 μs (26 allocations: 1.25 KiB)

julia> @btime e22()
  873.333 ns (6 allocations: 528 bytes)
3×3 Matrix{Int64}:
 0  0  2
 0  0  4
 3  4  5
# compare with 1.6.1: 1.390 μs (10 allocations: 512 bytes)
#             master: 1.240 μs (10 allocations: 512 bytes)

julia> @btime e6() # 4 alloc: 672 bytes
  515.104 ns (5 allocations: 640 bytes)
4×4×2 Array{Int64, 3}:
[:, :, 1] =
 1  2  2  3
 3  4  4  5
 1  2  2  3
 3  4  4  5

[:, :, 2] =
 5  6  6  7
 7  8  8  9
 5  6  6  7
 7  8  8  9
# compare with 1.6.1: 4.414 μs (56 allocations: 2.41 KiB)
#             master: 3.337 μs (50 allocations: 2.39 KiB)

@btime [1 2; 4 5]
  29.879 ns (1 allocation: 112 bytes)
2×2 Matrix{Int64}:
 1  2
 4  5
# compare with 1.6.1: 91.208 ns (2 allocations: 160 bytes)
#             master:  93.194 ns (2 allocations: 160 bytes)

julia> @btime [1 2; 4 fill(1, 1, 1, 1)]
  877.551 ns (9 allocations: 624 bytes)
2×2 Matrix{Int64}:
 1  2
 4  1
# compare with 1.6.1: 4.188 μs (43 allocations: 1.62 KiB)
#             master: 3.487 μs (42 allocations: 1.81 KiB)

BioTurboNick · 2021-06-06T00:40:45Z

With respect to the original motivating examples:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
# mixed arrays and scalars
@btime [a c ; b d]   # 2.922 μs (31 allocations: 1.25 KiB); now 897.778 ns (8 allocations: 576 bytes)
@btime [a c ; [b d]] # 2.322 μs (21 allocations: 880 bytes); 1.690 μs (26 allocations: 1.30 KiB)
@btime [[a c] ; [b d]] # 975.000 ns 16 allocations and 816 bytes -- explicit hcat nested within vcat (should be worst case)
# scalars wrapped in arrays
@btime [a c ; b [d]] # 1.410 μs ( (10 allocations: 512 bytes); now 865.957 ns (6 allocations: 528 bytes)
@btime [a c ; [b [d]]] # 1.230 μs (9 allocations: 560 bytes); now 816.250 ns (14 allocations: 1008 bytes)
@btime [[a c] ; [b [d]]] # 142.069 ns (4 allocations: 496 bytes) -- explicit hcat nested within vcat (should be worst case)

So better. Still some more improvements to make. Not entirely sure why the middle case is relatively poor.

BioTurboNick · 2023-08-13T21:24:16Z

Current status of this PR vs. Master:

(Some timing differences from previous test likely due to my new computer)

This PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 259.371 ns (22 allocations: 1.17 KiB) +++
@btime [a c ; [b d]] # 1.612 μs (26 allocations: 1.12 KiB) ---
@btime [[a c] ; [b d]] # 497.052 ns (16 allocations: 720 bytes) ---

# scalars wrapped in arrays
@btime [a c ; b [d]] # 189.981 ns (4 allocations: 352 bytes)
@btime [a c ; [b [d]]] # 1.161 μs (14 allocations: 816 bytes) ---
@btime [[a c] ; [b [d]]] # 84.902 ns (4 allocations: 384 bytes)

Master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 628.402 ns (17 allocations: 944 bytes)
@btime [a c ; [b d]] # 1.110 μs (21 allocations: 800 bytes)
@btime [[a c] ; [b d]] # 480.513 ns (16 allocations: 720 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 801.111 ns (10 allocations: 448 bytes)
@btime [a c ; [b [d]]] # 665.161 ns (9 allocations: 464 bytes)
@btime [[a c] ; [b [d]]] # 79.321 ns (4 allocations: 384 bytes)

+++ = Improved since last test
--- = Regression since last test

So currently, Master is actually faster or equivalent for 4 of the 6 cases now, which is great. So will need to think if this PR is truly necessary anymore, and see what can improve that last case.

BioTurboNick · 2024-02-10T16:48:56Z

Updating based on current master:

This PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 2.473 μs (43 allocations: 2.03 KiB)
@btime [a c ; [b d]] # 1.561 μs (33 allocations: 1.17 KiB)
@btime [[a c] ; [b d]] # 467.454 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 176.155 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 1.088 μs (22 allocations: 864 bytes)
@btime [[a c] ; [b [d]]] # 57.834 ns (8 allocations: 432 bytes)

Master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 564.249 ns (17 allocations: 816 bytes)
@btime [a c ; [b d]] # 1.041 μs (23 allocations: 832 bytes)
@btime [[a c] ; [b d]] # 470.286 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 708.369 ns (12 allocations: 464 bytes)
@btime [a c ; [b [d]]] # 585.099 ns (12 allocations: 496 bytes)
@btime [[a c] ; [b [d]]] # 59.694 ns (8 allocations: 432 bytes)

All cases on master are all faster than they were, even in the presence of additional allocations. I'll see about whether this PR can be wrapped up one way or the other this weekend I think.

BioTurboNick · 2024-02-11T20:17:20Z

New simple approach - just dispatching hvcat to the hvncat_shape methods. Back to being similar or faster than master. But need to verify I didn't slow any other conditions down.

EDIT: updated timings after better addressing dynamic dispatch of cat_similar

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 150.041 ns (25 allocations: 1.19 KiB)
@btime [a c ; [b d]] # 565.120 ns (31 allocations: 1.11 KiB)
@btime [[a c] ; [b d]] # 472.211 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 85.399 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 148.468 ns (20 allocations: 800 bytes)
@btime [[a c] ; [b [d]]] # 58.048 ns (8 allocations: 432 bytes)

base/abstractarray.jl

BioTurboNick · 2024-02-12T09:13:25Z

The remaining slowdown I'm seeing relates to the hcat methods.

In 1.10, hcat for an array and a scalar dispatches to a SparseArrays method (even though I haven't loaded SparseArrays?) which then redirects to typed_hcat, and then to _cat_t(Val(2), T, X...), which is very fast.

On master, the fallback abstractarrays.jl hcat is called instead, which feeds into the slower cat(A...; dims=Val(2)) instead.

Possibly related to #49322 ? Looks like a regression in cat, haven't fully investigated:

#1.10:
julia> @btime cat(b, d, dims = 2)
#  24.975 ns (1 allocation: 80 bytes)

#master:
julia> @btime cat(b, d, dims = 2)
#  554.371 ns (19 allocations: 576 bytes)

But we can probably ignore that for the purposes of this PR, which is about hvcat.

RomeoV · 2024-03-10T16:38:00Z

I wonder if this resolves the following issue:

using LinearAlgebra
function measure()
  data = rand(3,3)
  mat = [
    data ones(size(data, 1))
    ones(size(data, 2))' 0
  ]
  det(mat)
end
@code_warntype measure()  # -> can't infer type of `mat`

There has been some discussion at https://julialang.zulipchat.com/#narrow/stream/225542-helpdesk/topic/Matrix.20comprehension.20failed.20type.20inference.

BioTurboNick · 2024-03-10T18:16:31Z

Seems like it does!

MethodInstance for measure()
  from measure() @ Main REPL[2]:1
Arguments
  #self#::Core.Const(measure)
Locals
  mat::Matrix{Float64}
  data::Matrix{Float64}
Body::Float64
1 ─ %1  = Main.rand::Core.Const(rand)
│         (data = (%1)(3, 3))
│   %3  = Core.tuple(2, 2)::Core.Const((2, 2))
│   %4  = data::Matrix{Float64}
│   %5  = Main.ones::Core.Const(ones)
│   %6  = Main.size::Core.Const(size)
│   %7  = data::Matrix{Float64}
│   %8  = (%6)(%7, 1)::Int64
│   %9  = (%5)(%8)::Vector{Float64}
│   %10 = Main.:var"'"::Core.Const(adjoint)
│   %11 = Main.ones::Core.Const(ones)
│   %12 = Main.size::Core.Const(size)
│   %13 = data::Matrix{Float64}
│   %14 = (%12)(%13, 2)::Int64
│   %15 = (%11)(%14)::Vector{Float64}
│   %16 = (%10)(%15)::Adjoint{Float64, Vector{Float64}}
│         (mat = Base.hvcat(%3, %4, %9, %16, 0))
│   %18 = Main.det::Core.Const(LinearAlgebra.det)
│   %19 = mat::Matrix{Float64}
│   %20 = (%18)(%19)::Float64
└──       return %20

BioTurboNick · 2025-01-09T20:24:34Z

Alright, I think it's time! Can we consider this for merging? Universally faster hvcat, even though a few instances have a higher allocation count, all tests passing. And fixes the type instability mentioned above.

Comparing against current master:

const a, b, c, d, e, f = zeros(Int, 2, 2), [3 4], [2 ; 4], 5, 5.0, "string"
using BenchmarkTools
e19() = [a c ; b d]
e20() = [a c ; b e]
e21() = [a c ; b f]
e22() = [a c ; b [d]]

@btime e19() # 1.088 μs (17 allocations: 816 bytes)
@btime e20() # 1.117 μs (19 allocations: 848 bytes)
@btime e21() # 1.154 μs (17 allocations: 816 bytes)
@btime e22() # 2.429 μs (32 allocations: 1.20 KiB)

@btime [a c ; b d] # 1.075 μs (17 allocations: 816 bytes)
@btime [a c ; [b d]] # 1.906 μs (25 allocations: 1008 bytes)

@btime [a c ; b [d]] # 2.406 μs (32 allocations: 1.20 KiB)
@btime [a c ; [b [d]]] # 1.908 μs (27 allocations: 1.05 KiB)

And current PR:

@btime e19() # 312.075 ns (25 allocations: 1.19 KiB)
@btime e20() # 356.990 ns (30 allocations: 1.27 KiB)
@btime e21() # 115.967 ns (6 allocations: 304 bytes)
@btime e22() # 141.525 ns (8 allocations: 368 bytes)

@btime [a c ; b d] # 312.979 ns (25 allocations: 1.19 KiB)
@btime [a c ; [b d]] # 244.602 ns (18 allocations: 736 bytes)

@btime [a c ; b [d]] # 139.707 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 241.090 ns (20 allocations: 800 bytes)

oscardssmith · 2025-01-09T21:16:11Z

The one thing left that it would be good to see is performance on a literal 100x100 matrix (or similar). It's fairly common for large arrays to be written out in source code and it's important that the compiler not choke on them.

Other than that, this PR looks awesome! Deleting code and getting speedup is a great combination.

JeffBezanson · 2025-01-09T21:19:31Z

Love the 4-year persistence on this!! 😄

base/abstractarray.jl

BioTurboNick · 2025-01-10T16:02:38Z

I did a 100x100 array of 1.0 and on this PR: 8.533 μs (4 allocations: 79.01 KiB)
and with one string in it: 5.113 ms (120016 allocations: 2.90 MiB)

vs master: 8.108 μs (4 allocations: 79.01 KiB)
and with one string in it: 4.813 ms (71808 allocations: 2.29 MiB)

Even though they seem to be a bit faster on master, I also got a bunch of warnings I've never seen before about inference being slow that I don't see on the PR, when I threw the string in:

info: inference of var"##sample#280"(Tuple{}, BenchmarkTools.Parameters) from var"##sample#280"(Any, BenchmarkTools.Parameters) exceeding 2501 frames (may be slow).
info: inference of var"##sample#280"(Tuple{}, BenchmarkTools.Parameters) from var"##sample#280"(Any, BenchmarkTools.Parameters) exceeding 5001 frames (may be slow).
info: inference of var"##sample#280"(Tuple{}, BenchmarkTools.Parameters) from var"##sample#280"(Any, BenchmarkTools.Parameters) exceeding 10001 frames (may be slow).

Side note - don't try to paste a literal 100x100 matrix into PowerShell, it goes very, very slowly. (WSL is fine)

BioTurboNick · 2025-01-10T22:34:25Z

Looking at the profiles, I think the performance decrease for the mixed-type example is largely due to a little more overhead, and the extensive dimensionality checking (because the concatenated elements could be numbers or arrays).

The current code concatenates each row and then concatenates the rows. This PR checks dimensions, creates the destination array, and then populates it one element at a time. I suspect the master performance would get worse if more rows were mixed types, since all the intermediate arrays would have to have an Any eltype instead of Float64.

The differences in timing might not matter that much, I think - might not be repeatable? And the first example with all 1s actually doesn't touch my code.

BioTurboNick changed the title ~~More efficient hvcat of same-type elements~~ More efficient hvcat of scalars and arrays of numbers Feb 18, 2021

simeonschaub reviewed Feb 18, 2021

View reviewed changes

base/abstractarray.jl Outdated Show resolved Hide resolved

BioTurboNick force-pushed the array_scalar_hvcat branch from db3547c to de41ca2 Compare May 21, 2021 02:26

BioTurboNick force-pushed the array_scalar_hvcat branch from b09fd6f to e5b9ba7 Compare June 6, 2021 00:19

This comment was marked as outdated.

Sign in to view

BioTurboNick force-pushed the array_scalar_hvcat branch from e41aa39 to f8cb0ff Compare May 19, 2022 06:46

BioTurboNick added 5 commits August 13, 2023 16:47

Rebased changes

c7a73a2

Reorganized

bf028bb

Type stability improvement

78146d6

Fix type param

b084534

Merge issue

34872cd

BioTurboNick force-pushed the array_scalar_hvcat branch from 4d1c1d2 to 34872cd Compare August 13, 2023 21:03

Merge branch 'master' into array_scalar_hvcat

1a379ab

Simplifying for now

67dc65e

vtjnash reviewed Feb 12, 2024

View reviewed changes

base/abstractarray.jl Outdated Show resolved Hide resolved

vtjnash reviewed Feb 12, 2024

View reviewed changes

base/abstractarray.jl Outdated Show resolved Hide resolved

BioTurboNick added 2 commits February 12, 2024 02:47

Remove new cat_similar, rely on type assertions instead

16e3e80

Switch hvcat fallback to hvncat, allow dims argument form

b7f6f2d

BioTurboNick marked this pull request as ready for review February 12, 2024 15:11

Merge branch 'master' into array_scalar_hvcat

f762724

BioTurboNick added 2 commits March 20, 2024 13:36

Merge branch 'master' into array_scalar_hvcat

2836f9b

Merge branch 'master' into array_scalar_hvcat

874d395

BioTurboNick requested review from simeonschaub and vtjnash January 9, 2025 20:40

oscardssmith added performance Must go faster arrays [a, r, r, a, y, s] labels Jan 9, 2025

oscardssmith reviewed Jan 9, 2025

View reviewed changes

base/abstractarray.jl Show resolved Hide resolved

oscardssmith approved these changes Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient hvcat of scalars and arrays of numbers #39729

More efficient hvcat of scalars and arrays of numbers #39729

BioTurboNick commented Feb 18, 2021 •

edited

Loading

BioTurboNick commented Feb 18, 2021

BioTurboNick commented Feb 19, 2021

timholy commented Feb 19, 2021

BioTurboNick commented Feb 19, 2021

BioTurboNick commented Jun 6, 2021 •

edited

Loading

BioTurboNick commented Jun 6, 2021 •

edited

Loading

This comment was marked as outdated.

BioTurboNick commented Aug 13, 2023

BioTurboNick commented Feb 10, 2024

BioTurboNick commented Feb 11, 2024 •

edited

Loading

BioTurboNick commented Feb 12, 2024 •

edited

Loading

RomeoV commented Mar 10, 2024

BioTurboNick commented Mar 10, 2024

BioTurboNick commented Jan 9, 2025 •

edited

Loading

oscardssmith commented Jan 9, 2025 •

edited

Loading

JeffBezanson commented Jan 9, 2025

BioTurboNick commented Jan 10, 2025 •

edited

Loading

BioTurboNick commented Jan 10, 2025

More efficient hvcat of scalars and arrays of numbers #39729

Are you sure you want to change the base?

More efficient hvcat of scalars and arrays of numbers #39729

Conversation

BioTurboNick commented Feb 18, 2021 • edited Loading

BioTurboNick commented Feb 18, 2021

BioTurboNick commented Feb 19, 2021

timholy commented Feb 19, 2021

BioTurboNick commented Feb 19, 2021

BioTurboNick commented Jun 6, 2021 • edited Loading

BioTurboNick commented Jun 6, 2021 • edited Loading

This comment was marked as outdated.

BioTurboNick commented Aug 13, 2023

BioTurboNick commented Feb 10, 2024

BioTurboNick commented Feb 11, 2024 • edited Loading

BioTurboNick commented Feb 12, 2024 • edited Loading

RomeoV commented Mar 10, 2024

BioTurboNick commented Mar 10, 2024

BioTurboNick commented Jan 9, 2025 • edited Loading

oscardssmith commented Jan 9, 2025 • edited Loading

JeffBezanson commented Jan 9, 2025

BioTurboNick commented Jan 10, 2025 • edited Loading

BioTurboNick commented Jan 10, 2025

BioTurboNick commented Feb 18, 2021 •

edited

Loading

BioTurboNick commented Jun 6, 2021 •

edited

Loading

BioTurboNick commented Jun 6, 2021 •

edited

Loading

BioTurboNick commented Feb 11, 2024 •

edited

Loading

BioTurboNick commented Feb 12, 2024 •

edited

Loading

BioTurboNick commented Jan 9, 2025 •

edited

Loading

oscardssmith commented Jan 9, 2025 •

edited

Loading

BioTurboNick commented Jan 10, 2025 •

edited

Loading