2024-11-22
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
Donald Knuth
Make it work, make it right, make it fast
Kent Beck
{bench}
package (part of r-lib){bench}
package{bench}
packageoption_1 <- function(n) {
rnorm(n)
}
option_2 <- function(n) {
my_vec <- c()
for (i in 1:n) {
my_vec <- c(my_vec, rnorm(1))
}
}
bench::mark(op1 = option_1(100),
op2 = option_2(100),
check = FALSE)
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 op1 3.29µs 3.89µs 240081. 5.73KB 0
2 op2 80.37µs 86.1µs 11053. 64.64KB 10.7
{bench}
packageRows: 1
Columns: 13
$ expression <bch:expr> <option_1(100)>
$ min <bch:tm> 3.36µs
$ median <bch:tm> 3.88µs
$ `itr/sec` <dbl> 251542.5
$ mem_alloc <bch:byt> 848B
$ `gc/sec` <dbl> 0
$ n_itr <int> 10000
$ n_gc <dbl> 0
$ total_time <bch:tm> 39.8ms
$ result <list> <0.23496364, -0.09810101, 0.53894219, -0.03084733, -1.…
$ memory <list> [<Rprofmem[1 x 3]>]
$ time <list> <7.49µs, 5.26µs, 4.66µs, 4.11µs, 4.59µs, 3.64µs, 3.73µs,…
$ gc <list> [<tbl_df[10000 x 3]>]
Format of Data impacts:
Generate some data:
Save it to disk
readr::write_csv(random_nums, "random_nums.csv")
readr::write_rds(random_nums, "random_nums.rds")
openxlsx::write.xlsx(random_nums, "random_nums.xlsx")
arrow::write_parquet(random_nums, "random_nums.parquet")
readr::write_csv(structured, "structured.csv")
readr::write_rds(structured, "structured.rds")
openxlsx::write.xlsx(structured, "structured.xlsx")
arrow::write_parquet(structured, "structured.parquet")
Reading random_data
:
# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 readr 155.62ms 160.2ms 6.24 24.2MB 3.12
2 readxl 1.46s 1.46s 0.685 183.75MB 0
3 rds 15.38ms 16.7ms 58.7 22.89MB 4.89
4 parquet 17.55ms 18.1ms 54.7 7.96KB 0
Reading structured
:
# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 readr 214.36ms 214.55ms 4.66 24.22MB 0
2 readxl 1.13s 1.13s 0.888 138.15MB 1.78
3 rds 229.61ms 231.33ms 4.15 22.89MB 0
4 parquet 17.03ms 18.06ms 48.5 7.96KB 0
tibble::tibble(names = fs::dir_ls(glob = "random_nums*"),
size = fs::file_size(names)) |>
arrange(desc(size)) |>
mutate(ratio = as.numeric(size / min(size)))
# A tibble: 4 × 3
names size ratio
<fs::path> <fs::bytes> <dbl>
1 random_nums.csv 55.6M 2.46
2 random_nums.xlsx 38.1M 1.68
3 random_nums.rds 22.9M 1.01
4 random_nums.parquet 22.6M 1
tibble::tibble(names = fs::dir_ls(glob = "structured*"),
size = fs::file_size(names)) |>
arrange(desc(size)) |>
mutate(ratio = as.numeric(size / min(size)))
# A tibble: 4 × 3
names size ratio
<fs::path> <fs::bytes> <dbl>
1 structured.rds 25.75M 1080.
2 structured.xlsx 10.32M 433.
3 structured.csv 5.72M 240.
4 structured.parquet 24.42K 1
Add two vectors, low-level style:
Add two vectors, proper R style:
How fast?
bench::mark(
slow = low_level_vec_add(vec1, vec2),
fast = vec1 + vec2
) |> dplyr::mutate(ratio = as.numeric(median / min(median)))
# A tibble: 2 × 7
expression min median `itr/sec` mem_alloc `gc/sec` ratio
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <dbl>
1 slow 950.1ns 1.07µs 890277. 0B 0 11.1
2 fast 85.1ns 96.97ns 8102111. 0B 0 1
Vectorisation is contageous:
There’s still a loop.
It’s just not in R (it’s in C).
Scenario:
Scenario:
Scenario:
Option 1: Generate Individually:
Option 2: Generate counts first, then amounts:
option2 <- function(n_sims, mean_count, mean_size, sd_size) {
counts <- rpois(n_sims, mean_count)
sizes <- rnorm(sum(counts), mean_size, sd_size)
split(sizes, rep(seq_along(counts), counts))
}
set.seed(10)
option2(2, 4, 10, 1)
$`1`
[1] 9.815747 8.628669 9.400832 10.294545
$`2`
[1] 10.389794 8.791924 9.636324
{data.table}
Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development.
{data.table}
fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development.
{data.table}
DT[i, j, by]
## R: i j by
## SQL: where | order by select | update group by
{data.table}
{data.table}
{data.table}
bench::mark(datatable = iris_dt[Petal.Length > 2, .(mean_petal_length = mean(Petal.Length)), Species],
tidyverse = iris |> filter(Petal.Length > 2) |> group_by(Species) |> summarise(mean_petal_length = mean(Petal.Length)),
check = FALSE) |>
mutate(ratio = as.numeric(median / min(median)))
# A tibble: 2 × 7
expression min median `itr/sec` mem_alloc `gc/sec` ratio
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <dbl>
1 datatable 386.33µs 431.79µs 2218. 71.1KB 18.4 1
2 tidyverse 2.05ms 2.16ms 456. 13.7KB 6.22 5.00
\[f(n) = \begin{cases} n/2 &\text{if $n$ is even}\\ 3n+1 &\text{if $n$ is odd} \end{cases}\]
The conjecture is that whatever number you start with, you end up at one.
We write a function that starts with a number, runs the process to get to one, and outputs how many steps were taken.
bench::mark(r = collatz_r(100001),
rust = collatz_rust(100001)) |>
mutate(ratio = as.numeric(median / min(median)))
# A tibble: 2 × 7
expression min median `itr/sec` mem_alloc `gc/sec` ratio
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <dbl>
1 r 10.33µs 12.68µs 75828. 0B 15.2 3.91
2 rust 2.86µs 3.24µs 302302. 5.23KB 0 1
Declare your process, {targets} handles the running
▶ dispatched target scenarios
● completed target scenarios [0 seconds, 99 bytes]
▶ dispatched target modelled_scenarios
● completed target modelled_scenarios [10.021 seconds, 51 bytes]
▶ ended pipeline [10.075 seconds]
✔ skipped target scenarios
▶ dispatched branch modelled_scenarios_a452af4f4be44c66
▶ dispatched branch modelled_scenarios_54a41295e531e748
▶ dispatched branch modelled_scenarios_c88a34236a62708d
▶ dispatched branch modelled_scenarios_208f10494b4fe964
▶ dispatched branch modelled_scenarios_5ae819337c10b34c
▶ dispatched branch modelled_scenarios_620a7e1663f3d387
▶ dispatched branch modelled_scenarios_296c088a3c754be1
▶ dispatched branch modelled_scenarios_30d5113339f1a65f
▶ dispatched branch modelled_scenarios_7a71c0e0c90cacd9
▶ dispatched branch modelled_scenarios_60774606f380f97b
● completed branch modelled_scenarios_a452af4f4be44c66 [1.01 seconds, 44 bytes]
● completed branch modelled_scenarios_54a41295e531e748 [1.009 seconds, 44 bytes]
● completed branch modelled_scenarios_c88a34236a62708d [1.011 seconds, 44 bytes]
● completed branch modelled_scenarios_208f10494b4fe964 [1.01 seconds, 44 bytes]
● completed branch modelled_scenarios_620a7e1663f3d387 [1.006 seconds, 44 bytes]
● completed branch modelled_scenarios_5ae819337c10b34c [1.01 seconds, 44 bytes]
● completed branch modelled_scenarios_296c088a3c754be1 [1.006 seconds, 44 bytes]
● completed branch modelled_scenarios_30d5113339f1a65f [1.006 seconds, 44 bytes]
● completed branch modelled_scenarios_7a71c0e0c90cacd9 [1.006 seconds, 44 bytes]
● completed branch modelled_scenarios_60774606f380f97b [1.006 seconds, 44 bytes]
● completed pattern modelled_scenarios
▶ ended pipeline [3.732 seconds]
Idiomatic R is faster: vectorisation over loops
Think about your data structures
Experiment and test (e.g. with {bench})
You can drop down to a faster language (but that might not solve your problems)