Going FastR

Michael Jones

2024-11-22

Some words of Wisdom

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

Donald Knuth

Make it work, make it right, make it fast

Kent Beck

xkcd 1205 - Is it worth the time?

But…

  • It’s fun
  • If you never try to optimise, you’ll never get used to writing fast code.
  • Faster analysis/code means you can answer more questions

Measuring Performance

{bench} package (part of r-lib)

my_function <- function(x) {
  x + 1
}

{bench} package

my_function <- function(x) {
  x + 1
}

bench::mark(my_function(10))
# A tibble: 1 × 6
  expression           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 my_function(10)    192ns    212ns  2907725.        0B        0

{bench} package

option_1 <- function(n) {
  rnorm(n)
}

option_2 <- function(n) {
  my_vec <- c()
  for (i in 1:n) {
    my_vec <- c(my_vec, rnorm(1))
  }
}

bench::mark(op1 = option_1(100),
            op2 = option_2(100),
            check = FALSE)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 op1          3.29µs   3.89µs   240081.    5.73KB      0  
2 op2         80.37µs   86.1µs    11053.   64.64KB     10.7

{bench} package

bench::mark(option_1(100)) |>
  dplyr::glimpse()
Rows: 1
Columns: 13
$ expression <bch:expr> <option_1(100)>
$ min        <bch:tm> 3.36µs
$ median     <bch:tm> 3.88µs
$ `itr/sec`  <dbl> 251542.5
$ mem_alloc  <bch:byt> 848B
$ `gc/sec`   <dbl> 0
$ n_itr      <int> 10000
$ n_gc       <dbl> 0
$ total_time <bch:tm> 39.8ms
$ result     <list> <0.23496364, -0.09810101, 0.53894219, -0.03084733, -1.…
$ memory     <list> [<Rprofmem[1 x 3]>]
$ time       <list> <7.49µs, 5.26µs, 4.66µs, 4.11µs, 4.59µs, 3.64µs, 3.73µs,…
$ gc         <list> [<tbl_df[10000 x 3]>]

Data vs Code

Data

Format of Data impacts:

  • Storage
  • Read/Write Speed
  • Analytic Speed
  • Interoperability

Data - Options: Files

  • Excel
  • CSV
  • RDS/RDA
  • Parquet

Data - Options: Files

Generate some data:

n_draws <- 1000000

random_nums <- tibble(
  norm = rnorm(n_draws),
  unif = runif(n_draws),
  gamma = rgamma(n_draws, shape = 1, rate = 2))

structured <- tibble(
  no_variation = rep("A", n_draws),
  slow_variation = rep(LETTERS[1:4], each = n_draws / 4),
  rapid_variation = rep(LETTERS[1:10], n_draws / 10))

Data - Options: Files

Save it to disk

readr::write_csv(random_nums, "random_nums.csv")
readr::write_rds(random_nums, "random_nums.rds")
openxlsx::write.xlsx(random_nums, "random_nums.xlsx")
arrow::write_parquet(random_nums, "random_nums.parquet")

readr::write_csv(structured, "structured.csv")
readr::write_rds(structured, "structured.rds")
openxlsx::write.xlsx(structured, "structured.xlsx")
arrow::write_parquet(structured, "structured.parquet")

Data - Options: Files

Reading random_data:

bench::mark(
  readr = readr::read_csv("random_nums.csv"),
  readxl = readxl::read_xlsx("random_nums.xlsx"),
  rds = readr::read_rds("random_nums.rds"),
  parquet = arrow::read_parquet("random_nums.parquet"),
  check = FALSE,
)
# A tibble: 4 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 readr      155.62ms  160.2ms     6.24     24.2MB     3.12
2 readxl        1.46s    1.46s     0.685  183.75MB     0   
3 rds         15.38ms   16.7ms    58.7     22.89MB     4.89
4 parquet     17.55ms   18.1ms    54.7      7.96KB     0   

Data - Options: Files

Reading structured:

bench::mark(
  readr = readr::read_csv("structured.csv"),
  readxl = readxl::read_xlsx("structured.xlsx"),
  rds = readr::read_rds("structured.rds"),
  parquet = arrow::read_parquet("structured.parquet"),
  check = FALSE
)
# A tibble: 4 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 readr      214.36ms 214.55ms     4.66    24.22MB     0   
2 readxl        1.13s    1.13s     0.888  138.15MB     1.78
3 rds        229.61ms 231.33ms     4.15    22.89MB     0   
4 parquet     17.03ms  18.06ms    48.5      7.96KB     0   

Parquet aside: Two ways to open

conventional <- function() {
  arrow::read_parquet("structured.parquet") |>
    dplyr::group_by(slow_variation) |>
    summarise(n = n())
}

parquet_style <- function() {
  arrow::read_parquet("structured.parquet",
       as_data_frame = FALSE) |>
    dplyr::group_by(slow_variation) |>
    dplyr::summarise(n = n())
}

Parquet aside: Two ways to open

bench::mark(
  conventional(),
  parquet_style(),
  check = FALSE
)
# A tibble: 2 × 6
  expression           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 conventional()    38.4ms   38.7ms      25.5    40.1MB    20.4 
2 parquet_style()     28ms   30.4ms      32.4     5.5MB     7.47

Parquet aside: Two ways to open

arrow::read_parquet("structured.parquet",
  as_data_frame = FALSE)
Table
1000000 rows x 3 columns
$no_variation <string>
$slow_variation <string>
$rapid_variation <string>

Relative file size

tibble::tibble(names = fs::dir_ls(glob = "random_nums*"), 
  size = fs::file_size(names)) |>
  arrange(desc(size)) |>
  mutate(ratio = as.numeric(size / min(size)))
# A tibble: 4 × 3
  names                      size ratio
  <fs::path>          <fs::bytes> <dbl>
1 random_nums.csv           55.6M  2.46
2 random_nums.xlsx          38.1M  1.68
3 random_nums.rds           22.9M  1.01
4 random_nums.parquet       22.6M  1   

Relative file size

tibble::tibble(names = fs::dir_ls(glob = "structured*"),
  size = fs::file_size(names)) |>
  arrange(desc(size)) |>
  mutate(ratio = as.numeric(size / min(size)))
# A tibble: 4 × 3
  names                     size ratio
  <fs::path>         <fs::bytes> <dbl>
1 structured.rds          25.75M 1080.
2 structured.xlsx         10.32M  433.
3 structured.csv           5.72M  240.
4 structured.parquet      24.42K    1 

Data: Databases

  • File: SQLite, duckdb
  • Server: PostgreSQL, MS SQL Server
  • R Packages: {DBI}, {dbplyr}, {pool}, {dm}

Code

R is vectorised

Add two vectors, low-level style:

vec1 <- c(1,2,3)
vec2 <- c(6,5,4)

low_level_vec_add <- function(v1, v2) {
  new_vec <- c()
  for (i in seq_along(v1)) {
    new_vec <- c(new_vec, v1[i] + v2[i])
  }
  new_vec
}

low_level_vec_add(vec1, vec2)
[1] 7 7 7

R is vectorised

Add two vectors, proper R style:

vec1 + vec2
[1] 7 7 7

R is vectorised

How fast?

bench::mark(
  slow = low_level_vec_add(vec1, vec2),
  fast = vec1 + vec2
) |> dplyr::mutate(ratio = as.numeric(median / min(median)))
# A tibble: 2 × 7
  expression      min   median `itr/sec` mem_alloc `gc/sec` ratio
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <dbl>
1 slow        950.1ns   1.07µs   890277.        0B        0  11.1
2 fast         85.1ns  96.97ns  8102111.        0B        0   1  

R is vectorised, but is my code?

Vectorisation is contageous:

my_func <- function(x, y, threshold) {
  (x > threshold) * y
}

my_func(x = 4, y = 10, threshold = 3)
[1] 10
my_func(x = c(1, 2, 3), y = c(6,7,8), threshold = 2)
[1] 0 0 8
my_func(x = c(1, 2, 3), y = c(6,7,8), threshold = c(10, 1, 1))
[1] 0 7 8

How does vectorisation work?

  • There’s still a loop.

  • It’s just not in R (it’s in C).

Generating Random Numbers

Scenario:

  • \(N \sim Poi(\lambda)\) events
  • Each resulting in \(X_n \sim N(\mu, \sigma^2)\) value

Generating Random Numbers

Scenario:

  • \(N \sim Poi(\lambda)\) claims
  • Each resulting in \(X_n \sim N(\mu, \sigma^2)\) loss

Generating Random Numbers

Scenario:

  • \(N \sim Poi(\lambda)\) cumstomers
  • Each resulting in \(X_n \sim N(\mu, \sigma^2)\) purchases

Generating Random Numbers

Option 1: Generate Individually:

option1 <- function(n_sims, mean_count, mean_size, sd_size) {
  counts <- rpois(n_sims, mean_count)
  lapply(counts, \(x) rnorm(x, mean_size, sd_size))
}

set.seed(10)
option1(2, 4, 10, 1)
[[1]]
[1]  9.815747  8.628669  9.400832 10.294545

[[2]]
[1] 10.389794  8.791924  9.636324

Generating Random Numbers

Option 2: Generate counts first, then amounts:

option2 <- function(n_sims, mean_count, mean_size, sd_size) {
  counts <- rpois(n_sims, mean_count)
  sizes <- rnorm(sum(counts), mean_size, sd_size)
  split(sizes, rep(seq_along(counts), counts))
}

set.seed(10)
option2(2, 4, 10, 1)
$`1`
[1]  9.815747  8.628669  9.400832 10.294545

$`2`
[1] 10.389794  8.791924  9.636324

Which is faster?

bench::mark(
  opt1 = option1(1000, 30, 100, 10),
  opt2 = option2(1000, 30, 100, 10),
  check = FALSE
)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 opt1         2.15ms   2.25ms      436.  310.98KB     13.1
2 opt2         1.58ms   1.63ms      605.    1.28MB     13.1

Which is faster (more numbers)?

{data.table}

Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development.

data.table’s CRAN page

{data.table}

fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development.

data.table’s CRAN page

{data.table}

  • Everything tidyverse can do
  • Faster
  • Syntax closer to base:
DT[i, j, by]

##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by

{data.table}

## DT[i, j, by]
##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by

library(data.table)

iris_dt <- data.table(iris)

iris_dt[Petal.Length > 2,
        .(mean_petal_length = mean(Petal.Length)),
        Species]
      Species mean_petal_length
       <fctr>             <num>
1: versicolor             4.260
2:  virginica             5.552

{data.table}

iris_dt[Petal.Length > 2,
        .(mean_petal_length = mean(Petal.Length)),
        Species]
      Species mean_petal_length
       <fctr>             <num>
1: versicolor             4.260
2:  virginica             5.552
iris |>
  filter(Petal.Length > 2) |>
  group_by(Species) |>
  summarise(mean_petal_length = mean(Petal.Length))
# A tibble: 2 × 2
  Species    mean_petal_length
  <fct>                  <dbl>
1 versicolor              4.26
2 virginica               5.55

{data.table}

bench::mark(datatable = iris_dt[Petal.Length > 2, .(mean_petal_length = mean(Petal.Length)), Species],
            tidyverse = iris |> filter(Petal.Length > 2) |> group_by(Species) |> summarise(mean_petal_length = mean(Petal.Length)),
            check = FALSE) |>
    mutate(ratio = as.numeric(median / min(median)))
# A tibble: 2 × 7
  expression      min   median `itr/sec` mem_alloc `gc/sec` ratio
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <dbl>
1 datatable  386.33µs 431.79µs     2218.    71.1KB    18.4   1   
2 tidyverse    2.05ms   2.16ms      456.    13.7KB     6.22  5.00

Other Code Style Points

  • Don’t abstract too much
  • Developer time is as important as computer run time

Other languages

  • C/C++ with {RCPP}
  • Julia?
  • Rust with {rextendr}

Example: Collatz Conjecture

\[f(n) = \begin{cases} n/2 &\text{if $n$ is even}\\ 3n+1 &\text{if $n$ is odd} \end{cases}\]

The conjecture is that whatever number you start with, you end up at one.

We write a function that starts with a number, runs the process to get to one, and outputs how many steps were taken.

Collatz Conjecture in R

collatz_r <- function(n) {
  counter <- 0
  while (n != 1) {
    if (n %% 2 == 0) {
      n <- n / 2
    } else {
      n <- 3 * n + 1
    }
    counter <- counter + 1
  }
  counter
}

collatz_r(100)
[1] 25

Collatz Conjecture in Rust

library(rextendr)

rust_function(
  "fn collatz_rust(n: i64) -> i64 {
    let mut result = n;
    let mut counter = 0;
    while result != 1 {
        if result % 2 == 0 {
            result = result / 2;
        } else {
            result = 3 * result + 1;
        }
        counter += 1;
    }
    counter
}")

collatz_rust(100)
[1] 25

Collatz Conjecture in Rust

fn collatz_rust(n: i64) -> i64 {
  let mut result = n;
  let mut counter = 0;
  while result != 1 {
      if result % 2 == 0 {
          result = result / 2;
      } else {
          result = 3 * result + 1;
      }
      counter += 1;
  }
  counter
}

Collatz Conjecture: benchmarking:

bench::mark(r = collatz_r(100001),
            rust = collatz_rust(100001)) |>
  mutate(ratio = as.numeric(median / min(median)))
# A tibble: 2 × 7
  expression      min   median `itr/sec` mem_alloc `gc/sec` ratio
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <dbl>
1 r           10.33µs  12.68µs    75828.        0B     15.2  3.91
2 rust         2.86µs   3.24µs   302302.    5.23KB      0    1   

Honorable Mentions: Memoisation

  • {memoise} package
  • Build a lookup table for each function of arguments seen before, if you see them again, shortcut to the answer

Honorable Mentions: Memoisation

library(memoise)

library(memoise)
f <- function(x) {
  Sys.sleep(1)
  mean(x)
}
mf <- memoise(f)


system.time(mf(1:10))
   user  system elapsed 
  0.001   0.000   1.003 
system.time(mf(1:10))
   user  system elapsed 
  0.011   0.000   0.011 

Honorable Mentions: Parallelism

{targets}

Declare your process, {targets} handles the running

library(targets)

list(
  tar_target(scenarios, get_scenarios()),
  tar_target(
    modelled_scenarios,
    purrr::map(scenarios, do_single_scenario)
  )
)

{targets}

▶ dispatched target scenarios
● completed target scenarios [0 seconds, 99 bytes]
▶ dispatched target modelled_scenarios
● completed target modelled_scenarios [10.021 seconds, 51 bytes]
▶ ended pipeline [10.075 seconds]

{targets} - with {crew}

library(targets)

controller <- crew::crew_controller_local(
  name = "my_controller",
  workers = 10,
  seconds_idle = 3
)
tar_option_set(controller = controller)
list(
  tar_target(scenarios, get_scenarios()),
  tar_target(
    modelled_scenarios,
    do_single_scenario(scenarios), pattern = map(scenarios)
  )
)

{targets} - with {crew}

✔ skipped target scenarios
▶ dispatched branch modelled_scenarios_a452af4f4be44c66
▶ dispatched branch modelled_scenarios_54a41295e531e748
▶ dispatched branch modelled_scenarios_c88a34236a62708d
▶ dispatched branch modelled_scenarios_208f10494b4fe964
▶ dispatched branch modelled_scenarios_5ae819337c10b34c
▶ dispatched branch modelled_scenarios_620a7e1663f3d387
▶ dispatched branch modelled_scenarios_296c088a3c754be1
▶ dispatched branch modelled_scenarios_30d5113339f1a65f
▶ dispatched branch modelled_scenarios_7a71c0e0c90cacd9
▶ dispatched branch modelled_scenarios_60774606f380f97b
● completed branch modelled_scenarios_a452af4f4be44c66 [1.01 seconds, 44 bytes]
● completed branch modelled_scenarios_54a41295e531e748 [1.009 seconds, 44 bytes]
● completed branch modelled_scenarios_c88a34236a62708d [1.011 seconds, 44 bytes]
● completed branch modelled_scenarios_208f10494b4fe964 [1.01 seconds, 44 bytes]
● completed branch modelled_scenarios_620a7e1663f3d387 [1.006 seconds, 44 bytes]
● completed branch modelled_scenarios_5ae819337c10b34c [1.01 seconds, 44 bytes]
● completed branch modelled_scenarios_296c088a3c754be1 [1.006 seconds, 44 bytes]
● completed branch modelled_scenarios_30d5113339f1a65f [1.006 seconds, 44 bytes]
● completed branch modelled_scenarios_7a71c0e0c90cacd9 [1.006 seconds, 44 bytes]
● completed branch modelled_scenarios_60774606f380f97b [1.006 seconds, 44 bytes]
● completed pattern modelled_scenarios 
▶ ended pipeline [3.732 seconds]

Summary

Idiomatic R is faster: vectorisation over loops

Think about your data structures

Experiment and test (e.g. with {bench})

You can drop down to a faster language (but that might not solve your problems)