R-Mosaic: Is there a mean.n function? - r

is there a mean.n function (just as in SPSS) in mosaic in R?
I have 3 columns of data (including "NA") and I want a new column to have the means of the 3 data points for each row. How do I do that?

rowMeans might be just what you are looking for. It will return the row-wise mean, make sure to select/subset the right columns.
Here is an example
# Load packages
library(dplyr)
# Example data
ex_data = data.frame(A = rnorm(10), B = rnorm(10)*2, C = rnorm(10)*5)
ex_data
#> A B C
#> 1 0.2838024 -1.8784902 -2.7519131
#> 2 -0.4090575 1.6457548 6.1643390
#> 3 0.2061454 0.2103105 7.2798434
#> 4 -1.5246471 -0.6071042 -7.2411695
#> 5 -1.0461921 -2.6290405 -1.3840000
#> 6 -1.4802151 1.9323571 5.8539328
#> 7 0.1827485 0.1608848 -0.5157152
#> 8 -0.3006229 2.8650122 -1.4393171
#> 9 2.2981543 -0.2790727 2.6193970
#> 10 1.0495951 -0.9061784 -4.4013859
# Use rowMeans
ex_data$abc_means = rowMeans(x = ex_data[1:3])
ex_data
#> A B C abc_means
#> 1 0.2838024 -1.8784902 -2.7519131 -1.44886698
#> 2 -0.4090575 1.6457548 6.1643390 2.46701208
#> 3 0.2061454 0.2103105 7.2798434 2.56543308
#> 4 -1.5246471 -0.6071042 -7.2411695 -3.12430691
#> 5 -1.0461921 -2.6290405 -1.3840000 -1.68641084
#> 6 -1.4802151 1.9323571 5.8539328 2.10202491
#> 7 0.1827485 0.1608848 -0.5157152 -0.05736064
#> 8 -0.3006229 2.8650122 -1.4393171 0.37502404
#> 9 2.2981543 -0.2790727 2.6193970 1.54615953
#> 10 1.0495951 -0.9061784 -4.4013859 -1.41932305
You mentioned that you have NAs in your data, make sure to include na.rm = TRUE if appropriate.
Created on 2021-04-02 by the reprex package (v0.3.0)

Related

How to discretize a numeric column and summarize by it, with boundaries that don't overlap (equivalent to Google Sheets' "Pivot Group Rule")?

I'm trying to find the R procedure that is equivalent to Google Sheets' Pivot Group Rule. That is, I want to summarize that data by discretizing a numerical column with a fixed interval size that I decide on.
I am almost getting the desired output, but am having a trouble with the "(a,b]" interval notation.
Example
df <-
data.frame(
num_col = c(1400,9000,15000,17350,20000,22000,
25000,40000,42000,45000,50000,60000,65000,70000,75000,
1e+05,120000,125000,150000,168000,180000,2e+05,225000,
250000,270000,290000,3e+05,350000,4e+05,427000,450000,5e+05,
550000,560000,6e+05,625000,650000,7e+05,750000,8e+05,
850000,9e+05,913000,930000,950000,990000,1e+06,1066167,
1100000,1200000,1250000,1300000,1400000,1420000,1500000,
1700000,1750000,1800000,1900000,1950000,2e+06,2100000,
2300000,2400000,2450000,2500000,3e+06,3150000,3200000,
3300000,3400000,3440000,3500000,3660000,3800000,3850000,
4e+06,4400000,4500000,4600000,4700000,4800000,4900000,5e+06,
5500000,6e+06,6400000,6500000,6600000,6800000,6900000,
7e+06,7200000,7217600,7400000,7500000,7700000,8e+06,
8200000,8495000,8500000,8700000,8900000,9e+06,9200000,9500000,
9600000,1e+07,10500000,10818775,1.1e+07,11500000,
1.2e+07,12500000,12620000,1.3e+07,13200000,13400000,13500000,
1.4e+07,14500000,14800000,1.5e+07,1.6e+07,1.7e+07,17500000,
1.8e+07,18026148,18500000,1.9e+07,19500000,19800000,
19900000,2e+07,2.1e+07,2.2e+07,22500000,2.3e+07,2.4e+07,
2.5e+07,25500000,2.6e+07,2.7e+07,27220000,2.8e+07,2.9e+07,
3e+07,30300000,3.1e+07,31500000,3.2e+07,32500000,3.3e+07,
3.4e+07,3.5e+07,3.6e+07,3.7e+07,3.8e+07,38600000,3.9e+07,
39200000,4e+07,4.1e+07,4.2e+07,4.3e+07,4.4e+07,44500000,
4.5e+07,4.6e+07,4.7e+07,4.8e+07,4.9e+07,49900000,5e+07,
50100000,50200000,5.2e+07,5.3e+07,5.5e+07,5.6e+07,5.7e+07,
5.8e+07,58800000,6e+07,6.1e+07,6.3e+07,6.5e+07,6.6e+07,
6.8e+07,68005000,6.9e+07,7e+07,7.3e+07,7.4e+07,7.5e+07,
7.6e+07,7.8e+07,7.9e+07,8e+07,81200000,8.2e+07,8.4e+07,
8.5e+07,8.8e+07,9e+07,9.2e+07,9.3e+07,9.4e+07,9.5e+07,
9.9e+07,1e+08,1.02e+08,1.03e+08,1.05e+08,1.08e+08,1.1e+08,
1.12e+08,1.15e+08,1.17e+08,1.2e+08,1.25e+08,1.27e+08,
1.3e+08,1.32e+08,1.35e+08,1.4e+08,1.44e+08,1.45e+08,1.5e+08,
1.55e+08,1.6e+08,1.65e+08,1.7e+08,1.75e+08,1.76e+08,
1.78e+08,1.8e+08,1.85e+08,1.9e+08,1.95e+08,2e+08,2.09e+08,
2.1e+08,2.15e+08,2.2e+08,2.25e+08,2.3e+08,2.45e+08,2.5e+08,
2.6e+08,263700000,6e+08),
val = c(1,1,1,1,2,1,1,1,1,1,4,3,1,2,2,
8,1,4,4,1,1,7,1,11,1,1,6,2,2,1,3,21,1,1,3,
1,3,1,3,1,1,3,1,1,2,1,24,1,6,8,1,3,2,1,13,
1,1,4,1,1,22,3,1,1,1,13,27,1,2,3,2,1,12,1,1,
1,20,2,3,1,2,1,1,44,2,12,1,4,1,1,1,21,1,1,1,
3,1,15,1,1,5,1,1,8,1,2,1,43,1,1,11,1,24,2,
1,15,1,1,2,8,1,1,34,9,16,1,15,1,1,6,1,1,1,55,
3,11,1,4,5,40,1,9,3,1,14,3,38,1,3,1,7,1,2,
3,34,5,6,6,1,1,1,38,1,6,1,3,1,8,1,1,1,1,1,
25,1,1,3,1,11,1,1,5,1,18,4,1,12,2,4,1,2,11,1,
2,9,1,2,2,14,1,1,1,5,1,9,2,1,1,5,1,16,1,1,
3,1,8,1,2,1,8,7,1,8,1,8,4,1,6,14,2,4,6,8,4,
1,2,3,2,5,2,12,1,1,2,1,3,1,2,6,1,1,1)
)
look at the data
tibble::as_tibble(df)
#> # A tibble: 252 x 2
#> num_col val
#> <dbl> <dbl>
#> 1 1400 1
#> 2 9000 1
#> 3 15000 1
#> 4 17350 1
#> 5 20000 2
#> 6 22000 1
#> 7 25000 1
#> 8 40000 1
#> 9 42000 1
#> 10 45000 1
#> # ... with 242 more rows
desired output
desired_output <-
tibble::tribble(
~num_col_interval, ~val_sum,
"0 - 49999999", 962L,
"50000000 - 99999999", 164L,
"100000000 - 149999999", 78L,
"150000000 - 199999999", 53L,
"200000000 - 249999999", 23L,
"250000000 - 299999999", 8L,
"600000000 - 649999999", 1L
)
My attempt
library(dplyr)
library(ggplot2)
df |>
group_by(num_col_interval = ggplot2::cut_interval(num_col, length = 50000000 - 1, dig.lab = 10)) |>
summarise(across(val, sum))
#> # A tibble: 7 x 2
#> num_col_interval val
#> <fct> <dbl>
#> 1 [0,49999999] 962
#> 2 (49999999,99999998] 164
#> 3 (99999998,149999997] 78
#> 4 (149999997,199999996] 53
#> 5 (199999996,249999995] 23
#> 6 (249999995,299999994] 8
#> 7 (599999988,649999987] 1
You can see that the interval boundaries overlap. In the first row, it ranges 0 to 49999999, and in the second row, it ranges 49999999 to 99999998. I do understand the difference between ] and ( in the breaks notation. Nevertheless, I wish the ranges in the num_col_interval column to be as in desired_output.
How can I programatically format the num_col_interval values to be as in desired_output?
I'm mostly looking for a straightforward dplyr solution.
Here's how I would do it with Google Sheets, getting the desired output:
Several SO posts are relevant, but none of them answered my question:
How does cut with breaks work in R
Cut function in R - exclusive or am I double counting?
Cut by Defined Interval
Try this:
library(tidyverse)
df <-
data.frame(
num_col = c(1400,9000,15000,17350,20000,22000,
25000,40000,42000,45000,50000,60000,65000,70000,75000,
1e+05,120000,125000,150000,168000,180000,2e+05,225000,
250000,270000,290000,3e+05,350000,4e+05,427000,450000,5e+05,
550000,560000,6e+05,625000,650000,7e+05,750000,8e+05,
850000,9e+05,913000,930000,950000,990000,1e+06,1066167,
1100000,1200000,1250000,1300000,1400000,1420000,1500000,
1700000,1750000,1800000,1900000,1950000,2e+06,2100000,
2300000,2400000,2450000,2500000,3e+06,3150000,3200000,
3300000,3400000,3440000,3500000,3660000,3800000,3850000,
4e+06,4400000,4500000,4600000,4700000,4800000,4900000,5e+06,
5500000,6e+06,6400000,6500000,6600000,6800000,6900000,
7e+06,7200000,7217600,7400000,7500000,7700000,8e+06,
8200000,8495000,8500000,8700000,8900000,9e+06,9200000,9500000,
9600000,1e+07,10500000,10818775,1.1e+07,11500000,
1.2e+07,12500000,12620000,1.3e+07,13200000,13400000,13500000,
1.4e+07,14500000,14800000,1.5e+07,1.6e+07,1.7e+07,17500000,
1.8e+07,18026148,18500000,1.9e+07,19500000,19800000,
19900000,2e+07,2.1e+07,2.2e+07,22500000,2.3e+07,2.4e+07,
2.5e+07,25500000,2.6e+07,2.7e+07,27220000,2.8e+07,2.9e+07,
3e+07,30300000,3.1e+07,31500000,3.2e+07,32500000,3.3e+07,
3.4e+07,3.5e+07,3.6e+07,3.7e+07,3.8e+07,38600000,3.9e+07,
39200000,4e+07,4.1e+07,4.2e+07,4.3e+07,4.4e+07,44500000,
4.5e+07,4.6e+07,4.7e+07,4.8e+07,4.9e+07,49900000,5e+07,
50100000,50200000,5.2e+07,5.3e+07,5.5e+07,5.6e+07,5.7e+07,
5.8e+07,58800000,6e+07,6.1e+07,6.3e+07,6.5e+07,6.6e+07,
6.8e+07,68005000,6.9e+07,7e+07,7.3e+07,7.4e+07,7.5e+07,
7.6e+07,7.8e+07,7.9e+07,8e+07,81200000,8.2e+07,8.4e+07,
8.5e+07,8.8e+07,9e+07,9.2e+07,9.3e+07,9.4e+07,9.5e+07,
9.9e+07,1e+08,1.02e+08,1.03e+08,1.05e+08,1.08e+08,1.1e+08,
1.12e+08,1.15e+08,1.17e+08,1.2e+08,1.25e+08,1.27e+08,
1.3e+08,1.32e+08,1.35e+08,1.4e+08,1.44e+08,1.45e+08,1.5e+08,
1.55e+08,1.6e+08,1.65e+08,1.7e+08,1.75e+08,1.76e+08,
1.78e+08,1.8e+08,1.85e+08,1.9e+08,1.95e+08,2e+08,2.09e+08,
2.1e+08,2.15e+08,2.2e+08,2.25e+08,2.3e+08,2.45e+08,2.5e+08,
2.6e+08,263700000,6e+08),
val = c(1,1,1,1,2,1,1,1,1,1,4,3,1,2,2,
8,1,4,4,1,1,7,1,11,1,1,6,2,2,1,3,21,1,1,3,
1,3,1,3,1,1,3,1,1,2,1,24,1,6,8,1,3,2,1,13,
1,1,4,1,1,22,3,1,1,1,13,27,1,2,3,2,1,12,1,1,
1,20,2,3,1,2,1,1,44,2,12,1,4,1,1,1,21,1,1,1,
3,1,15,1,1,5,1,1,8,1,2,1,43,1,1,11,1,24,2,
1,15,1,1,2,8,1,1,34,9,16,1,15,1,1,6,1,1,1,55,
3,11,1,4,5,40,1,9,3,1,14,3,38,1,3,1,7,1,2,
3,34,5,6,6,1,1,1,38,1,6,1,3,1,8,1,1,1,1,1,
25,1,1,3,1,11,1,1,5,1,18,4,1,12,2,4,1,2,11,1,
2,9,1,2,2,14,1,1,1,5,1,9,2,1,1,5,1,16,1,1,
3,1,8,1,2,1,8,7,1,8,1,8,4,1,6,14,2,4,6,8,4,
1,2,3,2,5,2,12,1,1,2,1,3,1,2,6,1,1,1)
)
df |>
group_by(num_col_interval = cut_width(num_col, width = 50000000,
dig.lab = 10, closed = "left", boundary = 0)) |>
summarise(across(val, sum)) |>
separate(num_col_interval, into = c("left", "right"), sep = ",") |>
mutate(across(-val, parse_number),
right = if_else(right < max(right), right - 1L, right),
across(-val, ~ format(., scientific = FALSE)),
val = as.integer(val)) |>
unite(num_col_interval, left:right, sep = " - ")
#> # A tibble: 7 × 2
#> num_col_interval val
#> <chr> <int>
#> 1 " 0 - 49999999" 962
#> 2 " 50000000 - 99999999" 164
#> 3 "100000000 - 149999999" 78
#> 4 "150000000 - 199999999" 53
#> 5 "200000000 - 249999999" 23
#> 6 "250000000 - 299999999" 8
#> 7 "550000000 - 600000000" 1
Created on 2022-12-18 with reprex v2.0.2

apply gsub over a certain column in a list of data frames

I have a list of data frames -results1- where the data frames look like this (but with more rows)
names coefficients
1 ..a15.pdf 1.27679608
2 ..a17.pdf 1.05090176
I want to remove the dots before the variables in column 'names', i.e. change "..a15.pdf" to "a15.pdf".
I tried, with no success, different variations of
results1<-lapply(results1, function(x) {gsub("^.{0,2}", "", lapply(x, "[", "names"));x})
First two data frames from the list:
dput(results1[c(1,2)])
list(structure(list(names = c("..a15.pdf", "..a17.pdf", "..a18.pdf",
"..a21.pdf", "..a2TTT.pdf", "..a5.pdf", "..B11.pdf", "..B12.pdf",
"..B13.pdf", "..B22.pdf", "..B24.pdf", "..B4.pdf", "..B7.pdf",
"..B8.pdf", "..cw10-1.pdf", "..cw15-1TTT.pdf", "..cw17-1.pdf",
"..cw18.pdf", "..cw3.pdf", "..cw4.pdf", "..cw7_1TTT.pdf", "..cw13-1.pdf"
), coefficients = c(1.27679607834331, 1.05090175857491, 1.51820192474905,
2.30296037386815, 1.48568731934637, 0.493713103224402, 1.02705905465749,
0.999747360884078, 2.40828101927852, 0.695152132033603, 2.1436001615064,
2.25444037842867, 0.909773940025014, 1.14837173756827, -1.36323271003293,
0.341428535787024, -0.786878348480425, 0.793720472787986, -1.57831038567642,
0.277733503122777, -0.0364645818969112, -18.336668416705)), class = "data.frame", row.names = c(NA,
-22L)), structure(list(names = c("..a15.pdf", "..a17.pdf", "..a18.pdf",
"..a21.pdf", "..a2TTT.pdf", "..a5.pdf", "..B11.pdf", "..B12.pdf",
"..B13.pdf", "..B22.pdf", "..B24.pdf", "..B4.pdf", "..B7.pdf",
"..B8.pdf", "..cw10-1.pdf", "..cw15-1TTT.pdf", "..cw17-1.pdf",
"..cw18.pdf", "..cw3.pdf", "..cw4.pdf", "..cw7_1TTT.pdf", "..cw13-1.pdf"
), coefficients = c(2.096687569578, 2.19826038300833, 1.91814204277357,
0.801448541154512, 2.16169560949165, 1.48585130705963, 0.95126061691997,
1.93116618236938, 1.92555316191766, 1.00560861920225, 2.91129684208931,
2.75687804718002, 1.31164431967781, 2.22449059765255, -1.22629519335285,
1.31168579553008, -17.5786422399896, 1.25323523754693, -0.754445550651364,
0.555577381430987, 0.577850999404076, -34.2662973287062)), class = "data.frame", row.names = c(NA,
-22L)))
You have to escape . in gsub by using \\ two backslashes. This will replace all (any number of) dots preceeding the actual name.
results <- list(structure(list(names = c("..a15.pdf", "..a17.pdf", "..a18.pdf",
"..a21.pdf", "..a2TTT.pdf", "..a5.pdf", "..B11.pdf", "..B12.pdf",
"..B13.pdf", "..B22.pdf", "..B24.pdf", "..B4.pdf", "..B7.pdf",
"..B8.pdf", "..cw10-1.pdf", "..cw15-1TTT.pdf", "..cw17-1.pdf",
"..cw18.pdf", "..cw3.pdf", "..cw4.pdf", "..cw7_1TTT.pdf", "..cw13-1.pdf"
), coefficients = c(1.27679607834331, 1.05090175857491, 1.51820192474905,
2.30296037386815, 1.48568731934637, 0.493713103224402, 1.02705905465749,
0.999747360884078, 2.40828101927852, 0.695152132033603, 2.1436001615064,
2.25444037842867, 0.909773940025014, 1.14837173756827, -1.36323271003293,
0.341428535787024, -0.786878348480425, 0.793720472787986, -1.57831038567642,
0.277733503122777, -0.0364645818969112, -18.336668416705)), class = "data.frame", row.names = c(NA,
-22L)), structure(list(names = c("..a15.pdf", "..a17.pdf", "..a18.pdf",
"..a21.pdf", "..a2TTT.pdf", "..a5.pdf", "..B11.pdf", "..B12.pdf",
"..B13.pdf", "..B22.pdf", "..B24.pdf", "..B4.pdf", "..B7.pdf",
"..B8.pdf", "..cw10-1.pdf", "..cw15-1TTT.pdf", "..cw17-1.pdf",
"..cw18.pdf", "..cw3.pdf", "..cw4.pdf", "..cw7_1TTT.pdf", "..cw13-1.pdf"
), coefficients = c(2.096687569578, 2.19826038300833, 1.91814204277357,
0.801448541154512, 2.16169560949165, 1.48585130705963, 0.95126061691997,
1.93116618236938, 1.92555316191766, 1.00560861920225, 2.91129684208931,
2.75687804718002, 1.31164431967781, 2.22449059765255, -1.22629519335285,
1.31168579553008, -17.5786422399896, 1.25323523754693, -0.754445550651364,
0.555577381430987, 0.577850999404076, -34.2662973287062)), class = "data.frame", row.names = c(NA,
-22L)))
library(tidyverse)
results %>% map(~ .x %>% mutate(names = gsub('^\\.*(.*)$', '\\1', names)))
#> [[1]]
#> names coefficients
#> 1 a15.pdf 1.27679608
#> 2 a17.pdf 1.05090176
#> 3 a18.pdf 1.51820192
#> 4 a21.pdf 2.30296037
#> 5 a2TTT.pdf 1.48568732
#> 6 a5.pdf 0.49371310
#> 7 B11.pdf 1.02705905
#> 8 B12.pdf 0.99974736
#> 9 B13.pdf 2.40828102
#> 10 B22.pdf 0.69515213
#> 11 B24.pdf 2.14360016
#> 12 B4.pdf 2.25444038
#> 13 B7.pdf 0.90977394
#> 14 B8.pdf 1.14837174
#> 15 cw10-1.pdf -1.36323271
#> 16 cw15-1TTT.pdf 0.34142854
#> 17 cw17-1.pdf -0.78687835
#> 18 cw18.pdf 0.79372047
#> 19 cw3.pdf -1.57831039
#> 20 cw4.pdf 0.27773350
#> 21 cw7_1TTT.pdf -0.03646458
#> 22 cw13-1.pdf -18.33666842
#>
#> [[2]]
#> names coefficients
#> 1 a15.pdf 2.0966876
#> 2 a17.pdf 2.1982604
#> 3 a18.pdf 1.9181420
#> 4 a21.pdf 0.8014485
#> 5 a2TTT.pdf 2.1616956
#> 6 a5.pdf 1.4858513
#> 7 B11.pdf 0.9512606
#> 8 B12.pdf 1.9311662
#> 9 B13.pdf 1.9255532
#> 10 B22.pdf 1.0056086
#> 11 B24.pdf 2.9112968
#> 12 B4.pdf 2.7568780
#> 13 B7.pdf 1.3116443
#> 14 B8.pdf 2.2244906
#> 15 cw10-1.pdf -1.2262952
#> 16 cw15-1TTT.pdf 1.3116858
#> 17 cw17-1.pdf -17.5786422
#> 18 cw18.pdf 1.2532352
#> 19 cw3.pdf -0.7544456
#> 20 cw4.pdf 0.5555774
#> 21 cw7_1TTT.pdf 0.5778510
#> 22 cw13-1.pdf -34.2662973
Created on 2021-06-04 by the reprex package (v2.0.0)
Solution with tidyverse
library(purrr)
library(dplyr)
library(stringr)
map(results1, ~.x[]%>%
mutate(names = str_replace_all(names,"\\.\\.", "")))
[[1]]
names coefficients
1 a15.pdf 1.27679608
2 a17.pdf 1.05090176
3 a18.pdf 1.51820192
4 a21.pdf 2.30296037
5 a2TTT.pdf 1.48568732
6 a5.pdf 0.49371310
7 B11.pdf 1.02705905
8 B12.pdf 0.99974736
9 B13.pdf 2.40828102
10 B22.pdf 0.69515213
This worked for me:
lapply(your_list, function(df) dplyr::mutate(df, column = gsub(x = column, pattern = "pattern", replacement = "replacement")))
your_list- the list containing dataframes
column- the variable inside the dataframes where you want to do the gsub
In this way I found that the code did not change the class of the other variables in my dataframes

data.frame Using Vector of Names

Can I use a vector of variable names to make a data frame?
have=c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","jjj")
for(i in 1:10){assign(have[i],rnorm(10))}
want=data.frame(aaa,bbb,ccc,ddd,eee,fff,ggg,hhh,iii,jjj)
I wonder if I can alter the last aaa,bbb,ccc,ddd,eee,fff,ggg,hhh,iii,jjj somehow using have.
Assume that all variables in have are stored in the Global environment. Then you can also try this:
want <- as.data.frame(mget(have))
You could do
have=c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","jjj")
for(i in 1:10){assign(have[i],rnorm(10))}
want <- data.frame(sapply(have, get))
want
#> aaa bbb ccc ddd eee fff
#> 1 2.2111971 0.58169621 0.7558816 -1.6408627 0.7975625 0.09160389
#> 2 -0.7847731 1.60423888 0.3819555 -1.2061538 0.7545381 -0.64964125
#> 3 -1.2757056 0.57714761 0.4700359 -1.1041282 -0.3816839 0.40549014
#> 4 -0.0360762 -1.29007252 -0.7820075 -0.5319163 -0.2999686 0.51213744
#> 5 0.1763021 0.82259576 -0.4409983 1.4809103 -0.3658530 -0.16434920
#> 6 1.3196823 -0.18163744 1.5261259 1.3087872 -1.0644242 -1.31891628
#> 7 0.4076277 -0.89769591 -0.7778384 -0.3837985 -1.8659484 -1.53683062
#> 8 1.1872413 -0.06917426 0.3875081 0.4146543 -0.7035016 -0.63534985
#> 9 0.9037385 0.10581530 0.6210197 2.4435195 -1.2323838 0.84316865
#> 10 -0.8933586 1.47698413 0.4561502 1.0824430 2.2895535 0.05699095
#> ggg hhh iii jjj
#> 1 -0.4915989 -0.02034347 -1.6870239 -1.08651315
#> 2 1.7595238 0.47375431 0.5408044 0.65031636
#> 3 -2.0502394 0.85440730 -0.4114844 -0.17392623
#> 4 -1.1268393 0.68303043 1.1722424 -0.90590156
#> 5 -1.3235682 0.59603361 -0.8958801 -0.94192724
#> 6 -0.3669457 -0.27870024 1.8228263 0.01478657
#> 7 0.6525810 -0.00354290 0.3757264 0.34386963
#> 8 -0.3378531 -0.45219282 -0.8959065 -0.43244283
#> 9 0.3931531 0.61264470 0.6359348 0.02984539
#> 10 -0.5256779 0.79624735 -2.2912426 -1.06220090
Created on 2020-10-03 by the reprex package (v0.3.0)

Assigning a value if between two variables, across data frames in R

I have two data frames, one with a series of random values of length >n, call it:
df.my_data
I also have a second data frame, call it:
df.regions
df.regions consists of three columns, the first with a variable set of numbers 1 through n, the second with a distinguished lower bound, and the third with a distinguished upper bound. Call these
regions$location
regions$lower
regions$upper
I would like to assign the number in the first column of df.regions, regions$location, to a new column in df.my_data based on if the number in df.my_data falls between a given lower and upper bounds with respect to df.regions.
Let me know if I can clarify in any way.
If I understand correctly (and assuming that regions lower and upper bounds exhaust the range of values you need to classify and are exclusive), then this should be an analogous example
library(dplyr)
library(purrr)
set.seed(1)
x = tibble(value=abs(rnorm(10, 0, 5)))
bounds = tibble(lower = c(0:6), upper = c(1:6, Inf), class = letters[1:7])
x$class <- bounds[map_int(x$value, function(z) {which(map_lgl(seq_len(nrow(bounds)), ~between(z, bounds$lower[.x], bounds$upper[.x]) ))}),3]
x
#> # A tibble: 10 x 2
#> value class$class
#> <dbl> <chr>
#> 1 3.13 d
#> 2 0.918 a
#> 3 4.18 e
#> 4 7.98 g
#> 5 1.65 b
#> 6 4.10 e
#> 7 2.44 c
#> 8 3.69 d
#> 9 2.88 c
#> 10 1.53 b
Created on 2019-11-24 by the reprex package (v0.3.0)

R: Speed up a for loop on a very large data frame?

I have a huge set of coordinates with associated Z-values. Some of the pairs of coordinates are repeated several times with different Z values. I want to obtain the mean of all Z-values for each unique pair of coordinates.
I wrote a small line of code that works perfectly fine on a small data frame. The problem is that my actual data frame has more than 2 millions rows and the computation takes >10 hours to complete. I was wondering if there could be a way to make it more efficient and reduce the computation time.
Here is what my df looks like:
> df
x y Z xy
1 -54.60417 4.845833 0.3272980 -54.6041666666667/4.84583333333333
2 -54.59583 4.845833 0.4401644 -54.5958333333333/4.84583333333333
3 -54.58750 4.845833 0.5788663 -54.5875/4.84583333333333
4 -54.57917 4.845833 0.6611844 -54.5791666666667/4.84583333333333
5 -54.57083 4.845833 0.7830828 -54.5708333333333/4.84583333333333
6 -54.56250 4.845833 0.8340629 -54.5625/4.84583333333333
7 -54.55417 4.845833 0.8373666 -54.5541666666667/4.84583333333333
8 -54.54583 4.845833 0.8290986 -54.5458333333333/4.84583333333333
9 -54.57917 4.845833 0.9535526 -54.5791666666667/4.84583333333333
10 -54.59583 4.837500 0.0000000 -54.5958333333333/4.8375
11 -54.58750 4.845833 0.8582580 -54.5875/4.84583333333333
12 -54.58750 4.845833 0.3857006 -54.5875/4.84583333333333
You can see that some xy coordinates are the same (e.g. row 3,11,12 or 4 and 9) and I want the mean Z values of all these identical coordinates. So here is my script:
mean<-vector(mode = "numeric",length = length(df$x))
for (i in 1:length(df$x)){
mean(df$Z[which(df$xy==df$xy[i])])->mean[i]
}
mean->df$mean
df<-df[,-(3:4)]
df<-unique(df)
And I get something like this:
> df
x y mean
1 -54.60417 4.845833 0.3272980
2 -54.59583 4.845833 0.4401644
3 -54.58750 4.845833 0.6076083
4 -54.57917 4.845833 0.8073685
5 -54.57083 4.845833 0.7830828
6 -54.56250 4.845833 0.8340629
7 -54.55417 4.845833 0.8373666
8 -54.54583 4.845833 0.8290986
10 -54.59583 4.837500 0.0000000
That does the work, but surely there is a way to speed up this process (probably without the for loop) for a df with a much larger number of rows?
Welcome! In future it would be best to offer a quick way for us to copy and paste some code that generates the essential features of the dataset you're working with. Here is an example I think:
DF <- data.frame(x = sample(c(-54.1, -54.2), size = 10, replace = TRUE),
y = sample(c(4.8, 4.4), size = 10, replace = TRUE),
z = runif(10))
This looks to be just a split apply combine approach:
set.seed(1)
df <- data.frame(x = sample(c(-54.1, -54.2), size = 10, replace = TRUE),
y = sample(c(4.8, 4.4), size = 10, replace = TRUE),
z = runif(10))
library(data.table)
DT <- as.data.table(df)
DT[, .(mean_z = mean(z)), keyby = c("x", "y")]
#> x y mean_z
#> 1: -54.2 4.4 0.3491507
#> 2: -54.2 4.8 0.4604533
#> 3: -54.1 4.4 0.3037848
#> 4: -54.1 4.8 0.5734239
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df %>%
group_by(x, y) %>%
summarise(mean_z = mean(z))
#> # A tibble: 4 x 3
#> # Groups: x [?]
#> x y mean_z
#> <dbl> <dbl> <dbl>
#> 1 -54.2 4.4 0.349
#> 2 -54.2 4.8 0.460
#> 3 -54.1 4.4 0.304
#> 4 -54.1 4.8 0.573
Created on 2018-09-21 by the reprex package (v0.2.1)
You could try dplyr::summarise.
library(dplyr)
df %>%
group_by(x, y) %>%
summarise(meanZ = mean(Z))
I'd guess this would take less than a minute, depending on your machine.
Someone else might provide a data.table answer, which may be even quicker.

Resources