map function in R in data.table - r

I must be missing something, but I don't understand why my map function is not working.
I have a small data set, to which I want to append a ned column - sum of two existing columns:
DT <- data.table("columnPred" = c(1,2,3),
"column1" = c(7,8,9),
"column2" = c(44,55,66),
"new_column1" = rep(NA, 3))
I wrote my function to sum up:
test_map <- function(x){
x$new_column1 = x$column1 + x$column2
}
and run map:
map(DT, test_map)
I keep getting errors. What is wrong with my map function? How can I use map to repeat the same function row-wise? is there a better alterntive

We do not need map for that:
library(data.table)
DT[,new_column1 := (column1 + column2)][]
#> columnPred column1 column2 new_column1
#> 1: 1 7 44 51
#> 2: 2 8 55 63
#> 3: 3 9 66 75
However, if we want a map function to get the sum of the two columns, we can do the following:
library(data.table)
library(purrr)
pmap_dbl(DT, ~ ..2 + ..3)
#> [1] 51 63 75

We could use rowSums( which would also take care of NA elements if present)
library(data.table)
DT[, new_column1 := rowSums(.SD, na.rm = TRUE), .SDcols = column1:column2]
-output
> DT
columnPred column1 column2 new_column1
<num> <num> <num> <num>
1: 1 7 44 51
2: 2 8 55 63
3: 3 9 66 75

Related

In R , how to summarize data frame in multiple dimensions

There is dataframe raw_data as below, How can i change it to wished_data in easy way ?
I currently know group_by/summarise the data serval times (and add variables) , then rbind them. But this is little boring , especially when variables more then this example in occasion.
I want to know ,if is there any general method for similar situation ? Thanks!
library(tidyverse)
country <- c('UK','US','UK','US')
category <- c("A", "B", "A", "B")
y2021 <- c(17, 42, 21, 12)
y2022 <- c(49, 23, 52, 90)
raw_data <- data.frame(country,category,y2021,y2022)
We may use rollup/cube/groupingsets from data.table
library(data.table)
out <- rbind(setDT(raw_data), groupingsets(raw_data, j = lapply(.SD, sum),
by = c("country", "category"),
sets = list("country", "category", character())))
out[is.na(out)] <- 'TOTAL'
-output
> out
country category y2021 y2022
<char> <char> <num> <num>
1: UK A 17 49
2: US B 42 23
3: UK A 21 52
4: US B 12 90
5: UK TOTAL 38 101
6: US TOTAL 54 113
7: TOTAL A 38 101
8: TOTAL B 54 113
9: TOTAL TOTAL 92 214
Or with cube
out <- rbind(raw_data, cube(raw_data,
j = .(y2021= sum(y2021), y2022=sum(y2022)), by = c("country", "category")))
out[is.na(out)] <- 'TOTAL'
We can use the adorn_totals function from janitor. get_totals accepts a data frame and a column and it outputs the data frame with totals for the numeric columns, one such row for each level of the specified column. It then extracts out the total rows and since adorn_totals can rearrange the column order uses select to put the order back to the original so that we can later bind mulitiple instances together. We then bind together the orignal data frame and each of the total row data frames that we want.
library(dplyr)
library(janitor)
get_totals <- function(data, col) {
data %>%
group_by({{col}}) %>%
group_modify(~ adorn_totals(.)) %>%
ungroup %>%
filter(rowSums(. == "Total") > 0) %>%
select(any_of(names(data)))
}
bind_rows(
raw_data,
get_totals(raw_data, category),
get_totals(raw_data, country),
get_totals(raw_data)
)
giving:
country category y2021 y2022
1 UK A 17 49
2 US B 42 23
3 UK A 21 52
4 US B 12 90
5 Total A 38 101
6 Total B 54 113
7 UK Total 38 101
8 US Total 54 113
9 Total - 92 214

Join overlapping ranges from two data frames in r

Note: This question was closed as a 'duplicate'. The solutions offered here and here did not answer my question. They showed how to merge when a single entry fell within a range, I'm trying to identify overlapping ranges and joining them. Perhaps my title could have been better...
I have a main data set main_df with a start and end time (in seconds). I would like to see if the time range in main_df falls within a list of ranges in lookup_df, and if so, grab the value from lookup_df. Additionally, if the main_df falls within two different lookup ranges, duplicate the row so each value is represented.***
main_df <- tibble(start = c(30,124,161),
end = c(80,152,185))
lookup_df <- tibble(start = c(34,73,126,141,174,221),
end = c(69,123,136,157,189,267),
value = c('a','b','b','b','b','a'))
# Do something here to get the following:
> final_df
# A tibble: 4 x 4
start end value notes
<dbl> <dbl> <chr> <chr>
1 30 80 a ""
2 30 80 b "Duplicate because it falls within a and b"
3 124 152 b "Falls within two lookups but both are b"
4 161 185 b ""
***Edit: Looking at the way I've structured the problem...
#Not actual code
left_join(main_df, lookup_df, by(some_range_join_function) %>%
add_rows(through_some_means)
Rather than having to add a new row I could flip how I'm joining them...
semi_join(lookup_df, main_df, by(some_range_join_function))
You could do some logical comparisons and then a case handling what shall happen if all are 'b', 'a' and 'b', etc. In this way you easily could add more cases, e.g. both are 'a', one is 'a', more are 'b' which you didn't declare in OP. The approach yields NULL if there are no matches which gets omitted during rbind.
f <- \(x, y) {
w <- which((x[1] >= y[, 1] & x[1] <= y[, 2]) | (x[2] >= y[, 1] & x[1] <= y[, 2]))
if (length(w) > 0) {
d <- data.frame(t(x), value=cbind(y[w, 3]), notes='')
if (length(w) >= 2) {
if (all(d$value == 'b')) {
d <- d[!duplicated(d$value), ]
d$notes[1] <- 'both b'
}
else {
d$notes[nrow(d)] <- 'a & b'
}
}
d
}
}
apply(main_df, 1, f, lookup_df, simplify=F) |> do.call(what=rbind)
# start end value notes
# 1 30 80 a
# 2 30 80 b a & b
# 3 124 152 b both b
# 4 161 185 b
Data:
main_df <- structure(list(start = c(2, 30, 124, 161), end = c(1, 80, 152,
185)), row.names = c(NA, -4L), class = "data.frame")
lookup_df <- structure(list(start = c(34, 73, 126, 141, 174, 221), end = c(69,
123, 136, 157, 189, 267), value = c("a", "b", "b", "b", "b",
"a")), row.names = c(NA, -6L), class = "data.frame")
Another option is fuzzyjoin::interval_join:
library(fuzzyjoin)
library(dplyr)
interval_join(main_df, lookup_df, by = c("start", "end"), mode = "inner") %>%
group_by(value, start.x, end.x) %>%
slice(1) %>%
select(start = start.x, end = end.x, value)
# A tibble: 4 × 3
# Groups: value, start, end [4]
start end value
<dbl> <dbl> <chr>
1 30 80 a
2 30 80 b
3 124 152 b
4 161 185 b
You can use foverlaps from data.table for this.
library(data.table)
setDT(main_df) # make it a data.table if needed
setDT(lookup_df) # make it a data.table if needed
setkey(main_df, start, end) # set the keys of 'y'
foverlaps(lookup_df, main_df, nomatch = NULL) # do the lookup
# start end i.start i.end value
# 1: 30 80 34 69 a
# 2: 30 80 73 123 b
# 3: 124 152 126 136 b
# 4: 124 152 141 157 b
# 5: 161 185 174 189 b
Or to get the cleaned results as end result (OP's final_df)
unique(foverlaps(lookup_df, main_df, nomatch = NULL)[, .(start, end, value)])
start end value
1: 30 80 a
2: 30 80 b
3: 124 152 b
4: 161 185 b
A possible solution, based on powerjoin:
library(tidyverse)
library(powerjoin)
power_left_join(
main_df, lookup_df,
by = ~ (.x$start <= .y$start & .x$end >= .y$end) |
(.x$start >= .y$start & .x$start <= .y$end) |
(.x$start <= .y$start & .x$end >= .y$start),
keep = "left") %>%
distinct()
#> # A tibble: 4 x 3
#> start end value
#> <dbl> <dbl> <chr>
#> 1 30 80 a
#> 2 30 80 b
#> 3 124 152 b
#> 4 161 185 b
Or using tidyr::crossing:
library(tidyverse)
crossing(main_df, lookup_df,
.name_repair = ~ c("start", "end", "start2", "end2", "value")) %>%
filter((start <= start2 & end >= end2) |
(start >= start2 & start <= end2) | (start <= start2 & end >= start2)) %>%
select(-start2, -end2) %>%
distinct()
#> # A tibble: 4 x 3
#> start end value
#> <dbl> <dbl> <chr>
#> 1 30 80 a
#> 2 30 80 b
#> 3 124 152 b
#> 4 161 185 b
You can use the fuzzyjoin package to join based on intervals with the fuzzyjoin::interval_*_join() functions.
I'll be using an inner join, because if you use a semi join like you propose, you will loose the value col and get just 3 rows.
library(tidyverse)
library(fuzzyjoin)
fuzzyjoin::interval_inner_join(lookup_df, main_df, by = c("start", "end"), type = "any")
#> # A tibble: 5 × 5
#> start.x end.x value start.y end.y
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 34 69 a 30 80
#> 2 73 123 b 30 80
#> 3 126 136 b 124 152
#> 4 141 157 b 124 152
#> 5 174 189 b 161 185
As you can see, the fuzzy_inner_join() preserves the by cols from both tables, since they are not the same in a fuzzy join. Also, we still have separate rows for those cases in main_df that match multiple cases in lookup_df. Thus, we do some cleanup of the joined table:
interval_inner_join(lookup_df, main_df,
by = c("start", "end"),
type = "any") |>
select(-ends_with(".x")) |> # remove lookup interval cols
distinct() |> # remove duplicate
rename_with(str_remove, ends_with(".y"), "\\.y") # remove suffixes from col names
#> # A tibble: 4 × 3
#> value start end
#> <chr> <dbl> <dbl>
#> 1 a 30 80
#> 2 b 30 80
#> 3 b 124 152
#> 4 b 161 185
Finally, a clarification of terminology: In your question you state you want to join based on the interval from main_df falling within the interval from lookup_df. This is possible by using type = "within" in interval_*_join(). But based on the examples you provide, it appears you want to join based on any overlap. This can be done with type = "any", but it is the default, so you don't need to specify it.

Is there a way in data.table to collapse/summarize variables using objects, a la .SDcols?

I have a dataset with multiple variables that I want to collapse/summarize (depending on whether your background is more Stata or tidyverse) using data.table without having to name each individual variable in the code that does this.
Here's some example data:
library(data.table)
dt <- data.table(v1 = c(1,2,5,8,5,9, NA),
v2 = c(5,3,6,1, NA,7,8),
year = c(1,1,2,3,3,3,4))
dt
v1 v2 year
1: 1 5 1
2: 2 3 1
3: 5 6 2
4: 8 1 3
5: 5 NA 3
6: 9 7 3
7: NA 8 4
And here's the final dataset I want:
# this is the final data we want:
dt[, .(newv1 = sum(v1, na.rm = TRUE),
newv2 = sum(v2, na.rm = TRUE)),
by = 'year']
year newv1 newv2
1: 1 3 8
2: 2 5 6
3: 3 22 8
4: 4 0 8
The actual dataset has many variables I'd like to summarize, so I want to name the variables in a list outside of the collapsing beforehand and then name them in a systematic way, such as:
# but we want to do it with objects e.g.:
vars.to.collapse <- c('v1', 'v2')
new.v.names <- paste0('new', vars.to.collapse)
new.v.names
[1] "newv1" "newv2"
I know you can do this kind of thing--that is, create multiple variables using objects--using .SD when you're adding/modifying variables using the := operator, but I have not been able to find a way to do this while altering the unit of observation. In an ideal world I'd use code something like the following:
# want something like this but it doesn't work:
newdt <- dt[, .( (new.v.names = sum(.SD, na.rm = TRUE))),
.SDcols = vars.to.collapse,
by = 'year']
# not what I want:
newdt
year V1
1: 1 11
2: 2 11
3: 3 30
4: 4 8
But this doesn't produce the dataset I showed you above with the new variable names and all. I could work around this by using the := operator to add these statistics to the dataset and then drop duplicates or something, but I'd prefer to summarize directly if possible.
To follow up on your comment about naming new variables, I suggest two solutions using the data.table library.
So, please find below the two reprex.
Reprex 1 (as a follow-up to the solution proposed by #Roland)
Code
library(data.table)
vars.to.collapse <- c('v1', 'v2') # your code
new.v.names <- paste0('new', vars.to.collapse) # your code
dt <- dt[, lapply(.SD, sum, na.rm = TRUE), .SDcols = vars.to.collapse, by = 'year'] # Roland's code
setnames(dt, c("year", new.v.names))
Output
dt
#> year newv1 newv2
#> 1: 1 3 8
#> 2: 2 5 6
#> 3: 3 22 8
#> 4: 4 0 8
Reprex 2
Code
library(data.table)
vars.to.collapse <- c('v1', 'v2') # your code
new.v.names <- paste0('new', vars.to.collapse) # your code
dt[, lapply(.SD, sum, na.rm = TRUE), .SDcols = vars.to.collapse, by = 'year'
][, (new.v.names) := .SD, .SDcols = vars.to.collapse
][, .SD, .SDcols = !patterns("^v")][]
Output
#> year newv1 newv2
#> 1: 1 3 8
#> 2: 2 5 6
#> 3: 3 22 8
#> 4: 4 0 8
Created on 2021-11-17 by the reprex package (v2.0.1)

How can I aggregate multiple columns in a data.frame with a custom function in R?

I've got a data.frame dt with some duplicate keys and missing data, i.e.
Name Height Weight Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA
In this case the key is the name, and I would like to apply to each column a function like
f <- function(x){
x <- x[!is.na(x)]
x <- x[1]
return(x)
}
while aggregating by the key (i.e., the "Name" column), so as to obtain as a result
Name Height Weight Age
Alice 180 70 35
Bob NA 80 27
Charles 170 75 NA
I tried
dt_agg <- aggregate(. ~ Name,
data = dt,
FUN = f)
and I got some errors, then I tried the following
dt_agg_1 <- aggregate(Height ~ Name,
data = dt,
FUN = f)
dt_agg_2 <- aggregate(Weight ~ Name,
data = dt,
FUN = f)
and this time it worked.
Since I have 50 columns, this second approach is quite cumbersome for me. Is there a way to fix the first approach?
Thanks for help!
You were very close with the aggregate function, you needed to adjust how aggregate handles NA (from na.omit to na.pass). My guess is that aggregate removes all rows with NA first and then does its aggregating, instead of removing NAs as aggregate iterates over the columns to be aggregated. Since your example dataframe you have an NA in each row you end up with a 0-row dataframe (which is the error I was getting when running your code). I tested this by removing all but one NA and your code works as-is. So we set na.action = na.pass to pass the NA's through.
dt_agg <- aggregate(. ~ Name,
data = dt,
FUN = f, na.action = "na.pass")
original answer
dt_agg <- aggregate(dt[, -1],
by = list(dt$Name),
FUN = f)
dt_agg
# Group.1 Height Weight Age
# 1 Alice 180 70 35
# 2 Bob NA 80 27
# 3 Charles 170 75 NA
You can do this with dplyr:
library(dplyr)
df %>%
group_by(Name) %>%
summarize_all(funs(sort(.)[1]))
Result:
# A tibble: 3 x 4
Name Height Weight Age
<fctr> <int> <int> <int>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA
Data:
df = read.table(text = "Name Height Weight Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA", header = TRUE)
Here is an option with data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) head(sort(x), 1)), Name]
# Name Height Weight Age
#1: Alice 180 70 35
#2: Bob NA 80 27
#3: Charles 170 75 NA
Simply, add na.action=na.pass in aggregate() call:
aggdf <- aggregate(.~Name, data=df, FUN=f, na.action=na.pass)
# Name Height Weight Age
# 1 Alice 180 70 35
# 2 Bob NA 80 27
# 3 Charles 170 75 NA
If you add an ifelse() to your function to make sure the function returns a value if all values are NA:
f <- function(x) {
x <- x[!is.na(x)]
ifelse(length(x) == 0, NA, x)
}
You can use dplyr to aggregate:
library(dplyr)
dt %>% group_by(Name) %>% summarise_all(funs(f))
This returns:
# A tibble: 3 x 4
Name Height Weight Age
<fctr> <dbl> <dbl> <dbl>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA

R: apply operations to vectors with different suffixes

I have to perform some simple operations upon few vectors and rows.
Assume that I have a database such as:
observation outcome_1_a outcome_2_a outcome_1_b outcome_2_b choice_a choice_b
1 41 34 56 19 1 1
2 32 78 43 6 2 1
3 39 19 18 55 1 2
For each observation, outcome_1 and outcome_2 are the two possible outcomes, choice is the outcome chosen and the prefix _i, with i = a,b, indicates the number of times the situation is repeated.
If I want to create variables storing the highest outcome for each situation (a,b), hence:
max.a <- pmax(data$outcome_1_a, data$outcome_2_a)
max.b <- pmax(data$outcome_1_b, data$outcome_2_b)
Similarly, if I want to create variables storing the values chosen in each situation, I can do:
choice.a <- ifelse(data$choice_a == "1", data$outcome_1_a, data$outcome_1_b)
choice.b <- ifelse(data$choice_b == "1", data$outcome_2_a, data$outcome_2_b)
Finally, If I'd like to compute the mean by row of the situations a and b, I can do:
library(data.table)
setDT(data)
data[, .(Mean = rowMeans(.SD)), by = observation, .SDcols = c("outcome_1_a","outcome_2_a", "outcome_1_b", "outcome_2_b")]
Now, all of these work just fine. However, I was wondering if such operations can be done in a more efficient way.
In the example there are only few situations, but, if in the future I'll have to deal with, let's say, 15 or more different situations (a,b,c,d,...,), writing such operations might be annoying.
Is there a way to automate such process based on the different prefixes and/or suffixes of the variables?
Thank you for your help
You can select columns with some regex. For example, to get your max.a value.
library(data.table)
setDT(data)
data[, do.call(pmax, .SD), .SDcols = names(data) %like% "\\d+_a$"]
[1] 41 78 39
Alternatively, you could select your columns with some regex outside of the data.table. Lots of ways to go about this.
Similar application to your last command.
data[,
.(Mean = rowMeans(.SD)),
by = observation,
.SDcols = names(data) %like% "^outcome"]
observation Mean
1: 1 37.50
2: 2 39.75
3: 3 32.75
For choice.a, how would you choose between b, c, d, e etc?
For instance:
outcome_1_a outcome_2_a outcome_1_b outcome_2_b outcome_1_c outcome_2_c outcome_1_d outcome_2_d outcome_1_e outcome_2_e choice_a choice_b choice_c choice_d choice_e
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 85 32 28 91 42 32 96 27 29 2 1 1 1 1
2 17 22 84 53 11 69 16 66 11 41 1 2 2 1 1
3 92 98 76 83 18 27 21 51 92 41 1 1 1 1 2
4 63 49 61 64 100 28 43 51 22 94 1 2 1 1 1
Define an index variable that will help you go through the loops:
seqmax <- seq(1, 10, by = 2)
seqmax is a 1 3 5 7 9. The reason being is that there are 5 letters "a" "b" "c" "d" "e". So this sequence will help you to pattern the loop. This can be automated for the max number of letters, just find the column index for the last column before choice_a. Then you can do seq(1, grep(names(data), pattern = "choice_a") - 1, by = 2). The by = 2 argument can be adjusted for the number of columns by letter.
I use lapply with <<- to assing the new column to data.
lapply(c(1:5), function(x){
data[, paste0("max.", letters[x])] <<- apply(data[, c(seqmax[x], seqmax[x] + 1)], 1, max)
data[, paste0("choice.", letters[x])] <<- ifelse(
data[, grep(names(data), pattern = paste0("choice_", letters[x]), value = T)] == 1,
data[, seqmax[x]], data[, seqmax[x] + 1])
data[, paste0("mean.", letters[x])] <<- rowMeans(
data[, grep(names(data), pattern = paste0("outcome_\\d+_", letters[x]), value = T)])
})

Resources