Using index to reference column in summarise() in dplyr - R - r

I would like to reference a column inside the summarise() in dplyr with its index rather than with its name. For example:
> a
id visit timepoint bedroom den
1 0 0 62 NA
2 1 0 53 6.00
3 2 0 56 2.75
4 0 1 55 NA
5 1 2 61 NA
6 2 0 54 NA
7 0 1 58 2.75
8 1 2 59 NA
9 2 2 60 NA
10 0 1 57 NA
# E.g.
a %>% group_by(visit) %>% summarise(avg.bedroom = mean(bedroom, na.rm =T)
# Returns
visit avg.dedroom
<dbl> <dbl>
1 0 4.375
2 1 2.750
3 2 NaN
How could I use the index of column "bedroom" rather its name in the summarise clause? I tried:
a %>% group_by(visit) %>% summarise("4" = mean(.[[4]], na.rm = T))
but this returned false results:
visit `4`
<dbl> <dbl>
1 0 3.833333
2 1 3.833333
3 2 3.833333
Is my objective achievable and if yes how? Thank you.

Perhaps not exactly what you're looking for, but one option would be to use purrr rather than dplyr. Something like
# Read in data
d <- read.table(textConnection(" id visit timepoint bedroom den
1 12 0 62 NA
2 14 0 53 6.00
3 14 0 56 2.75
4 14 1 55 NA
5 14 2 61 NA
6 15 0 54 NA
7 15 1 58 2.75
8 16 2 59 NA
9 16 2 60 NA
10 17 1 57 NA "),
header = TRUE)
library(purrr)
d %>%
split(.$timepoint) %>%
map_dbl(function(x) mean(x[ ,5], na.rm = TRUE))
# 0 1 2
# 4.375 2.750 NaN
Or, with base
aggregate(d[ ,5] ~ timepoint, data = d, mean)
# timepoint d[, 5]
# 1 0 4.375
# 2 1 2.750

The answer I found is the summarize_at() function of dplyr. Here is how I used summarize_at() to create summary statistics on subsets of my dataframe where the columns were not known in advance (object is my original dataframe which is in a long form and has a column -- room -- that contains the names of the rooms, as well as two other columns, "visit" and "value"):
# Convert object to a wide form
object$row <- 1 : nrow(object)
y <- spread(object, room, value)
# Remove the row column from y
y <- y %>% select(-row)
# Initialize stat1, the dataframe with the summary
# statistics
stat1 <- data.frame(visit = c(0, 1, 2))
# Find the number of columns that stat1 will eventually
# have
y <- y %>% filter(id == id) %>%
select_if(function(col) mean(is.na(col)) != 1)
n <- ncol(y)
# Append columns with summary statistics to stat1
for (i in 3 : n) {
t <- y %>% group_by(visit) %>%
summarise_at(c(i), mean, na.rm = T)
t[, 2] <- round(t[, 2], 2)
stat1 <- cbind(stat1, t[, 2])
}
# Pass the dataframe stat1 to the list "results"
results$stat1 <- stat1

Related

R iterating by group and mapping values based on column value

I have the following data frame in R:
df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))
And another data frame:
data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
name = c('a','b','c','d','e','f','g','h','i','j'))
So, if we take a look at df we could sort the data by target and combination and we will notice that there are basically "groups". For example for target=1 and comb=0 there are four entries p1_start,p1_end,p2_start,p2_end and it is the same for all other target/comb combinations.
On the other side data contains entries with time being a timestamp.
Goal: I want to map the values from both data frames based on time.
Example: The first entry of data has time=2 meaning it happened between p1_start,p1_end so it should get the values target=1 and comb=0 mapped to the data data frame.
Example 2: The entries of data with time=14 happened between p2_start,p2_end so they should get the values target=1 and comb=1 mapped to the data data frame.
Idea: I thought I iterate over df by target and comb and for each combination of them check if there are rows in data whose time is between. The second could be done with the following command:
data[which(data$time > p1_start & data$time < p2_end),]
once I get the rows it is easy to append the values.
Problem: how could I do the iteration? I tried with the following:
df %>%
group_by(target, comb) %>%
print(data[which(data$time > df$p1_start & data$time < df$p2_end),])
But I am getting an error that time has not been initialized
Your problem is best known as performing non-equi join. We need to find a range in some given dataframe that corresponds to each value in one or more given vectors. This is better handled by the data.table package.
We would first transform your df into a format suitable for performing the join and then join data with df by time <= end while time >= start. Here is the code
library(data.table)
setDT(df)[, c("type", "name") := tstrsplit(name, "_", fixed = TRUE)]
df <- dcast(df, ... ~ name, value.var = "time")
cols <- c("target", "comb", "type")
setDT(data)[df, (cols) := mget(paste0("i.", cols)), on = .(time<=end, time>=start)]
After dcast, df looks like this
target comb type end start
1: 1 0 p1 3 1
2: 1 0 p2 7 5
3: 1 1 p1 11 9
4: 1 1 p2 15 13
5: 2 0 p1 19 17
6: 2 0 p2 23 21
7: 2 1 p1 27 25
8: 2 1 p2 31 29
And the output is
> data
time name target comb type
1: 2 a 1 0 p1
2: 5 b 1 0 p2
3: 8 c NA NA <NA>
4: 14 d 1 1 p2
5: 14 e 1 1 p2
6: 20 f NA NA <NA>
7: 21 g 2 0 p2
8: 26 h 2 1 p1
9: 28 i NA NA <NA>
10: 28 j NA NA <NA>
Here is a tidyverse solution:
library(tidyr)
library(dplyr)
df %>%
rename(name_df=name) %>%
mutate(x = time +1) %>%
pivot_longer(
cols = c(time, x),
names_to = "helper",
values_to = "time"
) %>%
right_join(data, by="time") %>%
select(time, name, target, comb)
time name target comb
<dbl> <chr> <dbl> <dbl>
1 2 a 1 0
2 5 b 1 0
3 8 c 1 0
4 14 d 1 1
5 14 e 1 1
6 20 f 2 0
7 21 g 2 0
8 26 h 2 1
9 28 i 2 1
10 28 j 2 1
df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))
data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
name = c('a','b','c','d','e','f','g','h','i','j'))
library(fuzzyjoin)
library(tidyverse)
tmp <- df %>%
separate(name,
into = c("p", "period"),
sep = "_",
remove = TRUE) %>%
pivot_wider(
id_cols = c(p, target, comb),
names_from = period,
values_from = time
) %>%
select(-p)
fuzzy_left_join(
x = data,
y = tmp,
by = c("time" = "start",
"time" = "end"),
match_fun = list(`>=`, `<=`))
#> time name target comb start end
#> 1 2 a 1 0 1 3
#> 2 5 b 1 0 5 7
#> 3 8 c NA NA NA NA
#> 4 14 d 1 1 13 15
#> 5 14 e 1 1 13 15
#> 6 20 f NA NA NA NA
#> 7 21 g 2 0 21 23
#> 8 26 h 2 1 25 27
#> 9 28 i NA NA NA NA
#> 10 28 j NA NA NA NA
Created on 2022-01-11 by the reprex package (v2.0.1)

sum up rows based on row.names and condition in col.names -- R

df <- data.frame(row.names = c('1s.u1','1s.u2','2s.u1','2s.u2','6s.u1'),fjri_deu_klcea= c('0','0','0','15','23'),hfue_klcea=c('2','2','0','156','45'),dji_dhi_ghcea_jk=c('456','0','0','15','15'),jdi_jdi_ghcea=c('1','2','3','4','100'),gz7_jfu_dcea_jdi=c('5','6','3','7','56'))
df
fjri_deu_klcea hfue_klcea dji_dhi_ghcea_jk jdi_jdi_ghcea gz7_jfu_dcea_jdi
1s.u1 0 2 456 1 5
1s.u2 0 2 0 2 6
2s.u1 0 0 0 3 3
2s.u2 15 156 15 4 7
6s.u1 23 45 15 100 56
I want to sum up df based on the cea part of the column names. So all rows with the same cea part should sum up.
df should look like this
klcea ghcea dcea
1s.u1 2 457 5
1s.u2 2 2 6
2s.u1 0 3 3
2s.u2 171 19 7
6s.u1 68 115 56
I thought about firstly getting a new column with the cea name called cea and then summing it up based on row.names and the respective cea
with something like with(df, ave(cea, row.names(df), FUN = sum))
I do not know how to generate the new column based on a pattern in a string. I guess grepl is useful but I could not come up with something, I tried df$cea <- df[grepl(colnames(df),'cea'),] which is wrong...
Using base R, you can extract the "cea" part from the name and use it in split.default to split dataframe into columns, we can then use rowSums to sum each individual dataframe.
sapply(split.default(df, sub('.*_(.*cea).*', '\\1', names(df))), rowSums)
# dcea ghcea klcea
#1s.u1 5 457 2
#1s.u2 6 2 2
#2s.u1 3 3 0
#2s.u2 7 19 171
#6s.u1 56 115 68
where sub part returns :
sub('.*_(.*cea).*', '\\1', names(df))
#[1] "klcea" "klcea" "ghcea" "ghcea" "dcea"
Using dplyr:
> df %>% rowwise() %>% mutate(klcea = sum(c_across(ends_with('klcea'))),
+ ghcea = sum(c_across(contains('ghcea'))),
+ dcea = sum(c_across(contains('dcea')))) %>%
+ select(klcea, ghcea, dcea)
# A tibble: 5 x 3
# Rowwise:
klcea ghcea dcea
<dbl> <dbl> <dbl>
1 2 457 5
2 2 2 6
3 0 3 3
4 171 19 7
5 68 115 56
If you wish to retain row names:
> df %>% rownames_to_column('rn') %>% rowwise() %>% mutate(klcea = sum(c_across(ends_with('klcea'))),
+ ghcea = sum(c_across(contains('ghcea'))),
+ dcea = sum(c_across(contains('dcea')))) %>%
+ select(klcea, ghcea, dcea, rn) %>% column_to_rownames('rn')
klcea ghcea dcea
1s.u1 2 457 5
1s.u2 2 2 6
2s.u1 0 3 3
2s.u2 171 19 7
6s.u1 68 115 56
>

Insert missing rows in time series data

I have an incomplete time series dataframe and I need to insert rows of NAs for missing time stamps. There should always be 6 time stamps per day, which is indicated by the variable "Signal" (1-6) in the dataframe. I am trying to merge the incomplete dataframe A with a vector Bcontaining all Signals. Simplified example data below:
B <- rep(1:6,2)
A <- data.frame(Signal = c(1,2,3,5,1,2,4,5,6), var1 = c(1,1,1,1,1,1,1,1,1))
Expected <- data.frame(Signal = c(1,2,3,NA, 5, NA, 1,2,NA,4,5,6), var1 = c(1,1,1,NA,1,NA,1,1,NA,1,1,1)
Note that Brepresents a dataframe with multiple variables and the NAs in Expected are rows of NAs in the dataframe. Also the actual dataframe has more observations (84 in total).
Would be awesome if you guys could help me out!
If you already know there are 6 timestamps in a day you can do this without B. We can create groups for each day and use complete to add the missing observations with NA.
library(dplyr)
library(tidyr)
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
ungroup() %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 NA
# 5 5 1
# 6 6 NA
# 7 1 1
# 8 2 1
# 9 3 NA
#10 4 1
#11 5 1
#12 6 1
If in the output you need Signal as NA for missing combination you can use
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
mutate(Signal = replace(Signal, is.na(var1), NA)) %>%
ungroup %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 NA NA
# 5 5 1
# 6 NA NA
# 7 1 1
# 8 2 1
# 9 NA NA
#10 4 1
#11 5 1
#12 6 1

Create lags relative to whole change within group

I've tried creating a variable that represents the lagged version of another variable relative to the whole change of the variable within the group.
Let's use this example dataframe:
game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
game_data
player level score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I've tried the following, but while lagging the variable works, I am not able to create a new variable that shows the lag of the variable relative to the whole change for the player:
result <-
+ game_data %>%
+ group_by(player) %>%
+ mutate(
+ lag_score = score - dplyr::lag(score, n=1, default = NA),
+ lag_score_relative = lag_score/sum(lag_score))
result
# A tibble: 9 x 5
# Groups: player [3]
player level score lag_score lag_score_relative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 NA NA
2 1 2 150 150 NA
3 1 3 170 20 NA
4 2 1 80 NA NA
5 2 2 100 20 NA
6 2 3 110 10 NA
7 3 1 75 NA NA
8 3 2 100 25 NA
9 3 3 0 -100 NA
For example, for player 1 it should be in
Level 1: NA/170 = NA
Level 2: 150/170
Level 3: 20/170
Thanks in advance, I hope anyone can help.
If you sum the lagged scores you include an NA. The sum then returns NA. You divide by NA which in the end returns NA for every value. To avoid this just set the na.rm argument to TRUE in your call of sum and NAs do not get included in the sum:
game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3),
score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
game_data %>%
group_by(player) %>%
mutate(
lag_score = score - dplyr::lag(score, n=1, default = NA),
lag_score_relative = lag_score/sum(lag_score, na.rm = TRUE))

Numbering of groups in dplyr?

I have question about numbering the groups in a data.frame.
I found only one similar approach here dplyr-how-to-number-label-data-table-by-group-number-from-group-by
but it didnt worked to me. I dont know why.
S <- rep(letters[1:12],each=6)
R = sort(replicate(9, sample(5000:6000,4)))
df <- data.frame(R,S)
get_next_integer = function(){
i = 0
function(S){ i <<- i+1 }
}
get_integer = get_next_integer()
result <- df %>% group_by(S) %>% mutate(label = get_integer())
result
Source: local data frame [72 x 3]
Groups: S [12]
R S label
(int) (fctr) (dbl)
1 5058 a 1
2 5121 a 1
3 5129 a 1
4 5143 a 1
5 5202 a 1
6 5213 a 1
7 5239 b 1
8 5245 b 1
9 5269 b 1
10 5324 b 1
.. ... ... ...
I look for elegant solution in dplyr. Numbering each letters from 1 to 12 etc.
Using as.numeric will do the trick.
S <- rep(letters[1:12],each=6)
R = sort(replicate(9, sample(5000:6000,4)))
df <- data.frame(R,S)
result <- df %>% mutate(label = as.numeric(S)) %>% group_by(S)
result
Source: local data frame [72 x 3]
Groups: S
R S label
1 5018 a 1
2 5042 a 1
3 5055 a 1
4 5066 a 1
5 5081 a 1
6 5133 a 1
7 5149 b 2
8 5191 b 2
9 5197 b 2
10 5248 b 2
.. ... . ...
No need to use dplyr at all.
S <- rep(letters[1:12],each=6)
R = sort(replicate(9, sample(5000:6000,4)))
df <- data.frame(R,S)
df$label <- as.numeric(factor(df$S))

Resources