Conditional subsetting of a data frame R - r

Let the data frame be:
set.seed(123)
df<-data.frame(name=sample(LETTERS,260,replace=TRUE),
hobby=rep(c("outdoor","indoor"),260),chess=rnorm(1:10))
and the condition which I will use to extract from df be:
df_cond<-df %>% group_by(name,hobby) %>%
summarize(count=n()) %>%
mutate(sum.var=sum(count),sum.name=length(name)) %>%
filter(sum.name==2) %>%
mutate(min.var=min(count)) %>%
mutate(use=ifelse(min.var==count,"yes","no")) %>%
filter(grepl("yes",use))
I want to randomly extract the rows from df that correspond to the (name,hobby,count) combination in df_cond along with the rest of df. I am having bit of a trouble combining %in% and sample.Thanks for any clue!
Edit: For example:
head(df_cond)
name hobby count sum.var sum.name min.var use
<fctr> <fctr> <int> <int> <int> <int> <chr>
1 A indoor 2 6 2 2 yes
2 B indoor 8 16 2 8 yes
3 B outdoor 8 16 2 8 yes
4 C outdoor 6 14 2 6 yes
5 D indoor 10 24 2 10 yes
6 E outdoor 8 18 2 8 yes
Using the above data frame, I want to randomly extract 2 rows (=count) with the combination A+indoor(row1) from df,
8 rows with the combination B+indoor (row 2) from df ....and so on.
Combining #denrous and #Jacob answers to get what I need. like so:
m2<-df_cond %>%
mutate(data = map2(name, hobby, function(x, y) {df %>% filter(name == x, hobby == y)})) %>%
ungroup() %>%
select(data) %>%
unnest()
test<-m2 %>%
group_by(name,hobby) %>%
summarize(num.levels=length(unique(hobby))) %>%
ungroup() %>%
group_by(name) %>%
summarize(total_levels=sum(num.levels)) %>%
filter(total_levels>1)
fin<-semi_join(m2,test)

If I understand correctly, you could use purrr to achieve what you want:
df_cond %>%
mutate(data = map2(name, hobby, function(x, y) {filter(df, name == x, hobby == y)})) %>%
mutate(data = map2(data, count, function(x, y) sample_n(x, size = y)))
And if you want the same form as df:
df_cond %>%
mutate(data = map2(name, hobby, function(x, y) {df %>% filter(name == x, hobby == y)})) %>%
mutate(data = map2(data, count, function(x, y) sample_n(x, size = y))) %>%
ungroup() %>%
select(data) %>%
unnest()

Edited based on OP clarification.
There has to better way but I'd use a loop:
library(dplyr)
master_df <- data.frame()
for (i in 1:nrow(df_cond)){
name = as.character(df_cond[i, 1])
hobby = as.character(df_cond[i, 2])
n = as.numeric(df_cond[i, 3])
temp_df <- df %>% filter(name == name, hobby == hobby)
temp_df <- sample_n(temp_df, n)
master_df <- rbind(master_df, temp_df)
}

Not clear if this is exactly what you want, but you may be looking for left_join:
df %>%
left_join(df_cond, by = "name")

Related

calculating the duration and the order of non-continuous events in R

My dataset consists of a series of behaviours observed in videos. For each behaviour, I have recorded when it starts and when it ends.
datain <-data.frame(
A=c("1/5+11/18","0/5","7/10"),
B=c("6/10+19/25","11/15","11/20"),
C=c("26/30","6/10","0/6"))
I would like to get the duration of each behaviour as well as the order of the behaviours for each observation, like in this desired output
dataout <-data.frame(
A=c("1/5+11/18","0/5","7/10"),
B=c("6/10+19/25","11/15","11/20"),
C=c("26/30","6/10","0/6"),
A.sum=c(11,5,3),
B.sum=c(10,4,9),
C.sum=c(4,4,6),
myorder=c("A/B/A/B/C","A/C/B","C/A/B"))
I am experimenting with the following lines to identify which columns have the + and to extract the rows with the interrupted behaviours (but I still have to calculate the duration of each behaviour), but I guess there could be more efficient solution than the one I am currently attempting.
d.1 <- lapply(datain, function(x) str_which(x,"\\+"))
d.2 <- which(lapply(d.1,length)>0)
coltosum <- match(names(d.2),colnames(datain))
mylist <- lapply(datain[coltosum],function(x) strsplit(x,"\\+"))
As always, I would greatly appreciate any suggestion.
Please note that I have edited this question after some days to include in the desired output the order of the behaviours.
Update: I have been able to figure out how to get the sequence of the behaviours. I bet there are more elegant and concise ways to get this result. Below the code
#removing empty columns
empty_columns <- sapply(datain, function(x) all(is.na(x) | x == ""))
datain<- datain[, !empty_columns]
#loop 1#
#this loop is for taking the occurrence of BH
mylist <- list()
for (i in seq(1,nrow(datain))){
mylist <- apply(datain,1,str_extract_all,pattern="\\d+")
myindx <- sapply(mylist, length)
myres <- c(do.call(cbind,lapply(mylist, `length<-`,max(myindx))))
names(myres) <- rep(colnames(datain),nrow(datain))
mydf <- ldply(myres,data.frame)
colnames(mydf) <- c("BH","values")
}
#loop 2#
#this loop is for counting the number of elements in a nested list
mydf.1 <- list()
myres.2 <- list()
for (i in seq(1,nrow(datain))){
mydf.1 <- length(unlist(mylist[i]))
myres.2[i] <- mydf.1
}
#this is for placing the row values
names(myres.2) <- rownames(datain)
myres.3 <- as.numeric(myres.2)
mydf$myrow <- c(rep(rownames(datain),myres.3))
#I can order by row and by values
mydf <- mydf[order(as.numeric(mydf$myrow),as.numeric(mydf$values)),]
#I have to pick up the right values
#I have to generate as many sequences as many elements for each row.
myseq <- sequence(myres.3)
mydf <- cbind(mydf,myseq)
myseq.2 <- seq(1,nrow(mydf),by=2)
#selecting the df according to the uneven row
mydf.1 <- mydf[myseq.2,]
myorder <-split(mydf.1,mydf.1$myrow)
#loop 3
myres.3 <- list()
for (i in seq(1,nrow(datain))){
myres.3 <- lapply(myorder,"[",i=1)
}
myorder.def <- data.frame(cbind(lapply(myres.3,paste0,collapse="/")))
colnames(myorder.def) <- "BH"
#last step, apply str_extract_all for each row
myorder.def$BH <- str_replace_all(myorder.def$BH,"c","")
myorder.def$BH <- str_replace_all(myorder.def$BH,"\\(","")
myorder.def$BH <- str_replace_all(myorder.def$BH,"\\)","")
myorder.def$BH <- str_replace_all(myorder.def$BH,"\"","")
myorder.def$BH <- str_replace_all(myorder.def$BH,", ","/")
data.out <- cbind(datain,myorder.def)
data.out
Stef
An option in base R would be to loop over the columns (lapply) of the dataset, then replace the digits (\\d+) followed by / and digits to denominator - numerator by capturing those digits and switching the backreferences (\\2-\\1), and eval(parse the string
datain[paste0(names(datain), ".sum")] <- lapply(datain, function(y)
sapply(gsub("(\\d+)/(\\d+)", "(\\2-\\1)", y),
function(x) eval(parse(text = x))))
-checking with OP's output
> datain
A B C A.sum B.sum C.sum
1 3/4+6/8+11/16 0/5+15/20 0/5 8 10 5
2 0/5 5/10 3/10 5 5 7
> dataout
A B C A.sum B.sum C.sum
1 3/4+6/8+11/16 0/5+10/5 0/5 8 10 5
2 0/5 5/10 3/10 5 5 7
Or with tidyverse, group by rows, loop across all the columns, read the string into a data.frame with read.table, subtract the columns, get the sum and return as new columns by modifying the .names
library(dplyr)
library(stringr)
datain %>%
rowwise %>%
mutate(across(everything(), ~ sum(with(read.table(text =
str_replace_all(.x, fixed("+"), "\n"), sep = "/",
header = FALSE), V2 - V1)), .names = "{.col}.sum")) %>%
ungroup
-output
# A tibble: 2 × 6
A B C A.sum B.sum C.sum
<chr> <chr> <chr> <int> <int> <int>
1 3/4+6/8+11/16 0/5+15/20 0/5 8 10 5
2 0/5 5/10 3/10 5 5 7
Another base R approach might be the following. First split by +, then split again by /, taking the sum of differences in the resulting values.
datain[paste0(names(datain), ".sum")] <-
lapply(datain, function(x) {
sapply(strsplit(x, "[+]"), function(y) {
sum(sapply(strsplit(y, "[/]"), function(z) {
diff(as.numeric(z)) }
))
})
})
datain
Output
A B C A.sum B.sum C.sum
1 3/4+6/8+11/16 0/5+15/20 0/5 8 10 5
2 0/5 5/10 3/10 5 5 7
Update:
Slightly improved:
library(dplyr)
library(tidyr)
library(data.table)
datain %>%
pivot_longer(everything()) %>%
separate_rows(value, sep = "\\+|\\/", convert = TRUE) %>%
group_by(group = rleid(name)) %>%
mutate(value = value - lag(value, default = value[1])) %>%
slice(which(row_number() %% 2 == 0)) %>%
mutate(value = sum(value),
name = paste0(name, ".sum")) %>%
slice(1) %>%
ungroup() %>%
select(-group) %>%
group_by(name) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-id) %>%
cbind(datain)
This row
separate_rows(value, sep = "\\+|\\/", convert = TRUE) %>%
is same as
separate_rows(value, sep = "\\+") %>%
separate_rows(value, sep = "\\/") %>%
type.convert(as.is = TRUE) %>%
The very very long way until finish: :-)
library(dplyr)
library(tidyr)
library(data.table)
datain %>%
pivot_longer(everything()) %>%
separate_rows(value, sep = "\\+") %>%
separate_rows(value, sep = "\\/") %>%
group_by(group =as.integer(gl(n(),2,n()))) %>%
type.convert(as.is = TRUE) %>%
mutate(x = value - lag(value, default = value[1])) %>%
ungroup() %>%
group_by(group = rleid(name)) %>%
mutate(x = sum(x)) %>%
mutate(labels = paste0(name, ".sum")) %>%
slice(1) %>%
ungroup() %>%
select(-c(name, group, value)) %>%
pivot_wider(names_from = labels,
values_from = x,
values_fn = list) %>%
unnest(cols = c(A.sum, B.sum, C.sum)) %>%
cbind(datain)
A.sum B.sum C.sum A B C
1 8 10 5 3/4+6/8+11/16 0/5+15/20 0/5
2 5 5 7 0/5 5/10 3/10

Ontime percentage calculations

I need to calculate the overall ontime percentage of each airline with this sample dataset.
library(tidyverse)
library(dplyr)
df_chi <- tribble(
~airline, ~ontime, ~qty,~dest,
'delta',TRUE,527,'CHI',
'delta',FALSE,92,'CHI',
'american',TRUE,4229,'CHI',
'american',FALSE,825,'CHI'
)
df_nyc <- tribble(
~airline, ~ontime, ~qty,~dest,
'delta',TRUE,1817,'NYC',
'delta',FALSE,567,'NYC',
'american',TRUE,1651,'NYC',
'american',FALSE,625,'NYC'
)
I have a solution although it is verbose and I want to avoid the numbered index ie [2,2]. Is there a more elegant way using more of the tidyverse?
df_all <- bind_rows(df_chi,df_nyc)
delta_ot <- df_all %>%
filter(airline == "delta") %>%
group_by(ontime) %>%
summarize(total = sum(qty))
delta_ot <- delta_ot[2,2] / sum(delta_ot$total)
american_ot <- df_all %>%
filter(airline == "american") %>%
group_by(ontime) %>%
summarize(total = sum(qty))
american_ot <- american_ot[2,2] / sum(american_ot$total)
As on the ontime column is logical column, use that to subset instead of [2, 2]. Also, instead of doing the filter, do this once by adding the 'airline' as a grouping column
library(dplyr)
bind_rows(df_chi, df_nyc) %>%
group_by(airline, ontime) %>%
summarise(total = sum(qty), .groups = 'drop_last') %>%
summarise(total = total[ontime]/sum(total))
-output
# A tibble: 2 × 2
airline total
<chr> <dbl>
1 american 0.802
2 delta 0.781
Subsetting by logical returns the corresponding value where there are TRUE elements
> c(1, 3, 5)[c(FALSE, TRUE, FALSE)]
[1] 3

Group_by inside a function

I am trying to use the group_by function inside of a function but it doesn't seem to work. I found an example in another post as below (this works) :-
dat <- mtcars[c(2:4,11)]
grp <- function(x) {
group_by(dat,!!as.name(x)) %>%
summarise(n=n()) %>%
mutate(pc=scales::percent(n/sum(n))) %>%
arrange(desc(n)) %>% head()
}
lapply(colnames(dat), grp)
What I don't understand is why do I need to data frame name in the group_by function - doesn't group_by function work this way :-
data %>% group_by(lgID) %>% summarise(mean_run = mean(HR))
where the data is piped to the group_by function?
Also, why do I need '!!as.name(x)' - what does this do?
Further, why does the version shown above work and this version shown below doesn't?
grp <- function(x) {
group_by(x) %>%
summarise(n=n()) %>%
mutate(pc=scales::percent(n/sum(n))) %>%
arrange(desc(n)) %>% head()
}
lapply(colnames(dat), grp)
Obviously I am missing something here!
Best regards
Deepak
If we need to pass both index and strings as 'x', wrap it inside across within group_by
library(dplyr) # version >= 1.0.0
f1 <- function(data, x) {
data %>%
group_by(across(all_of(x))) %>%
summarise(n=n(), .groups = 'drop') %>%
mutate(pc=scales::percent(n/sum(n))) %>%
arrange(desc(n)) %>%
head()
}
If we have an older version, use group_by_at(x)
-apply the function
out1 <- lapply(colnames(dat), function(x) f1(dat, x))
Or use index
out2 <- lapply(seq_along(dat), function(i) f1(dat, i))
identical(out1, out2)
#[1] TRUE
-output
out1[[1]]
# A tibble: 3 x 3
# cyl n pc
# <dbl> <int> <chr>
#1 8 14 43.8%
#2 4 11 34.4%
#3 6 7 21.9%
out2[[1]]
# A tibble: 3 x 3
# cyl n pc
# <dbl> <int> <chr>
#1 8 14 43.8%
#2 4 11 34.4%
#3 6 7 21.9%

Select rows by ID with most matches

I have a data frame like this:
df <- data.frame(id = c(1,1,1,2,2,3,3,3,3,4,4,4),
torre = c("a","a","b","d","a","q","t","q","g","a","b","c"))
and I would like my code to select for each id the torre that repeats more, or the last torre for the id if there isnt one that repeats more than the other, so ill get a new data frame like this:
df2 <- data.frame(id = c(1,2,3,4), torre = c("a","a","q","c"))
You can use aggregate:
aggregate(torre ~ id, data=df,
FUN=function(x) names(tail(sort(table(factor(x, levels=unique(x)))),1))
)
The full explanation for this function is a bit involved, but most of the job is done by the FUN= parameter. In this case we are making a function that get's the frequency counts for each torre, sorts them in increasing order, then get's the last one with tail(, 1) and takes the name of it. aggregate() function then applies this function separately for each id.
You could do this using the dplyr package: group by id and torre to calculate the number of occurrences of each torre/id combination, then group by id only and select the last occurrence of torre that has the highest in-group frequency.
library(dplyr)
df %>%
group_by(id,torre) %>%
mutate(n=n()) %>%
group_by(id) %>%
filter(n==max(n)) %>%
slice(n()) %>%
select(-n)
id torre
<dbl> <chr>
1 1 a
2 2 a
3 3 q
4 4 c
An approach with the data.table package:
library(data.table)
setDT(df)[, .N, by = .(id, torre)][order(N), .(torre = torre[.N]), by = id]
which gives:
id torre
1: 1 a
2: 2 a
3: 3 q
4: 4 c
And two possible dplyr alternatives:
library(dplyr)
# option 1
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
group_by(id) %>%
mutate(f = rank(n, ties.method = "first")) %>%
filter(f == max(f)) %>%
select(-n, -f)
# option 2
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
distinct() %>%
arrange(n) %>%
group_by(id) %>%
slice(n()) %>%
select(-n)
Yet another dplyr solution, this time using add_count() instead of mutate():
df %>%
add_count(id, torre) %>%
group_by(id) %>%
filter(n == max(n)) %>%
slice(n()) %>%
select(-n)
# A tibble: 4 x 2
# Groups: id [4]
id torre
<dbl> <fct>
1 1. a
2 2. a
3 3. q
4 4. c

Create custom dplyr data transformation function in R

I need to repeat an operation many times for a different combinations of two different variables (trying to create data for stacked barplots showing percentage. Could anyone turn the code below into a function (of dataset, and the two variables x and y) in order to create the new data sets quickly? Or give me some good reference or link for learning about functions and dplyr. Thanks.
dat = df %>%
select(x, y) %>%
group_by(x, y) %>%
summarise(n = n()) %>%
mutate(percentage = round(n/sum(n)*100, 1)) %>%
ungroup() %>%
group_by(x) %>%
mutate(pos = cumsum(percentage) - (0.5 * percentage)) %>%
ungroup()
return(dat)
As suggested in the comments above, step-by-step explanations can be found here: dplyr.tidyverse.org/articles/programming.html
This guide will provide explanation of quo() function and !! symbols.
For your example you can create a function like so:
df1<- data.frame(x1 = c(rep(3,5), rep(7,2)),
y1 = c(rep(2,4), rep(5,3)))
my.summary <- function(df, x, y){
df %>%
select(!!x, !!y) %>%
group_by(!!x, !!y) %>%
summarise(n = n()) %>%
mutate(percentage = round(n/sum(n)*100, 1)) %>%
ungroup() %>%
group_by(!!x) %>%
mutate(pos = cumsum(percentage) - (0.5 * percentage)) %>%
ungroup()
}
my.summary(df1, quo(x1), quo(y1))
# # A tibble: 3 x 5
# x1 y1 n percentage pos
# <dbl> <dbl> <int> <dbl> <dbl>
# 1 3 2 4 80 40
# 2 3 5 1 20 90
# 3 7 5 2 100 50

Resources