Related
library(dplyr)
mydf <- data.frame(a_x = c(1,2,3,4,5),
b_x = c(8,9,10,11,12),
a_y = c("k",'b','a','d','z'),
b_y = c('aa','bb','cc','dd','ee'),
prefix=c("a","b","c","a","a"))
mydf
Assuming that the data I have is mydf, I would like to produce the same result as mydf2.
I made a column with the name of the column containing the value to be extracted.
I want to extract the value through this column.
mydf2 <- data.frame(a_x=c(1,2,3,4,5),
b_x=c(8,9,10,11,12),
prefix=c("a","b","c","a","a"),
desired_x_value = c(1,9,NA,4,5),
desired_y_value = c('k','bb',NA,'d','z'))
mydf2
I've used 'get' and 'paste0' but it doesn't work. Can I solve this problem through 'dplyr' chain?
mydf %>% mutate(desired_x_value = get(paste0(prefix,"_x")),
desired_y_value = get(paste0(prefix,"_y")))
So basically you want to create new columns (desired_x_value and desired_y_value) of which its value depends on a condition. Using dplyr I prefer case_when as it is the best readable way to do it, but you could also use (nested) if(else) statements. What it is doing is "if X meets condition A do Y, if X meets condition B do Z, if X meets condition .... do ..."
mydf %>%
dplyr::mutate(
desired_x_value = case_when(
prefix == "a" ~ a_x,
prefix == "b" ~ b_x,
desired_y_values = case_when(
prefix == "a" ~a_y,
prefix == "b" ~b_y,
TRUE ~ NA_character_ ))
You can remove the columns you don't need anymore in a second step if you want. the code above results in the table:
a_x b_x a_y b_y prefix desired_x_value desired_y_values
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc c NA <NA>
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
You can write a helper function for this :
get_value <- function(data, prefix, group) {
data[cbind(1:nrow(data), match(paste(prefix, group, sep = '_'), names(data)))]
}
mydf %>%
mutate(desired_x_value = get_value(select(., ends_with('_x')), prefix, 'x'),
desired_y_value = get_value(select(., ends_with('_y')), prefix, 'y'))
# a_x b_x a_y b_y prefix desired_x_value desired_y_value
#1 1 8 k aa a 1 k
#2 2 9 b bb b 9 bb
#3 3 10 a cc c NA <NA>
#4 4 11 d dd a 4 d
#5 5 12 z ee a 5 z
A simple rowwise also works.
mydf %>% rowwise() %>%
mutate(desired_x = ifelse(any(str_detect(names(mydf)[-5], prefix)),
get(paste(prefix, 'x', sep = '_')), NA),
desired_y = ifelse(any(str_detect(names(mydf)[-5], prefix)),
get(paste(prefix, 'y', sep = '_')), NA))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x desired_y
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc c NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
If the prefixes don't contain any invalid column prefixes, this will do without ifelse statement.
mydf <- data.frame(a_x = c(1,2,3,4,5),
b_x = c(8,9,10,11,12),
a_y = c("k",'b','a','d','z'),
b_y = c('aa','bb','cc','dd','ee'),
prefix=c("a","b","a","a","a"))
mydf %>% rowwise() %>%
mutate(desired_x = get(paste(prefix, 'x', sep = '_')),
desired_y = get(paste(prefix, 'y', sep = '_')))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x desired_y
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc a 3 a
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
First I would like to say that I am not presenting this as a good solution as other proposed solutions are much better and simpler. However, since you have brought up get function, I wanted to show you how to make use of it to get your desired output. As a matter of fact some of the values in your prefix column such as c does not have a match among your column names and get function throws an error on terminating the execution, and unlike mget function it does not have a ifnotfound argument. So you need a way to go around that error message by means of an ifelse:
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
library(glue)
mydf1 %>%
mutate(desired_x_value = map(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(glue("{.x}_x")), NA)),
desired_y_value = map(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(glue("{.x}_y")), NA))) %>%
unnest(cols = c(desired_x_value, desired_y_value))
# A tibble: 5 x 7
a_x b_x a_y b_y prefix desired_x_value desired_y_value
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc NA NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
You can also use paste function instead of glue and in case we already know the output types of the desired columns, we can spare the last line:
mydf1 %>%
mutate(desired_x_value = map_dbl(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(paste(.x, "x", sep = "_")), NA)),
desired_y_value = map_chr(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(paste(.x, "y", sep = "_")), NA)))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x_value desired_y_value
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc NA NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
Edit.
I´ve rewritten the question hoping it makes more sense.
Given this data:
> df
Cat1 Cat2 Q
1 A B 1
2 A C 1
3 B D 1
4 B C 1
5 C C 1
6 C D 1
You can easily group by Cat1 and sum Q using dplyr:
> df %>% group_by(Cat1) %>% summarise(Sum1 = sum(Q))
# A tibble: 3 x 2
Cat1 Sum1
<fct> <dbl>
1 A 2
2 B 2
3 C 2
Now, my question is, as a next step, can you use the groups in the group by (i.e. A, B and C) to operate in the original table? For example, how could you sum Q when Cat2 equals each group?
Meaning, for A there is no match in Cat2, so the sum of Q would be 0. For B there is only a match in the first row, so the sum of Q would be 1. For C there is a match in the second, the fourth and the fifth row, so the sum of Q would be 3:
# A tibble: 3 x 3
Cat1 Sum1 Sum2
<fct> <dbl> <dbl>
1 A 2 0
2 B 2 1
3 C 2 3
Note that this is not what I´m asking:
> df %>% group_by(Cat1) %>% summarise(Sum1 = sum(Q), Sum2 = sum(Q[Cat1==Cat2]))
# A tibble: 3 x 3
Cat1 Sum1 Sum2
<fct> <dbl> <dbl>
1 A 2 0
2 B 2 0
3 C 2 1
#antoine-sac propose in the comments to duplicate df and do a left join on Cat1(Grouped) = Cat2. Of course this would solve the problem, but it´s not the question I´m trying to answer.
Code:
Cat1 <- c("A","A","B","B","C","C")
Cat2 <- c("B","C","D","C","C","D")
Cat1 <- factor(Cat1, levels = c("A","B","C","D"))
Cat2 <- factor(Cat2, levels = c("A","B","C","D"))
Q <- c(1,1,1,1,1,1)
df <- data.frame(Cat1, Cat2, Q)
I think a join is the cleanest way to do it. Think about yourself reading your code again in 6 months: you want the meaning of your code to be obvious.
library("dplyr")
df <- read.table(text = " Cat1 Cat2 Q
1 A B 1
2 A C 1
3 B D 1
4 B C 1
5 C C 1
6 C D 1", stringsAsFactor = FALSE)
df1 <- df %>%
group_by(Cat1) %>%
summarise(Sum1 = sum(Q))
df2 <- df %>%
group_by(Cat2) %>%
summarise(Sum2 = sum(Q))
full_join(df1, df2, by = c("Cat1" = "Cat2")) %>%
tidyr::replace_na(list(Sum1 = 0, Sum2 = 0))
# # A tibble: 4 x 3
# Cat1 Sum1 Sum2
# <chr> <dbl> <dbl>
# 1 A 2 0
# 2 B 2 1
# 3 C 2 3
# 4 D 0 2
With a full_join, you keep all values in Cat1 or Cat2 (A, B, C , D) but you can use a left_join (to keep A, B, C), a right_join (to keep B, C, D) or an inner_join (to keep B, C).
These are respectively the values in Cat1, in Cat2 or both in Cat1 and Cat2.
It may seem painful, especially if you have a lot of categories, but if you have to do it more than once, it is actually easy to automate in a function.
EDIT: actually it is not easy at all if you want to use dplyr due to non-standard evaluation. Here's how you'd do it:
sum_cats <- function(df, cat1, cat2, value) {
cat1 <- enquo(cat1)
cat2 <- enquo(cat2)
value <- enquo(value)
sum1 <- paste0("Sum_", quo_name(cat1))
df1 <- df %>%
rename(cat = !! cat1) %>%
group_by(cat) %>%
summarise(!! sum1 := sum(!! value))
sum2 <- paste0("Sum_", quo_name(cat2))
df2 <- df %>%
rename(cat = !! cat2) %>%
group_by(cat) %>%
summarise(!! sum2 := sum(!! value))
full_join(df1, df2, by = "cat") %>%
tidyr::replace_na(rlang::list2(!! sum1 := 0, !! sum2 := 0))
}
Now you can just call sum_cats to do all the work:
df %>%
sum_cats(Cat1, Cat2, Q)
# cat Sum_Cat1 Sum_Cat2
# <chr> <dbl> <dbl>
# 1 A 2 0
# 2 B 2 1
# 3 C 2 3
# 4 D 0 2
You can try
df %>%
group_by(Cat1) %>%
summarise(sum1 = sum(Q),
sum2 = sum(ifelse(.$Cat2 == Cat1[1], Q, 0)))
# A tibble: 3 x 3
Cat1 sum1 sum2
<fct> <dbl> <dbl>
1 A 2 0
2 B 2 1
3 C 2 3
By using the .$ you will compare and sum up the ungrouped original data.
You probably could construct a new column and summarise from the new column:
df %>% mutate(new_Quantity=ifelse(Start == End, Quantity,0)) %>% group_by(Start) %>% summarise(Sum = sum(new_Quantity))
I have the following data frame (with 1000's of columns):
df<- structure(c(1, 2, 2, 1, 2, 2, 2, 1, 3, 3, 2, 2),
.Dim = 4:3, .Dimnames = list(c("a", "b", "c", "d"),
c("t1", "t2", "t3")))
What would be an efficient way to get average of every two rows?
Result I want:
t1 t2 t3
a 1 2 3
b 2 2 3
a_b 1.5 2 3
c 2 2 2
d 1 1 2
c_d 1.5 1.5 2
Split on ever 2 rows, then get mean per column, and rbind, and rbind all again.
do.call(rbind,
lapply(seq(1, nrow(df), 2), function(i){
x <- df[ i:(i + 1), , drop = FALSE]
res <- rbind(x, colSums(x)/2)
rownames(res)[ nrow(res) ] <- paste(rownames(x), collapse = "_")
res
}))
# t1 t2 t3
# a 1.0 2.0 3
# b 2.0 2.0 3
# a_b 1.5 2.0 3
# c 2.0 2.0 2
# d 1.0 1.0 2
# c_d 1.5 1.5 2
One dplyr possibility could be:
df %>%
data.frame() %>%
rownames_to_column() %>%
mutate_if(is.factor, as.numeric) %>%
group_by(group = gl(n()/2, 2)) %>%
group_map(~ bind_rows(.x, tibble(rowname = paste(.x$rowname, collapse = "_"),
t1 = mean(.x$t1),
t2 = mean(.x$t2),
t3 = mean(.x$t3)))) %>%
ungroup() %>%
select(-group)
rowname t1 t2 t3
<chr> <dbl> <dbl> <dbl>
1 a 1 2 2
2 b 2 2 2
3 a_b 1.5 2 2
4 c 2 2 1
5 d 1 1 1
6 c_d 1.5 1.5 1
The first three rows could be omitted if you create it beforehand as a data.frame, with names as a column and with factors as numeric variables. Then, what it does, is to, first, create a grouping variables using gl(). Second, it calculates the means, creates the name as a combination of the two elements in the group and binds it with the original data. Finally, it ungroups and removes the redundant variable.
a base R solution that works with any number of columns
M <- matrix(unlist(c(df)), ncol = 2, byrow = TRUE)
M <- cbind(M, rowMeans(M))
M <- matrix(c(t(M)),ncol = ncol(df), byrow = FALSE)
# add row names and column names
row.names <- matrix(rownames(df), ncol = 2 ,byrow = TRUE)
rownames(M) <- c(t(cbind(row.names, apply(row.names,1, paste, collapse = "_"))))
colnames(M) <- colnames(df)
# t1 t2 t3
# a 1.0 2.0 3
# b 2.0 2.0 3
# a_b 1.5 2.0 3
# c 2.0 2.0 2
# d 1.0 1.0 2
# c_d 1.5 1.5 2
Another dplyr approach.
Update: If you really need the row names (a, b, a_b, etc) see after my original solution for a scalable, but more convoluted, version.
Original
df <- df %>% as_tibble()
n <- nrow(df)/2
orig <- df %>% mutate(grp = sort(rep(1:2, n)))
means <- orig %>% group_by(grp) %>% summarise_all(mean)
bind_rows(orig, means) %>% arrange(grp) %>% select(-grp)
Output:
# A tibble: 6 x 3
t1 t2 t3
<dbl> <dbl> <dbl>
1 1 2 3
2 2 2 3
3 1.5 2 3
4 2 2 2
5 1 1 2
6 1.5 1.5 2
Updated with row names
rnames <- row.names(df)
df <- df %>% as_tibble()
n <- (nrow(df)/2)
orig <- df %>%
mutate(grp = sort(rep(1:n, n)), rn = rnames)
means <- orig %>%
group_by(grp) %>%
mutate(rn = paste0(rn, collapse="_")) %>%
ungroup() %>%
group_by(rn) %>%
summarise_if(is.numeric, mean)
bind_rows(orig, means) %>% arrange(grp) %>% select(-grp)
Output:
t1 t2 t3 rn
<dbl> <dbl> <dbl> <chr>
1 1 2 3 a
2 2 2 3 b
3 1.5 2 3 a_b
4 2 2 2 c
5 1 1 2 d
6 1.5 1.5 2 c_d
One possibility is to use the dplyr package.
Note that the data I use is slightly different from the data you are using: in your data the numbers are actually character values.
df <- structure(c(1, 2, 2, 1, 2, 2, 2, 1, 3, 3, 2, 2),
.Dim = 4:3, .Dimnames = list(c("a", "b", "c", "d"),
c("t1", "t2", "t3")))
First I create the summary tibble (which contains the means).
library(dplyr)
df_summary <- df %>% as_tibble(rownames = "names") %>%
group_by(ceiling(1:n() / 2)) %>%
summarise(names = paste(names, collapse = "_"),
t1 = mean(t1),
t2 = mean(t2),
t3 = mean(t3)) %>%
select(-1)
# A tibble: 2 x 4
names t1 t2 t3
<chr> <dbl> <dbl> <dbl>
1 a_b 1.5 2 3
2 c_d 1.5 1.5 2
Then I combine the summary data with original data:
df_summary %>% bind_rows(df %>% as_tibble(rownames = "names")) %>%
slice(3, 4, 1, 5, 6, 2)
# A tibble: 6 x 4
names t1 t2 t3
<chr> <dbl> <dbl> <dbl>
1 a 1 2 3
2 b 2 2 3
3 a_b 1.5 2 3
4 c 2 2 2
5 d 1 1 2
6 c_d 1.5 1.5 2
This function averages based on a column named "group"
and should be in the dataset.
x is a data frame or a matrix.
rowm = function(x){
x = as.data.frame(x)
u = unique(x$group)
r = rep(NA, ncol(x)*length(u))
tempDF = matrix(r, ncol=ncol(x))
counter=0
for(i in u){
counter = counter+1
tempDF[counter, ] = colMeans(x[x$group==i, ], )
}
colnames(tempDF) = colnames(x)
return(tempDF)}
In an experiment, people had four candidates to choose from; sometimes they're male, other times they're female. In the below dataframe, C1 means Candidate 1, C2 means Candidate 2, and so on. F denotes female while M denotes male. A response of 1 indicates the person chose C1, a response of 2 indicates the person chose C2, and so on.
C1 C2 C3 C4 response
F F M M 2
M M F M 1
I want a new column "ChooseFemale" which equals to 1 if the candidate chose a female candidate, and zero otherwise. So the first row should have ChooseFemale equal to 1, while the second row should have ChooseFemale equal to zero.
This would require me to look up a certain column depending on the value of "response" column.
How can I do this?
Another base R solution:
x <- df[["response"]]
df$ChooseFemale <- as.integer(df[cbind(seq_along(x), x)] == "F")
C1 C2 C3 C4 response ChooseFemale
1 F F M M 2 1
2 M M F M 1 0
Data:
Lines <- "C1 C2 C3 C4 response
F F M M 2
M M F M 1"
df <- read.table(text = Lines, header = TRUE, stringsAsFactors = FALSE)
# create dataframe
my.df <- data.frame(c1=c('f','m'),
c2=c('f','m'),
c3=c('m','f'),
c4=c('m','m'),
resp=c(2, 1))
# add column
my.df$ChooseFemale <- NA
# loop over rows
for (row in 1:nrow(my.df)){
# extract the column to check from response column
col <- paste0('c', my.df$resp[row])
# fill in new column
my.df$ChooseFemale[row] <- ifelse(my.df[row, col]=='f', 1, 0)
}
apply(df,1,function(x) ifelse(df[,as.numeric(x['response'])]=='F',1,0))[,1]
[1] 1 0
Here is the basic idea, select the column using the value in response. Then use apply with MARGIN=1 to apply this function row by row.
df[1,'response']
[1] 2
df[1,df[1,'response']]
[1] F
Levels: F M
data
df <- read.table(text = "
C1 C2 C3 C4 response
F F M M 2
M M F M 1
",header=T)
You can create a simple function to check whether the response number matches "F", and then apply it to each row at once.
A tidyverse approach:
library(tidyverse)
mydata <- data.frame(C1=sample(c("F","M"),10,replace = T),
C2=sample(c("F","M"),10,replace = T),
C3=sample(c("F","M"),10,replace = T),
C4=sample(c("F","M"),10,replace = T),
response=sample(c(1:4),10,replace = T),
stringsAsFactors = FALSE)
C1 C2 C3 C4 response
1 M M M M 1
2 F F F M 4
3 M F M M 2
4 F M M F 2
5 M M M F 1
6 M F M F 4
7 M M M F 3
8 M M M M 2
9 M F M M 3
10 F F M F 4
Custom function to check if the response matches "F"
female_choice <- function(C1, C2, C3, C4, response) {
c(C1, C2, C3, C4)[response] == "F"
}
And then just use mutate() to modify your dataframe, and pmap() to use its rows, one by one, as the set of arguments for female_choice()
mydata %>%
mutate(ChooseFemale = pmap_chr(., female_choice))
C1 C2 C3 C4 response ChooseFemale
1 M M M M 1 FALSE
2 F F F M 4 FALSE
3 M F M M 2 TRUE
4 F M M F 2 FALSE
5 M M M F 1 FALSE
6 M F M F 4 TRUE
7 M M M F 3 FALSE
8 M M M M 2 FALSE
9 M F M M 3 FALSE
10 F F M F 4 TRUE
Here is one way to do it using tidyverse packages. As specified in the question, this takes into account both which candidate was chosen (C1-C4) and sex of the candidate (F/M):
# loading needed libraries
library(tidyverse)
# data
df <- utils::read.table(text = "C1 C2 C3 C4 response
F F M M 2
M M F M 1", header = TRUE) %>%
tibble::as_data_frame(x = .) %>%
tibble::rowid_to_column(.)
# manipulation
dplyr::full_join(
# creating dataframe with the new chooseFemale variable
x = df %>%
tidyr::gather(
data = .,
key = "candidate",
value = "choice",
C1:C4
) %>%
dplyr::mutate(choice_new = paste("C", response, sep = "")) %>%
# creating the needed column by checking both the candidate chosen and
# the sex of the candidate
dplyr::mutate(chooseFemale = dplyr::case_when((choice_new == candidate) &
(choice == "F") ~ 1,
(choice_new == candidate) &
(choice == "M") ~ 0
)) %>%
dplyr::select(.data = ., -choice_new) %>%
tidyr::spread(data = ., key = candidate, value = choice) %>%
dplyr::filter(.data = ., !is.na(chooseFemale)) %>%
dplyr::select(.data = ., -c(C1:C4)),
# original dataframe
y = df,
by = c("rowid", "response")
) %>% # removing the redundant row id
dplyr::select(.data = ., -rowid) %>% # rearranging the columns
dplyr::select(.data = ., C1:C4, response, chooseFemale)
#> # A tibble: 2 x 6
#> C1 C2 C3 C4 response chooseFemale
#> <fct> <fct> <fct> <fct> <int> <dbl>
#> 1 F F M M 2 1
#> 2 M M F M 1 0
Created on 2018-08-24 by the reprex package (v0.2.0.9000).
I'll provide an answer in the tidyr format. Your data is in a "wide" format. This makes it very human readable, but not necessarily machine readable. The first step to making it more tidy is to convert the data to long format. In other words, let's transform the data so that we don't have to do calculations across multiple columns in a single row.
tidy format allows you to use grouping variables, create summaries, etc.
library(dplyr)
library(tidyr)
df <- data.frame(C1 = c("F","M"),
C2 = c("F","M"),
C3 = c("M","F"),
C4 = c("M","M"),
stringsAsFactors = FALSE)
> df
C1 C2 C3 C4
1 F F M M
2 M M F M
Let's add an "id" field so we can keep track of each unique row. This is the same as the row number...but we are going to be converting the wide data to long data with different row numbers. Then use gather to convert from wide data to long data.
df_long <- df %>%
mutate(id = row_number(C1)) %>%
gather(key = "key", value = "value",C1:C4)
> df_long
id key value
1 1 C1 F
2 2 C1 M
3 1 C2 F
4 2 C2 M
5 1 C3 M
6 2 C3 F
7 1 C4 M
8 2 C4 M
Now it is possible to use group_by() to group based on variables, perform summaries, etc.
For what you've asked you group by the id column and then perform calculations on the group. In this case we will take the sum of all values that are "F". Then we ungroup and spread back to the wide / human readable format.
df_long %>%
group_by(id) %>%
mutate(response = sum(value=="F",na.rm=TRUE)) %>%
ungroup()
> df_long
# A tibble: 8 x 4
id key value response
<int> <chr> <chr> <int>
1 1 C1 F 2
2 2 C1 M 1
3 1 C2 F 2
4 2 C2 M 1
5 1 C3 M 2
6 2 C3 F 1
7 1 C4 M 2
8 2 C4 M 1
To get the data back in wide format once you are done doing all calculations that you need in long format:
df <- df_long %>%
spread(key,value)
> df
# A tibble: 2 x 6
id response C1 C2 C3 C4
<int> <int> <chr> <chr> <chr> <chr>
1 1 2 F F M M
2 2 1 M M F M
To get the data back in the order you had it:
df <- df %>%
select(-id) %>%
select(C1:C4,everything())
> df
# A tibble: 2 x 5
C1 C2 C3 C4 response
<chr> <chr> <chr> <chr> <int>
1 F F M M 2
2 M M F M 1
You can of course use the pipes to do this all in one step.
df <- df %>%
mutate(id = row_number(C1)) %>%
gather(key = "key", value = "value",C1:C4) %>%
group_by(id) %>%
mutate(response = sum(value=="F",na.rm=TRUE)) %>%
ungroup() %>%
spread(key,value) %>%
select(-id) %>%
select(C1:C4,everything())
I have the following data frame:
dat <- read_csv(
"s1,s2,v1,v2
a,b,10,20
a,b,22,NA
a,b,13,33
c,d,3,NA
c,d,4.5,NA
c,d,10,20"
)
dat
#> # A tibble: 6 x 4
#> s1 s2 v1 v2
#> <chr> <chr> <dbl> <int>
#> 1 a b 10.0 20
#> 2 a b 22.0 NA
#> 3 a b 13.0 33
#> 4 c d 3.0 NA
#> 5 c d 4.5 NA
#> 6 c d 10.0 20
What I want to do is
Filter row based on v1 values
Group by s1 and s2
Count total lines in every group
Count lines in every group where v2 is not NA.
For example with v1_filter >= 0 we get this:
s1 s2 total_line non_na_line
a b 3 2
c d 3 1
And with v1_filter >= 10 we get this:
s1 s2 total_line non_na_line
a b 2 1
c d 1 1
How can I achieve that with data.table or dplyr?
In reality we have around ~31M rows in dat. So we need
a fast method.
I'm stuck with this
library(data.table)
dat <- data.table(dat)
v1_filter = 0
dat[, v1 >= v1_filter,
by=list(s1,s2)]
Using sum should help. Operating on a logical vector, it treats each TRUE as 1 and FALSE as 0, so you can easily do this:
dat %>%
group_by(s1, s2) %>%
summarise(total_lines = n(),
non_na_line = sum(!is.na(v2)))
# A tibble: 2 x 4
# Groups: s1 [?]
s1 s2 total_lines non_na_line
<chr> <chr> <int> <int>
1 a b 3 2
2 c d 3 1
You'll easily be able to add in a filter between group_by and summarise, to get what you want. Keep in mind that summarise will only retain columns that you group by.
Benchmark
For what it's worth, I ran a quick benchmark, with some test data of similar size as yours.
s1charMix <- rep(letters[seq(from = 1, to = 10)], length.out = 30000000)
s2charMix <- rep(letters[seq(from = 11, to = 20)], length.out = 30000000)
s1chars <- sample(s1charMix, 30000000)
s2chars <- sample(s2charMix, 30000000)
v1Nums <- runif(30000000, min = 0, max = 20)
nomissing <- sample(1:200000,1)
int.mix <- rbinom(30000000 - nomissing, 30, 0.3)
nalist <- rep(NA, nomissing)
v2NumsNA <- sample(x = c(int.mix, nalist), 30000000)
df <- data_frame(s1 = s1chars, s2 = s2chars, v1 = v1Nums, v2 = v2NumsNA)
This should roughly replicate the size and type of the data you suggest:
df
# A tibble: 30,000,000 x 4
s1 s2 v1 v2
<chr> <chr> <dbl> <int>
1 d s 9.2123603 7
2 b q 16.6638639 11
3 g o 18.3682028 11
4 g s 0.8779067 9
5 a s 0.0719127 10
6 b q 16.8809193 12
7 h q 15.4382455 6
8 e k 2.3565489 11
9 h p 16.4508811 9
10 d n 2.7283823 11
# ... with 29,999,990 more rows
df %>%
filter(is.na(v2))
# A tibble: 116,924 x 4
s1 s2 v1 v2
<chr> <chr> <dbl> <int>
1 d r 13.1448988 NA
2 b o 0.2703848 NA
3 b t 18.8319385 NA
4 a s 11.6448437 NA
5 j m 0.5388760 NA
6 i k 8.7098427 NA
7 d s 6.1149735 NA
8 h p 2.5552694 NA
9 g r 0.9057442 NA
10 b s 19.8886830 NA
# ... with 116,914 more rows
Now, let's benchmark dplyr operations vs data.table:
### dplyr
df %>%
filter(v1 > 10) %>%
group_by(s1, s2) %>%
summarise(total_lines = n(),
non_na_line = sum(!is.na(v2)))
# A tibble: 100 x 4
# Groups: s1 [?]
s1 s2 total_lines non_na_line
<chr> <chr> <int> <int>
1 a k 150327 149734
2 a l 149655 149062
3 a m 149794 149200
4 a n 149771 149197
5 a o 149495 148942
...
> system.time(df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
user system elapsed
1.848 0.420 2.290
> system.time(for (i in 1:100) df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
user system elapsed
187.657 55.878 245.528
### Data.table
library(data.table)
dat <- data.table(df)
> dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0]
s1 s2 N i.N
1: b q 149968 149348
2: g o 150411 149831
3: h q 150132 149563
4: h p 150786 150224
5: e o 149951 149353
...
> system.time(dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
user system elapsed
2.027 0.228 2.271
> system.time(for (i in 1:100) dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
user system elapsed
213.281 43.949 261.664
TL;DR dplyr and data.table are similarly fast, if anything dplyr is slightly faster
> library(readr)
> dat <- read_csv(
+ "s1,s2,v1,v2
+ a,b,10,20
+ a,b,22,NA
+ a,b,13,33
+ c,d,3,NA
+ c,d,4.5,NA
+ c,d,10,20"
+ )
>
> dat
# A tibble: 6 x 4
s1 s2 v1 v2
<chr> <chr> <dbl> <int>
1 a b 10.0 20
2 a b 22.0 NA
3 a b 13.0 33
4 c d 3.0 NA
5 c d 4.5 NA
6 c d 10.0 20
Using data.table since you have a big data
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> dat=data.table(dat)
Without removing NA and keeping V1 filter as 0.1
> dat1=dat[v1>0.1,.N,.(s1,s2)]
> dat1
s1 s2 N
1: a b 3
2: c d 3
Removing v2 NA and keeping V1 filter as 0.1
> dat2=dat[v1>0.1&is.na(v2)==F,.N,.(s1,s2)]
> dat2
s1 s2 N
1: a b 2
2: c d 1
Merging the two and keeping V1 filter as 0
> dat[v1 > 0, .N, by = .(s1, s2)][ dat[v1 > 0 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0 ]
s1 s2 N i.N
1: a b 3 2
2: c d 3 1