I have the following Problem: In a data frame I have a lot of rows and columns with the first row being the date. For each date I have more than 1 observation and I want to summarize them.
My df looks like that (date replaced by ID for ease of use):
df:
ID Cash Price Weight ...
1 0.4 0 0
1 0.2 0 82 ...
1 0 1 0 ...
1 0 3.2 80 ...
2 0.3 1 70 ...
... ... ... ... ...
I want to group them by the first column and then summarize all rows BUT with different functions:
The function Cash and Price should be sum so I get the sum of Cash and Price for each ID. The function on Weight should be max so I only get the maximum weight for the ID.
Because I have so many columns I can not write a all functions by hand, but I have only 2 columns which should be summarized by max the rest should be summarized by sum.
So I am looking for a function to group by ID, summarize all with sum except 2 different columns which I need the max value.
I tried to use the dplyr package with:
df %>% group_by(ID = tolower(ID)) %>% summarise_each(funs(sum))
But I need the addition to not sum but max the 2 specified columns, any Ideas?
To be clear, the output of the example df should be:
ID Cash Price Weight
1 0.6 4.2 82
2 0.3 1 70
As of dplyr 1.0.0 you can use across():
tribble(
~ID, ~max1, ~max2, ~sum1, ~sum2, ~sum3,
1, 1, 1, 1, 2, 3,
1, 2, 3, 1, 2, 3,
2, 1, 1, 1, 2, 3,
2, 3, 4, 2, 3, 4,
3, 1, 1, 1, 2, 3,
3, 4, 5, 3, 4, 5,
3, NA, NA, NA, NA, NA
) %>%
group_by(ID) %>%
summarize(
across(matches("max1|max2"), max, na.rm = T),
across(!matches("max1|max2"), sum, na.rm = T)
)
# ID max1 max2 sum1 sum2 sum3
# 1 2 3 2 4 6
# 2 3 4 3 5 7
# 3 4 5 4 6 8
We can use
df %>%
group_by(ID) %>%
summarise(Cash = sum(Cash), Price = sum(Price), Weight = max(Weight))
If we have many columns, one way would be to do this separately and then join the output together.
df1 <- df %>%
group_by(ID) %>%
summarise_each(funs(sum), Cash:Price)
df2 <- df %>%
group_by(ID) %>%
summarise_each(funs(max), Weight)
inner_join(df1, df2, by = "ID")
# ID Cash Price Weight
# (int) (dbl) (dbl) (int)
#1 1 0.6 4.2 82
#2 2 0.3 1.0 70
Or do it w/o the double groups:
library(dplyr)
set.seed(1492)
df <- data.frame(id=rep(c(1,2), 3),
cash=rnorm(6, 0.5, 0.1),
price=rnorm(6, 0.5, 0.1)*6,
weight=sample(100, 6))
df
## id cash price weight
## 1 1 0.4410152 2.484082 10
## 2 2 0.4101343 3.032529 93
## 3 1 0.3375889 2.305076 58
## 4 2 0.6047922 3.248851 55
## 5 1 0.4721711 3.209930 34
## 6 2 0.5362493 2.331530 99
custom_summarise <- function(do_df) {
return(bind_cols(
summarise_each(select(do_df, -weight), funs(sum)),
summarise_each(select(do_df, weight), funs(max))
))
}
group_by(df, id) %>% do(custom_summarise(.))
## Source: local data frame [2 x 4]
## Groups: id [2]
##
## id cash price weight
## (dbl) (dbl) (dbl) (int)
## 1 3 1.250775 7.999089 58
## 2 6 1.551176 8.612910 99
library(data.table)
setDT(df)
df[,.(Cash = sum(Cash),Price = sum(Price),Weight = max(Weight)),by=ID]
One way of doing this for +90 columns can be:
max_col <- 'Weight'
sum_col <- setdiff(colnames(df),max_col)
query_1 <- paste0(sum_col,' = sum(',sum_col,')')
query_2 <- paste0(max_col,' = max(',max_col,')')
query_3 <- paste(query_1,collapse=',')
query_4 <- paste(query_2,collapse=',')
query_5 <- paste(query_3,query_4,sep=',')
final_query <- paste0('df[,.(',query_5,'),by = ID]')
eval(parse(text = final_query))
Here is a solution based on this comment on an issue on dplyr repo. I think it's very general to be applied to more complicated cases.
library(tidyverse)
df <- tribble(
~ID, ~Cash, ~Price, ~Weight,
#----------------------
'a', 4, 6, 8,
'a', 7, 3, 0,
'a', 7, 9, 0,
'b', 2, 8, 8,
'b', 5, 1, 8,
'b', 8, 0, 1,
'c', 2, 1, 1,
'c', 3, 8, 0,
'c', 1, 9, 1
)
out <- list(.vars=lst(vars(-Weight), vars(Weight)),
.funs=lst(sum, max))%>%
pmap(~df%>%group_by(ID)%>%summarise_at(.x, .y)) %>%
reduce(inner_join)
out
# A tibble: 3 x 4
# ID Cash Price Weight
# <chr> <dbl> <dbl> <dbl>
# 1 a 18 18 8
# 2 b 15 9 8
# 3 c 6 18 1
You should specify the vars in the first lst (e.g. vars(-Weight), vars(Weight)) and respective function to be applied in the lst (sum, max). The .x in the summarise_at argument refers to elements in the variable lst, and .y refers to the elements in the function lst.
Related
A MWE is as follows:
I have 3 groups with 2, 4, and 3 subjects consecutively. So I have:
library(dplyr)
Group <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
Subject_ID <- c(1, 2, 1 ,2, 3, 4, 1, 2)
df <- rbind(Group, Subject_ID)
Since the subjects in different groups are different subjects, so I want the subject ID be unique for each subject in the dataset. What I did was as follows:
Num_Subjects <- (length(unique(filter(df, Group == 1)$Subject)),
length(unique(filter(df, Group == 2)$Subject)),
length(unique(filter(df, Group == 3)$Subject)),
)
# Then I defined a summation function to calculate how many subjects there are in all previous groups.
sumfun <- function(x,start,end){
return(sum(x[start:end]))
}
# Then I defined another function that generates a new subject ID for each subject in each group.
SubjIDFn <- function(x, i) {
x %>% filter(Session == i) %>% mutate(
Sujbect = Subject + sumfun(Num_Subjects, 1, i-1)
)
}
# Then I loop this from group 2 to group 3,
for (i in 2:3) {
df.Corruption.WithoutS1 <- SubjIDFn(df.Corruption.WithoutS1, i)
}
Then the data set has zero observations. I don't know where it went wrong, and I don't know what is the smart solution to this problem. Thanks for your help!
I think you're a bit overshooting it... If Subject_ID is unique within groups, you may just go with:
library(dplyr)
Group <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
Subject_ID <- c(1, 2, 1 ,2, 3, 4, 1, 2, 3)
df <- bind_cols(Group=Group, Subject_ID=Subject_ID)
df %>% mutate(unique_id = paste(Group, Subject_ID, sep="."))
# A tibble: 9 x 3
Group Subject_ID unique_id
<dbl> <dbl> <chr>
1 1 1 1.1
2 1 2 1.2
3 2 1 2.1
4 2 2 2.2
5 2 3 2.3
6 2 4 2.4
7 3 1 3.1
8 3 2 3.2
9 3 3 3.3
Note that I used bind_cols instead of rbind to have a dataframe instead of a matrix.
I have a data frame with a few variables to reverse code. I have a separate vector that has all the variables to reverse code. I'd like to use mutate_at(), or some other tidy way, to reverse code them all in one line of code. Here's the dataset and the vector of items to reverse
library(tidyverse)
mock_data <- tibble(id = 1:5,
item_1 = c(1, 5, 3, 5, 5),
item_2 = c(4, 4, 4, 1, 1),
item_3 = c(5, 5, 5, 5, 1))
reverse <- c("item_2", "item_3")
Here's what I want it to look like with only items 2 and 3 reverse coded:
library(tidyverse)
solution <- tibble(id = 1:5,
item_1 = c(1, 5, 3, 5, 5),
item_2 = c(2, 2, 2, 5, 5),
item_3 = c(1, 1, 1, 1, 5))
I've tried this below code. I know that the recode is correct because I've used it for other datasets, but I know something is off with the %in% operator.
library(tidyverse)
mock_data %>%
mutate_at(vars(. %in% reverse), ~(recode(., "1=5; 2=4; 3=3; 4=2; 5=1")))
Error: `. %in% reverse` must evaluate to column positions or names, not a logical vector
Any help would be appreciated!
You can give reverse directly to mutate_at, no need for vars(. %in% reverse). And I would simplify the reversing as 6 minus the current value.
mock_data %>% mutate_at(reverse, ~6 - .)
# # A tibble: 5 x 4
# id item_1 item_2 item_3
# <int> <dbl> <dbl> <dbl>
# 1 1 1 2 1
# 2 2 5 2 1
# 3 3 3 2 1
# 4 4 5 5 1
# 5 5 5 5 5
If there's a possibility that reverse includes columns that are not in mock_data, and you want to skip those, use mutate_at(vars(one_of(reverse)), ...)
My aim is to replace NA's in a spark data frame using the Last Observation Carried Forward method. I wrote the following code and works. However, it seems to take longer than expected for a larger dataset.
It would be great if someone can recommend a better approach or improve the code.
Example and Code with Sparklyr
In the following example, NA's are replaced after ordering them using the
time and grouping them by grp.
df_with_nas <- data.frame(time = seq(as.Date('2001/01/01'),
as.Date('2010/01/01'), length.out = 10),
grp = c(rep(1, 5), rep(2, 5)),
v1 = c(1, rep(NA, 3), 5, rep(NA, 5)),
v2 = c(NA, NA, 3, rep(NA, 4), 3, NA, NA))
tbl <- copy_to(sc, df_with_nas, overwrite = TRUE)
tbl %>%
spark_apply(function(df) {
library(dplyr)
na_locf <- function(x) {
v <- !is.na(x)
c(NA, x[v])[cumsum(v) + 1]
}
df %>% arrange(time) %>% group_by(grp) %>% mutate_at(vars(-v1, -grp),
funs(na_locf(.)))
})
# # Source: spark<?> [?? x 4]
# time grp v1 v2
# <dbl> <dbl> <dbl> <dbl>
# 1 11323 1 1 NaN
# 2 11688. 1 NaN NaN
# 3 12053. 1 NaN 3
# 4 12419. 1 NaN 3
# 5 12784. 1 5 3
# 6 13149. 2 NaN NaN
# 7 13514. 2 NaN NaN
# 8 13880. 2 NaN 3
# 9 14245. 2 NaN 3
# 10 14610 2 NaN 3
data.table
Following approach with data.table works quite fast for the data I have. I am expecting the size of the data to increase soon, and then I may have to rely on sparklyr.
library(data.table)
setDT(df_with_nas)
df_with_nas <- df_with_nas[order(time)]
cols <- c("v1", "v2")
df_with_nas[, (cols) := zoo::na.locf(.SD, na.rm = FALSE),
by = grp, .SDcols = cols]
I did this sort of loop, is quite slow...
df_with_nas = df_with_nas %>% mutate(row = 1:nrow(df_with_nas))
for(n in 1:50){
df_with_nas = df_with_nas %>%
arrange(row) %>%
mutate_all(~if_else(is.na(.),lag(.,1),.))
}
run until no NA
then
collect(df_with_nas)
Will run the code.
You can leverage the spark_apply() function and run the na.locf function in each of your cluster nodes.
Install R runtimes on each of your cluster nodes.
Install the zoo R package on each nodes as well.
Run spark apply this way:
data_filled <- spark_apply(data_with_holes, function(df) zoo:na.locf(df))
You can do this quite quickly using sql with the added benefit that you can easily apply LOCF on grouped basis. The pattern you want to use is LAST_VALUE(column, true) OVER (window) - this searches over the window for the most recent column value which is not NA (passing "true" to LAST_VALUE sets ignore NA = true). Since you want to look backwards from the current value the window should be
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING
Of course, if the first value in the group is NA it will remain NA.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
test_table <- data.frame(
v1 = c(1, 2, NA, 3, NA, 5, NA, 6, NA),
v2 = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
time = c(1, 2, 3, 4, 5, 2, 1, 3, 4)
) %>%
sdf_copy_to(sc, ., "test_table")
spark_session(sc) %>%
sparklyr::invoke("sql", "SELECT *, LAST_VALUE(v1, true)
OVER (PARTITION BY v2
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING)
AS last_non_na
FROM test_table") %>%
sdf_register() %>%
mutate(v1 = ifelse(is.na(v1), last_non_na, v1))
#> # Source: spark<?> [?? x 4]
#> v1 v2 time last_non_na
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 NaN
#> 2 2 1 2 1
#> 3 2 1 3 2
#> 4 3 1 4 2
#> 5 3 1 5 3
#> 6 NaN 2 1 NaN
#> 7 5 2 2 NaN
#> 8 6 2 3 5
#> 9 6 2 4 6
Created on 2019-08-27 by the reprex package (v0.3.0)
I have a data frame: df=data.frame(sample.id=c(1, 1, 2, 3, 4, 4, 5, 6, 7, 7), sample.type=c(U, S, S, U, U, D, D, U, U, D), cond = c(1.4, 17, 12, 0.45, 1, 7, 1, 9, 0, 14))
I want a data frame that only contains the rows of sample.ids that have both sample.type "U" and sample.type "D"
new df: df.new=data.frame(sample.id=c(4, 4, 7, 7), sample.type=c(U, D, U, D), cond = c(1, 7, 0, 14))
What's the easiest way to do this? Duplicated doesn't work because it will return sample.ids with U and S as well as U and D. I can't figure out how to filter/subset for sample ids that are both sample.type U and sample.type D. Thanks for any advice!
We can do a filter by group
library(dplyr)
df %>%
group_by(sample.id) %>%
filter(all(c("U", "D") %in% sample.type))
# A tibble: 4 x 3
# Groups: sample.id [2]
# sample.id sample.type cond
# <dbl> <fct> <dbl>
#1 4 U 1
#2 4 D 7
#3 7 U 0
#4 7 D 14
Using filter with any
df %>% group_by(sample.id) %>% filter(any(sample.type == 'U') & any(sample.type == 'D'))
# A tibble: 4 x 3
# Groups: sample.id [2]
sample.id sample.type cond
<dbl> <fctr> <dbl>
1 4 U 1
2 4 D 7
3 7 U 0
4 7 D 14
With data.table
library(data.table)
setDT(df)
df[, if(all(c('U', 'D') %in% sample.type)) .SD, by = sample.id]
I have a sample data:
SampleID a b d f ca k l cb
1 0.1 2 1 2 7 1 4 3
2 0.2 3 2 3 4 2 5 5
3 0.5 4 3 6 1 3 9 2
I need to find row-wise sum of columns which have something common in names, e.g. row-wise sum(a, ca) or row-wise sum(b,cb). The problem is that i have large data.frame and ideally i would be able to write what is common in column header, so that code would pick only those columns to sum
Thank you beforehand for any assistance.
We can select the columns that have 'a' with grep, subset the columns and do rowSums and the same with 'b' columns.
rowSums(df1[grep('a', names(df1)[-1])+1])
rowSums(df1[grep('b', names(df1)[-1])+1])
If you want the output as a data frame, try using dplyr
# Recreating your sample data
df <- data.frame(SampleID = c(1, 2, 3),
a = c(0.1, 0.2, 0.5),
b = c(2, 3, 4),
d = c(1, 2, 3),
f = c(2, 3, 6),
ca = c(7, 4, 1),
k = c(1, 2, 3),
l = c(4, 5, 9),
cb = c(3, 5, 2))
Process the data
# load dplyr
library(dplyr)
# Sum across columns 'a' and 'ca' (sum(a, ca))
df2 <- df %>%
select(contains('a'), -SampleID) %>% # 'select' function to choose the columns you want
mutate(row_sum = rowSums(.)) # 'mutate' function to create a new column 'row_sum' with the sum of the selected columns. You can drop the selected columns by using 'transmute' instead.
df2 # have a look
a ca row_sum
1 0.1 7 7.1
2 0.2 4 4.2
3 0.5 1 1.5