First post, long time user.
I'm tryin to efficiently sum a column based on 2 criteria for every ID in another data frame of a different length. Below is an example:
ID
1 A
2 B
3 C
ID Color Type Price
A Green 1 5
A Blue 2 6
B Green 3 7
B Blue 2 2
C Green 2 4
C Blue 4 5
For each ID, I'd like to sum the price if the color is blue and the type is 2. The result would hopefully be the below:
ID Price
1 A 6
2 B 2
3 C 0
This seems like an easy task but I can't figure it out for some reason. Also, I'll need to perform this operation on 2 large data sets (>1,000,000 rows each). I've created a function and used it in a loop for prior problems like this but that solution doesn't work because of the amount of information. I feel that a function from the apply would probably be best but I can't get them to work.
I changed a bit your data example so it takes into account the fact that not all ID are in the first data frame, and that there are two values to sum solewhere:
df1 <- data.frame(ID = c("A","B","C"))
df2 <- read.table(text = "
ID Color Type Price
A Green 1 5
A Blue 2 6
A Blue 2 4
B Green 3 7
B Blue 2 2
C Green 2 4
C Blue 4 5
D Green 2 2
D Blue 4 8
",header = T)
The two main package to do that fast and on big data.frame are dplyr and data.table. They are quite equivalent (almost, see data.table vs dplyr: can one do something well the other can't or does poorly?). Here are the two solutions:
library(data.table)
setDT(df2)[ID %in% unique(df1$ID), .(sum = sum(Price[ Type == 2 & Color == "Blue"])),by = ID]
ID sum
1: A 10
2: B 2
3: C 0
You could do
setDT(df2)[ID %in% unique(df1$ID) & Type == 2 & Color == "Blue", .(sum = sum(Price)),by = ID]
but you will discard C as the entire condition for the row selection is not met:
ID sum
1: A 10
2: B 2
and with dplyr:
library(dplyr)
df2 %>%
filter(ID %in% unique(df1$ID)) %>%
group_by(ID) %>%
summarize(sum = sum(Price[Type==2 & Color=="Blue"]))
# A tibble: 3 x 2
ID sum
<fct> <int>
1 A 10
2 B 2
3 C 0
A sapply version. It may exist more elegant ways to write it, but if you have big tables as you said, you can easily parallelized it.
Using the data as proposed by #denis:
df1 <- data.frame(ID = c("A","B","C"))
df2 <- read.table(text = "
ID Color Type Price
A Green 1 5
A Blue 2 6
A Blue 2 4
B Green 3 7
B Blue 2 2
C Green 2 4
C Blue 4 5
D Green 2 2
D Blue 4 8
",header = T)
Here a simple function that does what you want with sapply:
getPrices <- function(tableid=df1,tablevalues=df2,color="Blue",type=2){
filteredtablevalues <- droplevels(tablevalues[ tablevalues$Color == "Blue" & tablevalues$Type == 2 & tablevalues$ID %in% df1$ID,])
#droplevels could be skipped by using unique(as.character(filteredtablevalues$ID)) in the sapply, not sure what would be the quickest
sapply(levels(filteredtablevalues$ID),function(id,tabval)
{
sum(tabval$Price[tabval$ID == id])
},tabval=filteredtablevalues)
}
As you see i added two parameters that allow you to select for pair color/type. And you can add this:
tmp=getPrices(df1,df2)
finaltable=cbind.data.frame(ID=names(tmp),Price=tmp)
If you absolutely need a data frame with a column ID and a column Price.
I will try some benchmark when I have time, but written this way you should be able to easily parallelize this with library(parallel) and library(Rmpi), which can save your life if you have very very big datasets.
EDIT :
Benchmark:
I was not able to reproduce the dplyr example proposed by #denis but I could comparer the data.table version:
#Create a bigger dataset
nt=10000 #nt as big as you want
df2=rbind.data.frame(df2,
list(ID= sample(c("A","B","C"),nt,replace=T),
Color=sample(c("Blue","Green"),nt,replace=T),
Type=sample.int(5,nt,replace=T),
Price=sample.int(5,nt,replace=T)
)
)
You can benchmark using the library(microbenchmark):
library(microbenchmark)
microbenchmark(sply=getPrices(df1,df2),dtbl=setDT(df2)[ID %in% unique(df1$ID), .(sum = sum(Price[ Type == 2 & Color == "Blue"])),by = ID],dplyr=df2 %>% filter(ID %in% unique(df1$ID)) %>% group_by(ID) %>% summarize(sum = sum(Price[Type==2 & Color=="Blue"])))
On my computer it gives:
Unit: milliseconds
expr min lq mean median uq max neval
sply 78.37484 83.89856 97.75373 89.17033 118.96890 131.3226 100
dtbl 75.67642 83.44380 93.16893 85.65810 91.98584 137.2851 100
dplyr 90.67084 97.58653 114.24094 102.60008 136.34742 150.6235 100
Edit2:
sapply appears to be slightly quicker than the data.table approach, though not significantly. But using sapply could be really helpful for you have huge ID table. Then you use library(parallel) and gain even more time.
Now the data.table approach seems to be the quickest. But still, the avantage of sapply is that you can parallelized it easily. Though in that case and given how I wrote the function getPrices, it will be efficient only if your ID table is huge.
Related
I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d
I am using "dplyr" package in R to summarize and re-organize a dataset. The dataset is composed of individual records and for each individual there may be a unique or multiple encounters. I would like to create a new column that considers information from subsequent encounters if the data collected during the first one was incomplete. Here is some example data:
ID<-rep(1:4,2)
Time<-as.character(c("A","A","A","A","B","B","B","B"))
Color<- as.character(c("u","u","red","red","green","u","u","red"))
Data<- data.frame(ID,Time,Color)
Data
For the above data example I would like to create a new column for color. For those individuals (ID) that were encountered in Time A that have "u" Color (this represents unknown), change Color to the identified kind in a subsequent time (Time B). Otherwise, if it has a color kind during Time A, keep as is. Here is what I've tried:
library(dplyr)
Data2<-mutate(Data, Color.new=if_else(Color=="u" & Time=="A",
Color=="green"|Color=="red" & Time=="B", NA))
Data2
I'm hoping that reads: When Color is "u" and Time is "A" then Color is its respective value (either green or red) when Time is "B", otherwise it stays as is.
My trial doesn't work and it results in a column with NAs and FALSE when for example individual 1 could have changed from unknown to green.
Thanks for your help or comments.
The reason your ifelse statement doesn't work is because it's designed to work on vectors: a vector of things to compare, and two vectors of possible responses. All of these have to be the same length. That's not really going to work in your case. There are two ways I can see approaching the solution:
1) If each individual has at most 1 time "B" observation, the easiest solution is to use spread and gather from the tidyr package. (These have the same function as reshape, and there are a lot of other functions that do this same thing - this is just the one I like.
result <- Data %>%
spread(Time, Color) %>%
mutate(
A = as.character(A),
B = as.character(B),
Color1 = case_when(
A == 'u' ~ B,
B == 'u' ~ A,
TRUE ~ A
)) %>% # only run this if you want to go back to the long format
gather(Time, Color, A:B)
# ID Color1 Time Color
# 1 1 green A u
# 2 2 u A u
# 3 3 red A red
# 4 4 red A red
# 5 1 green B green
# 6 2 u B u
# 7 3 red B u
# 8 4 red B red
If you can have multiple Bs, then an approach using summaries might work better:
Data %>%
group_by(ID) %>%
summarize(
Color1 = if(sum(Color != "u") > 0) paste(unique(Color[Color != "u"]), collapse = "_") else "u"
) %>%
left_join(Data, .)
# ID Time Color Color1
# 1 1 A u green
# 2 2 A u u
# 3 3 A red red
# 4 4 A red red
# 5 1 B green green
# 6 2 B u u
# 7 3 B u red
# 8 4 B red red
You can use tidyr::fill
library(tidyverse)
Data[Data =="u"] <- NA
Data %>% group_by(ID) %>% fill(Color,.direction = "up") %>% ungroup
# # A tibble: 8 x 3
# ID Time Color
# <int> <fctr> <fctr>
# 1 1 A green
# 2 1 B green
# 3 2 A <NA>
# 4 2 B <NA>
# 5 3 A red
# 6 3 B <NA>
# 7 4 A red
# 8 4 B red
Here we don't have the subsequent info for a few cases so some NAs remain.
I have a data frame of 6449x743, in which few rows are repeating twice with same column_X and column_Y values, but with higher column_Z values for second repetition for the same row. I want to keep the row with higher column_Z only.
I tried following, but this doesn't get rid of duplicate values and gives me output of 6449x743 only.
output <- unique(Data[,c('column_X', 'column_Y', max('column_Z'))])
Ideally, the output should be (6449 - N)x743, as number of rows will be less, but number of columns will remain same, as column_X and column_Y will become unique after filtering data based on column_Z.
If anyone has suggestions, please let me know. Thanks.
You can used not duplicated (!duplicated) with option fromLast = TRUE on specific columns like this:
df <- data.frame(a=c(1,1,2,3,4),b=c(2,2,3,4,5),c=1:5)
df <- df[order(df$c),] #make sure the data is sorted.
a b c
1 1 2 1
2 1 2 2
3 2 3 3
4 3 4 4
5 4 5 5
df[!duplicated(df$a,fromLast = TRUE) & !duplicated(df$b,fromLast = TRUE),]
a b c
2 1 2 2
3 2 3 3
4 3 4 4
5 4 5 5
Try
library(dplyr)
Data %>%
group_by(column_x, column_Y) %>%
filter(Z==max(column_Z))
It works with the sample data
set.seed(13)
df<-data_frame(a=sample(1:4, 50, rep=T),
b=sample(1:3, 50, rep=T),
x=runif(50), y=rnorm(50))
df %>% group_by(a,b) %>% filter(x==max(x))
Probably the easiest way would be to order the whole thing by column_Z and then remove the duplicates:
output <- Data[order(Data$column_Z, decreasing=TRUE),]
output <- output[!duplicated(paste(output$column_X, output$column_Y)),]
assuming I understood you correctly.
Here's an older answer which may be trying to accomplish the same thing that you are:
How to make a unique in R by column A and keep the row with maximum value in column B
Editing with relevant code:
A solution using package data.table:
set.seed(42)
dat <- data.frame(A=c('a','a','a','b','b'),B=c(1,2,3,5,200),C=rnorm(5))
library(data.table)
dat <- as.data.table(dat)
dat[,.SD[which.max(B)],by=A]
A B C
1: a 3 0.3631284
2: b 200 0.4042683
I have read data from multiple xlsx sheet with xlsx package of R. Currently my data frame is like below
firstcol SecondCol
A abcd
B bds
A <NA>
A asd
C <NA>
B adfdf
? <NA>
C adfd
From the above data I want to get the following output.
Firsrcol FirstcolCount SecondCol
A 3 times 2 # we'll not count NA's
B 2 times 2
C 2 times 1
other 1 times 0
Is there any direct method that can do this? It would be nice to have some suggestion about this.
A data.table approach:
#load library
require(data.table)
# convert data.frame to data.table
setDT(df)
# make a new data.table with two columns. First one has the counts by each level of firstcol. Second one has the count minus the number of NA cases:
df[, .(FirsrcolCount = .N,
secondCol = .N - sum(is.na(secondcol))),
by = firstcol]
Though not quite clear what exactly you mean. Something like this?
library(dplyr)
df %>% group_by(firstcol) %>% summarise(FirstcolCount = n(), SecondCol = n() - sum(SecondCol == "<NA>"))
Source: local data frame [4 x 3]
firstcol FirstcolCount SecondCol
1 ? 1 0
2 A 3 2
3 B 2 2
4 C 2 1
I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]