Following dataset is reproducible
group <- c(1,1,2,2,3,3)
parameter <- c("A","B","A","B","A","B")
values <- c(10,20,20,5,30,50)
df <- data.frame(group,parameter,values)
group parameter values
1 A 10
1 B 20
2 A 20
2 B 5
3 A 30
3 B 50
I want to check within each group whether A > B (store this result in fourth column for entire group)
If yes -> TRUE, If no -> FALSE
New Df:
group parameter values status
1 A 10 FALSE
1 B 20 FALSE
2 A 20 TRUE
2 B 5 TRUE
3 A 30 FALSE
3 B 50 FALSE
Approach
with(df, ave(values,group, FUN = function(x) ))
I am not able to think what will be the code inside the function. Can someone please help me
Updated: Status should be ranked as per the values column (highest to lowest) per group
group parameter values status
1 A 10 2
1 B 20 1
2 A 20 1
2 B 5 2
3 A 30 2
3 B 50 1
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'group', compare the 'values' where 'parameter' is 'A' with that of 'B' and assign (:=) to create 'status'
library(data.table)
setDT(df)[, status := values[parameter=="A"]>values[parameter=="B"], by = group]
df
# group parameter values status
#1: 1 A 10 FALSE
#2: 1 B 20 FALSE
#3: 2 A 20 TRUE
#4: 2 B 5 TRUE
#5: 3 A 30 FALSE
#6: 3 B 50 FALSE
and for the rank, use frank on the 'values' after grouping by 'group.
setDT(df)[, status:= frank(-values), group]
df
# group parameter values status
#1: 1 A 10 2
#2: 1 B 20 1
#3: 2 A 20 1
#4: 2 B 5 2
#5: 3 A 30 2
#6: 3 B 50 1
Or with ave, we can compare the first value with second one (assuming that 'parameter' is ordered and also only two elements per 'group'
df$status <- with(df, as.logical(ave(values, group, FUN = function(x) x[1] > x[2])))
Or another option is to order the dataset by the first columns (in case it is not ordered), the subset the 'values' by the recycling of logical index, compare and replicate each of the logical values by 2.
df1 <- df[do.call(order, df[1:2]), ]
rep(df1$values[c(TRUE, FALSE)] > df1$values[c(FALSE, TRUE)], each = 2)
There is also the tidyverse solution using dplyr:
library(dplyr)
df %>%
group_by(group) %>%
mutate(status = ifelse(values[parameter == "A"] > values[parameter == "B"], TRUE, FALSE),
rank = min_rank(-values))
Source: local data frame [6 x 5]
Groups: group [3]
group parameter values status rank
(dbl) (fctr) (dbl) (lgl) (int)
1 1 A 10 FALSE 2
2 1 B 20 FALSE 1
3 2 A 20 TRUE 1
4 2 B 5 TRUE 2
5 3 A 30 FALSE 2
6 3 B 50 FALSE 1
Related
This is a my df (data.frame):
group value
1 10
1 20
1 25
2 5
2 10
2 15
I need to calculate difference between values in consecutive rows by group.
So, I need a that result.
group value diff
1 10 NA # because there is a no previous value
1 20 10 # value[2] - value[1]
1 25 5 # value[3] value[2]
2 5 NA # because group is changed
2 10 5 # value[5] - value[4]
2 15 5 # value[6] - value[5]
Although, I can handle this problem by using ddply, but it takes too much time. This is because I have a lot of groups in my df. (over 1,000,000 groups in my df)
Are there any other effective approaches to handle this problem?
The package data.table can do this fairly quickly, using the shift function.
require(data.table)
df <- data.table(group = rep(c(1, 2), each = 3), value = c(10,20,25,5,10,15))
#setDT(df) #if df is already a data frame
df[ , diff := value - shift(value), by = group]
# group value diff
#1: 1 10 NA
#2: 1 20 10
#3: 1 25 5
#4: 2 5 NA
#5: 2 10 5
#6: 2 15 5
setDF(df) #if you want to convert back to old data.frame syntax
Or using the lag function in dplyr
df %>%
group_by(group) %>%
mutate(Diff = value - lag(value))
# group value Diff
# <int> <int> <int>
# 1 1 10 NA
# 2 1 20 10
# 3 1 25 5
# 4 2 5 NA
# 5 2 10 5
# 6 2 15 5
For alternatives pre-data.table::shift and pre-dplyr::lag, see edits.
You can use the base function ave() for this
df <- data.frame(group=rep(c(1,2),each=3),value=c(10,20,25,5,10,15))
df$diff <- ave(df$value, factor(df$group), FUN=function(x) c(NA,diff(x)))
which returns
group value diff
1 1 10 NA
2 1 20 10
3 1 25 5
4 2 5 NA
5 2 10 5
6 2 15 5
try this with tapply
df$diff<-as.vector(unlist(tapply(df$value,df$group,FUN=function(x){ return (c(NA,diff(x)))})))
Since dplyr 1.1.0, you can shorten the dplyr version with inline temporary grouping with .by:
mutate(df, diff = value - lag(value), .by = group)
This must be a duplicate but I can't find it. So here goes.
I have a data.frame with two columns. One contains a group and the other contains a criterion. A group can contain many different criteria, but only one per row. I want to identify groups that contain three specific criteria (but that will appear on different rows. In my case, I want to identify all groups that contains the criteria "I","E","C". Groups may contain any number and combination of these and several other letters.
test <- data.frame(grp=c(1,1,2,2,2,3,3,3,4,4,4,4,4),val=c("C","I","E","I","C","E","I","A","C","I","E","E","A"))
> test
grp val
1 1 C
2 1 I
3 2 E
4 2 I
5 2 C
6 3 E
7 3 I
8 3 A
9 4 C
10 4 I
11 4 E
12 4 E
13 4 A
In the above example, I want to identify grp 2, and 4 because each of these contains the letters E, I, and C.
Thanks!
Here's a dplyr solution. %in% is vectorized so c("E", "I", "C") %in% val returns a logical vector of length three. For the target groups, passing that vector to all() returns TRUE. That's our filter, and we run it within each group using group_by().
library(dplyr)
test %>%
group_by(grp) %>%
filter(all(c("E", "I", "C") %in% val))
# Source: local data frame [8 x 2]
# Groups: grp [2]
#
# grp val
# (dbl) (fctr)
# 1 2 E
# 2 2 I
# 3 2 C
# 4 4 C
# 5 4 I
# 6 4 E
# 7 4 E
# 8 4 A
Or if this output would be handier (thanks #Frank),
test %>%
group_by(grp) %>%
summarise(matching = all(c("E", "I", "C") %in% val))
# Source: local data frame [4 x 2]
#
# grp matching
# (dbl) (lgl)
# 1 1 FALSE
# 2 2 TRUE
# 3 3 FALSE
# 4 4 TRUE
library(data.table)
test <- data.frame(grp=c(1,1,2,2,2,3,3,3,4,4,4,4,4),val=c("C","I","E","I","C","E","I","A","C","I","E","E","A"))
setDT(test) # convert the data.frame into a data.table
group.counts <- dcast(test, grp ~ val) # count number of same values per group and create one column per val with the count in the cell
group.counts[I>0 & E>0 & C>0,] # now filtering is easy
Results in:
grp A C E I
1: 2 0 1 1 1
2: 4 1 1 2 1
Instead of returning the group numbers only you could also "join" the resulting group numbers with the original data to show the "raw" data rows of each group that matches:
test[group.counts[I>0 & E>0 & C>0,], .SD, on="grp" ]
This shows:
grp val
1: 2 E
2: 2 I
3: 2 C
4: 4 C
5: 4 I
6: 4 E
7: 4 E
8: 4 A
PS: Just to understand the solution easier: The counts for all groups are:
> group.counts
grp A C E I
1: 1 0 1 0 1
2: 2 0 1 1 1
3: 3 1 0 1 1
4: 4 1 1 2 1
Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)
I have the following two data frames:
df1 <- data.frame(month=c("1","1","1","1","2","2","2","3","3","3","3","3"),
temp=c("10","15","16","25","13","17","20","5","16","25","30","37"))
df2 <- data.frame(period=c("1","1","1","1","1","1","1","1","2","2","2","2","2","2","3","3","3","3","3","3","3","3","3","3","3","3"),
max_temp=c("9","13","16","18","30","37","38","39","10","15","16","25","30","32","8","10","12","14","16","18","19","25","28","30","35","40"),
group=c("1","1","1","2","2","2","3","3","3","3","4","4","5","5","5","5","5","6","6","6","7","7","7","7","8","8"))
I would like to:
Consecutively for each row, check if the value in the month column in df1 matches that in the period column of df2, i.e. df1$month == df2$period.
If step 1 is not TRUE, i.e. df1$month != df2$period, then repeat step 1 and compare the value in df1 with the value in the next row of df2, and so forth until df1$month == df2$period.
If df1$month == df2$period, check if the value in the temp column of df1 is less than or equal to that in the max_temp column of df2, i.e. df1$temp <= df$max_temp.
If df1$temp <= df$max_temp, return value in that row for the group column in df2 and add this value to df1, in a new column called "new_group".
If step 3 is not TRUE, i.e. df1$temp > df$max_temp, then go back to step 1 and compare the same row in df1 with the next row in df2.
An example of the output data frame I'd like is:
df3 <- data.frame(month=c("1","1","1","1","2","2","2","3","3","3","3","3"),
temp=c("10","15","16","25","13","17","20","5","16","25","30","37"),
new_group=c("1","1","1","2","3","4","4","5","6","7","7","8"))
I've been playing around with the ifelse function and need some help or re-direction. Thanks!
I found the procedure for computing new_group hard to follow as stated. As I understand it, you're trying to create a variable called new_group in df1. For row i of df1, the new_group value is the group value of the first row in df2 that:
Is indexed i or higher
Has a period value matching df1$month[i]
Has a max_temp value no less than df1$temp[i]
I approached this by using sapply called on the row indices of df1:
fxn = function(idx) {
# Potentially matching indices in df2
pm = idx:nrow(df2)
# Matching indices in df2
m = pm[df2$period[pm] == df1$month[idx] &
as.numeric(as.character(df1$temp[idx])) <=
as.numeric(as.character(df2$max_temp[pm]))]
# Return the group associated with the first matching index
return(df2$group[m[1]])
}
df1$new_group = sapply(seq(nrow(df1)), fxn)
df1
# month temp new_group
# 1 1 10 1
# 2 1 15 1
# 3 1 16 1
# 4 1 25 2
# 5 2 13 3
# 6 2 17 4
# 7 2 20 4
# 8 3 5 5
# 9 3 16 6
# 10 3 25 7
# 11 3 30 7
# 12 3 37 8
library(data.table)
dt1 <- data.table(df1, key="month")
dt2 <- data.table(df2, key="period")
## add a row index
dt1[, rn1 := seq(nrow(dt1))]
dt3 <-
unique(dt1[dt2, allow.cartesian=TRUE][, new_group := group[min(which(temp <= max_temp))], by="rn1"], by="rn1")
## Keep only the columns you want
dt3[, c("month", "temp", "max_temp", "new_group"), with=FALSE]
month temp max_temp new_group
1: 1 1 19 1
2: 1 3 19 1
3: 1 4 19 1
4: 1 7 19 1
5: 2 2 1 3
6: 2 5 1 3
7: 2 6 1 4
8: 3 10 18 5
9: 3 4 18 5
10: 3 7 18 5
11: 3 8 18 5
12: 3 9 18 5
I am trying to identify duplicates based of a match of elements in two vectors. Using duplicate() provides a vector of all matches, however I would like to index which are matches with each other or not. Using the following code as an example:
x <- c(1,6,4,6,4,4)
y <- c(3,2,5,2,5,5)
frame <- data.frame(x,y)
matches <- duplicated(frame) | duplicated(frame, fromLast = TRUE)
matches
[1] FALSE TRUE TRUE TRUE TRUE TRUE
Ultimately, I would like to create a vector that identifies elements 2 and 4 are matches as well as 3,5,6. Any thoughts are greatly appreciated.
Another data.table answer, using the group counter .GRP to assign every distinct element a label:
d <- data.table(frame)
d[,z := .GRP, by = list(x,y)]
# x y z
# 1: 1 3 1
# 2: 6 2 2
# 3: 4 5 3
# 4: 6 2 2
# 5: 4 5 3
# 6: 4 5 3
How about this with plyr::ddply()
ddply(cbind(index=1:nrow(frame),frame),.(x,y),summarise,count=length(index),elems=paste0(index,collapse=","))
x y count elems
1 1 3 1 1
2 4 5 3 3,5,6
3 6 2 2 2,4
NB = the expression cbind(index=1:nrow(frame),frame) just adds an element index to each row
Using merge against the unique possibilities for each row, you can get a result:
labls <- data.frame(unique(frame),num=1:nrow(unique(frame)))
result <- merge(transform(frame,row = 1:nrow(frame)),labls,by=c("x","y"))
result[order(result$row),]
# x y row num
#1 1 3 1 1
#5 6 2 2 2
#2 4 5 3 3
#6 6 2 4 2
#3 4 5 5 3
#4 4 5 6 3
The result$num vector gives the groups.