Values comparison under columns combinations - r

I have a data frame of the following type:
date ID1 ID2 sum
2017-1-5 1 a 200
2017-1-5 1 b 150
2017-1-5 2 a 300
2017-1-4 1 a 200
2017-1-4 1 b 120
2017-1-4 2 a 300
2017-1-3 1 b 150
I'm trying to compare between columns combinations over different dates to see if the sum values are equal. So, in the above-mentioned example, I'd like the code to identify that the sum of [ID1=1, ID2=b] combination is different between 2017-1-5 and 2017-1-4 (In my real data I have more than 2 ID categories and more than 2 Dates).
I'd like my output to be a data frame which contains all the combinations that include (at least one) unequal results. In my example:
date ID1 ID2 sum
2017-1-5 1 b 150
2017-1-4 1 b 120
2017-1-3 1 b 150
I tried to solve it using loops like this: Is there a R function that applies a function to each pair of columns with no great success.
Your help will be appreciated.

Using dplyr, we can group_by_(.dots=paste0("ID",1:2)) and then see if the values are unique:
library(dplyr)
res <- df %>% group_by_(.dots=paste0("ID",1:2)) %>%
mutate(flag=(length(unique(sum))==1)) %>%
ungroup() %>% filter(flag==FALSE) %>% select(-flag)
The group_by_ allows you to group multiple ID columns easily. Just change 2 to however many ID columns (i.e., N) you have assuming that they are numbered consecutively from 1 to N. The column flag is created to indicate if all of the values are the same (i.e., number of unique values is 1). Then we filter for results for which flag==FALSE. This gives the desired result:
res
### A tibble: 3 x 4
## date ID1 ID2 sum
## <chr> <int> <chr> <int>
##1 2017-1-5 1 b 150
##2 2017-1-4 1 b 120
##3 2017-1-3 1 b 150

Related

R Group dataframe according to certain conditions and each group has the same number of each condition

My dataframe has 324 different images with unique imageID. And there are 3*3 =9 conditions, each image belonging to one of the conditions. For example, Image 1 belongs to 1A condition and Image 5 belongs to 2B condition. What I try to achieve is to group images into 6 blocks randomly but in each block, there is the same number of each condition. Then, when group the dataframe by blokNo, they will be presented in a random order. And I want to generate multiple orders of presentation from the same dataframe.
My data frame looks like this:
ImageID Catagory1 Category2 BlokNo
1 1 A
4 1 A
6 1 A
5 2 B
8 2 B
3 2 B
14 3 C
12 3 C
17 3 C
I would like my data to look like this:
ImageID Catagory1 Category2 BlokNo
1 1 A 2
4 1 A 1
6 1 A 3
5 2 B 3
8 2 B 2
3 2 B 1
14 3 C 1
12 3 C 3
17 3 C 2
Below is the code I tried. It actually can realize part of my requirement, but since I actually have 3*3=9 conditions in total, I am wondering if there are other quick ways to do it. Thank you in advance!
Cond1 <- df %>% filter (Category1 == 1 & Category2 == A) #filter out one condition
Cond1$BlokNo <- sample(rep(1:6, each = ceiling(36/6))[1:36]) #randomly assign a number from 1:6 to each image in certain condition
Instead of filtering by each unique combinations, do a group_by on those 'Category2' columns and get the sample of row_number()
library(dplyr)
df <- df %>%
group_by(Category1, Category2) %>%
mutate(BlockNo = sample(row_number())) %>%
ungroup

Counting the instances of a variable that exceeds a threshold

I have a dataset with id and speed.
id <- c(1,1,1,1,2,2,2,2,3,3,3)
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- cbind(id, speed)
limit <- 35
Say, if 'speed' crosses 'limit' will count it as 1. And you will count again only if speed comes below and crosses the 'limit'.
I want data to be like.
id | Speed Viol.
----------
1 | 2
---------
2 | 2
---------
3 | 1
---------
here id (count).
id1 (1) 40 (2) 50,40
id2 (1) 45,50 (2) 55
id3 (1) 50,50,60
How to do it not using if().
Here's a method tapply as suggested in the comments and the original vectors.
tapply(speed, id, FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
1 2 3
2 2 1
tapply applies a function to each group, here, by ID. The function checks if the first element of an ID is over 35, and then concatenates this to the output of diff, whose argument is checking if subsequent observations are greater than 35. Thus diff checks if an ID returns to above 35 after dropping below that level. Negative values in the resulting vector are converted to FALSE (0) with > 0 and these results are summed.
tapply returns a named vector, which can be fairly nice to work with. However, if you want a data.frame, then you could use aggregate instead as suggested by d.b:
aggregate(speed, list(id=id), FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
id x
1 1 2
2 2 2
3 3 1
Here's a dplyr solution. I group by id then check if speed is above the limit in each row, but wasn't in the previous entry. (I get the previous row using lag). If this is the case, it produces TRUE. Or, if it's the first row for the id (i.e., row_number()==1) and it's above the limit, this gives a TRUE, too. Then, I sum all the TRUE values for each id using summarise.
id <- c(1,1,1,1,2,2,2,2,3,3,3)
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- data.frame(id, speed)
limit <- 35
library(dplyr)
i %>%
group_by(id) %>%
mutate(viol=(speed>limit&lag(speed)<limit)|(row_number()==1&speed>limit)) %>%
summarise(sum(viol))
# A tibble: 3 x 2
id `sum(viol)`
<dbl> <int>
1 1 2
2 2 2
3 3 1
Here is another option with data.table,
library(data.table)
setDT(i)[, id1 := rleid(speed > limit), by = id][
speed > limit, .(violations = uniqueN(id1)), by = id][]
which gives,
id violations
1: 1 2
2: 2 2
3: 3 1
aggregate(speed~id, data.frame(i), function(x) sum(rle(x>limit)$values))
# id speed
#1 1 2
#2 2 2
#3 3 1
The main idea is that x > limit will check for instances when the speed limit is violated and rle(x) will group those instances into consecutive violations or consecutive non-violations. Then all you need to do is to count the groups of consecutive violations (when rle(x>limit)$values is TRUE).

How to find first occurrence of a vector of numeric elements within a data frame column?

I have a data frame (min_set_obs) which contains two columns: the first containing numeric values, called treatment, and the second an id column called seq:
min_set_obs
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
Let's say I have a vector of numeric values, called key:
key
[1] 1 1 1 2 2 3
I.e. a vector of three 1s, two 2s, and one 3.
How would I go about identifying which rows from my min_set_obs data frame contain the first occurrence of values from the key vector?
I'd like my output to look like this:
Treatment seq
1 29
1 23
3 60
1 6
2 41
2 44
I.e. the sixth row from min_set_obs was 'extra' (it was the fourth 1 when there should only be three 1s), so it would be removed.
I'm familiar with the %in% operator, but I don't think it can tell me the position of the first occurrence of the key vector in the first column of the min_set_obs data frame.
Thanks
Here is an option with base R, where we split the 'min_set_obs' by 'Treatment' into a list, get the head of elements in the list using the corresponding frequency of 'key' and rbind the list elements to a single data.frame
res <- do.call(rbind, Map(head, split(min_set_obs, min_set_obs$Treatment), n = table(key)))
row.names(res) <- NULL
res
# Treatment seq
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60
Use dplyr, you can firstly count the keys using table and then take the top n rows correspondingly from each group:
library(dplyr)
m <- table(key)
min_set_obs %>% group_by(Treatment) %>% do({
# as.character(.$Treatment[1]) returns the treatment for the current group
# use coalesce to get the default number of rows (0) if the treatment doesn't exist in key
head(., coalesce(m[as.character(.$Treatment[1])], 0L))
})
# A tibble: 6 x 2
# Groups: Treatment [3]
# Treatment seq
# <int> <int>
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60

Delete the lower value in one column based on repeat values in another column in R (large data set)

I have a large data set loaded into R that contains multiple duplicates in one column (colA) and another column that has different unique values (colB). I need to figure out a way delete the lowest values in colB that correspond to the same value in colA.
For example,
A 1
A 2
A 3
B 8
B 9
B 10
should become
A 3
B 10
If this were something like Python, it would be an easy command to code, but I am new to R and greatly appreciate the help.
Here's a dplyr solution
d <- read.table(textConnection("A 1
A 2
A 3
B 8
B 9
B 10"))
library(dplyr)
d %>%
group_by(V1) %>%
summarize(max = max(V2))
# A tibble: 2 × 2
V1 max
<fctr> <int>
1 A 3
2 B 10
You can do this with aggregate
aggregate(df$B, list(df$A), max)
Group.1 x
1 A 3
2 B 10
library(plyr)
data<-data.frame("x"=c(rep("A",3),rep("B",3)),"y"=c(1:3,8:10))
ddply(data,~x,summarise,max=max(y))
x max
1 A 3
2 B 10

R: Add value in new column of data frame depending on value in another column

I have 2 data frames in R, df1 and df2.
df1 represents in each row one subject in an experiment. It has 3 columns. The first two columns specify a combination of groups the subject is in. The third column contains the experimental result.
df2 containts values for each combination of groups that can be used for normalization. Thus, it has three columns, two for the groups and a third for the normalization constant.
Now I want to create a fourth column in df1 with the experimental results from the third column, divided by the normalization constant in df2. How can I facilitate this?
Here's an example:
df1 <- data.frame(c(1,1,1,1),c(1,2,1,2),c(10,11,12,13))
df2 <- data.frame(c(1,1,2,2),c(1,2,1,2),c(30,40,50,60))
names(df1)<-c("Group1","Group2","Result")
names(df2)<-c("Group1","Group2","NormalizationConstant")
As result, I need a new column in df1 with c(10/30,11/40,12/30,13/40).
My first attempt is with the following code, which fails for my real data with the error message "In is.na(e1) | is.na(e2) : Length of the longer object is not a multiple of the length of the shorter object". Nevertheless, when I replace the referrer ==df1[,1] and ==df1[,2] with fixed values, it works. Is this really returning only the value of the column for this particular row?
df1$NormalizedResult<- df1$Result / df2[df2[,1]==df1[,1] & df2[,2]==df1[,2],]$NormalizationConstant
Thanks for your help!
In this case where the groups are aligned perfectly it's as simple as:
> df1$expnormed <- df1$Result/df2$NormalizationConstant
> df1
Group1 Group2 Result expnormed
1 1 1 10 0.3333333
2 1 2 11 0.2750000
3 1 1 12 0.2400000
4 1 2 13 0.2166667
If they were not exactly aligned you would use merge:
> dfm <-merge(df1,df2)
> dfm
Group1 Group2 Result NormalizationConstant
1 1 1 10 30
2 1 1 12 30
3 1 2 11 40
4 1 2 13 40
> dfm$expnormed <- with(dfm, Result/NormalizationConstant)
A possibility :
df1$res <- df1$Result/df2$NormalizationConstant[match(do.call("paste", df1[1:2]), do.call("paste", df2[1:2]))]
Group1 Group2 Result res
1 1 1 10 0.3333333
2 1 2 11 0.2750000
3 1 1 12 0.4000000
4 1 2 13 0.3250000
Hth

Resources