Counting the instances of a variable that exceeds a threshold - r

I have a dataset with id and speed.
id <- c(1,1,1,1,2,2,2,2,3,3,3)
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- cbind(id, speed)
limit <- 35
Say, if 'speed' crosses 'limit' will count it as 1. And you will count again only if speed comes below and crosses the 'limit'.
I want data to be like.
id | Speed Viol.
----------
1 | 2
---------
2 | 2
---------
3 | 1
---------
here id (count).
id1 (1) 40 (2) 50,40
id2 (1) 45,50 (2) 55
id3 (1) 50,50,60
How to do it not using if().

Here's a method tapply as suggested in the comments and the original vectors.
tapply(speed, id, FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
1 2 3
2 2 1
tapply applies a function to each group, here, by ID. The function checks if the first element of an ID is over 35, and then concatenates this to the output of diff, whose argument is checking if subsequent observations are greater than 35. Thus diff checks if an ID returns to above 35 after dropping below that level. Negative values in the resulting vector are converted to FALSE (0) with > 0 and these results are summed.
tapply returns a named vector, which can be fairly nice to work with. However, if you want a data.frame, then you could use aggregate instead as suggested by d.b:
aggregate(speed, list(id=id), FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
id x
1 1 2
2 2 2
3 3 1

Here's a dplyr solution. I group by id then check if speed is above the limit in each row, but wasn't in the previous entry. (I get the previous row using lag). If this is the case, it produces TRUE. Or, if it's the first row for the id (i.e., row_number()==1) and it's above the limit, this gives a TRUE, too. Then, I sum all the TRUE values for each id using summarise.
id <- c(1,1,1,1,2,2,2,2,3,3,3)
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- data.frame(id, speed)
limit <- 35
library(dplyr)
i %>%
group_by(id) %>%
mutate(viol=(speed>limit&lag(speed)<limit)|(row_number()==1&speed>limit)) %>%
summarise(sum(viol))
# A tibble: 3 x 2
id `sum(viol)`
<dbl> <int>
1 1 2
2 2 2
3 3 1

Here is another option with data.table,
library(data.table)
setDT(i)[, id1 := rleid(speed > limit), by = id][
speed > limit, .(violations = uniqueN(id1)), by = id][]
which gives,
id violations
1: 1 2
2: 2 2
3: 3 1

aggregate(speed~id, data.frame(i), function(x) sum(rle(x>limit)$values))
# id speed
#1 1 2
#2 2 2
#3 3 1
The main idea is that x > limit will check for instances when the speed limit is violated and rle(x) will group those instances into consecutive violations or consecutive non-violations. Then all you need to do is to count the groups of consecutive violations (when rle(x>limit)$values is TRUE).

Related

subset dataframe based on hierarchical preference of factor levels within column in R

I have a dataframe which I would like to subset based on hierarchical preference of factor levels within a column. With following example I want to show, that per level of "ID" I want to select only one "method". Specifically, if possible keeping CACL, if CACL doesn't exist for this level, then subset for "KCL" and if that doesn't exist, then subset for "H2O".
ID<-c(1,1,1,2,2,3)
method<-c("CACL","KCL","H2O","H2O","KCL","H2O")
df1<-data.frame(ID,method)
ID method
1 1 CACL
2 1 KCL
3 1 H2O
4 2 H2O
5 2 KCL
6 3 H2O
ID<-c(1,2,3)
method<-c("CACL","KCL","H2O")
df2<-data.frame(ID,method)
ID method
1 1 CACL
2 2 KCL
3 3 H2O
I have done something similar subsetting by selecting a minimum number within a level, but am not able to adapt it. Am wondering whether I should use ifelse here too?
#if present, choose rows containing "number" 2 instead of 1 (this column contained only the two numbers 1 and 2)
library(dplyr)
new<-df %>%
group_by(col1,col2,col3) %>%
summarize(number = ifelse(any(number > 1), min(number[number>1]),1))
dfnew<-merge(new,df,by=c("colxyz","number"),all.x=T)
You can use order with match and then simply !duplicated:
df1 <- df1[order(match(df1$method, c("CACL","KCL","H2O"))),]
df1[!duplicated(df1$ID),]
# ID method
#1 1 CACL
#5 2 KCL
#6 3 H2O
#Variant not changing df1
i <- order(match(df1$method, c("CACL","KCL","H2O")))
df1[i[!duplicated(df1$ID[i])],]
An option using dplyr:
df1 %>%
mutate(preference = match(method, c("CACL","KCL","H2O"))) %>%
group_by(ID) %>%
filter(preference == min(preference)) %>%
select(-preference)
# A tibble: 3 x 2
# Groups: ID [3]
ID method
<dbl> <fct>
1 1 CACL
2 2 KCL
3 3 H2O

calculate count of values that satisfy two condition in r

I'm new to R and have a large data set where I need to check to see if one of the two values exceeds a threshold; if it does, I need to count it, and if it doesn't I ignore that value.
I have to iterate through several columns, but I run into the issue where my if statement only checks the first value. A simple example would have the columns id, val1, val2, val3. If val1 or val2 are greater than a threshold, then I would count val3 otherwise ignore. My data set is called data.
id val1 val2 val3
1 .4 4 10
2 5 5 11
3 2 2 1
4 6 1 10
5 2 100 4
My code is:
if(data$val1 > 5 | data$val2 > 5){sum(data$val3>5)}
The issue is that it only checks the first row. How can I iterate through every row?
sum(data$val3[data$val1 > 5 | data$val2 > 5])
We can also do this with rowSums
sum(rowSums(df1[c('val1', 'val2')]>5)>0)
#[1] 2
with(data, sum(val3[pmax(val1, val2) > 5]))
[1] 14

Aggregate in R taking way too long

I'm trying to count the unique values of x across groups y.
This is the function:
aggregate(x~y,z[which(z$grp==0),],function(x) length(unique(x)))
This is taking way too long (~6 hours and not done yet). I don't want to stop processing as I have to finish this tonight.
by() was taking too long as well
Any ideas what is going wrong and how I can reduce the processing time ~ 1 hour?
My dataset has 3 million rows and 16 columns.
Input dataframe z
x y grp
1 1 0
2 1 0
1 2 1
1 3 0
3 4 1
I want to get the count of unique (x) for each y where grp = 0
UPDATE: Using #eddi's excellent answer. I have
x y
1: 2 1
2: 1 3
Any idea how I can quickly summarize this as the number of x's for each value y?
So for this it will be
Number of x y
5 1
1 3
Here you go:
library(data.table)
setDT(z) # to convert to data.table in place
z[grp == 0, uniqueN(x), by = y]
# y V1
#1: 1 2
#2: 3 1
library(dplyr)
z %>%
filter(grp == 0) %>%
group_by(y) %>%
summarize(nx = n_distinct(x)))
is the dplyr way, though it may not be as fast as data.table.

countif within R repeated across each row

I'm having trouble trying to replicate some of the countif function I'm familiar with in excel. I've got a data frame, and it has a large number of rows. I'm trying to take 2 variables (x & z) and do a countif of how many other variables within my dataframe match that. I figured out doing:
sum('mydataframe'$x==`mydataframe`$x[1]&`mydataframe'$z==`mydataframe`$z[1])
This gives me the correct countif for x&z within the whole data set for the first row [1]. The problem is I've got to use that [1]. I've tried using the (with,...) command, but then I can no longer access the whole column.
I'd like to be able to do the count of x & z combination for each row within the data frame then have that output as a new vector that I can just add as another column. And I'd like this to go on for every row through to the end.
Hopefully this is pretty simple. I figure some combination of (with,..) or apply or something will do it, but I'm just too new.
I am interested in a count total in every instance, not a running sequential count.
It seems that you are asking for a way to create a new column that contains the number of rows in the entire data frame with x and z value equal to the values of those variables for that row.
With a bit of sample data:
(dat <- data.frame(x=c(1, 1, 2), z=c(3, 3, 3)))
# x z
# 1 1 3
# 2 1 3
# 3 2 3
One simple approach would be grouping with dplyr's group_by function and then creating a new column with the number of elements in that group:
library(dplyr)
dat %>% group_by(x, z) %>% mutate(n=n())
# x z n
# (dbl) (dbl) (int)
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
A base R solution would probably involve ave:
dat$n <- ave(rep(NA, nrow(dat)), dat$x, dat$z, FUN=length)
dat
# x z n
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
An option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(dat)) , group by 'x', 'z' and
assign 'n' as the number of elements in each group (.N).
library(data.table)
setDT(dat)[, n:= .N, by = .(x,z)]
dat
# x z n
#1: 1 3 2
#2: 1 3 2
#3: 2 3 1

aggregate over several variables in r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Resources