R data.table sum number of columns exceeding threshold

R data.table sum number of columns exceeding threshold - r

I would like to sum the number of columns whose values exceed a threshold in an observation. Additionally, I would like to specify those column names and thresholds as vectors (cols, th)
Take the example data set:
x <- data.table(x1=c(1,2,3),x2=c(3,2,1))
The goal is to create a new column exceed.count with number of columns in which x1 and x2 exceed a respective threshold. Assuming the case in which the thresholds for both x1 and x2 are 2:
th <- c(2,2)
The function could be defined as:
fn <- function(z,th) (sum(z[,x1]>th[1],z[,x2]>th[2]))
And the number of columns exceeding the thresholds calculated by:
x[,exceed.count:=fn(.SD,th),by=seq_len(nrow(x))]
The results are:
x1 x2 exceed.count
1: 1 3 1
2: 2 2 0
3: 3 1 1
What I would like to do is be able to specify the column names as vector, e.g.
cols <- c("x1","x2")
I was playing around with a function of the form:
fn.i <- function(z,i) (sum(z[,cols[i],with=FALSE] > th[i]))
which works for a single i, but how do I vectorize this across elements of cols? (cols and th will always be the same length)

I think there is an easier way to solve your problem:
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
th<-c(2,2)
x[,exceed.count:=sum(.SD>th),by=seq_len(nrow(x))]
Or, taking into account your input (only a subset of columns):
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
sd.cols = c("x1")
th<-c(2)
x[,exceed.count:=sum(.SD>th),by=seq_len(nrow(x)), .SDcols=sd.cols]
Or
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
sd.cols = c("x1")
th<-c(2,2)
x[,exceed.count:=sum(.SD>th[1]),by=seq_len(nrow(x)), .SDcols=sd.cols]

#JonnyCrunch's approach, specifying a subset of columns with .SDcols=sd.cols works fine (as long as you ensure ncol(x) == length(th), otherwise vector recycling will mess things up).
Here's an alternative that is shorter syntax (but will be less performant for very wide columns):
x[,exceed.count:=sum(.SD>th), by=seq_len(nrow(x)) ]
no need to explicitly specify .SDcols, let it default to all columns
define the threshold vector th for all columns, using the don't-care value +Inf in those columns you don't want counted.
.
> x <- data.table(x0=4:6, x1=1:3, x2=3:1, x3=7:5)
x0 x1 x2 x3
1: 4 1 3 7
2: 5 2 2 6
3: 6 3 1 5
> th <- c(+Inf, 2, +Inf, 2)
> fn <- function(z,th) (z>th)
> x[,exceed.count:=sum(.SD>th), by=seq_len(nrow(x)) ]
x0 x1 x2 x3 exceed.count
1: 4 1 3 7 1
2: 5 2 2 6 1
3: 6 3 1 5 2

Here's one way to get around iteration over rows:
x <- data.table(x1=c(1,2,3), x2=c(3,2,1))
thL <- list(x1 = 2, x2 = 2)
nm = names(thL)
x[, n := 0L]
for (i in seq_along(thL)) x[thL[i], on=sprintf("%s>%s", nm[i], nm[i]), n := n + 1L][]
x1 x2 n
1: 1 3 1
2: 2 2 0
3: 3 1 1

Related

R Difference with previous column across multiple columns

I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!

x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.

Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}

Sum Values of Every Column in Data Frame with Conditional For Loop

So I want to go through a data set and sum the values from each column based on the condition of my first column. The data and my code so far looks like this:
x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20
for(i in colnames(data)){
if(data$x>2){
x1 <-sum(data[[i]])
}
else{
x2 <-sum(data[[i]])
}
}
My assumption was that the for loop would call each column by name from the data and then sum the values in each column based on whether they matched the condition of column x.
I want to sum half the values from each column and assign them to a value x1 and do the same for the remainder, assigning it to x2. I keep getting an error saying the following:
the condition has length > 1 and only the first element will be used
What am I doing wrong and is there a better way to go about this? Ideally I want a table that looks like this:
v1 v2 v3
x1 6 7 35
x2 4 3 15

Here's a dplyr solution. First, I define the data frame.
df <- read.table(text = "x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20", header = TRUE)
# x v1 v2 v3
# 1 1 0 1 5
# 2 2 4 2 10
# 3 3 5 3 15
# 4 4 1 4 20
Then, I create a label (x_check) to indicate which group each row belongs to based on your criterion (x > 2), group by this label, and summarise each column with a v in its name using sum.
# Load library
library(dplyr)
df %>%
mutate(x_check = ifelse(x>2, "x1", "x2")) %>%
group_by(x_check) %>%
summarise_at(vars(contains("v")), funs(sum))
# # A tibble: 2 x 4
# x_check v1 v2 v3
# <chr> <int> <int> <int>
# 1 x1 6 7 35
# 2 x2 4 3 15

Not sure if I understood your intention correctly, but here is how you would reproduce your results with base R:
df <- data.frame(
x = c(1:4),
v1 = c(0, 4, 5, 1),
v2 = 1:4,
v3 = (1:4)*5
)
x1 <- colSums(df[df$x > 2, 2:4, drop = FALSE])
x2 <- colSums(df[df$x <= 2, 2:4, drop = FALSE])
Where
df[df$x > 2, 2:4, drop = FALSE] will create a subset of df where the rows satisfy df$x > 2 and the columns are 2:4 (meaning the second, third and fourth column), drop = FALSE is there mainly to prevent R from simplifying the results in some special cases
colSums does a by-column sum on the subsetted data.frame
If your x column was really a condition (e.g. a logical vector) you could just do
x1 <- colSums(df[df$x, 2:4, drop = FALSE])
x2 <- colSums(df[!df$x, 2:4, drop = FALSE])
Note that there is no loop needed to get to the results, with R you should use vectorized functions as much as possible.
More generally, you could do such aggregation with aggregate:
aggregate(df[, 2:4], by = list(condition = df$x <= 2), FUN = sum)

Summarise whether a value is contained in multiple other columns

I am investigating a large dataset with 100+ columns. One set of columns contain integers where the integers are not repeated across columns. For example, the number 6 may or may not appear in a row, but it will only appear once across the columns.
An example mock-up (bearing in mind that there are hundreds of other, non-related columns surrounding these):
> x1 <- c(1,6,4,5)
> x2 <- c(6,0,11,3)
> x3 <- c(5,0,9,6)
> df <- data.frame(cbind(x1, x2, x3))
> df
x1 x2 x3
1 1 6 5
2 6 0 0
3 4 11 9
4 5 3 6
Ideally using dplyr (since I am trying to become more "fluent"), how would I most cleanly create a new column to indicate whether or not a 6 was contained in the other columns? I am hesitant to use a function like reshape2's melt given the 100s of other columns in the dataset.
My current, messy, solution:
> library(dplyr)
> df <- mutate(df, Contains6 = (x1 == 6) + (x2 == 6) + (x3 == 6),
+ Contains6 = revalue(as.factor(as.character(Contains6)),
+ c("0"="No","1"="Yes")))
> df
x1 x2 x3 Contains6
1 1 6 5 Yes
2 6 0 0 Yes
3 4 11 9 No
4 5 3 6 Yes
Possible extension to this: would there be a clean, programmatic way of creating similar columns for all values contained in x1:x3, e.g. Contains1, Contains4, etc?

We can use apply with MARGIN=1
df$Contains6 <- c("no", "yes")[(apply(df==6, 1, any))+1L]
df$Contains6
#[1] "yes" "yes" "no" "yes"
If we need to create multiple "Contains" columns, we can loop with lapply
v1 <- c(1,4,6)
df[paste0("Contains", v1)] <- lapply(v1, function(i)
c('no', 'yes')[(apply(df==i, 1, any))+1L])

In R, how to sum certain rows of a data frame with certain logic?

Hi experienced R users,
It's kind of a simple thing.
I want to sum x by Group.1 depending on one controllable variable.
I'd like to sum x by grouping the first two rows when I say something like: number <- 2
If I say 3, it should sum x of the first three rows by Group.1
Any idea how I might tackle this problem? Should I write a function?
Thank y'all in advance.
Group.1 Group.2 x
1 1 Eggs 230299
2 2 Eggs 263066
3 3 Eggs 266504
4 4 Eggs 177196

If the sums you want are always cumulative, there's a function for that, cumsum. It works like this.
> cumsum(c(1,2,3))
[1] 1 3 6
In this case you might want something like
> mysum <- cumsum(yourdata$x)
> mysum[2] # the sum of the first two rows
> mysum[3] # the sum of the first three rows
> mysum[number] # the sum of the first "number" rows

Assuming your data is in mydata:
with(mydata, sum(x[Group.1 <= 2])

You could use the by function.
For instance, given the following data.frame:
d <- data.frame(Group.1=c(1,1,2,1,3,3,1,3),Group.2=c('Eggs'),x=1:8)
> d
Group.1 Group.2 x
1 1 Eggs 1
2 1 Eggs 2
3 2 Eggs 3
4 1 Eggs 4
5 3 Eggs 5
6 3 Eggs 6
7 1 Eggs 7
8 3 Eggs 8
You can do this:
num <- 3 # sum only the first 3 rows
# The aggregation function:
# it is called for each group receiving the
# data.frame subset as input and returns the aggregated row
innerFunc <- function(subDf){
# we create the aggregated row by taking the first row of the subset
row <- head(subDf,1)
# we set the x column in the result row to the sum of the first "num"
# elements of the subset
row$x <- sum(head(subDf$x,num))
return(row)
}
# Here we call the "by" function:
# it returns an object of class "by" that is a list of the resulting
# aggregated rows; we want to convert it to a data.frame, so we call
# rbind repeatedly by using "do.call(rbind, ... )"
d2 <- do.call(rbind,by(data=d,INDICES=d$Group.1,FUN=innerFunc))
> d2
Group.1 Group.2 x
1 1 Eggs 7
2 2 Eggs 3
3 3 Eggs 19

If you want to sum only a subset of your data:
my_data <- data.frame(c("TRUE","FALSE","FALSE","FALSE","TRUE"), c(1,2,3,4,5))
names(my_data)[1] <- "DESCRIPTION" #Change Column Name
names(my_data)[2] <- "NUMBER" #Change Column Name
sum(subset(my_data, my_data$DESCRIPTION=="TRUE")$NUMBER)
You should get 6.

Not sure why Eggs are important here ;)
df1 <- data.frame(Gr=seq(4),
x=c(230299, 263066, 266504, 177196)
)
now with n=2 i.e. first two rows:
n <- 2
sum(df1[, "x"][df1[, "Gr"]<=n])
The expression [df1[, "Gr"]<=n] creates a logical vector to subset the elements in df1[, "x"] before summing them.
Also, it appears your Group.1 is the same as the row no. If so this may be simpler:
sum(df1[, "x"][1:n])
or to get all at once
cumsum(df1[, "x"])

Multiply various subsets of a data frame by different vectors

I would like to multiply several columns in my data frame by a vector of values. The specific vector of values changes depending on the value in another column.
--EDIT--
What if I make the data set more complicated, i.e., more than 2 conditions and the conditions are randomly shuffled around the data set?
Here is an example of my data set:
df=data.frame(
Treatment=(rep(LETTERS[1:4],each=2)),
Species=rep(1:4,each=2),
Value1=c(0,0,1,3,4,2,0,0),
Value2=c(0,0,3,4,2,1,4,5),
Value3=c(0,2,4,5,2,1,4,5),
Condition=c("A","B","A","C","B","A","B","C")
)
Which looks like:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 2 B
B 2 1 3 4 A
B 2 3 4 5 C
C 3 4 2 2 B
C 3 2 1 1 A
D 4 0 4 4 B
D 4 0 5 5 C
If Condition=="A", I would like to multiply columns 3-5 by the vector c(1,2,3). If Condition=="B", I would like to multiply columns 3-5 by the vector c(4,5,6). If Condition=="C", I would like to multiply columns 3-5 by the vector c(0,1,0). The resulting data frame would therefore look like this:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 12 B
B 2 1 6 12 A
B 2 0 4 0 C
C 3 16 10 12 B
C 3 2 2 3 A
D 4 0 20 24 B
D 4 0 5 0 C
I have tried subsetting the data frame and multiplying by the vector:
t(t(subset(df[,3:5],df[,6]=="A")) * c(1,2,3))
But I can't return the subsetted data frame to the original. Is there any way to perform this operation without subsetting the data frame, so that other columns (e.g., Treatment, Species) are preserved?

Here's a fairly general solution that you should be able to adapt to fit your needs.
Note the first argument in the outer call is a logical vector and the second is numeric, so before multiplication TRUE and FALSE are converted to 1 and 0, respectively. We can add the outer results because the conditions are non-overlapping and the FALSE elements will be zero.
multiples <-
outer(df$Condition=="A",c(1,2,3)) +
outer(df$Condition=="B",c(4,5,6)) +
outer(df$Condition=="C",c(0,1,0))
df[,3:5] <- df[,3:5] * multiples

Here's a non-vectorized, but easy to understand solution:
replaceFunction <- function(v){
m <- as.numeric(v[3:5])
if (v[6]=="A")
out <- m * c(1,2,3)
else if (v[6]=="B")
out <- m * c(4,5,6)
else
out <- m
return(out)
}
g <- apply(df, 1, replaceFunction)
df[3:5] <- t(g)
df

Edited to reflect some notes from the comments
Assuming that Condition is a factor, you could do this:
#Modified to reflect OP's edit - the same solution works just fine
m <- matrix(c(1:6,0,1,0),3,3,byrow = TRUE)
df[,3:5] <- with(df,df[,3:5] * m[Condition,])
which makes use of fairly quick vectorized multiplication. And obviously, wrapping this in with isn't strictly necessary, it's just what popped out of my brain. Also note the subsetting comment below by Backlin.
More globally, remember that every subsetting you can do with subset you can also do with [, and crucially, [ support assignment via [<-. So if you want to alter a portion of a data frame or matrix, you can always use this type of idiom:
df[rowCondition,colCondition] <- <replacement values>
assuming of course that <replacement values> is the same dimension as your subset of df. It may work otherwise, but you will run afoul of R's recycling rules and R may kick back a warning.

df[3:5] <- df[3:5] * t(sapply(df$Condition, function(x) if(x=="B") 4:6 else 1:3))
Or by vector multiplication
df[3:5] <- df[3:5] * (3*(df$Condition == "B") %*% matrix(1, 1, 3)
+ matrix(1:3, nrow(df), 3, byrow=T))