I would like to perform a rowSums based on specific values for multiple columns (i.e. multiple conditions). I know how to rowSums based on a single condition (see example below) but can't seem to figure out multiple conditions.
# rowSums with single, global condition
set.seed(100)
df <- data.frame(a = sample(0:100,10),
b = sample(0:100,10),
c = sample(0:100,10),
d = sample(0:100,10))
print(df)
a b c d
1 31 63 54 49
2 25 88 71 92
3 54 27 53 34
4 5 39 73 93
5 45 73 40 67
6 46 64 16 85
7 77 19 97 17
8 34 33 82 59
9 50 93 51 99
10 15 100 25 11
Single Condition Works
df$ROWSUMS <- rowSums(df[,1:4] <= 50)
# And produces
a b c d ROWSUMS
1 31 63 54 49 2
2 25 88 71 92 1
3 54 27 53 34 2
4 5 39 73 93 2
5 45 73 40 67 2
6 46 64 16 85 2
7 77 19 97 17 2
8 34 33 82 59 2
9 50 93 51 99 1
10 15 100 25 11 3
Multiple Conditions Don't Work
df$ROWSUMS_Multi <- rowSums(df[,1] <= 50 | df[,2] <= 25 | df[,3] <= 75)
Error in rowSums(df[, 1] <= 50 | df[, 2] <= 25 | df[, 3] <= 75) :
'x' must be an array of at least two dimensions
Desired Output
a b c d ROWSUMS_Multi
1 31 63 54 49 2
2 25 88 71 92 2
3 54 27 53 34 1
4 5 39 73 93 2
5 45 73 40 67 2
6 46 64 16 85 2
7 77 19 97 17 1
8 34 33 82 59 1
9 50 93 51 99 2
10 15 100 25 11 2
I could just be sub-setting incorrectly, but I haven't been able to find a fix.
One problem with [ while having a single row or single column is it coerces the data.frame to a vector. Based on ?Extract
x[i, j, ... , drop = TRUE]
NOTE, drop is TRUE by default
and later in the documentation
drop - For matrices and arrays. If TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.
To avoid that either use drop = FALSE or simply drop the , which will return a single column data.frame because by default, the index without any comma is regarded as column index and not row index for data.frame
rowSums(df[1] <= 50 | df[2] <= 25 | df[3] <= 75)
Update
Based on the expected output, the rowSums can be written as
dfROWSUMS <- rowSums(df[1:3] <= c(50, 25, 75)[col(df[1:3])])
df$ROWSUMS
#[1] 2 2 1 2 2 2 1 1 2 2
NOTE: Earlier comment was based on why the rowSums didn't work. Didn't check the expected output earlier. Here, we need to do comparison of 3 columns with different values. When we do
df[1] <= 50
It is a single column of one TRUE/FALSE
When we do | with
df[1] <= 50 | df[2] <= 25
It would be still be a single column of TRUE/FALSE. Only difference is that we have replaced TRUE/FALSE or FALSE/TRUE in a row with TRUE. Similarly, it would be the case when we add n logical comparisons compared with |. Instead of that, do a +, does the elementwise sum
((df[1] <= 50)+ (df[2] <= 25) + (df[3] <= 75))[,1] # note it is a matrix
Here, we can do it with vector i.e. using , as well
((df[, 1] <= 50)+ (df[, 2] <= 25) + (df[, 3] <= 75)) # vector output
The only issue with this would be to repeatedly do the +. If we use rowSums, then make sure the comparison value replicated (col) to the same dimensions of the subset of data.frame. Another option is Map,
Reduce(`+`, Map(`<=`, df[1:3], c(50, 25, 75)))
We can also use cbind to create a matrix from the multiple conditions using column positions or column names then use rowSums like usual, e.g
> rowSums(cbind(df[,'a'] <= 50 ,df[,'b'] <= 25 ,df[,'c'] <= 75), na.rm = TRUE)
[1] 2 2 1 2 2 2 1 1 2 2
> rowSums(cbind(df['a'] <= 50 ,df['b'] <= 25 ,df['c'] <= 75), na.rm = TRUE)
[1] 2 2 1 2 2 2 1 1 2 2
Using dplyr
library(dplyr)
df %>% mutate(ROWSUMS=rowSums(cbind(.['a'] <= 50 ,.['b'] <= 25 ,.['c'] <= 75), na.rm = TRUE))
Related
How can I vectorize the following operation in R that involves modifying column Z recursively using lagged values of Z?
library(dplyr)
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df
X Y
1 20 27
2 69 49
3 92 32
4 28 56
5 10 26
6 70 20
7 53 39
8 81 89
9 96 55
10 11 84
df <- df %>% mutate(Z=if_else(row_number()==1, initial_Z-Y, NA_real_))
df
X Y Z
1 20 27 973
2 69 49 NA
3 92 32 NA
4 28 56 NA
5 10 26 NA
6 70 20 NA
7 53 39 NA
8 81 89 NA
9 96 55 NA
10 11 84 NA
for (i in 2:nrow(df)) {
df$Z[i] <- (df$Z[i-1]*df$X[i-1]/df$X[i])-df$Y[i]
}
df
X Y Z
1 20 27 973.000000
2 69 49 233.028986
3 92 32 142.771739
4 28 56 413.107143
5 10 26 1130.700000
6 70 20 141.528571
7 53 39 147.924528
8 81 89 7.790123
9 96 55 -48.427083
10 11 84 -506.636364
So the first value of Z is set first, based on initial_Z and first value of Y. Rest of the values of Z are calculated by using lagged values of X and Z, and current value of Y.
My actual df is large, and I need to repeat this operation thousands of times in a simulation. Using a for loop takes too much time. I prefer implementing this using dplyr, but other approaches are also welcome.
Many thanks in advance for any help.
I don't know that you can avoid the effect of for loops, but in general R should be pretty good at them. Given that, here is a Reduce variant that might suffice for you:
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df$Z <- with(df, Reduce(function(prevZ, i) {
if (i == 1) return(prevZ - Y[i])
prevZ*X[i-1]/X[i] - Y[i]
}, seq_len(nrow(df)), init = initial_Z, accumulate = TRUE))[-1]
df
# X Y Z
# 1 20 27 973.000000
# 2 69 49 233.028986
# 3 92 32 142.771739
# 4 28 56 413.107143
# 5 10 26 1130.700000
# 6 70 20 141.528571
# 7 53 39 147.924528
# 8 81 89 7.790123
# 9 96 55 -48.427083
# 10 11 84 -506.636364
To be clear, Reduce uses for loops internally to get through the data. I generally don't like using indices as the values for Reduce's x, but since Reduce only iterates over one value, and we need both X and Y, the indices (rows) are a required step.
The same can be accomplished using accumulate2. Note that these are just for-loops. You should consider writing the for loop in Rcpp if at all its causing a problem in R
df %>%
mutate(Z = accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000)[-1])
X Y Z
1 20 27 973
2 69 49 233.029
3 92 32 142.7717
4 28 56 413.1071
5 10 26 1130.7
6 70 20 141.5286
7 53 39 147.9245
8 81 89 7.790123
9 96 55 -48.42708
10 11 84 -506.6364
You could unlist(Z):
df %>%
mutate(Z = unlist(accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000))[-1])
I have a dataset of this form.
a=data.frame(A=1:5,B=1:5,matrix(seq(50),nrow = 5))
colnames(a)<-c("A","B", paste0(1:10))
A B 1 2 3 4 5 6 7 8 9 10
1 1 1 6 11 16 21 26 31 36 41 46
2 2 2 7 12 17 22 27 32 37 42 47
3 3 3 8 13 18 23 28 33 38 43 48
4 4 4 9 14 19 24 29 34 39 44 49
5 5 5 10 15 20 25 30 35 40 45 50
I am intending to use apply in order to do the product of rows conditionnally to the value of A and B. Let's take row 2 for instance, we have A=2 and B=2 then the code will be looking for column="2" and column="2+2" and will do the product of all the elements of the selected vectors, Result is thus equal to 7*12*17=1248.
I can do it for a row
prod(a[1,match(a$A[1],colnames(a)):match(a$A[1]+a$B[1],colnames(a))])
but can't figure a way to apply it to all the data.frame. Any help?
Here is one option with apply,specify the MARGIN as 1 to loop over the rows,, create the index to match the column names from the first two elements (A, 'B), create a sequence (:), subset the values of 'x' and get theprod`
apply(a, 1, function(x) {
i1 <- match(x[1], names(x))
i2 <- match(x[1] + x[2], names(x))
prod(x[i1:i2])
})
#[1] 6 1428 150696 17535024 2362500000
I've got a dataset that when I score needs to be converted from a continuous scale to categorical. Each value will be put into one of those categories at 10 intervals based on the minimum and maximum of that column. So if the minimum = 1 and the maximum = 100 there will be 10 categories so that any value from 1-10 = 1, and 11-20 = 2, 21-30 = 3, ..., 91-100 = 10. Here's what my data looks like
df <- as.data.frame(cbind(test1 = sample(13:52, 15),
test2 = sample(16:131, 15)))
> df
test1 test2
1 44 131
2 26 83
3 74 41
4 6 73
5 83 20
6 63 110
7 23 29
8 42 64
9 41 40
10 10 96
11 2 39
12 14 24
13 67 30
14 51 59
15 66 37
So far I have a function:
trail.bin <- function(data, col, min, max) {
for(i in 1:10) {
for(e in 0:9) {
x <- as.data.table(data)
mult <- (max - min)/10
x[col >= min+(e*mult) & col < min+(i*mult),
col := i]
}
}
return(x)
}
What I'm trying to do is take the minimum and maximum, find what the spacing of intervals would be (mult), then use two loops on a data.table reference syntax. The outcome I'm hoping for is:
df2
test1 test2
1 5 131
2 3 83
3 8 41
4 1 73
5 9 20
6 7 110
7 3 29
8 5 64
9 5 40
10 2 96
11 1 39
12 2 24
13 7 30
14 6 59
15 7 37
Thanks!
You could create a function using cut
library(data.table)
trail.bin <- function(data, col, n) {
data[, (col) := lapply(.SD, cut, n, labels = FALSE), .SDcols = col]
return(data)
}
setDT(df)
trail.bin(df, 'test1', 10)
You can also pass multiple columns
trail.bin(df, c('test1', 'test2'), 10)
I have a dataset (x) contains
1 10
20 30
34 38
59 83
...
I have a big matrix nx1. I want to assign a value 1 for each row in x. For example
mat[1:10,1] = 1
mat[20:30,1] = 1
etc...
In R, the size of x is quite big and takes a while to do the following:
for ( j in 1:dim(x)[1] ) {
mat[x[j,1]:x[j,2], 1] <- 1
}
Please help me if there is a faster way to do this. Thanks.
You can easily make a list of the rows you want to assign a value of 1 to in your big matrix, using apply on x with seq.int to get the row numbers like this...
rows <- unlist( apply( x , 1 , FUN = function(x){ seq.int(x[1],x[2])}) )
rows
# [1] 1 2 3 4 5 6 7 8 9 10 20 21 22 23 24 25 26 27 28 29 30 34 35 36 37 38 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
And then use subsetting which will be much faster, like this
mat[ rows , 1 ] <- 1
If m is your set of start and stop locations:
m <- matrix(scan(), ncol=2)
#------
1: 1 10
3: 20 30
5: 34 38
7: 59 83
9:
Read 8 items
mapply( seq.int, m[,1], m[,2])
rx1[ unlist( mapply( seq.int, m[,1], m[,2]) ), 1] <- 1
(Trivially different than SimonO101's earlier contribution.)
data.table usually excels in cases like this. Here is a data.table-based solution:
library(data.table)
indexes<-data.table(istart=c(1L,20L,34L,59L), istop=c(10L,30L,38L,83L))
mat<-data.table(val=sample(1L:1e5L,1e5))
mat[indexes[,list(i=seq(istart,istop)),by="istart"][,i],val:=1L]
I have multiple frames, for the purpose suppose 2.
Each frame comprises 2 columns - an index column, and a value column
sz<-5;
frame_1<-data.frame(index=sort(sample(1:10,sz,replace=F)),value=rpois(sz,50));
frame_2<-data.frame(index=sort(sample(1:10,sz,replace=F)),value=rpois(sz,50));
frame_1:
index value
1 49
6 62
7 58
8 30
10 50
frame_2:
index value
4 60
5 64
6 48
7 46
9 57
The goal is to create a third frame, frame_3, whose indices will be the union of those in frame_1 and frame_2,
frame_3<-data.frame(index = sort(union(frame_1$index,frame_2$index)));
and which will comprise two additional columns, value_1 and value_2.
frame_3$value_1 will be filled out from frame_1$value, frame_3$value_2 will be filled out from frame_2$value;
These should be filled out like so:
frame_3:
index value_1 value_2
1 49 NA
4 49 60 # value_1 is filled through with previous value
5 49 64 # value_1 is filled through with previous value
6 62 48
7 58 46
8 30 46 # value_2 is filled through with previous value
9 30 57 # value_1 is filled through with previous value
10 50 57 # value_1 is filled through with previous value
i'm looking for an efficient solution, as im dealing with records in the hundreds of thousands
This problem screams for data.table. You can use a loop to recursively construct columns one by one using x[y, roll=TRUE].
require(data.table)
dt1 <- data.table(frame_1)
dt2 <- data.table(frame_2)
setkey(dt1, index)
setkey(dt2, index)
dt3 <- data.table(index = sort(unique(c(dt1$index, dt2$index))))
> dt1[dt2[dt3, roll=TRUE], roll=TRUE]
# index value value.1
# 1: 1 49 NA
# 2: 4 49 60
# 3: 5 49 64
# 4: 6 62 48
# 5: 7 58 46
# 6: 8 30 46
# 7: 9 30 57
# 8: 10 50 57
If your data.frames aren't very large, you can just use merge combined with zoo::na.locf.
R> library(zoo)
R> frame_3 <- merge(frame_1, frame_2, by="index",
+ all=TRUE, suffixes=paste(".",1:2,sep=""))
R > (frame_3 <- na.locf(frame_3))
index value.1 value.2
1 1 49 NA
2 4 49 60
3 5 49 64
4 6 62 48
5 7 58 46
6 8 30 46
7 9 30 57
8 10 50 57
Or, just use zoo objects to begin with, assuming your "value" columns are all one type (like a matrix, you can't mix types in zoo objects).
R> z1 <- zoo(frame_1$value, frame_1$index)
R> z2 <- zoo(frame_2$value, frame_2$index)
R> (z3 <- na.locf(merge(z1, z2)))
z1 z2
1 49 NA
4 49 60
5 49 64
6 62 48
7 58 46
8 30 46
9 30 57
10 50 57