Logical vector across many columns - r

I am trying to run a logical or statement across many columns in data.table but I am having trouble coming up with the code. My columns have a pattern like the one shown in the table below. I could use a regular logical vector if needed, but I was wondering if I could figure out a way to iterate across a1, a2, a3, etc. as my actual dataset has many "a" type columns.
Thanks in advance.
library(data.table)
x <- data.table(a1 = c(1, 4, 5, 6), a2 = c(2, 4, 1, 10), z = c(9, 10, 12, 12))
# this works but does not work for lots of a1, a2, a3 colnames
# because code is too long and unwieldy
x[a1 == 1 | a2 == 1 , b:= 1]
# this is broken and returns the following error
x[colnames(x)[grep("a", names(x))] == 1, b := 1]
Error in `[.data.table`(x, colnames(x)[grep("a", names(x))] == 1, `:=`(b, :
i evaluates to a logical vector length 2 but there are 4 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
Output looks like below:
a1 a2 z b
1: 1 2 9 1
2: 4 4 10 NA
3: 5 1 12 1
4: 6 10 12 NA

Try using a mask:
x$b <- 0
x[rowSums(ifelse(x[, list(a1, a2)] == 1, 1, 0)) > 0, b := 1]
Now imagine you have 100 a columns and they are the first 100 columns in your data table. Then you can select the columns using:
x[rowSums(ifelse(x[, c(1:100)] == 1, 1, 0) > 0, b := 1]
ifelse(x[, list(a1, a2)] == 1, 1, 0) returns a data table that only has the values 1 where there is a 1 in the a columns. Then I used rowSums to sum horizontally, and if any of these sums is > 0, it means there was a 1 in at least one of the columns of a given row, so I simply selected those rows and set b to 1.

Related

Create a new column of cumulative value based on multiple columns in data table

This is my first post after days of searching for answer. I'm transitioning from R data frame to R data table with difficulties.
What I want to achieve is to create some sort of cumulative value based on the indicator from multiple columns/variables.
I can do that quite easily with data frame:
DF = data.frame(
a1 = c(1, 2, 3, 4, 5),
a2 = c(1, 2, 3, 4, 5),
a3 = c(1, 2, 3, 4, NA)
)
DF$b1<-as.numeric(0)
for(i in 1:3) {
DF$b1<-as.numeric(DF[i]>0)+DF$b1
}
However, to me, it is not so straight forward in data table. What I have done is the following:
DT<-setDT(DF)
DT[,b1:= as.numeric(DT[,1]>0)+as.numeric(DT[,2]>0)+as.numeric(DT[,3]>0)]
The code above works. But it doesn't seem to be user friendly if I want to increase the number of columns analyzed to (say) 10. In the case of data frame, I can just change the index from 1:3 to 1:10.
Appreciate any comments on how I can improve the code for data table above. It would also be very helpful if any good resources or documentations can be shared with me on this type of practical problem: referencing column index in a loop for data table. Thanks.
You can try rowSums after turning you table to logical via .SD > 0, i.e.
DT[, b1 := rowSums(.SD > 0)][]
# a1 a2 a3 b1
#1: 1 1 1 3
#2: 2 2 2 3
#3: 3 3 3 3
#4: 4 4 4 3
#5: 5 5 NA NA

how to fill in values in a vector?

I have vectors in R containing a lot of 0's, and a few non-zero numbers.Each vector starts with a non-zero number.
For example <1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0>
I would like to set all of the zeros equal to the most recent non-zero number.
I.e. this vector would become <1,1,1,1,1,1,2,2,2,2,2,2,4,4,4,4>
I need to do this for a about 100 vectors containing around 6 million entries each. Currently I am using a for loop:
for(k in 1:length(vector){
if(vector[k] == 0){
vector[k] <- vector[k-1]
}
}
Is there a more efficient way to do this?
Thanks!
One option, would be to replace those 0 with NA, then use zoo::na.locf:
x <- c(1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
x[x == 0] <- NA
zoo::na.locf(x) ## you possibly need: `install.packages("zoo")`
# [1] 1 1 1 1 1 1 2 2 2 2 2 2 4 4 4 4
Thanks to Richard for showing me how to use replace,
zoo::na.locf(replace(x, x == 0, NA))
You could try this:
k <- c(1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
k[which(k != 0)[cumsum(k != 0)]]
or another case that cummax would not be appropriate
k <- c(1,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0)
k[which(k != 0)[cumsum(k != 0)]]
Logic:
I am keeping "track" of the indices of the vector elements that are non zero which(k != 0), lets denote this new vector as x, x=c(1, 7, 13)
Next I am going to "sample" this new vector. How? From k I am creating a new vector that increments every time there is a non zero element cumsum(k != 0), lets denote this new vector as y y=c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3)
I am "sampling" from vector x: x[y] i.e. taking the first element of x 6 times, then the second element 6 times and the third element 3 times. Let denote this new vector as z, z=c(1, 1, 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 13, 13, 13)
I am "sampling" from vector k, k[z], i.e. i am taking the first element 6 times, then the 7th element 6 times then the 13th element 3 times.
Add to #李哲源's answer:
If it is required to replace the leading NAs with the nearest non-NA value, and to replace the other NAs with the last non-NA value, the codes can be:
x <- c(0,0,1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
zoo::na.locf(zoo::na.locf(replace(x, x == 0, NA),na.rm=FALSE),fromLast=TRUE)
# you possibly need: `install.packages("zoo")`
# [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 4 4 4 4

R - work on data frame rows based on condition

I'm trying to understand how can I work on the rows of a data frame based on a condition.
Having a data frame like this
> d<-data.frame(x=c(0,1,2,3), y=c(1,1,1,0))
> d
x y
1 0 1
2 1 1
3 2 1
4 3 0
how can I add +1 to all rows that contain a value of zero? (note that zeros can be found in any column), so that the result would look like this:
x y
1 1 2
2 1 1
3 2 1
4 4 1
The following code seems to do part of the job, but is just printing the rows where the action was taken, the number of times it was taken (2)...
> for(i in 1:nrow(d)){
+ d[d[i,]==0,]<-d[i,]+1
+ }
> d
x y
1 1 2
2 4 1
3 1 2
4 4 1
I'm sure there is a simple solution for this, maybe an apply function?, but I'm not getting there.
Thanks.
Some possibilities:
# 1
idx <- which(d == 0, arr.ind = TRUE)[, 1]
d[idx, ] <- d[idx, ] + 1
# 2
t(apply(d, 1, function(x) x + any(x == 0)))
# 3
d + apply(d == 0, 1, max)
The usage of which for vectors, e.g. which(1:3 > 2), is quite common, whereas it is used less for matrices: by specifying arr.ind = TRUE what we get is array indices, i.e. coordinates of every 0:
which(d == 0, arr.ind = TRUE)
row col
[1,] 1 1
[2,] 4 2
Since we are interested only in rows where zeros occur, I take the first column of which(d == 0, arr.ind = TRUE) and add 1 to all the elements in these rows by d[idx, ] <- d[idx, ] + 1.
Regarding the second approach, apply(d, 1, function(x) x) would be simply going row by row and returning the same row without any modifications. By any(x == 0) we check whether there are any zeros in a particular row and get TRUE or FALSE. However, by writing x + any(x == 0) we transform TRUE or FALSE to 1 or 0, respectively, as required.
Now the third approach. d == 0 is a logical matrix, and we use apply to go over its rows. Then when applying max to a particular row, we again transform TRUE, FALSE to 1, 0 and find a maximal element. This element is 1 if and only if there are any zeros in that row. Hence, apply(d == 0, 1, max) returns a vector of zeros and ones. The final point is that when we write A + b, where A is a matrix and b is a vector, the addition is column-wise. In this way, by writing d + apply(d == 0, 1, max) we add apply(d == 0, 1, max) to every column of d, as needed.

Find the minimum distance between two data frames, for each element in the second data frame

I have two data frames ev1 and ev2, describing timestamps of two types of events collected over many tests. So, each data frame has columns "test_id", and "timestamp". What I need to find is the minimum distance of ev1 for each ev2, in the same test.
I have a working code that merges the two datasets, calculates the distances, and then uses dplyr to filter for the minimum distance:
ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(6, 1, 8, 4, 5, 11))
data <- merge(ev2, ev1, by=c("test_id"), suffixes=c(".ev2", ".ev1"))
data$distance <- data$time.ev2 - data$time.ev1
min_data <- data %>%
group_by(test_id, time.ev2) %>%
filter(abs(distance) == min(abs(distance)))
While this works, the merge part is very slow and feels inefficient -- I'm generating a huge table with all combinations of ev2->ev1 for the same test_id, only to filter it down to one. It seems like there should be a way to "filter on the fly", during the merge. Is there?
Update: The following case with two "group by" columns fails when data.table approach outlined by akrun is used:
ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4), group_id=c(0, 0, 0, 1, 1, 1))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(5, 6, 7, 1, 2, 8), group_id=c(0, 0, 0, 1, 1, 1))
setkey(setDT(ev1), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=abs(time-i.time)]
Error in eval(expr, envir, enclos) : object 'i.time' not found
Here's how I'd do it using data.table:
require(data.table)
setkey(setDT(ev1), test_id)
ev1[ev2, .(ev2.time = i.time, ev1.time = time[which.min(abs(i.time - time))]), by = .EACHI]
# test_id ev2.time ev1.time
# 1: 0 6 3
# 2: 0 1 1
# 3: 0 8 3
# 4: 1 4 4
# 5: 1 5 4
# 6: 1 11 4
In joins of the form x[i] in data.table, the prefix i. is used to refer the columns in i, when both x and i share the same name for a particular column.
Please see this SO post for an explanation on how this works.
This is syntactically more straightforward to understand what's going on, and is memory efficient (at the expense of little speed1) as it doesn't materialise the entire join result at all. In fact, this does exactly what you say in your post - filter on the fly, while merging.
On speed, it doesn't matter in most of the cases really. If there are a lot of rows in i, it might be a tad slower as the j-expression will have to be evaluated for each row in i. In contrast, #akrun's answer does a cartesian join followed by one filtering. So while it's high on memory, it doesn't evaluate j for each row in i. But again, this shouldn't even matter unless you work with really large i which is not often the case.
HTH
May be this helps:
library(data.table)
setkey(setDT(ev1), test_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=time-i.time]
DT[DT[,abs(distance)==min(abs(distance)), by=list(test_id, i.time)]$V1]
# test_id time i.time distance
#1: 0 3 6 3
#2: 0 1 1 0
#3: 0 3 8 5
#4: 1 4 4 0
#5: 1 4 5 1
#6: 1 4 11 7
Or
ev1[ev2, allow.cartesian=TRUE][,distance:= time-i.time][,
.SD[abs(distance)==min(abs(distance))], by=list(test_id, i.time)]
Update
Using the new grouping
setkey(setDT(ev1), test_id, group_id)
setkey(setDT(ev2), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=i.time-time]
DT[DT[,abs(distance)==min(abs(distance)), by=list(test_id,
group_id,i.time)]$V1]$distance
#[1] 2 3 4 -1 0 4
Based on the code you provided
min_data$distance
#[1] 2 3 4 -1 0 4

using R to select rows in data set with matching missing observations

I have determined how to identify all unique patterns of missing observations in a data set. Now I would like to select all rows in that data set with a given pattern of missing observations. I would like to do this iteratively so that if there are n patterns of missing observations in the data set I end up with n data sets each containing only 1 pattern of missing observations.
I know how to do this, but my method is not very efficient and is not general. I am hoping to learn a more efficient and more general approach because my real data sets are much larger and more variable than in the example below.
Here is an example data set and the code I am using. I do not bother to include the code I used to create the matrix zzz from the matrix dd, but can add that code if it helps.
dd <- matrix(c(
1, 0, 1, 1,
NA, 1, 1, 0,
NA, 0, 0, 0,
NA, 1,NA, 1,
NA, 1, 1, 1,
0, 0, 1, 0,
NA, 0, 0, 0,
0,NA,NA,NA,
1,NA,NA,NA,
1, 1, 1, 1,
NA, 1, 1, 0),
nrow=11, byrow=T)
zzz <- matrix(c(
1, 1, 1, 1,
NA, 1, 1, 1,
NA, 1,NA, 1,
1,NA,NA,NA
), nrow=4, byrow=T)
for(jj in 1:dim(zzz)[1]) {
ddd <-
dd[
((dd[, 1]%in%c(0,1) & zzz[jj, 1]%in%c(0,1)) |
(is.na(dd[, 1]) & is.na(zzz[jj, 1]))) &
((dd[, 2]%in%c(0,1) & zzz[jj, 2]%in%c(0,1)) |
(is.na(dd[, 2]) & is.na(zzz[jj, 2]))) &
((dd[, 3]%in%c(0,1) & zzz[jj, 3]%in%c(0,1)) |
(is.na(dd[, 3]) & is.na(zzz[jj, 3]))) &
((dd[, 4]%in%c(0,1) & zzz[jj, 4]%in%c(0,1)) |
(is.na(dd[, 4]) & is.na(zzz[jj, 4]))),]
print(ddd)
}
The 4 resulting data sets in this example are:
a)
1 0 1 1
0 0 1 0
1 1 1 1
b)
NA 1 1 0
NA 0 0 0
NA 1 1 1
NA 0 0 0
NA 1 1 0
c)
NA 1 NA 1
d)
0 NA NA NA
1 NA NA NA
Is there a more general and more efficient method of doing the same thing? In the example above the 4 resulting data sets are not saved, but I do save them with my real data.
Thank you for any advice.
Mark Miller
# Missing value patterns (TRUE=missing, FALSE=present)
patterns <- unique( is.na(dd) )
result <- list()
for( i in seq_len(nrow(patterns))) {
# Rows with this pattern
rows <- apply( dd, 1, function(u) all( is.na(u) == patterns[i,] ) )
result <- append( result, list(dd[rows,]) )
}
Not completely sure I understand the question, but here's a stab at it...
The first thing you want to do is figure out which elements are NA, and which aren't. For that, you can use the is.na() function.
is.na(dd)
will generate a matrix of the same size as dd containing TRUE where the value was NA, and FALSE elsewhere.
You then want to find the unique patterns in your matrix. For that, you want the unique() function, which accepts a 'margin' parameter, allowing you to find only unique rows in a matrix.
zzz <- unique(is.na(dd), margin=1)
creates a matrix similar to your zzz matrix, but you could, of course, substitute the "TRUE"s for NAs and "FALSE"s for 1's so it would be identical to your matrix.
You can then go a few directions from here to try to sort these into different datasets. Unfortunately, I think you're going to need one loop here.
results <- list()
for (r in 1:nrow(dd)){
ind <- which(apply (zzz, 1, function(x) {all(x==is.na(dd[r,]))}))
if (ind %in% names(results)){
results[[ind]] <- rbind(results[[ind]], dd[r,])
}
else{
results[[ind]] <- dd[r,]
names(results)[ind] <- ind
}
}
At that point, you have a list which contains all of the rows of dd, sorted by pattern of NAs. You'll find that the pattern expressed in row 1 of zzz will be matched with row 1 of results, and the same for the rest of the rows.

Resources