How to remove rows where all columns are zero using data.table - r

I have the following data frame:
# the original dataset
dat <- data.frame(a = c(0,0,2,3), b= c(1,0,0,0), c=c(0,0,1,3))
It looks like this:
> dat
a b c
1 0 1 0
2 0 0 0
3 2 0 1
4 3 0 3
What I want to do is to remove rows with all zero, resulting in :
a b c
0 1 0
2 0 1
3 0 3
How can I do that with data.table.
In reality I have much higher dimension need to be processed so need to be super fast.
I tried this but still slow:
dat <- dat[Reduce(`|`, dat), ]

You can try with rowSums -
library(data.table)
setDT(dat)
dat[rowSums(dat != 0) != 0]
# a b c
#1: 0 1 0
#2: 2 0 1
#3: 3 0 3

Again using rowSums(). I think this is more readable.
library(data.table)
dat[(rowSums(dat) !=0),]
a b c
1 0 1 0
3 2 0 1
4 3 0 3

Related

ifelse replace value if it is lower than previous

I am working with a dataset which has some errors in the data. Numbers are sometimes registered wrong. Here is some toy data example:
The issue is that the Reversal column should only be counting up (per unique ID). So in a vector of 0,0,0,1,1,1,0,1,2,2,0,0,2,3, the 0's following the 1 and 2 should not be 0's. Instead, they should be equal to whatever value came before. I tried to remedy this by using the lag function from the dplyr package:
Data$Reversal <- ifelse(Data$Reversal < lag(Data$Reversal), lag(Data$Reversal), Data$Reversal) .
But this results in numerous issues:
The first value becomes NA. I've tried using the default=Data$Reversal call in the lag function but to no avail.
The Reversal value should reset to 0 for each Unique ID. Now it continues across ID's. I tried a messy code using group_by(ID) but could not get this to work, as it broke my earlier ifelse function.
This only works when there is 1 error. But if there are two errors in a row it only fixes 1 value.
Alternatively, I found this thread in which the answer provided by Andrie also seems promising. This fixes problem 1 and 3, but I can't get this code to work per ID (using the group_by function).
Andrie's answer:
local({
r <- rle(data)
x <- r$values
x0 <- which(x==0) # index positions of zeroes
xt <- x[x0-1]==x[x0+1] # zeroes surrounded by same value
r$values[x0[xt]] <- x[x0[xt]-1] # substitute with surrounding value
inverse.rle(r)
})
Any help would be much appreciated.
I think cummax does exactly what you need.
Base R
dat$Reversal <- ave(dat$Reversal, dat$ID, FUN = cummax)
dat
# ID Owner Reversal Success
# 1 1 A 0 0
# 2 1 A 0 0
# 3 1 A 0 0
# 4 1 B 1 1
# 5 1 B 1 0
# 6 1 B 1 0
# 7 1 error 1 0
# 8 1 error 1 0
# 9 1 B 1 0
# 10 1 B 1 0
# 11 1 C 1 1
# 12 1 C 2 0
# 13 1 error 2 0
# 14 1 C 2 0
# 15 1 C 3 1
# 16 2 J 0 0
# 17 2 J 0 0
dplyr
dat %>%
group_by(ID) %>%
mutate(Reversal = cummax(Reversal)) %>%
ungroup()
data.table
as.data.table(dat)[, Reversal := cummax(Reversal), by = .(ID)][]
Data, courtesy of https://extracttable.com/
dat <- read.table(header = TRUE, text = "
ID Owner Reversal Success
1 A 0 0
1 A 0 0
1 A 0 0
1 B 1 1
1 B 1 0
1 B 1 0
1 error 0 0
1 error 0 0
1 B 1 0
1 B 1 0
1 C 1 1
1 C 2 0
1 error 0 0
1 C 2 0
1 C 3 1
2 J 0 0
2 J 0 0")

Filter multiple columns based on same criteria in R

I have a dataframe in which there are multiple columns (more than 30) that is saved in a list. I would like to apply the same criteria for all those columns without writing each code for each columns. I have example below to help understand my problem better
A<-c("A","B","C","D","E","F","G","H","I")
B<-c(0,0,0,1,2,3,0,0,0)
C<-c(0,1,0,0,1,2,0,0,0)
D<-c(0,0,0,0,1,1,0,1,0)
E<-c(0,0,0,0,0,0,0,1,0)
data<-data.frame(A,B,C,D,E)
Let say I have the above df as an example and I have saved the list of cols as below
list <- c("B","C","D","E")
I would like to use those cols with the same criteria as below
setDT(data)[B>=1 | C>=1 | D>=1 | E>=1]
And get the following result
A B C D E
1: B 0 1 0 0
2: D 1 0 0 0
3: E 2 1 1 0
4: F 3 2 1 0
5: H 0 0 1 1
However, is there a way to get the above answer without writing each individual column criteria (e.g. B>=1 | C>=1 ....) since I have more than 30 cols in the actual data. Thanks a lot
For your specific example of checking if at least one value in a row is at least 1, you could use rowSums
data[rowSums(data[,-1]) > 0, ]
# A B C D E
# 2 B 0 1 0 0
# 4 D 1 0 0 0
# 5 E 2 1 1 0
# 6 F 3 2 1 0
# 8 H 0 0 1 1
If you have other criteria in mind, you might as well consider using any within apply
ind <- apply(data[,-1], 1, function(x) {any(x >= 1)})
data[ind,]
# A B C D E
# 2 B 0 1 0 0
# 4 D 1 0 0 0
# 5 E 2 1 1 0
# 6 F 3 2 1 0
# 8 H 0 0 1 1
dplyr::filter_at will do just that.
library(dplyr)
data %>% filter_at(vars(-A),any_vars(.>=1))
# A B C D E
# 1 B 0 1 0 0
# 2 D 1 0 0 0
# 3 E 2 1 1 0
# 4 F 3 2 1 0
# 5 H 0 0 1 1
You could always use Reduce, this is nice because you can put any type of logic you want into the function:
A simple method might be:
data[Reduce("|", as.data.frame(data[,list] >= 1)),]
# A B C D E
#2 B 0 1 0 0
#4 D 1 0 0 0
#5 E 2 1 1 0
#6 F 3 2 1 0
#8 H 0 0 1 1
A little explanation: Reduce successively applies the same function to each element of x. In this case the "|" operator is applied to each of the logical columns of the data.frame.
If you wanted to do more complicated logic checks you could do that with your own anonymous function.
Please check this using applyin R.
B<-c(0,0,0,1,2,3,0,0,0)
C<-c(0,1,0,0,1,2,0,0,0)
D<-c(0,0,0,0,1,1,0,1,0)
ef=data.frame(B,C,D)
con=apply(ef,2,function(x) x>1 )

how to select subset only by [] in r?

a<-data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
I want to convert the element of q1,q2 which !=1 to 0,and I want to use only [].I believe all the subset can be done by [].
a[grep("q\\d",colnames(a),perl=TRUE)!=1,grep("q\\d",colnames(a),perl=TRUE)]<-0
but it doesn't work, what's the problem?
We create the a numeric index of the column names that start with 'q' followed by numbers ('nm1'), use that to subset the columns in 'a' and assign the values that are not equal to 1 in that subset to 0.
nm1 <- grep("q\\d+", names(a))
a[nm1][a[nm1] != 1] <- 0
and make sure we have the columns as character class by using stringsAsFactors= FALSE in the data.frame
The above replacement is based on a logical matrix (a[nm1]!=1) which may create memory problems if the dataset is really big. In that case, it is better to loop through the columns and replace with 0
a[nm1] <- lapply(a[nm1], function(x) replace(x, x!=1, 0))
data
a <- data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
Just in case, if you know column names, you can use them for indexing.
a<-data.frame(q1=rep(c(1,'A','B'),4), q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
col_n <- c("q1", "q2")
a[, col_n][a[, col_n]!=1]<-0
> a
q1 q2 w1
1 1 1 1
2 0 0 A
3 0 0 B
4 1 0 C
5 0 1 1
6 0 0 A
7 1 0 B
8 0 0 C
9 0 1 1
10 1 0 A
11 0 0 B
12 0 0 C
data.table approach:
a<-data.table(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
a[,grep("^q", colnames(a), value = T):=lapply(a[,grep("^q", colnames(a), value = T), with = F], function(x) ifelse(x == 1, 1, 0))]
> a
q1 q2 w1
1: 1 1 1
2: 0 0 A
3: 0 0 B
4: 1 0 C
5: 0 1 1
6: 0 0 A
7: 1 0 B
8: 0 0 C
9: 0 1 1
10: 1 0 A
11: 0 0 B
12: 0 0 C

How to write new column conditional on grouped rows in R?

I have a data frame where each Item has three categories (a, b,c) and a numeric Answer for each category is recorded (either 0 or 1). I would like to create a new column contingent on the rows in the Answer column. This is how my data frame looks like:
Item <- rep(c(1:3), each=3)
Option <- rep(c('a','b','c'), times=3)
Answer <- c(1,1,0,1,0,1,1,1,1)
df <- data.frame(Item, Option, Answer)
Item Option Answer
1 1 a 1
2 1 b 1
3 1 c 0
4 2 a 0
5 2 b 0
6 2 c 1
7 3 a 1
8 3 b 1
9 3 c 1
What is needed: whenever the three categories in the Option column are 1, the New column should receive a 1. In any other case, the column should have a 0. The desired output should look like this:
Item Option Answer New
1 1 a 1 0
2 1 b 1 0
3 1 c 0 0
4 2 a 0 0
5 2 b 0 0
6 2 c 1 0
7 3 a 1 1
8 3 b 1 1
9 3 c 1 1
I tried to achieve this without using a loop, but I got stuck because I don't know how to make a new column contingent on a group of rows, not just a single one. I have tried this solution but it doesn't work if the rows are not grouped in pairs.
Do you have any suggestions? Thanks a bunch!
This should work:
df %>%
group_by(Item)%>%
mutate(New = as.numeric(all(as.logical(Answer))))
using data.table
DT <- data.table(Item, Option, Answer)
DT[, Index := as.numeric(all(as.logical(Answer))), by= Item]
DT
Item Option Answer Index
1: 1 a 1 0
2: 1 b 1 0
3: 1 c 0 0
4: 2 a 1 0
5: 2 b 0 0
6: 2 c 1 0
7: 3 a 1 1
8: 3 b 1 1
9: 3 c 1 1
Or using only base R
df$Index <- with(df, +(ave(!!Answer, Item, FUN = all)))
df$Index
#[1] 0 0 0 0 0 0 1 1 1

How to sum and combine two data frames?

I have two data frames:
DATA1:
ID com_alc_cd com_liv_cd com_hyee_cd
A 1 0 0
B 0 0 1
D 0 0 0
C 0 1 0
DATA2:
ID com_alc_dd com_liv_dd com_hyee_dd
B 0 2 0
A 1 0 2
C 0 1 0
D 0 1 0
I want to combine the two data frames, so as to obtain the sum of the two:
SUM(DATA1, DATA2):
ID com_alc com_liv com_hyee
A 2 0 2
B 0 2 1
C 0 2 0
D 0 1 0
Try this for example( assuming that your data.frames are matrix of the same size)
d1 <- DATA1[order(DATA1$ID),]
d2 <- DATA2[order(DATA2$ID),]
data.frame(ID=d1$ID,as.matrix(subset(d1,select=-ID)) +
as.matrix(subset(d2,select=-ID)))
ID com_alc_cd com_liv_cd com_hyee_cd
1 A 2 0 2
2 B 0 2 1
4 C 0 2 0
3 D 0 1 0
EDIT general solution
library(reshape2)
## put the data in the long format
res <- do.call(rbind,lapply(list(DATA1,DATA2),melt,id.vars='ID'))
## polish names
res$variable <- gsub('(.*_.*)_.*','\\1',res$variable)
## wide format and aggregate using sum
dcast(ID~variable,data=res,fun.aggregate=sum)
ID com_alc com_hyee com_liv
1 A 2 2 0
2 B 0 1 2
3 C 0 0 2
4 D 0 0 1
You can also use aggregate
names(df1) <- names(df2)
df3 <- rbind(df1, df2)
res <- aggregate(df3[,-1], by=list(df3$ID), sum)

Resources