For example,
set.seed(1984)
d <- data.table(name=letters[1:26],a=rbinom(26,1,0.5),b=rbinom(26,1,0.5),c=rbinom(26,1,0.5))
I can remove rows that a, b, c columns are 0 by:
d[,if(sum(a,b,c) != 0) .SD,by=.(a,b,c)]
the result is:
a b c name
1: 1 1 1 a
2: 1 1 1 u
3: 1 1 1 x
4: 0 1 0 b
5: 0 1 0 d
6: 0 1 0 h
7: 0 1 1 c
8: 0 1 1 g
9: 0 1 1 o
10: 0 1 1 q
11: 0 1 1 t
12: 1 1 0 e
13: 1 1 0 k
14: 1 1 0 y
15: 1 0 0 f
16: 1 0 0 i
17: 1 0 0 r
18: 1 0 0 s
19: 1 0 0 w
20: 0 0 1 j
21: 0 0 1 v
22: 1 0 1 m
23: 1 0 1 n
a b c name
Now, I have two questions:
How to keep "name" column as the first column?
How to choose a, b, c columns as a simple expression (like a:c, but a:c is not meant a, b, c)? If there are hundreds columns, I can't type endless a, b, c ... in sum function or being the parameters of by.
Add question:
if it is not sum (has rowSums version for handling rows) but other functions like max, how to resovle question 1 and 2 without apply function family (apply function family is designed for data frame, I am afraid of they will decrease the speed of data table).
We could use Reduce with + to create a logical vector based on the columns specified in the .SDcols
d[d[, Reduce(`+`, .SD) != 0, .SDcols = a:c]]
Other options include (#nicola's)
d[Reduce("+",d[,a:c])!=0]
Or as suggested by #Frank using pmax to create a column ('keep') based on the maximum value on on each row, convert it to logical from binary and based on that subset the rows and columns
d[, keep := as.logical(do.call(pmax, .SD)), .SDcols=!"name"][(keep), !"keep"]
You could also use rowSums function:
d[rowSums(d[,2:4])!=0,]
Related
I am working with a dataset which has some errors in the data. Numbers are sometimes registered wrong. Here is some toy data example:
The issue is that the Reversal column should only be counting up (per unique ID). So in a vector of 0,0,0,1,1,1,0,1,2,2,0,0,2,3, the 0's following the 1 and 2 should not be 0's. Instead, they should be equal to whatever value came before. I tried to remedy this by using the lag function from the dplyr package:
Data$Reversal <- ifelse(Data$Reversal < lag(Data$Reversal), lag(Data$Reversal), Data$Reversal) .
But this results in numerous issues:
The first value becomes NA. I've tried using the default=Data$Reversal call in the lag function but to no avail.
The Reversal value should reset to 0 for each Unique ID. Now it continues across ID's. I tried a messy code using group_by(ID) but could not get this to work, as it broke my earlier ifelse function.
This only works when there is 1 error. But if there are two errors in a row it only fixes 1 value.
Alternatively, I found this thread in which the answer provided by Andrie also seems promising. This fixes problem 1 and 3, but I can't get this code to work per ID (using the group_by function).
Andrie's answer:
local({
r <- rle(data)
x <- r$values
x0 <- which(x==0) # index positions of zeroes
xt <- x[x0-1]==x[x0+1] # zeroes surrounded by same value
r$values[x0[xt]] <- x[x0[xt]-1] # substitute with surrounding value
inverse.rle(r)
})
Any help would be much appreciated.
I think cummax does exactly what you need.
Base R
dat$Reversal <- ave(dat$Reversal, dat$ID, FUN = cummax)
dat
# ID Owner Reversal Success
# 1 1 A 0 0
# 2 1 A 0 0
# 3 1 A 0 0
# 4 1 B 1 1
# 5 1 B 1 0
# 6 1 B 1 0
# 7 1 error 1 0
# 8 1 error 1 0
# 9 1 B 1 0
# 10 1 B 1 0
# 11 1 C 1 1
# 12 1 C 2 0
# 13 1 error 2 0
# 14 1 C 2 0
# 15 1 C 3 1
# 16 2 J 0 0
# 17 2 J 0 0
dplyr
dat %>%
group_by(ID) %>%
mutate(Reversal = cummax(Reversal)) %>%
ungroup()
data.table
as.data.table(dat)[, Reversal := cummax(Reversal), by = .(ID)][]
Data, courtesy of https://extracttable.com/
dat <- read.table(header = TRUE, text = "
ID Owner Reversal Success
1 A 0 0
1 A 0 0
1 A 0 0
1 B 1 1
1 B 1 0
1 B 1 0
1 error 0 0
1 error 0 0
1 B 1 0
1 B 1 0
1 C 1 1
1 C 2 0
1 error 0 0
1 C 2 0
1 C 3 1
2 J 0 0
2 J 0 0")
I have a dataframe in which there are multiple columns (more than 30) that is saved in a list. I would like to apply the same criteria for all those columns without writing each code for each columns. I have example below to help understand my problem better
A<-c("A","B","C","D","E","F","G","H","I")
B<-c(0,0,0,1,2,3,0,0,0)
C<-c(0,1,0,0,1,2,0,0,0)
D<-c(0,0,0,0,1,1,0,1,0)
E<-c(0,0,0,0,0,0,0,1,0)
data<-data.frame(A,B,C,D,E)
Let say I have the above df as an example and I have saved the list of cols as below
list <- c("B","C","D","E")
I would like to use those cols with the same criteria as below
setDT(data)[B>=1 | C>=1 | D>=1 | E>=1]
And get the following result
A B C D E
1: B 0 1 0 0
2: D 1 0 0 0
3: E 2 1 1 0
4: F 3 2 1 0
5: H 0 0 1 1
However, is there a way to get the above answer without writing each individual column criteria (e.g. B>=1 | C>=1 ....) since I have more than 30 cols in the actual data. Thanks a lot
For your specific example of checking if at least one value in a row is at least 1, you could use rowSums
data[rowSums(data[,-1]) > 0, ]
# A B C D E
# 2 B 0 1 0 0
# 4 D 1 0 0 0
# 5 E 2 1 1 0
# 6 F 3 2 1 0
# 8 H 0 0 1 1
If you have other criteria in mind, you might as well consider using any within apply
ind <- apply(data[,-1], 1, function(x) {any(x >= 1)})
data[ind,]
# A B C D E
# 2 B 0 1 0 0
# 4 D 1 0 0 0
# 5 E 2 1 1 0
# 6 F 3 2 1 0
# 8 H 0 0 1 1
dplyr::filter_at will do just that.
library(dplyr)
data %>% filter_at(vars(-A),any_vars(.>=1))
# A B C D E
# 1 B 0 1 0 0
# 2 D 1 0 0 0
# 3 E 2 1 1 0
# 4 F 3 2 1 0
# 5 H 0 0 1 1
You could always use Reduce, this is nice because you can put any type of logic you want into the function:
A simple method might be:
data[Reduce("|", as.data.frame(data[,list] >= 1)),]
# A B C D E
#2 B 0 1 0 0
#4 D 1 0 0 0
#5 E 2 1 1 0
#6 F 3 2 1 0
#8 H 0 0 1 1
A little explanation: Reduce successively applies the same function to each element of x. In this case the "|" operator is applied to each of the logical columns of the data.frame.
If you wanted to do more complicated logic checks you could do that with your own anonymous function.
Please check this using applyin R.
B<-c(0,0,0,1,2,3,0,0,0)
C<-c(0,1,0,0,1,2,0,0,0)
D<-c(0,0,0,0,1,1,0,1,0)
ef=data.frame(B,C,D)
con=apply(ef,2,function(x) x>1 )
a<-data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
I want to convert the element of q1,q2 which !=1 to 0,and I want to use only [].I believe all the subset can be done by [].
a[grep("q\\d",colnames(a),perl=TRUE)!=1,grep("q\\d",colnames(a),perl=TRUE)]<-0
but it doesn't work, what's the problem?
We create the a numeric index of the column names that start with 'q' followed by numbers ('nm1'), use that to subset the columns in 'a' and assign the values that are not equal to 1 in that subset to 0.
nm1 <- grep("q\\d+", names(a))
a[nm1][a[nm1] != 1] <- 0
and make sure we have the columns as character class by using stringsAsFactors= FALSE in the data.frame
The above replacement is based on a logical matrix (a[nm1]!=1) which may create memory problems if the dataset is really big. In that case, it is better to loop through the columns and replace with 0
a[nm1] <- lapply(a[nm1], function(x) replace(x, x!=1, 0))
data
a <- data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
Just in case, if you know column names, you can use them for indexing.
a<-data.frame(q1=rep(c(1,'A','B'),4), q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
col_n <- c("q1", "q2")
a[, col_n][a[, col_n]!=1]<-0
> a
q1 q2 w1
1 1 1 1
2 0 0 A
3 0 0 B
4 1 0 C
5 0 1 1
6 0 0 A
7 1 0 B
8 0 0 C
9 0 1 1
10 1 0 A
11 0 0 B
12 0 0 C
data.table approach:
a<-data.table(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
a[,grep("^q", colnames(a), value = T):=lapply(a[,grep("^q", colnames(a), value = T), with = F], function(x) ifelse(x == 1, 1, 0))]
> a
q1 q2 w1
1: 1 1 1
2: 0 0 A
3: 0 0 B
4: 1 0 C
5: 0 1 1
6: 0 0 A
7: 1 0 B
8: 0 0 C
9: 0 1 1
10: 1 0 A
11: 0 0 B
12: 0 0 C
I have a data frame where each Item has three categories (a, b,c) and a numeric Answer for each category is recorded (either 0 or 1). I would like to create a new column contingent on the rows in the Answer column. This is how my data frame looks like:
Item <- rep(c(1:3), each=3)
Option <- rep(c('a','b','c'), times=3)
Answer <- c(1,1,0,1,0,1,1,1,1)
df <- data.frame(Item, Option, Answer)
Item Option Answer
1 1 a 1
2 1 b 1
3 1 c 0
4 2 a 0
5 2 b 0
6 2 c 1
7 3 a 1
8 3 b 1
9 3 c 1
What is needed: whenever the three categories in the Option column are 1, the New column should receive a 1. In any other case, the column should have a 0. The desired output should look like this:
Item Option Answer New
1 1 a 1 0
2 1 b 1 0
3 1 c 0 0
4 2 a 0 0
5 2 b 0 0
6 2 c 1 0
7 3 a 1 1
8 3 b 1 1
9 3 c 1 1
I tried to achieve this without using a loop, but I got stuck because I don't know how to make a new column contingent on a group of rows, not just a single one. I have tried this solution but it doesn't work if the rows are not grouped in pairs.
Do you have any suggestions? Thanks a bunch!
This should work:
df %>%
group_by(Item)%>%
mutate(New = as.numeric(all(as.logical(Answer))))
using data.table
DT <- data.table(Item, Option, Answer)
DT[, Index := as.numeric(all(as.logical(Answer))), by= Item]
DT
Item Option Answer Index
1: 1 a 1 0
2: 1 b 1 0
3: 1 c 0 0
4: 2 a 1 0
5: 2 b 0 0
6: 2 c 1 0
7: 3 a 1 1
8: 3 b 1 1
9: 3 c 1 1
Or using only base R
df$Index <- with(df, +(ave(!!Answer, Item, FUN = all)))
df$Index
#[1] 0 0 0 0 0 0 1 1 1
I would like to create a new column inside my data table, this column being a vector of values; but I am getting the following error:
DT = data.table(x=rep(c("a","b"),c(2,3)),y=1:5)
>
> DT
x y
1: a 1
2: a 2
3: b 3
4: b 4
5: b 5
> DT[, my_vec := rep(0,y)]
Error in rep(0, y) : invalid 'times' argument
My expected result is:
> DT
x y my_vec
1: a 1 0
2: a 2 0 0
3: b 3 0 0 0
4: b 4 0 0 0 0
5: b 5 0 0 0 0 0
Is there a way to do that?
The syntax is a little cumbersome, but you can do this:
DT[, my_vec := list(list(rep(0, y))), by = y]
DT
# x y my_vec
#1: a 1 0
#2: a 2 0,0
#3: b 3 0,0,0
#4: b 4 0,0,0,0
#5: b 5 0,0,0,0,0
It is not clear whether you need a list as my_vec or a vector. If it is the latter, we group by sequence of rows, replicate the 0 with 'y' and paste the elements together within each group.
DT[, my_vec := paste(rep(0, y), collapse=' ') , 1:nrow(DT)]
DT
# x y my_vec
#1: a 1 0
#2: a 2 0 0
#3: b 3 0 0 0
#4: b 4 0 0 0 0
#5: b 5 0 0 0 0 0