Faster Alternative for looping in combination with If in R - r

I have a data frame with 2,000,000 + rows and 22 columns.
In three of the columns the entries are either 0, 1 or NA.
I want to have a column which has the sum of these three columns for every row, treating NA as 0.
Using a for loop is definitely way too slow.
Have you got any alternatives for me? Another idea was using mutate in a pipe, but I have problems selecting the columns that I want to add up by name.
First attempt:
for(i in 1:nrow(T12)){
if(is.na(T12$blue[i]) & is.na(T12$blue.y[i])) {
T12$blue[i] <- T12$blue.x[i]
}else if(is.na(T12$blue[i]) & is.na(T12$blue.x[i])){
T12$blue[i] <- T12$blue.y[i]
}else if(is.na(T12$blue[i]) & is.na(T12$blue.x[i]) & is.na(T12$blue.y[i]) )
T12[i,] <- NULL
}
Thank you!

I am going to assume that the columns you wish to add are the first three. If you need different columns, just change c(1,2,3) in the code below.
apply(T12[,c(1,2,3)], 1, sum, na.rm=TRUE)
Note: #27ϕ9 comments that a faster solution is
rowSums(T12[,c(1,2,3)], 1, na.rm=TRUE)

You can first replace all the NA's to 0.
df[is.na(df)] <- 0
setDT(df)[,newcol := a + b + c]

If your object column names are a, b and c, maybe you can try the code below
within(T12, new <- rowSums(cbind(a,b,c),na.rm = TRUE))

Related

R- Remove rows based on condition across some columns

I have a data frame like this :
I want to remove rows which have values = 0 in those columns which are "numeric". I tried some functions but returned to me error o dind't remove anything be cause not the entire row is = 0. Summarizing, i need to remove the rows which are equals to 0 on the colums that have a numeric class( i.e from sales month to expected sales,). How i could do this???(below attach the result i expect)
PD: If I could do it with some function that allows me to put the number of the column instead of the name, it would be great!
Here a simple solution with lapply.
set.seed(5)
df <- data.frame(a=1:10,b=letters[1:10],x=sample(0:5,5,replace = T),y=sample(c(0,10,20,30,40,50),5,replace = T))
df <-df[!unlist(lapply(1:nrow(df), function(i) {
any(df[i, ] == 0)
})), ]

R: delete columns from data.frame if condition fulfilled

I have got a data.frame with approx. 20,000 columns. From this data.frame I want to remove columns for which the follow vector has a value of 1.
u.snp <- apply(an[25:19505], 2, mean)
I am sure there must be a straight forward way to accomplish this but can´t see it right now. Any hints would be greatly appreciated. Thanks.
Update: Thanks for your help. Now I tried the following:
cm <- colMeans(an.mdr[25:19505])
tail(sort(cm), n=40)
With the tail function I see that 22 columns out of 19481 columns of an.mdr have mean=1. Next I remove these columns using the code as suggested.
an.mdr.s <- an.mdr
an.mdr.s[colMeans(an.mdr.s[25:19505])==1] <- NULL
As anticipated an.mdr.s has 22 columns less than an.mdr. But when I calculate the column means for all but the first 24 columns I again have 22 columns with column mean=1 in an.mdr.s.
cmm <- colMeans(an.mdr.s[25:19483])
tail(sort(cmm), n=40)
Honestly, I cannot see what is going on here right now.
That should be quite easily accomplished with the following command:
df[colMeans(df)==1] <- NULL
You can do in two simple steps (df is your data frame):
# step 1 - calculate mean for all columns and filter with mean = 1
remove_columns <- sapply(df, mean)
remove_columns <- names(remove_columns[remove_columns == 1])
# alternate using filter (just for knowledge)
## remove_columns <- names(Filter(function(x) x == 1,sapply(df, mean)))
# step 2 - remove them
df_new <- df[,setdiff(names(df), remove_columns)]

Replace values in a dataframe, until it matches a given value

I have hundreds of dataframes withi the following structure:
df <- data.frame(yr=seq(0,20,1), op=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,10))
I would like to automatically replace the initial part of the columnop as the sequence would continue from the end of the vector, but just until the second 0 value.
It should result like this:
df.result <- data.frame(yr=seq(0,20,1), op=c(11,12,13,14,15,16,17,18,19,20,0,1,2,3,4,5,6,7,8,9,10))
It must be dynamics, because the sequences are all different.
Any tips for doing this quickly with a function without needs for loops?
Thanks in advance!
Here's an approach:
transform(df, op = replace(op, cumsum(!op) < 2, seq(which(!op)[2] - 1) + tail(op, 1)))

exclude rows which contain all 0's

I would like to exclude the rows in a data frame which contain all 0's.
I can check for if a row contain 0 or not by using %in% operator. But need to know how to iterate over an entire matrix. and then print the new matrix excluding the other rows.
How can I achieve that?
Using Senor O's sample data (DF <- data.frame(A=rep(c(1,0),5), B=0)), try:
DF[!rowSums(DF == 0) == ncol(DF), ]
This should work:
AllZeros = apply(DF, 1, function(X) all(X==0))
DF2 = DF[!AllZeros,]
Try it with:
DF <- data.frame(A=rep(c(1,0),5), B=0)
As sample data.
There are a ton of ways to do this as the guys who have answered so far will show you.
I'll provide one more example off of this example dataset created.
DF <- data.frame(A=rep(c(1,0),5), B=0)
The subset command works well.
newDF <- subset(DF, !(A == 0 & B == 0) )
Depending on the size of your matrix and the naming convention of your variables, this may be tedious in which case I'd go straight for the apply functions.

Subtracting two dataset

I have 2 datasets. One is the parent dataset (A) and other one is a subset (B) of it. I want to create a dataset from A which does not contain rows from B. It should be something like
C=A-B
Both the datasets A and B have same number of columns and column names.
If B is an actual subset of A, you can use setdiff on rownames:
sset <- subset(mtcars,cyl==4)
mtcars[setdiff(rownames(mtcars),rownames(sset)),]
If you do not want to convert it into a string for comparing, i.e Do exact matches
you can try this out
a <- data.frame(t(matrix(1:12,3,4)))
b <- data.frame(t(matrix(7:21,3,5)))
a[!apply(a,1,FUN=function(y){any(apply(b,1,FUN=function(x){all(x==y)}))}),]
Something like the following might do the trick:
C <- A[!(apply(A, 1, toString) %in% apply(B, 1, toString)), ]

Resources