I have a data frame with 2,000,000 + rows and 22 columns.
In three of the columns the entries are either 0, 1 or NA.
I want to have a column which has the sum of these three columns for every row, treating NA as 0.
Using a for loop is definitely way too slow.
Have you got any alternatives for me? Another idea was using mutate in a pipe, but I have problems selecting the columns that I want to add up by name.
First attempt:
for(i in 1:nrow(T12)){
if(is.na(T12$blue[i]) & is.na(T12$blue.y[i])) {
T12$blue[i] <- T12$blue.x[i]
}else if(is.na(T12$blue[i]) & is.na(T12$blue.x[i])){
T12$blue[i] <- T12$blue.y[i]
}else if(is.na(T12$blue[i]) & is.na(T12$blue.x[i]) & is.na(T12$blue.y[i]) )
T12[i,] <- NULL
}
Thank you!
I am going to assume that the columns you wish to add are the first three. If you need different columns, just change c(1,2,3) in the code below.
apply(T12[,c(1,2,3)], 1, sum, na.rm=TRUE)
Note: #27ϕ9 comments that a faster solution is
rowSums(T12[,c(1,2,3)], 1, na.rm=TRUE)
You can first replace all the NA's to 0.
df[is.na(df)] <- 0
setDT(df)[,newcol := a + b + c]
If your object column names are a, b and c, maybe you can try the code below
within(T12, new <- rowSums(cbind(a,b,c),na.rm = TRUE))
Related
I have a data frame like this :
I want to remove rows which have values = 0 in those columns which are "numeric". I tried some functions but returned to me error o dind't remove anything be cause not the entire row is = 0. Summarizing, i need to remove the rows which are equals to 0 on the colums that have a numeric class( i.e from sales month to expected sales,). How i could do this???(below attach the result i expect)
PD: If I could do it with some function that allows me to put the number of the column instead of the name, it would be great!
Here a simple solution with lapply.
set.seed(5)
df <- data.frame(a=1:10,b=letters[1:10],x=sample(0:5,5,replace = T),y=sample(c(0,10,20,30,40,50),5,replace = T))
df <-df[!unlist(lapply(1:nrow(df), function(i) {
any(df[i, ] == 0)
})), ]
I have got a data.frame with approx. 20,000 columns. From this data.frame I want to remove columns for which the follow vector has a value of 1.
u.snp <- apply(an[25:19505], 2, mean)
I am sure there must be a straight forward way to accomplish this but can´t see it right now. Any hints would be greatly appreciated. Thanks.
Update: Thanks for your help. Now I tried the following:
cm <- colMeans(an.mdr[25:19505])
tail(sort(cm), n=40)
With the tail function I see that 22 columns out of 19481 columns of an.mdr have mean=1. Next I remove these columns using the code as suggested.
an.mdr.s <- an.mdr
an.mdr.s[colMeans(an.mdr.s[25:19505])==1] <- NULL
As anticipated an.mdr.s has 22 columns less than an.mdr. But when I calculate the column means for all but the first 24 columns I again have 22 columns with column mean=1 in an.mdr.s.
cmm <- colMeans(an.mdr.s[25:19483])
tail(sort(cmm), n=40)
Honestly, I cannot see what is going on here right now.
That should be quite easily accomplished with the following command:
df[colMeans(df)==1] <- NULL
You can do in two simple steps (df is your data frame):
# step 1 - calculate mean for all columns and filter with mean = 1
remove_columns <- sapply(df, mean)
remove_columns <- names(remove_columns[remove_columns == 1])
# alternate using filter (just for knowledge)
## remove_columns <- names(Filter(function(x) x == 1,sapply(df, mean)))
# step 2 - remove them
df_new <- df[,setdiff(names(df), remove_columns)]
I have hundreds of dataframes withi the following structure:
df <- data.frame(yr=seq(0,20,1), op=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,10))
I would like to automatically replace the initial part of the columnop as the sequence would continue from the end of the vector, but just until the second 0 value.
It should result like this:
df.result <- data.frame(yr=seq(0,20,1), op=c(11,12,13,14,15,16,17,18,19,20,0,1,2,3,4,5,6,7,8,9,10))
It must be dynamics, because the sequences are all different.
Any tips for doing this quickly with a function without needs for loops?
Thanks in advance!
Here's an approach:
transform(df, op = replace(op, cumsum(!op) < 2, seq(which(!op)[2] - 1) + tail(op, 1)))
I would like to exclude the rows in a data frame which contain all 0's.
I can check for if a row contain 0 or not by using %in% operator. But need to know how to iterate over an entire matrix. and then print the new matrix excluding the other rows.
How can I achieve that?
Using Senor O's sample data (DF <- data.frame(A=rep(c(1,0),5), B=0)), try:
DF[!rowSums(DF == 0) == ncol(DF), ]
This should work:
AllZeros = apply(DF, 1, function(X) all(X==0))
DF2 = DF[!AllZeros,]
Try it with:
DF <- data.frame(A=rep(c(1,0),5), B=0)
As sample data.
There are a ton of ways to do this as the guys who have answered so far will show you.
I'll provide one more example off of this example dataset created.
DF <- data.frame(A=rep(c(1,0),5), B=0)
The subset command works well.
newDF <- subset(DF, !(A == 0 & B == 0) )
Depending on the size of your matrix and the naming convention of your variables, this may be tedious in which case I'd go straight for the apply functions.
I have 2 datasets. One is the parent dataset (A) and other one is a subset (B) of it. I want to create a dataset from A which does not contain rows from B. It should be something like
C=A-B
Both the datasets A and B have same number of columns and column names.
If B is an actual subset of A, you can use setdiff on rownames:
sset <- subset(mtcars,cyl==4)
mtcars[setdiff(rownames(mtcars),rownames(sset)),]
If you do not want to convert it into a string for comparing, i.e Do exact matches
you can try this out
a <- data.frame(t(matrix(1:12,3,4)))
b <- data.frame(t(matrix(7:21,3,5)))
a[!apply(a,1,FUN=function(y){any(apply(b,1,FUN=function(x){all(x==y)}))}),]
Something like the following might do the trick:
C <- A[!(apply(A, 1, toString) %in% apply(B, 1, toString)), ]