I would like to exclude the rows in a data frame which contain all 0's.
I can check for if a row contain 0 or not by using %in% operator. But need to know how to iterate over an entire matrix. and then print the new matrix excluding the other rows.
How can I achieve that?
Using Senor O's sample data (DF <- data.frame(A=rep(c(1,0),5), B=0)), try:
DF[!rowSums(DF == 0) == ncol(DF), ]
This should work:
AllZeros = apply(DF, 1, function(X) all(X==0))
DF2 = DF[!AllZeros,]
Try it with:
DF <- data.frame(A=rep(c(1,0),5), B=0)
As sample data.
There are a ton of ways to do this as the guys who have answered so far will show you.
I'll provide one more example off of this example dataset created.
DF <- data.frame(A=rep(c(1,0),5), B=0)
The subset command works well.
newDF <- subset(DF, !(A == 0 & B == 0) )
Depending on the size of your matrix and the naming convention of your variables, this may be tedious in which case I'd go straight for the apply functions.
Related
I have a data set that looks like this:
a <- I want to filter the rows with the same GeneFunction that has the highest Shared_NDC_Coverage.
b <- In addition, I would also like to keep rows that are higher than the 10% of the highest Shared_NDC_Coverage (the information we get at a) for that subset of the group that we filtered for each TreatmentGeneFunction.
I am new to R, and trying to self teach myself the some of the basics. I appreciate any kind of help. Thank you!
So far I have tried the following codes but they are not really working;
sample <- read.csv("Ksg_3Q_vs_NDCsharedSupplemented_T60_Top_Newannot.csv")
sample <- arrange(sample, TreatmentGeneFuncion)
for(a in 1:nrow(sample)){
if(sample$TreatmentGene[a] == sample$NDCGene[a]){
sample[a] <- which.max(sample$Shared_NDC_Coverage[a])
}
}
sample2<- distinct(sample, TreatmentGeneFuncion, .keep_all = TRUE) #This only returns the first row of the repeated values
Guesses:
a <- subset(sample, GeneFunction == GeneFunction[which.max(Shared_NDC_Coverage)])
b <- subset(sample, Shared_NDC_Coverage >= 0.1 * max(a$Shared_NDC_Coverage))
I have a data frame with 2,000,000 + rows and 22 columns.
In three of the columns the entries are either 0, 1 or NA.
I want to have a column which has the sum of these three columns for every row, treating NA as 0.
Using a for loop is definitely way too slow.
Have you got any alternatives for me? Another idea was using mutate in a pipe, but I have problems selecting the columns that I want to add up by name.
First attempt:
for(i in 1:nrow(T12)){
if(is.na(T12$blue[i]) & is.na(T12$blue.y[i])) {
T12$blue[i] <- T12$blue.x[i]
}else if(is.na(T12$blue[i]) & is.na(T12$blue.x[i])){
T12$blue[i] <- T12$blue.y[i]
}else if(is.na(T12$blue[i]) & is.na(T12$blue.x[i]) & is.na(T12$blue.y[i]) )
T12[i,] <- NULL
}
Thank you!
I am going to assume that the columns you wish to add are the first three. If you need different columns, just change c(1,2,3) in the code below.
apply(T12[,c(1,2,3)], 1, sum, na.rm=TRUE)
Note: #27ϕ9 comments that a faster solution is
rowSums(T12[,c(1,2,3)], 1, na.rm=TRUE)
You can first replace all the NA's to 0.
df[is.na(df)] <- 0
setDT(df)[,newcol := a + b + c]
If your object column names are a, b and c, maybe you can try the code below
within(T12, new <- rowSums(cbind(a,b,c),na.rm = TRUE))
I am very new to coding.
I'm looking to retrieve frequency values of rows in my data frame
I already know you can do this using:
df$col <- rowSums( data[,0:100] )
But I specifically want the sum of data from rows that are divisible by two, in other words, even rows up to a specific point in my data frame.
Perhaps you would need to incorporate an if else function?
Something vaguely similar to this oversimplified code?
if df$col[0:5]%%2
print rowSum
else:
don't
Anyone have any ideas?
Much appreciated
Indexing with a logical vector and the recycling rule will give the nice solution:
rowSums(cars[c(FALSE, TRUE), ])
st.mat <- matrix(1:100,ncol = 10, nrow =10)
for(i in 1:dim(st.mat)[1]){
if(i %% 2 == 0){
print(sum(st.mat[i,]))
}
}
This would be a very simple way to do it.
I have two matrices df_matrix and df_subset. One is a subset of the other one. Therefore, df_matrix has 10000 rows and columns and df_subset contains only 8222 columns and rows of df_matrix.
I want to select only those columns from df_matrix that are NOT in df_subset. I thought it is best to do it by column names, so I tried executing this code:
newdf <- df_matrix[, which( (colnames(df_matrix)) != (colnames(KroneckerProducts)) )]
However, this is not working at all. Is there any other way to do this?
General rule is not to use == or != with objects of different length
Use %in% with !
newdf <- df_matrix[, !(colnames(df_matrix) %in% colnames(KroneckerProducts))]
I'm still new to R and do all of my subsetting via the pattern:
data[ command that produces logical with same length as data ]
or
subset( data , command that produces logical with same length as data )
for example:
test = c("A", "B","C")
ignore = c("B")
result = test[ !( test %in% ignore ) ]
result = subset( test , !( test %in% ignore ) )
But I vaguely remember from my readings that there's a shorter/(more readable?) way to do this? Perhaps using the "with" function?
Can someone list alternative to the example above to help me understand the options in subsetting?
I don't know of a more succinct way of subsetting for your specific example, using only vectors. What you may be thinking of, regarding with, is subsetting data frames based on conditions using columns from that data frame. For example:
dat <- data.frame(variable1 = runif(10), variable2 = letters[1:10])
If we want grab a subset of dat based on a condition using variable1 we could do this:
dat[dat$variable1 < 0,]
or we can save ourselves having to write dat$* each time by using with:
with(dat,dat[variable1 < 0,])
Now, you'll notice that I really didn't save any keystrokes by doing that in this case. But if you have a data frame with a long name, and a complicated condition it can save you a bit. See also the related ?within command if you're altering the data frame in question.
Alternatively, you can use subset which can do essentially the same thing:
subset(dat, variable1 < 0)
subset can also handle conditions on the columns via the select argument.
The with function would help if test were a column in a data frame (or object in a list), but with global vectors with does not help.
Some people have created a not in operator that could save a couple of key strokes from what you did. If all the values in test are unique then the setdiff function may be what you are thinking of (but if for example you had multiple "A"s then setdiff would only return 1 of them).
With your ignore being only 1 value you could use test != ignore, but that does not generalize to ignore having 2 or more values.
I have seen timed comparisons of alternate methods and %in% (based on match) was one of the best performing strategies.
Alternates:
test[!test=="B"] #logical indexing
test[which(test != "B")] #numeric indexing
# the which() is not superfluous when there are NA's if you want them ignored
Another alternative to the original example:
test[test != ignore]
Other ways, using joran's example:
set.seed(1)
df <- data.frame(variable1 = runif(10), variable2 = letters[1:10])
Returning one column: df[[1]]. df$name is equivalent to df[["name", exact = FALSE]]
df[df[[1]] < 0.5, ]
df[df["variable1"] < 0.5, ]
Returning one data frame of one column: df[1]
df[df[1] < 0.5, ]
Using with
with(df, df[df[[1]] < 0.5, ]) # One column
with(df, df[df["variable1"] < 0.5, ]) # One column
with(df, df[df[1] < 0.5, ]) # data frame of one column
Using dplyr:
library(dplyr)
filter(df, variable1 < 0.5)