How to remove rows where columns satisfy certain condition in data frame - r

I have a data frame that looks like this
df <- data.frame(cbind(1:10, sample(c(1:5), 10, replace=TRUE)))
# in real case the columns could be more than two
# and the column name could be anything.
What I want to do is to remove all rows where the value of all its columns
is smaller than 5.
What's the way to do it?

df[!apply(df,1,function(x)all(x<5)),]

First of all ...please stop using cbind to create data.frames. You will be sorry if you continue. R will punish you.
df[ !rowSums(df <5) == length(df), ]
(The length() function returns the number of columns in a dataframe.)

Related

Count values per rows in a data frame R

I know, there is other questions like this one but none of them answer my specific problem.
On my data frame, I need to count the number of values in each rows between cols 3 and 8.
I want a simple NB.VAL like in Excel..
base_graphs$NB <- rowSums(!is.na(base_graphs)) # with this code, I count all values except NAs but I can't select specific columns
How to create this new column "NB" on my data frame "base_graphs" ?
You were really close:
base_graphs$NB <- rowSums(!is.na(base_graphs[, 3:8]))
The [, 3:8] subsets and selects columns 3 through 8.
apply can apply a function to each row of a data frame. Try:
base_graphs$NB <- apply(base_graphs[3:8], 1, function (x) sum(is.na(x)))

Randomly selecting dataframe column. Avoid sampling same column again

Is there a way to random pick a column in a dataframe and then avoid randomly pick it again? This should pick a random column
random_data_vector = data[, sample(ncol(data), 1)]
but I'm not sure how to avoid picking the column again. I thought about removing the column completely but there might be a better approach
You can first sample the columns with
random_cols <- sample(ncol(data))
and then select the random vectors like this
random_data_vector1 <- my_df[, random_cols[1]]
random_data_vector2 <- my_df[, random_cols[2]]
The default setting of sample is replace = FALSE, thus in the random_cols vector you won't have duplicated numbers and you won't select one column twice.

R: Assign values to a new column based on values of another column where a condition is satisfied

I want to create a new column in a data.frame where its value is equal to the value in another data.frame where a particular condition is satisfied between two columns in each data frame.
The R pseudo-code being something like this:
DF1$Activity <- DF2$Activity where DF2$NAME == DF1$NAME
In each data.frame values for $NAME are unique in the column.
Use the ifelse function. Here, I put NA when the condition is not met. However, you may choose any value or values from any vector.
Recycling rules1 apply.
DF1$Activity <- ifelse(DF2$NAME == DF1$NAME, DF2$Activity, NA)
I'm not sure this one actually needs an example. What happens when you create a column with a set of NA values and then assign the required rows with the same logical vector on both sides:
DF1$Activity <- NA
DF1$Activity[DF2$NAME == DF1$NAME] <- DF2$Activity[DF2$NAME == DF1$NAME]
without an example its quite hard to tell. But from your description it sounds like a base::merge or dplyr::inner_join operation. Those are quite fast in comparison to if statements.
Cheers

How to access a column after subsetting data frame?

It has to be really simple but it looks like my mind is not working properly anymore.
So, what I would like to do is to store one of the columns from mtcars as a vector but after subsetting it. I need one line code for the subsetting and assigning a vector.
That's what I would like to achieve but with one line:
data <- mtcars[mtcars[,11]==4,]
vec <- data[,1]
Thx!
vec<-mtcars[mtcars[,11]==4,][,1]
The mtcars[,11]==4 would be the row index and by selecting the column index as '1', we get the first column with subset of rows based on the condition.
mtcars[mtcars[,11]==4, 1]

Is there a more elegant way to find duplicated records?

I've got 81,000 records in my test frame, and duplicated is showing me that 2039 are identical matches. One answer to Find duplicated rows (based on 2 columns) in Data Frame in R suggests a method for creating a smaller frame of just the duplicate records. This works for me, too:
dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`
But it seems, as the poster noted, inelegant. Is there a cleaner way to get the same result: a view of just those records that are duplicates?
In my case I'm working with scraped data and I need to figure out whether the duplicates exist in the original or were introduced by me scraping.
duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.
# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ] #note the comma
You can put it all together in one line
df[duplicated(df$var), ] # again, the comma, to indicate we are selected rows
doops <- which(duplicated(df$var)==TRUE)
uniques <- df[-doops,]
duplicates <- df[doops,]
Is the logic I generally use when I am trying to remove the duplicate entrys from a data frame.

Resources