Removing rows from a data frame - r

I have this data.frame:
set.seed(1)
df <- data.frame(id1=LETTERS[sample(26,100,replace = T)],id2=LETTERS[sample(26,100,replace = T)],stringsAsFactors = F)
and this vector:
vec <- LETTERS[sample(26,10,replace = F)]
I want to remove from df any row which either df$id1 or df$id2 are not in vec
Is there any faster way of finding the row indices which meet this condition than this:
rm.idx <- which(!apply(df,1,function(x) all(x %in% vec)))

I used dplyr with such script
df1 <- df %>% filter(!(df$id1 %in% vec)|!(df$id2 %in% vec))

Looping over the columns might be faster than over rows. So, use lapply to loop over the columns, create a list of logical vectors with %in%, use Reduce with | to check whether there are any TRUE values for each corresponding row and use that to subset the 'df'
df[Reduce(`|`, lapply(df, `%in%`, vec)),]
If we need both elements, then replace | with &
df[Reduce(`&`, lapply(df, `%in%`, vec)),]

Actually
rm.idx <- unique(which(!(df$id1 %in% vec) | !(df$id2 %in% vec)))
is also fast.

Related

Join dataframes, retaining column names

I have the following 3 data frames, each of which has columns with names. I want to combine them and retain the column names. When I use the patch I found for combining dataframes, it drops that name on any dataframes that don't have at least 2 columns. How can I retain the names?
x<-data.frame(mean(1:10))
names(x)[names(x) == 'mean.1.10.'] <- 'var.name'
y<-data.frame(1:4)
names(y)[names(y) == 'X1.4'] <- 'var.name2'
z<-data.frame(matrix(1:10,5,2))
names(z)[names(z) == 'X1'] <- 'var.name3'
names(z)[names(z) == 'X2'] <- 'var.name4'
list_datf <- list(x, y, z)
n_r <- seq_len(max(sapply(list_datf, nrow)))
NEW <- do.call(cbind, lapply(list_datf, `[`, n_r, ))
You need to include drop = FALSE in the indexing step so that the things you're binding together retain all of their dimensions. I couldn't figure out a way to do this by passing drop = FALSE as an extra argument to [, so I resorted to using an anonymous function instead.
NEW <- do.call(cbind, lapply(list_datf, function(x) x[n_r, , drop = FALSE]))
Alternatively, you could convert your components to tibbles, which (unlike data frames) never drop "unneeded" dimensions:
NEW <- do.call(cbind, lapply(list_datf, function(x) tibble::as_tibble(x)[n_r, ]))
If you want to go full tidyverse:
library(dplyr)
list_datf %>% purrr::map(~ tibble::as_tibble(.)[n_r, ]) %>% bind_cols()

How do I convert all numeric columns to character type in my dataframe?

I would like to do something more efficient than
dataframe$col <- as.character(dataframe$col)
since I have many numeric columns.
In base R, we may either use one of the following i.e. loop over all the columns, create an if/else conditon to change it
dataframe[] <- lapply(dataframe, function(x) if(is.numeric(x))
as.character(x) else x)
Or create an index for numeric columns and loop only on those columns and assign
i1 <- sapply(dataframe, is.numeric)
dataframe[i1] <- lapply(dataframe[i1], as.character)
It may be more flexible in dplyr
library(dplyr)
dataframe <- dataframe %>%
mutate(across(where(is.numeric), as.character))
All said by master akrun! Here is a data.table alternative. Note it converts all columns to character class:
library(data.table)
data.table::setDT(df)
df[, (colnames(df)) := lapply(.SD, as.character), .SDcols = colnames(df)]

Count the number of missing values in R

I'm working with Pima Indians Diabetes data from Kaggle in Rstudio and instead of na's as missing values it has 0s.
How can I count the number of "0" values in each variable with a single loop instead of typing table(data$variableName==0) for each column. Just rephrasing ,"a single loop for the whole data frame".
We can use colSums on a logical matrix
colSums(data == 0)
Or with sapply in a loop
sapply(data, function(x) sum(x == 0))
or with apply
apply(data, 2, function(x) sum(x == 0))
Or in a for loop
count <- numeric(ncol(data))
for(i in seq_along(data)) count[i] <- sum(data[[i]] == 0)
Try this:
library(dplyr)
data %>% summarise(across(.fns = ~sum(.==0,na.rm=TRUE) ,.names = "Zeros_in_{.col}"))

Remove rows containing string in any vector in data frame

I have a data frame containing a number of vectors that contain strings I would like to remove rows that contain a certain string.
df <- data.frame(id=seq(1:10),
foo=runif(10),
sapply(letters[1:5],function(x) {sample(letters,10,T)} ),
bar=runif(10))
This can be done on a single vector by specifying the vector name i.e.
df <- df[!grepl("b", df$a),]
which I can then repeat specifying each vector e.g.
df <- df[!grepl("b", df$b),]
df <- df[!grepl("b", df$c),]
df <- df[!grepl("b", df$d),]
df <- df[!grepl("b", df$e),]
but is it possible to do it in one line without having to specify which columns contain the string? Something like:
df <- df[!grepl("b", df),]
You could try
df[-which(df=="b", arr.ind=TRUE)[,1],]
or, as suggested by #docendodiscimus
df[rowSums(df == "b") == 0,]
This second option is preferable because it does not lead to any difficulty if no matching pattern is found.
Paste columns then grepl:
df[!grepl("b", paste0(df$a, df$b, df$c, df$d, df$e)), ]
Identify factor (or character columns) then paste:
df[!grepl("b",
apply(df[, sapply(df, class) == "factor"], 1, paste0, collapse = ",")), ]
target_cols <- c("a", "b", "c", "d", "e")
df[!Reduce(`|`, lapply(df[,target_cols], function(col) grepl("b", col))),]

split dataframe by row number in R

This is probably really simple, but I can't find a solution:
df <- data.frame(replicate(10,sample(0:1,10,rep=TRUE)))
v <- c(3, 7)
is there an elegant way to split this dataframe in three elements (of a list) at the row number specified in v?
Assuming that rows 1&2 goes in the first split, 3,4,5,6 in the second and 7 to nrow(df) goes in the last
split(df, cumsum(1:nrow(df) %in% v))
but if 1:3 rows are in the first split, then comes 4:7, and in the third 8 to nrow(df)
split(df, cumsum(c(TRUE,(1:nrow(df) %in% v)[-nrow(df)])) )
Or as #James mentioned in the comments,
split(df, cumsum(1:nrow(df) %in% (v+1)))
Another way:
split(df, findInterval(1:nrow(df), v))
For the alternative interpretation, you can use:
split(df, cut(1:nrow(df), unique(c(1, v, nrow(df))), include.lowest=TRUE))

Resources