Remove rows of a DataFrame based on a separate vector?

Remove rows of a DataFrame based on a separate vector? - r

I have a vector containing a combination of NA values and strings:
v <- c(NA, NA, "text", NA)
I also have a separate data frame:
df <- data.frame("Col1" = 1:4, "Col2" = 5:8)
Col1 Col2
1 5
2 6
3 7
4 8
My goal is to remove the rows of df where the corresponding v value is NA. So in this case the output would just be:
Col1 Col2
3 7
Since the third element of v is the only one that's not NA, only the third row of df is kept. I tried to accomplish this using a for loop:
for (i in 1:length(v)) {
if (is.na(v[i])) {
df <- df[-i, ]
}
}
However, for some reason this just outputs a version of df that includes only the 2nd and 4th rows:
Col1 Col2
2 6
4 8
I can't figure out why the loop isn't working. Any suggestions appreciated!

This will do it -
df[!is.na(v), ]
You don't need a loop. You can always subset any dataframe using a vector of row indices or logical vector (TRUE and FALSE). !is.na(v) generates a logical vector based on v and subsets the dataframe accordingly.

Related

Remove rows with missing data in select columns, only if they don't have missing data in all columns (preferably use complete.cases)

So I'm trying to remove rows that have missing data in some columns, but not those that have missing data in all columns.
using rowSums alongside !is.na() gave me 1000's of rows of NA at the bottom of my dataset. The top answer here provided a good way of solving my issue using complete.cases:
Remove rows with all or some NAs (missing values) in data.frame
i.e.
data_set1 <- data_set1[complete.cases(data_set1[11:103]), ]
However, that only allows me to remove rows with any missing data in the specified columns. I'm struggling to get complete.cases to play along with rowSums and stop it from removing rows with all missing data.
Any advice very much appreciated!

Try using rowSums like :
cols <- 11:103
vals <- rowSums(is.na(data_set1[cols]))
data_set2 <- data_set1[!(vals > 0 & vals < length(cols)), ]
Or with complete.cases and rowSums
data_set1[complete.cases(data_set1[cols]) |
rowSums(is.na(data_set1[cols])) == length(cols) , ]
With reproducible example,
df <- data.frame(a = c(1, 2, 3, NA, 1), b = c(NA, 2, 3, NA, NA), c = 1:5)
cols <- 1:2
vals <- rowSums(is.na(df[cols]))
df[!(vals > 0 & vals < length(cols)), ]
# a b c
#2 2 2 2
#3 3 3 3
#4 NA NA 4

Select column name based on data frame content R

I want to build a matrix or data frame by choosing names of columns where the element in the data frame contains does not contain an NA. For example, suppose I have:
zz <- data.frame(a = c(1, NA, 3, 5),
b = c(NA, 5, 4, NA),
c = c(5, 6, NA, 8))
which gives:
a b c
1 1 NA 5
2 NA 5 6
3 3 4 NA
4 5 NA 8
I want to recognize each NA and build a new matrix or df that looks like:
a c
b c
a b
a c
There will be the same number of NAs in each row of the input matrix/df. I can't seem to get the right code to do this. Suggestions appreciated!

library(dplyr)
library(tidyr)
zz %>%
mutate(k = row_number()) %>%
gather(column, value, a, b, c) %>%
filter(!is.na(value)) %>%
group_by(k) %>%
summarise(temp_var = paste(column, collapse = " ")) %>%
separate(temp_var, into = c("var1", "var2"))
# A tibble: 4 × 3
k var1 var2
* <int> <chr> <chr>
1 1 a c
2 2 b c
3 3 a b
4 4 a c

Here's a possible vectorized base R approach
indx <- which(!is.na(zz), arr.ind = TRUE)
matrix(names(zz)[indx[order(indx[, "row"]), "col"]], ncol = 2, byrow = TRUE)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "c"
#[3,] "a" "b"
#[4,] "a" "c"
This finds non-NA indices, sorts by rows order and then subsets the names of your zz data set according to the sorted index. You can wrap it into as.data.frame if you prefer it over a matrix.

EDIT: transpose the data frame one time before process, so don't need to transpose twice in loop in first version.
cols <- names(zz)
for (column in cols) {
zz[[column]] <- ifelse(is.na(zz[[column]]), NA, column)
}
t_zz <- t(zz)
cols <- vector("list", length = ncol(t_zz))
for (i in 1:ncol(t_zz)) {
cols[[i]] <- na.omit(t_zz[, i])
}
new_dt <- as.data.frame(t(do.call("cbind", cols)))
The tricky part here is your goal actually change data frame structure, so the task of "remove NA in each row" have to build row by row as new data frame, since every column in each row could came from different column of original data frame.
zz[1, ] is a one row data frame, use t to convert it into vector so we can use na.omit, then transpose back to row.
I used 2 for loops, but for loops are not necessarily bad in R. The first one is vectorized for each column. The second one need to be done row by row anyway.
EDIT: growing objects is very bad in performance in R. I knew I can use rbindlist from data.table which can take a list of data frames, but OP don't want new packages. My first attempt just use rbind which could not take list as input. Later I found an alternative is to use do.call. It's still slower than rbindlist though.

In R, how do I delete rows in a data frame by column names of another data frame?

I have one dataframe (df1) with more than 200 columns containing data (several thousands of rows each). Column names are alphanumeric and all distinct from each other.
I have a second dataset (df2) with a couple of columns where the first column (named 'col1') contains rows with "values" carrying colnames of df1.
But not for every row in df2 I have a corresponding column in df1.
Now I would like to delete (drop) all rows in df2 where there is no "corresponding" column in df1.
I searched quite a while using keywords like "subset data.frame by values from another data.frame" but did not find any solution. I checked, e.g. here, here or here and some other places.
Thanks for your help.

Data:
df1 <- data.frame(a = 1:3, b = 1:3)
# a b
# 1 1 1
# 2 2 2
# 3 3 3
df2 <- data.frame(col1 = c("a", "c"))
# col1
# 1 a
# 2 c
Keep rows in df2 whose values are names in df1:
subset(df2, col1 %in% names(df1))
# col1
# 1 a

How to add value to column when data frame is empty in r

I have data frame that I have to initialized as empty data frame.
Now I have only column available, I want to add it to empty data frame. How I can do it? I will not sure what will be length of column in advance.
Example
df = data.frame(a= NA, b = NA, col1= NA)
....
nrow(col1) # Here I will know length of column, and I only have one column available here.
df$col1 <- col1
error is as follows:
Error in `$<-.data.frame`(`*tmp*`, "a", value = c("1", :
replacement has 5 rows, data has 1
Any help will be greatful

use cbind
df = data.frame(a= NA, b = NA)
col1 <- c(1,2,3,4,5)
df <- cbind(df, col1)
# a b col1
# 1 NA NA 1
# 2 NA NA 2
# 3 NA NA 3
# 4 NA NA 4
# 5 NA NA 5
After your edits, you can still use cbind, but you'll need to drop the existing column first (or handle the duplicate columns after the cbind)
cbind(df[, 1:2], col1)
## or if you don't know the column indeces
## cbind(df[, !names(df) %in% c("col1")], col1)

A little workaround with lists:
l <- list(a=NA, b=NA, col1=NA)
col1 <- c(1,2,3)
l$col1 <- col1
df <- as.data.frame(l)

I like both answers provided by Symbolix and maRtin, I have done my own hack. My hack is as follow.
df[1:length(a),"a"] = a
However, I am not sure, which one this method is efficient in term of time. What will be big O notion for time

Find the index of the column in data frame that contains the string as value

I have data frame like this :
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I want to find the index of the column of df that has values matching to string "a".
i.e. it should give me 1 as result.
I tried using which in sapply but its not working.
Anybody knows how to do it without a loop ??

Something like this?
which(apply(df, 2, function(x) any(grepl("a", x))))
The steps are:
With apply go over each column
Search if a is in this column with grepl
Since we get a vector back, use any to get TRUE if any element has been matched to a
Finally check which elements (columns) are TRUE (i.e. contain the searched letter a).

Since you mention you were trying to use sapply() but were unsuccessful, here's how you can do it:
> sapply(df, function(x) any(x == "a"))
col1 col2 col3
TRUE FALSE FALSE
> which(sapply(df, function(x) any(x == "a")))
col1
1
Of course, you can also use the grep()/grepl() approach if you prefer string matching. You can also wrap your which() function with unname() if you want just the column number.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove rows of a DataFrame based on a separate vector? - r

This will do it - df[!is.na(v), ] You don't need a loop. You can always subset any dataframe using a vector of row indices or logical vector (TRUE and FALSE). !is.na(v) generates a logical vector based on v and subsets the dataframe accordingly.

Related

Remove rows with missing data in select columns, only if they don't have missing data in all columns (preferably use complete.cases)

Select column name based on data frame content R

In R, how do I delete rows in a data frame by column names of another data frame?

How to add value to column when data frame is empty in r

Find the index of the column in data frame that contains the string as value

Categories

Resources