Remove Rows From Data Frame where a Row matches a String - r

I do I remove all rows in a dataframe where a certain row meets a string match criteria?
For example:
A,B,C
4,3,Foo
2,3,Bar
7,5,Zap
How would I return a dataframe that excludes all rows where C = Foo:
A,B,C
2,3,Bar
7,5,Zap

Just use the == with the negation symbol (!). If dtfm is the name of your data.frame:
dtfm[!dtfm$C == "Foo", ]
Or, to move the negation in the comparison:
dtfm[dtfm$C != "Foo", ]
Or, even shorter using subset():
subset(dtfm, C!="Foo")

You can use the dplyr package to easily remove those particular rows.
library(dplyr)
df <- filter(df, C != "Foo")

I had a column(A) in a data frame with 3 values in it (yes, no, unknown). I wanted to filter only those rows which had a value "yes" for which this is the code, hope this will help you guys as well --
df <- df [(!(df$A=="no") & !(df$A=="unknown")),]

if you wish to using dplyr, for to remove row "Foo":
df %>%
filter(!C=="Foo")

I know this has been answered but here is another option:
library (dplyr)
df %>% filter(!c=="foo)
OR
df[!df$c=="foo", ]

If your exclusion conditions are stored in another data frame you could use rows_delete:
library(dplyr)
removal_df <- data.frame(C = "Foo")
df %>%
rows_delete(removal_df, by = "C")
A B C
1 2 3 Bar
2 7 5 Zap
This is also handy if you have multiple exclusion conditions so you do not have to write out a long filter statement.
Note: rows_delete is only available if you have dplyr >= 1.0.0

Related

Filter dataset with %in% with pattern

I’m using filter to my dataset to select certain values from column:
%>%
filter(col1 %in% c(“value1”, “value2"))
How ever I don’t understand how to filter values in column with pattern without fully writing it. For example I also want all values which start with “value3” (“value33”, “value34”,....) along with “value1” and “value2”. Can I add grepl to that vector?
You can use regular expressions to do that:
df %>%
filter(str_detect('^value[1-3]'))
If you want to use another tidyverse package to help, you can use str_starts from stringr to find strings that start with a certain value
dd %>% filter(stringr::str_starts(col1, "value"))
Here are few options in base R :
Using grepl :
subset(df, grepl('^value', b))
# a b
#1 1 value3
#3 3 value123
#4 4 value12
Similar option with grep which returns index of match.
df[grep('^value', df$b),]
However, a faster option would be to use startsWith
subset(df, startsWith(b, "value"))
All of this would select rows where column b starts with "value".
data
df <- data.frame(a = 1:5, b = c('value3', 'abcd', 'value123', 'value12', 'def'),
stringsAsFactors = FALSE)

subsetting from same-named data.frame in R

I have a data.frame called c41 (HERE). Some column names (e.g., type) in this data frame are repeated once or twice. As a result, data.frame adds a ".number" suffix to distinguish between them.
Suppose I want to subset variable type == 3 among all column names that have a "type" root in their names. Currently, I drop the ".number" suffixes and then subset but that incorrectly returns nothing.
Question: In BASE R, how can I subset a variable value (type == 3) without needing to include the ".number" suffixes (e.g., type == 3 instead of type.1 == 3)?
In other words, how can I find any "type" whose value is 3 regardless of its numeric suffix.
c41 <- read.csv("https://raw.githubusercontent.com/izeh/l/master/c4.csv")
c42 <- setNames(c41, sub("\\.\\d+$", "", names(c41))) # Take off the `".number"` suffixes
subset(c42, type == 3) # Now subset ! But it return nothing!
Renaming the columns to make them non-unique is a recipe for a headache and is not advisable. Without renaming the columns, in base R you could do something like this instead:
c41[rowSums(c41[grep("^type", names(c41))] == 3, na.rm = TRUE) > 0,]
I don't think subset() can be used here if column names are duplicated.
EDIT: I see that you edited your question to specify base R. Can't help you there! But perhaps a dplyr solution is of interest.
You can use dplyr::filter_at and the starts_with helper.
library(dplyr)
library(readr)
c4 <- read_csv("https://raw.githubusercontent.com/izeh/l/master/c4.csv")
c4 %>%
filter_at(vars(starts_with("type")), any_vars(. == 3))
Adding a select_at to display just the relevant columns:
c4 %>%
filter_at(vars(starts_with("type")), any_vars(. == 3)) %>%
select_at(vars(starts_with("type")))
Result:
# A tibble: 2 x 2
type type_1
<dbl> <dbl>
1 1 3
2 2 3

Conditional sum in R – multiple columns

I'm trying to figure out how to extract some specific information from very big tables (e.g., 30'000 rows and 50 columns).
Imagine I have this data frame:
S1 <- c(1,2,1,1,3,1)
S2 <- c(2,1,3,2,1,1)
S3 <- c(1,2,2,1,3,1)
S4 <- c(3,3,4,2,3,1)
S5 <- c(3,2,5,3,2,2)
count <- c(10,5,3,1,1,1)
df <- data.frame(count,S1,S2,S3,S4,S5)
What I need is to sum the column "count" when, for instance, S1 and S3 shares the same value (it doesn't matter which value), but no other column has the same value.
In this example, it should returns the value 11, because I should only take into consideration the values of the column "count" from the rows 1 and 4.
In the rows 2, 5 and 6, S1 and S3 have a similar value, but I don't want consider them because there are also other columns with the same value. And finally, not considering row 3 simply because S1 and S3 have different values.
I know how to do it easily in excel, but I was wondering how I could do it in R. I've tried somme commands from dplyr, but I failed.
If any of you could give a help, I'll be very grateful.
A little more complex, but it works. Using only R base. From this question take the form of comparing multiple columns in a simple way.
sum(df[df$S1==df$S3 & rowSums(sapply(df[,c(3,5,6)],`==`,e2=df$S1)) == 0,1])
[1] 11
The most complex part is how to check multiple columns. In this case we use sapply to compare the columns c(3,5,6) by equality ('==') with S1, (e2 is the second argument of the == function).
As ycw mentions, it can be a little complicated to define all the columns by a vector, so this form allows you to check all the columns except those we don't want.
sum(df[df$S1==df$S3 & rowSums(sapply(df[,!(colnames(df) %in% c("count", "S1", "S3"))],`==`,e2=df$S1)) == 0,1])
Applying the same procedure to the two comparisons and defining only the vector of the same values:
equals <- c("S1", "S3")
not_equals <- !(colnames(df) %in% c("count", equals))
sum(df[rowSums(sapply(df[,equals,drop=FALSE],`==`,e2=df[equals[1]])) == length(equals) &
rowSums(sapply(df[,not_equals,drop=FALSE],`==`,e2=df[equals[1]])) == 0, 1])
Note: Use drop=FALSE for selecting only one column of dataframe and avoid "promotion to vector" problem or omit the , this way:
sum(df[rowSums(sapply(df[equals],`==`,e2=df[equals[1]])) == length(equals) &
rowSums(sapply(df[not_equals],`==`,e2=df[equals[1]])) == 0, 1])
A solution using dplyr. There are two steps. The first filter function finds rows with S1 == S3. The second filter_at function checks columns other than S1, S3, and count all are not equal to S1, which should be the same as S3 after the first filter function.
library(dplyr)
df2 <- df %>%
filter(S1 == S3) %>%
filter_at(vars(-S1, -S3, -count), all_vars(. != S1))
df2
count S1 S2 S3 S4 S5
1 10 1 2 1 3 3
2 1 1 2 1 2 3
Then the total count is as follows.
sum(df2$count)
[1] 11
Using dplyr, rowwise, filter :
library(dplyr)
df %>%
rowwise() %>%
filter(S1==S3 & !S1 %in% c(S2,S4,S5)) %>%
pull(count) %>%
sum()
# [1] 11

How to turn colSums results in R to data frame

I need to get col sum for all the columns and have the result in a data frame with colnames and their sum as two columns. But if I do this, column name seem to become index rather than a col itself.
demo=data.frame(a=runif(10),b=runif(10,max=2),c=runif(10,max=3))
as.data.frame(colSums(demo))
The undesired result:
colSums(demo)
a 4.083571
b 11.698794
c 14.082574
The desired result:
colname colSums(demo)
a 4.083571
b 11.698794
c 14.082574
How can I add a heading to the column on the left while keep the shape as it is? Thanks.
One possibility is to transpose the result with t()
data.frame(t(colSums(demo)))
a b c
colSums.demo. 5.782475 10.46739 18.46751
To change the name of the row in the output, we can use rownames, for instance like this:
`rownames<-`(data.frame(t(colSums(demo))), "myColsum")
a b c
myColsum 5.782475 10.46739 18.46751
Your desired output can be achieved with stack() and setNames():
setNames(nm=c('colname','colSums(demo)'),stack(colSums(demo))[2:1]);
## colname colSums(demo)
## 1 a 4.083571
## 2 b 11.698794
## 3 c 14.082574
> demo = data.frame(a=runif(10),b=runif(10,max=2),c=runif(10,max=3))
> df <- data.frame(colname = names(demo),colSums_demo=colSums(demo))
> print(df, row.names=F)
colname colSums_demo
a 4.754546
b 12.488904
c 18.152095
Try this:
as.data.frame(lapply(demo, sum))
The result:
a b c
1 6.400121 10.16047 10.6528
There is now an easy way to do this using the tidyverse function enframe().
enframe(colSums(demo))
And if you want to use the column names from your example, you can set them as arguments.
demo %>% colSums() %>% enframe(name = "colname", value = "colSums(demo)")
We can use data.table with melt
library(data.table)
melt(setDT(demo)[, lapply(.SD, sum)])
demo=data.frame(a=runif(10),b=runif(10,max=2),c=runif(10,max=3))
as.data.frame(colSums(demo))
new_table<-as.data.frame(colSums(demo))
new_table %>%
rownames_to_column(var ="colname")
A new tidyverse way!
demo %>%
summarise(across(where(is.numeric)
,sum))
and then a tidyr::pivot_longer if you want long by colname.

Determine the number of NA values in a column

I want to count the number of NA values in a data frame column. Say my data frame is called df, and the name of the column I am considering is col. The way I have come up with is following:
sapply(df$col, function(x) sum(length(which(is.na(x)))))
Is this a good/most efficient way to do this?
You're over-thinking the problem:
sum(is.na(df$col))
If you are looking for NA counts for each column in a dataframe then:
na_count <-sapply(x, function(y) sum(length(which(is.na(y)))))
should give you a list with the counts for each column.
na_count <- data.frame(na_count)
Should output the data nicely in a dataframe like:
----------------------
| row.names | na_count
------------------------
| column_1 | count
Try the colSums function
df <- data.frame(x = c(1,2,NA), y = rep(NA, 3))
colSums(is.na(df))
#x y
#1 3
A quick and easy Tidyverse solution to get a NA count for all columns is to use summarise_all() which I think makes a much easier to read solution than using purrr or sapply
library(tidyverse)
# Example data
df <- tibble(col1 = c(1, 2, 3, NA),
col2 = c(NA, NA, "a", "b"))
df %>% summarise_all(~ sum(is.na(.)))
#> # A tibble: 1 x 2
#> col1 col2
#> <int> <int>
#> 1 1 2
Or using the more modern across() function:
df %>% summarise(across(everything(), ~ sum(is.na(.))))
If you are looking to count the number of NAs in the entire dataframe you could also use
sum(is.na(df))
In the summary() output, the function also counts the NAs so one can use this function if one wants the sum of NAs in several variables.
A tidyverse way to count the number of nulls in every column of a dataframe:
library(tidyverse)
library(purrr)
df %>%
map_df(function(x) sum(is.na(x))) %>%
gather(feature, num_nulls) %>%
print(n = 100)
This form, slightly changed from Kevin Ogoros's one:
na_count <-function (x) sapply(x, function(y) sum(is.na(y)))
returns NA counts as named int array
sapply(name of the data, function(x) sum(is.na(x)))
Try this:
length(df$col[is.na(df$col)])
User rrs answer is right but that only tells you the number of NA values in the particular column of the data frame that you are passing to get the number of NA values for the whole data frame try this:
apply(<name of dataFrame>, 2<for getting column stats>, function(x) {sum(is.na(x))})
This does the trick
I read a csv file from local directory. Following code works for me.
# to get number of which contains na
sum(is.na(df[, c(columnName)]) # to get number of na row
# to get number of which not contains na
sum(!is.na(df[, c(columnName)])
#here columnName is your desire column name
Similar to hute37's answer but using the purrr package. I think this tidyverse approach is simpler than the answer proposed by AbiK.
library(purrr)
map_dbl(df, ~sum(is.na(.)))
Note: the tilde (~) creates an anonymous function. And the '.' refers to the input for the anonymous function, in this case the data.frame df.
If you're looking for null values in each column to be printed one after the other then you can use this. Simple solution.
lapply(df, function(x) { length(which(is.na(x)))})
Another option using complete.cases like this:
df <- data.frame(col = c(1,2,NA))
df
#> col
#> 1 1
#> 2 2
#> 3 NA
sum(!complete.cases(df$col))
#> [1] 1
Created on 2022-08-27 with reprex v2.0.2
You can use this to count number of NA or blanks in every column
colSums(is.na(data_set_name)|data_set_name == '')
In the interests of completeness you can also use the useNA argument in table. For example table(df$col, useNA="always") will count all of non NA cases and the NA ones.

Resources