selecting columns based on exact string - r

df1 <- data.frame(x1_modhigh_2020 = 1,
x2_modhigh_2030 = 1,
x1_low_2020 = 1,
x2_low_2030 = 1,
x1_high_2020 = 1,
x2_high_2030 = 1)
In a for-loop I want to select columns based on whether they contain 'low', 'modhigh' or 'high' and do some operations on them. My method of selecting columns is:
library(dplyr)
df1 %>% dplyr::select(contains("low")) # this works
df1 %>% dplyr::select(contains("modhigh")) # this works
df1 %>% dplyr::select(contains("high")) # does not work. This also select `modhigh`
How can I modify the selection of high so that modhigh does not get selected as well

Using matches you can use regex syntax (rather than contains, which does not allow the use of regex), here for example the pipe |, which is a regex metacharacter signifying alternation:
df1 %>%
select(matches("_high|low"))
x1_low_2020 x2_low_2030 x1_high_2020 x2_high_2030
1 1 1 1 1

I would also use the matches selection helper proposed by #Chris, but if you are interested in alternatives:
# dplyr
dplyr::select(df1, grep("_high|low", colnames(df1)))
# base R
df1[, grep("_high|low", colnames(df1))]
Both result in
x1_low_2020 x2_low_2030 x1_high_2020 x2_high_2030
1 1 1 1

Related

Filter dataset with %in% with pattern

I’m using filter to my dataset to select certain values from column:
%>%
filter(col1 %in% c(“value1”, “value2"))
How ever I don’t understand how to filter values in column with pattern without fully writing it. For example I also want all values which start with “value3” (“value33”, “value34”,....) along with “value1” and “value2”. Can I add grepl to that vector?
You can use regular expressions to do that:
df %>%
filter(str_detect('^value[1-3]'))
If you want to use another tidyverse package to help, you can use str_starts from stringr to find strings that start with a certain value
dd %>% filter(stringr::str_starts(col1, "value"))
Here are few options in base R :
Using grepl :
subset(df, grepl('^value', b))
# a b
#1 1 value3
#3 3 value123
#4 4 value12
Similar option with grep which returns index of match.
df[grep('^value', df$b),]
However, a faster option would be to use startsWith
subset(df, startsWith(b, "value"))
All of this would select rows where column b starts with "value".
data
df <- data.frame(a = 1:5, b = c('value3', 'abcd', 'value123', 'value12', 'def'),
stringsAsFactors = FALSE)

How to split and paste a string while mutating a dataframe?

I have a dataframe like this one:
x <- data.frame(filename = c("aa-b-c x", "c-dd-e y"), number=c(1,2))
filename number
1 aa-b-c x 1
2 c-dd-e y 2
I want to mutate the filename column so it looks like this:
filename number
1 c/aa/b 1
2 e/dd/c 2
This works on a single row: paste(str_match(x$filename[1], "(\\w+)-(\\w+)-(\\w+)")[c(4,2,3)], collapse = "/") but it fails inside the mutate. I'm sure I'm missing a simple fix.
One option is to rearrange the components after capturing as a group
library(dplyr)
library(stringr)
x %>%
mutate(filname = str_replace(filename,
"^([a-z]+)-([a-z]+)-([a-z]+)\\s.*", "\\3/\\1/\\2"))
str_match returns a matrix when you give it multiple vectors. This should work pretty well:
apply(str_match(x$filename, "(\\w+)-(\\w+)-(\\w+)")[, c(4,2,3), drop = FALSE], 1, paste, collapse = "/")
# [1] "c/aa/b" "e/c/dd"
The drop = FALSE is necessary to keep the output a matrix in case there is only one row.

Use an external list to remove data from rows

I have a data frame
df <- data.frame(
A = c(4, 2, 7),
B = c(3, 3, 5),
C = c("Expert,Foo", "Bar,Wild", "Zap")
)
and a second one which I would like to use as index to remove rows which contain the specific values
mylist <- data.frame(rtext = c("Foo","Bar"))
So I tried this:
subset(df, C %in% mylist$rtext)
How can I remove the specific rows?
As it is a partial match, we can use grep. We paste the elements of 'myList' column 'rtext' into a single string with delimiter | which implies OR, then get a logical index with grepl on the 'C' column of 'df', negate (!) to change TRUE to FALSE and FALSE to TRUE to subset the rows that are not in the 'rtext' of 'mylist'
subset(df, !grepl(paste(mylist$rtext, collapse="|"), C))
# A B C
#3 7 5 Zap
Using str_detect from stringr
df[!stringr::str_detect(df$C,paste(mylist$rtext,collapse = '|')),]
A B C
3 7 5 Zap
If you need the 100% match , which means Foooo will not be removed ,check with dplyr and tidyr re-format your df 1st , since str_detect and grepl are partial match , if you have word like Expert,Foott it will still show as match with Foo
library(tidyr)
library(dplyr)
df$id=seq.int(nrow(df))
df1=df %>%
transform(C = strsplit(C, ",")) %>%
unnest(C)
df[!df$id%in%df1$id[df1$C%in%mylist$rtext],]

Conditional sum in R – multiple columns

I'm trying to figure out how to extract some specific information from very big tables (e.g., 30'000 rows and 50 columns).
Imagine I have this data frame:
S1 <- c(1,2,1,1,3,1)
S2 <- c(2,1,3,2,1,1)
S3 <- c(1,2,2,1,3,1)
S4 <- c(3,3,4,2,3,1)
S5 <- c(3,2,5,3,2,2)
count <- c(10,5,3,1,1,1)
df <- data.frame(count,S1,S2,S3,S4,S5)
What I need is to sum the column "count" when, for instance, S1 and S3 shares the same value (it doesn't matter which value), but no other column has the same value.
In this example, it should returns the value 11, because I should only take into consideration the values of the column "count" from the rows 1 and 4.
In the rows 2, 5 and 6, S1 and S3 have a similar value, but I don't want consider them because there are also other columns with the same value. And finally, not considering row 3 simply because S1 and S3 have different values.
I know how to do it easily in excel, but I was wondering how I could do it in R. I've tried somme commands from dplyr, but I failed.
If any of you could give a help, I'll be very grateful.
A little more complex, but it works. Using only R base. From this question take the form of comparing multiple columns in a simple way.
sum(df[df$S1==df$S3 & rowSums(sapply(df[,c(3,5,6)],`==`,e2=df$S1)) == 0,1])
[1] 11
The most complex part is how to check multiple columns. In this case we use sapply to compare the columns c(3,5,6) by equality ('==') with S1, (e2 is the second argument of the == function).
As ycw mentions, it can be a little complicated to define all the columns by a vector, so this form allows you to check all the columns except those we don't want.
sum(df[df$S1==df$S3 & rowSums(sapply(df[,!(colnames(df) %in% c("count", "S1", "S3"))],`==`,e2=df$S1)) == 0,1])
Applying the same procedure to the two comparisons and defining only the vector of the same values:
equals <- c("S1", "S3")
not_equals <- !(colnames(df) %in% c("count", equals))
sum(df[rowSums(sapply(df[,equals,drop=FALSE],`==`,e2=df[equals[1]])) == length(equals) &
rowSums(sapply(df[,not_equals,drop=FALSE],`==`,e2=df[equals[1]])) == 0, 1])
Note: Use drop=FALSE for selecting only one column of dataframe and avoid "promotion to vector" problem or omit the , this way:
sum(df[rowSums(sapply(df[equals],`==`,e2=df[equals[1]])) == length(equals) &
rowSums(sapply(df[not_equals],`==`,e2=df[equals[1]])) == 0, 1])
A solution using dplyr. There are two steps. The first filter function finds rows with S1 == S3. The second filter_at function checks columns other than S1, S3, and count all are not equal to S1, which should be the same as S3 after the first filter function.
library(dplyr)
df2 <- df %>%
filter(S1 == S3) %>%
filter_at(vars(-S1, -S3, -count), all_vars(. != S1))
df2
count S1 S2 S3 S4 S5
1 10 1 2 1 3 3
2 1 1 2 1 2 3
Then the total count is as follows.
sum(df2$count)
[1] 11
Using dplyr, rowwise, filter :
library(dplyr)
df %>%
rowwise() %>%
filter(S1==S3 & !S1 %in% c(S2,S4,S5)) %>%
pull(count) %>%
sum()
# [1] 11

dplyr filter with condition on multiple columns

I'd like to remove rows corresponding to a particular combination of variables from my data frame.
Here's a dummy data :
father<- c(1, 1, 1, 1, 1)
mother<- c(1, 1, 1, NA, NA)
children <- c(NA, NA, 2, 5, 2)
cousins <- c(NA, 5, 1, 1, 4)
dataset <- data.frame(father, mother, children, cousins)
dataset
father mother children cousins
1 1 NA NA
1 1 NA 5
1 1 2 1
1 NA 5 1
1 NA 2 4
I want to filter this row :
father mother children cousins
1 1 NA NA
I can do it with :
test <- dataset %>%
filter(father==1 & mother==1) %>%
filter (is.na(children)) %>%
filter (is.na(cousins))
test
My question :
I have many columns like grand father, uncle1, uncle2, uncle3 and I want to avoid something like that:
filter (is.na(children)) %>%
filter (is.na(cousins)) %>%
filter (is.na(uncle1)) %>%
filter (is.na(uncle2)) %>%
filter (is.na(uncle3))
and so on...
How can I use dplyr to say filter all the column with na (except father==1 & mother==1)
A possible dplyr(0.5.0.9004 <= version < 1.0) solution is:
# > packageVersion('dplyr')
# [1] ‘0.5.0.9004’
dataset %>%
filter(!is.na(father), !is.na(mother)) %>%
filter_at(vars(-father, -mother), all_vars(is.na(.)))
Explanation:
vars(-father, -mother): select all columns except father and mother.
all_vars(is.na(.)): keep rows where is.na is TRUE for all the selected columns.
note: any_vars should be used instead of all_vars if rows where is.na is TRUE for any column are to be kept.
Update (2020-11-28)
As the _at functions and vars have been superseded by the use of across since dplyr 1.0, the following way (or similar) is recommended now:
dataset %>%
filter(across(c(father, mother), ~ !is.na(.x))) %>%
filter(across(c(-father, -mother), is.na))
See more example of across and how to rewrite previous code with the new approach here: Colomn-wise operatons or type vignette("colwise") in R after installing the latest version of dplyr.
dplyr >= 1.0.4
If you're using dplyr version >= 1.0.4 you really should use if_any or if_all, which specifically combines the results of the predicate function into a single logical vector making it very useful in filter. The syntax is identical to across, but these verbs were added to help fill this need: if_any/if_all.
library(dplyr)
dataset %>%
filter(if_all(-c(father, mother), ~ is.na(.)), if_all(c(father, mother), ~ !is.na(.)))
Here I have written out the variable names, but you can use any tidy selection helper to specify variables (e.g., column ranges by name or location, regular expression matching, substring matching, starts with/ends with, etc.).
Output
father mother children cousins
1 1 1 NA NA
None of the answers seems to be an adaptable solution. I think the intention is not to list all the variables and values to filter the data.
One easy way to achieve this is through merging. If you have all the conditions in df_filter then you can do this:
df_results = df_filter %>% left_join(df_all)
A dplyr solution:
test <- dataset %>%
filter(father==1 & mother==1 & rowSums(is.na(.[,3:4]))==2)
Where '2' is the number of columns that should be NA.
This gives:
> test
father mother children cousins
1 1 1 NA NA
You can apply this logic in base R as well:
dataset[dataset$father==1 & dataset$mother==1 & rowSums(is.na(dataset[,3:4]))==2,]
Here is a base R method using two Reduce functions and [ to subset.
keepers <- Reduce(function(x, y) x == 1 & y == 1, dataset[, 1:2]) &
Reduce(function(x, y) is.na(x) & is.na(y), dataset[, 3:4])
keepers
[1] TRUE FALSE FALSE FALSE FALSE
Each Reduce consecutively takes the variables provided and performs a logical check. The two results are connected with an &. The second argument to the Reduce functions can be adjusted to include whatever variables in the data.frame that you want.
Then use the logical vector to subset
dataset[keepers,]
father mother children cousins
1 1 1 NA NA
This answer builds on #Feng Jiangs answer using the dplyr::left_joint() operation, and is more like a reprex. In addition, it ensures the proper order of columns is restored in case the order of variables in df_filter differs from the order of the variables in the original dataset. Also, the dataset was expanded for a duplicate combination to show these are part of the filtered output (df_out).
library(dplyr)
father<- c(1, 1, 1, 1, 1,1)
mother<- c(1, 1, 1, NA, NA,1)
children <- c(NA, NA, 2, 5, 2,NA)
cousins <- c(NA, 5, 1, 1, 4,NA)
dataset <- data.frame(father, mother, children, cousins)
df_filter <- data.frame( father = 1, mother = 1, children = NA, cousins = NA)
test <- df_filter %>%
left_join(dataset) %>%
relocate(colnames(dataset))

Resources