Use an external list to remove data from rows - r

I have a data frame
df <- data.frame(
A = c(4, 2, 7),
B = c(3, 3, 5),
C = c("Expert,Foo", "Bar,Wild", "Zap")
)
and a second one which I would like to use as index to remove rows which contain the specific values
mylist <- data.frame(rtext = c("Foo","Bar"))
So I tried this:
subset(df, C %in% mylist$rtext)
How can I remove the specific rows?

As it is a partial match, we can use grep. We paste the elements of 'myList' column 'rtext' into a single string with delimiter | which implies OR, then get a logical index with grepl on the 'C' column of 'df', negate (!) to change TRUE to FALSE and FALSE to TRUE to subset the rows that are not in the 'rtext' of 'mylist'
subset(df, !grepl(paste(mylist$rtext, collapse="|"), C))
# A B C
#3 7 5 Zap

Using str_detect from stringr
df[!stringr::str_detect(df$C,paste(mylist$rtext,collapse = '|')),]
A B C
3 7 5 Zap
If you need the 100% match , which means Foooo will not be removed ,check with dplyr and tidyr re-format your df 1st , since str_detect and grepl are partial match , if you have word like Expert,Foott it will still show as match with Foo
library(tidyr)
library(dplyr)
df$id=seq.int(nrow(df))
df1=df %>%
transform(C = strsplit(C, ",")) %>%
unnest(C)
df[!df$id%in%df1$id[df1$C%in%mylist$rtext],]

Related

selecting columns based on exact string

df1 <- data.frame(x1_modhigh_2020 = 1,
x2_modhigh_2030 = 1,
x1_low_2020 = 1,
x2_low_2030 = 1,
x1_high_2020 = 1,
x2_high_2030 = 1)
In a for-loop I want to select columns based on whether they contain 'low', 'modhigh' or 'high' and do some operations on them. My method of selecting columns is:
library(dplyr)
df1 %>% dplyr::select(contains("low")) # this works
df1 %>% dplyr::select(contains("modhigh")) # this works
df1 %>% dplyr::select(contains("high")) # does not work. This also select `modhigh`
How can I modify the selection of high so that modhigh does not get selected as well
Using matches you can use regex syntax (rather than contains, which does not allow the use of regex), here for example the pipe |, which is a regex metacharacter signifying alternation:
df1 %>%
select(matches("_high|low"))
x1_low_2020 x2_low_2030 x1_high_2020 x2_high_2030
1 1 1 1 1
I would also use the matches selection helper proposed by #Chris, but if you are interested in alternatives:
# dplyr
dplyr::select(df1, grep("_high|low", colnames(df1)))
# base R
df1[, grep("_high|low", colnames(df1))]
Both result in
x1_low_2020 x2_low_2030 x1_high_2020 x2_high_2030
1 1 1 1

Create a Column with Unique values from Lists Columns

I have a dataset on Rstudio made of columns that contains lists inside them. Here is an example where column "a" and column "c" contain lists in each row.
¿What I am looking for?
I need to create a new column that collects unique values from columns a b and c and that skips NA or null values
Expected result is column "desired_result".
test <- tibble(a = list(c("x1","x2"), c("x1","x3"),"x3"),
b = c("x1", NA,NA),
c = list(c("x1","x4"),"x4","x2"),
desired_result = list(c("x1","x2","x4"),c("x1","x3","x4"),c("x2","x3")))
What i have tried so far?
I tried the following but do not produces the expected result as in column "desired_result
test$attempt_1_ <-lapply(apply((test[, c("a","b","c"), drop = T]),
MARGIN = 1, FUN= c, use.names= FALSE),unique)
We may use pmap to loop over each of the corresponding elements of 'a' to 'c', remove the NA (na.omit) and get the unique values to store as a list in 'desired_result'
library(dplyr)
library(purrr)
test <- test %>%
mutate(desired_result2 = pmap(across(a:c), ~ sort(unique(na.omit(c(...))))))
-checking with OP's expected
> all.equal(test$desired_result, test$desired_result2)
[1] TRUE

Check if string contains anything other than items in vector [R]

I have a dataframe containing a column of strings. I want to check whether any of the elements in each string match any of the elements in one or more predefined vectors, and then return a new logical column. This is easily accomplished using grepl().
However (and this is the part I need help with), I also want to check whether the strings contain any elements other than those contained in the keyword vectors.
Example data:
matchvector1 <- c("Apple","Banana","Orange")
matchvector2 <- c("Strawberry","Kiwi","Grapefruit")
id <- c(1,2,3)
string_column <- c(paste0(c("Apple","Banana"),collapse=", "), paste0(c("Strawberry","Kiwi"), collapse = ", "), paste0(c("Apple","Pineapple"), collapse = ", "))
df <- data.frame(id, string_column)
df$string_column <- as.character(df$string_column)
matches_vector1 <- grepl(paste(matchvector1, collapse = "|"), df$string_column)
matches_vector2 <- grepl(paste(matchvector2, collapse = "|"), df$string_column)
The output should look something like:
matches_vector1: TRUE FALSE TRUE
matches_vector2: FALSE TRUE FALSE
unmatched_words: FALSE FALSE TRUE
I'm stuck on this last part. Is there an easy way to match on anything except something in a list of keywords using grepl() (or another function)? I suspect it will involve using negative lookaround somehow but the few existing threads on this didn't seem to answer my question.
One option is to split the 'string_column' with separate_rows, grouped by 'id', check if there are not any elements from 'string_column' %in% the concatenated vectors
library(dplyr)
library(tidyr)
df %>%
separate_rows(string_column) %>%
group_by(id) %>%
summarise(unmatched = any(!string_column %in% c(matchvector1, matchvector2)) )
# A tibble: 3 x 2
# id unmatched
#* <dbl> <lgl>
#1 1 FALSE
#2 2 FALSE
#3 3 TRUE
or in base R
lengths(sapply(strsplit(df$string_column, ",\\s*"),
setdiff, c(matchvector1, matchvector2))) > 0
#[1] FALSE FALSE TRUE

Filter dataset with %in% with pattern

I’m using filter to my dataset to select certain values from column:
%>%
filter(col1 %in% c(“value1”, “value2"))
How ever I don’t understand how to filter values in column with pattern without fully writing it. For example I also want all values which start with “value3” (“value33”, “value34”,....) along with “value1” and “value2”. Can I add grepl to that vector?
You can use regular expressions to do that:
df %>%
filter(str_detect('^value[1-3]'))
If you want to use another tidyverse package to help, you can use str_starts from stringr to find strings that start with a certain value
dd %>% filter(stringr::str_starts(col1, "value"))
Here are few options in base R :
Using grepl :
subset(df, grepl('^value', b))
# a b
#1 1 value3
#3 3 value123
#4 4 value12
Similar option with grep which returns index of match.
df[grep('^value', df$b),]
However, a faster option would be to use startsWith
subset(df, startsWith(b, "value"))
All of this would select rows where column b starts with "value".
data
df <- data.frame(a = 1:5, b = c('value3', 'abcd', 'value123', 'value12', 'def'),
stringsAsFactors = FALSE)

How do I change all the character values of a column that starts with specific characters?

I have a dataset with millions of observations.
One of the columns of this dataset uses 4 or 5 characters to classify these observations.
My goal is to merge this classification into smaller groups, for example, I want to replace all the values of the column that STARTS with "AA" (e.g., "AABC" or "AAUCC") for just "A". How can I do this?
To illustrate:
Considering that my data is labeled "f2016" and the column that I'm interested in is "SECT16", I've been using the following code to replace values:
f2016$SECT16[f2016$SECT16 == "AABB"] <- "A"
But I cannot do this to all combinations of letters that I have in the dataset. Is there a way that I can do the same replacement holding the first two letters constant?
Here is another base R solution:
f2016[startsWith(f2016$SECT16, "AA"),] <- "A"
# SECT16
# 1 A
# 2 A
# 3 ABBBBC
# 4 DDDDE
# 5 BABA
This replaces chars with the prefix specified in this case AA. An an excerpt from from the help(startsWith).
startsWith() is equivalent to but much faster than
substring(x, 1, nchar(prefix)) == prefix
or also
grepl("^", x)
where prefix is not to contain special regular expression characters.
Data
f2016 <- data.frame(SECT16 = c("AAABBB", "AAAAAABBBB", "ABBBBC", "DDDDE", "BABA"), stringsAsFactors = F)
We can use grep/grepl
f2016$SECT16[grep("^AA", f2016$SECT16)] <- "A"
#f2016$SECT16[grepl("^AA", f2016$SECT16)] <- "A"
Consider this dataset
df <- data.frame(A = c("ABCD", "AACD", "DASDD", "AABB"), stringsAsFactors = FALSE)
df
# A
#1 ABCD
#2 AACD
#3 DASDD
#4 AABB
df$A[grep("^AA", df$A)] <- "A"
df
# A
#1 ABCD
#2 A
#3 DASDD
#4 A
You can use stringr and dplyr.
Modify all columns:
df <- df %>% mutate_all(function(x) stringr::str_replace(x, "^AA.+", "A"))
Modify specific columns:
df <- df %>% mutate_at(1, function(x) stringr::str_replace(x, "^AA.+", "A"))
Data
df <- data.frame(SECT16 = c("AABC", "AABB"),
SECT17 = c("AADD", "AAEE"))

Resources