grepl in multiple columns in R - r

I'm trying to do a string search and replace across multiple columns in R. My code:
# Get columns of interest
selected_columns <- c(368,370,372,374,376,378,380,382,384,386,388,390,392,394)
#Perform grepl across multiple columns
df[,selected_columns][grepl('apples',df[,selected_columns],ignore.case = TRUE)] <- 'category1'
However, I'm getting the error:
Error: undefined columns selected
Thanks in advance.

grep/grepl works on vectors/matrix and not on data.frame/list. According to the?grep`
x - a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.
We can loop over the columns (lapply) and replace the values based on the match
df[, selected_columns] <- lapply(df[, selected_columns],
function(x) replace(x, grepl('apples', x, ignore.case = TRUE), 'category1'))
Or with dplyr
library(dplyr)
library(stringr)
df %>%
mutate_at(selected_columns, ~ replace(., str_detect(., 'apples'), 'category1'))

Assuming you want to partially match a cell and replace it, you could use rapply() and replace cell contents that have "apples" with "category1" using gsub():
df[selected_columns] <- rapply(df[selected_columns], function(x) gsub("apples", "category1", x), how = "replace")
Just keep in mind the difference between grepl()/gsub() (with and without boundaries in your regex), and %in%/match() when searching for strings.

Related

In R, how do I split each string in a vector to return everything before the Nth instance of a character?

Example:
df <- data.frame(Name = c("J*120_234_458_28", "Z*23_205_a834_306", "H*_39_004_204_99_04902"))
I would like to be able to select everything before the third underscore for each row in the dataframe. I understand how to split the string apart:
df$New <- sapply(strsplit((df$Name),"_"), `[`)
But this places a list in each row. I've thus far been unable to figure out how to use sapply to unlist() each row of df$New select the first N elements of the list to paste/collapse them back together. Because the length of each subelement can be distinct, and the number of subelements can also be distinct, I haven't been able to figure out an alternative way of getting this info.
We specify the 'n', after splitting the character column by '_', extract the n-1 first components
n <- 4
lapply(strsplit(as.character(df$Name), "_"), `[`, seq_len(n - 1))
If we need to paste it together, can use anonymous function call (function(x)) after looping over the list with lapply/sapply, get the first n elements with head and paste them together`
sapply(strsplit(as.character(df$Name), "_"), function(x)
paste(head(x, n - 1), collapse="_"))
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or use regex method
sub("^([^_]+_[^_]+_[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or if the 'n' is really large, then
pat <- sprintf("^([^_]+){%d}[^_]+).*", n-1)
sub(pat, "\\1", df$Name)
Or
sub("^(([^_]+_){2}[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"

Problem with str_replace in many columns and for

I'm a noob in R.
I'm working with a dataframe of like 150 rows and 21 columns. I want to change from column 5 to 20 the character "-" to "0.00".
I'm using this code and it works individually:
datos$max52sem<-str_replace(datos$max52sem,"-","0.00")
datos$min52sem<-str_replace(datos$min52sem,"-","0.00")
I'm trying to use a "for" to change it from all columns, instead of writing all my variables' names 15 times.
This is what I'm writing:
mis_vars<-c("max52sem","min52sem","cierre_prev","cierre_hoy","max_hoy","min_hoy","ret_hoy","ret_sem","ret_mes","ret_año","ret_ytd","vol","upa","vla","pvla","pu")
for(x in mis_vars)
datos$x<-str_replace(datos$x,"-","0")
"mis_vars" are the names of my columns (variables) I want to change in my dataframe, but I get this answer from R and I don't know what I'm doing wrong.
Error in $<-.data.frame(*tmp*, "x", value = character(0)) :
replacement has 0 rows, data has 1220>
With dplyr, we can use mutate_at
library(dplyr)
library(stringr)
datos <- datos %>%
mutate_at(vars(mis_vars), ~ str_replace(., "-", "0"))
In the OP's for loop, instead of datos$x <-, it should be datos[[x]] <- as it will be creating a column named 'x' instead of the variable in 'mis_vars'
Or using only base R
datos[mis_vars] <- lapply(datos[mis_vars], sub, pattern = "-", replacement = "0")
In base R, we can use lapply to change multiple columns
datos[mis_vars] <- lapply(datos[mis_vars], function(x) sub("-", "0.00", x))

add running counter for semi-consecutive strings in vector

I would like to add a number indicating the x^th occurrence of a word in a vector. (So this question is different from Make a column with duplicated values unique in a dataframe , because I have a simple vector and try to avoid the overhead of casting it to a data.frame).
E.g. for the vector:
book, ship, umbrella, book, ship, ship
the output would be:
book, ship, umbrella, book2, ship2, ship3
I have solved this myself by transposing the vector to a dataframe and next using the grouping function. That feels like using a sledgehammer to crack nuts:
# add consecutive number for equal string
words <- c("book", "ship", "umbrella", "book", "ship", "ship")
# transpose word vector to data.frame for grouping
df <- data.frame(words = words)
df <- df %>% group_by(words) %>% mutate(seqN = row_number())
# combine columns and remove '1' for first occurrence
wordsVec <- paste0(df$words, df$seqN)
gsub("1", "", wordsVec)
# [1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Is there a more clean solution, e.g. using the stringr package?
You can still utilize row_number() from dplyr but you don't need to convert to data frame, i.e.
sub('1$', '', ave(words, words, FUN = function(i) paste0(i, row_number(i))))
#[1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Another option is to use make.unique along with gsubfn to increment your values by 1, i.e.
library(gsubfn)
gsubfn("\\d+", function(x) as.numeric(x) + 1, make.unique(words))
#[1] "book" "ship" "umbrella" "book.2" "ship.2" "ship.3"

Trimming data frame in R with grep?

My dataframe, dat, has two columns which look like this:
value condition
2 learning/cat
4 learning/dog
1 naming/cat
6 naming/dog
I would like to 'trim' the data frame to only include rows in which condition contains "naming".
I've tried to do this with grep:
dat = dat[grep("naming", dat$condition, value = T)]
which causes the following error:
Error in `[.data.frame`(dat, grep("naming", dat$condition, value = T)) :
undefined columns selected
Can anyone suggest a fix? Any help would be greatly appreciated!
You can split up condition using separate from tidyr:
df = input_df %>% separate( condition, into = c("condition1", "condition2"), sep = "/")
Then just use filter:
only_naming_df = df %>% filter(condition1 == "naming")
The error is easy to fix once adding a comma after the parenthesis. But I want to have a list of available options to achieve this task. Belows are solution and comments from others and mine.
Use grep or grepl
grep returns the index (row number), while grepl returns a logical vector (TRUE or FALSE). Notice that when using grep in this case, value = T should not be added because it will return the string, which is not helpful for subsetting.
dat[grep("naming", dat$condition), ]
dat[grepl("naming", dat$condition), ]
Functions from dplyr and stringr
str_detect is equivalent to grepl(pattern, x), while str_which is equivalent to grep(pattern, x).
library(dplyr)
library(stringr)
dat %>% filter(str_detect(condition, "naming"))
dat %>% slice(str_which(condition, "naming"))
Data Preparation
# Create example dataframes
dat <- read.table(text = "value condition
2 learning/cat
4 learning/dog
1 naming/cat
6 naming/dog",
header = TRUE, stringsAsFactors = FALSE)

purrr::discard How to delete elements in a vector containing one or more specific strings

I would like to remove the elements containing '_1' and '_3' in the vector using the discard function from purrr. Here the example:
library(purrr)
x <- c("ABAC_13", "ZDRF73", "UYDS_12", "FGSH41", "GFSC_35" , "JHSC_29")
With discard we need to provide a logical vector indicating which values we need to discard.
To create a logical vector we use grepl giving TRUE values to the elements which have '_1' or '_3'
library(purrr)
discard(x, grepl("_1|_3", x))
#[1] "ZDRF73" "FGSH41" "JHSC_29"
and as #Lazarus Thurston commented using str_subset should be a better choice here.
str_subset(x, '_(1|3)', negate = TRUE)
As this is specific to tidyverse, we can use the syntax specific to it
library(tidyverse)
str_detect(x, "_[13]") %>%
discard(x, .)
#[1] "ZDRF73" "FGSH41" "JHSC_29"
If we need to remove the elements
grep("_\\d+", x, invert = TRUE, value = TRUE)
#[1] "ZDRF73" "FGSH41"
or if it is specific to 1, 3
grep("_[13]", x, invert = TRUE, value = TRUE)
#[1] "ZDRF73" "FGSH41" "JHSC_29"
If we need to remove the substring part,
sub("_\\d+", '', x)
This task can be performed using grepl(). Basically we want to find such occurrences that contains _1 or _3. The grepl output is a logical vector of TRUE/FALSE. Following that we remove those elements from x vector by using a subset and negating opearator i.e. x[!grepl("_1|_3", x)].
x <- c("ABAC_13", "ZDRF73", "UYDS_12", "FGSH41", "GFSC_35" , "JHSC_29")
x[!grepl("_1|_3", x)]

Resources