I am trying to remove some pattern (to_remove) from another string column (entry) inside mutate().
The problem is both my string and pattern columns contain some empty strings. So using some vectorized functions such as stringr::str_remove() would result some warnings and slow the process down by a lot.
I notice that without the empty strings & patterns (i.e. you replace them with some values) it would only take less than 1 sec to complete about 1e5 rows of records. However, with the warnings it would take over 10 secs.
I am wondering if there is any way I can use stringr::str_remove() inside mutate() but skipping those empty rows so that I can still have the speed benefit from vectorization.
Note that I can also use dplyr::rowwise() + gsub() but rowwise() slows things down a lot as well:(
Example code:
library(tidyverse)
library(stringr)
set.seed(123)
temp <- data.frame(
entry = c('A12','JW13','C','')
,to_remove = c('A','W','','D')
) %>%
sample_n(1e5,replace = T)
temp <- temp %>%
mutate(
removed = str_remove(entry,to_remove)
)
Try replacing the blank values with NA :
library(dplyr)
library(stringr)
temp %>%
mutate(to_remove = na_if(to_remove, ''),
removed = str_remove(entry,to_remove))
We can do
library(dplyr)
library(stringr)
temp %>%
mutate(removed = str_remove(entry, replace(to_remove, to_remove == "", NA)))
Related
I have a data-frame with string variable column "disease". I want to filter the rows with partial match "trauma" or "Trauma". I am currently done the following using dplyr and stringr:
trauma_set <- df %>% filter(str_detect(disease, "trauma|Trauma"))
But the result also includes "Nontraumatic" and "nontraumatic". How can I filter only "trauma, Trauma, traumatic or Traumatic" without including nontrauma or Nontrauma? Also, is there a way I can define the string to detect without having to specify both uppercase and lowercase version of the string (as in both trauma and Trauma)?
If we want to specify the word boundary, use \\b at the start. Also, for different cases, we can use ignore_case = TRUE by wrapping with modifiers
library(dplyr)
library(stringr)
out <- df %>%
filter(str_detect(disease, regex("\\btrauma", ignore_case = TRUE)))
sum(str_detect(out$disease, regex("^Non", ignore_case = TRUE)))
#[1] 0
data
set.seed(24)
df <- data.frame(disease = sample(c("Nontraumatic", "Trauma",
"Traumatic", "nontraumatic", "traumatic", "trauma"), 50 ,
replace = TRUE), value = rnorm (50))
You were very close to a correct solution, you just needed to add the "start of string" anchor ^, as follows:
trauma_set <- df %>% filter(str_detect(disease, "^trauma|^Trauma"))
I want to extract everything but a pattern and return this concetenated in a string.
I tried to combine str_extract_all together with sapply and cat
x = c("a_1","a_20","a_40","a_30","a_28")
data <- tibble(age = x)
# extracting just the first pattern is easy
data %>%
mutate(age_new = str_extract(age,"[^a_]"))
# combining str_extract_all and sapply doesnt work
data %>%
mutate(age_new = sapply(str_extract_all(x,"[^a_]"),function(x) cat(x,sep="")))
class(str_extract_all(x,"[^a_]"))
sapply(str_extract_all(x,"[^a_]"),function(x) cat(x,sep=""))
Returns NULL instead of concatenated patterns
Instead of cat, we can use paste. Also, with tidyverse, can make use of map and str_c (in place of paste - from stringr)
library(tidyverse)
data %>%
mutate(age_new = map_chr(str_extract_all(x, "[^a_]+"), ~ str_c(.x, collapse="")))
using `OP's code
data %>%
mutate(age_new = sapply(str_extract_all(x,"[^a_]"),
function(x) paste(x,collapse="")))
If the intention is to get the numbers
library(readr)
data %>%
mutate(age_new = parse_number(x))
Here is a non tidyverse solution, just using stringr.
apply(str_extract_all(column,regex_command,simplify = TRUE),1,paste,collapse="")
'simplify' = TRUE changed str_extract_all to output a matrix, and apply iterates over the matrix. I got the idea from https://stackoverflow.com/a/4213674/8427463
Example: extract all 'r' in rownames(mtcar) and concatenate as a vector
library(stringr)
apply(str_extract_all(rownames(mtcars),"r",simplify = TRUE),1,paste,collapse="")
I have two datasets, I'm trying to join together. the column i am joining by does not exactly match up with each other. first file the column looks like this: 00:01:54:2145 etc. 00: for every single observation. I want to change all the observations in this column to be in this format: 01/54/2145.
I have tried several things with string package, but can't get it to work.
df1 <- df %>%
str_replace_all("00:")
I'm getting this error, but don't think that's the only problem:
argument is not an atomic vector; coercing
Thank you
library(stringr)
library(dplyr)
my_conversion <- Vectorize(function(str) {
str_replace(str, "^00:", "") %>%
str_replace_all(":", "/")
})
df <- data.frame(
a_column = 1:3, key_column = c("00:01:54:2145", "00:01:54:2145", "00:01:54:2145"))
df %>% mutate(key_column = my_conversion(key_column))
My dataframe, dat, has two columns which look like this:
value condition
2 learning/cat
4 learning/dog
1 naming/cat
6 naming/dog
I would like to 'trim' the data frame to only include rows in which condition contains "naming".
I've tried to do this with grep:
dat = dat[grep("naming", dat$condition, value = T)]
which causes the following error:
Error in `[.data.frame`(dat, grep("naming", dat$condition, value = T)) :
undefined columns selected
Can anyone suggest a fix? Any help would be greatly appreciated!
You can split up condition using separate from tidyr:
df = input_df %>% separate( condition, into = c("condition1", "condition2"), sep = "/")
Then just use filter:
only_naming_df = df %>% filter(condition1 == "naming")
The error is easy to fix once adding a comma after the parenthesis. But I want to have a list of available options to achieve this task. Belows are solution and comments from others and mine.
Use grep or grepl
grep returns the index (row number), while grepl returns a logical vector (TRUE or FALSE). Notice that when using grep in this case, value = T should not be added because it will return the string, which is not helpful for subsetting.
dat[grep("naming", dat$condition), ]
dat[grepl("naming", dat$condition), ]
Functions from dplyr and stringr
str_detect is equivalent to grepl(pattern, x), while str_which is equivalent to grep(pattern, x).
library(dplyr)
library(stringr)
dat %>% filter(str_detect(condition, "naming"))
dat %>% slice(str_which(condition, "naming"))
Data Preparation
# Create example dataframes
dat <- read.table(text = "value condition
2 learning/cat
4 learning/dog
1 naming/cat
6 naming/dog",
header = TRUE, stringsAsFactors = FALSE)
I know this question has been asked many times (Converting Character to Numeric without NA Coercion in R, Converting Character\Factor to Numeric without NA Coercion in R, etc.) but I cannot seem to figure out what is going on in this one particular case (Warning message:
NAs introduced by coercion). Here is some reproducible data I'm working with.
#dependencies
library(rvest)
library(dplyr)
library(pipeR)
library(stringr)
library(translateR)
#scrape data from website
url <- "http://irandataportal.syr.edu/election-data"
ir.pres2014 <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="content"]/div[16]/table') %>%
html_table(fill = TRUE)
ir.pres2014<-ir.pres2014[[1]]
colnames(ir.pres2014)<-c("province","Rouhani","Velayati","Jalili","Ghalibaf","Rezai","Gharazi")
ir.pres2014<-ir.pres2014[-1,]
#Get rid of unnecessary rows
ir.pres2014<-ir.pres2014 %>%
subset(province!="Votes Per Candidate") %>%
subset(province!="Total Votes")
#Get rid of commas
clean_numbers = function (x) str_replace_all(x, '[, ]', '')
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)
#remove any possible whitespace in string
no_space = function (x) gsub(" ","", x)
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(no_space), -province)
This is where things start going wrong for me. I tried each of the following lines of code but I got all NA's each time. For example, I begin by trying to convert the second column (Rouhani) to numeric:
#First check class of vector
class(ir.pres2014$Rouhani)
#convert character to numeric
ir.pres2014$Rouhani.num<-as.numeric(ir.pres2014$Rouhani)
Above returns a vector of all NA's. I also tried:
as.numeric.factor <- function(x) {seq_along(levels(x))[x]}
ir.pres2014$Rouhani2<-as.numeric.factor(ir.pres2014$Rouhani)
And:
ir.pres2014$Rouhani2<-as.numeric(levels(ir.pres2014$Rouhani))[ir.pres2014$Rouhani]
And:
ir.pres2014$Rouhani2<-as.numeric(paste(ir.pres2014$Rouhani))
All those return NA's. I also tried the following:
ir.pres2014$Rouhani2<-as.numeric(as.factor(ir.pres2014$Rouhani))
That created a list of single digit numbers so it was clearly not converting the string in the way I have in mind. Any help is much appreciated.
The reason is what looks like a leading space before the numbers:
> ir.pres2014$Rouhani
[1] " 1052345" " 885693" " 384751" " 1017516" " 519412" " 175608" …
Just remove that as well before the conversion. The situation is complicated by the fact that this character isn’t actually a space, it’s something else:
mystery_char = substr(ir.pres2014$Rouhani[1], 1, 1)
charToRaw(mystery_char)
# [1] c2 a0
I have no idea where it comes from but it needs to be replaced:
str_replace_all(x, rawToChar(as.raw(c(0xc2, 0xa0))), '')
Furthermore, you can simplify your code by applying the same transformation to all your columns at once:
mystery_char = rawToChar(as.raw(c(0xc2, 0xa0)))
to_replace = sprintf('[,%s]', mystery_char)
clean_numbers = function (x) as.numeric(str_replace_all(x, to_replace, ''))
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)