I have a vector of string named words, and I need to remove all empty strings using library(stringr). I tried str_remove_all(words, pattern = ""), but it showed me:
Error: Empty `pattern`` not supported.
What should I do? Any help would be appreciated.
How about just use a subset in base R:
words <- words[words != ""]
If perhaps your "empty" words are not really empty, but actually contain one or more whitespace characters, then use grepl to remove them:
words <- words[!grepl("^\\s+$", words)]
If you really want to use stringr, then you probably want to use str_subset, which removes elements of a vector, instead of removing the match from each element. Here is a pattern that keeps only strings with at least one character:
str_subset(words, ".+")
Related
I have column names in the following format:
col= c('UserLanguage','Q48','Q21...20','Q22...21',"Q22_4_TEXT...202")
I would like to get the column names without everything that is after ...
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
I am not sure how to code it. I found this post here but I am not sure how to specify the pattern in my case.
You can use gsub.
gsub("\\...*","",col)
#[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
Or you can use stringr
library(stringr)
str_remove(col, "\\...*")
Since . matches any character, we need to "escape" (\) to specify exactly what we want to match in the regular expression (and not use the special behavior of the .). So, to match a period, we would need \.. However, the backslash (\) is used to escape special behavior (e.g., escape symbol in strings) in regexps. So, to create the regular expression, we need an additional backslash, \\. In this case, we want to match additional periods, so we can add those here, hence \\.... Then, * specifies that the previous expression (everything the three periods) may occur 0 or more times.
You could sub and capture the first word in each column:
col <- c("UserLanguage", "Q48", "Q21...20", "Q22...21", "Q22_4_TEXT...202")
sub("^(\\w+).*$", "\\1", col)
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
The regex pattern used here says to match:
^ from the start of the input
(\w+) match AND capture the first word
.* then consume the rest
$ end of the input
Then, using sub we replace with \1 to retain just the first word.
I'm a total novice to regex, and have a hard time wrapping my head around it. Right now I have a column filled with strings, but the only relevant text to my analysis is between quotation marks. I've tried this:
response$text <- stri_extract_all_regex(response$text, '"\\S+"')
but when I view response$text, the output comes out like this:
"\"caring\""
How do I change my regex expression so that instead the output reads:
caring
You can use
library(stringi)
response$text <- stri_extract_all_regex(response$text, '(?<=")[^\\s"]+(?=")')
Or, with stringr:
library(stringr)
response$text <- str_extract_all(response$text, '(?<=")[^\\s"]+(?=")')
However, with several words inside quotes, I'd rather use stringr::str_match_all:
library(stringr)
matches <- str_match_all(response$text, '"([^\\s"]+)"')
response$text <- lapply(matches, function(x) x[,2])
See this regex demo.
With the capturing group approach used in "([^\\s"]+)" it becomes possible to avoid overlapping matches between quoted substrings, and str_match_all becomes handy since the matches it returns contain the captured substrings as well (unlike *extract* functions).
library(stringr)
namesfun<-(sapply(mxnames, function (x)(str_extract(x,sapply(jockeys, function (y)y)))))%>%as.data.frame(stringsAsFactors = F)
So I am trying to use str_extract using sapply through two vectors, and the "jockeys" vector that I use as the pattern argument in str_extract, has elements with special characters like "-" or "/" that interfere with regex.
Since I want an exact "human" match if you prefer, and not regex based match, how can I disable regex from being the default matching manner?
I hope I got my point across!
I'm trying to convert characters like "9.230" to a numeric type.
First I erased the dots, because it was returning me "NA", and then I converted to numerical.
The problem is that when I convert to numerical I lose the trailing zero:
Example:
a<-9.230
as.numeric(gsub(".","",a,fixed=TRUE))
Returns: 923
Does anyone know how avoid this?
You assign the number 9.230 which is the same as 9.23. How is the system supposed to know that there was a trailing zero? If you want to transform a string, work with the string "9.230".
Look for result of
a<-9.230
gsub(".","",a,fixed=TRUE)
#[1] "923"
Question will be why? Because fixed=TRUE have been used in argument of gsub. Hence . is replaced by the 2nd argument of gsub that is "".
Basically thats the reason why as.numeric(gsub(".","",a,fixed=TRUE)) is resulting in 923
There is another point. How a <- 9.230 was changed to character in gsub function. This has been explained in r documentation for gsub:
Arguments: x, text
a character vector where matches are sought, or an object
which can be coerced by as.character to a character vector. Long
vectors are supported.
Final question: How to avoid such behavior?
Dont use gsub. Use sprintf("%.3f",a)
I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.