I am trying to extract the two letters between two spaces -
AAPL US Equity
1836 JP Equity
APPLE SOMETHING NOT
C US Equity
Result -
US
JP
US
What I tried was gsub("\\s[A-Z]{2}\\s", "\\1", vec) but that gives me -
AAPLEquity
1836Equity
APPLE SOMETHING NOT
CEquity
which seems the exact opposite of what I want.
We can use sub
out <- rep("", length(vec))
i1 <- grepl("\\b[A-Z]{2}\\b", vec)
out[i1] <- sub(".*\\s+([A-Z]{2})\\s+.*", "\\1", vec[i1])
out
#[1] "US" "JP" "" "US"
Or using str_extract to extract the two upper case characters after a space (specified by the regex lookaround) and follows a word boundary (\\b)
str_extract(vec, "(?<=\\s)([A-Z]{2})\\b")
#[1] "US" "JP" NA "US"
NOTE: Not copied syntax from others' answer
data
vec <- c("AAPL US Equity", "1836 JP Equity", "APPLE SOMETHING NOT", "C US Equity")
The gsub command removes the parts of text matched with the regular expression. \s[A-Z]{2}\s finds streaks of whitespace, 2 uppercase ASCII letters and whitespace, and removes them from character vectors.
You may use
x <- c('AAPL US Equity','1836 JP Equity','APPLE SOMETHING NOT','C US Equity')
sub(".*\\s+([A-Z]{2})\\s.*|.*", "\\1", x)
# => [1] "US" "JP" "" "US"
Here, the .*\\s+([A-Z]{2})\\s.* alternative matches those inputs that have a two-letter "word" between whitespaces and puts the words into Group 1 (\1), while .* alternative matches all other inputs to produce an empty result as the sub operation.
Or, you may use
library(stringr)
str_extract(x, "(?<=\\s)[A-Z]{2}(?=\\s)")
# => [1] "US" "JP" NA "US"
Here, (?<=\\s)[A-Z]{2}(?=\\s) matches and str_extract extracts strings that are first two-letter words in between whitespaces.
If the words can be at the start/end of the string use
str_extract(x, "(?<!\\S)[A-Z]{2}(?!\\S)")
I have been unable to find the answer to this specific question, I am using R to clean some survey data.
I have some messy survey data with question names as columns, that sometimes include a number and sometimes don't. When they include a number, it will often contain some subcharacters as well indicating the question. Example, I have this vector:
questions <- c(
"1 question 1 what do you think?",
"1.a. question 1a further details on what you think",
"Please explain",
"2 question 2 what is your motivation",
"2.a. further details",
"2.b. even further details",
"Please explain")
I want to extract the substrings that contain numbers, and return no results if there is no such match. Desired result (using R)
"1"
"1.a."
NA
"2"
"2.a."
"2.b."
NA
I know I can capture the first number, using
stri_extract_first_regex(questions, "[0-9]+")
But I am at a loss how to modify it to capture the whole string until the first whitespace if it finds a match using this pattern.
For you example data you might use:
[0-9]+(?:\.[a-z]\.)?
That will match:
[0-9]+ Match 1+ digits
(?: Non capturing group
\.[a-z]\. Match a dot, lowercase character and a dot
)? Close non capturing group and make it optional
For example:
questions <- c(
"1 question 1 what do you think?",
"1.a. question 1a further details on what you think",
"Please explain",
"2 question 2 what is your motivation",
"2.a. further details",
"2.b. even further details",
"Please explain")
print(stri_extract_first_regex(questions, "[0-9]+(?:\\.[a-z]\\.)?"))
# [1] "1" "1.a." NA "2" "2.a." "2.b." NA
This might work:
hasnumber <- grepl("[0-9]+",questions)
firstspaces <- sapply(gregexpr(" ", questions), function(x) x[[1]])
res <- ifelse(hasnumber, substr(questions,1,firstspaces-1), NA)
> res
[1] "1" "1.a." NA "2" "2.a." "2.b." NA
The most difficult part I guess is to define where are the first spaces in each question, which could be done with loops or here sapply
You may use
questions <- sub("^(\\d+(?:\\.[a-z0-9]+)*\\.?).*|.*", "\\1", questions)
questions[questions==""] <- NA
questions
# => [1] "1" "1.a." NA "2" "2.a." "2.b." NA
The ^(\\d+(?:\\.[a-z0-9]+)*\\.?).*|.* matches
^ - start of string
(\\d+(?:\\.[a-z0-9]+)*) - Capturing group 1:
\\d+ - 1+ digits
(?:\\.[a-z0-9]+)* - 0 or more repetitions of
\\. - a dot
[a-z0-9]+ - 1 or more lowercase ASCII letters or digits
\\.? - an optional dot
.* - any 0+ chars to the end of the string
| - or
.* - the whole string.
Replaces with the contents of Group 1. If the second alternative matches, the result is an empty string, questions[questions==""] <- NA replaces these elements with NAs.
I am trying to look for gene symbols in some text, for that purpose I am trying to establish a pattern that matches gene symbols (they use to be three or more uppercase letters together). I tried this but it didn't work.
TW2 <- text_words [grep ("b\[[:upper:]]b\", text_words) ]
You may use
text_words <- "GHJ GJKGKJ HHKKK J777 JJ8JJJJ"
TW2 <- unlist(regmatches(text_words, gregexpr("\\b[[:upper:]]{3,}\\b", text_words)))
TW2
## => [1] "GHJ" "GJKGKJ" "HHKKK"
See the R demo online
The pattern matches:
\\b - a word boundary
[[:upper:]]{3,} - 3 or more uppercase letters
\\b - a word boundary.
If you have a vector with the strings you need to test against the pattern in full, use
text_words <- c("GHJ","GJKGKJ","HHKKK","J777","JJ8JJJJ")
TW2 <- grep("^[[:upper:]]{3,}$", text_words, value=TRUE)
TW2
## => [1] "GHJ" "GJKGKJ" "HHKKK"
Here, word boundaries are replaced with anchors, ^ for the start of the string and $ for the end of the string. See another R demo.
I have a data frame in R with one column containing an address in Korean. I need to extract one of the words (a word ending with 동), if it's there (it's possible that it's missing) and create a new column named "dong" that will contain this word. So my data is shown in column "address" and desired output is shown in column "dong" shown below.
address <- c("대전광역시 서구 탄방동 홈플러스","대전광역시 동구 효동 주민센터","대전광역시 대덕구 오정동 한남마트","대전광역시 동구 자양동 87-3번지 성동경로당","대전광역시 유성구 용계로 128")
dong <- c("탄방동","효동","오정동","자양동",NA)
data <- data.frame(address,dong, stringsAsFactors = FALSE)
I've tried using grep but it's not giving me exactly what I need.
grep(".+동\\s",data$address,value=T)
I think I have 2 issues: 1) I'm not sure how to write a proper regular expression to identify the word I need and 2) I'm not sure why grep returns the whole string rather than the word. I would appreciate any suggestions.
A regex to extract Korean whole words ending with a specific letter is
\b\w*동\b
See the regex demo.
Details:
\b- leading word boundary
\w* - 0+ word chars
동 - ending letter
\b - trailing word boundary
See the R demo:
address <- c("대전광역시 서구 탄방동 홈플러스","대전광역시 동구 효동 주민센터","대전광역시 대덕구 오정동 한남마트","대전광역시 동구 자양동 87-3번지 성동경로당","대전광역시 유성구 용계로 128")
## matches <- regmatches(address, gregexpr("\\b\\w*동\\b", address, perl=TRUE ))
matches <- regmatches(address, gregexpr("\\b\\w*동\\b", address ))
dong <- unlist(lapply(matches, function(x) if (length(x) == 0) NA else x))
data <- data.frame(address,dong, stringsAsFactors = FALSE)
Output:
address dong
1 대전광역시 서구 탄방동 홈플러스 탄방동
2 대전광역시 동구 효동 주민센터 효동
3 대전광역시 대덕구 오정동 한남마트 오정동
4 대전광역시 동구 자양동 87-3번지 성동경로당 자양동
5 대전광역시 유성구 용계로 128 <NA>
Note that dong <- unlist(lapply(matches, function(x) if (length(x) == 0) NA else x)) line is necessary to add NA to those rows where no match was found.
grep returns the whole string. In your case, stringr library is useful.
library(stringr)
str_match(paste0(data$address, ' '), '([^\\s]+동)\\s')
[,1] [,2]
[1,] "탄방동 " "탄방동"
[2,] "효동 " "효동"
[3,] "오정동 " "오정동"
[4,] "자양동 " "자양동"
[5,] NA NA
The column 2 is what you want. Note that I added a space at the end of strings so that regex would match if "dong" appears at the end of string.
I've got some problems deleting duplicate elements in a string.
My data look similar to this:
idvisit path
1 1,16,23,59
2 2,14,14,19
3 5,19,23,19
4 10,10
5 23,23,27,29,23
I have a column containing an unique ID and a column containing a path for web page navigation.
The right column contains some cases, where pages just were reloaded and the page were tracked twice or even more.
The pages are separated with commas and are saved as factors.
My problem is, that I don't want to have multiple pages in a row, so the data should look like this.
idvisit path
1 1,16,23,59
2 2,14,19
3 5,19,23,19
4 10
5 23,27,29,23
The multiple pages next to each other should be removed. I know how to delete a specific multiple number using regexpressions, but I have about 20.000 different pages and can't do this for all of them.
Does anyone have a solution or a hint, for my problem?
Thanks
Sebastian
We can use tidyverse. Use the separate_rows to split the 'path' variable by the delimiter (,) to convert to a long format, then grouped by 'idvisit', we paste the run-length-encoding values
library(tidyverse)
separate_rows(df1, path) %>%
group_by(idvisit) %>%
summarise(path = paste(rle(path)$values, collapse=","))
# A tibble: 5 × 2
# idvisit path
# <int> <chr>
#1 1 1,16,23,59
#2 2 2,14,19
#3 3 5,19,23,19
#4 4 10
#5 5 23,27,29,23
Or a base R option is
df1$path <- sapply(strsplit(df1$path, ","), function(x) paste(rle(x)$values, collapse=","))
NOTE: If the 'path' column is factor class, convert to character before passing as argument to strsplit i.e. strsplit(as.character(df1$path), ",")
Using stringr package, with function: str_replace_all, I think it gets what you want using the following regular expression: ([0-9]+),\\1and then replace it with \\1 (we need to scape the \ special character):
library(stringr)
> str_replace_all("5,19,23,19", "([0-9]+),\\1", "\\1")
[1] "5,19,23,19"
> str_replace_all("10,10", "([0-9]+),\\1", "\\1")
[1] "10"
> str_replace_all("2,14,14,19", "([0-9]+),\\1", "\\1")
[1] "2,14,19"
You can use it in a array form: x <- c("5,19,23,19", "10,10", "2,14,14,19") then:
str_replace_all(x, "([0-9]+),\\1", "\\1")
[1] "5,19,23,19" "10" "2,14,19"
or using sapply:
result <- sapply(x, function(x) str_replace_all(x, "([0-9]+),\\1", "\\1"))
Then:
> result
5,19,23,19 10,10 2,14,14,19
"5,19,23,19" "10" "2,14,19"
Notes:
The first line is the attribute information:
> str(result)
Named chr [1:3] "5,19,23,19" "10" "2,14,19"
- attr(*, "names")= chr [1:3] "5,19,23,19" "10,10" "2,14,14,19"
If you don't want to see them (it does not affect the result), just do:
attributes(result) <- NULL
Then,
> result
[1] "5,19,23,19" "10" "2,14,19"
Explanation about the regular expression used: ([0-9]+),\\1
([0-9]+): Starts with a group 1 delimited by () and finds any digit (at least one)
,: Then comes a punctuation sign: , (we can include spaces here, but the original example only uses this character as delimiter)
\\1: Then comes an identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern doesn't match.
Then if the pattern matches, it replaces it, with the value of the variable \\1, i.e. the first time the number appears in the pattern matched.
How to handle more than one duplicated number, for example 2,14,14,14,19?:
Just use this regular expression instead: ([0-9]+)(,\\1)+, then it matches when at least there is one repetition of the delimiter (right) and the number. You can try other possibilities using this regex101.com (in MHO it more user friendly than other online regular expression checkers).
I hope this would work for you, it is a flexible solution, you just need to adapt it with the pattern you need.