I have collected tweets and I would like to extract the emoji unicode from each tweet. The emoji unicode is in <U+00001F44D> format and I have used the gsub function on R to remove all texts before and after the emoji using the function
tweets$text <- gsub(".*(<.*>).*", "\\1", tweets$text)
However, because there may be several emojis per tweet, i have decided to split each column after the character ">".
In some columns, there are strings that are just alphabet characters and does not start with "<".
My question is: How do I remove the string if it does not start with a "<"?
example:
data$text <- c("<U+000>", "character", "abc <U+000>")
data$text <- gsub(".*(<.*>).*", "\\1", data$text)
the data will still include the "character" string, but I'm trying to remove all characters except emoji unicode.
We can use grep instead of gsub
grep("^\\<", v1, invert = TRUE, value = TRUE)
#[1] "<U+000>"
If we need to extract the emoji's and remove the rest of characters, we can use str_extract from stringr. Specify the regex to match i.e. < is a metacharacter, so we can escape it (\\<) followed by one or more characters that are not a > (inside the square brackets will evaluate the literal character - ^ - implies not that character) followed by the > (again escape \\)
library(stringr)
str_extract(v1, "\\<[^>]+\\>")
#[1] "<U+000>" NA "<U+000>"
If we need to create multiple columns if there are multiple elements
lst1 <- str_extract_all(dat$v2, "\\<[^>]+\\>")
n <- lengths(lst1)
lapply(lst1, `length<-`,max(n))
dat[paste0("Col", seq_len(max(n)))] <- do.call(rbind,
lapply(lst1, `length<-`,max(n)))
dat
# v2 Col1 Col2
#1 <U+000> <U+000> <NA>
#2 character <NA> <NA>
#3 abc <U+000> <U+000> <NA>
#4 <U+000> characters <U+000> <U+000> <U+000>
Or using base R
regmatches(v1, regexpr("\\<[^>]+\\>", v1, perl = TRUE))
#[1] "<U+000>" "<U+000>"
data
v1 <- c("<U+000>", "character", "abc <U+000>")
v2 <- c(v1, "<U+000> characters <U+000>")
dat <- data.frame(v2 = v2, stringsAsFactors = FALSE)
Related
I have a tibble and the vectors within the tibble are character strings with a mix of English and Mandarin characters. I want to split the tibble into two, with one column returning the English, the other column returning the Mandarin. However, I had to resort to the following code in order to accomplish this:
tb <- tibble(x = c("I我", "love愛", "you你")) #create tibble
en <- str_split(tb[[1]], "[^A-Za-z]+", simplify = T) #split string when R reads a character that is not a-z
ch <- str_split(tb[[1]], "[A-Za-z]+", simplify = T) #split string after R reads all the a-z characters
tb <- tb %>%
mutate(EN = en[,1],
CH = ch[,2]) %>%
select(-x)#subset the matrices created above, because the matrices create a column of blank/"" values and also remove x column
tb
I'm guessing there's something wrong with my RegEx that's causing this to occur. Ideally, I would like to write one str_split line that would return both of the columns.
We can use strsplit from base R
do.call(rbind, strsplit(tb$x, "(?<=[A-Za-z])(?=[^A-Za-z])", perl = TRUE))
Or we can use
library(stringr)
tb$en <- str_extract(tb$x,"[[:alpha:]]+")
tb$ch <- str_extract(tb$x,"[^[:alpha:]]+")
We can use str_match and get data for English and rest of the characters separately.
stringr::str_match(tb$x, "([A-Za-z]+)(.*)")[, -1]
# [,1] [,2]
#[1,] "I" "我"
#[2,] "love" "愛"
#[3,] "you" "你"
A simple solution using str_extract from package stringr:
library(stringr)
tb$en <- str_extract(tb$x,"[A-z]+")
tb$ch <- str_extract(tb$x,"[^A-z]")
In case there's more than one Chinese character, just add +to [^A-z].
Alternatively, use gsuband backreference:
tb$en <- gsub("(\\w+).$", "\\1", tb$x)
tb$ch <- gsub("\\w+(.$)", "\\1", tb$x)
Yet another solution macthes unicode characters with [ -~]+ and excludes them with [^ -~]+:
tb$en <- str_extract(tb$x, "[ -~]+")
tb$ch <- str_extract(tb$x, "[^ -~]+")
Result:
tb
# A tibble: 3 x 3
x en ch
<chr> <chr> <chr>
1 I我 I 我
2 love愛 love 愛
3 you你 you 你
I have a returned string like this from my code: (<C1>, 4.297, %)
And I am trying to extract only the value 4.297 from this string using gsub command:
Fat<-gsub("\\D", "", stringV)
However, this extracts not only 4.297 but also the number '1' in C1.
Is there a way to extract only 4.297 from this string, please can you help.
Thanks
How about this?
# Your sample character string
ss <- "(<C1>, 4.297, %)";
gsub(".+,\\s*(\\d+\\.\\d+),.+", "\\1", ss)
#[1] "4.297"
or
gsub(".+,\\s*([0-9\\.]+),.+", "\\1", ss)
Convert to numeric with as.numeric if necessary.
Another option is str_extract to match one or more numeric elements with . and is preceded by a word boundary and succeeded by word boundary(\\b)
library(stringr)
as.numeric(str_extract(stringV, "\\b[0-9.]+\\b"))
#[1] 4.297
If there are multiple numbers, use str_extract_all
data
stringV <- "(<C1>, 4.297, %)"
An alternative is to treat your vector as a comma-separated-variable, and use read.csv
df <- read.csv(text = stringV, colClasses = c("character", "numeric", "character"), header = F)
V1 V2 V3
1 (<C1> 4.297 %)
This method is relying on the 'numeric' being in the 'second' position in the vector.
you can use as.numeric convert no number string to NA.
ss <- as.numeric(unlist(strsplit(stringV, ',')))
ss[!is.na(ss)]
#[1] 4.297
I am looking for a regex for gsub to remove all the unwanted commas:
Data:
,,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354
Desired result:
12345
12345,1345,1354
123
12345
12354
This is the progress I have made so far:
(,(?!\d+))
You seem to want to remove all leading and trailing commas.
You may do it with
gsub("^,+|,+$", "", x)
See the regex demo
The regex contans two alternations, ^,+ matches 1 or more commas at the start and ,+$ matches 1+ commas at the end, and gsub replaces these matches with empty strings.
See R demo
x <- c(",,,,,,,12345","12345,1345,1354","123,,,,,,","12345,",",12354")
gsub("^,+|,+$", "", x)
## [1] "12345" "12345,1345,1354" "123" "12345"
## [5] "12354"
You can also use str_extract from stringr. Thanks to greedy matching, you don't have to specify how many times a digit occurs, the longest match is automatically chosen:
library(dplyr)
library(stringr)
df %>%
mutate(V1 = str_extract(V1, "\\d.+\\d"))
or if you prefer base R:
df$V1 = regmatches(df$V1, gregexpr("\\d.+\\d", df$V1))
Result:
V1
1 12345
2 12345,1345,1354
3 123
4 12345
5 12354
Data:
df = read.table(text = ",,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354")
I have a file in Excel that has, as an example, text such as this "4.56/505AB" in a cell. The numbers all vary, as does the length of text, so the text can be single or multiple characters, and the numbers can contain characters such as a decimal point or slash mark.
The ideal, separated format for this example would be: column 1 = 4.56/505, column 2 = AB.
What I've tried:
"Split_Text" in Excel, which removed the special characters from the number, and resulted in the following output: column 1 = 456505, column 2 = ./AB
R with the "G_sub" command, which resulted in: [1] " 4 . 56 / 505 AB"
Is there a way to take these methods further, or will this be a manual fix? Thank you!
Assuming the first uppercase letter is the beginning of the second column
df <- data.frame(c1 = c("4.56/505AB", "1.23/202CD"))
library(stringr)
df$c2 <- str_extract(df$c1, "[^[A-Z]]+")
df$c3 <- str_extract(df$c1, "[A-Z]+")
df
# c1 c2 c3
# 1 4.56/505AB 4.56/505 AB
# 2 1.23/202CD 1.23/202 CD
1) sub/read.table Match the leading characters and the trailing characters within the two capture groups and separate them with a semicolon. Then read that in using read.table. No packages are used.
x <- "4.56/505AB"
pat <- "^([0-9.,/]+)(.*)"
read.table(text = sub(pat, "\\1;\\2", x), sep = ";", as.is = TRUE)
## V1 V2
## 1 4.56/505 AB
The result has character columns but if you prefer factor then omit
the as.is = TRUE. Also we have assumed there are no semicolons in the input but if there are then use some other character that does not appear in the input in place of the semicolon in the two places where semicolon appears.
1a) If we can assume that the second column always starts with a letter then we could just replace the first letter encountered by semicolon followed by that letter and then read it in using read.table. This has the advantage of using a slghtly simpler pattern.
read.table(text = sub("([[:alpha:]])", ";\\1", x), sep = ";", as.is = TRUE)
2) read.pattern Using the same input x and pattern pat it is even shorter using read.pattern in the gsubfn package:
library(gsubfn)
read.pattern(text = x, pattern = pat, as.is = TRUE)
## V1 V2
## 1 4.56/505 AB
Update: revised.
How do I remove part of a string? For example in ATGAS_1121 I want to remove everything before _.
Use regular expressions. In this case, you can use gsub:
gsub("^.*?_","_","ATGAS_1121")
[1] "_1121"
This regular expression matches the beginning of the string (^), any character (.) repeated zero or more times (*), and underscore (_). The ? makes the match "lazy" so that it only matches are far as the first underscore. That match is replaced with just an underscore. See ?regex for more details and references
You can use a built-in for this, strsplit:
> s = "TGAS_1121"
> s1 = unlist(strsplit(s, split='_', fixed=TRUE))[2]
> s1
[1] "1121"
strsplit returns both pieces of the string parsed on the split parameter as a list. That's probably not what you want, so wrap the call in unlist, then index that array so that only the second of the two elements in the vector are returned.
Finally, the fixed parameter should be set to TRUE to indicate that the split parameter is not a regular expression, but a literal matching character.
If you're a Tidyverse kind of person, here's the stringr solution:
R> library(stringr)
R> strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
R> strings %>% str_replace(".*_", "_")
[1] "_1121" "_1432" "_1121"
# Or:
R> strings %>% str_replace("^[A-Z]*", "")
[1] "_1121" "_1432" "_1121"
Here's the strsplit solution if s is a vector:
> s <- c("TGAS_1121", "MGAS_1432")
> s1 <- sapply(strsplit(s, split='_', fixed=TRUE), function(x) (x[2]))
> s1
[1] "1121" "1432"
Maybe the most intuitive solution is probably to use the stringr function str_remove which is even easier than str_replace as it has only 1 argument instead of 2.
The only tricky part in your example is that you want to keep the underscore but its possible: You must match the regular expression until it finds the specified string pattern (?=pattern).
See example:
strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
strings %>% stringr::str_remove(".+?(?=_)")
[1] "_1121" "_1432" "_1121"
Here the strsplit solution for a dataframe using dplyr package
col1 = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
col2 = c("T", "M", "A")
df = data.frame(col1, col2)
df
col1 col2
1 TGAS_1121 T
2 MGAS_1432 M
3 ATGAS_1121 A
df<-mutate(df,col1=as.character(col1))
df2<-mutate(df,col1=sapply(strsplit(df$col1, split='_', fixed=TRUE),function(x) (x[2])))
df2
col1 col2
1 1121 T
2 1432 M
3 1121 A