I am trying to extract all hashtags from some tweets, and obtain for each tweet a single string with all hashtags.
I am using str_extract from stringr, so I obtain a list of character vectors. My problem is that I do not manage to unlist it and keep the same number of elements of the list (that is, the number of tweets).
Example:
This is a vector of tweets of length 3:
a <- "rt #ugh_toulouse: #mondial2014 : le top 5 des mannequins brésiliens http://www.ladepeche.fr/article/2014/06/01/1892121-mondial-2014-le-top-5-des-mannequins-bresiliens.html #brésil "
b <- "rt #30millionsdamis: beauté de la nature : 1 #baleine sauve un naufragé ; elles pourtant tellement menacées par l'homme... http://goo.gl/xqrqhd #instinctanimal "
c <- "rt #onlyshe31: elle siège toujours!!!!!!! marseille. nouveau procès pour la députée - 01/06/2014 - ladépêche.fr http://www.ladepeche.fr/article/2014/06/01/1892035-marseille-nouveau-proces-pour-la-deputee.html #toulouse "
all <- c(a, b, c)
Now I use str_extract_all to extract the hashtags:
ex <- str_extract_all(all, "#(.+?)[ |\n]")
If I now use unlist I get a vector of length 5:
undesired <- unlist(ex)
> undesired
[1] "#mondial2014 " "#brésil "
[3] "#baleine " "#instinctanimal "
[5] "#toulouse "
What I want is something like the following. However this is very inefficient, because it is not vectorized, and it takes forever (really!) on a smallish data frame of tweets:
desired <- c()
for (i in 1:length(ex)){
desired[i] <- paste(ex[[i]], collapse = " ")
}
> desired
[1] "#mondial2014 #brésil "
[2] "#baleine #instinctanimal "
[3] "#toulouse "
Help!
You could use stringi which may be faster for big datasets
library(stringi)
sapply(stri_extract_all_regex(all, '#(.+?)[ |\n]'), paste, collapse=' ')
#[1] "#mondial2014 #brésil " "#baleine #instinctanimal "
#[3] "#toulouse "
The for loops can be fast if you preassign the length of the output desired
desired <- numeric(length(ex))
for (i in 1:length(ex)){
desired[i] <- paste(ex[[i]], collapse = " ")
}
Or you could use vapply which would be faster than sapply and a bit safer (contributed by #Richie Cotton)
vapply(ex, toString, character(1))
#[1] "#mondial2014 , #brésil " "#baleine , #instinctanimal "
#[3] "#toulouse "
Or as suggested by #Ananda Mahto
vapply(stri_extract_all_regex(all, '#(.+?)[ |\n]'),
stri_flatten, character(1L), collapse = " ")
Related
I've been trying to split a string in R and then joining it back together but none of the tricks have worked for what I need.
!!!Important !!! My question is not a duplicate:
saving a split result into a variable and then pasting, collapsing etc is not the same as just paste a vector like this
paste(c("bla", "bla"), collapse = " ")
> paste(c("The","birch", "canoe"), collapse = ' ')
[1] "The birch canoe"
> paste(s, collapse=" ")
[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
Here's the code:
I take pre-saved sentences in R
sentences[1]
and split it
s <- str_split(sentences[1])
this is what I get:
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
Now when I try to join this back together I get backslashes
toString(s)
"c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
paste produces the same result:
> paste(s)
[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
I tried using str_split_fixed and wrap it into a vector, but it joins the sentence back together with a comma, even if I ask it not to.
v <- as.vector(str_split_fixed(sentences[1], " ", 5))
toString(v, sep="")
[1] "The, birch, canoe, slid, on the smooth planks."
I thought maybe str_split_i or str_split_1 could solve it as according to the documentation in theory it should, but that's what I get when I try to use it
"could not find function "str_split_1" "
Are there any other ways to join back a string after splitting it without it producing commas or backslashes?..
See the difference between:
s <- list(c("The" , "birch" , "canoe" , "slid" , "on" , "the" , "smooth" , "planks."))
paste(s[1], collapse = " ")
#[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
and
paste(s[[1]], collapse = " ")
#[1] "The birch canoe slid on the smooth planks."
This is because [[ will extract the vector, and [ and will keep the output as a list.
I'm trying to use str_pad on a subset of items in a data frame. So that is, starting with something like:
> d <- data.frame(q=c("all","two","a","an","each"),s=c("univ","exis","exis","exis","univ"))
> d
q s
1 all univ
2 two exis
3 a exis
4 an exis
5 each univ
I'd like to add white space to just the items where the value in q starts with "a" or "e". I can use str_pad and str_subset to get this:
> str_pad(str_subset(d$q,"\\b([ae])"),3)
[1] "all" " a" " an" "each"
But I don't know how to change those items in the data frame. I can use subset() to pick out the rows I want to edit, but I'm not sure how to rewrite parts of that subset, it gives me an error:
> subset(d,str_detect(d$q,"\\b([ae])")==TRUE)
q s
1 all univ
3 a exis
4 an exis
5 each univ
> subset(d,str_detect(d$q,"\\b([ae])")==TRUE)$q <- str_pad(str_subset(d$q,"\\b([ae])"),3)
Error in subset(d, str_detect(d$q, "\\b([ae])") == TRUE)$q <- str_pad(str_subset(d$q, :
could not find function "subset<-"
Is there a short-ish way to do this? I can think of a couple roundabout ways but something brief would be good. Thanks!
Here is an efficient way to do it.
library(dplyr)
library(stringr)
d <- data.frame(q = c("all","two","a","an","each"),
s = c("univ","exis","exis","exis","univ")) %>%
mutate(q = ifelse(str_detect(q, '^[ae]'), paste(' ', q), q))
d$q
The output:
[1] " all" "two" " a" " an" " each"
Let us know if this is what you're looking for.
Is this what you're looking for?
library(tidyverse)
library(stringr)
d_2 <- d %>%
dplyr::mutate(result = if_else(stringr::str_detect(q, "^a")|
stringr::str_detect(q, "^e"), paste(" ", q), q))
We can use sub from base R
d$q <- sub('^([ae])', " \\1", d$q)
d$q
#[1] " all" "two" " a" " an" " each"
I aim to remove duplicate words only in parentheses from string sets.
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
What I want to get is just like this
a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'
In order to get the result, I used a code like this
a = gsub('\\|', " | ", a)
a = gsub('\\(', "( ", a)
a = gsub('\\)', " )", a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))
However, it resulted in undesirable outputs.
a
[1] "I ( have | has ) certain words word worded"
[2] "( You | Youre ) can cans do this works worked"
[3] "I ( am | are ) sure surely you know what when her should do"
Why did my code remove parentheses located in the latter part of strings?
What should I do for the result I want?
We can use gsubfn. Here, the idea is to select the characters inside the brackets by matching the opening bracket (\\( have to escape the bracket as it is a metacharacter) followed by one or more characters that are not a closing bracket ([^)]+), capture it as a group within the brackets. In the replacement, we split the group of characters (x) with strsplit, unlist the list output, get the unique elements and paste it together
library(gsubfn)
gsubfn("\\(([^)]+)", ~paste0("(", paste(unique(unlist(strsplit(x,
"[|]"))), collapse="|")), a)
#[1] "I (have|has) certain (words|word|worded) certain"
#[2] "(You|Youre) (can|cans) do this (works|worked)"
#[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
Take the answer above. This is more straightforward, but you can also try:
library(stringi)
library(stringr)
a_new <- gsub("[|]","-",a) # replace this | due to some issus during the replacement later
a1 <- str_extract_all(a_new,"[(](.*?)[)]") # extract the "units"
# some magic using stringi::stri_extract_all_words()
a2 <- unlist(lapply(a1,function(x) unlist(lapply(stri_extract_all_words(x), function(y) paste(unique(y),collapse = "|")))))
# prepare replacement
names(a2) <- unlist(a1)
# replacement and finalization
str_replace_all(a_new, a2)
[1] "I (have|has) certain (words|word|worded) certain"
[2] "(You|Youre) (can|cans) do this (works|worked)"
[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
The idea is to extract the words within the brackets as unit. Then remove the duplicates and replace the old unit with the updated.
a longer but more elaborate try
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# blank output
new_a <- c()
for (sentence in 1:length(a)) {
split <- trim(unlist(strsplit(a[sentence],"[( )]")))
newsentence <- c()
for (i in split) {
j1 <- as.character(unique(trim(unlist(strsplit(gsub('\\|'," ",i)," ")))))
if( length(j1)==0) {
next
} else {
ifelse(length(j1)>1,
newsentence <- c(newsentence,paste("(",paste(j1,collapse="|"),")",sep="")),
newsentence <- c(newsentence,j1[1]))
}
}
newsentence <- paste(newsentence,collapse=" ")
print(newsentence)
new_a <- c(new_a,newsentence)}
# [1] "I (have|has) certain (words|word|worded) certain"
# [2] "(You|Youre) (can|cans) do this (works|worked)"
# [3] "I (am|are) (sure|surely) you know (what|when) (you|her) should do"
What is the cleanest way of finding for example the string ": [1-9]*" and only keeping that part?
You can work with regexec to get the starting points, but isn't there a cleaner way just to get immediately the value?
For example:
test <- c("surface area: 458", "bedrooms: 1", "whatever")
regexec(": [1-9]*", test)
How do I get immediately just
c(": 458",": 1", NA )
You can use base R which handles this just fine.
> x <- c('surface area: 458', 'bedrooms: 1', 'whatever')
> r <- regmatches(x, gregexpr(':.*', x))
> unlist({r[sapply(r, length)==0] <- NA; r})
# [1] ": 458" ": 1" NA
Although, I find it much simpler to just do...
> x <- c('surface area: 458', 'bedrooms: 1', 'whatever')
> sapply(strsplit(x, '\\b(?=:)', perl=T), '[', 2)
# [1] ": 458" ": 1" NA
library(stringr)
str_extract(test, ":.*")
#[1] ": 458" ": 1" NA
Or for a faster approach stringi
library(stringi)
stri_extract_first_regex(test, ":.*")
#[1] ": 458" ": 1" NA
If you need the keep the values of the one that doesn't have the match
gsub(".*(:.*)", "\\1", test)
#[1] ": 458" ": 1" "whatever"
Try any of these. The first two use the base of R only. The last one assumes that we want to return a numeric vector.
1) sub
s <- sub(".*:", ":", test)
ifelse(test == s, NA, s)
## [1] ": 458" ": 1" NA
If there can be more than one : in a string then replace the pattern with "^[^:]*:" .
2) strsplit
sapply(strsplit(test, ":"), function(x) c(paste0(":", x), NA)[2])
## [1] ": 458" ": 1" NA
Do not use this one if there can be more than one : in a string.
3) strapplyc
library(gsubfn)
s <- strapplyc(test, "(:.*)|$", simplify = TRUE)
ifelse(s == "", NA, s)
## [1] ": 458" ": 1" NA
We can omit the ifelse line if "" is ok instead of NA.
4) strapply If the idea is really that there are some digits on the line and we want to return the numbers or NA then try this:
library(gsubfn)
strapply(test, "\\d+|$", as.numeric, simplify = TRUE)
## [1] 458 1 NA
I am trying to replace strings in R in a large number of texts.
Essentially, this reproduces the format of the data from which I try to delete the '\n' parts.
document <- as.list(c("This is \\na try-out", "And it \\nfails"))
I can do this with a loop and gsub but it takes forever. I looked at this post for a solution. So I tried: temp <- apply(document, 2, function(x) gsub("\\n", " ", fixed=TRUE)). I also used lapply, but it also gives an error message. I can't figure this out, help!
use lapply if you want to return a list
document <- as.list(c("This is \\na try-out", "And it \\nfails"))
temp <- lapply(document, function(x) gsub("\\n", " ", x, fixed=TRUE))
##[[1]]
##[1] "This is a try-out"
##[[2]]
##[1] "And it fails"