Add whitespace to subset of a data frame in r - r

I'm trying to use str_pad on a subset of items in a data frame. So that is, starting with something like:
> d <- data.frame(q=c("all","two","a","an","each"),s=c("univ","exis","exis","exis","univ"))
> d
q s
1 all univ
2 two exis
3 a exis
4 an exis
5 each univ
I'd like to add white space to just the items where the value in q starts with "a" or "e". I can use str_pad and str_subset to get this:
> str_pad(str_subset(d$q,"\\b([ae])"),3)
[1] "all" " a" " an" "each"
But I don't know how to change those items in the data frame. I can use subset() to pick out the rows I want to edit, but I'm not sure how to rewrite parts of that subset, it gives me an error:
> subset(d,str_detect(d$q,"\\b([ae])")==TRUE)
q s
1 all univ
3 a exis
4 an exis
5 each univ
> subset(d,str_detect(d$q,"\\b([ae])")==TRUE)$q <- str_pad(str_subset(d$q,"\\b([ae])"),3)
Error in subset(d, str_detect(d$q, "\\b([ae])") == TRUE)$q <- str_pad(str_subset(d$q, :
could not find function "subset<-"
Is there a short-ish way to do this? I can think of a couple roundabout ways but something brief would be good. Thanks!

Here is an efficient way to do it.
library(dplyr)
library(stringr)
d <- data.frame(q = c("all","two","a","an","each"),
s = c("univ","exis","exis","exis","univ")) %>%
mutate(q = ifelse(str_detect(q, '^[ae]'), paste(' ', q), q))
d$q
The output:
[1] " all" "two" " a" " an" " each"
Let us know if this is what you're looking for.

Is this what you're looking for?
library(tidyverse)
library(stringr)
d_2 <- d %>%
dplyr::mutate(result = if_else(stringr::str_detect(q, "^a")|
stringr::str_detect(q, "^e"), paste(" ", q), q))

We can use sub from base R
d$q <- sub('^([ae])', " \\1", d$q)
d$q
#[1] " all" "two" " a" " an" " each"

Related

How to paste together all the objects in an environment in R?

This seems like a simple question, but I can't find a solution. I want to take all of the objects (character vectors) in my environment and use them as arguments in a paste function. But the catch is I want to do so without specifying them all individually.
a <- "foo"
b <- "bar"
c <- "baz"
z <- paste(a, b, c, sep = " ")
z
[1] "foo bar baz"
I imagine that there must be something like the ls() would offer this, but obviously
z <- paste(ls(), collapse = " ")
z
[1] "a b c"
not "foo bar baz", which is what I want.
We can use mget to return the values of the objects in a list and then with do.call paste them into a single string
do.call(paste, c(mget(ls()), sep= " "))
As the sep is " ", we don't need that in paste as it by default giving a space
do.call(paste, mget(ls()))

Collapsing rows using two vectors as indicators

This is my first time posting; please let me know if I'm doing any beginner mistakes. In my specific case I have a vector of strings, and I want to collapse some adjacent rows. I have one vector indicating the starting position and one indicating the last element. How can I do this?
Here is some sample code and my approach that does not work:
text <- c("cat", "dog", "house", "mouse", "street")
x <- c(1,3)
y <- c(2,5)
result <- as.data.frame(paste(text[x:y],sep = " ",collapse = ""))
In case it's not clear, the result I want is a data frame consisting of two strings: "cat dog" and "house mouse street".
Not sure this is the best option, but it does the job,
sapply(mapply(seq, x, y), function(i)paste(text[i], collapse = ' '))
#[1] "cat dog" "house mouse street"
Either use base R with
mapply(function(.x,.y) paste(text[.x:.y],collapse = " "), x, y)
or use the purrr package as
map2_chr(x,y, ~ paste(text[.x:.y],collapse = " "))
Both yield
# [1] "cat dog" "house mouse street"
The output as a data frame depends on the structure you want: rows or columns
I think you want
result <- data.frame(combined = c(paste(text[x[1]:y[1]], collapse = " "),
paste(text[x[2]:y[2]], collapse = " ")))
Which gives you
result
#> combined
#> 1 cat dog
#> 2 house mouse street
Another base R solution, using parse + eval
result <- data.frame(new = sapply(paste0(x,":",y),function(v) paste0(text[eval(parse(text = v))],collapse = " ")),
row.names = NULL)
such that
> result
new
1 cat dog
2 house mouse street

how to remove duplicate words in a certain pattern from a string in R

I aim to remove duplicate words only in parentheses from string sets.
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
What I want to get is just like this
a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'
In order to get the result, I used a code like this
a = gsub('\\|', " | ", a)
a = gsub('\\(', "( ", a)
a = gsub('\\)', " )", a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))
However, it resulted in undesirable outputs.
a
[1] "I ( have | has ) certain words word worded"
[2] "( You | Youre ) can cans do this works worked"
[3] "I ( am | are ) sure surely you know what when her should do"
Why did my code remove parentheses located in the latter part of strings?
What should I do for the result I want?
We can use gsubfn. Here, the idea is to select the characters inside the brackets by matching the opening bracket (\\( have to escape the bracket as it is a metacharacter) followed by one or more characters that are not a closing bracket ([^)]+), capture it as a group within the brackets. In the replacement, we split the group of characters (x) with strsplit, unlist the list output, get the unique elements and paste it together
library(gsubfn)
gsubfn("\\(([^)]+)", ~paste0("(", paste(unique(unlist(strsplit(x,
"[|]"))), collapse="|")), a)
#[1] "I (have|has) certain (words|word|worded) certain"
#[2] "(You|Youre) (can|cans) do this (works|worked)"
#[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
Take the answer above. This is more straightforward, but you can also try:
library(stringi)
library(stringr)
a_new <- gsub("[|]","-",a) # replace this | due to some issus during the replacement later
a1 <- str_extract_all(a_new,"[(](.*?)[)]") # extract the "units"
# some magic using stringi::stri_extract_all_words()
a2 <- unlist(lapply(a1,function(x) unlist(lapply(stri_extract_all_words(x), function(y) paste(unique(y),collapse = "|")))))
# prepare replacement
names(a2) <- unlist(a1)
# replacement and finalization
str_replace_all(a_new, a2)
[1] "I (have|has) certain (words|word|worded) certain"
[2] "(You|Youre) (can|cans) do this (works|worked)"
[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
The idea is to extract the words within the brackets as unit. Then remove the duplicates and replace the old unit with the updated.
a longer but more elaborate try
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# blank output
new_a <- c()
for (sentence in 1:length(a)) {
split <- trim(unlist(strsplit(a[sentence],"[( )]")))
newsentence <- c()
for (i in split) {
j1 <- as.character(unique(trim(unlist(strsplit(gsub('\\|'," ",i)," ")))))
if( length(j1)==0) {
next
} else {
ifelse(length(j1)>1,
newsentence <- c(newsentence,paste("(",paste(j1,collapse="|"),")",sep="")),
newsentence <- c(newsentence,j1[1]))
}
}
newsentence <- paste(newsentence,collapse=" ")
print(newsentence)
new_a <- c(new_a,newsentence)}
# [1] "I (have|has) certain (words|word|worded) certain"
# [2] "(You|Youre) (can|cans) do this (works|worked)"
# [3] "I (am|are) (sure|surely) you know (what|when) (you|her) should do"

Concatenate a string to the whole dataframe column [duplicate]

How can I concatenate (merge, combine) two values?
For example I have:
tmp = cbind("GAD", "AB")
tmp
# [,1] [,2]
# [1,] "GAD" "AB"
My goal is to concatenate the two values in "tmp" to one string:
tmp_new = "GAD,AB"
Which function can do this for me?
paste()
is the way to go. As the previous posters pointed out, paste can do two things:
concatenate values into one "string", e.g.
> paste("Hello", "world", sep=" ")
[1] "Hello world"
where the argument sep specifies the character(s) to be used between the arguments to concatenate,
or collapse character vectors
> x <- c("Hello", "World")
> x
[1] "Hello" "World"
> paste(x, collapse="--")
[1] "Hello--World"
where the argument collapse specifies the character(s) to be used between the elements of the vector to be collapsed.
You can even combine both:
> paste(x, "and some more", sep="|-|", collapse="--")
[1] "Hello|-|and some more--World|-|and some more"
help.search() is a handy function, e.g.
> help.search("concatenate")
will lead you to paste().
For the first non-paste() answer, we can look at stringr::str_c() (and then toString() below). It hasn't been around as long as this question, so I think it's useful to mention that it also exists.
Very simple to use, as you can see.
tmp <- cbind("GAD", "AB")
library(stringr)
str_c(tmp, collapse = ",")
# [1] "GAD,AB"
From its documentation file description, it fits this problem nicely.
To understand how str_c works, you need to imagine that you are building up a matrix of strings. Each input argument forms a column, and is expanded to the length of the longest argument, using the usual recyling rules. The sep string is inserted between each column. If collapse is NULL each row is collapsed into a single string. If non-NULL that string is inserted at the end of each row, and the entire matrix collapsed to a single string.
Added 4/13/2016: It's not exactly the same as your desired output (extra space), but no one has mentioned it either. toString() is basically a version of paste() with collapse = ", " hard-coded, so you can do
toString(tmp)
# [1] "GAD, AB"
As others have pointed out, paste() is the way to go. But it can get annoying to have to type paste(str1, str2, str3, sep='') everytime you want the non-default separator.
You can very easily create wrapper functions that make life much simpler. For instance, if you find yourself concatenating strings with no separator really often, you can do:
p <- function(..., sep='') {
paste(..., sep=sep, collapse=sep)
}
or if you often want to join strings from a vector (like implode() from PHP):
implode <- function(..., sep='') {
paste(..., collapse=sep)
}
Allows you do do this:
p('a', 'b', 'c')
#[1] "abc"
vec <- c('a', 'b', 'c')
implode(vec)
#[1] "abc"
implode(vec, sep=', ')
#[1] "a, b, c"
Also, there is the built-in paste0, which does the same thing as my implode, but without allowing custom separators. It's slightly more efficient than paste().
> tmp = paste("GAD", "AB", sep = ",")
> tmp
[1] "GAD,AB"
I found this from Google by searching for R concatenate strings: http://stat.ethz.ch/R-manual/R-patched/library/base/html/paste.html
Alternatively, if your objective is to output directly to a file or stdout, you can use cat:
cat(s1, s2, sep=", ")
Another way:
sprintf("%s you can add other static strings here %s",string1,string2)
It sometimes useful than paste() function. %s denotes the place where the subjective strings will be included.
Note that this will come in handy as you try to build a path:
sprintf("/%s", paste("this", "is", "a", "path", sep="/"))
output
/this/is/a/path
You can create you own operator :
'%&%' <- function(x, y)paste0(x,y)
"new" %&% "operator"
[1] newoperator`
You can also redefine 'and' (&) operator :
'&' <- function(x, y)paste0(x,y)
"dirty" & "trick"
"dirtytrick"
messing with baseline syntax is ugly, but so is using paste()/paste0() if you work only with your own code you can (almost always) replace logical & and operator with * and do multiplication of logical values instead of using logical 'and &'
Given the matrix, tmp, that you created:
paste(tmp[1,], collapse = ",")
I assume there is some reason why you're creating a matrix using cbind, as opposed to simply:
tmp <- "GAD,AB"
Consider the case where the strings are columns and the result should be a new column:
df <- data.frame(a = letters[1:5], b = LETTERS[1:5], c = 1:5)
df$new_col <- do.call(paste, c(df[c("a", "b")], sep = ", "))
df
# a b c new_col
#1 a A 1 a, A
#2 b B 2 b, B
#3 c C 3 c, C
#4 d D 4 d, D
#5 e E 5 e, E
Optionally, skip the [c("a", "b")] subsetting if all columns needs to be pasted.
# you can also try str_c from stringr package as mentioned by other users too!
do.call(str_c, c(df[c("a", "b")], sep = ", "))
glue is a new function, data class, and package that has been developed as part of the tidyverse, with a lot of extended functionality. It combines features from paste, sprintf, and the previous other answers.
tmp <- tibble::tibble(firststring = "GAD", secondstring = "AB")
(tmp_new <- glue::glue_data(tmp, "{firststring},{secondstring}"))
#> GAD,AB
Created on 2019-03-06 by the reprex package (v0.2.1)
Yes, it's overkill for the simple example in this question, but powerful for many situations. (see https://glue.tidyverse.org/)
Quick example compared to paste with with below. The glue code was a bit easier to type and looks a bit easier to read.
tmp <- tibble::tibble(firststring = c("GAD", "GAD2", "GAD3"), secondstring = c("AB1", "AB2", "AB3"))
(tmp_new <- glue::glue_data(tmp, "{firststring} and {secondstring} went to the park for a walk. {firststring} forgot his keys."))
#> GAD and AB1 went to the park for a walk. GAD forgot his keys.
#> GAD2 and AB2 went to the park for a walk. GAD2 forgot his keys.
#> GAD3 and AB3 went to the park for a walk. GAD3 forgot his keys.
(with(tmp, paste(firststring, "and", secondstring, "went to the park for a walk.", firststring, "forgot his keys.")))
#> [1] "GAD and AB1 went to the park for a walk. GAD forgot his keys."
#> [2] "GAD2 and AB2 went to the park for a walk. GAD2 forgot his keys."
#> [3] "GAD3 and AB3 went to the park for a walk. GAD3 forgot his keys."
Created on 2019-03-06 by the reprex package (v0.2.1)
Another non-paste answer:
x <- capture.output(cat(data, sep = ","))
x
[1] "GAD,AB"
Where
data <- c("GAD", "AB")

unlist keeping the same number of elements (vectorized)

I am trying to extract all hashtags from some tweets, and obtain for each tweet a single string with all hashtags.
I am using str_extract from stringr, so I obtain a list of character vectors. My problem is that I do not manage to unlist it and keep the same number of elements of the list (that is, the number of tweets).
Example:
This is a vector of tweets of length 3:
a <- "rt #ugh_toulouse: #mondial2014 : le top 5 des mannequins brésiliens http://www.ladepeche.fr/article/2014/06/01/1892121-mondial-2014-le-top-5-des-mannequins-bresiliens.html #brésil "
b <- "rt #30millionsdamis: beauté de la nature : 1 #baleine sauve un naufragé ; elles pourtant tellement menacées par l'homme... http://goo.gl/xqrqhd #instinctanimal "
c <- "rt #onlyshe31: elle siège toujours!!!!!!! marseille. nouveau procès pour la députée - 01/06/2014 - ladépêche.fr http://www.ladepeche.fr/article/2014/06/01/1892035-marseille-nouveau-proces-pour-la-deputee.html #toulouse "
all <- c(a, b, c)
Now I use str_extract_all to extract the hashtags:
ex <- str_extract_all(all, "#(.+?)[ |\n]")
If I now use unlist I get a vector of length 5:
undesired <- unlist(ex)
> undesired
[1] "#mondial2014 " "#brésil "
[3] "#baleine " "#instinctanimal "
[5] "#toulouse "
What I want is something like the following. However this is very inefficient, because it is not vectorized, and it takes forever (really!) on a smallish data frame of tweets:
desired <- c()
for (i in 1:length(ex)){
desired[i] <- paste(ex[[i]], collapse = " ")
}
> desired
[1] "#mondial2014 #brésil "
[2] "#baleine #instinctanimal "
[3] "#toulouse "
Help!
You could use stringi which may be faster for big datasets
library(stringi)
sapply(stri_extract_all_regex(all, '#(.+?)[ |\n]'), paste, collapse=' ')
#[1] "#mondial2014 #brésil " "#baleine #instinctanimal "
#[3] "#toulouse "
The for loops can be fast if you preassign the length of the output desired
desired <- numeric(length(ex))
for (i in 1:length(ex)){
desired[i] <- paste(ex[[i]], collapse = " ")
}
Or you could use vapply which would be faster than sapply and a bit safer (contributed by #Richie Cotton)
vapply(ex, toString, character(1))
#[1] "#mondial2014 , #brésil " "#baleine , #instinctanimal "
#[3] "#toulouse "
Or as suggested by #Ananda Mahto
vapply(stri_extract_all_regex(all, '#(.+?)[ |\n]'),
stri_flatten, character(1L), collapse = " ")

Resources