How do I paste string columns in data.frame [duplicate]

How do I paste string columns in data.frame [duplicate] - r

This question already has answers here:
Concatenate rows of a data frame
(4 answers)
Closed 6 years ago.
suppose we have:
mydf <- data.frame(a= LETTERS, b = LETTERS, c =LETTERS)
Now we want to add a new column, containing a concatenation of all columns.
So that rows in the new column read "AAA", "BBB", ...
In my mind the following should work?
mydf[,"Concat"] <- apply(mydf, 1, paste0)

In addition to #akrun's answer, here is a short explanation on why your code didn't work.
What you are passing to paste0 in your code are vectors and here is the behavior of paste and paste0 with vectors:
> paste0(c("A","A","A"))
[1] "A" "A" "A"
Indeed, to concatenate a vector, you need to use argument collapse:
> paste0(c("A","A","A"), collapse="")
[1] "AAA"
Consequently, your code should have been:
> apply(mydf, 1, paste0, collapse="")
[1] "AAA" "BBB" "CCC" "DDD" "EEE" "FFF" "GGG" "HHH" "III" "JJJ" "KKK" "LLL" "MMM" "NNN" "OOO" "PPP" "QQQ" "RRR" "SSS" "TTT" "UUU" "VVV"
[23] "WWW" "XXX" "YYY" "ZZZ"

We can use do.call with paste0 for faster execution
mydf[, "Concat"] <- do.call(paste0, mydf)

Related

How to extract unique letters among word of consecutive letters?

For example, there is character x = "AAATTTGGAA".
What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".
Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.
How should I get this?

Here is a useful regex trick approach:
x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out
[1] "AAA" "TTT" "GG" "AA"
The regex pattern used here says to split at any boundary where the preceding and following characters are different.
(?<=(.)) lookbehind and also capture preceding character in \1
(?!\\1) then lookahead and assert that following character is different

You can split each character in the string. Use rle to find consecutive runs and select only the unique ones.
x <- "AAATTTGGAA"
vec <- unlist(strsplit(x, ''))
rle(vec)$values
#[1] "A" "T" "G" "A"
paste0(rle(vec)$values, collapse = '')
#[1] "ATGA"

We can use regmatch with pattern (.)\\1+ like below
> regmatches(x,gregexpr("(.)\\1+",x))[[1]]
[1] "AAA" "TTT" "GG" "AA"
or if you need the unique letters only
> gsub("(.)\\1+", "\\1", x)
[1] "ATGA"

split vector of strings with partial match

If I have a list of some elements:
x = c('abc', 'bbc', 'cd', 'hj', 'aa', 'zz', 'd9', 'jk')
I'd like to split it every time there's an 'a' to create a nested list:
[1][[1]] 'abc', 'bbc', 'cd', 'hj'
[2][[1]] 'aa', 'zz', 'd9', 'jk'
I tried
split(x, 'a')
but split doesn't look for partial matches.

We can create a group by matching the substring 'a' with grepl to a logical vector and then convert to numeric by getting the cumulative sum for distinct groups and use that in split
split(x, cumsum(grepl('a', x)))
#$`1`
#[1] "abc" "bbc" "cd" "hj"
#$`2`
#[1] "aa" "zz" "d9" "jk"

Another base R solution using split + findInterval (code is not as short as the answer by #akrun)
split(x,findInterval(seq_along(x),grep("a",x)))
such that
> split(x,findInterval(seq_along(x),grep("a",x)))
$`1`
[1] "abc" "bbc" "cd" "hj"
$`2`
[1] "aa" "zz" "d9" "jk"

Another base R possibility could be:
split(x, cumsum(nchar(sub("a", "", x, fixed = TRUE)) - nchar(x) != 0))
$`1`
[1] "abc" "bbc" "cd" "hj"
$`2`
[1] "aa" "zz" "d9" "jk"

Misunderstanding sub function

I'm working on a Shiny app that loops through an html file replacing an instance of a phrase with a different phrase relative to its position.
That is, the first time "aa" comes, I put "bluh",
the second time "aa" comes, I put "gfgf".
I have a table of all the 2nd phrases in order.
I think I'm misunderstanding the sub function documentation:
The two *sub functions differ only in that sub replaces only the first
occurrence of a pattern whereas gsub replaces all occurrences.
But here a smallest reproducible example:
tt <- c("aa", "aa","bb","aa")
sub("aa","test",tt)
# [1] "test" "test" "bb" "test"
gsub("aa","test",tt)
# [1] "test" "test" "bb" "test"
tt
# [1] "aa" "aa" "bb" "aa"
I expected
sub("aa","test",tt)
# [1] "test" "aa" "bb" "aa"
so that I could loop through and go:
og.list <- c("aa","cat","aa","cat","aa")
repl.list <- c("the","is","happy")
for(i in 1:3){
og.list <- sub("aa",repl.list[i], og.list)
}
instead all "aa" become "the". I thought that's what gsub did, but instead it's both sub and gsub.
Thank you.

I think you might want just this:
og.list[og.list == "aa"] <- repl.list
#[1] "the" "cat" "is" "cat" "happy"

Thank you Wiktor^.
I now understand that I would need to separate each item into its own string and then sub.
og.list <- c("aa","cat","aa","cat","aa"
repl.list <- c("the","is","happy")
og.index <- grep("aa",og.list)
for(i in 1:3){
curr.index <- og.index[i]
og.list[curr.index] <- sub("aa",
repl.list[i],
og.list[curr.index])
}

Split a list whose elements are multiple element lists

Say I have a list a which is defined as:
a <- list("aaa;bbb", "aaa", "bbb", "aaa;ccc")
I want to split this list by semicolon ;, get only unique values, and return another list. So far I have split the list using str_split():
a <- str_split(a, ";")
which gives me
> a
[[1]]
[1] "aaa" "bbb"
[[2]]
[1] "aaa"
[[3]]
[1] "bbb"
[[4]]
[1] "aaa" "ccc"
How can I manipulate this list (using unique()?) to give me something like
[[1]]
[1] "aaa"
[[2]]
[1] "bbb"
[[3]]
[1] "ccc"
or more simply,
[[1]]
[1] "aaa" "bbb" "ccc"

One option is to use list() with unique() and unlist() inside your list.
# So first you use your code
a <- list("aaa;bbb", "aaa", "bbb", "aaa;ccc")
# Load required library
library(stringr) # load str_split
a <- str_split(a, ";")
# Finally use list() with unique() and unlist()
list(unique(unlist(a)))
# And the otuput
[[1]]
[1] "aaa" "bbb" "ccc"

One alternative in base R is to use rapply which applies a function to each of the inner most elements in a nested list and returns the most simplified object possible by default. Here, it returns a vector of characters.
unique(rapply(a, strsplit, split=";"))
[1] "aaa" "bbb" "ccc"
To return a list, wrap the output in list
list(unique(rapply(a, strsplit, split=";")))
[[1]]
[1] "aaa" "bbb" "ccc"

In R, how can a string be split without using a seperator

i am try split method and i want to have the second element of a string containing only 2 elemnts. The size of the string is 2.
examples :
string= "AC"
result shouldbe a split after the first letter ("A"), that I get :
res= [,1] [,2]
[1,] "A" "C"
I tryed it with split, but I have no idea how to split after the first element??

strsplit() will do what you want (if I understand your Question). You need to split on "" to split the string on it's elements. Here is an example showing how to do what you want on a vector of strings:
strs <- rep("AC", 3) ## your string repeated 3 times
next, split each of the three strings
sstrs <- strsplit(strs, "")
which produces
> sstrs
[[1]]
[1] "A" "C"
[[2]]
[1] "A" "C"
[[3]]
[1] "A" "C"
This is a list so we can process it with lapply() or sapply(). We need to subset each element of sstrs to select out the second element. Fo this we apply the [ function:
sapply(sstrs, `[`, 2)
which produces:
> sapply(sstrs, `[`, 2)
[1] "C" "C" "C"
If all you have is one string, then
strsplit("AC", "")[[1]][2]
which gives:
> strsplit("AC", "")[[1]][2]
[1] "C"

split isn't used for this kind of string manipulation. What you're looking for is strsplit, which in your case would be used something like this:
strsplit(string,"",fixed = TRUE)
You may not need fixed = TRUE, but it's a habit of mine as I tend to avoid regular expressions. You seem to indicate that you want the result to be something like a matrix. strsplit will return a list, so you'll want something like this:
strsplit(string,"",fixed = TRUE)[[1]]
and then pass the result to matrix.

If you sure that it's always two char string (check it by all(nchar(x)==2)) and you want only second then you could use sub or substr:
x <- c("ab", "12")
sub(".", "", x)
# [1] "b" "2"
substr(x, 2, 2)
# [1] "b" "2"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How do I paste string columns in data.frame [duplicate] - r

We can use do.call with paste0 for faster execution mydf[, "Concat"] <- do.call(paste0, mydf)

Related

How to extract unique letters among word of consecutive letters?

split vector of strings with partial match

Misunderstanding sub function

Split a list whose elements are multiple element lists

In R, how can a string be split without using a seperator

Categories

Resources