Concatenating strings with - r

I have a data frame with several variables. What I want is create a string using (concatenation) the variable names but with something else in between them...
Here is a simplified example (number of variables reduced to only 3 whereas I have actually many)
Making up some data frame
df1 <- data.frame(1,2,3) # A one row data frame
names(df1) <- c('Location1','Location2','Location3')
Actual code...
len1 <- ncol(df1)
string1 <- 'The locations that we are considering are'
for(i in 1:(len1-1)) string1 <- c(string1,paste(names(df1[i]),sep=','))
string1 <- c(string1,'and',paste(names(df1[len1]),'.'))
string1
This gives...
[1] "The locations that we are considering are"
[2] "Location1"
[3] "Location2"
[4] "Location3 ."
But I want
The locations that we are considering are Location1, Location2 and Location3.
I am sure there is a much simpler method which some of you would know...
Thank you for you time...

Are you looking for the collapse argument of paste?
> paste (letters [1:3], collapse = " and ")
[1] "a and b and c"

The fact that these are names of a data.frame does not really matter, so I've pulled that part out and assigned them to a variable strs.
strs <- names(df1)
len1 <- length(strs)
string1 <- paste("The locations that we are considering are ",
paste(strs[-len1], collapse=", ", sep=""),
" and ",
strs[len1],
".\n",
sep="")
This gives
> cat(string1)
The locations that we are considering are Location1, Location2 and Location3.
Note that this will not give sensible English if there is only 1 element in strs.
The idea is to collapse all but the last string with comma-space between them, and then paste that together with the boilerplate text and the last string.

If your main goal is to print the results to the screen (or other output) then use the cat function (whose name derives from concatenate):
> cat(names(iris), sep=' and '); cat('\n')
Sepal.Length and Sepal.Width and Petal.Length and Petal.Width and Species
If you need a variable with the string, then you can use paste with the collapse argument. The sprintf function can also be useful for inserting strings into other strings (or numbers into strings).

An other options would be:
library(stringr)
str_c("The location that we are consiering are ", str_c(str_c(names(df1)[1:length(names(df1))-1], collapse=", "), names(df1)[length(names(df1))], sep=" and "))

Related

Read txt file into list where each list element is delimited by row ending with colon

I've got the following .txt structure
test <- "A n/a:
4001
Exam date:
2020-01-01 15:38
Pos (deg):
18.19
18.37"
I'd like to read this into a list, where each list element is given the name of the row ending with a colon, and the values are given by the following rows. (see: expected output).
Challenges
The number of rows (the length of each list element) can differ. There can be special characters (e.g., "A n/a") and there is the date time value which contains a pesky colon.
My problem
My current solution (see below) is unsafe, because I cannot be sure that I have a full list of all expected elements - the file might contain unexpected list elements which I would then not capture, or worse, they would mess up the entire data.
What I tried
I tried reading the txt to json with jsonlite::fromJson, because the structure somehow resembled it, but this gave an error about an unexpected character.
I tried to read into a single string and split, but this leaves me, again, with all values in a single list element:
readr::read_file(test)
strsplit(test, split = ":\n")
My current approach is to read this in with read.csv2 and generate a lookup on the (expected) row names, create a vector for splitting and using the first element of the resulting list for naming.
myfile <- read.csv2(text = test,
header = FALSE)
lu <- paste(c("A n", "date", "Pos"), collapse = "|")
ls_file <- split(myfile$V1, cumsum(grepl(lu, myfile$V1, ignore.case = TRUE)))
names(ls_file) <- unlist(lapply(ls_file, function(x) x[1]))
ls_file <- lapply(ls_file, function(x) x <- x[2:length(x)])
## expected output is a named list
## The spaces and backticks below do not really bother me,
## but I would get rid of them in a next step.
ls_file
#> $`A n/a:`
#> [1] " 4001"
#>
#> $`Exam date:`
#> [1] " 2020-01-01 15:38"
#>
#> $`Pos (deg):`
#> [1] "18.19" "18.37"
Assuming the name of each element ends with :, then we can:
res <- readLines(textConnection(test))
res <- split(res, cumsum(endsWith(res, ':')))
res <- setNames(lapply(res, `[`, -1), sapply(res, `[`, 1))
# > res
# $`A n/a:`
# [1] " 4001"
#
# $`Exam date:`
# [1] " 2020-01-01 15:38"
#
# $`Pos (deg):`
# [1] "18.19" "18.37"

Replacing given characters to new ones before a defined parameter in gsub function

I am not so qualified in R and I am struggling with a problem. I want to replace all the existing underscores which are before "S11" pattern, with the dashes "(-)". S11 is just a number and it is variable in my table such as S29, S30. Here is the code that I am using and failing:
foo <- c("H2_2months_S11_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_", "H2_2months_with_acetate_S101_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_", "Formate_3months_S99_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_")
Sample <- gsub(pattern="*(_S)", replacement="-", x=foo)
Getting:
[1] "H2_2months-11_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[2] "H2_2months_with_acetate-101_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[3] "Formate_3months-99_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
I also don't want "_S" to be deleted and replaced. I use "_S[0-9]" as a matching criteria and before "_S", the underscores should be changed to "-".
Also please recommend me a good website that I can learn those "codes or signs" using in this function. Thanks in advance.
Expected output:
[1] "H2-2months-S11_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[2] "H2-2months-with-acetate-S101_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[3] "Formate-3months-S99_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
This should work.
Basically we divide the job in two parts, first match ("_(S[0-9+])"), then we split the resulting string at "-", then we use gsub to fix all the "_" we find.
foo <- c("H2_2months_S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_")
foo <- gsub(pattern="_(S[0-9+])", replacement="-\\1", x=foo)
#foo
#[1] "H2_2months-S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"
Then we split:
split <- unlist(strsplit(foo, "-")) # split using the new "-"
#split
#[1] "H2_2months"
#[2] "S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"
Now we can use simple gsub on everything except the last element in split.
split_1 <- split[-length(split)] # fix all the "_" before the match (exclude the last)
split_1 <- gsub("_", "-", split_1)
Then we paste the results:
paste0(split_1, "-", split[length(split)]) # paste back together
#[1] "H2-2months-S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"
Here in a function and with another example:
foo <- c("H2_2months_abc_456_S123_L001_R1_001")
my_foo <- function(s) {
s <- gsub(pattern="_(S[0-9+])", replacement="-\\1", x=s)
split <- unlist(strsplit(s, "-"))
split_1 <- split[-length(split)]
split_1 <- gsub("_", "-", split_1)
paste0(split_1, "-", split[length(split)])
}
my_foo(foo)
#[1] "H2-2months-abc-456-S123_L001_R1_001"
This will match the "_S11" and save S11 to the group. Then replace this with a "-" followed by the captured group "S11".
Sample <- gsub("_(S[0-9+])", "-\\1", foo)
Excellent place to learn more regex: https://www.regular-expressions.info/quickstart.html
Excellent place to test regex with explanations of the matching: https://regexr.com/
Edit: Thanks RLave, didn't realise it could be any digits after the S. Updated answer.

how to remove duplicate words in a certain pattern from a string in R

I aim to remove duplicate words only in parentheses from string sets.
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
What I want to get is just like this
a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'
In order to get the result, I used a code like this
a = gsub('\\|', " | ", a)
a = gsub('\\(', "( ", a)
a = gsub('\\)', " )", a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))
However, it resulted in undesirable outputs.
a
[1] "I ( have | has ) certain words word worded"
[2] "( You | Youre ) can cans do this works worked"
[3] "I ( am | are ) sure surely you know what when her should do"
Why did my code remove parentheses located in the latter part of strings?
What should I do for the result I want?
We can use gsubfn. Here, the idea is to select the characters inside the brackets by matching the opening bracket (\\( have to escape the bracket as it is a metacharacter) followed by one or more characters that are not a closing bracket ([^)]+), capture it as a group within the brackets. In the replacement, we split the group of characters (x) with strsplit, unlist the list output, get the unique elements and paste it together
library(gsubfn)
gsubfn("\\(([^)]+)", ~paste0("(", paste(unique(unlist(strsplit(x,
"[|]"))), collapse="|")), a)
#[1] "I (have|has) certain (words|word|worded) certain"
#[2] "(You|Youre) (can|cans) do this (works|worked)"
#[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
Take the answer above. This is more straightforward, but you can also try:
library(stringi)
library(stringr)
a_new <- gsub("[|]","-",a) # replace this | due to some issus during the replacement later
a1 <- str_extract_all(a_new,"[(](.*?)[)]") # extract the "units"
# some magic using stringi::stri_extract_all_words()
a2 <- unlist(lapply(a1,function(x) unlist(lapply(stri_extract_all_words(x), function(y) paste(unique(y),collapse = "|")))))
# prepare replacement
names(a2) <- unlist(a1)
# replacement and finalization
str_replace_all(a_new, a2)
[1] "I (have|has) certain (words|word|worded) certain"
[2] "(You|Youre) (can|cans) do this (works|worked)"
[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
The idea is to extract the words within the brackets as unit. Then remove the duplicates and replace the old unit with the updated.
a longer but more elaborate try
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# blank output
new_a <- c()
for (sentence in 1:length(a)) {
split <- trim(unlist(strsplit(a[sentence],"[( )]")))
newsentence <- c()
for (i in split) {
j1 <- as.character(unique(trim(unlist(strsplit(gsub('\\|'," ",i)," ")))))
if( length(j1)==0) {
next
} else {
ifelse(length(j1)>1,
newsentence <- c(newsentence,paste("(",paste(j1,collapse="|"),")",sep="")),
newsentence <- c(newsentence,j1[1]))
}
}
newsentence <- paste(newsentence,collapse=" ")
print(newsentence)
new_a <- c(new_a,newsentence)}
# [1] "I (have|has) certain (words|word|worded) certain"
# [2] "(You|Youre) (can|cans) do this (works|worked)"
# [3] "I (am|are) (sure|surely) you know (what|when) (you|her) should do"

data frame from character vector that contains three comma seperated values in each row

How to create columns of a data frame from a long character vector that contains three comma seperated values in each row. The first element contains the names of the data frame columns.
Not every row has three columns, some places there is just a trailing comma:
> string.split.cols[1] #This row is the .names
[1] "Acronym,Full form,Remarks"
> string.split.cols[2]
[1] "AC,Actual Cost, "
> string.split.cols[3]
[1] "ACWP,Actual Cost of Work Performed,Old term for AC"
> string.split.cols[4]
[1] "ADM,Arrow Diagramming Method,Rarely used now"
> string.split.cols[5]
[1] "ADR,Alternative Dispute Resolution, "
> string.split.cols[6]
[1] "AE,Apportioned Effort, "
The output should be a df with three columns, I'm only interested in the first two columns and will throw out the third.
This is the original string, some columns are not comma escaped but that isn't a big huge deal.
string.cols <- [1] "Acronym,Full form,Remarks\nAC,Actual Cost, \nACWP,Actual Cost of Work Performed,Old term for AC\nADM,Arrow Diagramming Method,Rarely used now\nADR,Alternative Dispute Resolution, \nAE,Apportioned Effort, \nAOA,Activity-on-Arrow,Rarely used now\nAON,Activity-on-Node, \nARMA,Autoregressive Moving Average, \nBAC,Budget at Completion, \nBARF,Bought-into, Approved, Realistic, Formal,from Rita Mulcahy's PMP Exam Prep\nBCR,Benefit Cost Ratio, \nBCWP,Budgeted Cost of Work Performed,Old term for EV\nBCWS,Budgeted Cost of Work Scheduled,Old term for PV\nCA,Control Account, \nCBR,Cost Benefit Ratio, \nCBT,Computer-Based Test, \n..."
Have you tried the text input for read.csv?
df <- read.csv( text = string.split.cols, header = T )
I found this routine to be very fast for splitting a string and converting to a data frame.
slist<-strsplit(mylist,",")
x<-sapply(slist, FUN= function(x) {x[1]})
y<-sapply(slist, FUN= function(x) {x[2]})
df<-data.frame(Column1Name=x, Column2Name=y, stringsAsFactors = FALSE)
where mylist is your vector of strings to split.
You can use rbind.data.frame to do this, after splitting the string:
x <- do.call(rbind.data.frame, strsplit(split.string.cols[-1], ','))
names(x) <- strsplit(split.string.cols[1], ',')[[1]]
x
## Acronym Full form Remarks
## 1 AC Actual Cost
## 2 ACWP Actual Cost of Work Performed Old term for AC
## ...
As a one-liner:
setNames(do.call(rbind.data.frame,
strsplit(split.string.cols[-1], ',')
),
strsplit(split.string.cols[1], ',')[[1]]
)

Simple Comparing of two texts in R

I want to compare two texts to similarity, therefore i need a simple function to list clearly and chronologically the words and phrases occurring in both texts. these words/sentences should be highlighted or underlined for better visualization)
on the base of #joris Meys ideas, i added an array to divide text into sentences and subordinate sentences.
this is how it looks like:
textparts <- function (text){
textparts <- c("\\,", "\\.")
i <- 1
while(i<=length(textparts)){
text <- unlist(strsplit(text, textparts[i]))
i <- i+1
}
return (text)
}
textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")
commonWords <- intersect(textparts1, textparts2)
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
}
return(list(textparts1,textparts2))
However, sometimes it works, sometimes it doesn't.
I WOULD like to have results like these:
> return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence" " whereas this is a dependent clause*" " This thing works*"
[[2]]
[1] "This could be a sentence" " whereas this is a dependent clause*" " Plagiarism is not cool" " This thing works*"
whereas i get none results.
There are some problems with the answer of #Chase :
differences in capitalization are not taken into account
interpunction can mess up results
if there is more than one word similar, then you get a lot of warnings due to the gsub call.
Based on his idea, there is the following solution that makes use of tolower() and some nice functionalities of regular expressions :
compareSentences <- function(sentence1, sentence2) {
# split everything on "not a word" and put all to lowercase
x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
x2 <- tolower(unlist(strsplit(sentence2, "\\W")))
commonWords <- intersect(x1, x2)
#add word beginning and ending and put words between ()
# to allow for match referencing in gsub
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
# replace the match by the match with star added
sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)
}
return(list(sentence1,sentence2))
}
This gives following result :
text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "
compareSentences(text1,text2)
[[1]]
[1] "This* is* a* test*. Weather* is* fine*"
[[2]]
[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "
I am sure that there are far more robust functions on the natural language processing page, but here's one solution using intersect() to find the common words. The approach is to read in the two sentences, identify the common words and gsub() them with a combination of the word and a moniker of our choice. Here I chose to use *, but you could easily change that, or add something else.
sent1 <- "I shot the sheriff."
sent2 <- "Dick Cheney shot a man."
compareSentences <- function(sentence1, sentence2) {
sentence1 <- unlist(strsplit(sentence1, " "))
sentence2 <- unlist(strsplit(sentence2, " "))
commonWords <- intersect(sentence1, sentence2)
return(list(
sentence1 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence1), collapse = " ")
, sentence2 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence2), collapse = " ")
))
}
> compareSentences(sent1, sent2)
$sentence1
[1] "I shot* the sheriff."
$sentence2
[1] "Dick Cheney shot* a man."

Resources