Remove strings that contain a colon in R - r

this an exemplary excerpt of my data set. It looks like as follows:
Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id_234;2018/03/02
I want to delete those words which contain a a colon. In this case, this would be wa119:d, ax21:3 and bC230:13 so that my new data set should look like as follows:
Description;ID;Date
Here comes the first row;id_112;2018/03/02
Here comes the second row;id_115;2018/03/02
Here comes the third row;id_234;2018/03/02
Unfortunately, I was not able to find a regular expression / solution with gsub? Can anyone help?

Here's one approach:
## reading in yor data
dat <- read.table(text ='
Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id:234;2018/03/02
', sep = ';', header = TRUE, stringsAsFactors = FALSE)
## \\w+ = one or more word characters
gsub('\\w+:\\w+\\s+', '', dat$Description)
## [1] "Here comes the first row"
## [2] "Here comes the second row"
## [3] "Here comes the third row"
More info on \\w a shorthand character class that is the same as [A-Za-z0-9_]:https://www.regular-expressions.info/shorthand.html

Supposing the column you want to modify is dat:
dat <- c("wa119:d Here comes the first row",
"ax21:3 Here comes the second row",
"bC230:13 Here comes the third row")
Then you can take each element, split it into words, remove the words containing a colon, and then paste what's left back together, yielding what you want:
dat_colon_words_removed <- unlist(lapply(dat, function(string){
words <- strsplit(string, split=" ")[[1]]
words <- words[!grepl(":", words)]
paste(words, collapse=" ")
}))

Another solution that will exactly match expected result from OP could be as:
#data
df <- read.table(text = "Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id:234;2018/03/02", stringsAsFactors = FALSE, sep="\n")
gsub("[a-zA-Z0-9]+:[a-zA-Z0-9]+\\s", "", df$V1)
#[1] "Description;ID;Date"
#[2] "Here comes the first row;id_112;2018/03/02"
#[3] "Here comes the second row;id_115;2018/03/02"
#[4] "Here comes the third row;id:234;2018/03/02"

Related

How to randomly reshuffle letters in words

I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here
Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))
You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"
Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")
You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.

Why does paste() concatenate list elements in the wrong order?

Given the following string:
my.str <- "I welcome you my precious dude"
One splits it:
my.splt.str <- strsplit(my.str, " ")
And then concatenates:
paste(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6], sep = " ")
The result is:
[1] "I you precious" "welcome my dude"
When not using the colon operator it returns the correct order:
paste(my.splt.str[[1]][1], my.splt.str[[1]][2], my.splt.str[[1]][3], my.splt.str[[1]][4], my.splt.str[[1]][5], my.splt.str[[1]][6], sep = " ")
[1] "I welcome you my precious dude"
Why is this happening?
paste is designed to work with vectors element-by-element. Say you did this:
names <- c('Alice', 'Bob', 'Charlie')
paste('Hello', names)
You'd want to result to be [1] "Hello Alice" "Hello Bob" "Hello Charlie", rather than "Hello Hello Hello Alice Bob Charlie".
To make it work like you want it to, rather than giving the different sections to paste as separate arguments, you could first combine them into a single vector with c:
paste(c(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6]), collapse = " ")
## [1] "I welcome you my precious dude"
We can use collapse instead of sep
paste(my.splt.str[[1]], collapse= ' ')
If we use the first approach by OP, it is pasteing the corresponding elements from each of the subset
If we want to selectively paste, first create an object because the [[ repeat can be avoided
v1 <- my.splt.str[[1]]
v1[3:4] <- toupper(v1[3:4])
paste(v1, collapse=" ")
#[1] "I welcome YOU MY precious dude"
When we have multiple arguments in paste, it is doing the paste on the corresponding elements of it
paste(v1[1:2], v1[3:4])
#[1] "I you" "welcome my"
If we use collapse, then it would be a single string, but still the order is different because the first element of v1[1:2] is pasteed with the first element of v1[3:4] and 2nd with the 2nd element
paste(v1[1:2], v1[3:4], collapse = ' ')
#[1] "I you welcome my"
It is documented in ?paste
paste converts its arguments (via as.character) to character strings, and concatenates them (separating them by the string given by sep). If the arguments are vectors, they are concatenated term-by-term to give a character vector result. Vector arguments are recycled as needed, with zero-length arguments being recycled to "".
Also, converting to uppercase can be done on a substring without splitting as well
sub("^(\\w+\\s+\\w+)\\s+(\\w+\\s+\\w+)", "\\1 \\U\\2", my.str, perl = TRUE)
#[1] "I welcome YOU MY precious dude"

How to extract first 2 words from a string in R?

I need to extract first 2 words from a string. If the string contains more than 2 words, it should return the first 2 words else if the string contains less than 2 words it should return the string as it is.
I've tried using 'word' function from stringr package but it's not giving the desired output for cases where len(string) < 2.
word(dt$var_containing_strings, 1,2, sep=" ")
Example:
Input String: Auto Loan (Personal)
Output: Auto Loan
Input String: Others
Output: Others
If you want to use stringr::word(), you can do:
ifelse(is.na(word(x, 1, 2)), x, word(x, 1, 2))
[1] "Auto Loan" "Others"
Sample data:
x <- c("Auto Loan (Personal)", "Others")
Something like this?
a <- "this is a character string"
unlist(strsplit(a, " "))[1:2]
[1] "this" "is"
EDIT:
To add the part where original string is returned if number of worlds is less than 2, a simple if-else function can be used:
a <- "this is a character string"
words <- unlist(strsplit(a, " "))
if (length(words) > 2) {
words[1:2]
} else {
a
}
You could use regex in base R using sub
sub("(\\w+\\s+\\w+).*", "\\1", "Auto Loan (Personal)")
#[1] "Auto Loan"
which will also work if you have only one word in the text
sub("(\\w+\\s+\\w+).*", "\\1", "Auto")
#[1] "Auto"
Explanation :
Here we extract the pattern shown inside round brackets which is (\\w+\\s+\\w+) which means :
\\w+ One word followed by \\s+ whitespace followed by \\w+ another word, so in total we extract two words. Extraction is done using backreference \\1 in sub.

Choose a pattern which will select only WHOLE words which start with an r, s, or t regardless of case

I don't know what to put for ptrn
Choose a pattern which will select only WHOLE words which start with an r, s, or t regardless of case.
ptrn <- "" # EDIT THIS LINE
reg <- gregexpr(ptrn, plath) # DO NOT EDIT THIS LINE
(rst_words <- Reduce("c",regmatches(x = plath, m = reg))) # DO NOT EDIT THIS LINE
Try:
pattern = "\\b[rstRST]\\w+"
\\b is a word boundary, [rstRST] will match any word that starts with any one letter inside the brackets and \\w+ will match the remaining letters.
See the regex working at Regex101
You did not share an example , however you can try grep after splitting the string into words.
x <- "Random text as an example reading where it ended"
grep("^[RST]",strsplit(x, " ")[[1]], value = TRUE, ignore.case = TRUE)
#[1] "Random" "text" "reading"

Looping through and replacing text in a data frame

I have a dataframe that consists of a variable with multiple words, such as:
variable
"hello my name is this"
"greetings friend"
And another dataframe that consists of two columns, one of which is words, the other of which is replacements for those words, such as:
word
"hello"
"greetings"
replacement:
replacement
"hi"
"hi"
I'm trying to find an easy way to replace the words in "variable" with the replacement words, looping over both all the observations, and all the words in each observation. The desired result is:
variable
"hi my name is this"
"hi friend"
I've looked into some methods that use cSplit, but it's not feasible for my application (there are too many words in any given observation of "variable", so this creates too many columns). I'm not sure how I would use strsplit for this, but am guessing that is the correct option?
EDIT: From my understanding of this question, my question my be a repeat of a previously unanswered question: Replace strings in text based on dictionary
stringr's str_replace_all would be handy in this case:
df = data.frame(variable = c('hello my name is this','greetings friend'))
replacement <- data.frame(word = c('hello','greetings'), replacment = c('hi','hi'), stringsAsFactors = F)
stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
Output:
> stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
[1] "hi my name is this" "hi friend"
This is similar to #amrrs's solution, but I am using a named vector instead of supplying two separate vectors. This also addresses the issue mentioned by OP in the comments:
library(dplyr)
library(stringr)
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .) %>%
str_replace_all(df1$variable, .)
# [1] "hi my name is this" "hi friend" "hi, hellomy is not a word"
# [4] "hi! my friend"
This is the named vector with regex as the names and string to replace with as elements:
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .)
# \\bhello\\b \\bgreetings\\b
# "hi" "hi"
Data:
df1 = data.frame(variable = c('hello my name is this',
'greetings friend',
'hello, hellomy is not a word',
'greetings! my friend'))
df2 = data.frame(word = c('hello','greetings'),
replacement = c('hi','hi'),
stringsAsFactors = F)
Note:
In order to address the issue of root words also being converted, I wrapped the regex with word boundaries (\\b). This makes sure that I am not converting words that live inside another, like "helloguys".

Resources