Remove white space from a data frame column and add path - r

I have a column of a dataframe with data like this:
df$names
"stock 1"
"stock stock1 2"
"stock 2"
I would like to remove the spaces of everyrow of text. A result like this:
df$names
"stock1"
"stockstock12"
"stock2"
And add a path for the name of files and have a final column like this (the path is the same for all rows)
df$names
"C:/Desktop/stock_files/stock1"
"C:/Desktop/stock_files/stockstock12"
"C:/Desktop/stock_files/stock2"

We can use gsub to remove the white space. We select one or more spaces (\\s+) and replace it with ''.
df$names <- gsub('\\s+', '', df$names)
df$names
#[1] "stock1" "stockstock12" "stock2"
Then, we use paste to join the strings together
path <- "C:/Desktop/stock_files"
df$names <- paste(path, df$names, sep="/")
df$names
#[1] "C:/Desktop/stock_files/stock1" "C:/Desktop/stock_files/stockstock12"
#[3] "C:/Desktop/stock_files/stock2"

Related

reverse the name if it seperate by comma

If there is a first and last name is like "nandan, vivek". I want to display as "vivek nandan".
n<-("nandan,vivek")
result:
[1] vivek nandan
where first name:vivek
last name:nandan
this is the author name.
We can try using sub here:
input <- "nankin,vivek"
sub("([^,]+),\\s*(.*)", "\\2 \\1", input)
[1] "vivek nankin"
The regex pattern used above matches the last name followed by the first name, in separate capture groups. It then replaces with those capture groups, in reverse order, separated by a single space.
An option would be sub to capture the substring that are letters ([a-z]+) followed by a , and again capture the next word ([a-z]+). In the replacement, reverse the order of the backreferences
sub("([a-z]+),([a-z]+)", "\\2 \\1", n)
#[1] "vivek nandan"
A non-regex option would be to split the string and then paste the reversed words
paste(rev(strsplit(n, ",")[[1]]), collapse=" ")
#[1] "vivek nandan"
Or extract the word and paste
library(stringr)
paste(word(n, 2, sep=","), word(n, 1, sep=","))
#[1] "vivek nandan"
data
n<- "nandan,vivek"

Remove strings that contain a colon in R

this an exemplary excerpt of my data set. It looks like as follows:
Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id_234;2018/03/02
I want to delete those words which contain a a colon. In this case, this would be wa119:d, ax21:3 and bC230:13 so that my new data set should look like as follows:
Description;ID;Date
Here comes the first row;id_112;2018/03/02
Here comes the second row;id_115;2018/03/02
Here comes the third row;id_234;2018/03/02
Unfortunately, I was not able to find a regular expression / solution with gsub? Can anyone help?
Here's one approach:
## reading in yor data
dat <- read.table(text ='
Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id:234;2018/03/02
', sep = ';', header = TRUE, stringsAsFactors = FALSE)
## \\w+ = one or more word characters
gsub('\\w+:\\w+\\s+', '', dat$Description)
## [1] "Here comes the first row"
## [2] "Here comes the second row"
## [3] "Here comes the third row"
More info on \\w a shorthand character class that is the same as [A-Za-z0-9_]:https://www.regular-expressions.info/shorthand.html
Supposing the column you want to modify is dat:
dat <- c("wa119:d Here comes the first row",
"ax21:3 Here comes the second row",
"bC230:13 Here comes the third row")
Then you can take each element, split it into words, remove the words containing a colon, and then paste what's left back together, yielding what you want:
dat_colon_words_removed <- unlist(lapply(dat, function(string){
words <- strsplit(string, split=" ")[[1]]
words <- words[!grepl(":", words)]
paste(words, collapse=" ")
}))
Another solution that will exactly match expected result from OP could be as:
#data
df <- read.table(text = "Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id:234;2018/03/02", stringsAsFactors = FALSE, sep="\n")
gsub("[a-zA-Z0-9]+:[a-zA-Z0-9]+\\s", "", df$V1)
#[1] "Description;ID;Date"
#[2] "Here comes the first row;id_112;2018/03/02"
#[3] "Here comes the second row;id_115;2018/03/02"
#[4] "Here comes the third row;id:234;2018/03/02"

Match and replace misspelled words in a string in R

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
everything works perfectly, until it finds a similar string, and replaces it with another word
if I have a pattern like the following:
"plea", the correct one is "please", but when I execute it removes it and replaces it with "pleased".
What I am looking for is that if a string is already correct, it is no longer modified, in case it finds a similar pattern.
Perhaps you need to perform progressive replace. e.g. you should have multiple set of badwords and goodwords. First replace with badwords having more letters so that matching pattern is not found and then got go for smaller ones.
From the list provided by you, I would create 2 sets as:
goodwords1<-c( "three", "teasing")
badwords1<- c("thre", "teeasing")
goodwords2<-c("tree", "testing")
badwords2<- c("tre", "tesing")
First replace with 1st set and then with 2nd set. You can create many such sets.
str_replace_all takes regex as the pattern, so you can paste0 word boundaries \\b around each badwords so that a replacement will only be made if the whole word is matched:
library(stringr)
string <- c("tre", "tree", "teeasing", "tesing")
goodwords <- c("tree", "three", "teasing", "testing")
badwords <- c("tre", "thre", "teeasing", "tesing")
# Paste word boundaries around badwords
badwords <- paste0("\\b", badwords, "\\b")
vect.corpus <- goodwords
names(vect.corpus) <- badwords
str_replace_all(string, vect.corpus)
[1] "tree" "tree" "teasing" "testing"
The advantage of this is that you don't have to keep track of which strings are the longer strings.
This is what badwords looks like after pasting:
> badwords
[1] "\\btre\\b" "\\bthre\\b" "\\bteeasing\\b" "\\btesing\\b"

Split column at multiple-character delimiter in data frame

My question is very similar to the question below, with the added problem that I need to split by a double-space.
Split column at delimiter in data frame
I would like to split this vector into columns.
text <- "first second and second third and third and third fourth"
The result should be four columns reading "first", "second and second", "third and third and third", "fourth"
We can use \\s{2,} to match the pattern of space that are 2 or more in strsplit
v1 <- strsplit(text, "\\s{2,}")[[1]]
v1
#[1] "first" "second and second"
#[3] "third and third and third" "fourth"
This can be converted to data.frame using as.data.frame.list
setNames(as.data.frame.list(v1), paste0("col", 1:4))

How to extract substring from a string?

There are some strings which show the following pattern
ABC, DEF.JHI
AB,DE.(JH)
Generally, it includes three sections which are separated with , and . The last character can be either normal character or sth like ). I would like to extract the last part. For example, I would like to generate the following two strings based on the above ones
JHI
(JH)
Is there a way to do that in R?
library(stringr)
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)")
str_extract(str1,perl('(?<=\\.).*'))
#[1] "JHI" "(JH)"
(?<=\\.) search for . followed by .* all characters
You can just split on the . using strsplit and extract the second element.
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)")
unlist(lapply(strsplit(str1, "\\."), "[", 2))
# [1] "JHI" "(JH)"
Here's another possibility:
sapply(strsplit(str1, "\\.\\(|\\.|\\)"), "[[", 2)
Riffing on #josiber's answer you could remove the part of the string before the .
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)")
gsub(".*\\.", "", str1)
# [1] "JHI" "(JH)"
EDIT
In case your third element is not always preceded by a ., to extract the final part
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)", "ABC.DE, (JH)")
gsub(".*[,.]", "" , str1)
# [1] "JHI" "(JH)" " (JH)"

Resources