Split column at multiple-character delimiter in data frame - r

My question is very similar to the question below, with the added problem that I need to split by a double-space.
Split column at delimiter in data frame
I would like to split this vector into columns.
text <- "first second and second third and third and third fourth"
The result should be four columns reading "first", "second and second", "third and third and third", "fourth"

We can use \\s{2,} to match the pattern of space that are 2 or more in strsplit
v1 <- strsplit(text, "\\s{2,}")[[1]]
v1
#[1] "first" "second and second"
#[3] "third and third and third" "fourth"
This can be converted to data.frame using as.data.frame.list
setNames(as.data.frame.list(v1), paste0("col", 1:4))

Related

Extract All Strings between a sequence of numbers

I'm dealing with a regular expression in which I has string that has a series of numbers four numbers then name which repeat for multiples.
The text pattern is a series of 4 numbers, then a string. I would like to extract the string after the four numbers. The four numbers must appear before the string. In the example below, I do not want to extract "Not this one", but would like the strings after four numbers.
string_to_inspect <-"Not This One 4586 This one 8888 Another one 8955 PS109 8566 Last One"
My ideal extraction is a character vector that looks like:
"This one" "Another one" "PS109" "Last One"
I have tried
str_extract_all(pattern = "[0-9]{4}(.*?)", string = string_to_inspect)
And it returns a single string that include all the numbers
"4586 This one 8888 Another one 8955 PS109 8566 Last One"
I have tried various combinations but I know I must be missing something critical.
We can split the string by four digits, remove the first one, and then trim the white space.
library(stringr)
str_trim(str_split(string_to_inspect, pattern = "\\s[0-9]{4}\\s")[[1]][-1])
# [1] "This one" "Another one" "PS109" "Last One"
strsplit(string_to_inspect, " [0-9]+ ")
In case you don't want problems with strings mixed with numbers:
string_to_inspect <-"Not This One 4586 This one 8888 Another one 8955 PS109 8566 Last One"
str2insp <- strsplit(string_to_inspect, ' ')[[1]]
str2insp[!gsub('[[:digit:]]', '', str2insp) == '']
outputs:
#[1] "Not" "This" "One" "This" "one" "Another" "one" "PS109" "Last" "One"

Remove strings that contain a colon in R

this an exemplary excerpt of my data set. It looks like as follows:
Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id_234;2018/03/02
I want to delete those words which contain a a colon. In this case, this would be wa119:d, ax21:3 and bC230:13 so that my new data set should look like as follows:
Description;ID;Date
Here comes the first row;id_112;2018/03/02
Here comes the second row;id_115;2018/03/02
Here comes the third row;id_234;2018/03/02
Unfortunately, I was not able to find a regular expression / solution with gsub? Can anyone help?
Here's one approach:
## reading in yor data
dat <- read.table(text ='
Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id:234;2018/03/02
', sep = ';', header = TRUE, stringsAsFactors = FALSE)
## \\w+ = one or more word characters
gsub('\\w+:\\w+\\s+', '', dat$Description)
## [1] "Here comes the first row"
## [2] "Here comes the second row"
## [3] "Here comes the third row"
More info on \\w a shorthand character class that is the same as [A-Za-z0-9_]:https://www.regular-expressions.info/shorthand.html
Supposing the column you want to modify is dat:
dat <- c("wa119:d Here comes the first row",
"ax21:3 Here comes the second row",
"bC230:13 Here comes the third row")
Then you can take each element, split it into words, remove the words containing a colon, and then paste what's left back together, yielding what you want:
dat_colon_words_removed <- unlist(lapply(dat, function(string){
words <- strsplit(string, split=" ")[[1]]
words <- words[!grepl(":", words)]
paste(words, collapse=" ")
}))
Another solution that will exactly match expected result from OP could be as:
#data
df <- read.table(text = "Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id:234;2018/03/02", stringsAsFactors = FALSE, sep="\n")
gsub("[a-zA-Z0-9]+:[a-zA-Z0-9]+\\s", "", df$V1)
#[1] "Description;ID;Date"
#[2] "Here comes the first row;id_112;2018/03/02"
#[3] "Here comes the second row;id_115;2018/03/02"
#[4] "Here comes the third row;id:234;2018/03/02"

Remove white space from a data frame column and add path

I have a column of a dataframe with data like this:
df$names
"stock 1"
"stock stock1 2"
"stock 2"
I would like to remove the spaces of everyrow of text. A result like this:
df$names
"stock1"
"stockstock12"
"stock2"
And add a path for the name of files and have a final column like this (the path is the same for all rows)
df$names
"C:/Desktop/stock_files/stock1"
"C:/Desktop/stock_files/stockstock12"
"C:/Desktop/stock_files/stock2"
We can use gsub to remove the white space. We select one or more spaces (\\s+) and replace it with ''.
df$names <- gsub('\\s+', '', df$names)
df$names
#[1] "stock1" "stockstock12" "stock2"
Then, we use paste to join the strings together
path <- "C:/Desktop/stock_files"
df$names <- paste(path, df$names, sep="/")
df$names
#[1] "C:/Desktop/stock_files/stock1" "C:/Desktop/stock_files/stockstock12"
#[3] "C:/Desktop/stock_files/stock2"

Merge two columns in R while replacing when value is not present in other column

i want to merge two columns of my data set, The nature of these two columns are either/or, i.e if a value is present in one column it wont be present in other column.
i tried these
temp<-list(a=1:3,b=10:14)
paste(temp$a,temp$b)
output
"1 10" "2 11" "3 12" "1 13" "2 14"
and this
temp<-list(a=1:3,b=10:14,c=20:25)
temp<-within(temp,a <- paste(a, b, sep=''))
output
temp$a
[1] "110" "211" "312" "113" "214"
but what i am looking for is to replace the values when they are not present . for example temp$a only have 1:3 and temp$b have 10:14 , i.e two extra values - so i want my answer to be
1_10 2_11 3_12 _13 _14
EDIT -please look that i do not want column c to be concatenated with a and $b
Using stri_list2matrix, we can fill the list elements that have shorter length with '' and use paste.
library(stringi)
do.call(paste, c(as.data.frame(stri_list2matrix(temp, fill='')), sep='_'))
#[1] "1_10" "2_11" "3_12" "_13" "_14"
stri_list2matrix(temp, fill='') converts the list to matrix after filling the list elements that are shorter in length with ''. Convert it to data.frame (as.data.frame) and use do.call(paste to paste the elements in each row separated by _ (sep='_').
Update
Based on the edited 'temp', if you are interested only in the first two elements of 'temp'
do.call(paste, c(as.data.frame(stri_list2matrix(temp[1:2], fill='')),
sep='_'))
#[1] "1_10" "2_11" "3_12" "_13" "_14"
You can also subset by the names ie. temp[c('a', 'b')]
Expand the length of the shorter vector to match the length of the longer vector, then paste:
paste(c(temp$a,rep("",length(temp$b)-length(temp$a))), temp$b, sep="_")
#[1] "1_10" "2_11" "3_12" "_13" "_14"

Collapse character vector into single observation in R

How do you reduce a multi-valued vector to a single observation? Specifically, dealing with text. The solution should be scalable.
Consider:
col <- c("This is row 1", "AND THIS IS ROW 2", "Wow, and this is row 3!")
Which returns the following:
> col
[1] "This is row 1" "AND THIS IS ROW 2" "Wow, and this is row 3!"
Where the desired solution looks like this:
> col
[1] "This is row 1 AND THIS IS ROW 2 Wow, and this is row 3!"
You are looking for ?paste:
> paste(col, collapse = " ")
#[1] "This is row 1 AND THIS IS ROW 2 Wow, and this is row 3!"
In this case you want to collapse the strings together and add a space in between them. You can also check out paste0.

Resources