R: How can I Split and Trim Data Successfully

R: How can I Split and Trim Data Successfully - r

I've successfully split the data and removed the "," with the following code:
s = MSA_data$area_title
str_split(s, pattern = ",")
Result
[1] "Albany" " GA"
I need to trim this data, removing white space, however this places the comma back into the data which was initially removed.
"Albany, GA"
How can I successfully split and trim the data so that the result is:
[1] "Albany" "GA"
Thank you

An alternative is to use trimws function to trim the whitespace at the beginning and end of the string.
Result <- trimws(Result)

We just need to use zero or more spaces (\\s*) (the question OP asked) and this can be done in a single step
strsplit(MSA_data$area_title, pattern = ",\\s*")
If we are using the stringr, then make use of the str_trim
library(stringr)
str_trim(str_split("Albany, GA", ",")[[1]])
#[1] "Albany" "GA"

Related

remove part of a word after apostrophe

I have
df<-c("That's","you're", "'am")
and I would like to remove the part of a word after and including the apostrophe which should return
c("That", "you", "")
tidyverse solution or a solution usable within a pipe |> structure preferable

Replace ' and whatever follows it, using str_replace in stringr.
library(stringr)
str_replace(df, "'.*", "")
#[1] "That" "you" ""

Using R base sub
> sub("'.*", "", df)
[1] "That" "you" ""

Your example data only has one word per string. If you also need it to work for strings containing multiple words then use:
gsub("'\\w*\\b","",df)

Using trimws in base R
trimws(df, whitespace = "'.*")
[1] "That" "you" ""

R splitting string on predefined location

I have string, which should be split into parts from "random" locations. Split occurs always from next comma after colon.
My idea was to find colons with
stringr::str_locate_all(test, ":") %>%
unlist()
then find commas
stringr::str_locate_all(test, ",") %>%
unlist()
and from there to figure out position where it should be split up, but could not find suitable way to do it. Feels like there is always 6 characters after colon before the comma, but I can't be sure about that for whole data.
Here is example string:
dput(test)
"AA,KK,QQ,JJ,TT,99,88:0.5083,66,55:0.8303,AK,AQ,AJs,AJo:0.9037,ATs:0.0024,ATo:0.5678"
Here is what result should be
dput(result)
c("AA,KK,QQ,JJ,TT,99,88:0.5083", "66,55:0.8303", "AK,AQ,AJs,AJo:0.9037",
"ATs:0.0024", "ATo:0.5678")

Perehaps we can use regmatches like below
> regmatches(test, gregexpr("(\\w+,?)+:[0-9.]+", test))[[1]]
[1] "AA,KK,QQ,JJ,TT,99,88:0.5083" "66,55:0.8303"
[3] "AK,AQ,AJs,AJo:0.9037" "ATs:0.0024"
[5] "ATo:0.5678"

here is one option with strsplit - replace the , after the digit followed by the . and one or more digits (\\d+) with a new delimiter using gsub and then split with strsplit in base R
result1 <- strsplit(gsub("([0-9]\\.[0-9]+),", "\\1;", test), ";")[[1]]
-checking
> identical(result, result1)
[1] TRUE
If the number of characters are fixed, use a regex lookaround
result1 <- strsplit(test, "(?<=:.{6}),", perl = TRUE)[[1]]

Is it possible to use R's base::strsplit() without consuming pattern

I have a string that consists entirely of simple repeating patterns of a [:digit:]+[A-Z] for instance 12A432B4B.
I want to to use base::strsplit() to get:
[1] "12A" "432B" "4B"
I thought I could use lookahead to split by a LETTER and keep this pattern with unlist(strsplit("12A432B4B", "(?<=.)(?=[A-Z])", perl = TRUE)) but as can be seen I get the split wrongly:
[1] "12" "A432" "B4" "B"
Cant get my mind around a pattern that works with this strsplit strategy? Explanations would be really appreciated.
Bonus:
I also failed to use back reference in gsub (e.g. - pattern not working `gsub("([[:digit:]]+[A-Z])+", "\\1", "12A432B4B"), and can you retrieve more than \\1 to \\9 groups, say if [:digit:]+[A-Z] repeats for more than 9 times ?

We can use regex lookaround to split between an upper case letter and a digit
strsplit(str1, "(?<=[A-Z])(?=[0-9])", perl = TRUE)[[1]]
#[1] "12A" "432B" "4B"
data
str1 <- "12A432B4B"

The pattern mentioned in the post can be used as it is in str_extract_all :
str_extract_all(string, '[[:digit:]]+[A-Z]')[[1]]
#[1] "12A" "432B" "4B"
Or in base R :
regmatches(string, gregexpr('[[:digit:]]+[A-Z]', string))[[1]]
where string is :
string <- '12A432B4B'

Extract text with gsub

I am setting up an automated data analysis procedure and, more or less at the end of the procedure, I would like to extract automatically the name of the file that has been analysed. I have a data frame with a column containing names, with the following style:
Baseline/Cell_Line_2_KB_1813_B_Baseline
Dose 0001/Cell_Line_3_KB1720_1_0001
Dose 0010/Cell_Line_1_KB1810 mat_0010
I would like to extract just the characters in bold: "KB_1813_B", "KB1720_1" and "KB1810 mat" in a separate column.
I used gsub with the following command:
df$column.with.names <- gsub(".*KB|_.*", "KB", df$column.with.new.names)
I could easily remove the first part of the problem, but I am stuck trying to remove the second part. Is there some command in gsub to remove everything, starting from the end of the name, until you encounter a special character ( "_" in my case)?
Thank you :)

We can use str_extract
library(stringr)
str_extract(df$column.with.new.names, "KB_*\\d+[_ ]*[^_]*")
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
Or the same pattern can be captured as a group with sub
sub(".*(KB_*\\d+[_ ]*[^_]*).*", "\\1", df$column.with.new.names)
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
data
df <- data.frame(column.with.new.names = c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010"), stringsAsFactors = FALSE)

The way to do this is using regex groups:
x <- c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010")
gsub("^.+Cell_Line_._(.+)_.+$", "\\1", x)
[1] "KB_1813_B" "KB1720_1" "KB1810 mat"

Gsub transforming numbers

I find this problem >S
I scrap some data from the web and for instance I obtain this
"3.444.654" (As character)
If I use gsub("3.444.654", ".", "") in order to get 3444654...
R gives me
[1] ""
What could I do to get the integer!

> gsub(".", "", "3.444.654", fixed = TRUE)
[1] "3444654"
Maybe read the documentation for gsub for argument order etc. To then turn the string into a number, use as.numeric, as.integer etc.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: How can I Split and Trim Data Successfully - r

An alternative is to use trimws function to trim the whitespace at the beginning and end of the string. Result <- trimws(Result)

We just need to use zero or more spaces (\\s) (the question OP asked) and this can be done in a single step strsplit(MSA_data$area_title, pattern = ",\\s") If we are using the stringr, then make use of the str_trim library(stringr) str_trim(str_split("Albany, GA", ",")[[1]]) #[1] "Albany" "GA"

Related

remove part of a word after apostrophe

R splitting string on predefined location

Is it possible to use R's base::strsplit() without consuming pattern

Extract text with gsub

Gsub transforming numbers

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: How can I Split and Trim Data Successfully - r

An alternative is to use trimws function to trim the whitespace at the beginning and end of the string. Result <- trimws(Result)

We just need to use zero or more spaces (\\s*) (the question OP asked) and this can be done in a single step strsplit(MSA_data$area_title, pattern = ",\\s*") If we are using the stringr, then make use of the str_trim library(stringr) str_trim(str_split("Albany, GA", ",")[[1]]) #[1] "Albany" "GA"

Related

remove part of a word after apostrophe

R splitting string on predefined location

Is it possible to use R's base::strsplit() without consuming pattern

Extract text with gsub

Gsub transforming numbers

Categories

Resources

We just need to use zero or more spaces (\\s) (the question OP asked) and this can be done in a single step strsplit(MSA_data$area_title, pattern = ",\\s") If we are using the stringr, then make use of the str_trim library(stringr) str_trim(str_split("Albany, GA", ",")[[1]]) #[1] "Albany" "GA"