Parsing a character string of varying lengths

Parsing a character string of varying lengths - r

I'm trying to parse a string of estimated salaries to create a new field called "Salary.Min" which should be a numeric value. It seems straightforward and I can handle this in SQL with a quick case statement but I'm having trouble translating into R.
Do I need to use a for loop here or is there a more efficient/simple way? Generally I'm looking to do something akin to "if 4th character in string = K then return characters 2:3, otherwise return characters 2:4"
This code seemed to be okay at first but after validating I've realized it's eliminating all records where the 4th character = K (ie minimum salaries of $100k+)
> ifelse(
> substr(data_public$Salary.Estimate, 4,4) == "K",
> data_public$Salary.Min<- substr(data_public$Salary.Estimate, 2, 3),
> data_public$Salary.Min<- substr(data_public$Salary.Estimate, 2, 4))
I have a wide range of Salary.Estimate values, a few for example:
a) $105K - $115K
b) $89K - $95K
c) $78K - $85K

We could make this shorter with trimws and substr. Here, we get the substring from 2 to 4 character and specify the whitespace in trimws as 'K' where the which = 'right' signifies to match for the trailing character
data_public$Salary.Min <- trimws(substr( data_public$Salary.Estimate, 2, 4),
which = 'right', whitespace = "K")
Or we could use sub
sub("^.(..)K?.", "\\1", data_public$Salary.Estimate)
In the ifelse code, the assignment should be outside the ifelse
data_public$Salary.Min<- with(data_public,
ifelse(substr(Salary.Estimate, 4, 4) == "K",
substr(Salary.Estimate, 2, 3), substr(Salary.Estimate,2, 4)))

Related

Select columns based on exact string match

I have a large dataframe which contain columns like this:
df <- data.frame(W0 = 1,
Response = 1,
HighResponse = 1,
Response.W0 = 1,
HighResponse.W0 =1)
Now, in a for loop, I want to select a column based on whether they contain a specified string- Response, W0, HighResponse. My method of selecting the column is:
x <- dplyr::select(df, contains("HighResponse.W0")) #this works
x <- dplyr::select(df, contains("HighResponse")) #doesn't work. Selects HighResponse and HighResponse.W0
x <- dplyr::select(df, contains("Response")) #doesn't work. Selects Response, HighResponse, Response.W0, HighResponse.W0
x <- dplyr::select(df, contains("W0")) #doesn't work. Selects W0, Response.W0, HighResponse.W0
How can I modify my column selection method, so that it only selects exact string? For ex, select only W0 or Response not the other matching strings.

Use anchors with matches to specify the beginning (^) and end ($) of the string:
dplyr::select(df, matches("^HighResponse$"))
Or, without contains:
dplyr::select(df, "HighResponse")

Generate all possible combinations of a text string with two specific letters substituted for each other in R

Using R, I have generated several strings of letters that range from 6-25 characters. I'd like for each one to generate an output that consists of all the combinations of these strings with every "I" substituted for a "L" and vice versa, the order of the characters should stay the same.
For example:
Input
"IVGLWEA"
OUTPUT
"IVGLWEA"
"LVGLWEA"
"LVGIWEA"
'IVGIWEA"
"LVGLWEA"
many thanks
rob

Edit: Thanks to #Skaqqs for the dynamic solution!
string <- "IVGLWEA"
# find the number of I's and L's in the string
n <- length(unlist(gregexpr("I|L", string)))
# make a grid of all possible combinations with this amount of I's and L's
df <- expand.grid(rep(list(c("I", "L")), n))
# replace I's and L's with %s
string_ <- gsub("I|L", "\\%s", string)
# replace %s with letters in grid
do.call(sprintf, as.list(c(string_, df)))
Result:
[1] "IVGIWEA" "LVGIWEA" "IVGLWEA" "LVGLWEA"

Here's an extremely inefficient (but concise!) approach:
Create all potential combinations of your input characters and use regex to extract the desired pattern.
pattern <- "(I|L)VG(I|L)WEA"
b <- c("I", "V", "G", "L", "W", "E", "A")
strings <- apply(expand.grid(rep(list(b), 7)), 1, paste0, collapse = "")
grep(pattern, strings, value = TRUE)
[1] "IVGIWEA" "LVGIWEA" "IVGLWEA" "LVGLWEA"

How to transform long names into shorter (two-part) names

I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.

You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.

Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

Move "*" to new column in R

Hello I have a column in a data.frame, it has many rows, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"))
I want to make a new column "Species_new" where the "*" is moved to the end of the character string, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"),
"Species_new" = c("Briza minor*", "Briza minor*", "Wattle"))
Is there a way to do this using gsub? The manual example would take far too long as I have approximately 50,000 rows.
Thanks in advance

One option is to capture the * as a group and in the replacement reverse the backreferences
df$Species_new <- sub("^([*])(.*)$", "\\2\\1", df$Species)
df$Species_new
#[1] "Briza minor*" "Briza minor*" "Wattle"
NOTE: * is a metacharacter meaning 0 or more, so we can either escape (\\*) or place it in brackets ([]) to evaluate the raw character i.e. literal evaluation

Thanks so much for the quick response, I also found a workaround;
df$Species_new = sub("[*]","",df$Species, perl=TRUE)
differences = setdiff(df$Species,df$Species_new)
tochange = subset(df,df$Species == differences)
toleave = subset(df,!df$Species == differences)
tochange$Species_new = paste(tochange$Species_new, "*", sep = "")
df = rbind(tochange,toleave)

How to replace the certain character in certain position in the string?

I have a question that is how to replace a character which is in a certain place. For example:
str <- c("abcdccc","hijklccc","abcuioccc")
#I want to replace character "c" which is in position 3 to "X" how can I do that?
#I know the function gsub and substr, but the only idea I have got so far is
#use if() to make it. How can I do it quickly?
#ideal result
>str
"abXdccc" "hijklccc" "abXuioccc"

It's a bit awkward, but you can replace a single character dependent on that single character's value like:
ifelse(substr(str,3,3)=="c", `substr<-`(str,3,3,"X"), str)
#[1] "abXdccc" "hijklccc" "abXuioccc"
If you are happy to overwrite the value, you could do it a bit cleaner:
substr(str[substr(str,3,3)=="c"],3,3) <- "X"
str
#[1] "abXdccc" "hijklccc" "abXuioccc"

I wonder if you can use a regex lookahead here to get what you are after.
str <- c("abcdccc","hijklccc","abcuioccc")
gsub("(^.{2})(?=c)(.*$)", "\\1X\\2", str, perl = T)
Or using a positive lookbehind as suggested by thelatemail
sub("(?<=^.{2})c", "X", str, perl = TRUE)
What this is doing is looking to match the letter c which is after any two characters from the start of the string. The c is replaced with X.
(?<= is the start of positive lookbehind
^.{2} means any two characters from the start of the string
)c is the last part which says it has to be a c after the two characters
[1] "abXcdccc" "hijklccc" "abXcuioccc"
If you want to read up more about regex being used (link)
Additionally a generalised function:
switch_letter <- function(x, letter, position, replacement) {
stopifnot(position > 1)
pattern <- paste0("(?<=^.{", position - 1, "})", letter)
sub(pattern, replacement, x, perl = TRUE)
}
switch_letter(str, "c", 3, "X")

This should work too:
str <- c("abcdefg","hijklnm","abcuiowre")
a <- strsplit(str[1], "")[[1]]
a[3] <- "X"
a <- paste(a, collapse = '')
str[1] <- a

How about this idea:
c2Xon3 <- function(x){sprintf("%s%s%s",substring(x,1,3),gsub("c","X",substring(x,3,3)),substring(x,4,nchar(x)))}
str <- c("abcdccc","hijklccc","abcuioccc")
strNew <- sapply(str,c2Xon3 )

This should work
str <- c("abcdefg","hijklnm","abcuiowre")
for (i in 1:length(str))
{
if (substr(str[i],3,3)=='c') {
substr(str[i], 3, 3) <- "X"
}
}

You can just use ifelse with gsub, i.e.
ifelse(substr(str, 3, 3) == 'c', paste0(substring(str, 1, 2),'X', substring(str, 4)), str)
#[1] "abXdccc" "hijklccc" "abXuioccc"

Categories

HOME

google-code

ms-access-2010

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Parsing a character string of varying lengths - r

Related

Select columns based on exact string match

Generate all possible combinations of a text string with two specific letters substituted for each other in R

How to transform long names into shorter (two-part) names

Move "*" to new column in R

How to replace the certain character in certain position in the string?

Categories

Resources