Remove character from string in R - r

I have a data frame as given below:
data$Latitude
"+28.666428" "+28.666470" "+28.666376" "+28.666441" "+28.666330" "+28.666391"
str(data$Latitude)
Factor w/ 1368 levels "+10.037451","+10.037457",..
I want to remove the "+" character from each of the Latitude values.
I tried using gsub()
data$Latitude<-gsub("+","",as.character(factor(data$Latitude)))
This isn't working.

You can use a combination of sapply, substring and regexpr to achieve the result.
regexpr(<character>,<vector>)[1] gives you the index of the character.
Using the value as the start index for substring, the rest of the string can be separated.
sapply allows you loop through the values.
Here is the data.
d<-c("+28.666428","+28.666470","+28.666376","+28.666441","+28.666330")
Here is the logic.
v <- sapply(d, FUN = function(d){substring(d, regexpr('+',d) + 1, nchar(d))}, USE.NAMES=FALSE)
Here is the output
> v
[1] "28.666428" "28.666470" "28.666376" "28.666441" "28.666330" "28.666391"

Related

R: remove substring and change the remaining string by addition of a number

in R: I have some strings with the following pattern of letters and numbers
A11B3XyC4
A1B14C23XyC16
B14C23XyC16D3
B14C23C16D3
I want to remove the part "Xy" (always the same letters) and when I do this I want to increase the number behind the Letter B by one (everything else should stay the same).
When there is no "Xy" in the string there is no change to the string
The result should look like this:
A11B4C4
A1B15C23C16
B15C23C16D3
B14C23C16D3
Could you point me to a function capable of this? I struggle with doing a calculation (x+1) with a string.
Thank you!
We could use str_replace to do the increment on the substring of numbers that follows the 'B' string after removing the 'Xy' only for cases where there is 'Xy' substring in case_when
library(stringr)
library(dplyr)
case_when(str_detect(str1, "Xy") ~ str_replace(str_remove(str1,
"Xy"), "(?<=B)(\\d+)", function(x) as.numeric(x) + 1), TRUE ~str1)
[1] "A11B4C4" "A1B15C23C16" "B15C23C16D3" "B14C23C16D3"
data
str1 <- c("A11B3XyC4", "A1B14C23XyC16", "B14C23XyC16D3", "B14C23C16D3")

How can I give sequential names to items in a list?

I have a character string that I have split into a list of smaller strings using strsplit. For example:
> full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
> full.seq
[1] "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
> sequences <- strsplit(full.seq, "cg")
> sequences
[[1]]
[1] "FZp" "K3VdAQzEFZpcAVdV8QM8ZpsEFZpa" "GKi3VdVSQzEFZp"
[4] "GKAVdVRpEFKGIZpg13"
I would like to give each of these new strings a unique, sequential name that I can still use to identify that they were from the same original string (for a later analysis I will do on these strings). For example, "ID.seq1", "ID.seq2", "ID.seq3" etc. I have tried doing this manually but receive this error:
> names(sequences) <- c("ID.seq1", "ID.seq2", "ID.seq3", "ID.seq4")
Error in names(sequences) <- c("ID.seq1", "ID.seq2", "ID.seq3", "ID.seq4") :
'names' attribute [4] must be the same length as the vector [1]
I would also like an automated way of doing this though, as I will need to label up to 30 new strings from a number of original strings. Any suggestions?
First of all, if you want a character vector, you will have to subset the list, because strsplit returns a list. After doing that, you can easily assign names to that vector of terms.
full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
sequences <- strsplit(full.seq, "cg")[[1]]
names(sequences) <- paste0("ID.seq", c(1:4))
sequences
ID.seq1 ID.seq2
"FZp" "K3VdAQzEFZpcAVdV8QM8ZpsEFZpa"
ID.seq3 ID.seq4
"GKi3VdVSQzEFZp" "GKAVdVRpEFKGIZpg13"
Answer by Tim is perfect. I just want to add if you want to keep your list and name elements of each item:
full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
full.seq
sequences <- strsplit(full.seq, "cg")
names(sequences[[1]]) <- paste("ID.seq",1:4,sep="")

extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?
Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"
We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"
Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

How to make mass correction and conversion characters of vector to the numeric

I have a big vector of characters, containing the numbers which I need to correct, and then convert to numeric vector, for example:
data.f <- c('11.23', '34,32 + 12,17', '21.1')
I need to get in result like this:
num 11.23 34.32 21.1
I tried to use apply function family to solve this problem
num <- sapply(data.f, function(x) ifelse(nchar(data.f[x])<6, data.f[x] <- as.numeric(data.f[x]), data.f[x] <- as.numeric(substring(gsub("[,]", ".", data.f[x]), 1,5))))
I have tried experiment with different options
num <- sapply(data.f, function(x) ifelse(nchar(data.f[x])<6, as.numeric(data.f[x]), as.numeric(substring(gsub("[,]", ".", data.f[x]), 1,5))))
gsub - to change come to dot
substring - to cut extra symbols (it would be better to find average)
as.numeric - to convert character to numeric
But at result I get the same as I had:
str(num)
- attr(*, "names")= chr [1:3] "11.23" "34,32 + 12,17" "21.1"
attributes(num)
$names
[1] "11.23" "34,32 + 12,17" "21.1"
I need help to fide solution, maybe some good human will see what I messed?
We can replace the , with . and use parse_number to extract the number
readr::parse_number(gsub(",", ".", data.f))
#[1] 11.23 34.32 21.10
The ifelse is vectorized unlike if/else (which take one element at a time). So, we don't really need a loop function (sapply is looping each element of the vector - for a vector, the unit is single element)
ifelse(nchar(data.f) < 6, as.numeric(data.f),
as.numeric(substr(gsub(',', '.', data.f), 1, 5)))
#[1] 11.23 34.32 21.10
NOTE: nchar, substr, ifelse, as.numeric and gsub can all take a vector with length > 1.

concatenate a string that contains backquote characters R

I have a string that contains back quotes, which mess up the concatenate function. If you try to concatenate with back ticks, the concatenate function doesn't like this:
a <- c(`table`, `chair`, `desk`)
Error: object 'chair' not found
So I can create the variable:
bad.string <- "`table`, `chair`, `desk`"
a <- gsub("`", "", bad.string)
That gives a string "table, chair, desk".
It then should be like:
good.object <- c("table", "chair", "couch", "lamp", "stool")
I don't know why the backquotes cause the concatenate function to break, but how can I replace the string to not have the illegal characters?
Try:
good.string <- trimws(unlist(strsplit(gsub("`", "", bad.string), ",")))
Here gsub() is used to remove the backticks, strsplit converts a single string into a list of strings, where the comma in the original string denotes the separation, unlist() converts the list of strings into a vector of strings and trimws() deletes trailing or leading whitespaces.
From the documentation on quotes, back ticks are reserved for non-standard variable names such as
`the dog` <- 1:5
`the dog`
# [1] 1 2 3 4 5
So when you are trying to use concatenate, R is doing nothing wrong. It looks at all the variable in c() and tries to find them, causing the error.
If this is a vector you wrote, just copy replace all of the backticks with single or double quotes.
If this is somehow being generated in R, bring the entire thing out as a string, then use gsub() and eval(parse())
eval(parse(text = gsub('\`',"\'","c(`table`, `chair`, `desk`)")))
[1] "table" "chair" "desk"
EDIT: For the new example of bad.string
You have to go through, replace all of the back ticks with double quotes, then you can read it through read.csv(). This is a little janky though as it gives back a row vector, so we transpose it to get back a column vector
bad_string <- "`table`, `chair`, `desk`"
okay_string <- gsub('\`','\"',bad.string)
okay_string
# [1] "\"table\", \"chair\", \"desk\""
t(read.csv(text = okay_string,header=FALSE, strip.white = TRUE))
# [,1]
# V1 "table"
# V2 "chair"
# V3 "desk"

Resources