I am new to R and would like to know how to remove leading 0s from a determinate column in a database.
This is the column I have in my df.
questionn
SI001
SI002
SI003
SI010
and I would like to get something like
questionn
1
2
3
10
I have tried something like this but it doesn't work because of the SI010
library(stringr)
df$questionn <- str_replace_all(df$questionn, 'SQ0', '')
data
df <- data.frame(questionn=c("SI001","SI002","SI003","SI010"),stringsAsFactors = FALSE)
Try:
as.numeric(str_replace_all(df$questionn,"SI0",""))
You can remove all characters that are not digits then convert as numeric:
as.numeric(gsub("\\D","",df$questionn))
[1] 1 2 3 10
or as.numeric(str_replace_all(df$questionn,"\\D","")) for same output.
substr(gsub("SI", "", question$question),
regexpr("[^0]",gsub("SI", "", question$question)),
nchar(gsub("SI", "", question$question)))
Produces:
"1" "2" "3" "10"
The first thing you do is strip out the SI, to get the data in a format of having leading zeros.
Related
I am trying to find 3 or more consecutive "a" within the last 10 letters of my data frame string. My data frame looks like this:
V1
aaashkjnlkdjfoin
jbfkjdnsnkjaaaas
djshbdkjaaabdfkj
jbdfkjaaajbfjna
ndjksnsjksdnakns
aaaandfjhsnsjna
I have written this code, however it just gets out the number of consecutive "a" within the whole string. However, I am wanting to do it so it only looks at the last 10 digits and then prints the string where the consecutive "a" are found. The code I have wrote is:
out: [1] 3
I am wanting my output to look like this:
jbfkjdnsnkjaaaas
djshbdkjaaabdfkj
jbdfkjaaajbfjna
Can anyone help
Using regex, you could do:
grep("(?=.{10}$).*?a{3,}", string, perl = TRUE, value = TRUE)
[1] "jbfkjdnsnkjaaaas" "djshbdkjaaabdfkj" "jbdfkjaaajbfjna"
string <- c("aaashkjnlkdjfoin", "jbfkjdnsnkjaaaas", "djshbdkjaaabdfkj",
"jbdfkjaaajbfjna", "ndjksnsjksdnakns", "aaaandfjhsnsjna")
If you have a dataframe and need tosubset it:
subset(df, grepl("(?=.{10}$).*?a{3}",V1, perl = TRUE))
V1
2 jbfkjdnsnkjaaaas
3 djshbdkjaaabdfkj
4 jbdfkjaaajbfjna
I'm trying to extract values from a vector of strings. Each string in the vector, (there are about 2300 in the vector), follows the pattern of the example below:
"733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
What I'd like is to extract the numbers following the pattern "Sent. " and place them into a separate vector. For the example, I'd like to extract "1311531".
I'm having trouble using gsub to accomplish this.
library(tidyverse)
Data <- c("PASTE YOUR WHOLE STRING")
str_locate(Data, "Sent. ")
Reference <- str_locate_all(Data, "Sent. ") %>% as.data.frame()
Reference %>% names() #Returns [1] "start" "end"
Reference <- Reference %>% mutate(end = end +1)
YourNumbers <- substr(Data,start = Reference$end[1], stop = Reference$end[1])
for (i in 2:dim(Reference)[1]){
Temp <- substr(Data,start = Reference$end[i], stop = Reference$end[i])
YourNumbers <- paste(YourNumbers, Temp, sep = "")
}
YourNumbers #Returns "1234567"
We can use str_match_all from stringr to get all the numbers followed by "Sent".
str_match_all(text, "Sent.*?_+(\\d+)")[[1]][, 2]
#[1] "1" "3" "1" "1" "5" "3" "1"
A base R option using strsplit and sub
lapply(strsplit(ss, "\\|"), function(x)
sub("Sent.+: _+(\\d+)_+", "\\1", x[grepl("^Sent", x)]))
#[[1]]
#[1] "1" "3" "1" "1" "5" "3" "1"
Sample data
ss <- "733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
I have a character string looking like this:
string <- c("1","2","3","","5","6","")
I would like to replace the gaps by the previous value, obtaining a string similar to this:
string <- c("1","2","3","3","5","6","6")
I have adjusted this solution (Replace NA with previous and next rows mean in R) and I do get the correct result:
string <- as.data.frame(string)
ind <- which(string == "")
string$string[ind] <- sapply(ind, function(i) with(string, string[i-1]))
This way is however quite cumbersome and there must be an easier way that does not require me to transform the string to a data frame first. Thanks for your help!
We can use na.locf from zoo after changing the blank ("") to NA so that the NA values get replaced by the non-NA adjacent previous values
library(zoo)
na.locf(replace(string, string =="", NA))
#[1] "1" "2" "3" "3" "5" "6" "6"
If there is only atmost one blank between the elements, then create an index as in the OP's post and then do the replacement by the element corresponding to the index subtracted 1
i1 <- which(string == "")
string[i1] <- string[i1-1]
I have a column of identifiers:
c('ABB123a','ABB123b','ABB123c','ABB125','ABB125b','ABB1110','ABB1110aa')
#desired output
c('ABB123','ABB123','ABB123','ABB125','ABB125','ABB1110','ABB1110')
What's the easiest way of removing the character following the pattern 3 characters, 2 to 4 numeric in R?
This seems to do what you want:
Your data:
x <- c("ABB123a","ABB123b","ABB123c","ABB125a")
Alter the data using gsubby removing the last ($) character in the string
x_new <- gsub("\\w$", "", x)
x_new
[1] "ABB123" "ABB123" "ABB123" "ABB125"
I would like to remove constant (shared) parts of a string automatically and retain the variable parts.
e.g. i have a column with the following:
D20181116_Basel-Take1_digital
D20181116_Basel-Take2_digital
D20181116_Basel-Take3_digital
D20181116_Basel-Take4_digital
D20181116_Basel-Take5_digital
D20181116_Basel-Take5a_digital
how can i get automatically to for any similar column (here removing: "D20181116_Basel-Take" and "_digital"). But the code should be find the constant part itself and remove them.
1
2
3
4
5
5a
I hope this is clear. Thank you very much.
You can do it with a regex: it will remove everything before 'Take' and after the underscore character:
vec<- c("D20181116_Basel-Take1_digital",
"D20181116_Basel-Take2_digital",
"D20181116_Basel-Take3_digital",
"D20181116_Basel-Take4_digital",
"D20181116_Basel-Take5_digital",
"D20181116_Basel-Take5a_digital")
sub(".*?Take(.*?)_.*", "\\1", vec)
[1] "1" "2" "3" "4" "5" "5a"
with gsub():
assuming you have a dataframe df and want to change column
df$column <- gsub("^D20181116_Basel-Take","",df$column)
df$column <- gsub("_digital$","",df$column)