how to add a trailing zeroes to multiple parts of a string - r

I have the following data
v1
19956673-1
20043747-23
20056956-1
36628-2
45820-4
478
115
I need to add trailing zeroes to the both sting fields (before and after the dash) so the desired output (v2) has 8 digits before the dash and 2 digits after. Also, data with no dash can be passed as is.
v1 v2
19956673-1 19956673-01
20043747-23 20043747-23
20056956-1 20056956-01
36628-2 00036628-02
45820-4 00045820-04
478 478
115 115

Here is an option to extract the part after the -, then use sprintf
i1 <- grep('-', df1$v1)
df1$v2 <- df1$v1
df1$v2[i1] <- sprintf('%s-%02d', sub('-.*', '', df1$v1[i1]),
as.numeric(sub('.*-', '', df1$v1[i1])))
-output
df1
# v1 v2
#1 19956673-1 19956673-01
#2 20043747-23 20043747-23
#3 20056956-1 20056956-01
#4 36628-2 36628-02
#5 45820-4 45820-04
#6 478 478
#7 115 115
Or another option is regex based on capturing as a group i.e. match the digits (\\d+) from the start (^) of the string, capture as a group ((...)), followed by a -, then capture the single digit (\\d) at the end ($), replace with the backreference of the captured groups and insert 0 before the second backreference
df1$v2 <- sub('^(\\d+)-(\\d)$', '\\1-0\\2', df1$v1)
data
df1 <- structure(list(v1 = c("19956673-1", "20043747-23", "20056956-1",
"36628-2", "45820-4", "478", "115")), row.names = c(NA, -7L),
class = "data.frame")

A solution with sub and positive lookbehind:
v2 <- sub("(?<=-)(\\d)$", "0\\1", v1, perl = TRUE)
Result:
v2
[1] "19956673-01" "20043747-23" "20056956-01" "36628-02" "45820-04"
How this works:
(?<=-): positive lookbehind: "if you see a - on the left ...
(\\d)$: ... then remember (\\1) the single digit ((\\d)) right at the end of the string ($) and add a 0 to the left of it"
Data:
v1 <- c("19956673-1", "20043747-23", "20056956-1", "36628-2", "45820-4")

Related

How to remove a mystery character from column header in R?

I have a mystery character in my dataframe in R:
df <- structure(list(`ID21` = c("23", "44"),
ID22 = c("53", "23"), `Drug-na�ve_D22` = c("53",
"45")), row.names = 1:2, class = "data.frame")
> df
ID21 ID22 Drug-na�ve_D22
1 23 53 53
2 44 23 45
What's the best way to remove this character? Would some sort of gsub with regular expression work?
In this example I've replaced it with the letter i:
> df
ID21 ID22 Drug-naive_D22
1 23 53 53
2 44 23 45
To remove any non-word characters (letters, numbers and underscore) in your column names
names(df) <- gsub("\\W", "", names(df))
If you want to replace the characters with a different character, put them in the second argument
To match any non-ASCII character you can use this pattern:
[^ -~]
So, for example, if you want to replace the char by i, you can use sub thus:
sub("[^ -~]", "i", names(df))

Finding matches on a character in more than one position in R [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I have a character vector where I want to match the first and last parts so I can generate a list of matching characters.
Here is an example character: "20190625_165055_0f4e"
The first part is a date. The last 4 characters are a unique identifier. I need all characters in the list where these two parts are duplicates.
I could use a simple regex to match characters according to position, but some have more middle characters than others, e.g. "20190813_170215_17_1057"
Here is an example vector:
mylist<-c("20190712_164755_1034","20190712_164756_1034","20190712_164757_1034","20190719_164712_1001","20190719_164713_1001","20190722_153110_1054","20190813_170215_17_1057","20190813_170217_22_1057","20190828_170318_14_1065")
With this being the desired output:
c("20190712_164755_1034","20190712_164756_1034","20190712_164757_1034")
c("20190719_164712_1001","20190719_164713_1001")
c("20190722_153110_1054")
c("20190813_170215_17_1057","20190813_170217_22_1057")
c("20190828_170318_14_1065")
edits: made my character vector more simple and added desired output
We could remove the middle substring with sub and split the list based on that into a list of character vectors
lst1 <- split(mylist, sub("^(\\d+)_.*_([^_]+)$", "\\1_\\2", mylist))
lst1
#$`20190712_1034`
#[1] "20190712_164755_1034" "20190712_164756_1034" "20190712_164757_1034"
#$`20190719_1001`
#[1] "20190719_164712_1001" "20190719_164713_1001"
#$`20190722_1054`
#[1] "20190722_153110_1054"
#$`20190813_1057`
#[1] "20190813_170215_17_1057" "20190813_170217_22_1057"
#$`20190828_1065`
#[1] "20190828_170318_14_1065"
In the sub, we capture ((...)) one or more digits (\\d+) from the start (^) of the string, followed by a _, and other characters (.*) till the _ and capture the rest of the characters that are not a _ ([^_]+) till the end ($) of the string. In the replacement, we specify the backreference (\\1, \\2) of the captured groups). Essentially, removing the varying part in the middle and keep the fixed substring at the beginning and end and use that to split the character vector
Here's an alternative approach with extract from tidyr.
library(tidyr)
result <- as.data.frame(mylist) %>%
extract(1, into = c("date","var1","var2"),
regex = "(^[0-9]{8}_[0-9]{6})_?(.*)?_([^_]+$)",
remove = FALSE)
result
# mylist date var1 var2
#1 20190625_165055_0f4e 20190625_165055 0f4e
#2 20190625_165056_0f4e 20190625_165056 0f4e
#3 20190625_165057_0f4e 20190625_165057 0f4e
#4 20190712_164755_1034 20190712_164755 1034
#...
#27 20190828_170318_14_1065 20190828_170318 14 1065
#28 20190828_170320_26_1065 20190828_170320 26 1065
#...
Now you can easily manipulate the data based on those variables.
split(result,result$var2)
#$`0f22`
# mylist date var1 var2
#29 20190917_165157_0f22 20190917_165157 0f22
#
#$`0f2a`
# mylist date var1 var2
#18 20190813_152856_0f2a 20190813_152856 0f2a
#19 20190813_152857_0f2a 20190813_152857 0f2a
#...
We can use extract to extract the date part and last 4 characters into separate columns. We then use group_split to split data based on those 2 columns.
tibble::tibble(mylist) %>%
tidyr::extract(mylist, c('col1', 'col2'), regex = '(.*?)_.*_(.*)',
remove = FALSE) %>%
dplyr::group_split(col1, col2, .keep = FALSE)
#[[1]]
# A tibble: 3 x 1
# mylist
# <chr>
#1 20190712_164755_1034
#2 20190712_164756_1034
#3 20190712_164757_1034
#[[2]]
# A tibble: 2 x 1
# mylist
# <chr>
#1 20190719_164712_1001
#2 20190719_164713_1001
#[[3]]
# A tibble: 1 x 1
# mylist
# <chr>
#1 20190722_153110_1054
#...

Extract values based on last n characters

I have a vector like below:
vector
jdjss-jdhs--abc-bec-ndj
kdjska-kvjd-jfj-nej-ndjk
eknd-nend-neekd-nemd-nemdkd-nedke
How do I extract the last 3 values so that my result looks like below based on a - delimitor:
vector Col1 Col2 Col3
jdjss-jdhs--abc-bec-ndj abc bec ndj
kdjska-kvjd-jfj-nej-ndjk jfj nej ndjk
eknd-nend-neekd-nemd-nemdkd-nedke nemd nemdkd nedke
I've attemped to use sub and the qdap package but no luck.
sub( "(^[^-]+[-][^-]+)(.+$)", "\\2", df$vector)
qdap::char2end(df$vector, "-", 3)
Not sure how to go about doing this.
You may use tidyr::extract:
library(tidyr)
vector <- c("jdjss-jdhs--abc-bec-ndj", "kdjska-kvjd-jfj-nej-ndjk", "eknd-nend-neekd-nemd-nemdkd-nedke")
df <- data.frame(vector)
tidyr::extract(df, vector, into = c("Col1", "Col2", "Col3"), "([^-]*)-([^-]*)-([^-]*)$", remove=FALSE)
vector Col1 Col2 Col3
1 jdjss-jdhs--abc-bec-ndj abc bec ndj
2 kdjska-kvjd-jfj-nej-ndjk jfj nej ndjk
3 eknd-nend-neekd-nemd-nemdkd-nedke nemd nemdkd nedke
The ([^-]*)-([^-]*)-([^-]*)$ pattern matches:
([^-]*) - Group 1 ('Col1'): 0+ chars other than -
- - a hyphen
([^-]*) - Group 2 ('Col2'): 0+ chars other than -
- - a hyphen
([^-]*) - Group 3 ('Col3'): 0+ chars other than -
$ - end of string
Set remove=FALSE in order to keep the original column.
You can use strsplit from base.
x <- "eknd-nend-neekd-nemd-nemdkd-nedke"
lastElements <- function(x, last = 3){
strLength <- length(strsplit(x, "-")[[1]])
start <- strLength - (last - 1)
strsplit(x, "-")[[1]][start:strLength]
}
> lastElements(x)
[1] "nemd" "nemdkd" "nedke"
You can simply split string by - using strsplit and extract last n elements:
df <- data.frame(vector = c(
"jdjss-jdhs--abc-bec-ndj",
"kdjska-kvjd-jfj-nej-ndjk",
"eknd-nend-neekd-nemd-nemdkd-nedke"),
stringsAsFactors = FALSE
)
cbind(df, t(sapply(strsplit(df$vector, "-"), tail, 3)))
vector 1 2 3
1 jdjss-jdhs--abc-bec-ndj abc bec ndj
2 kdjska-kvjd-jfj-nej-ndjk jfj nej ndjk
3 eknd-nend-neekd-nemd-nemdkd-nedke nemd nemdkd nedke
strcapture, as a base R corollary to the tidyr extract answer from Wiktor:
strcapture("([^-]*)-([^-]*)-([^-]*)$", df$vector, proto=list(Col1="",Col2="",Col3=""))
# Col1 Col2 Col3
#1 abc bec ndj
#2 jfj nej ndjk
#3 nemd nemdkd nedke

Remove all the characters before the last comma in R

I have a data table like this:
id number
1 5562,4024,...,1213
2 4244,4214,...,244
3 424,4213
4 1213,441
...
And I want to subset only the last part of each column of number, which should be like:
id number
1 1213
2 244
3 4213
4 441
...
So what should I do to achieve that?
One option is capture the digits at the end ($) of the string as a group that follows a , and replace with the backreference (\\1) of the captured group
df$number <- as.numeric(sub(".*,(\\d+)$", "\\1", df$number))
Or match the characters (.*) until the , and replace it with blank ("")
df$number <- as.numeric(sub(".*,", "", df$number))
data
df <- structure(list(id = 1:4, number = c("5562,4024,...,1213",
"4244,4214,...,244",
"424,4213", "1213,441")), class = "data.frame", row.names = c(NA,
-4L))

Trimming and reformatting dates in R

I have a column of data with the following types of dates and number entries:
16-Jun
21-01A
7-04
Aug-99
5-09
I want to convert these all into numbers, by doing two things. First, where the data have a number before a dash (as in the first three examples), I want to trim the data from the dash onwards. So the entries would appear 16, 21 and 7.
Second, where the entry is written in month-date format (e.g. Aug-99), I want to convert that to the number of the month and then trim it. so this example, would be to convert the date to 8-99 then trim to just 8.
How can I do this in R? When I use grep, sub and match commands, as in the answer below, I get:
[1] 16 21 7 5 8
When I am after: [1] 16 21 7 8 5
We use grep to find the index of elements that start with alphabets. Remove the substring that starts from - to the end of the string with sub. Subset the 'v2' based on 'i1' and convert to numeric while we match the ones starting with alphabets to month.abb and get the index of month, concatenate the output.
i1 <- grepl("^[A-Z]", v1)
v2 <- sub("-.*", "", v1)
c(as.numeric(v2[!i1]), match(v2[i1], month.abb))
#[1] 16 21 7 8
For the new dataset, we can use ifelse
i1 <- grepl("^[A-Z]", df1$v1)
v2 <- sub("-.*", "", df1$v1)
as.numeric(ifelse(i1, match(v2, month.abb), v2))
#[1] 16 21 7 8 5
data
v1 <- c('16-Jun','21-01A','7-04','Aug-99')
df1 <- structure(list(v1 = c("16-Jun", "21-01A", "7-04", "Aug-99", "5-09"
)), .Names = "v1", class = "data.frame", row.names = c(NA, -5L))

Resources