Modifying hyphenated data with r - r

I am reading data from a csv file and one of the columns in the data comes in three different formats:
xxxxx-xxx-xx (5-3-2)
xxxxx-xxxx-x (5-4-1)
xxxx-xxxx-xx (4-4-2)
My goal is to turn these three different styles into one style in the form:
xxxxx-xxxx-xx (5-4-2)
In order to make all the different forms the same I need to insert an additional zero at the specific location on each of the 3 different conditions like so:
xxxxx-0xxx-xx
xxxxx-xxxx-0x
0xxxx-xxxx-xx
Anyone have thoughts on the best way to accomplish this?

I would do this using sprintf and strsplit:
x <- c('11111-111-11', '11111-1111-1', '1111-1111-11')
y <- strsplit(x, '-')
myfun <- function(y) {
first <- sprintf('%05d', as.integer(y[1]))
second <- sprintf('%04d', as.integer(y[2]))
third <- sprintf('%02d', as.integer(y[3]))
paste(first, second, third, sep='-')
}
sapply(y, myfun)
# [1] "11111-0111-11" "11111-1111-01" "01111-1111-11"
You could also do this with fancy regular expressions or the gsubfn package but that may be overkill!

Slightly shorter and a more functional programming version of Justin's solution
numbers <- c('11111-111-11', '11111-1111-1', '1111-1111-11')
restyle <- function(number, fmt){
tmp <- as.list(as.integer(strsplit(number, '-')[[1]]))
do.call(sprintf, modifyList(tmp, list(fmt = fmt)))
}
sapply(numbers, restyle, fmt = '%05d-%04d-%02d', USE.NAMES = F)

Are you working in a unix like environment? It might be easier to use sed at the command line rather than R's regex functions.
echo "54324-965-23" | sed 's/\(.....\)-\(...\)-\(..\)/\1-0\2-\3/'
will spit back
54324-0965-23
If you want to apply it to the entire file it would look something like
cat file1.txt | sed 's/\(.....\)-\(...\)-\(..\)/\1-0\2-\3/' > file2.txt
And if you have multiple txt changing operations you can pipe them all together
cat file1.txt | sed 's/\(.....\)-\(...\)-\(..\)/\1-0\2-\3/' | sed '2ndthing' | sed 'thirdthing' > file2.txt

One solution to this is to first remove the hyphens, then just add them back in the desired character location, like so:
> v <- c("01234-567-89","01234-5678-9","0123-4567-89")
> v
[1] "01234-567-89" "01234-5678-9" "0123-4567-89"
> #remove hyphens
> v <- gsub("-","",v)
> v
[1] "0123456789" "0123456789" "0123456789"
> #add hyphens
> paste(substr(v,1,4),substr(v,5,8),substr(v,9,10),sep="-")
[1] "0123-4567-89" "0123-4567-89" "0123-4567-89"

Related

How to remove specific pattern in string?

I have data in that string is like f <- "./DAYA-1178/10TH FEB.xlsx". I would like to extract only DAYA-1178
what I have tried is
f1 <- gsub(".*./","", f)
But it is giving last result of my file "10TH FEB.xlsx"
Appreciate any lead.
It seems you are dealing with files. You need the basename of the directory:
basename(dirname(f))
[1] "DAYA-1178"
or you could do:
sub(".*/","",dirname(f))
[1] "DAYA-1178"
Using strsplit, we can split the input on path separator / and retain the second element:
f <- "./DAYA-1178/10TH FEB.xlsx"
unlist(strsplit(f, "/"))[2]
[1] "DAYA-1178"
If you wish to use sub, here is one way:
sub("^.*/(.*?)/.*$", "\\1", f)
[1] "DAYA-1178"
f1 <- gsub("[.,xlsx]","",f)
u can try like these it will give
f1 <- /DAYA-1178/10TH FEB
f3 <- strsplit(f1,"/")[[1]][2]
DAYA-1178 --> answer

How to split a string based on a semicolon following a conditional digit

I'm working in R with strings like the following:
"a1_1;a1_2;a1_5;a1_6;a1_8"
"two1_1;two1_4;two1_5;two1_7"
I need to split these strings into two strings based on the last digit being less than 7 or not. For instance, the desired output for the two strings above would be:
"a1_1;a1_2;a1_5;a1_6" "a1_8"
"two1_1;two1_4;two1_5" "two1_7"
I attempted the following to no avail:
x <- "a1_1;a1_2;a1_5;a1_6;a1_8"
str_split("x", "(\\d<7);")
In an earlier version of the question I was helped by someone that provided the following function, but I don't think it's set up to handle digits both before and after the semicolon in the strings above. I'm trying to modify it but I haven't been able to get it to come out correctly.
f1 <- function(strn) {
strsplit(gsubfn("(;[A-Za-z]+\\d+)", ~ if(readr::parse_number(x) >= 7)
paste0(",", sub(";", "", x)) else x, strn), ",")[[1]]
}
Can anyone help me understand what I'd need to do to make this split as desired?
Splitting and recombining on ;, with a simple regex capture in between.
s <- c("a1_1;a1_2;a1_5;a1_6;a1_8", "two1_1;two1_4;two1_5;two1_7")
sp <- strsplit(s, ";")
lapply(sp,
function(x) {
l <- sub(".*(\\d)$", "\\1", x) < 7
c(paste(x[l], collapse=";"), paste(x[!l], collapse=";"))
}
)
# [[1]]
# [1] "a1_1;a1_2;a1_5;a1_6" "a1_8"
#
# [[2]]
# [1] "two1_1;two1_4;two1_5" "two1_7"

Remove text after second colon

I need to remove everything after the second colon. I have several date formats, that need to be cleaned using the same algorithm.
a <- "2016-12-31T18:31:34Z"
b <- "2016-12-31T18:31Z"
I have tried to match on the two column groups, but I cannot seem to find out how to remove the second match group.
sub("(:.*){2}", "", "2016-12-31T18:31:34Z")
A regex you can use: (:[^:]+):.*
which you can check on: regex101 and use like
sub("(:[^:]+):.*", "\\1", "2016-12-31T18:31:34Z")
[1] "2016-12-31T18:31"
sub("(:[^:]+):.*", "\\1", "2016-12-31T18:31Z")
[1] "2016-12-31T18:31Z"
Let say you have a vector:
date <- c("2016-12-31T18:31:34Z", "2016-12-31T18:31Z", "2017-12-31T18:31Z")
Then you could split it by ":" and take only first two elements dropping the rest:
out = sapply(date, function(x) paste(strsplit(x, ":")[[1]][1:2], collapse = ':'))
Use it as an opportunity to make a partial timestamp validator vs just targeting any trailing seconds:
remove_seconds <- function(x) {
require(stringi)
x <- stri_trim_both(x)
x <- stri_match_all_regex(x, "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}T[[:digit:]]{2}:[[:digit:]]{2})")[[1]]
if (any(is.na(x))) return(NA)
sprintf("%sZ", x[,2])
}
That way, you'll catch errant timestamp strings.

change the sequence of numbers in a filename using R

I am sorry, I could not find an answer to this question anywhere and would really appreciate your help.
I have .csv files for each hour of a year. The filename is written in the following way:
hh_dd_mm.csv (e.g. for February 1st 00:00--> 00_01_02.csv). In order to make it easier to sort the hours of a year I would like to change the filename to mm_dd_hh.csv
How can I write in R to change the filename from the pattern HH_DD_MM to MM_DD_HH?
a <- list.files(path = ".", pattern = "HH_DD_MM")
b<-paste(pattern="MM_DD_HH")
file.rename(a,b)
Or you could do:
a <- c("00_01_02.csv", "00_02_02.csv")
gsub("(\\d{2})\\_(\\d{2})\\_(\\d{2})(.*)", "\\3_\\2_\\1\\4", a)
#[1] "02_01_00.csv" "02_02_00.csv"
Not sure if this is the best solution, but seem to work
a <- c("00_01_02.csv", "00_02_02.csv")
b <- unname(sapply(a, function(x) {temp <- strsplit(x, "(_|[.])")[[1]] ; paste0(temp[[3]], "_", temp[[2]], "_", temp[[1]], ".", temp[[4]])}))
b
## [1] "02_01_00.csv" "02_02_00.csv"
You can use chartr to create the new file name. Here's an example..
> write.csv(c(1,1), "12_34_56")
> list.files()
# [1] "12_34_56"
> file.rename("12_34_56", chartr("1256", "5612", "12_34_56"))
# [1] TRUE
> list.files()
# [1] "56_34_12"
In chartr, you can replace the elements of a string, so long as it doesn't change the number of characters in the original string. In the above code, I basically just swapped "12" with "56", which is what it looks like you are trying to do.
Or, you can write a short string swapping function
> strSwap <- function(x) paste(rev(strsplit(x, "[_]")[[1]]), collapse = "_")
> ( files <- c("84_15_45", "59_95_21", "31_51_49",
"51_88_27", "21_39_98", "35_27_14") )
# [1] "84_15_45" "59_95_21" "31_51_49" "51_88_27" "21_39_98" "35_27_14"
> sapply(files, strSwap, USE.NAMES = FALSE)
# [1] "45_15_84" "21_95_59" "49_51_31" "27_88_51" "98_39_21" "14_27_35"
You could also so it with the substr<- assignment function
> s1 <- substr(files,1,2)
> substr(files,1,2) <- substr(files,7,8)
> substr(files,7,8) <- s1
> files
# [1] "45_15_84" "21_95_59" "49_51_31" "27_88_51" "98_39_21" "14_27_35"

Add commas to output

How can obtain my R output (let us say elements of a vector) separated by commas?
(i.e. not by space)
Currently I can only get output separated by space.
Try this example:
#dummy vector
v <- c("a","1","c")
#separated with commas
paste(v,collapse=",")
#output
#[1] "a,1,c"
EDIT 1:
Thanks to #DavidArenburg:
cat(noquote(paste(v,collapse=",")))
a,1,c
EDIT 2:
Another option: by #RichardScriven
cat(v, sep = ",")

Resources