How to remove specific pattern in string? - r

I have data in that string is like f <- "./DAYA-1178/10TH FEB.xlsx". I would like to extract only DAYA-1178
what I have tried is
f1 <- gsub(".*./","", f)
But it is giving last result of my file "10TH FEB.xlsx"
Appreciate any lead.

It seems you are dealing with files. You need the basename of the directory:
basename(dirname(f))
[1] "DAYA-1178"
or you could do:
sub(".*/","",dirname(f))
[1] "DAYA-1178"

Using strsplit, we can split the input on path separator / and retain the second element:
f <- "./DAYA-1178/10TH FEB.xlsx"
unlist(strsplit(f, "/"))[2]
[1] "DAYA-1178"
If you wish to use sub, here is one way:
sub("^.*/(.*?)/.*$", "\\1", f)
[1] "DAYA-1178"

f1 <- gsub("[.,xlsx]","",f)
u can try like these it will give
f1 <- /DAYA-1178/10TH FEB
f3 <- strsplit(f1,"/")[[1]][2]
DAYA-1178 --> answer

Related

How to separate a column of directory paths and create a new column with the first level directory in r

I have a df with 3 columns. The second column contains directory paths. How do I separate the first level directory of the path and create a new column for it?
The df looks like this:
file_path file_name
/owner1/project1/third name1.bam
/owner2/project2/hard/fourth name2.bam
/owner2/project3/easy/ name3.bam
/owner3/project4/A. name4.bam
The output I seek is:
owner. file_path. file_name. f
/owner1 /project1/third. name1.bam
/owner2 /project2/hard/fourth. name2.bam
/owner2. /project3/easy. name3.bam
/owner3. /project4/A. name4.bam
I have tried "mutate", but when I use the "/" as the separator, it splits all the levels. All I want is to separate the first level of the path. Is there another approach or function that can accomplish this?
It's unclear to me what you're looking to do precisely but you may want to try using strsplit(x, split = "/") and then index the output.
Edit:
Hey Lou_A, maybe the follwing is what you are looking. There are probably more elegant ways of doing this but this is what came to mind.
df <- data.frame(lvl = c("levela1/levela2/levela3",
"levelb1/levelb2/levelb3",
"levelc1/levelc2/levelc3"))
df[,"lvl1"] <- sapply(strsplit(df[,1], "/"), `[`, 1)
rep.fun <- function(str){
lvl1 = strsplit(str, "/")[[1]][1]
low.lvl = gsub(lvl1, "", str)
return(low.lvl)
}
df[,"rest.lvl"] <- sapply(df[,1], rep.fun)
library(stringr)
df <- data.frame(lvl = c("levela1/levela2/levela3",
"levelb1/levelb2/levelb3",
"levelc1/levelc2/levelc3"))
df <- cbind(str_split_fixed(df$lvl, '/', n=2), df[,-1])

How to split a string based on a semicolon following a conditional digit

I'm working in R with strings like the following:
"a1_1;a1_2;a1_5;a1_6;a1_8"
"two1_1;two1_4;two1_5;two1_7"
I need to split these strings into two strings based on the last digit being less than 7 or not. For instance, the desired output for the two strings above would be:
"a1_1;a1_2;a1_5;a1_6" "a1_8"
"two1_1;two1_4;two1_5" "two1_7"
I attempted the following to no avail:
x <- "a1_1;a1_2;a1_5;a1_6;a1_8"
str_split("x", "(\\d<7);")
In an earlier version of the question I was helped by someone that provided the following function, but I don't think it's set up to handle digits both before and after the semicolon in the strings above. I'm trying to modify it but I haven't been able to get it to come out correctly.
f1 <- function(strn) {
strsplit(gsubfn("(;[A-Za-z]+\\d+)", ~ if(readr::parse_number(x) >= 7)
paste0(",", sub(";", "", x)) else x, strn), ",")[[1]]
}
Can anyone help me understand what I'd need to do to make this split as desired?
Splitting and recombining on ;, with a simple regex capture in between.
s <- c("a1_1;a1_2;a1_5;a1_6;a1_8", "two1_1;two1_4;two1_5;two1_7")
sp <- strsplit(s, ";")
lapply(sp,
function(x) {
l <- sub(".*(\\d)$", "\\1", x) < 7
c(paste(x[l], collapse=";"), paste(x[!l], collapse=";"))
}
)
# [[1]]
# [1] "a1_1;a1_2;a1_5;a1_6" "a1_8"
#
# [[2]]
# [1] "two1_1;two1_4;two1_5" "two1_7"

Remove text after second colon

I need to remove everything after the second colon. I have several date formats, that need to be cleaned using the same algorithm.
a <- "2016-12-31T18:31:34Z"
b <- "2016-12-31T18:31Z"
I have tried to match on the two column groups, but I cannot seem to find out how to remove the second match group.
sub("(:.*){2}", "", "2016-12-31T18:31:34Z")
A regex you can use: (:[^:]+):.*
which you can check on: regex101 and use like
sub("(:[^:]+):.*", "\\1", "2016-12-31T18:31:34Z")
[1] "2016-12-31T18:31"
sub("(:[^:]+):.*", "\\1", "2016-12-31T18:31Z")
[1] "2016-12-31T18:31Z"
Let say you have a vector:
date <- c("2016-12-31T18:31:34Z", "2016-12-31T18:31Z", "2017-12-31T18:31Z")
Then you could split it by ":" and take only first two elements dropping the rest:
out = sapply(date, function(x) paste(strsplit(x, ":")[[1]][1:2], collapse = ':'))
Use it as an opportunity to make a partial timestamp validator vs just targeting any trailing seconds:
remove_seconds <- function(x) {
require(stringi)
x <- stri_trim_both(x)
x <- stri_match_all_regex(x, "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}T[[:digit:]]{2}:[[:digit:]]{2})")[[1]]
if (any(is.na(x))) return(NA)
sprintf("%sZ", x[,2])
}
That way, you'll catch errant timestamp strings.

change the sequence of numbers in a filename using R

I am sorry, I could not find an answer to this question anywhere and would really appreciate your help.
I have .csv files for each hour of a year. The filename is written in the following way:
hh_dd_mm.csv (e.g. for February 1st 00:00--> 00_01_02.csv). In order to make it easier to sort the hours of a year I would like to change the filename to mm_dd_hh.csv
How can I write in R to change the filename from the pattern HH_DD_MM to MM_DD_HH?
a <- list.files(path = ".", pattern = "HH_DD_MM")
b<-paste(pattern="MM_DD_HH")
file.rename(a,b)
Or you could do:
a <- c("00_01_02.csv", "00_02_02.csv")
gsub("(\\d{2})\\_(\\d{2})\\_(\\d{2})(.*)", "\\3_\\2_\\1\\4", a)
#[1] "02_01_00.csv" "02_02_00.csv"
Not sure if this is the best solution, but seem to work
a <- c("00_01_02.csv", "00_02_02.csv")
b <- unname(sapply(a, function(x) {temp <- strsplit(x, "(_|[.])")[[1]] ; paste0(temp[[3]], "_", temp[[2]], "_", temp[[1]], ".", temp[[4]])}))
b
## [1] "02_01_00.csv" "02_02_00.csv"
You can use chartr to create the new file name. Here's an example..
> write.csv(c(1,1), "12_34_56")
> list.files()
# [1] "12_34_56"
> file.rename("12_34_56", chartr("1256", "5612", "12_34_56"))
# [1] TRUE
> list.files()
# [1] "56_34_12"
In chartr, you can replace the elements of a string, so long as it doesn't change the number of characters in the original string. In the above code, I basically just swapped "12" with "56", which is what it looks like you are trying to do.
Or, you can write a short string swapping function
> strSwap <- function(x) paste(rev(strsplit(x, "[_]")[[1]]), collapse = "_")
> ( files <- c("84_15_45", "59_95_21", "31_51_49",
"51_88_27", "21_39_98", "35_27_14") )
# [1] "84_15_45" "59_95_21" "31_51_49" "51_88_27" "21_39_98" "35_27_14"
> sapply(files, strSwap, USE.NAMES = FALSE)
# [1] "45_15_84" "21_95_59" "49_51_31" "27_88_51" "98_39_21" "14_27_35"
You could also so it with the substr<- assignment function
> s1 <- substr(files,1,2)
> substr(files,1,2) <- substr(files,7,8)
> substr(files,7,8) <- s1
> files
# [1] "45_15_84" "21_95_59" "49_51_31" "27_88_51" "98_39_21" "14_27_35"

Modifying hyphenated data with r

I am reading data from a csv file and one of the columns in the data comes in three different formats:
xxxxx-xxx-xx (5-3-2)
xxxxx-xxxx-x (5-4-1)
xxxx-xxxx-xx (4-4-2)
My goal is to turn these three different styles into one style in the form:
xxxxx-xxxx-xx (5-4-2)
In order to make all the different forms the same I need to insert an additional zero at the specific location on each of the 3 different conditions like so:
xxxxx-0xxx-xx
xxxxx-xxxx-0x
0xxxx-xxxx-xx
Anyone have thoughts on the best way to accomplish this?
I would do this using sprintf and strsplit:
x <- c('11111-111-11', '11111-1111-1', '1111-1111-11')
y <- strsplit(x, '-')
myfun <- function(y) {
first <- sprintf('%05d', as.integer(y[1]))
second <- sprintf('%04d', as.integer(y[2]))
third <- sprintf('%02d', as.integer(y[3]))
paste(first, second, third, sep='-')
}
sapply(y, myfun)
# [1] "11111-0111-11" "11111-1111-01" "01111-1111-11"
You could also do this with fancy regular expressions or the gsubfn package but that may be overkill!
Slightly shorter and a more functional programming version of Justin's solution
numbers <- c('11111-111-11', '11111-1111-1', '1111-1111-11')
restyle <- function(number, fmt){
tmp <- as.list(as.integer(strsplit(number, '-')[[1]]))
do.call(sprintf, modifyList(tmp, list(fmt = fmt)))
}
sapply(numbers, restyle, fmt = '%05d-%04d-%02d', USE.NAMES = F)
Are you working in a unix like environment? It might be easier to use sed at the command line rather than R's regex functions.
echo "54324-965-23" | sed 's/\(.....\)-\(...\)-\(..\)/\1-0\2-\3/'
will spit back
54324-0965-23
If you want to apply it to the entire file it would look something like
cat file1.txt | sed 's/\(.....\)-\(...\)-\(..\)/\1-0\2-\3/' > file2.txt
And if you have multiple txt changing operations you can pipe them all together
cat file1.txt | sed 's/\(.....\)-\(...\)-\(..\)/\1-0\2-\3/' | sed '2ndthing' | sed 'thirdthing' > file2.txt
One solution to this is to first remove the hyphens, then just add them back in the desired character location, like so:
> v <- c("01234-567-89","01234-5678-9","0123-4567-89")
> v
[1] "01234-567-89" "01234-5678-9" "0123-4567-89"
> #remove hyphens
> v <- gsub("-","",v)
> v
[1] "0123456789" "0123456789" "0123456789"
> #add hyphens
> paste(substr(v,1,4),substr(v,5,8),substr(v,9,10),sep="-")
[1] "0123-4567-89" "0123-4567-89" "0123-4567-89"

Resources