remove all words after character / including it - r

I have this list of elements that I've called with the name row
row <-
[[798]]
[1] "SINE/tRNA-Deu"
[[799]]
[1] "Simple/repeat"
[[800]]
[1] "SINE/tRNA-Deu"
[[802]]
[1] "SINE/tRNA-gip"
[[803]]
[1] "Simple/repeat"
[[804]]
[1] "SINE/MIR"
[[805]]
[1] "SINE/tRNA-Deu"
[[806]]
[1] "Simple/repeat"
[[807]]
[1] "SINE/tRNA-Deu"
[[808]]
[1] "SINE/tRNA-Deu"
[[809]]
[1] "SINE/tRNA-Deu"
[[813]]
[1] "Low_complexity/alfa"
there is a way to eliminate all the words after / in all the elements?
I've tried this:
row1 <- gsub("(/).*", "\\1", row)
but in the output the character "/" is not deleted. I don't wanto to include it in the name of elements ( ex. SINE, Simple, etc.):
[1] "SINE/" "Simple/" "SINE/" "SINE/" "Simple/"
[6] "SINE/" "SINE/" "Simple/" "SINE/" "SINE/"
[11] "SINE/" "Low_complexity/"
Where is the error in my code?

A simple fix, don't use a capture group:
row1 <- gsub("/.*", "", row)

Related

Change the row names in R

i have two dataframes with similar rownames:
> rownames(abundance)[1:10]
[1] "X001.V2.fastq_mapped_to_agora.txt.uniq"
[2] "X001.V8.fastq_mapped_to_agora.txt.uniq"
[3] "X003.V17.fastq_mapped_to_agora.txt.uniq"
[4] "X003.V2.fastq_mapped_to_agora.txt.uniq"
[5] "X003.V8.fastq_mapped_to_agora.txt.uniq"
[6] "X004.V2.fastq_mapped_to_agora.txt.uniq"
[7] "X004.V8.fastq_mapped_to_agora.txt.uniq"
[8] "X005.V2.fastq_mapped_to_agora.txt.uniq"
[9] "X005.V8.fastq_mapped_to_agora.txt.uniq"
[10] "X006.V2.fastq_mapped_to_agora.txt.uniq"
> rownames(fluxes)[1:10]
[1] "001.V8" "003.V17" "003.V2" "003.V8" "004.V2" "004.V8" "005.V2"
[8] "005.V8" "006.V2" "006.V8"
But the row names of the dataframe abundance is larger. How can i make the names of each rows like the rownames of fluxes. It can be like from "X" to second ".".
We could use sub:
rownames(abundance) <- sub("X(.*)\\.fastq_mapped_to_agora\\.txt\\.uniq", "\\1", rownames(abundance))
Output:
[1] "001.V2" "001.V8" "003.V17" "003.V2" "003.V8" "004.V2" "004.V8" "005.V2" "005.V8" "006.V2"
We may use trimws
rownames(abundance) <- trimws(rownames(abundance), whitespace = "\\..*")
Or could be
rownames(abundance) <- sub("^([^.]+\\.[^.]+)\\..*", "\\1", rownames(abundance))
-testing
> trimws("X001.V2.fastq_mapped_to_agora.txt.uniq", whitespace = "\\..*")
[1] "X001"
> sub("^([^.]+\\.[^.]+)\\..*", "\\1", "X001.V2.fastq_mapped_to_agora.txt.uniq")
[1] "X001.V2"

Print a date range in a loop with correctly formatted dates [duplicate]

This question already has answers here:
Looping over a Date or POSIXct object results in a numeric iterator
(7 answers)
How to iterate over list of Dates without coercion to numeric?
(1 answer)
Closed 1 year ago.
Typing this into the console gives:
seq(as.Date('2020-04-02'), as.Date('2020-04-30'), by = 'day')
[1] "2020-04-02" "2020-04-03" "2020-04-04" "2020-04-05" "2020-04-06" "2020-04-07" "2020-04-08" "2020-04-09" "2020-04-10" "2020-04-11" "2020-04-12"
[12] "2020-04-13" "2020-04-14" "2020-04-15" "2020-04-16" "2020-04-17" "2020-04-18" "2020-04-19" "2020-04-20" "2020-04-21" "2020-04-22" "2020-04-23"
[23] "2020-04-24" "2020-04-25" "2020-04-26" "2020-04-27" "2020-04-28" "2020-04-29" "2020-04-30"
My loop:
for(i in seq(as.Date('2020-04-02'), as.Date('2020-04-30'), by = 'day')) {print(i)}
Gives:
[1] 18354
[1] 18355
[1] 18356
[1] 18357
[1] 18358
[1] 18359
[1] 18360
[1] 18361
[1] 18362
[1] 18363
[1] 18364
[1] 18365
[1] 18366
[1] 18367
[1] 18368
[1] 18369
[1] 18370
[1] 18371
[1] 18372
[1] 18373
[1] 18374
[1] 18375
[1] 18376
[1] 18377
[1] 18378
[1] 18379
[1] 18380
[1] 18381
[1] 18382
Expected actual dates.
Tried:
print(as.Date(i))
But this gives:
Error in as.Date.numeric(i) : 'origin' must be supplied
How can I print my date range via a loop?
Try:
for (i in as.list(seq(as.Date('2020-04-02'), as.Date('2020-04-30'), by = 'day'))) {
print(i)
}
I don't know why this is necessary, but if you run
for (i in Sys.Date()) {browser();print(i);}
# Called from: top level
# Browse[1]>
debug at #1: print(i)
# Browse[1]>
i
# [1] 18709
you'll see that i is being converted to numeric in the for (.) portion. The as.list helps preserve that class.
Another way is to supply the origin argument to as.Date:
for(i in seq(as.Date('2020-04-02'), as.Date('2020-04-30'), by = 'day')){
print(as.Date(i, origin="1970-01-01"))}
When R transforms a date into a numeric, it returns the number of days after 197-01-01. Other softwares use different origins.

How to extract text from a column using R

How would I go about extracting, for each row (there are ~56,000 records in an Excel file) in a specific column, only part of a string? I need to keep all text to the left of the last '/' forward slash. The challenge is that not all cells have the same number of '/'. There is always a filename (*.wav) at the end of the last '/', but the number of characters in the filename is not always the same (sometimes 5 and sometimes 6).
Below are some examples of the strings in the cells:
cloch/51.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav
AB_AeolinaL/025-C#.wav
AB_AeolinaL/026-D.wav
AB_violadamourL/rel99999/091-G.wav
AB_violadamourL/rel99999/092-G#.wav
AB_violadamourR/024-C.wav
AB_violadamourR/025-C#.wav
The extracted text should be:
cloch
grand/Grand_bombarde/02-suchy_Grand_bombarde
grand/Grand_bombarde/02-suchy_Grand_bombarde
AB_AeolinaL
AB_AeolinaL
AB_violadamourL/rel99999
AB_violadamourL/rel99999
AB_violadamourR
AB_violadamourR
Can anyone recommend a strategy using R?
You can use the stringr package str_remove(string,pattern) function like:
str = "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav"
str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
Then you can just iterate over all other strings:
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
You have to substract strings using this method:
substr(strings,1,regexpr("\\/[^\\/]*$", strings)-1)
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
Input
strings<-c("cloch/51.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav","AB_AeolinaL/025-C#.wav","AB_AeolinaL/026-D.wav","AB_violadamourL/rel99999/091-G.wav","AB_violadamourL/rel99999/092-G#.wav","AB_violadamourR/024-C.wav","AB_violadamourR/025-C#.wav")
In which this regex regexpr("\\/[^\\/]*$", strings) gives you the position of the last "/"
Assuming that the strings you propose are in a column of a dataframe:
df <- data.frame(x = 1:5, y = c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav"))
# I define a function that separates a string at each "/"
# throws the last piece and reattaches the pieces
cut_str <- function(s) {
st <- head((unlist(strsplit(s, "\\/"))), -1)
r <- paste(st, collapse = "/")
return(r)
}
# through the sapply function I get the desired result
new_strings <- as.vector(sapply(df$y, FUN = cut_str))
new_strings
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
You could use
dirname(strings)
If there is no /, this returns ., which you could remove afterwards if you like, e.g.:
res <- dirname(strings)
res[res=="."] <- ""
``
You could start the match with / followed by 1 or more times any char except a forward slash or a whitespace char using a negated character class [^\\s/]+
Then match .wav at the end of the string using $
Replace the match with an empty string using sub for example.
[^\\s/]+\\.wav$
See the regex matches | R demo
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
sub("/[^\\s/]+\\.wav$", "", strings)
Output
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"

How to display real dates in a loop in r

When I iterate over dates in a loop, R prints out the numeric coding of the dates.
For example:
dates <- as.Date(c("1939-06-10", "1932-02-22", "1980-03-13", "1987-03-17",
"1988-04-14", "1979-08-28", "1992-07-16", "1989-12-11"), tryFormats = c("%Y-%m-%d"))
for(d in dates){
print(d)
}
The output is as follows:
[1] -11163
[1] -13828
[1] 3724
[1] 6284
[1] 6678
[1] 3526
[1] 8232
[1] 7284
How do I get R to print out the actual dates?
So the output reads:
[1] "1939-06-10"
[1] "1932-02-22"
[1] "1980-03-13"
[1] "1987-03-17"
[1] "1988-04-14"
[1] "1979-08-28"
[1] "1992-07-16"
[1] "1989-12-11"
Thank you!
When you use dates as seq in a for loop in R, it loses its attributes.
You can use as.vector to strip attributes and see for yourself (or dput to see under the hood on the full object):
as.vector(dates)
# [1] -11163 -13828 3724 6284 6678 3526 8232 7284
dput(dates)
# structure(c(-11163, -13828, 3724, 6284, 6678, 3526, 8232, 7284), class = "Date")
In R, Date objects are just numeric vectors with class Date (class is an attribute).
Hence you're seeing numbers (FWIW, these numbers count days since 1970-01-01).
To restore the Date attribute, you can use the .Date function:
for (d in dates) print(.Date(d))
# [1] "1939-06-10"
# [1] "1932-02-22"
# [1] "1980-03-13"
# [1] "1987-03-17"
# [1] "1988-04-14"
# [1] "1979-08-28"
# [1] "1992-07-16"
# [1] "1989-12-11"
This is equivalent to as.Date(d, origin = '1970-01-01'), the numeric method for as.Date.
Funnily enough, *apply functions don't strip attributes:
invisible(lapply(dates, print))
# [1] "1939-06-10"
# [1] "1932-02-22"
# [1] "1980-03-13"
# [1] "1987-03-17"
# [1] "1988-04-14"
# [1] "1979-08-28"
# [1] "1992-07-16"
# [1] "1989-12-11"
There are multiple way you can handle this :
Loop over index of dates :
for(d in seq_along(dates)){
print(dates[d])
}
#[1] "1939-06-10"
#[1] "1932-02-22"
#[1] "1980-03-13"
#[1] "1987-03-17"
#[1] "1988-04-14"
#[1] "1979-08-28"
#[1] "1992-07-16"
#[1] "1989-12-11"
Or convert date to list and then print directly.
for(d in as.list(dates)) {
print(d)
}

Replace value table with condition in R

I have list of dataset :
> data1
[1] /index.php/search?
[2] /tabel/graphic1_.php?
[3] /mod/Layout/variableView2.php?
[4] /table/tblmon-frameee.php?
and a table:
> tes
[1] http://aladdine/index.php/search?
[2] http://aladdine/mod/params/returnParams.php
[3] http://aladdine/mod/Layout/variableView2.php
[4] http://aladdine/index.php/bos/index?
[5] http://aladdine/index.php/Bos
I want to change the value of the test table with an index on dataset which has a matching string values in the dataset.
I have tried this code:
for(i in 1:length(dataset)){
p = data[i]
for(j in 1:length(tes)){
t = tes [j]
if(grepl(p, t)){
tes[j]=i
}
else tes[j] = "-"
}
}
My expectation result like this,
> tes
[1] 1
[2] -
[3] 3
[4] -
[5] -
But, I always get warning message invalid factor level, NA generated. Why?
Thanks before.
The following code does not do exactly what you need, but effectively it should give you the same information.
data1<-c('/index.php/search?',
'/tabel/graphic1_.php?',
'/mod/Layout/variableView2.php?',
'/table/tblmon-frameee.php?')
tes<-c('http://aladdine/index.php/search?',
'http://aladdine/mod/params/returnParams.php',
'http://aladdine/mod/Layout/variableView2.php',
'http://aladdine/index.php/bos/index?',
'http://aladdine/index.php/Bos')
> lapply(data1,FUN = function(x) which(grepl(x,tes)))
[[1]]
[1] 1
[[2]]
integer(0)
[[3]]
[1] 3
[[4]]
integer(0)
For example, the first output in [[1]] tells which element in "tes" match the first element in "data1" etc...
Probably not the fastest one as i use for loop in this code but hope this provides a solution:
require(data.table)
data1<-c("/index.php/search?","/tabel/graphic1_.php?","/mod/Layout/variableView2.php?","/table/tblmon-frameee.php?")
tes<-c("http://aladdine/index.php/search?","http://aladdine/mod/params/returnParams.php" ,"http://aladdine/mod/Layout/variableView2.php","http://aladdine/index.php/bos/index?","http://aladdine/index.php/Bos")
d<-data.table(d=data1,t=tes)
d$id<-seq(1:nrow(d))
for (i in 1:nrow(d))
{
d$index[i]<-lapply(data1,FUN=function(x) {ifelse(length(grep(x,tes[i]))>0,d$id[i],"-")})[i]
}

Resources