Splitting a sequence within dataframe? - r

I have a csv file like this,
x <- read.csv("C:/Users/XXXX/Documents/XXXX/Day1_15042014/work2.csv")
class(x)
x$Sequence.window![enter image description here][1]
> x$Sequence.window
[1] VVELRKTGGDTLEFHKFYKNFSSGLKDVVWN
[2] PGLTTQGTKFGRKIVKTLAYRVKSTQPSSGN
[3] EATEFYLRYYVGHKGKFGHEFLEFEFREDGK
[4] LVPVVWGERKTPEIEKKGFGASSKAATSLPS
[5] NMNELPEKKNSAGFIKLEDKQKLIVEMEKSV
[6] PTLHFNYRYFETDAPKDVPGAPRQWWFGGGT
[7] PDPTTAPMEAAKQPKKKRSRSKKCKSVNNLD
[8] PAKAAKTAKVTSPAKKAVAATKKVATVATKK
The class of this is a dataframe . I would now like to split the sequence window within a range 10:22 ( Ex [1] VVELRKTGGDTLEFHKFYKNFSSGLKDVVWN, output should be like [1] DTLEFHKFYKNFS for all the sequences) . How would I do this within a data frame?

You can use the substr function
#dummy data
x <- read.table(text="Sequence.window
VVELRKTGGDTLEFHKFYKNFSSGLKDVVWN
PGLTTQGTKFGRKIVKTLAYRVKSTQPSSGN
EATEFYLRYYVGHKGKFGHEFLEFEFREDGK",header=TRUE,as.is=TRUE)
#substr from 10 to 22
substr(x$Sequence.window,start=10,stop=22)
#[1] "DTLEFHKFYKNFS" "FGRKIVKTLAYRV" "YVGHKGKFGHEFL"

Related

how to sort list.files() in correct date order?

Using normal list.files() in the working directory return the file list but the numeric order is messed up.
f <- list.files(pattern="*.nc")
f
# [1] "te1971-1.nc" "te1971-10.nc" "te1971-11.nc" "te1971-12.nc"
# [5] "te1971-2.nc" "te1971-3.nc" "te1971-4.nc" "te1971-5.nc"
# [9] "te1971-6.nc" "te1971-7.nc" "te1971-8.nc" "te1971-9.nc"
where the number after "-" describes the month number.
I used the following to try to sort it
myFiles <- paste("te", i, "-", c(1:12), ".nc", sep = "")
mixedsort(myFiles)
it returns ordered files but in reverse:
[1] "te1971-12.nc" "te1971-11.nc" "tev1971-10.nc" "te1971-9.nc"
[5] "te1971-8.nc" "te1971-7.nc" "te1971-6.nc" "te1971-5.nc"
[9] "te1971-4.nc" "te1971-3.nc" "te1971-2.nc" "te1971-1.nc"
How do I fix this?
The issue is that the values get alphabetically sorted.
You could gsub out years and months as groups (.) and add "-1" as first day of the month to the yield, coerce it as.Date and order by that.
x[order(as.Date(gsub('.*(\\d{4})-(\\d{,2}).*', '\\1-\\2-1', x)))]
# [1] "te1971-1.nc" "te1971-2.nc" "te1971-3.nc" "te1971-4.nc" "te1971-5.nc"
# [6] "te1971-6.nc" "te1971-7.nc" "te1971-8.nc" "te1971-9.nc" "te1971-10.nc"
# [11] "te1971-11.nc" "te1971-12.nc"
Data:
x <- c("te1971-1.nc", "te1971-10.nc", "te1971-11.nc", "te1971-12.nc",
"te1971-2.nc", "te1971-3.nc", "te1971-4.nc", "te1971-5.nc", "te1971-6.nc",
"te1971-7.nc", "te1971-8.nc", "te1971-9.nc")

How do I sort a vector with names containing many strings of numbers in R?

I have a list of names which I would like to sort by the R value in ascending order.
[1] "W2345_S-001-R2-20D.datavalue.csv" "W2346_S-001-R4-20D.datavalue.csv"
[3] "W2347_S-001-R1-20D.datavalue.csv" "W2348_S-001-R3-20D.datavalue.csv"
[5] "W2349_S-001-R5-20D.datavalue.csv"
However, mixedsort only gives the above (sorting by W values) but I would like to arrange them by R1, R2, R3, R4, R5, ignoring the other numbers contained in the names.
Hence the output should be
[1] "W2347_S-001-R1-20D.datavalue.csv" "W2345_S-001-R2-20D.datavalue.csv"
[3] "W2348_S-001-R3-20D.datavalue.csv" "W2346_S-001-R4-20D.datavalue.csv"
[5] "W2349_S-001-R5-20D.datavalue.csv"
list_of_names <- c("W2345_S-001-R2-20D_790.datavalue.csv",
"W2346_S-001-R4-20D_792.datavalue.csv",
"W2347_S-001-R1-20D_789.datavalue.csv",
"W2348_S-001-R3-20D_791.datavalue.csv",
"W2349_S-001-R5-20D_793.datavalue.csv")
library(stringr)
names_order <- order(as.numeric(str_match(list_of_names, "-R\\s*(.*?)\\s*-")[,2]))
list_of_names[names_order]
[1] "W2347_S-001-R1-20D_789.datavalue.csv" "W2345_S-001-R2-20D_790.datavalue.csv"
[3] "W2348_S-001-R3-20D_791.datavalue.csv" "W2346_S-001-R4-20D_792.datavalue.csv"
[5] "W2349_S-001-R5-20D_793.datavalue.csv"
# Your data
vals<-c( "W2345_S-001-R2-20D_790.datavalue.csv",
"W2346_S-001-R4-20D_792.datavalue.csv",
"W2347_S-001-R1-20D_789.datavalue.csv",
"W2348_S-001-R3-20D_791.datavalue.csv",
"W2349_S-001-R5-20D_793.datavalue.csv")
library(stringr)
vals_df<-data.frame(vals,
position=str_extract(vals,"(?<=R)\\d{1}")|>
as.numeric())
vals_df[order(vals_df$position),]$vals
[1] "W2347_S-001-R1-20D_789.datavalue.csv" "W2345_S-001-R2-20D_790.datavalue.csv"
[3] "W2348_S-001-R3-20D_791.datavalue.csv" "W2346_S-001-R4-20D_792.datavalue.csv"
[5] "W2349_S-001-R5-20D_793.datavalue.csv

Change the row names in R

i have two dataframes with similar rownames:
> rownames(abundance)[1:10]
[1] "X001.V2.fastq_mapped_to_agora.txt.uniq"
[2] "X001.V8.fastq_mapped_to_agora.txt.uniq"
[3] "X003.V17.fastq_mapped_to_agora.txt.uniq"
[4] "X003.V2.fastq_mapped_to_agora.txt.uniq"
[5] "X003.V8.fastq_mapped_to_agora.txt.uniq"
[6] "X004.V2.fastq_mapped_to_agora.txt.uniq"
[7] "X004.V8.fastq_mapped_to_agora.txt.uniq"
[8] "X005.V2.fastq_mapped_to_agora.txt.uniq"
[9] "X005.V8.fastq_mapped_to_agora.txt.uniq"
[10] "X006.V2.fastq_mapped_to_agora.txt.uniq"
> rownames(fluxes)[1:10]
[1] "001.V8" "003.V17" "003.V2" "003.V8" "004.V2" "004.V8" "005.V2"
[8] "005.V8" "006.V2" "006.V8"
But the row names of the dataframe abundance is larger. How can i make the names of each rows like the rownames of fluxes. It can be like from "X" to second ".".
We could use sub:
rownames(abundance) <- sub("X(.*)\\.fastq_mapped_to_agora\\.txt\\.uniq", "\\1", rownames(abundance))
Output:
[1] "001.V2" "001.V8" "003.V17" "003.V2" "003.V8" "004.V2" "004.V8" "005.V2" "005.V8" "006.V2"
We may use trimws
rownames(abundance) <- trimws(rownames(abundance), whitespace = "\\..*")
Or could be
rownames(abundance) <- sub("^([^.]+\\.[^.]+)\\..*", "\\1", rownames(abundance))
-testing
> trimws("X001.V2.fastq_mapped_to_agora.txt.uniq", whitespace = "\\..*")
[1] "X001"
> sub("^([^.]+\\.[^.]+)\\..*", "\\1", "X001.V2.fastq_mapped_to_agora.txt.uniq")
[1] "X001.V2"

How to extract text from a column using R

How would I go about extracting, for each row (there are ~56,000 records in an Excel file) in a specific column, only part of a string? I need to keep all text to the left of the last '/' forward slash. The challenge is that not all cells have the same number of '/'. There is always a filename (*.wav) at the end of the last '/', but the number of characters in the filename is not always the same (sometimes 5 and sometimes 6).
Below are some examples of the strings in the cells:
cloch/51.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav
AB_AeolinaL/025-C#.wav
AB_AeolinaL/026-D.wav
AB_violadamourL/rel99999/091-G.wav
AB_violadamourL/rel99999/092-G#.wav
AB_violadamourR/024-C.wav
AB_violadamourR/025-C#.wav
The extracted text should be:
cloch
grand/Grand_bombarde/02-suchy_Grand_bombarde
grand/Grand_bombarde/02-suchy_Grand_bombarde
AB_AeolinaL
AB_AeolinaL
AB_violadamourL/rel99999
AB_violadamourL/rel99999
AB_violadamourR
AB_violadamourR
Can anyone recommend a strategy using R?
You can use the stringr package str_remove(string,pattern) function like:
str = "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav"
str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
Then you can just iterate over all other strings:
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
You have to substract strings using this method:
substr(strings,1,regexpr("\\/[^\\/]*$", strings)-1)
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
Input
strings<-c("cloch/51.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav","AB_AeolinaL/025-C#.wav","AB_AeolinaL/026-D.wav","AB_violadamourL/rel99999/091-G.wav","AB_violadamourL/rel99999/092-G#.wav","AB_violadamourR/024-C.wav","AB_violadamourR/025-C#.wav")
In which this regex regexpr("\\/[^\\/]*$", strings) gives you the position of the last "/"
Assuming that the strings you propose are in a column of a dataframe:
df <- data.frame(x = 1:5, y = c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav"))
# I define a function that separates a string at each "/"
# throws the last piece and reattaches the pieces
cut_str <- function(s) {
st <- head((unlist(strsplit(s, "\\/"))), -1)
r <- paste(st, collapse = "/")
return(r)
}
# through the sapply function I get the desired result
new_strings <- as.vector(sapply(df$y, FUN = cut_str))
new_strings
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
You could use
dirname(strings)
If there is no /, this returns ., which you could remove afterwards if you like, e.g.:
res <- dirname(strings)
res[res=="."] <- ""
``
You could start the match with / followed by 1 or more times any char except a forward slash or a whitespace char using a negated character class [^\\s/]+
Then match .wav at the end of the string using $
Replace the match with an empty string using sub for example.
[^\\s/]+\\.wav$
See the regex matches | R demo
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
sub("/[^\\s/]+\\.wav$", "", strings)
Output
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"

How can I use two lists to create a Table (Columns and Rows)

I want to write a script in R that allows me to import MSG files and store the information in a table. The fields may vary by course, so the column names are defined based on the first MSG file being imported.
The import and extraction are already working (special thanks to the user "January")
What does not work is the filling in the table, which consists of two steps. Add column names and fill in rows.
I've tried using unlist to prepare the contents of the lists so that I can add them as colums and rows to a table.
Anmeldung <- gsub("^\\s+", "", Anmeldung) # remove spaces at the beginning and end
Anmeldung <- gsub("\\s+$", "", Anmeldung)
words <- strsplit(Anmeldung, " *[\n\r]+ *")[[1]]
fields <- as.list(words[seq(1, length(words), 2)])
information <- as.list(words[seq(2, length(words), 2)])
resTab1 = data.frame(t(unlist(fields)))
resTab2 = data.frame(t(unlist(information)))
colnames(resTab2) = c(resTab1)
variable.names(resTab2)
When I am trying to create the Table,this error appears:
colnames(resTab2) = c(resTab1)
Error in names(x) <- value :
'names' attribute [22] must be the same length as the vector [21]
This is what the Dataframes Fields and Information look like:
Fields
> fields
[[1]]
[1] "Anrede"
[[2]]
[1] "Vorname"
[[3]]
[1] "Name"
[[4]]
[1] "Email (für Kontaktaufnahme)"
[[5]]
[1] "Telefon/Mobile (geschäftlich)"
[[6]]
[1] "Telefon/Mobile (privat)"
[[7]]
[1] "Strasse/Nr."
Information:
> information
[[1]]
[1] "Herr"
[[2]]
[1] "James"
[[3]]
[1] "Bond"
[[4]]
[1] "james.bond#email.com"
[[5]]
[1] "007 000 77 07"
[[6]]
[1] "007 000 77 07"
[[7]]
[1] "Lampenstrasse 8"
I see you're trying to give names to resTab2 that is shorter than your resTab1
ex:
x <- c(1,2)
y <- c("a","b","c")
names(x) <- y
#Error in names(x) <- y :
#'names' attribute [3] must be the same length as the vector [2]
EDIT:
use unlist to flatten the list
information <- unlist(information)
fields <- unlist(fields)
names(information) <- fields
information
#OUTPUT
#Anrede 'Herr'
#Vorname 'James'
#Name 'Bond'
#Email (für Kontaktaufnahme) 'james.bond#email.com'
#Telefon/Mobile (geschäftlich) '007 000 77 07'
#Telefon/Mobile (privat) '007 000 77 07'
#Strasse/Nr. 'Lampenstrasse 8'

Resources