How to remove characters from filenames in R? - r

Due to an unrelated software error I have many files with double filenames e.g.
c("info10_file1.info10_file1.xy", "info11_file1.info11_file1.xy")
I want to remove this repetition, The files should be:
c("info10_file1.xy", "info11_file1.xy")
I have tried using sapply for a file rename function, but that requires a pattern which means only patterns with info10 will be changed.
So running this code:
sapply(files_list,FUN=function(eachPath){
....file.rename(from=eachPath,to=sub(pattern='info10_file1.',replacement='',eachPath))
})
Will result in:
"info10_file1.xy", "info11_file1.info11_file1.xy"
An improvement can be done by setting pattern='file1.info' which means all files will be processed, but the number 10 or 11 in info10 or info11 will still be repeated producing this:
"info10_10_file1.xy", "info11_11_file1.xy"
Is there a way to simply delete an arbitrary amount of characters? In this example it would be 13.

Related

R -find and replace within a script, iteratively [duplicate]

This question already has an answer here:
R: list files based on pattern
(1 answer)
Closed 1 year ago.
I have a somewhat complex script that is working well. It imports multiple .csvs, combines them, adjusts them, re-sorts them and writes them out as multiple new .csvs. All good.
The problem is that I need to run this script on each of 2100 files. Each .csv file has a name incorporating a seven or eight digit non-numeric string which also has other specific identifiers. There are numerous files with the same string suffix and the script works on all of them at once. An example of the naming system:
gfdlesm2g_45Fall_17100202.csv
ccsm4_45Fall_10270102.csv
bnuesm_45Fall_5130205.csv
mirocesmchem_45Fall_5010007.csv
The script begins with fnames <- dir("~/Desktop/modified_files/", pattern = "*_45Fall_1030001.csv")
And I need to replace the "1030001", in this case, with the next number. Right now I am using Find and Replace in RStudio to replace the seven (or eight) digit number each time the script has completed. I know there has to be a better way than to do this all manually for 2100 files.
All the research I've found is for iterating within a dataframe or whatever, in the columns or rows, and I can't process how to make this work for my needs.
I am thinking that if I made a vector of all the numbers (really they're names), like "01080204", "01090003", "01100001", "18020116", "18020125", "15080303", "16020301", "03170006", "04010101", "04010201", etc
There must be a way to say, in code, "now pick the next name, and run the script". I looked at the lapply, mapply, sapply family and couldn't seem to figure it out.
If you are looking for pattern in files _45Fall_ you can use list.files.
fnames <- list.files("~/Desktop/modified_files/", pattern = "*_45Fall_\\d+\\.csv$")

R list.files: some regexes only return a single file

I'm puzzled by the behaviour of regexes in the list.files command. I have a folder with ~500 files, most names start with "new_" and end with ".txt". There are some other files on the same folder, e.g. README, _cabs.txt.
I'd like to get only the new_*.txt files. I've tried different ways to call list.files with different results. Here are they:
#1 This returns ALL files including README and others
list.files(path="correctpath/")
#2 This returns ALL files including _cabs.txt, which I do not want.
list.files(path="correctpath/",pattern="txt")
#3 This returns ALL files I want, but...
list.files(path="correctpath/",pattern="new_")
#4 This returns just one of the new_*.txt files.
list.files(path="correctpath/",pattern="new*\\.txt")
#5 This returns an empty list.
list.files(path="correctpath/",pattern="new_*\\.txt")
So I have one solution that works, but would like to understand what's going on with the approaches 4 and 5.
thanks in advance
Rafael
list.files(path="correctpath/",pattern="new_.*\\.txt")
* means 0 or more times. If you want to match any character 0 or more time you need to add a period before it .* because a period means any character (except newline). The pattern "new_.*\\.txt" should work.
Good R regex reference.

readcsv fails to read # character in Julia

I've been using asd=readcsv(filename) to read a csv file in Julia.
The first row of the csv file contains strings which describe the column contents; the rest of the data is a mix of integers and floats. readcsv reads the numbers just fine, but only reads the first 4+1/2 string entries.
After that, it renders "". If I ask the REPL to display asd[1,:], it tells me it is 1x65 Array{Any,2}.
The fifth column in the first row of the csv file (this seems to be the entry it chokes on) is APP #1 bias voltage [V]; but asd[1,5] is just APP . So it looks to me as though readcsv has choked on the "#" character.
I tried using "quotes=false" keyword in readcsv, but it didn't help.
I used to use xlsread in Matlab and it worked fine.
Has anybody out there seen this sort of thing before?
The comment character in Julia is #, and this applies when reading files from delimited text files.
But luckily, the readcsv() and readdlm() functions have an optional argument to help in these situations.
You should try readcsv(filename; comment_char = '/').
Of course, the example above assumes that you don't have any / characters in your first line. If you do, then you'll have to change that / above to something else.

R: Extract value and lines after key word (text file mining)

Setting:
I have (simple) .csv and .dat files created from laboratory devices and other programs storing information on measurements or calculations. I have found this for other languages but nor for R
Problem:
Using R, I am trying to extract values to quickly display results w/o opening the created files. Hereby I have two typical settings:
a) I need to read a priori unknown values after known key words
b) I need to read lines after known key words or lines
I can't make functions such as scan() and grep() work.
c) Finally I would like to loop over dozens of files in a folder and give me a summary (to make the picture complete: I will manage this part)
I woul appreciate any form of help.
ok, it works for the key value (although perhaps not very nice)
variable<-scan("file.csv", what=character(),sep="")
returns a charactor vector of everything
variable[grep("keyword", ks)+2] # + 2 as the actual value is stored two places ahead
returns characters of seaked values.
as.numeric(lapply(variable, gsub, patt=",", replace="."))
for completion: data had to be altered to number and "," and "." problem needed to be solved.
in a line:
data=as.numeric(lapply(ks[grep("Ks_Boden", ks)+2], gsub, patt=",", replace="."))
Perseverence is not to bad of an asset ;-)
The rest isn't finished, yet, I will post once finished.

Read lines by number from a large file

I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.
How can I read-out the lines in one pass?
I was hoping for a C function that does it on one pass.
The trick is to use connection AND open it before read.table:
con<-file('filename')
open(con)
read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)
You may also try scan, it is faster and gives more control.
If it's a binary file
Some discussion is here:
Reading in only part of a Stata .DTA file in R
If it's a CSV or other text file
If they are contiguous and at the top of the file, just use the ,nrows argument to read.csv or any of the read.table family. If not, you can combine the ,nrows and the ,skip arguments to repeatedly call read.csv (reading in a new row or group of contiguous rows with each call) and then rbind the results together.
If your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.
However, from the R docs:
Use of seek on Windows is discouraged. We have found so many
errors in the Windows implementation of file positioning that
users are advised to use it only at their own risk, and asked not
to waste the R developers' time with bug reports on Windows'
deficiencies.
You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!
Before I was able to get an R solution/answer, I've done it in Ruby:
#!/usr/bin/env ruby
NUM_SEQS = 14024829
linenumbers = (1..10).collect{(rand * NUM_SEQS).to_i}
File.open("./data/uniprot_2011_02.tab") do |f|
while line = f.gets
print line if linenumbers.include? f.lineno
end
end
runs fast (as fast as my storage can read the file).
I compile a solution based on the discussions here.
scan(filename,what=list(NULL),sep='\n',blank.lines.skip = F)
This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.

Resources