I am looking for a way to concatenate a string or a number (3 digits at least) into a save file.
For instance, in python I can use '%s' % [format(str), format(number)]) and add it to csv file with a random generator.
How do I generate a random number into a format in R?
That is my save file and I want to add a random string or a number in the end of the file name:
file = paste(path, 'group1N[ADD FORMAT HERE].csv',sep = '')
file = paste(path, 'group1N.csv',sep = '') to become -- >
file = paste(path, 'group1N212.csv',sep = '') or file = paste(path, 'group1Nkut.csv',sep = '')
after using a random generator of strings or numbers and appending it to the save .csv file, each time it is saved, as a random generated end of file
You could use the built-in tempfile() function:
tempfile(pattern="group1N", tmpdir=".", fileext=".csv")
[1] "./group1N189d494eaaf2ea.csv"
(if you don't specify tmpdir the results go to a session-specific temporary directory).
This won't write over existing files; given that there are 14 hex digits in the random component, I think the "very likely to be unique" in the description is an understatement ... (i.e. at a rough guess the probability of collision might be something like 16^(-14) ...)
The names are very likely to be unique among calls to ‘tempfile’
in an R session and across simultaneous R sessions (unless
‘tmpdir’ is specified). The filenames are guaranteed not to be
currently in use.
Related
I have a *.DAT file which can be opened by txt editor. I want to extract some contents from this and convert it to *.csv. The converted csv file must have header (colnames), specification (lower and higher) and data portion. I need to convert 100's of these type of files to *.csv (as separate csv or all combined to one big csv file)
Sample snippet of my *.DAT file will look like below
[FILEINFO]
VERSION=V4.0
FILENAME=TEST.DAT
CREATIONTIME=2015-07-09 22:05:26
[LOTINFO]
LotNo=A6022142
DUT=BCEK450049
PRODUCTNAME=EX061
Order=
ChipCode=
SACH_NO=B39000-
MAT_NO=B39000-P810
[SPEC1]
TXT=SEN1
Unit=
LSL=-411.400000
USL=-318.700000
[SPEC2]
TXT=SEN2
Unit=
LSL=-11.000000
USL=11.000000
[SPEC3]
TXT=SEN3
Unit=
LSL=-45.000000
USL=10.000000
[DATA]
2,29,-411.232,10.193,-11.530,
3,29,-411.257,10.205,-11.328,
I can extract the contents below [DATA] and save in csv file. I am not sure how to extract the contents above to create header, etc. I used below code to extract contents below [DATA]
library(stringr)
library(readr)
myTXT <- read_file("EXAMPLE.DAT")
ExtData <- read.csv(text =
sub(".*\\[DATA\\]\\s+", "", my_txt), header = FALSE)
dat2csv <- write.csv(ExtData, dat_2_csv.csv",row.names=FALSE)
To extract the contents above [DATA] I tried below code with no success
con <- file("EXAMPLE.DAT","r")
OneLine <- c()
while(True) {
line = readLines(con,1)
if(length(line) == 0) break
elseif(line="LSL=")
RES <- str_split(line,"=",simplify=TRUE)
lines <- RES[1,2]
}
Expected output csv file as below
According to this link, .DAT files are very generic files with very specific information. Therefore, and especially after looking at your sample snippet, I doubt there is a straightforward way to do the conversion (unless there's a package designed specifically to process similar data).
I can only give you my 5 cents of my general strategy to tackle this:
For starters, instead of focusing on the .csv format, you should first focus on turning this text file into a table format.
To do so, you should save the parameters in separate vectors/columns (Every column could be TXT, Unit, LSL, etc.)
In doing so, each row (SPEC1, SPEC2, SPEC3) would be representing each datapoint with all its characteristics.
Even so, looks like it also contains metadata, and you might, therefore, save the different chunks of data into different variables (file.info = read_file(x, nrows = 4))
Hope it might help a bit.
Edit: As said by #qwe, the format resembles a .ini file. So a good way to start would be to open the file with a '=' delimiter:
data = read.table('example.dat', delim = '=')
I have a Python3 script that reads the first eight characters of every filename in a directory in order to determine whether the file was created before or after 180 days ago based on each file's name. The file names all begin with YYYYMMDD or eerasedd_YYYYMMDD_etc.xls. I can collect all these filenames already.
I need to tell my script to ignore any filename that does not conform to the standard eight leading numerical characters, example: 20180922 or eerasedd_20171207_1oIkZf.so.
if name.startswith('eerasedd_'):
fileDate = datetime.strptime(name[9:17], DATEFMT).date()
else:
fileDate = datetime.strptime(name[0:8], DATEFMT).date()
I need logic to prevent the script from choking on files that don't fit the desired pattern. The script needs to carry on with its work and forget about non-conformant filenames. Do I need to add code that causes an exception or just add an elif block?
I have a function to get only the names of those files I need based on their extensions.
def get_files(extensions):
all_files = []
for ext in extensions:
all_files.extend(Path('/Users/mrh/Python/calls').glob(ext))
for file in get_files(('*.wav', '*.xml')):
print (file.name)
Now I need to figure out how to check each 'file.name' for the date string in its filename. i.e. now I need to run something like
if name.startswith('eerasedd_'):
fileDate = datetime.strptime(name[9:17], DATEFMT).date()
else:
fileDate = datetime.strptime(name[0:8], DATEFMT).date()
against 'file.name' to see whether the files are 180 days old or less.
I need to add meta data about the Row being processed. I need the filename to be added as a column. I looked at the ambulance demos in the Git repo, but can't figure out how to implement this.
You use a feature of U-SQL called 'file sets' and 'virtual columns'. In my simple example, I have two files in my input directory, I use file sets and refer to the virtual columns in the EXTRACT statement, eg
// Filesets, file set with virtual column
#q =
EXTRACT rowId int,
filename string,
extension string
FROM "/input/filesets example/{filename}.{extension}"
USING Extractors.Tsv();
#output =
SELECT filename,
extension,
COUNT( * ) AS records
FROM #q
GROUP BY filename,
extension;
OUTPUT #output TO "/output/output.csv"
USING Outputters.Csv();
My results:
Read more about both features here:
https://msdn.microsoft.com/en-us/library/azure/mt621320.aspx
This question already has answers here:
Is it possible to get the number of rows in a CSV file without opening it?
(5 answers)
Closed 5 years ago.
I want to use data.table to process a very big file.
It doesn't fit on memory.
I've thought on reading the file on chunks using a loop with (increasing properly the skip parameter).
fread("myfile.csv", skip=loopindex, nrows=chunksize)
processing each of this chunks and appending the resulting output with fwrite.
In order to do it properly I need to know the total number of rows, without reading the whole file.
What's the proper/faster way to do it?
I can ony think in reading only the first column but maybe there is an special command or trick.
or maybe there is an automatic way to detect the end of the file.
1) count.fields Not sure if count.fields reads the whole file into R at once. Try it to see if it works.
length(count.fields("myfile.csv", sep = ","))
If the file has a header subtract one from the above.
2) sqldf Another possibility is:
library(sqldf)
read.csv.sql("myfile.csv", sep = ",", sql = "select count(*) from file")
You may need other arguments as well depending on header, etc. Note that this does not read the file into R at all -- only into sqlite.
3) wc Use the system command wc which should be available on all platforms that R runs on.
shell("wc -l myfile.csv", intern = TRUE)
or to directly get the number of lines in the file
read.table(pipe("wc -l myfile.csv"))[[1]]
or
read.table(text = shell("wc -l myfile.csv", intern = TRUE))[[1]]
Again, if there is a header subtract one.
If you are on Windows be sure that Rtools is installed and use this:
read.table(pipe("C:\\Rtools\\bin\\wc -l myfile.csv"))[[1]]
Alternately on Windows without Rtools try this:
read.table(pipe('find /v /c "" myfile.csv'))[[3]]
See How to count no of lines in text file and store the value into a variable using batch script?
The answer by #G. Grothendieck about using wc -l is a good one, if you can rely on it being present.
You might also want to look into iterating through the file in chunks, e.g. by employing something like this answer that only relies on base R functions.
Since you don't need to read single lines, you can read in a batch from a connection. For instance:
count_lines = function(filepath, batch) {
con = file(filepath, "r")
n = 0
while ( TRUE ) {
lines = readLines(con, n = batch)
present = length(lines)
n = n + present
if ( present < batch) {
break
}
}
close(con)
return(n)
}
Then you could read the file in, at say 1,000 lines at a time:
count_lines("filename.txt", 1000)
I have a large dataset in dbf file and would like to export it to the csv type file.
Thanks to SO already managed to do it smoothly.
However, when I try to import it into R (the environment I work) it combines some characters together, making some rows much longer than they should be, consequently breaking the whole database. In the end, whenever I import the exported csv file I get only half of the db.
Think the main problem is with quotes in string characters, but specifying quote="" in R didn't help (and it helps usually).
I've search for any question on how to deal with quotes when exporting in visual foxpro, but couldn't find the answer. Wanted to test this but my computer catches error stating that I don't have enough memory to complete my operation (probably due to the large db).
Any helps will be highly appreciated. I'm stuck with this problem on exporting from the dbf into R for long enough, searched everything I could and desperately looking for a simple solution on how to import large dbf to my R environment without any bugs.
(In R: Checked whether have problems with imported file and indeed most of columns have much longer nchars than there should be, while the number of rows halved. Read the db with read.csv("file.csv", quote="") -> didn't help. Reading with data.table::fread() returns error
Expected sep (',') but '0' ends field 88 on line 77980:
But according to verbose=T this function reads right number of rows (read.csv imports only about 1,5 mln rows)
Count of eol after first data row: 2811729 Subtracted 1 for last eol
and any trailing empty lines, leaving 2811728 data rows
When exporting to TYPE DELIMITED You have some control on the VFP side as to how the export formats the output file.
To change the field separator from quotes to say a pipe character you can do:
copy to myfile.csv type delimited with "|"
so that will produce something like:
|A001|,|Company 1 Ltd.|,|"Moorfields"|
You can also change the separator from a comma to another character:
copy to myfile.csv type delimited with "|" with character "#"
giving
|A001|#|Company 1 Ltd.|#|"Moorfields"|
That may help in parsing on the R side.
There are three ways to delimit a string in VFP - using the normal single and double quote characters. So to strip quotes out of character fields myfield1 and myfield2 in your DBF file you could do this in the Command Window:
close all
use myfile
copy to mybackupfile
select myfile
replace all myfield1 with chrtran(myfield1,["'],"")
replace all myfield2 with chrtran(myfield2,["'],"")
and repeat for other fields and tables.
You might have to write code to do the export, rather than simply using the COPY TO ... DELIMITED command.
SELECT thedbf
mfld_cnt = AFIELDS(mflds)
fh = FOPEN(m.filename, 1)
SCAN
FOR aa = 1 TO mfld_cnt
mcurfld = 'thedbf.' + mflds[aa, 1]
mvalue = &mcurfld
** Or you can use:
mvalue = EVAL(mcurfld)
** manipulate the contents of mvalue, possibly based on the field type
DO CASE
CASE mflds[aa, 2] = 'D'
mvalue = DTOC(mvalue)
CASE mflds[aa, 2] $ 'CM'
** Replace characters that are giving you problems in R
mvalue = STRTRAN(mvalue, ["], '')
OTHERWISE
** Etc.
ENDCASE
= FWRITE(fh, mvalue)
IF aa # mfld_cnt
= FWRITE(fh, [,])
ENDIF
ENDFOR
= FWRITE(fh, CHR(13) + CHR(10))
ENDSCAN
= FCLOSE(fh)
Note that I'm using [ ] characters to delimit strings that include commas and quotation marks. That helps readability.
*create a comma delimited file with no quotes around the character fields
copy to TYPE DELIMITED WITH "" (2 double quotes)