I have a lot of very large txt-files that have a header I would like to remove.
Is there a way to do this without reading in the entire file?
Possibly using the system() command?
I found some ideas here but I haven't been able to make them work. I am using Windows 7 and R Version 3.2.2.
Here is what I have tried:
> systCommand <- paste0("echo '$(tail -n +2 ", myFilePath, ")' > ", myFilePath)
> system(systCommand, intern=T)
Error in system(systCommand, intern = T) : 'echo' not found
I am pretty sure that this is because I am using windows?
after reading in
count_table <- read.table("your path/filename.txt")
head(count_table)
if the first row is header,
c_table <- count_table[-1,]
head(c_table)
then, the first line of header can be removed
Related
I have a long list of files with names like: file-typeX-sectorY.tsv, where X and Y get values from 0-100. I process each of those files with an R program, but read them one by one like this:
data <- read.table(file='my_info.tsv', sep = '\t', header = TRUE, fill = TRUE)
it is impractical. I want to build a bash program that does something like
#!/bin/bash
for i in {0..100..1}
do
for j in {1..100..1)
do
Rscript program.R < file-type$i-sector$j.tsv
done
done
My problem is not with the bash script but with the R program. How can I receive the files one by one? I have googled and tried instructions like:
args <- commandArgs(TRUE)
either
data <- commandArgs(trailingOnly = TRUE)
but I can't find the way. Could you please help me?
At the simplest level your problem may be the (possible accidental ?) redirect you have -- so remove the <.
Then a mininmal R 'program' to take a command-line argument and do something with it would be
#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)
stopifnot("require at least one arg" = length(args) > 0)
cat("We were called with '", args[1], "'\n", sep="")
We use a 'shebang' line and make it chmod 0755 basicScript.R to be runnable. The your shell double loop, reduced here (and correcting one typo) becomes
#!/bin/bash
for i in {0..2..1}; do
for j in {1..2..1}; do
./basicScript.R file-type${i}-sector${j}.tsv
done
done
and this works as we hope with the inner program reflecting the argument:
$ ./basicCaller.sh
We were called with 'file-type0-sector1.tsv'
We were called with 'file-type0-sector2.tsv'
We were called with 'file-type1-sector1.tsv'
We were called with 'file-type1-sector2.tsv'
We were called with 'file-type2-sector1.tsv'
We were called with 'file-type2-sector2.tsv'
$
Of course, this is horribly inefficient as you have N x M external processes. The two outer loops could be written in R, and instead of calling the script you would call your script-turned-function.
I think this is an accurate title but feel free to change it if anyone thinks it can be worded better. I am running the following commands using data.table::fread.
fread("sed 's+0/0+0+g' R.test.txt > R.test.edit.txt")
fread("sed 's+0/1+1+g' R.test.edit.txt > R.test.edit2.txt")
fread("sed 's+1/1+2+g' R.test.edit2txt > R.test.edit3.txt")
fread("sed 's+./.+0.01+g' R.test3..edit3.txt > R.test.edit.final.txt")
After each line I get the following message
Warning messages:
1: In fread("sed 's+0/0+0+g' /R/R.test.small.txt > /R/R.test.edit.small.txt") :
File '/path/to/tmp/RtmpwqJu82/file7e7e250b96bf' has size 0. Returning a NULL data.table.
2: In fread("sed 's+0/1+1+g' /R/R.test.edit.small.txt > /R/R.test.edit2.small.txt") :
File '/path/to/tmp/RtmpwqJu82/file7e7e8456d82' has size 0. Returning a NULL data.table.
3: In fread("sed 's+1/1+2+g' /R/R.test.edit2.small.txt > /R/R.test.edit3.small.txt") :
File '/path/to/tmp/RtmpwqJu82/file7e7e3f96bc35' has size 0. Returning a NULL data.table.
4: In fread("sed 's+./.+0.01+g' /R/R.test.edit3.small.txt > /R/R.test.edit.final.small.txt") :
File '/path/to/tmp/RtmpwqJu82/file7e7e302a3cde' has size 0. Returning a NULL data.table.
So it is weird... fread makes all the files I need when I run it on my laptop but gives that error for each file. When I got to run the script on our cluster, the script crashes and gives the following message.
> fread("sed 's+0/0+0+g' /R/R.test.txt > /R/R.test.edit.txt")
Error in fread("sed 's+0/0+0+g' /R/R.test.txt > /R/R.test.edit.txt") :
File is empty: /dev/shm/file38d161d613c
Execution halted
I think it has to do with the message I get when I run the script on my laptop? I think it a user issue but maybe it is a bug. I was wondering if anyone had any ideas. I was wondering if anyone had any ideas? I thought of a work around using the following
end_time <- Sys.time()
print(end_time)
peakRAM(system(paste("sed 's+0/0+0+g' /R/R.test.txt > /R/R.test.edit.txt")),
system(paste("sed 's+0/1+1+g' /R/R.test.edit.txt > /R/R.test.edit2.txt")),
system(paste("sed 's+1/1+2+g' /R/R.test.edit2.txt > /R/R.test.edit3.txt")),
system(paste("sed 's+./.+0.01+g' /R/R.test.edit3.txt > /R/R.test.edit.final.txt")))
end_time <- Sys.time()
print(end_time)
And this works fine. So I think there's a problem with sed or anything like that. I am just wondering what I am doing wrong when I use fread
Comments above are correct about what to do; I tried looking in the documentation for fread but didn't find anything helpful for you so I filed an issue to improve... thanks!
When you pass a terminal command to fread, it creates a tmp file for you automatically in the background. You can see the exact line here, stylized:
system(paste0('(', cmd, ') > ', tmpFile<-tempfile(tmpdir=tmpdir))
Then fread is applied to that file. As mentioned, the file resulting from your command with > tmpFile appended has size 0.
If you actually want to keep those intermediate files (e.g. R.test.edit.txt), you have two options: (1) first, run system('grep > R.test.edit.txt') then run fread on the output; or (2) [available on development version only for now; see Installation wiki] supply the tmpdir argument to fread and omit the > R.test.edit.txt part; fread will do the outputting itself for you.
If you don't actually care about the intermediate files, simply omit the > R.test.edit.txt part and fread should work as you were expecting, e.g.:
fread("sed 's+0/0+0+g' R.test.txt")
In tensorflow the training from the scratch produced following 6 files:
events.out.tfevents.1503494436.06L7-BRM738
model.ckpt-22480.meta
checkpoint
model.ckpt-22480.data-00000-of-00001
model.ckpt-22480.index
graph.pbtxt
I would like to convert them (or only the needed ones) into one file graph.pb to be able to transfer it to my Android application.
I tried the script freeze_graph.py but it requires as an input already the input.pb file which I do not have. (I have only these 6 files mentioned before). How to proceed to get this one freezed_graph.pb file? I saw several threads but none was working for me.
You can use this simple script to do that. But you must specify the names of the output nodes.
import tensorflow as tf
meta_path = 'model.ckpt-22480.meta' # Your .meta file
output_node_names = ['output:0'] # Output nodes
with tf.Session() as sess:
# Restore the graph
saver = tf.train.import_meta_graph(meta_path)
# Load weights
saver.restore(sess,tf.train.latest_checkpoint('path/of/your/.meta/file'))
# Freeze the graph
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
# Save the frozen graph
with open('output_graph.pb', 'wb') as f:
f.write(frozen_graph_def.SerializeToString())
If you don't know the name of the output node or nodes, there are two ways
You can explore the graph and find the name with Netron or with console summarize_graph utility.
You can use all the nodes as output ones as shown below.
output_node_names = [n.name for n in tf.get_default_graph().as_graph_def().node]
(Note that you have to put this line just before convert_variables_to_constants call.)
But I think it's unusual situation, because if you don't know the output node, you cannot use the graph actually.
As it may be helpful for others, I also answer here after the answer on github ;-).
I think you can try something like this (with the freeze_graph script in tensorflow/python/tools) :
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
The important flag here is --input_binary=false as the file graph.pbtxt is in text format. I think it corresponds to the required graph.pb which is the equivalent in binary format.
Concerning the output_node_names, that's really confusing for me as I still have some problems on this part but you can use the summarize_graph script in tensorflow which can take the pb or the pbtxt as an input.
Regards,
Steph
I tried the freezed_graph.py script, but the output_node_name parameter is totally confusing. Job failed.
So I tried the other one: export_inference_graph.py.
And it worked as expected!
python -u /tfPath/models/object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=/your/config/path/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix=/your/checkpoint/path/model.ckpt-50000 \
--output_directory=/output/path
The tensorflow installation package I used is from here:
https://github.com/tensorflow/models
First, use the following code to generate the graph.pb file.
with tf.Session() as sess:
# Restore the graph
_ = tf.train.import_meta_graph(args.input)
# save graph file
g = sess.graph
gdef = g.as_graph_def()
tf.train.write_graph(gdef, ".", args.output, True)
then, use summarize graph get the output node name.
Finally, use
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
to generate the freeze graph.
I would like to write to a file and then append it several times in a loop (on a windows machine). After each time I append it, I want to close the connection because I want the file to sink to a dropbox account so I can open it on other computers, while the code is running, to check the log file's status (note this condition makes this question different from any question asked on SO about sink, writeLines, write, cat, etc). I've tried
#set up writing
logFile = file("log_file.txt")
write("This is a log file for ... ", file=logFile, append=FALSE)
for(i in 1:10){
write(i, file=logFile, append=TRUE)
}
I've also tried sink(file=logFile,append=TRUE); print(i); sink(); in the loop and also cat. Neither option works. The file only displays i=10, the last iteration of the loop. I noticed the following sentence in the documentation for write.
"if TRUE the data x are appended to the connection."
Does the above mean that it won't append to an existing file.
The following seems to work with cat because it doesn't need a file connection:
#set up writing
logFile = "log_file.txt"
cat("This is a log file for ... ", file=logFile, append=FALSE, sep = "\n")
for(i in 1:10){
cat(i, file=logFile, append=TRUE, sep = "\n")
}
The output would look like so it does append each value:
This is a log file for ...
1
2
3
4
5
6
7
8
9
10
Which I think is what you want. If you are on a mac or using linux you can keep track of progress in the file using:
tail -f log_file.txt
I am not sure how this would work with dropbox however. Can you login to the computer running the code (e.g., on mac or linux)?
what about explicitly closing the file after every iteration?
#set up writing
file.text <- "log_file.txt"
logFile = file(file.txt)
write("This is a log file for ... ", file=logFile, append=FALSE)
close(logFile)
for(i in 1:10){
logFile <- file(file.text)
write(i, file=logFile, append=TRUE)
close(logFile)
}
I have a number of R scripts that I would like to chain together using a UNIX-style pipeline. Each script would take as input a data frame and provide a data frame as output. For example, I am imagining something like this that would run in R's batch mode.
cat raw-input.Rds | step1.R | step2.R | step3.R | step4.R > result.Rds
Any thoughts on how this could be done?
Writing executable scripts is not the hard part, what is tricky is how to make the scripts read from files and/or pipes. I wrote a somewhat general function here: https://stackoverflow.com/a/15785789/1201032
Here is an example where the I/O takes the form of csv files:
Your step?.R files should look like this:
#!/usr/bin/Rscript
OpenRead <- function(arg) {
if (arg %in% c("-", "/dev/stdin")) {
file("stdin", open = "r")
} else if (grepl("^/dev/fd/", arg)) {
fifo(arg, open = "r")
} else {
file(arg, open = "r")
}
}
args <- commandArgs(TRUE)
file <- args[1]
fh.in <- OpenRead(file)
df.in <- read.csv(fh.in)
close(fh.in)
# do something
df.out <- df.in
# print output
write.csv(df.out, file = stdout(), row.names = FALSE, quote = FALSE)
and your csv input file should look like:
col1,col2
a,1
b,2
Now this should work:
cat in.csv | ./step1.R - | ./step2.R -
The - are annoying but necessary. Also make sure to run something like chmod +x ./step?.R to make your scripts executables. Finally, you could store them (and without extension) inside a directory that you add to your PATH, so you will be able to run it like this:
cat in.csv | step1 - | step2 -
Why on earth you want to cram your workflow into pipes when you have the whole R environment available is beyond me.
Make a main.r containing the following:
source("step1.r")
source("step2.r")
source("step3.r")
source("step4.r")
That's it. You don't have to convert the output of each step into a serialised format; instead you can just leave all your R objects (datasets, fitted models, predicted values, lattice/ggplot graphics, etc) as they are, ready for the next step to process. If memory is a problem, you can rm any unneeded objects at the end of each step; alternatively, each step can work with an environment which it deletes when done, first exporting any required objects to the global environment.
If modular code is desired, you can recast your workflow as follows. Encapsulate the work done by each file into one or more functions. Then call these functions in your main.r with the appropriate arguments.
source("step1.r") # defines step1_read_input, step1_f2
source("step2.r") # defines step2_f2
source("step3.r") # defines step3_f1, step3_f2, step3_f3
source("step4.r") # defines step4_write_output
step1_read_input(...)
step1_f2(...)
....
step4write_output(...)
You'll need to add a line at the top of each script to read in from stdin. Via this answer:
in_data <- readLines(file("stdin"),1)
You'll also need to write the output of each script to stdout().