I have a CSV file with millions of rows. I would like to open a connection to the file and filter unnecessary rows before opening it in R. To detail, I would like to import every 30th row starting at the second row.
I am operating on a Windows machine. I know the following command achieves the desired result on an Apple; however, it doesn't work on my Windows machine.
awk 'BEGIN{i=0}{i++;if (i%30==2) print $1}' < test.csv
In R, if I ran this code on an Apple, I would get the desired result:
write.csv(1:100000, file = "test.csv")
file.pipe <- pipe("awk 'BEGIN{i=0}{i++;if (i%30==2) print $1}' < test.csv")
res <- read.csv(file.pipe)
Clearly, I know nothing about Windows CLI, so could someone translate this awk command to Windows language for me and possibly explain how the translation achieves the desired result?
Thanks in advance!
UPDATE:
So I have downloaded Git and have successfully been able to complete this task using Git command line, but I need to implement it in R because I have to do this task on thousands of files. Anyone know how to make R run this command through Git?
write.csv(1:100000, file = "test.csv")
file.pipe <- pipe("awk \"BEGIN{i=0}{i++;if (i%30==2) print $1}\" test.csv")
res <- read.csv(file.pipe)
On windows the programline of awk need to be surrounded by double quotes. They are escaped because of the other double quotes on the same line.
Also the '<' before the input file is not needed (an I doubt that it is needed on an Apple.)
Related
I'm using Mac with RStudio 0.99.489 and R3.2.2. I have a csv file of 1GB, it's not exactly big but still takes around 5 min if I tried to import it with read.csv, and I have a lot files of this size so I tried fread(). From reading the previous questions, I learned that this error might be because of missing values on date (a normal entries would be like '03May1995:15:31:50' for the date column, however, where the error occurs, it looks like '05May').
I tried sed 's/\\0//g' mycsv1.csv > mycsv2.csv as mentioned in 'Embedded nul in string' error when importing csv with fread, but the same error message still pops up.
sed -i 's/\\0//g' /src/path/mycsv.csv simply doesn't work for me, the terminal reports error for this command line (I'm not very familiar with those command lines, so I don't understand the logic behind those)
I tried
file <- "file.csv"
tt <- tempfile() # or tempfile(tmpdir="/dev/shm")
system(paste0("tr < ", file, " -d '\\000' >", tt))
fread(tt)
from 'Embedded nul in string' when importing large CSV (8 GB) with fread(), I guess it removed the entries where there is a missing value, because when I run fread(tt) R says
Error in fread(tt) :
Expecting 5 cols, but line 5060627 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes.
After that, I tried iconv -f utf-16 -t utf-8 myfile1.csv > myfile2.csv because it seems like this was caused by some problem with fread can't comprehend utf-16, and there might be something wrong with this command line, but it simply gives me a spread sheet with random symbols.
And I saw this
vim filename.csv
:%s/CTRL+2//g
ESC #TO SWITCH FROM INSERT MODE
:wq # TO SAVE THE FILE
from Error with fread in R--embedded nul in string: '\0' but after I typed in vim filename.csv, the terminal just read in the whole spreadsheet and I couldn't type in the 2nd command (:%s/CTRL+2//g), again, I don't really understand those command lines, so maybe I need to make some adjustment to my situation.
Thanks for the help!
try
sed -i 's/\x0//g' my_file
or
cat my_file|tr -d '\000' > new_file
I have many large text files, and I would like to add a line at the very beginning. I saw someone had asked this already here. However, this involves reading the entire text file and appending it to the single line. Is there a better (faster) way?
I tested this on windows 7 and it works. Essentially, you use the shell function and do everything on the windows cmd which is quite fast.
write_beginning <- function(text, file){
#write your text to a temp file i.e. temp.txt
write(text, file='temp.txt')
#print temp.txt to a new file
shell(paste('type temp.txt >' , 'new.txt'))
#append your file to new.txt
shell(paste('type', file, '>> new.txt'))
#remove temp.txt - I use capture output to get rid of the
#annoying TRUE printed by file.remove
dump <- capture.output(file.remove('temp.txt'))
#uncomment the last line below to rename new.txt with the name of your file
#and essentially modify your old file
#dump <- capture.output(file.rename('new.txt', file))
}
#assuming your file is test.txt and you want to add 'hello' at the beginning just do:
write_beginning('hello', 'test.txt')
On linux you just need to find the corresponding command in order to send a file to another one (I really think you need to replace type by cat on linux but I cannot test right now).
You'd use the system() function on a Linux distro:
system('cp file.txt temp.txt; echo " " > file.txt; cat temp.txt >> file.txt; rm temp.txt')
So I'm running a program that works but the issue is my computer is not powerful enough to handle the task. I have a code written in R but I have access to a supercomputer that runs a Unix system (as one would expect).
The program is designed to read a .csv file and find everything with the unit ft3(monthly total) in the "Units" column and select the value in the column before it. The files are charts that list things in multiple units.
To convert this program in R:
getwd()
setwd("/Users/youruserName/Desktop")
myData= read.table("yourFileName.csv", header=T, sep=",")
funData= subset(myData, units="ft3(monthly total)", select=units:value)
write.csv(funData, file="funData.csv")
To a program in Shell Script, I tried:
pwd
cd /Users/yourusername/Desktop
touch RunThisProgram
nano RunThisProgram
(((In nano, I wrote)))
if
grep -r yourFileName.csv ft3(monthly total)
cat > funData.csv
else
cat > nofun.csv
fi
control+x (((used control x to close nano)))
chmod -x RunThisProgram
./RunThisProgram
(((It runs for a while)))
We get a funData.csv file output but that file is empty
What am I doing wrong?
It isn't actually running, because there are a couple problems with your script.
grep needs the pattern first, and quoted; -r is for recursing a
directory...
if without a then
cat is called wrong so it is actually reading from stdin.
You really only need one line:
grep -F "ft3(monthly total)" yourFileName.csv > funData.csv
i have a scenario in which i need to download files through curl command and want my script to pause for some time before downloading the second one. I used sleep command like
sleep 3m
but it is not working.
any idea ???
thanks in advance.
Make sure your text editor is not putting a /r /n and only a /n for every new line. This is typical if you are writing the script on windows.
Use notepad++ (windows) and go to edit|EOL convention|UNIX then save it. If you are stuck with it on the computer, i have read from here [talk.maemo.org/showthread.php?t=67836] that you can use [tr -d "\r" < oldname.sh > newname.sh] to remove the problem. if you want to see if it has the problem use [od -c yourscript.sh] and /r will occur before any /n.
Other problems I have seen it cause is cd /dir/dir and you get [cd: 1: can't cd to /dir/dir] or copy scriptfile.sh newfilename the resulting file will be called newfilenameX where X is an invisible character (ensure you can delete it before trying it), if the file is on a network share, a windows machine can see the character. Ensure it is not the last line for a successful test.
Until i figured it out (i knew i had to ask google for something that may manifest in various ways) i thought that there was an issue with this linux version i was using (sleep not working in a script???).
Are you sure you are using sleep the right way? Based on your description, you should be invoking it as:
sleep 180
Is this the way you are doing it?
You might also want to consider wget command as it has an explicit --wait flag, so you might avoid having the loop in the first place.
while read -r urlname
do
curl ..........${urlname}....
sleep 180 #180 seconds is 3 minutes
done < file_with_several_url_to_be_fetched
?
I'd like to split a file and grep each piece without writing them to indvidual files.
I've attempted a couple variations of split and grep and no such luck; any suggestions?
Something along the lines of:
split -b SIZE filename | grep "string"
I've attempted grep/fgrep to find the string but my shell complains that the files are too large. See: use fgrep instead
There is no point in splitting the file if you plan to [linearly] search each of the pieces anyway (assuming that's the only thing you are doing with it). Consider running grep on the entire file.
If however you plan to utilize the fact that the file is split later on, then the typical way would be:
Create a temporary directory and step into it
Run split/csplit on the original file
Use for loop over written fragment to do your processing.