How to write a file line by line in R - r

I am trying to read a csv file line by line and only select the 2nd and the 3rd cell from left, and the 3rd cell from the right. For example, if there are 17 cells in this line, I am going to take the 15th cell. Then I want to combine those 3 cells, separated by comma, and then to write this line to a new csv file.
Foe now, I am just using a for loop to access each line and then split them by comma. Then I select the cells I want and combine them as a string and append to a big String variable. Once the for-loop finishes, I write out the file by writeLines(). However, it takes a long time to finish this process because there are 2.8 million rows and it takes a lot of memory. Is there any way to make it more efficient? or can I write the output file line by line in the for-loop?
FileLinebyLine <- read_lines("testfile.csv")
pt<-proc.time()
NewFile <- ""
RowList <- list()
for (i in 1:length(FileLinebyLine))
{
a <- strsplit(FileLinebyLine[i],",")
RowList[i] = paste(a[[1]][2],a[[1]][3],a[[1]][(length(a[[1]]) - 2)], sep = ",")
}
NewFile <- paste(unlist(RowList), sep = "\n")
proc.time()-pt
outputfile <- file("output.txt")
writeLines(NewFile,outputfile)
close(outputfile)
I have also tried to use write_lines() in the for loop but it always gives me the error Error in
isOpen(path) : invalid connection
Can anyone help me? Appreciate that!!!

Yes you can read and write line by line, although I don't know how fast it will be. Here's an example that read a file line by line, the 4th item in every line and writes to a new file one line at a time:
con = file("temp.csv", "r")
while(length(x <- readLines(con, n = 1)) > 0) {
write(strsplit(x,",")[[1]][4], file="out.csv", append=T)
}
close(con)
temp.csv
a,b,c,d,e,f,g,h
x,y,z,a,b,c,d,e
1,2,3,4,5,6,7,8
q,w,e,r,t,y,u,i
out.csv
d
a
4
r
Hope that helps.
Edit: You can also add library(compiler); enableJIT(3) to speed up your loops a little.

Related

How to import a CSV with a last empty column into R?

I wrote an R script to make some scientometric analyses of Journal Citation Report data (JCR), which I have been using and updating in the past years.
Today, Clarivate has just introduced some changes in its database and now the exported CSV file contains one last empty column, which spoils my script. Because of this last empty column, read.csv automatically assumes that the first column contains the row names.
As before, there is also one first useless row, which is automatically removed in my script with skip = 1.
One simple solution to this "empty column situation" would be to manually remove this last column in Excel, and then proceed with my script as usual.
However, is there a way to add this removal to my script using base R?
The beginning of my script is:
jcreco = read.csv("data/jcr ecology 2020.csv",
na = "n/a", skip = 1, header = T)
The original CSV file downloaded from JCR is available in my Dropbox.
Could you please help me? Thank you!
The real problem is that empty column doesn't have a header. If they had only had the extra comma at the end of the header line this probably wouldn't be as messy. But you can also do a bit of column shuffling with fill=TRUE. For example
dd <- read.table("~/../Downloads/jcr ecology 2020.csv", sep=",",
skip=2, fill=T, header=T, row.names=NULL)
names(dd)[-ncol(dd)] <- names(dd)[-1]
dd <- dd[,-ncol(dd)]
This reads in the data but puts the rows names in the data.frame and fills the last column with NA. Then you shift all the column names over to the left and drop the last column.
Here is a way.
Read the data as text lines;
Discard the first line;
Remove the end comma with sub;
Create a text connection;
And read in the data from the connection.
The variable fl holds the file, on my disk I had to set the directory.
fl <- "jcr_ecology_2020.csv"
txt <- readLines(fl)
txt <- txt[-1]
txt <- sub(",$", "", txt)
con <- textConnection(txt)
df1 <- read.csv(con)
close(con)
head(df1)

Reading a CSV file, looping through the rows, using connections

So I have a large csv excel file that my computer cannot handle opening without rstudio terminating.
To solve this I am trying to iterate through the rows of the file in order do my calculations on each row at a time, before storing the value and then moving on to the next row.
This I can normally achieve (eg on a smaller file) through simply reading and storing the whole csv file within Rstudio and running a simple for loop.
It is, however, the size of this storage of data that I am trying to avoid, hence I am trying to read a row of the csv file one at a time instead.
(I think that makes sense)
This was suggested :here
I have managed to get my calculations to be read and work quickly for the first row of my data file.
It is the looping over this that I am struggling with, as I am trying to use a for loop (potentially should be using a while/if statement) but I have nowhere for the "i" value to be called from within the loop: part of my code is below:
con = file(FileName, "r")
for (row in 1:nrow(con)) {
data <- read.csv(con, nrow=1) #reading of file
"insert calculations here"
}
So the "row" is not called upon so the loop only goes through once. I also have an issue with the "1:nrow(con)" as clearly the nrow(con) simply returns NULL
Any help with this would be great,
thanks.
read.csv() will generate an error if it tries to read past the end of the file. So you could do something like this:
con <- file(FileName, "rt")
repeat {
data <- try(read.csv(con, nrow = 1, header = FALSE), silent = TRUE) #reading of file
if (inherits(data, "try-error")) break
"insert calculations here"
}
close(con)
It will be really slow going one line at a time, but you can do it in larger batches if your calculation code supports that. And I'd recommend specifying the column types using colClasses in the read.csv() call, so that R doesn't guess differently sometimes.
Edited to add:
We've been told that there are 3000 columns of integers in the dataset. The first row only has partial header information. This code can deal with that:
n <- 1 # desired batch size
col.names <- paste0("C", 1:3000) # desired column names
con <- file(FileName, "rt")
readLines(con, 1) # Skip over bad header row
repeat {
data <- try(read.csv(con, nrow = n, header = FALSE,
col.names = col.names,
colClasses = "integer"),
silent = TRUE) #reading of file
if (inherits(data, "try-error")) break
"insert calculations here"
}
close(con)
You could read in your data in batches of, say, 10,000 rows at a time (but you can change n to do as much as you want), do your calculations and then write the changes to a new file, appending the each batch to the end of the file.
Something like:
i = 0
n = 10000
while (TRUE) {
df = readr::read_csv('my_file.csv', skip=i, n_max=n)
# If the number of rows in the file is divisible by n, it may be the case
# that the next pass will result in an empty data.frame being returned
if (nrow(df) > 0) {
# do your calculations
# If you have performed calculations on df and want to save those results,
# save the data.frame to a file, appending it to the file to avoid overwriting prior results.
readr::write_csv(df, 'my_new_file.csv', append=TRUE)
} else {
break
}
# Check to see if we need to keep going, if so add n to i
if (nrow(df) < n) {
break
} else {
i = i + n
}
}

What is an alternative to the scan() function in R which is not just for files

I have to read a list of text files starting with the name hello that are located in the same folder. I have to remove period after each letter because I only want to only use period as delimiters.
For example, if a line of text looks like this one : “apple. 10.”
I erase the period on the same line to get this result: “apple 10.”
Here is a glimpse of my code.
files0 <- list.files(path=maindir,pattern="hello",full.names=F,recursive=T,
include.dirs=T)
The next loop is not very efficient, because I have to create temporary text files to use the scan() function.
############### First step
for(a in 1:length(files0)){ #start of the loop going through
# every files0
read <- readLines(paste(maindir,files0[a],sep="/")) #read each line
hello <- gsub("(\\D+)\\.","\\1", lec) #remove every period after a letter
write.table(mod,file=paste(maindir,paste("temporary",files0[a],sep="_"),sep="/"),
sep = ";",col.names = T,row.names = F,quote = FALSE)
#create new temporary files without the period after a letter
} #end of the loop
##################Second step
files <- list.files(path=maindir,pattern="temporary",full.names=F,
recursive=T,include.dirs=T)
for(b in 1:length(files)){ #start of the loop going through every files
hola <- scan(files[b],character(), sep=".") #read every files and
# use period as delimiters
} #end of the loop
I would like to find an alternative to the scan() function in R since I would not have to create temporary files. Also, I would want to be able to directly use the original files (files0) without modifying them.
For example, I have tried the strsplit() function but it didn't properly delimite my text file using a period.
Thank you for your help.
I found an alternative.
for(a in 1:length(files0)){ #start of the loop going through
# every files0
read <- readLines(paste(maindir,files0[a],sep="/")) #read each line
hello <- gsub("(\\D+)\\.","\\1", lec) #remove every period after a letter
hello1 <- unlist(strsplit(hello, "[.]"))
} #end of the loop
I simply have to use the functions unlist(strsplit()).

Skip all leading empty lines in read.csv

I am wishing to import csv files into R, with the first non empty line supplying the name of data frame columns. I know that you can supply the skip = 0 argument to specify which line to read first. However, the row number of the first non empty line can change between files.
How do I work out how many lines are empty, and dynamically skip them for each file?
As pointed out in the comments, I need to clarify what "blank" means. My csv files look like:
,,,
w,x,y,z
a,b,5,c
a,b,5,c
a,b,5,c
a,b,4,c
a,b,4,c
a,b,4,c
which means there are rows of commas at the start.
read.csv automatically skips blank lines (unless you set blank.lines.skip=FALSE). See ?read.csv
After writing the above, the poster explained that blank lines are not actually blank but have commas in them but nothing between the commas. In that case use fread from the data.table package which will handle that. The skip= argument can be set to any character string found in the header:
library(data.table)
DT <- fread("myfile.csv", skip = "w") # assuming w is in the header
DF <- as.data.frame(DT)
The last line can be omitted if a data.table is ok as the returned value.
Depending on your file size, this may be not the best solution but will do the job.
Strategy here is, instead of reading file with delimiter, will read as lines,
and count the characters and store into temp.
Then, while loop will search for first non-zero character length in the list,
then will read the file, and store as data_filename.
flist = list.files()
for (onefile in flist) {
temp = nchar(readLines(onefile))
i = 1
while (temp[i] == 0) {
i = i + 1
}
temp = read.table(onefile, sep = ",", skip = (i-1))
assign(paste0(data, onefile), temp)
}
If file contains headers, you can start i from 2.
If the first couple of empty lines are truly empty, then read.csv should automatically skip to the first line. If they have commas but no values, then you can use:
df = read.csv(file = 'd.csv')
df = read.csv(file = 'd.csv',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))
It's not efficient if you have large files (since you have to import twice), but it works.
If you want to import a tab-delimited file with the same problem (variable blank lines) then use:
df = read.table(file = 'd.txt',sep='\t')
df = read.table(file = 'd.txt',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))

Trimming big data

I am working on a similar issue as was stated on this other posting and tried adapting the code to select the columns I am interested in and making it fit my data file.
My issue, however, is that the resulting file has become larger than the original one, and I'm not sure the code is working the way I intended.
When I open with SPSS, the dataset seems to have taken in the header line, and then made millions of copies without end of the second line (I had to force stop the process).
I noticed there's no counter in the while loop specifying the line, might this be the case? My background in programming with R is very limited. The file is a .csv and is 4.8GB with 329 variables and millions of rows. I only need to keep around 30 of the variables.
This is the code I used:
##Open separate connections to hold cursor position
file.in <- file('npidata_20050523-20130707.csv', 'rt')
file.out<- file('Mainoutnpidata.txt', 'wt')
line<-readLines(file.in,n=1)
line.split <-strsplit(line, ',')
##Column picking, only column 1
cat(line.split[[1]][1:11],line.split[[1]][23:25], line.split[[1]][31:33], line.split[[1]][308:311], sep = ",", file = file.out, fill= TRUE)
##Use a loop to read in the rest of the lines
line <-readLines(file.in, n=1)
while (length(line)){
line.split <-strsplit(line, ',')
if (length(line.split[[1]])>1) {
cat(line.split[[1]][1:11],line.split[[1]][23:25], line.split[[1]][31:33], line.split[[1]][308:311],sep = ",", file = file.out, fill= TRUE)
}
}
close(file.in)
close(file.out)
One thing wrong that jumps out it that you are missing a lines <- readLines(file.in, n=1) inside your while loop. You are now stuck in an infinite loop. Also, reading only one line at a time is going to be terribly slow.
If in your file (unlike the one in the example you linked to) every row contains the same number of columns, you could use my LaF package. This should result in something along the lines of:
library(LaF)
m <- detect_dm_csv("npidata_20050523-20130707.csv", header=TRUE)
laf <- laf_open(m)
begin(laf)
con <- file("Mainoutnpidata.txt", 'wt')
while(TRUE) {
d <- next_block(laf, columns = c(1:11, 23:25, 31:33, 308:311))
if (nrow(d) == 0) break;
write.csv(d, file=con, row.names=FALSE, header=FALSE)
}
close(con)
close(laf)
If your 30 columns fit into memory you could even do:
library(LaF)
m <- detect_dm_csv("npidata_20050523-20130707.csv", header=TRUE)
laf <- laf_open(m)
d <- laf[, c(1:11, 23:25, 31:33, 308:311)]
close(laf)
I couldn't test the code above on your file, so can't guarantee there are no errors (let me know if there are).

Resources