I am processing a large file, I read in chucks of it and process it and save what I extract. Then after rm(list=ls()) to clear memory (sometime have to use .rs.restartR() as well but that is not of concern in this post), I run the same script after adding 1 in two numbers in my script.
This seemed like a opportunity to try writing a loop but - between trying to initialize all the object that are used in the loop and given that I am not very good with writing loops it got really confusing.
I posted this here to hear some suggestion, I apologize in advance if my question is too vague. Thanks.
####################### A:11
####################### B:12
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*11, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## First use colSums(), saves a numeric vector in `final_dfm_1`
## tib is the desired oject I will save with new name ea. time.
final_dfm_1 <- colSums(dfm_1)
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq12.Rda")
rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)
Below I will run the same script but change 11 to 12 in fread(), and change 12 to 13 in saveRDS() command.
####################### A:12
####################### b:13
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*12, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## Using colSums(), gives a numeric vector`final_dfm_1`
## tib is the desired oject I will save with new name each time.
final_dfm_1 <- colSums(dfm_1)
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq13.Rda")
rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)
Below is a list of all the objects (thanks this post) in my working environment, that are cleared from the working environment before running the the same chunk with A+1, and B+1.
Type Size Rows Columns
dfm_1 dfmSparse 174708600 166836 1731410
bi_tkn_one tokens 152494696 166836 NA
tib tbl_df 148109248 1731410 2
final_dfm_1 numeric 148108544 1731410 NA
text_tbl data.table 22485264 166836 1
I spent some time trying to figure out how to write this loop, found a post on SO about how to initialize a data.table with a character column, but there are still other objects that I think I need to initialize. I am unsure of how plausible it is to write such a loop.
I have copied and pasted the same script back-to-back as shown above and run it all at once. It's silly, since I am just adding one in two places.
Feel free comment on my approach, I would like to learn something out of this. Best
On a side note: I read about adding .rs.restartR() to the loop, and came across post that suggested using batch-files or scheduling tasks in R, I will have to pass on learning those for now.
This was very simple, I didn't have to initialize any objects , that is what I was trying to do. Only things I had to load was the required packages upon starting R and run the loop.
ls()
character(0)
From an empty environment, just a simple loop.
library(data.table)
library(quanteda)
library(dplyr)
for (i in 4:19){
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*i, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 3, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## Using colSums(), gives a numeric vector`final_dfm_1`
## tib is the desired oject I will save with new name each time.
final_dfm_1 <- colSums(dfm_1)
print(setNames(length(final_dfm_1), "no. N-grams in this batch"))
# no. N-grams
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
iplus = i+1
saveRDS(tib, file = paste0("titr",iplus,".Rda"))
rm(list=ls())
Sys.sleep(10)
gc()
Sys.sleep(10)
}
Without initializing any data.table, or other objects the result of above loop was 16 files saved in my working directory.
That makes me think, when do we need to initialize vectors, matrices and other objects that are used to our loop?
Related
I have a large database that I've split into multiple files. Each file is saved in the same directory, and there is a numerical sequence in the naming scheme so the order of the database is maintained. Ive done this to reduce the time and memory it takes to load and manipulate the database. I would like to start analyzing the database in sequence, which I intend to accomplish using a rollapply like function. I am having a problem when I want the window to span two files at once. Which is where I need help. Here is dummy dataset that will create five CSV files with a similar naming scheme to my database:
library(readr)
val <- c(1,2,3,4,5)
df_1 <- data.frame(val)
write_csv(df_1, "1_database.csv", col_names = TRUE)
write_csv(df_1, "2_database.csv", col_names = TRUE)
write_csv(df_1, "3_database.csv", col_names = TRUE)
write_csv(df_1, "4_database.csv", col_names = TRUE)
write_csv(df_1, "5_database.csv", col_names = TRUE)
Keep in mind that this database is huge, and causes memory and time issues on my current machine. The solution MUST have a component that "forgets". This means recurrently joining the files, or loading them all at once to the R environment is not an option. When a new file is loaded, the last file must be removed from the R environment. I can have at maximum three files loaded at once. For example files 1-3 can be loaded, and then file 1 needs to be removed before file 4 is loaded.
The output can be a single list of all files - the combination of files 1-5 in a single list.
For the sake of simplicity, lets say I want to use a window of 2, and I want to calculate the mean of this window. I'm imagining something like this (see below) but this maybe a failed approach, and I'm open to anything.
appreciated_function <- function(x){
Your greatly appreciated function
}
rollapply(df, 2, appreciated_function, by.column = FALSE, align = "left")
Suppose the window width is k. Iterate through all files and for each one read that file plus the first k-1 rows of the next (except for the last) and use rollapply on that appending what we get to what we have so far. Alternately, if the output is too large we could write out each result instead of appending it.
At the bottom we check that it gives the expected result.
library(readr)
library(zoo)
val <- c(1,2,3,4,5)
df_1 <- data.frame(val)
write_csv(df_1, "1_database.csv", col_names = TRUE)
write_csv(df_1, "2_database.csv", col_names = TRUE)
write_csv(df_1, "3_database.csv", col_names = TRUE)
write_csv(df_1, "4_database.csv", col_names = TRUE)
write_csv(df_1, "5_database.csv", col_names = TRUE)
d <- dir(pattern = "database.csv$")
k <- 2
r <- NULL
for(i in seq_along(d)) {
Next <- if (i != length(d)) read_csv(d[i+1], n_max = k-1)
DF <- rbind(read_csv(d[i]), Next)
r0 <- rollapply(DF, k, sum, align = "left")
# if output too large replace next statement with one to write out r0
r <- rbind(r, r0)
}
# check
r2 <- rollapply(data.frame(val = sequence(rep(5, 5))), k, sum, align = "left")
identical(r, r2)
## [1] TRUE
I wrote some R code to run analysis on my research project. I coded it in such a way that there was an output text file with the status of the program. Now the header of the output file looks like this:
start time: 2014-10-23 19:15:04
starting analysis on state model: 16
current correlation state: 1
>>>em_prod_combs
em_prod_combs
H3K18Ac_H3K4me1 1.040493e-50
H3K18Ac_H3K4me2 3.208806e-77
H3K18Ac_H3K4me3 0.0001307375
H3K18Ac_H3K9Ac 0.001904384
the `>>>em_prod_combs" is on line 4. line 5 its repeated again (R code). I'd like data from like 6. This data goes on for 36 more rows so ends at line 42. Then there is some other text in the file until all the way to like 742 which looks like this:
(742) >>>em_prod_combs
(743) em_actual_perc
(744) H3K18Ac_H3K4me1 0
H3K18Ac_H3K4me2 0
H3K18Ac_H3K4me3 0.0001976819
H3K18Ac_H3K9Ac 0.001690382
And again I'd like to select data from line 744 (actual data, not headers) and go for another 36 rows and end at line 780. Here is my part of the code:
filepath <- paste(folder_directory, corr_folders[fi], filename, sep="" )
con <- file(filepath)
open(con);
results.list <- list();
current.line <- 0
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
if(line==">>>em_prod_combs"){
storethenext <- TRUE
}
}
close(con)
Here, I was trying to see if the line read had the ">>>" mark. If so, set a variable to TRUE and store the next 36 lines (using another counter variable) in a data frame or list and set the storethenext variable back to F. I was kind of hoping that there is a better way of doing this....
So I realized that ReadLines has a parameter that you can set for skipping lines. Based on that, I got this:
df <- data.frame(name = character,
params = numeric(40),
stringsAsFactors = FALSE)
con <- file(filepath)
open(con);
results.list <- list();
current.line <- 0
firstblock <- readLines(con, n = 5, warn = FALSE)
firstblock <- NULL #throwaway
firstblock <- readLines(con, n = 36, warn = FALSE)
firstblock <- as.list(firstblock) #convert to list
for(k in 1:36){
splitstring = strsplit(firstblock[[k]], " ", fixed=TRUE)
## put the data in the df
}
But it turns out from Ben's answer that read.table can do the same thing in one line: So I've reduced it down to the following one liner:
firstblock2 <- read.table(filepath, header = FALSE, sep = " ", skip = 5, nrows = 36)
This also makes it a data frame impliticitly and does all the dirty work for me.
The documentation for read.table is here:
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
In tidyverse (readr)
If you don't wish to convert the data into a dataframe you can just read the slice of text with read_lines()
(being sure to note that the n_max = argument is asking for how many lines you want to read in; not the number of the row you want to stop at. tbh: Ultimately I too found this preferable as usually I need to manage the file length more than I need to isolate sections of code on the read in.)
firstblock <- read_lines(filepath, skip = 5, n_max = 31)
If you don't want to think in terms of file size, you could modify your code thus:
start_line = 5
end_line = 36
line_count = end_line - start_line
firstblock <- read_lines(filepath, skip = start_line, n_max = line_count)
In any case, additional things I found helpful for working with these file formats after I got to know them a little better after finding this post before:
If you want to convert files immediately into lists as you did above just use:
read_lines_raw(filepath, skip = 5, n_max = 31)
and you will get a 31 element list as your firstblock element in lieu of the character element that you get with the first.
Additional super-cool features I stumbled upon there (and was moved to share - because I thought they rock):
automatically decompresses* .gz, .bz2, .xz, or .zip files.
will make the connection and complete the download automatically if the filepath starts with http://, https://, ftp://, or ftps://.
If the file has both an extension and prefix as above, then it does both.
If things need to go back where they came from, the write_lines function turned out to be much more enjoyable to use that the base version. specifically, there are no FileConn's to open and close: just specify the object, and the filename you wish to write it into.
so for example the below:
fileConn <- file("directory/new.sql")
writeLines(new_sql, fileConn)
close(fileConn)
gets to just be:
write_lines(new_sql, "directory/new.sql")
Enjoy, and hope this helps!
I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows. This function seems to work.
I am now trying to read just the bad rows into a data frame, so far unsuccessfully.
My current approach is to use read.table on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.
I calculate skipVec as:
(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1
But for the moment I am just handing my function a skipVec vector of all zeros.
If my logic is correct, this should return all the rows. It does not. Instead I get an error:
"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep =
"") : no lines available in input"
My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.
My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply.
I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.
Here is the code that produces the error message above:
# Make a small small test data frame, write it to a file, and read it back in
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))
testThis.DF
# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF <- lapply(skipVec, FUN=function(pass){
read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)
The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.
If instead of running read.table through lapply you just run the first few iterations manually, you will see what is going on:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
nnn fff
1 2 aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
X2 X3 bb
1 3 5 cc
Because header = TRUE it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") :
no lines available in input
Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:
write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
line <- read.table(con, nrow = 1, header = FALSE, sep = "",
row.names = 1)
if (pass) NULL else line
})
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)
Some clues towards higher speeds:
use scan instead of read.table. Read data as character and only at the end, after you have put your data into a character matrix or data.frame, apply type.convert to each column.
Instead of looping over skipVec, loop over its rle if it is much shorter. So you'll be able to read or skip chunks of lines at a time.
I am running the following code in order to open up a set of CSV files that have temperature vs. time data
temp = list.files(pattern="*.csv")
for (i in 1:length(temp))
{
assign(temp[i], read.csv(temp[i], header=FALSE, skip =20))
colnames(as.data.frame(temp[i])) <- c("Date","Unit","Temp")
}
the data in the data frames looks like this:
V1 V2 V3
1 6/30/13 10:00:01 AM C 32.5
2 6/30/13 10:20:01 AM C 32.5
3 6/30/13 10:40:01 AM C 33.5
4 6/30/13 11:00:01 AM C 34.5
5 6/30/13 11:20:01 AM C 37.0
6 6/30/13 11:40:01 AM C 35.5
I am just trying to assign column names but am getting the following error message:
Error in `colnames<-`(`*tmp*`, value = c("Date", "Unit", "Temp")) :
'names' attribute [3] must be the same length as the vector [1]
I think it may have something to do how my loop is reading the csv files. They are all stored in the same directory in R.
Thanks for your help!
I'd take a slightly different approach which might be more understandable:
temp = list.files(pattern="*.csv")
for (i in 1:length(temp))
{
tmp <- read.csv(temp[i], header=FALSE, skip =20)
colnames(tmp) <- c("Date","Unit","Temp")
# Now what do you want to do?
# For instance, use the file name as the name of a list element containing the data?
}
Update:
temp = list.files(pattern="*.csv")
stations <- vector("list", length(temp))
for (i in 1:length(temp)) {
tmp <- read.csv(temp[i], header=FALSE, skip =20)
colnames(tmp) <- c("Date","Unit","Temp")
stations[[i]] <- tmp
}
names(stations) <- temp # optional; could process file names too like using basename
station1 <- station[[1]] # etc station1 would be a data.frame
This 2nd part could be improved as well, depending upon how you plan to use the data, and how much of it there is. A good command to know is str(some object). It will really help you understand R's data structures.
Update #2:
Getting individual data frames into your workspace will be quite hard - someone more clever than I may know some tricks. Since you want to plot these, I'd first make names more like you want with:
names(stations) <- paste(basename(temp), 1:length(stations), sep = "_")
Then I would iterate over the list created above as follows, creating your plots as you go:
for (i in 1:length(stations)) {
tmp <- stations[[i]]
# tmp is a data frame with columns Date, Unit, Temp
# plot your data using the plot commands you like to use, for example
p <- qplot(x = Date, y = Temp, data = tmp, geom = "smooth", main = names(stations)[i])
print(p)
# this is approx code, you'll have to play with it, and watch out for Dates
# I recommend the package lubridate if you have any troubles parsing the dates
# qplot is in package ggplot2
}
And if you want to save them in a file, use this:
pdf("filename.pdf")
# then the plotting loop just above
dev.off()
A multipage pdf will be created. Good Luck!
It is usually not recommended practice to use the 'assign' statement in R. (I should really find some resources on why this is so.)
You can do what you are trying using a function like this:
read.a.file <- function (f, cnames, ...) {
my.df <- read.csv(f, ...)
colnames(my.df) <- cnames
## Here you can add more preprocessing of your files.
}
And loop over the list of files using this:
lapply(X=temp, FUN=read.a.file, cnames=c("Date", "Unit", "Temp"), skip=20, header=FALSE)
"read.csv" returns a data.frame so you don't need "as.data.frame" call;
You can use "col.names" argument to "read.csv" to assign column names;
I don't know what version of R you are using, but "colnames(as.data.frame(...)) <-" is just an incorrect call since it calls for "as.data.frame<-" function that does not exist, at least in version 2.14.
A short-term fix to your woes is the following, but you really need to read up more on using R as from what you did above I expect you'll get into another mess very quickly. Maybe start by never using assign.
lapply(list.files(pattern = "*.csv"), function (f) {
df = read.csv(f, header = F, skip = 20))
names(df) = c('Date', 'Unit', 'Temp')
df
}) -> your_list_of_data.frames
Although more likely you want this (edited to preserve file name info):
df = do.call(rbind,
lapply(list.files(pattern = "*.csv"), function(f)
cbind(f, read.csv(f, header = F, skip = 20))))
names(df) = c('Filename', 'Date', 'Unit', 'Temp')
At a glance it appears that you are missing a set of subset braces, [], around the elements of your temp list. Your attribute list has three elements but because you have temp[i] instead of temp[[i]] the for loop isn't actually accessing the elements of the list thus treating as an element of length one, as the error says.
I am still pretty new to R and very new to for-loops and functions, but I searched quite a bit on stackoverflow and couldn't find an answer to this question. So here we go.
I'm trying to create a script that will (1) read in multiple .csv files and (2) apply a function to strip twitter handles from urls in and do some other things to these files. I have developed script for these two tasks separately, so I know that most of my code works, but something goes wrong when I try to combine them. I prepare for doing so using the following code:
# specify directory for your files and replace 'file' with the first, unique part of the
# files you would like to import
mypath <- "~/Users/you/data/"
mypattern <- "file+.*csv"
# Get a list of the files
file_list <- list.files(path = mypath,
pattern = mypattern)
# List of names to be given to data frames
data_names <- str_match(file_list, "(.*?)\\.")[,2]
# Define function for preparing datasets
handlestripper <- function(data){
data$handle <- str_match(data$URL, "com/(.*?)/status")[,2]
data$rank <- c(1:500)
names(data) <- c("dateGMT", "url", "tweet", "twitterid", "rank")
data <- data[,c(4, 1:3, 5)]
}
That all works fine. The problem comes when I try to execute the function handlestripper() within the for-loop.
# Read in data
for(i in data_names){
filepath <- file.path(mypath, paste(i, ".csv", sep = ""))
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
}
When I execute this code, I get the following error: Error in data$URL : $ operator is invalid for atomic vectors. I know that this means that my function is being applied to the string I called from within the vector data_names, but I don't know how to tell R that, in this last line of my for-loop, I want the function applied to the objects of name i that I just created using the assign command, rather than to i itself.
Inside your loop, you can change this:
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
to
tmp <- read.delim(filepath, colClasses = "character", sep = ",")
assign(i, handlestripper(tmp))
I think you should make as few get and assign calls as you can, but there's nothing wrong with indexing your loop with names as you are doing. I do it all the time, anyway.