I am trying to read text files and create a data frame (called dataset) of some specific columns (about 12) (located at certain lengths) as below:
x <- fread("file1.txt",colClasses = "character", sep = "\n", header = FALSE, verbose = FALSE,strip.white = FALSE)
y <- fread("file2.txt",colClasses = "character", sep = "\n", header = FALSE, verbose = FALSE,strip.white = FALSE)
# combine them
x = rbind(x,y)
# We basically read the whole file as a string and then read substrings
# corresponding to each variable start and finish lengths.
Var1= sapply(as.list(x$V1), stri_sub, from = 80, to = 82)
Var1= as.data.frame(Var1)
Var2= sapply(as.list(x$V1), stri_sub, 83, 89)
Var2= as.data.frame(Var2)
dataset = cbind(Var1,Var2)
It takes around 1 minute to run the two text file have 200K and 300K rows respectively. They have 1800 characters per line. Is there a faster way to run this? I will be reading about 200 such files.
I think you can simplify your code in the following manner
x <- Reduce(rbind, lapply(1:2, function(k) fread(paste0("file",k,".txt"),
colClasses = "character",
sep = "\n",
header = FALSE,
verbose = FALSE,
strip.white = FALSE)))
dataset <- data.frame(Var1= substr(x$V1, 80, 82), Var2 = substr(x$V1,83,89))
where the second line may save more time when you use substr over the whole column.
Related
I am working with multiple csv files in long format. Each file has a different number of columns but the same number of rows. I was trying to read all files and merged them in one df but I could not do it.
So far I use this code to read each file individually:
try <- read.table('input/SMPS/new_format/COALA_SMPS_20200218.txt', #set the file to read
sep = ',', #separator
header = F, # do not read the header
skip = 17, # skip 17 firdt lines of information
fill = T) %>% #fill all empty spaces in the df
t()%>% #transpose the data
data.frame()%>% #make it a df
select(1:196) #select the useful data
My plan was to use something similar to this code but I don't know where to include the transpose function to make it work.
smps_files_new <- list.files(pattern = '*.txt',path = 'input/SMPS/new_format/')#Change the path where the files are located
myfiles <-do.call("rbind", ##Apply the bind to the files
lapply(smps_files_new, ##call the list
function(x) ##apply the next function
read.csv(paste("input/SMPS/new_format/", x, sep=''),sep = ',', #separator
header = F, # do not read the header
skip = 17, # skip 17 first lines of information
stringsAsFactors = F,
fill = T))) ##
Use the same code in lapply which you used for individual files :
do.call(rbind, ##Apply the bind to the files
lapply(smps_files_new, ##call the list
function(x) ##apply the next function
read.csv(paste("input/SMPS/new_format/", x, sep=''),sep = ',',
header = F, # do not read the header
skip = 17, # skip 17 first lines of information
stringsAsFactors = FALSE,
fill = TRUE) %>%
t()%>%
data.frame()%>%
select(1:196)))
Another way would be to use purrr::map_df or map_dfr instead of lapply + do.call(rbind
purrr::map_df(smps_files_new,
function(x)
read.csv(paste("input/SMPS/new_format/", x, sep=''),sep = ',',
header = F,
skip = 17,
stringsAsFactors = FALSE,
fill = TRUE) %>%
t()%>%
data.frame()%>%
select(1:196)))
I am trying to apply the same function to all csv files (identical structure) in a folder - adding two new columns based on 'old' columns, adding 0.05 to each variable and then saving it under the same name in the same folder as csv. Should be easy and there are several examples here for doing that, mostly using lapply, however, I keep running into an error:
Error in `$<-.data.frame`(`*tmp*`, "LAT", value = numeric(0)) : replacement has 0 rows, data has 3
This is my code:
my_files <- list.files(path="C:/PATH", pattern=".csv", full.names=T, recursive=FALSE)
add_col <- function(my_files) {
mpa <- read.csv(my_files, header=T)
mpa$LAT <- mpa$lat_bin + 0.05
mpa$LON <- mpa$lon_bin + 0.05
return(mpa)
write.csv(mpa,
append = FALSE,
quote = FALSE,
sep = ",",
row.names = FALSE,
col.names = TRUE)
}
I am unsure how best to do that for a large amounts of files.
Here is some sample code for the files
Df1 <- data.frame(lat_bin = c(50,40,70,6,8,4),lon_bin = (c(1,5,2,4,9,11)))
Df2 <- data.frame(lat_bin = c(66, 77, 82, 65, 88, 43),lon_bin = (c(2,3,4,5,11,51)))
Df3 <- data.frame(lat_bin = c(43,46,55,67,1,11),lon_bin = (c(7,6,5,9,11,15)))
write.csv(Df1, "data_1.csv", row.names=F)
write.csv(Df2, "data_2.csv", row.names=F)
write.csv(Df3, "data_3.csv", row.names=F)
Simply change parameters where function receives one file and you pass entire list of files inside lapply. As info, lappy is perhaps most popular of the apply family of functions that receives a list/vector input and returns an equal-length list where each input list element is passed into a function.
Specifically here res returns a list of dataframes equal to the number of files in my_files, each with column value changes. Also, write.csv had a missing file name, but below saves new csv files with _new suffix (double slashes to escape period, special character in regex).
my_files <- list.files(path="C:/PATH", pattern=".csv", full.names=T,
recursive=FALSE)
add_col <- function(one_file) {
mpa <- read.csv(one_file, header=T)
mpa$LAT <- mpa$lat_bin + 0.05
mpa$LON <- mpa$lon_bin + 0.05
write.csv(mpa,
file = sub("\\.csv", "_new\\.csv", one_file),
append = FALSE,
quote = FALSE,
sep = ",",
row.names = FALSE,
col.names = TRUE)
return(mpa)
}
res <- lapply(my_files, function(i) add_col(i)) # LONGER VERSION
res <- lapply(my_files, add_col) # SHORTER VERSION
I want to read a csv file of 4000 columns and 3000 rows and rows are of different length. Now i'm using the code below to read, but the maximum number of columns can be read is 2067.
read_data <- function(filename) {
setwd(dir)
no_col <- max(count.fields(filename, sep = ","))
temp_data <- read.csv(filename, header = FALSE, sep = ",", row.names = NULL, na.strings = 0, fill = TRUE, col.names=1:no_col)
How do I solve this problem?
I'm a total R-newbie and I have a simple question which might be hilarious but I could not find a answer even though I searched for 4 hours. I might miss the concept.
I write a Monte-Carlo script with a lot of variables stored in differnet environments. At the end of every iteration I want to write all variables (the ones which are listed when typing ls()) to a table.
This would be a working example (without the item I ask for) of what I want to do. (Thank you for your help sofar, it helped me to build that example!)
#input data (data will be manipulated for mc later on)
ha<-5
w_eff<-1.9
v_T1<-8
n<-1000 #number of iterations
#function
T1_func <- function(ha_mc, w_eff_mc, v_T1_mc){
T1_result <- ((ha*10)/(w_eff*v_T1));
return(T1_result)
}
for(i in 1:n){ #number of iterations
#MC maipulation (illustrative)
ha_mc<-rnorm(1, ha, sd=1)
w_eff_mc<-rnorm(1, w_eff, sd=1)
v_T1_mc<-rnorm(1, v_T1, sd=1)
#calculation
T1_mc<-T1_func(ha_mc, w_eff_mc, v_T1_mc)
#now I want to write all variables to a table
df<-data.frame(ha, w_eff, v_T1, ha_mc, w_eff_mc, v_T1_mc, T1_mc)
write.table(df, file = "result.txt", append = TRUE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = FALSE,
col.names = !file.exists("result.txt"), qmethod = c("escape", "double"))
}
My question would be: how do I get that:
df<-data.frame(ha, w_eff, v_T1, ha_mc, w_eff_mc, v_T1_mc, T1_mc)
without writing down all the variables (ha, w_eff, v_T1, ha_mc, w_eff_mc, v_T1_mc, T1_mc) but with something like "ls()". And how do I get that for the variables in the different environments so that I will have a column named "my.env$w_eff".
Thank you very much!
I woud suggest not using ls() and instead making a data.frame which contains the variables you want to store. Here I firstly create the file "results.txt" with the correct column headers (I'm storing values of a, b, and c) and then in each iteration I append the corresponding values to the file. Hope this helps:
n <- 10L
write.table(data.frame("a", "b", "c"), file = "result.txt",
col.names = FALSE, row.names = FALSE)
for (i in seq_len(n)) {
#do MC
a <- rnorm(1L)
b <- exp(a)
c <- a + b
write.table(data.frame(a, b, c), file = "result.txt",
append = TRUE, row.names = i, col.names = FALSE)
}
Thats the solution I found with your help, thx!
write_table_func <- function(env_name, file_part_name, dir_name){
#write input to table
df_input<-data.frame(as.list(get(env_name), all.names=TRUE))
sort.df_input <- df_input[,order(names(df_input))]
filename<-(paste(sep="", dir_name, "/", "tabl_", process_n, "_", process_step_n, file_part_name, ".txt"));
suppressWarnings(write.table(sort.df_input, file = paste(filename), append = TRUE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = FALSE,
col.names = !file.exists(paste(filename)), qmethod = c("escape", "double")));
rm(df_input);
rm(sort.df_input);
}
I'm attempting to import and export, in pieces, a single 10GB CSV file with roughly 10 million observations. I want about 10 manageable RData files in the end (data_1.RData, data_2.Rdata, etc.), but I'm having trouble making the skip and nrows dynamic. My nrows will never change as I need almost 1 million per dataset, but I'm thinking I'll need some equation for skip= so that every loop it increases to catch the next 1 million rows. Also, having header=T might mess up anything over ii=1since only the first row will include variable names. The following is the bulk of the code I'm working with:
for (ii in 1:10){
data <- read.csv("myfolder/file.csv",
row.names=NULL, header=T, sep=",", stringsAsFactors=F,
skip=0, nrows=1000000)
outName <- paste("data",ii,sep="_")
save(data,file=file.path(outPath,paste(outName,".RData",sep="")))
}
(Untested but...) You can try something like this:
nrows <- 1000000
ind <- c(0, seq(from = nrows, length.out = 10, by = nrows) + 1)
header <- names(read.csv("myfolder/file.csv", header = TRUE, nrows = 1))
for (i in seq_along(ind)) {
data <- read.csv("myfolder/file.csv",
row.names = NULL, header = FALSE,
sep = ",", stringsAsFactors = FALSE,
skip = ind[i], nrows = 1000000)
names(data) <- header
outName <- paste("data", ii, sep = "_")
save(data, file = file.path(outPath, paste(outName, ".RData", sep = "")))
}