Set columns conditions while binding a large list of files

Set columns conditions while binding a large list of files - r

I'm stuck. I have defined those lines of code in order to bind diferent csv that i read parsing its folders.
setAs("character", "myDate", function(from) as.Date(from, format = "%d/%m/%Y"))
LF <- list.files("O:/00 CREDIT MANAGEMENT/", pattern = ".csv", full.names = TRUE, recursive = FALSE)
PayMatrix <- do.call("rbind", lapply(archivos1, function(x) {
read.csv(x, header = 3, sep = ";", dec = ",", skip = "2", na.strings = "",
colClasses= c("Expiration.Date" = "myDate", "Payment.date" = "myDate"))
}))
My problem is that it is a very large set of data, and I would like to know how to parsing this csv conditionally depending of the value of "Payment.Date" Column (i.e. Payment.Date>0), equally, I´m going to use only a few part of the columns in those csv so I will like to cut the files before or during the loop.
I´ve tried the "awk" thing, but it is not working.
{read.csv(pipe("awk '{if (Payment.date > 0) print [,c(1:2,6:9,29)]}'x"), header=3...
My input files are something similar to that. (csv, header=3)
CURRENT INVOICES 27/03/2017 (W 13)
16276178,26
Client Code. Invoice Invoice Date Expiration Date Amount Payment date
1004775 21605000689 29/05/2016 29/07/2016 226,3
1005140 21702000548 28/02/2017 28/04/2017 22939,2
1004775 21703005560 25/03/2017 25/05/2017 21456,2
1004858 F9M01815. 30/01/2017 30/03/2017 5042,52 27/03/2017

Would a selection within the lapply() function work for you? (untested due to lack of reproducible example)
PayMatrix <- do.call("rbind", lapply(archivos1, function(x) {
tmp <- read.csv(x, header = 3, sep = ";", dec = ",", skip = "2", na.strings = "",
colClasses= c("Expiration.Date" = "myDate", "Payment.date" = "myDate"))
tmp[tmp$Payment.Date > 0, ]
}))
BTW: For handling large data frames efficiently, I recommend to consider to use the data.table package. With that, your code could become (untested)
library(data.table)
PayMatrix <- rbindlist(lapply(archivos1, function(x) {
fread(x, <...>)[Payment.Date > 0, ]
}))
where <...> denote the parameters which have to be passed to fread().
BTW: The fread() function in the data.table package is not just for speed on large files. It has very useful convenience features for small data. For details, please, see fread's wiki page.

Related

Reduce unnecessary repeated reading from files in nested for loop R

I'm writing some R code to handle pairs of files, an Excel and a csv (Imotions.txt). I need extract a column from the Excel and merge it to the csv, in pairs. Below is my abbreviated script: My script is now in polynomial time, and keeps repeating the body of the nested for loop 4 times instead of just doing it once.
Basically is there a general way to think about running some code over a paired set of files that I can translate to this and other languages?
excel_files <- list.files(pattern = ".xlsx" , full.names = TRUE)
imotion_files <-list.files(pattern = 'Imotions.txt', full.names = TRUE)
for (imotion_file in imotion_files) {
for (excel_file in excel_files) {
filename <- paste(sub("_Imotions.txt", "", imotion_file))
raw_data <- extract_raw_data(imotion_file)
event_data <- extract_event_data(imotion_file)
#convert times to milliseconds
latency_ms <- as.data.frame(
sapply(
df_col_only_ones$latency,
convert_to_ms,
raw_data_first_timestamp = raw_data_first_timestamp
)
)
#read in paradigm data
paradigm_data <- read_excel(path = excel_file, range = "H30:H328")
merged <- bind_cols(latency_ms, paradigm_data)
print(paste("writing = ", filename))
write.table(
merged,
file = paste(filename, "_EVENT", ".txt", sep = ""),
sep = '\t',
col.names = TRUE,
row.names = FALSE,
quote = FALSE
)
}
}

It is not entirely clear about some operations. Here is a an option in tidyverse
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
out <- crossing(excel_files, imotion_files) %>%
mutate(filename = str_remove(imotion_file, "_Imotions.txt"),
raw_data = map(imotion_files, extract_raw_data),
event_data = map(imption_filess, extract_event_data),
paradigm_data = map(excel_files, ~
read_excel(.x, range = "H30:H328") %>%
bind_cols(latency_ms, .))
Based on the OP's code, latency_ms can be created outside the loop once and used it while binding the columns

Based on the naming of raw_data_first_timestamp, I'm assuming it's created by the extract_raw_data function - otherwise you can move the latency_ms outside the loop entirely, as akrun mentioned.
If you don't want to use tidyverse, see the modified version of your code at bottom. Notice that the loops have been broken out to cut down on duplicated actions.
Some general tips to improve efficiency when working with loops:
Before attempting to improve nested loop efficiencies, consider whether the loops can be broken out so that data from earlier loops is stored for usage in later loops. This can also be done with nested loops and variables tracking whether data has already been set, but it's usually simpler to break the loops out and negate the need for the tracking variables.
Create variables and call functions before the loop where possible. Depending on the language and/or compiler (if one is used), variable creation outside loops may not help with efficiency, but it's still good practice.
Variables and functions which must be created or called inside loops should be done in the highest scope - or the outermost loop - possible.
Disclaimer - I have never used R, so there may be syntax errors.
excel_files <- list.files(pattern = ".xlsx" , full.names = TRUE)
imotion_files <-list.files(pattern = 'Imotions.txt', full.names = TRUE)
paradigm_data_list <- vector("list", length(excel_files))
for (i in 1:length(excel_files)) {
#read in paradigm data
paradigm_data_list[[i]] <- read_excel(path = excel_files[[i]], range = "H30:H328")
}
for (imotion_file in imotion_files) {
filename <- paste(sub("_Imotions.txt", "", imotion_file))
raw_data <- extract_raw_data(imotion_file)
event_data <- extract_event_data(imotion_file)
#convert times to milliseconds
latency_ms <- as.data.frame(
sapply(
df_col_only_ones$latency,
convert_to_ms,
raw_data_first_timestamp = raw_data_first_timestamp
)
)
for (paradigm_data in paradigm_data_list) {
merged <- bind_cols(latency_ms, paradigm_data)
print(paste("writing = ", filename))
write.table(
merged,
file = paste(filename, "_EVENT", ".txt", sep = ""),
sep = '\t',
col.names = TRUE,
row.names = FALSE,
quote = FALSE
)
}
}

Converting pdf files into data.frames

I'm currently trying to create a function that will read many pdf files into a data frame. My ultimate goal is to have it read specific information from the pdf files and convert them into a data.frame with insurance plan names in each row and the columns comprising of information I need such as individual plan price, family plan prices, etc. I have been following an answer given by someone for a similar question in the past. However, I keep getting an error. Here is a link to two different files I am practicing on(1 and 2).
Here are my code and error below:
PDFtoDF = function(file) {
dat = readPDF(control=list(text="-layout"))(elem=list(uri=file),
language="en", id="id1")
dat = c(as.character(dat))
dat = gsub("^ ?([0-9]{1,3}) ?", "\\1|", dat)
dat = gsub("(, HVOL )","\\1 ", dat)
dat = gsub(" {2,100}", "|", dat)
excludeRows = lapply(gregexpr("\\|", dat), function(x) length(x)) != 6
write(dat[excludeRows], "rowsToCheck.txt", append=TRUE)
dat = dat[!excludeRows]
dat = read.table(text=dat, sep="", quote="", stringsAsFactors=FALSE)
names(dat) = c("Plan", "Individual", "Family")
return(dat)
}
files <- list.files(pattern = "pdf$")
df = do.call("rbind", lapply(files, PDFtoDF))
Error in read.table(text = dat, sep = "", quote = "", stringsAsFactors =
FALSE) : no lines available in input
Before this approach, I have been using the pdftools package and regular expressions. This approach worked except it was difficult to clarify a pattern for some parts of the document such as the plan name which is at the top. I was hoping the methodology I'm trying now will help since it will extract the text into separate strings for me.

Here's the best answer:
require(readtext)
df <- readtext("*.pdf")
Yes it's that simple, with the readtext package!

make csv data import case insensitive

I realize this is a total newbie one (as always in my case), but I'm trying to learn R, and I need to import hundreds of csv files, that have the same structure, but in some the column names are uppercase, and in some they are lower case.
so I have (for now)
flow0300csv <- Sys.glob("./csvfiles/*0300*.csv")
for (fileName in flow0300csv) {
flow0300 <- read.csv(fileName, header=T,sep=";",
colClasses = "character")[,c('CODE','CLASS','NAME')]
}
but I get an error because of the lower cases. I have tried to apply "tolower" but I can't make it work. Any tips?

The problem here isn't in reading the CSV files, it's in trying to index using column names that don't actually exist in your "lowercase" data frames.
You can instead use grep() with ignore.case = TRUE to index to the columns you want.
tmp <- read.csv(fileName, header = T, sep = ";",
colClasses = "character")
ind <- grep(patt = "code|class|name", x = colnames(tmp),
ignore.case = TRUE)
tmp[, ind]
You may want to look into readr::read_csv2() or even data.table::fread() for better performance.

After reading the .csv-file you may want to convert the column names to all uppercase with
flow0300 <- read.csv(fileName, header = T, sep = ";", colClasses = "character")
colnames(flow0300) <- toupper(colnames(flow0300))
flow0300 <- flow0300[, c("CODE", "CLASS", "NAME")]
EDIT: Extended solution with the input of #xraynaud.

Adding a column characters based on file name in R

I have several hundred files regarding information in .pet files organized by date code (19960101 is format YYYYMMDD). I'm trying to add a column, NDate with the date code:
for (pet.atual in files.pet) {
data.pet.atual <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
data.pet.atual <- cbind(data.pet.atual, NDate= pet.atual)
}
What i'm trying to achieve, for example, is for the 01-01-1996 NDate = 19960101, for 02-01-1996 NDate = 19960102 and so on. Still the for loop just replaces the NDate field everytime it runs with the latest pet.atual, ideas? Thanks

Small modification should do the trick:
data.pet.atual <- NULL
for (pet.atual in files.pet) {
tmp.data <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
tmp.data <- cbind(tmp.data, NDate= pet.atual)
data.pet.atual <- rbind(data.pet.atual, tmp.data)
}
You can also replace the tmp.data<-cbind(...) by tmp.data$NDate <- pet.atual

You may also try fread() and rbindlist() from the data.table package (untested due to lack of a reproducible example):
library(data.table)
result <- rbindlist(lapply(files.pet, fread), idcol = "NDate")
result[, NDate := anytime::anydate(files.pet[NDate])]
lapply() "loops" over all entries in files.pet executing fread() for each entry and returns a list with the data.tables fread has created from reading each file. rbindlist() is used to combine all pieces into one large data.table. The parameter idcol = NDate generates an index column named NDate to identify the origin of each row in the final output. The ids are integer numbers 1 to the length of the list (if the list is not named).
Finally, the id number is used to lookup the file name in files.pet which is directly converted to class Date using the anytime package.
EDIT Perhaps, it might be more efficient to convert the file names to Date first before looking them up:
result[, NDate := anytime::anydate(files.pet)[NDate]]
Although fread() is pretty smart in analysing and guessing the right parameters for reading the files it might be necessary (and perhaps faster as well) to supply additional parameters, e.g.:
result <- rbindlist(lapply(files.pet, fread, header = FALSE, sep = ","), idcol = "NDate")

Yes, lapply will help, as Frank suggests. And you want to use rbind to keep the dates different for each file. Something along the lines of:
I'm assuming files.pet is a list of all the files you want to include...
my.fun<-function(file){
data <- read.table(file = file,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";")
data$NDate = file
return(data)}
data.pet.atual <- do.call(rbind.data.frame, lapply(files.pet, FUN=my.fun))
I can't test this without a reproducible example, so you may need to play with it a bit, but the general approach should work!

R: Dynamically create a variable name

I'm looking to create multiple data frames using a for loop and then stitch them together with merge().
I'm able to create my data frames using assign(paste(), blah). But then, in the same for loop, I need to delete the first column of each of these data frames.
Here's the relevant bits of my code:
for (j in 1:3)
{
#This is to create each data frame
#This works
assign(paste(platform, j, "df", sep = "_"), read.csv(file = paste(masterfilename, extension, sep = "."), header = FALSE, skip = 1, nrows = 100))
#This is to delete first column
#This does not work
assign(paste(platform, j, "df$V1", sep = "_"), NULL)
}
In the first situation I'm assigning my variables to a data frame, so they inherit that type. But in the second situation, I'm assigning it to NULL.
Does anyone have any suggestions on how I can work this out? Also, is there a more elegant solution than assign(), which seems to bog down my code? Thanks,
n.i.

assign can be used to build variable names, but "name$V1" isn't a variable name. The $ is an operator in R so you're trying to build a function call and you can't do that with assign. In fact, in this case it's best to avoid assign completely. You con't need to create a bunch of different variables. If you data.frames are related, just keep them in a list.
mydfs <- lapply(1:3, function(j) {
df<- read.csv(file = paste(masterfilename, extension, sep = "."),
header = FALSE, skip = 1, nrows = 100))
df$V1<-NULL
df
})
Now you can access them with mydfs[[1]], mydfs[[2]], etc. And you can run functions overall data.sets with any of the *apply family of functions.

As #joran pointed out in his comment, the proper way of doing this would be using a list. But if you want to stick to assign you can replace your second statement with
assign(paste(platform, j, "df", sep = "_"),
get(paste(platform, j, "df", sep = "_"))[
2:length(get(paste(platform, j, "df", sep = "_")))]
If you wanted to use a list instead, your code to read the data frames would look like
dfs <- replicate(3,
read.csv(file = paste(masterfilename, extension, sep = "."),
header = FALSE, skip = 1, nrows = 100), simplify = FALSE)
Note you can use replicate because your call to read.csv does not depend on j in the loop. Then you can remove the first column of each
dfs <- lapply(dfs, function(d) d[-1])
Or, combining everything in one command
dfs <- replicate(3,
read.csv(file = paste(masterfilename, extension, sep = "."),
header = FALSE, skip = 1, nrows = 100)[-1], simplify = FALSE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Set columns conditions while binding a large list of files - r

Related

Reduce unnecessary repeated reading from files in nested for loop R

Converting pdf files into data.frames

make csv data import case insensitive

Adding a column characters based on file name in R

R: Dynamically create a variable name

Categories

Resources