Import big CSV files at once in R

Import big CSV files at once in R - r

I`ve 70 csv files with the same columns in a folder, each of them are 0.5 GB.
I want to import them into a single dataframe in R.
Normally I import each of them correctly as below:
df <- read_delim("file.csv",
"|", escape_double = FALSE, col_types = cols(pc_no = col_character(),
id_key = col_character()), trim_ws = TRUE)
To import all of them, coded like that and error as follows:
argument "delim" is missing, with no default
tbl <-
list.files(pattern = "*.csv") %>%
map_df(~read_delim("|", escape_double = FALSE, col_types = cols(pc_no = col_character(), id_key = col_character()), trim_ws = TRUE))
With read_csv, imported but appears only one column which contains all columns and values.
tbl <-
list.files(pattern = "*.csv") %>%
map_df(~read_csv(., col_types = cols(.default = "c")))

In your second block of code, you're missing the ., so read_delim is interpreting your arguments as read_delim(file="|", delim=<nothing provided>, ...). Try:
tbl <- list.files(pattern = "*.csv") %>%
map_df(~ read_delim(., delim = "|", escape_double = FALSE,
col_types = cols(pc_no = col_character(), id_key = col_character()),
trim_ws = TRUE))
I explicitly identified delim= here but it's not strictly necessary. Had you done that in your first attempt, however, you would have seen
readr::read_delim(delim = "|", escape_double = FALSE,
col_types = cols(pc_no = col_character(), id_key = col_character()),
trim_ws = TRUE)
# Error in read_delimited(file, tokenizer, col_names = col_names, col_types = col_types, :
# argument "file" is missing, with no default
which is more indicative of the actual problem.

Related

data.table doubles in size after being written

I am working with the SALES dataset and a trimmed copy of it (TEST). The problem I have is that the SALES file doubles its value when it is saved.
This occurs only with this file (SALES), when the same procedure is performed with the TEST file, the result has the same size as the original file.
I tried transforming the file to an R base data frame and the result is still the same.
Similarly, if I open the file SALES_2 and save it, the size of this file doubles again.
This is the current code:
library(jsonlite)
library(lubridate)
library(tidyverse)
library(readr)
library(stringi)
library(stringr)
library(readxl)
options(scipen = 999)
SALES <- read_delim("C:/Users/edjca/OneDrive/FORMA/PRUEBA/SALES.csv",
delim = "|", escape_double = FALSE, locale =
locale(decimal_mark = ",", grouping_mark ="."), trim_ws = TRUE)
TEST <- read_delim("C:/Users/edjca/OneDrive/FORMA/PRUEBA/Test.csv",
delim = "|", escape_double = FALSE,
locale = locale(decimal_mark = ",", grouping_mark = "."), trim_ws = TRUE)
data.table::fwrite(SALES, "C:/Users/edjca/OneDrive/FORMA/PRUEBA/SALES_2.csv", sep = "|", dec = ",")
data.table::fwrite(TEST, "C:/Users/edjca/OneDrive/FORMA/PRUEBA/Test_2.csv", sep = "|", dec = ",")
Add a picture of the results in my folder and objects sizes in R

writing a for loop to generate tab delimited files in R

I have several files that are in the format of "XXX.qassoc" saved in a folder. I am trying to write a for loop that converts these files to txt like this all at once.
A <- read.table(file = "XXX.qassoc", quote = "\"", comment.char = "",header=TRUE)
write.table(A, file = "XXX.txt", sep = "\t", row.names = FALSE)
Does anyone know what I can do? Thank you!

Set directory with setwd('path') and then:
library(rebus)
library(tidyverse)
library(stringr)
list.files() %>%
str_subset("\\.qassoc$") %>%
walk(~ {A <- read.table(file = .x, quote = "\"", comment.char = "",header=TRUE)
write.table(A, file = str_replace(.x, '\\.qassoc$', '.txt'), sep = "\t", row.names = FALSE)})

exporting dataframe to tsv, but row.names are missing?

b <- data.frame(var1 = c(9.2, 3.5,5.5), var2 = 1:3,row.names = c("a","b","c"))
write_tsv(b,path = result_path,na = "NA",append = T,col_names = T,quote_escape = "double")
b is exported as tsv but the row.names are missing. row.names=T is not an argument for write_tsv.
What can I do to maintain the rownames?

Row names are never kept for any of the readr write_delim() functions. You can either add the row names to the data or use write.table().
Add row names:
library(tibble)
write_tsv(b %>% rownames_to_column(), path = result_path, na = "NA", append = T, col_names = T, quote_escape = "double")
Or:
write.table(b, result_path, na = "NA", append = TRUE, col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)

Column of data not appearing after web scrape (rvest and xlsx)

I am currently trying to download public Treasury data and when setting up my scraping, I am only pulling the date column, 20-year column, and extrapolation factor. The 10-year column, situated in the middle of the table, is not included in the scrape and paste into excel. My code is below. directory not included.
url <- "https://www.treasury.gov/resource-center/data-chart-center/interest-
rates/Pages/TextView.aspx?data=longtermrateYear&year=2020"
ten_year_comp <- read_html(url, encoding = "table")
ten_year_comp %>%
html_nodes("table") %>%
.[[4]] %>%
html_table(fill = TRUE) %>%
write.xlsx(ten_year_comp, file = "TREASURY10YR.xlsx", sheetName = "ten_year_comp",
col.names = TRUE, row.names = TRUE, asTable = TRUE, append = FALSE)

url <- "https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=longtermrateYear&year=2020"
ten_year_comp <- html_nodes(read_html(url), "table")[[4]] %>% html_table(fill = T)
write.xlsx(
ten_year_comp,
file = "TREASURY10YR.xlsx",
sheetName = "ten_year_comp",
col.names = TRUE,
row.names = TRUE,
asTable = TRUE,
append = FALSE
)

How to read multiple PDF files in R?

I have a script that I am using to read multiple PDF files. Here is my code
corpus_raw <- data.frame("company" = c(),"text" = c(), check.names = FALSE)
for (i in 1:length(pdf_list)){
print(i)
document_text <- pdf_text(paste("V:/CodingProject2_FundOverview/", pdf_list[i],sep = "")) %>%
strsplit("\r\n")
document <- data.frame("company" = gsub(x = pdf_list[i],pattern = ".pdf", replacement = ""),
"text" = document_text, stringsAsFactors = FALSE, check.names = FALSE)
colnames(document) <- c("company", "text")
corpus_raw <- rbind(corpus_raw,document)
}
I get the following error message:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 79, 56
I even tried to keep the check.names = FALSE but it seems like I am doing something wrong. Any help will be appreciated. Thanks

I knew I was doing something stupid. Anyways, I was able to figure out the answer on my own.
for (i in 1:length(pdf_list)){
print(i)
document_text <- pdf_text(paste("V:/CodingProject2_FundOverview/", pdf_list[i],sep = "")) %>%
strsplit("\r\n")
document <- data.frame("company" = gsub(x = pdf_list[i],pattern = ".pdf", replacement = ""),
"text" = I(document_text), stringsAsFactors = FALSE, check.names = FALSE)
colnames(document) <- c("company", "text")
corpus_raw <- rbind(corpus_raw,document)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Import big CSV files at once in R - r

Related

data.table doubles in size after being written

writing a for loop to generate tab delimited files in R

exporting dataframe to tsv, but row.names are missing?

Column of data not appearing after web scrape (rvest and xlsx)

How to read multiple PDF files in R?

Categories

Resources