I have thousands of huge CSV files that I need to upload into Postgres. I read that COPY FROM is the fastest way to upload csv files. However, I need to do a bit of pre-processing of the data. As a bare minimum, I need to add the filename (or some sort of file id) so that I can tie the information to its source.
Right now, I'm reading the CSV files into a R data frame, adding a column with filename to the data frame and then writing the data frame to Postgres using
dbWriteTable(con, name = 'my_table', value = my_dataframe, row.names = FALSE, append = TRUE, overwrite= FALSE)
I want to know if there is a better/faster way of importing the csv files.
Thanks.
Related
I am trying to output multiple small data frames to an Excel file. The data frames are residuals and predicted from mgcv models run from a loop. Each is a separate small data set that I am trying to output to a separate worksheet in the same Excel spreadsheet file.
The relevant line of code that is causing the error is from what I can tell this line of code
write.xlsx(resid_pred, parfilename, sheetName=parsheetname, append = TRUE)**
where resid_pred is the residuals predicted data frame, parfilename is the file name and path and
parsheetname is the sheet name.
The error message is
Error in save Workbook(wb, file = file, overwrite = overwrite) : File already exists!
Which makes no sense since the file would HAVE to exist if I am appending to it. Does anyone have a clue?
Amazingly the following code will work:
write.xlsx2(resid_pred, file=parfilename, sheetName= parsheetname, col.names =
TRUE, rowNames = FALSE, append = TRUE, overwrite = FALSE)
The only difference is it is write.xlsx2 not just xlsx.
I'm working with a huge file to do some data analysis in R. This file is a .csv. I can import it just fine. However, after transposing all the rows and columns using data.frame(t(data)), I export it and cannot re-import this data.
This is the code I am using:
write.csv(transposed_data, file = "transposed_data.csv", row.names = FALSE, quote = FALSE)
When I transpose the rows and columns, does something happen to the data that is causing these issues? When using read.csv, the transposed data simply will not open.
I have merged two tables together and I want to write them to a .txt file. I have managed to do this, however when I open the .txt file in excel the symbols   have been added to some values. How do I stop this from happening? I have used the following code:
ICU_PPS <- merge(ICU, PPS, by=c("Study","Lane","Isolate_ID","Sample_Number","MALDI_ID", "WGS","Source"),all=TRUE)
write.table(ICU_PPS,"ICUPPS2.txt", sep="\t", row.names = FALSE)
An example of some values in a column that I get:
100_1#175
100_1#176
100_1#177
100_1#179
100_1#18 
100_1#19 
100_1#20 
What I want to achieve:
100_1#175
100_1#176
100_1#177
100_1#179
100_1#18
100_1#19
100_1#20
Introdution
I have written following R code by referring Link-1. Here, Sparklyr package is used in R programming to read huge data from JSON file. But, while creating CSV file, it has shown the error.
R code
sc <- spark_connect(master = "local", config = conf, version = '2.2.0')
sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE,
memory = FALSE, overwrite = TRUE)
sdf_schema_viewer(sample_tbl) # to create db schema
sample_tbl %>% spark_write_csv(path = "data.csv") # To write CSV file
Last line shows the following error. Dataset contains different data types. If required I can show the database schema. It contains nested data columns.
Error
Error: java.lang.UnsupportedOperationException: CSV data source does not support struct,media:array,display_url:string,expanded_url:string,id:bigint,id_str:string,indices:array,media......
Question
How to resolve this error? Is it due to the different data types or deep level 2 to 3 nested columns? Any help would be appreciated.
It seems that your dataframe has array data type, which is NOT supported by CSV. It seems it's not possible that CSV file can include array or other nest structure for this scenario.
Therefore, If you want your data to be human readable text, write out as Excel file.
Please note that Excel CSV (very special case though) supports arrays in CSV using "\n"
inside quotes, but you have to use as EOL for the row "\r\n" (Windows EOL).
I am currently trying to merge two data files using the map_df code. I have downloaded my dataset [https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data] and placed it within my working directory's file location. It is a file with many separate smaller files. I am hoping to import the dataset quickly using map_df instead of having to name every single file in code. However, when I try to pull the data from that folder:
namedata.df <- read.csv.folder(Namedata, x = TRUE, y = TRUE, header = TRUE, dec = ".", sep = ";", pattern = "csv", addSpec = NULL, back = TRUE)
I get a return of: Error in substr(folder, start = nchar(folder), stop = nchar(folder)) :
object 'Namedata' not found
Why might it be missing the folder? Is there a better way to pull in a folder of data?
Try projectTemplate. When you run load.project() command it loads all csv, xls files as dataframes. The data frame names are same as the file names