How to open a PostgreSQL database dump in R - r

i have many files (with no extension) that i would like to open with R and extract data.frames.
it says in the header that it is a
--
m -- PostgreSQL database dump
It has many tables inside it, and the only noticeable pattern i detected is that and the end of each table it has a "." (see print screen)
is there a smart way to import this file and and extract/break it into meaningful dataframes?
Thank you in advance!

Related

In R and Sparklyr, writing a table to .CSV (spark_write_csv) yields many files, not one single file. Why? And can I change that?

Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)

Export R data frame to MS Access

I am trying to export a data frame from R to MS Access but it seems to me that there is no package available to do this task. Is there a way to export a data frame directly to Access? Any help will be greatly appreciated.
The following works for medium sized datasets, but may fail if MyRdataFrame is too large for the 2GB limit of Access or conversion type errors.
library(RODBC)
db <- "C:Documents/PreviouslySavedBlank.accdb"
Mycon <- odbcConnectAccess2007(db)
sqlSave(Mycon, MyRdataFrame)
There is the ImportExport package.
The database has to already exist (at least in my case). So you have to create it first.
It has to be a access database 2000 version with extension .mdb
Here is an example:
ImportExport::access_export("existing_databse.mdb",as.data.frame(your_R_data),
tablename="bob")
with "bob" the name of the table you want to create in the database. Choose your own name of course and it has to be a non already existing table
It will also add a first column called rownames which is just an index column
Note that creating a .accdb file and then changing the extension to .mdb wont work ^^ you really have to open it and save it as .mdb. I added as.data.frame() but if your data is already one then no need.
There might be a way for .accdb files using directly sqlSave (which is used internally by ImportExport) and specifying the driver from the RODBC package. This is in the link in the comment from #BenJacobson. But the solution above worked for me and it was only one line.

Bulkload option exporting data from R to Teradata using RODBC

I have done many researches on how to upload a huge data in .txt through R to Teradata DB. I tried to use RODBC's sqlSave() but it did not work. I also followed some other similar questions posted such as:
Write from R to Teradata in 3.0 OR Export data frame to SQL server using RODBC package OR How to quickly export data from R to SQL Server.
However, since Teradata somehow is structured differently than MS SQL server, most of those options suggested are not applicable to my situation.
I know that there is a TeradataR package available but it has not been updated since like 2-3 years ago.
So here are my 2 main problems I am facing:
1. How to bulk load (all records at once) data in .txt format to Teradata using R if there is any way. (So far I only tried using SAS to do so, but I need to explore this in R)
2. The data is big like 500+ MB so I cannot load it through R, I am sure there is a way to go around this but directly pull data from server.
Here is what I tried according to one of posts but this was for MS SQL server:
toSQL = data.frame(...) #this doesn't work for me cause its too big.
write.table(toSQL,"C:\\export\\filename.txt",quote=FALSE,sep=",",row.names=FALSE,col.names=FALSE,append=FALSE);
sqlQuery(channel,"BULK
INSERT Yada.dbo.yada
FROM '\\\\<server-that-SQL-server-can-see>\\export\\filename.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\\n'
)");
*Note: there is an option in Teradata to insert/import data but that is the same as writing millions of rows of Insert statements.
Sorry that I do not have sample codes at this point since the package that I found wasn't the right one that I should use.
Anyone has similar issues/problems like this?
Thank you so much for your help in advance!
I am not sure if you figured out how to do this but I second Andrew's solution. If you have Teradata installed on your computer you can easily run the FastLoad Utility from the shell.
So I would:
export by data frame to a txt file (Comma separated)
create my fastload script and call the exported txt file from within the fastload scrip (you can learn more about it here)
run the shell command referencing my fastload script.
setwd("pathforyourfile")
write.table(mtcars, "mtcars.txt", sep = ",", row.names = FALSE,quote= FALSE, na = "NA",col.names = FALSE)
shell("fastload < mtcars_fastload.txt")
I hope this sorts your issue. Let me know if you need help especially on the fastloading script. More than happy to help.

Read a PostgreSQL local file into R in chunks and export to .csv

I have a local file in PostgreSQL format that I would like to read into R into chunks and export it as .csv.
I know this might be a simple question but I'm not at all familiar with PostgreSQL or SQL. I've tried different things using R libraries like RPostgreSQL, RSQLite and sqldf but I couldn't get my head around this.
If your final goal is to create a csv file, you can do it directly using PostgreSQL.
You can run something similar to this:
COPY my_table TO 'C:\my_table.csv' DELIMITER ',' CSV HEADER;
Sorry if I misunderstood your requirement.
The requirement is to programmatically create a very large .csv file from scratch and populate it from data in a database? I would use this approach.
Step 1 - isolate the database data into a single table with an auto incrementing primary key field. Whether you always use the same table or create and drop one each time depends on the possibility of concurrent use of the program.
Step 2 - create the .csv file with your programming code. It can either be empty, or have column headers, depending on whether or not you need column headers.
Step 3 - get the minimum and maximum primary key values from your table.
Step 4 - set up a loop in your programming code using the values from Step 3. Inside the loop:
query the table to get x rows
append those rows to your file
increment the variables that control your loop
Step 5 - Do whatever you have to do with the file. Don't try to read it with your programming code.

Speed up read.dbf in R (problems with importing large dbf file)

I have a dataset given in .dbf format and need to import it into R.
I haven't worked with such extension previously, so have no idea how to export dbf file with multiple tables into different format.
Simple read.dbf has been running hours and still no results.
Tried to look for speeding up R performance, but not sure whether it's the case, think the problem is behind reading the large dbf file itself (weights ~ 1.5Gb), i.e. the command itself must be not efficient at all. However, I don't know any other option how to deal with such dataset format.
Is there any other option to import the dbf file?
P.S. (NOT R ISSUE) The source of the dbf file uses visual foxpro, but can't export it to other format. I've installed foxpro, but given that I've never used it before, I don't know how to export it in the right way. Tried simple "Export to type=XLS" command, but here comes a problem with encoding as most of variables are in Russian Cyrillic and can't be decrypted by excel. In addition, the dbf file contains multiple tables that should be merged in 1 big table, but I don't know how to export those tables separately to xls, same as I don't know how to export multiple tables as a whole into xls or csv, or how to merge them together as I'm absolutely new to dbf files theme (though looked through base descriptions already)
Any helps will be highly appreciated. Not sure whether I can provide with sample dataset, as there are many columns when I look the dbf in foxpro, plus those columns must be merged with other tables from the same dbf file, and have no idea how to do that. (sorry for the mess)
Your can export from Visual FoxPro in many formats using the COPY TO command via the Command Window, as per the VFP help file.
For example:
use mydbf in 0
select mydbf
copy to myfile.xls type xl5
copy to myfile.csv type delimited
If you're having language-related issues, you can add an 'as codepage' clause to the end of those. For example:
copy to myfile.csv type delimited as codepage 1251
If you are not familiar with VFP I would try to get the raw data out like that, and into a platform that you are familiar with, before attempting merges etc.
To export them in a loop you could use the following in a .PRG file (amending the two path variables at the top to reflect your own setup).
Close All
Clear All
Clear
lcDBFDir = "c:\temp\" && -- Where the DBF files are.
lcOutDir = "c:\temp\export\" && -- Where you want your exported files to go.
lcDBFDir = Addbs(lcDBFDir) && -- In case you forgot the backslash.
lcOutDir = Addbs(lcOutDir)
* -- Get the filenames into an array.
lnFiles = ADir(laFiles, Addbs(lcDBFDir) + "*.DBF")
* -- Process them.
For x = 1 to lnFiles
lcThisDBF = lcDBFDir + laFiles[x, 1]
Use (lcThisDBF) In 0 Alias currentfile
Select currentfile
Copy To (lcOutDir + Juststem(lcThisDBF) + ".csv") type csv
Use in Select("Currentfile") && -- Close it.
EndFor
Close All
... and run it from the Command Window - Do myprg.prg or whatever.

Resources