My collaborator ran R code on a Linux operating system. But I only have windows now. I am only trying to run this R code which is known to work on a Linux system.
I need to read a large csv files with more than 400,000 rows. My computer cannot handle this large file so that I only read the first 10,000 rows. Then a simple left join function did not work with this truncated file. The syntax of this left join appears correct.
data_tag <- left_join(data, tags, by = "app_id")
Error: by can't contain join column app_id which is missing from
LHS.
I have checked many times that app_id is in both files.
Is that possible that the file changed slightly when switching from a Linux to Windows? Or somehow, the truncated file can not be read correctly into R?
Any help is highly appreciated.
Related
Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)
Just wondering if there's a difference in the read/write parquet function from the arrow package in R when running in Windows vs Linux OS?
Example code(insert anything in dataframe):
mydata = data.frame(...)
write_parquet(mydata, 'mydata.parquet')
read_parquet('mydata.parquet')
I'm noticing when this code is ran in Windows the parquet files can be read with no problems in either Windows or Linux, and returns a dataframe in R. But when the write parquet code is ran in Linux, and afterwards if I try to read these parquet files in R in Windows it does not return a dataframe but rather a grouped list (each vector in the grouped list contains the data for that respective column). Initially I tried doing a workaround with do.call(rbind...) to convert the grouped list back into a dataframe, but it does not contain any of the column names.
Please let me know if there are any ways to resolve this. Ideally I'd like to be able to write parquet files and be able to read them back into R as dataframes from either OS. For reference I'm on R4.0 on both OS.
Thanks in advance.
I am trying to merge two files in R in an attempt to compute the correlation. One file is located at http://richardtwatson.com/data/SolarRadiationAthens.csv and the other at http://richardtwatson.com/data/electricityprices.csv
Currently my code looks as follows:
library(dplyr)
data1<-read.csv("C:/Users/nldru/Downloads/SolarRadiationAthens.csv")
data2 <- read.csv("C:/Users/nldru/Downloads/electricityprices.csv")
n <- merge(data1,data2)
I have the code stored locally on my computer just for ease of access. The files are being read in properly, but for some reason when I merge, variable n receives no data, just the headers of the columns from the csv files. I have experimented with using inner_join to no avail as well as pulling the files directly from the http address linked above and using read_delim() commands but can't seem to get it to work. Any help or tips are much appreciated.
I have done many researches on how to upload a huge data in .txt through R to Teradata DB. I tried to use RODBC's sqlSave() but it did not work. I also followed some other similar questions posted such as:
Write from R to Teradata in 3.0 OR Export data frame to SQL server using RODBC package OR How to quickly export data from R to SQL Server.
However, since Teradata somehow is structured differently than MS SQL server, most of those options suggested are not applicable to my situation.
I know that there is a TeradataR package available but it has not been updated since like 2-3 years ago.
So here are my 2 main problems I am facing:
1. How to bulk load (all records at once) data in .txt format to Teradata using R if there is any way. (So far I only tried using SAS to do so, but I need to explore this in R)
2. The data is big like 500+ MB so I cannot load it through R, I am sure there is a way to go around this but directly pull data from server.
Here is what I tried according to one of posts but this was for MS SQL server:
toSQL = data.frame(...) #this doesn't work for me cause its too big.
write.table(toSQL,"C:\\export\\filename.txt",quote=FALSE,sep=",",row.names=FALSE,col.names=FALSE,append=FALSE);
sqlQuery(channel,"BULK
INSERT Yada.dbo.yada
FROM '\\\\<server-that-SQL-server-can-see>\\export\\filename.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\\n'
)");
*Note: there is an option in Teradata to insert/import data but that is the same as writing millions of rows of Insert statements.
Sorry that I do not have sample codes at this point since the package that I found wasn't the right one that I should use.
Anyone has similar issues/problems like this?
Thank you so much for your help in advance!
I am not sure if you figured out how to do this but I second Andrew's solution. If you have Teradata installed on your computer you can easily run the FastLoad Utility from the shell.
So I would:
export by data frame to a txt file (Comma separated)
create my fastload script and call the exported txt file from within the fastload scrip (you can learn more about it here)
run the shell command referencing my fastload script.
setwd("pathforyourfile")
write.table(mtcars, "mtcars.txt", sep = ",", row.names = FALSE,quote= FALSE, na = "NA",col.names = FALSE)
shell("fastload < mtcars_fastload.txt")
I hope this sorts your issue. Let me know if you need help especially on the fastloading script. More than happy to help.
I have unix folder containing about 700 sas datasets.
When I assign a library to this folder, I am able to view only 30 of the 700 datasets. I checked unix permissions for the datasets and see no difference between the ones that are visible and the ones that are not.
1) This is a pre-assigned library in SAS Management console.
2) The Datasets are still accessible if i name them in data and proc steps,Just not visible in the library.
The issue with partial listing of SAS datasets in SAS library is finally resolved with the help of SAS Support.
This is the response I got from them:
“The problem is related to Swedish characters in SAS filenames, we have renamed the 3 files with Swedish characters and then all tables is showing
We still don’t know why this is happening and will register a support track regarding this.
As a workaround all personnel should not save filenames with Swedish Characters while we investigate why this is happening”.
For example out of 700 datasets, 3 had filenames containing the special character ä. This was somehow having an impact on the Listing in SAS EG and Proc Contents. When it was replaced, all 700 files automatically started appearing.
Please check the library references for local and unix libraries. they should be exactly similar. then only we can see the datasets in window environment.