Issue with writing Parquet Files via Arrow Package in R - r

Just wondering if there's a difference in the read/write parquet function from the arrow package in R when running in Windows vs Linux OS?
Example code(insert anything in dataframe):
mydata = data.frame(...)
write_parquet(mydata, 'mydata.parquet')
read_parquet('mydata.parquet')
I'm noticing when this code is ran in Windows the parquet files can be read with no problems in either Windows or Linux, and returns a dataframe in R. But when the write parquet code is ran in Linux, and afterwards if I try to read these parquet files in R in Windows it does not return a dataframe but rather a grouped list (each vector in the grouped list contains the data for that respective column). Initially I tried doing a workaround with do.call(rbind...) to convert the grouped list back into a dataframe, but it does not contain any of the column names.
Please let me know if there are any ways to resolve this. Ideally I'd like to be able to write parquet files and be able to read them back into R as dataframes from either OS. For reference I'm on R4.0 on both OS.
Thanks in advance.

Related

Merging Two csv files with different column lengths with R

I am trying to merge two files in R in an attempt to compute the correlation. One file is located at http://richardtwatson.com/data/SolarRadiationAthens.csv and the other at http://richardtwatson.com/data/electricityprices.csv
Currently my code looks as follows:
library(dplyr)
data1<-read.csv("C:/Users/nldru/Downloads/SolarRadiationAthens.csv")
data2 <- read.csv("C:/Users/nldru/Downloads/electricityprices.csv")
n <- merge(data1,data2)
I have the code stored locally on my computer just for ease of access. The files are being read in properly, but for some reason when I merge, variable n receives no data, just the headers of the columns from the csv files. I have experimented with using inner_join to no avail as well as pulling the files directly from the http address linked above and using read_delim() commands but can't seem to get it to work. Any help or tips are much appreciated.

What are the commands for viewing a ".RData" file's data in RStudio?

I am trying to find out how I can see the data within a dataset with a .RData extension.
I tried view(), it gave me one object present in the dataset but I know that this dataset is a large dataset with over 300MB size and consists of a very large number of names list. I need to view all of the contents of it and have been unsuccessful so far.
Should I convert it into a CSV instead in order to view all of the contents? If yes, how can I do that using RStudio?
The cross-platform function is View. (Caps are discriminatory in R.) If you did:
obj <- load("filename.Rdata") # assuming a file exist in your working directory
Then type:
obj
You should see a print-listing of the character representations of the objects created (or possibly overwritten) in your global environment. The Rstudio aspect of this question would not affect the result.

run R code from Linux to windows

My collaborator ran R code on a Linux operating system. But I only have windows now. I am only trying to run this R code which is known to work on a Linux system.
I need to read a large csv files with more than 400,000 rows. My computer cannot handle this large file so that I only read the first 10,000 rows. Then a simple left join function did not work with this truncated file. The syntax of this left join appears correct.
data_tag <- left_join(data, tags, by = "app_id")
Error: by can't contain join column app_id which is missing from
LHS.
I have checked many times that app_id is in both files.
Is that possible that the file changed slightly when switching from a Linux to Windows? Or somehow, the truncated file can not be read correctly into R?
Any help is highly appreciated.

exporting R data.table in .txt or .csv format

In these days I have run into a problem of data export from R to a more “common” format as .csv or .txt.
My dataset is in data.table format and has 149000 rows * 124 columns. I adopt the following lines of code to try to export it:
write.table(data_reduced,"directory/data_reduced.txt",sep="\t",row.names=FALSE)
write.csv2(data_reduced,"directory/data_reduced.csv")
The result, in both cases, is that the .txt or .csv files have a lower number of rows than they are supposed to do and this changes with the different trials I did (it ranges from 900 to 1800, more or less). Usually what I get are the first rows and then the very last one.
I have tried to transform the data.table in a matrix or data.frame but the result I get is more or less the same. I have also tried to adopt the write.xlsx function but I have some problems with Java (which is something common as I have noticed reading the SO forum and other web sources).
I have also read about a function called fwrite to export very large datasets but it looks like that my RStudio cannot find it, despite I installed the data.table package.
Can anyone give me an explanation/solution for this problem? I've been reading different sources to sort it out but with no success until now.
I use RStudio Version 0.99.473.

Using R to write a .mat file not giving the right output?

I had a .csv file that I wanted to read into Octave (originally tried to use csvread). It was taking too long, so I tried to use R to workaround: How to read large matrix from a csv efficiently in Octave
This is what I did in R:
forest_test=read.csv('forest_test.csv')
library(R.matlab)
writeMat("forest_test.mat", forest_test_data=forest_test)
and then I went back to Octave and did this:
forest_test = load('forest_test.mat')
This is not giving me a matrix, but a struct. What am I doing wrong?
To answer your exact question, you are using the load function wrong. You must not assign it's output to a variable if you just want the variables on the file to be inserted in the workspace. From Octave's load help text:
If invoked with a single output argument, Octave returns data
instead of inserting variables in the symbol table. If the data
file contains only numbers (TAB- or space-delimited columns), a
matrix of values is returned. Otherwise, 'load' returns a
structure with members corresponding to the names of the variables
in the file.
With examples, following our case:
## inserts all variables in the file in the workspace
load ("forest_test.mat");
## each variable in the file becomes a field in the forest_test struct
forest_test = load ("forest_test.mat");
But still, the link you posted about Octave being slow with CSV files makes referece to Octave 3.2.4 which is a quite old version. Have you confirmed this is still the case in a recent version (last release was 3.8.2).
There is a function designed to convert dataframes to matrices:
?data.matrix
forest_test=data.matrix( read.csv('forest_test.csv') )
library(R.matlab)
writeMat("forest_test.mat", forest_test_data=forest_test)

Resources