R does not import values as numbers - r

Basic data was generated using a SQL query and the intention is to process data in R. However, while importing from a .csv or from .xlsx, R imports numbers as characters in spite of changing the data-type in the built-in import tool. Further, while performing basic arithmetic operations, following errors were encountered:
In Ops.factor((data$A), (data$B)) :‘/’ not meaningful for factors
Is there a simple way to solve this?
Data-set was analysed using the str() function, which revealed that R imported the particular columns as factors.
Used package varhandle and function unfactor to unfactorize the data
Used as.numeric for some columns which were read as characters instead of factors
Tried changing data-types in Excel before importing
data$A <- unfactor(data$A)
data$B <- unfactor(data$B)
data$PERCENTAGE <- (data$B)/(data$A)*100
By what means can R import the data as per specified data-types?
Thank you for the help in advance!

For csv files I would recommend read_csv from Hadley Wickham's excellent Tidyverse package. It has intelligent defaults that cope with most things I throw at it.
For .xlsx, there is read_excel, also from the Tidyverse package (there are other packages available).
Or, alternatively just export a .csv from within Excel and use read_csv.
[Note the Tidyverse's will import these files as a "tibble" which is essentially a data frame on steroids without some of the headaches but is easily converted to a data.frame if you prefer.]

Related

R Import Excel file with values instead of formulas (multiple sheets)

I'm trying to import a Excel sheet with a lot of formulas.
My goal is to read the values of the formulas.
I spent a lot of time here, searching for answers and already tried a lot, like using read.xlsx, read_excel, experimenting with the arguments, ... but I can't get it to work. I always receive either N/A or FALSE values instead of the formula value.
In my opinion a possible reason could be that the variables of the formulas are spread across various sheets. For example: I want to import the values of the formulas on sheet A. Some formulas on Sheet A need the results of other formulas on Sheet A, while these formulas need values from sheets B, C and D.
Could this be a possible reason for the non-working import and if so, how can I fix it?
I'm aware that I could solve this problem by opening the .xlsx in Excel and saving the formulas as values, but since the idea of my script is to automize a lot of processing steps, this solution would not be satisfying.
Thanks!
This may not be due to scattered arguments of functions used on sheet onto many other sheets. By any chance, values coming out of formulae in excel will be imported as values only and your formulae will not be imported as such in R. You can however, recreate formulae as per R functions.
As regards your problem of import that may be due to incorrect use of arguments. Try readr package

How to convert Spark R dataframe into R list

This is my first time to try Spark R to do the same work I did with RStudio, on Databricks Cloud Community Edition. But met some weird problems.
It seems that Spark R do support packages like ggplot2, plyr, but the data has to be in R list format. I could generate this type of list in R Studio when I am using train <- read.csv("R_basics_train.csv"), variable train here is a list when you use typeof(train).
However, in Spark R, when I am reading the same csv data as "train", it will be converted into dataframe, and this is not the Spark Python DataFrame we have used before, since I cannot use collect() function to convert it into list.... When you use typeof(train), it shows the type is "S4", but in fact the type is dataframe....
So, is there anyway in Spark R that I can convert dataframe into R list so that I can use methods in ggplot2, plyr?
You can find the origional .csv training data here:
train
Later I found that using r_df <- collect(spark_df) will convert Spark DataFrame into R dataframe, although cannot use R summary() on its dataframe, with R dataframe, we can do many R operations.
It looks like they changed SparkR, so you now need to use
r_df<-as.data.frame(spark_df)
Not sure if you call this as the drawback of sparkR, but in order to leverage many good functionalities which R has to offer such as data exploration, ggplot libraries, you need to convert your pyspark data frame into normal data frame by calling collect
df <- collect(df)

R importing using haven, use Stata 12 or sab7bdat source file?

The University of Cape Town make data available through it's DataFirst Portal.
All their data is made available in the following formats:
SAS (sab7bdat)
SPSS
Stata (12)
I would like to import a dataset into R using the Haven package, which supports all of the above formats (it utilises the ReadStat Library).
Which would be the prefered format for doing this?
More specifically:
Are there differences in terms of data available in the original formats?
Are some formats closer to R's format than others, and does this affect the output?
Are there differences in terms of speed? (less important)
The best way to transfer data between different systems is .csv, as it can be read by all systems without much hassle.
As you only have access to the other formats, there shouldn't be too much difference (given that haven works with all of them).
As to your questions:
I am not aware of any differences in the data availability or format-compatabilities. However, if you want to speed things up, you should probably look into data.table and it's fread (replaces read.table, so no support for the mentioned files).
You can read the data like this:
library(haven)
dat <- read_sas("link_to_sas_file")
dat <- read_spss("link_to_spss_file")
dat <- read_stata("link_to_stata_file")

Save big data file in R to be loaded afterwards in Matlab

I have created a 300000 x 7 numeric matrix in R and I want to work with it in both R and Matlab. However, I'm not able to create a file well readeable for Matlab.
When using the command save(), with file=xx.csv, it recognizes 5 columns instead; with extension .txt all data is opened in a single column instead.
I have also tried with packages ff and ffdf to manage this big data (I guess the problem of R identifying rows and column when saving is related somehow to this), but I don't know how to save it in a readable format for Matlab afterwards.
An example of this dataset would be:
output <- matrix(runif(2100000, 1, 1000), ncol=7, nrow=300000)
If you want to work both with R and Matlab, and you have a matrix as big as yours, I'd suggest using the R.matlab package. The package provides methods readMat and writeMat. Both methods read/write the binary format that is understood by Matlab (and through R.matlab also by R).
Install the package by typing
install.packages("R.matlab")
Subsequently, don't forget to load the package, e.g. by
library(R.matlab)
The documentation of readMat and writeMat, accessible through ?readMat and ?writeMat, contains easy usage examples.

How can I save a data set created using the memisc package in r?

I'm using memisc to read in a massive SPSS file in order to reshape it. This seems to work very well.
However, I'd also like to be able to output the results to an SPSS-readable file that retains labels and descriptions--after all, this is the advantage of using the data-set construct memisc has set up.
I've searched the memisc documentation for any mention of an export, save, or output function. I've also tried passing a memisc data set to write.foreign (from the foreign package).
Why am I doing this in the first place? Reshaping massive data files in SPSS is a pain. R makes it easy. But the folks I'm doing this for want to maintain the labels and descriptions.
Thanks!

Resources