I am reading an excel spreadsheet with the following command
models <- XLConnect::loadWorkbook(bla.xlsx)
models <- readWorksheet(models,sheet1,check.names=F)
however I noticed that R modifies my number. For example in Excel:
1.011379571
and in R
1.011379570999999977
did anyone had this issue. I know it's very small issues but I really need 100% precision
Related
I have a question about the limitation of data in R shiny. I am now working on updating a previous project. The original data is around 5MB and the program would resample the data to obtain an estimate for future values. I am not updating the program to make it more general, where I try to import 300MB data. However, the R shiny would crack. I have used R to handle larger data before. But I am not sure if R shiny has any limitation of data size. Does anyone have any idea about it. Thanks.
I am stuck with huge dataset to be imported in R and then processing it (by randomForest). Basically, I have a csv file of about 100K rows and 6K columns. Importing it directly takes a long time with many warnings regarding space allocations (limit reached for 8061mb). At the end of many warnings, I do get that dataframe in R, but not sure whether to rely on it. Even if I use that dataframe, I am pretty sure running a randomForest on it will definitely be a huge problem. Hence, mine is a two part question:
How to efficiently import such a large csv file without any warnings/errors?
Once imported into R, how to proceed for using randomForest function on it.
Should we use some package which enhances computing efficiency. Any help is welcome, thanks.
Actually your limit for loading files in R seems to be 8G, try increasing that if your machine have more memory.
If that does not work, one option is that you can submit to MapReduce from R ( see http://www.r-bloggers.com/mapreduce-with-r-on-hadoop-and-amazon-emr/ and https://spark.apache.org/docs/1.5.1/sparkr.html). However, Random Forest is not supported in either way yet.
I have survey data in SPSS and Stata which is ~730 MB in size. Each of these programs also occupy approximately the amount of space you would expect(~800MB) in the memory if I'm working with that data.
I've been trying to pick up R, and so attempted to load this data into R. No matter what method I try(read.dta from the stata file, fread from a csv file, read.spss from the spss file) the R object(measured using object.size()) is between 2.6 to 3.1 GB in size. If I save the object in an R file, that is less than 100 MB, but on loading it is the same size as before.
Any attempts to analyse the data using the survey package, particularly if I try and subset the data, take significantly longer than the equivalent command in stata.
e.g I have a household size variable 'hhpers' in my data 'hh', weighted by variable 'hhwt' , subset by 'htype'
R code :
require(survey)
sv.design <- svydesign(ids = ~0,data = hh, weights = hh$hhwt)
rm(hh)
system.time(svymean(~hhpers,sv.design[which
(sv.design$variables$htype=="rural"),]))
pushes the memory used by R upto 6 GB and takes a very long time -
user system elapsed
3.70 1.75 144.11
The equivalent operation in stata
svy: mean hhpers if htype == 1
completes almost instantaneously, giving me the same result.
Why is there such a massive difference between both memory usage(by object as well as the function), and time taken between R and Stata?
Is there anything I can do to optimise the data and how R is working with it?
ETA: My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata.
After some digging, I expect the reason for this is R's limited number of data types. All my data is stored as int, which takes 4 bytes per element. In survey data, each response is categorically coded, and typically requires only one byte to store, which stata stores using the 'byte' data type, and R stores using the 'int' data type, leading to some significant inefficiency in large surveys.
Regarding difference in memory usage - you're on the right track and (mostly) its because of object types. Indeed integer saving will take up a lot of your memory. So proper setting of variable types would improve memory usage by R. as.factor() would help. See ?as.factor for more details to update this after reading data. To fix this during reading data from the file refer to colClasses parameter of read.table() (and similar functions specific for stata & SPSS formats). This will help R store data more efficiently (its on the fly guessing of types is not top-notch).
Regarding the second part - calculation speed - large dataset parsing is not perfect in base R, that's where data.table package comes handy - its fast and quite similar to original data.frame behavior. Summary calcuations are really quick. You would use it via hh <- as.data.table(read.table(...)) and you can calculate something similar to your example with
hh <- as.data.table(hh)
hh[htype == "rural",mean(hhpers*hhwt)]
## or
hh[,mean(hhpers*hhwt),by=hhtype] # note 'empty' first argument
Sorry, I'm not familiar with survey data studies, so I can't be more specific.
Another detail into memory usage by function - most likely R made a copy of your entire dataset to calculate the summaries you were looking for. Again, in this case data.table would help and prevent R from making excessive copies and improve memory usage.
Of interest may also be the memisc package which, for me, resulted in much smaller eventual files than read.spss (I was however working at a smaller scale than you)
From the memisc vignette
... Thus this package provides facilities to load such subsets of variables, without the need to load a complete data set. Further, the loading of data from SPSS files is organized in such a way that all informations about variable labels, value labels, and user-defined missing values are retained. This is made possible by the definition of importer objects, for which a subset method exists. importer objects contain only the information about the variables in the external data set but not the data. The data itself is loaded into memory when the functions subset or as.data.set are used.
I have created a 300000 x 7 numeric matrix in R and I want to work with it in both R and Matlab. However, I'm not able to create a file well readeable for Matlab.
When using the command save(), with file=xx.csv, it recognizes 5 columns instead; with extension .txt all data is opened in a single column instead.
I have also tried with packages ff and ffdf to manage this big data (I guess the problem of R identifying rows and column when saving is related somehow to this), but I don't know how to save it in a readable format for Matlab afterwards.
An example of this dataset would be:
output <- matrix(runif(2100000, 1, 1000), ncol=7, nrow=300000)
If you want to work both with R and Matlab, and you have a matrix as big as yours, I'd suggest using the R.matlab package. The package provides methods readMat and writeMat. Both methods read/write the binary format that is understood by Matlab (and through R.matlab also by R).
Install the package by typing
install.packages("R.matlab")
Subsequently, don't forget to load the package, e.g. by
library(R.matlab)
The documentation of readMat and writeMat, accessible through ?readMat and ?writeMat, contains easy usage examples.
Completely new to R here. I ran R in SPSS to solve some complex polynomials from SPSS datasets. I managed to get the result from R back into SPSS, but it was a very inelegant process:
begin program R.
z <- polyroot(unlist(spssdata.GetDataFromSPSS(variables=c("qE","qD","qC","qB","qA"),cases=1),use.names=FALSE))
otherVals <- spssdata.GetDataFromSPSS(variables=c("b0","b1","Lc","tInv","sR","c0","c1","N2","xBar","DVxSq"),cases=1)
b0<-unlist(otherVals["b0"],use.names=FALSE)
b1<-unlist(otherVals["b1"],use.names=FALSE)
Lc<-unlist(otherVals["Lc"],use.names=FALSE)
tInv<-unlist(otherVals["tInv"],use.names=FALSE)
sR<-unlist(otherVals["sR"],use.names=FALSE)
c0<-unlist(otherVals["c0"],use.names=FALSE)
c1<-unlist(otherVals["c1"],use.names=FALSE)
N2<-unlist(otherVals["N2"],use.names=FALSE)
xBar<-unlist(otherVals["xBar"],use.names=FALSE)
DVxSq<-unlist(otherVals["DVxSq"],use.names=FALSE)
z2 <- Re(z[abs(c(abs(b0+b1*Re(z)-tInv*sR*sqrt(1/(c0+c1*Re(z))^2+1/N2+(Re(z)-xBar)^2/DVxSq))-Lc))==min(abs(c(abs(b0+b1*Re(z)-tInv*sR*sqrt(1/(c0+c1*Re(z))^2+1/N2+(Re(z)-xBar)^2/DVxSq))-Lc)))])
varSpec1 <- c("Xd","Xd",0,"F8","scale")
dict <- spssdictionary.CreateSPSSDictionary(varSpec1)
spssdictionary.SetDictionaryToSPSS("results", dict)
new = data.frame(z2)
spssdata.SetDataToSPSS("results", new)
spssdictionary.EndDataStep( )
end program.
Honestly, it was mostly pieced together from somewhat-related examples and seems more complicated than it should be. I had to take the new dataset created by R and run MATCH FILES with my original dataset. All I want to do is a) pull numbers from SPSS into R, b) manipulate them-in this case, finding a polyroot that fit certain criteria- , and c) put the results right back into the SPSS dataset without messing up any of the previous data.
Am I missing something that would make this more simple? Keep in mind that I have zero R experience outside of this attempt, but I have decent experience in programming SPSS and matlab.
Thanks in advance for any help you give!
R in SPSS can create new SPSS datasets, but it can't modify an existing one. There are a lot of situations where the data from R would be dimensionally inconsistent with the active SPSS dataset. So you need to create a dictionary and data frame using the apis above and then do whatever is appropriate on the SPSS side if you need to match back. You might want to submit an enhancement request for SPSS at suggest#us.ibm.com