Related
This question is specifically about the use of multiple cores to run a given function where the function requires a package and additional arguments to run.
I have a large dataset of the following form:
Event_ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
Type=c("A","B","C","D","E","A","B","C","D","E","A","B","C","D")
Revenue1=c(24,9,51,7,22,15,86,66,0,57,44,93,34,37)
Revenue2=c(16,93,96,44,67,73,12,65,81,22,39,94,41,30)
z = data.frame(Event_ID,Type,Revenue1,Revenue2)
I have a fairly complex function and am trying run it on multiple core. Below I present a simple function where the function essentially takes the sum of two columns and subtracts the value of the multiplication of two matrices (My apologies if the function is over-simplified but I am trying to understand how parallel processing works). Heres the function below:
set.seed(100)
library(truncnorm)
alpha_old=matrix(c(1,5),nrow=1)
library(truncnorm)
Total_Revenue=function(data,alpha_old){
for (i in 1:nrow(z)){
beta_old=matrix(rtruncnorm(2,a=1,b=10,mean =5,sd=1),ncol=1) #generates beta for each row
adjustment_factor = alpha_old%*%beta_old #computes adjustment factor for each row
z[i,'Total_Rev'] = z[i,'Revenue1']+z[i,'Revenue2']-adjustment_factor
}
return(z)
}
Total = Total_Revenue(data=z,alpha=alpha_old)
print(Total)
Running the function regularly and printing the results out provides the expected output (output shown at the end).
Now I want to implement the following using multiple cores using the parSapply. I tried the following:
library(parallel)
library(doParallel)
no_cores <- detectCores() - 1
registerDoParallel(cores=no_cores)
cl2 <- makeCluster(no_cores)
invisible(clusterEvalQ(cl2, library(truncnorm)))
clusterExport(cl=cl2, varlist=c("alpha_old","z"), envir=environment())
result1 = parSapply(cl2, X= 1:nrow(z),FUN=Total_Revenue,data=z,alpha_old=alpha_old)
stopCluster(cl2)
I get the following message:
Error in checkForRemoteErrors(val) : 14 nodes produced errors; first error: unused argument (X[[i]])
This is the first time I am trying to use multicore processing and am not very familiar with the parallel and doParallel packages. The actual dataset I am working with has around 5 million observation and the function involves additional steps (comparing the values between the other values of the dataset) which I deleted from the example function. Any help on dealing with this will be greatly appreciated. Thanks in advance.
P.S. The output that I get by running the function on one core:
P.P.S. The example data is taken from another question that I had posted here:
Gpu processing R (How to use Gpu processing to run a function on subsets of a dataset)
I am trying to do an anova anaysis in R on a data set with one within factor and one between factor. The data is from an experiment to test the similarity of two testing methods. Each subject was tested in Method 1 and Method 2 (the within factor) as well as being in one of 4 different groups (the between factor). I have tried using the aov, the Anova(in car package), and the ezAnova functions. I am getting wrong values for every method I try. I am not sure where my mistake is, if its a lack of understanding of R or the Anova itself. I included the code I used that I feel should be working. I have tried a ton of variations of this hoping to stumble on the answer. This set of data is balanced but I have a lot of similar data sets and many are unblanced. Thanks for any help you can provide.
library(car)
library(ez)
#set up data
sample_data <- data.frame(Subject=rep(1:20,2),Method=rep(c('Method1','Method2'),each=20),Level=rep(rep(c('Level1','Level2','Level3','Level4'),each=5),2))
sample_data$Result <- c(4.76,5.03,4.97,4.70,5.03,6.43,6.44,6.43,6.39,6.40,5.31,4.54,5.07,4.99,4.79,4.93,5.36,4.81,4.71,5.06,4.72,5.10,4.99,4.61,5.10,6.45,6.62,6.37,6.42,6.43,5.22,4.72,5.03,4.98,4.59,5.06,5.29,4.87,4.81,5.07)
sample_data[, 'Subject'] <- as.factor(sample_data[, 'Subject'])
#Set the contrats if needed to run type 3 sums of square for unblanaced data
#options(contrats=c("contr.sum","contr.poly"))
#With aov method as I understand it 'should' work
anova_aov <- aov(Result ~ Method*Level + Error(Subject/Method),data=test_data)
print(summary(anova_aov))
#ezAnova method,
anova_ez = ezANOVA(data=sample_data, wid=Subject, dv = Result, within = Method, between=Level, detailed = TRUE, type=3)
print(anova_ez)
Also, the values I should be getting as output by SAS
SAS Anova
Actually, your R code is correct in both cases. Running these data through SPSS yielded the same result. SAS, like SPSS, seems to require that the levels of the within factor appear in separate columns. You will end up with 20 rows instead of 40. An arrangmement like the one below might give you the desired result in SAS:
Subject Level Method1 Method2
I have a dataset with about 11,500 rows and 15 factors. I only need to impute values for 3 of the factors, with only 2 of the factors having any significant number of missing values. I have been trying to use mice to create imputed datasets, and I am using the following code:
dataset<-read.csv("filename.csv",header=TRUE)
model<-success~1+course+medium+ethnicity+gender+age+enrollment+HSGPA+GPA+Pell+ethnicity*medium
library(mice)
vempty<-c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
v12<-c(0,0,0,0,0,0,0,1,1,1,1,0,1,1,1)
v13<-c(0,0,0,0,0,0,0,1,1,1,1,1,0,1,1)
v14<-c(0,0,0,0,0,0,0,1,1,1,1,1,1,0,1)
list<-list(vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,v12,v13,v14,vempty)
predmatrix<-do.call(rbind,list)
MIdataset<-mice(dataset,m=2,predictorMatrix=predmatrix)
MIoutput<- pool(glm(model, data=MIdataset, family=binomial))
After this code, I get the error message:
Error in as.data.frame.default(data) :
cannot coerce class '"mids"' into a data.frame
I'm totally at a loss as to what this means. I had no trouble doing this same analysis just deleting the missing data and using regular glm. I'd also like to do a multilvel logistic model on imputed datasets using lmer (that's the next step after I get this to work with glm), so if there is anything I am doing wrong that will also impact that next step, that would be good to know, too. I've tried to search this error on the internet, and I'm not getting anywhere. I'm just really learning R, so I'm also not that familiar with the environment yet.
Thanks for your time!
You need to apply the with.mids function. I think the last line in your code should look like this:
pool(with(MIdataset, glm(formula(model), family = binomial)))
You could also try this:
expr <- 'glm(success ~ course, family = binomial)'
pool(with(MIdataset, parse(text = expr)))
I'm trying to analyze data from the University of Minnesota IPUMS dataset for the 1990 US census in R. I'm using the survey package because the data is weighted. Just taking the household data (and ignoring the person variables to keep things simple), I am attempting to calculate the mean of hhincome (household income). To do this I created a survey design object using the svydesign() function with the following code:
> require(foreign)
> ipums.household <- read.dta("/path/to/stata_export.dta")
> ipums.household[ipums.household$hhincome==9999999, "hhincome"] <- NA # Fix missing
> ipums.hh.design <- svydesign(id=~1, weights=~hhwt, data=ipums.household)
> svymean(ipums.household$hhincome, ipums.hh.design, na.rm=TRUE)
mean SE
[1,] 37029 17.365
So far so good. However, I get a different standard error if I attempt the same calculation in Stata (using code meant for a different portion of the same dataset):
use "C:\I\Hate\Backslashes\stata_export.dta"
replace hhincome = . if hhincome == 9999999
(933734 real changes made, 933734 to missing)
mean hhincome [fweight = hhwt] # The code from the link above.
Mean estimation Number of obs = 91746420
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
hhincome | 37028.99 3.542749 37022.05 37035.94
--------------------------------------------------------------
And, looking at another way to skin this cat, the author of survey, has this suggestion for frequency weighting:
expanded.data<-as.data.frame(lapply(compressed.data,
function(x) rep(x,compressed.data$weights)))
However, I can't seem to get this code to work:
> hh.dataframe <- data.frame(ipums.household$hhincome, ipums.household$hhwt)
> expanded.hh.dataframe <- as.data.frame(lapply(hh.dataframe, function(x) rep(x, hh.dataframe$hhwt)))
Error in rep(x, hh.dataframe$hhwt) : invalid 'times' argument
Which I can't seem to fix. This may be related to this issue.
So in sum:
Why don't I get the same answers in Stata and R?
Which one is right (or am I doing something wrong in both cases)?
Assuming I got the rep() solution working, would that replicate Stata's results?
What's the right way to do it? Kudos if the answer allows me to use the plyr package for doing arbitrary calculations, rather than being limited to the functions implemented in survey (svymean(), svyglm() etc.)
Update
So after the excellent help I've received here and from IPUMS via email, I'm using the following code to properly handle survey weighting. I describe here in case someone else has this problem in future.
Initial Stata Preparation
Since IPUMS don't currently publish scripts for importing their data into R, you'll need to start from Stata, SAS, or SPSS. I'll stick with Stata for now. Begin by running the import script from IPUMS. Then before continuing add the following variable:
generate strata = statefip*100000 + puma
This creates a unique integer for each PUMA of the form 240001, with first two digits as the state fip code (24 in the case of Maryland) and the last four a PUMA id which is unique on a per state basis. If you're going to use R you might also find it helpful to run this as well
generate statefip_num = statefip * 1
This will create an additional variable without labels, since importing .dta files into R apply the labels and lose the underlying integers.
Stata and svyset
As Keith explained, survey sampling is handled by Stata by invoking svyset.
For an individual level analysis I now use:
svyset serial [pweight=perwt], strata(strata)
This sets the weighting to perwt, the stratification to the variable we created above, and uses the household serial number to account for clustering. If we were using multiple years, we might want to try
generate double yearserial = year*100000000 + serial
to account for longitudinal clustering as well.
For household level analysis (without years):
svyset serial [pweight=hhwt], strata(strata)
Should be self-explanatory (though I think in this case serial is actually superfluous). Replacing serial with yearserial will take into account a time series.
Doing it in R
Assuming you're importing a .dta file with the additional strata variable explained above and analysing at the individual letter:
require(foreign)
ipums <- read.dta('/path/to/data.dta')
require(survey)
ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Or at the household level:
ipums.hh.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=hhwt)
Hope someone finds this helpful, and thanks so much to Dwin, Keith and Brandon from IPUMS.
1&2) The comment you cited from Lumley was written in 2001 and predates any of his published work with the survey package which has only been out a few years. You are probably using "weights" in two different senses. (Lumley describes three possible senses early in his book.) The survey function svydesign is using probability weights rather than frequency weights. Seems likely that these are not really frequency weights but rather probability weights, given the massive size of that dataset, and that would mean that the survey package result is correct and the Stata result incorrect. If you are not convinced, then the survey package offers the function as.svrepdesign() with which Lumley's book describes how to create a replicate weight vector from a svydesign-object.
3) I think so, but as RMN said ..."It would be wrong."
4) Since it's wrong (IMO) it's not necessary.
You shouldn't be using frequency weights in Stata. That is pretty clear. If IPUMS doesn't have a "complex" survey design, you can just use:
mean hhincome [pw = hhwt]
Or, for convenience:
svyset [pw = hhwt]
svy: mean hhincome
svy: regress hhincome `x'
What's nice about the second option is that you can use it for more complex survey designs (via options on svyset. Then you can run lots of commands without having to typ [pw...] all the time.
Slight addition for people who don't have access to Stata or SAS; (I would put this in comments but...)
The library SAScii can use the SAS code file to read in the IPUMS downloaded data. The code to read in the data is from the doc
library(SAScii)
IPUMS.file.location <- "..\\usa_00007dat\\usa_00007.dat"
IPUMS.SAS.read.in.instructions <- "..\\usa_00007dat\\usa_00007.sas"
#store the IPUMS extract as an R data frame!
IPUMS.df <-
read.SAScii (
IPUMS.file.location ,
IPUMS.SAS.read.in.instructions ,
zipped = F )
How can I plot a very large data set in R?
I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?
In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff big-data handling package:
ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){
stopifnot(all(qs<=1 & qs>=0))
ffsort(ffv,...)->ffvs
j<-(qs*(length(ffv)-1))+1
jf<-floor(j);ceiling(j)->jc
rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2
}
This is an exact algorithm, so it uses sorting -- and thus may take a lot of time.
Problem is you can't load all data into the memory. So you could do sampling of the data, as indicated earlier by #Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the "randomized method" described in the link #aix gave.
If you can't subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it's getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.
Good, take a simple .csv file, then following function samples a fraction p of the data:
sample.df <- function(f,n=10000,split=",",p=0.1){
con <- file(f,open="rt",)
on.exit(close(con,type="rt"))
y <- data.frame()
#read header
x <- character(0)
while(length(x)==0){
x <- strsplit(readLines(con,n=1),split)[[1]]
}
Names <- x
#read and process data
repeat{
x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
if(is.null(x)) {break}
names(x) <- Names
nn <- nrow(x)
id <- sample(1:nn,round(nn*p))
y <- rbind(y,x[id,])
}
rownames(y) <- NULL
return(y)
}
An example of the usage :
#Make a file
Df <- data.frame(
X1=1:10000,
X2=1:10000,
X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)
# n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)
#clean up
unlink("test.txt")
All you need for a boxplot are the quantiles, the "whisker" extremes, and the outliers (if shown), which is all easily precomputed. Take a look at the boxplot.stats function.
You should also look at the RSQLite, SQLiteDF, RODBC, and biglm packages. For large datasets is can be useful to store the data in a database and pull only pieces into R. The databases can also do sorting for you and then computing quantiles on sorted data is much simpler (then just use the quantiles to do the plots).
There is also the hexbin package (bioconductor) for doing scatterplot equivalents with very large datasets (probably still want to use a sample of the data, but works with a large sample).
You could put the data into a database and calculate the quantiles using SQL. See : http://forge.mysql.com/tools/tool.php?id=149
This is an interesting problem.
Boxplots require quantiles. Computing quantiles on very large datasets is tricky.
The simplest solution that may or may not work in your case is to downsample the data first, and produce plots of the sample. In other words, read a bunch of records at a time, and retain a subset of them in memory (choosing either deterministically or randomly.) At the end, produce plots based on the data that's been retained in memory. Again, whether or not this is viable very much depends on the properties of your data.
Alternatively, there exist algorithms that can economically and approximately compute quantiles in an "online" fashion, meaning that they are presented with one observation at a time, and each observation is shown exactly once. While I have some limited experience with such algorithms, I have not seen any readily-available R implementations.
The following paper presents a brief overview of some relevant algorithms: Quantiles on Streams.
You could make plots from manageable sample of your data. E.g. if you use only 10% randomly chosen rows then boxplot on this sample shouldn't differ from all-data boxplot.
If your data are on some database there you be able to create some random flag (as I know almost every database engine has some kind of random number generator).
Second thing is how large is your dataset? For boxplot you need two columns: value variable and group variable. This example:
N <- 1e6
x <- rnorm(N)
b <- sapply(1:100, function(i) paste(sample(letters,40,TRUE),collapse=""))
g <- factor(sample(b,N,TRUE))
boxplot(x~g)
needs 100MB of RAM. If N=1e7 then it uses <1GB of RAM (which is still manageable to modern machine).
Perhaps you can think about using disk.frame to summarise the data down first before running the plotting?
The problem with R (and other languages like Python and Julia) is that you have to load all your data into memory to plot it. As of 2022, the best solution is to use DuckDB (there is an R connector), it allows you to query very large datasets (CSV, parquet, among others), and it comes with many functions to compute summary statistics. The idea is to use DuckDB to compute those statistics, load such statistics into R/Python/Julia, and plot.
Computing a boxplot with SQL + R
You need a bunch of statistics to plot a boxplot. If you want a complete reference, you can look at matplotlib's code. The code is in Python, but the code is pretty straightforward, so you'll get it even if you don't know Python.
The most critical piece are percentiles; you can compute those in DuckDB like this (just change the placeholders):
SELECT
percentile_disc(0.25) WITHIN GROUP (ORDER BY "{{column}}") AS q1,
percentile_disc(0.50) WITHIN GROUP (ORDER BY "{{column}}") AS med,
percentile_disc(0.75) WITHIN GROUP (ORDER BY "{{column}}") AS q3,
AVG("{{column}}") AS mean,
COUNT(*) AS N
FROM "{{path/to/data.parquet}}"
You need some other statistics to create the boxplot with all its details. For full implementation, check this (note: it's written in Python). I had to implement this for a package I wrote called JupySQL, which allows plotting very large datasets in Jupyter by leveraging SQL engines such as DuckDB.
Once you compute the statistics, you can use R to generate the boxplot.