I need an R package or function that will allow me to match controls to cases for a large dataset, 5 million subjects.
I have tried a few packages, my problems are summarized below. I only tried to match on a single covariate and I most likely will need to match on several.
Package MatchIt: The nearest neighbor, optimal, and genetic methods all just run for hours and hours. The "cem" method runs really quickly but I need to know which cases were matched/unmatched so I can do further analysis with the matched subset. Running the match.data() on the cem results only supplies the weights to be used in a regression and not the matched subset. The paired function in cem would work if I wanted one to one matching but I want to retain as many controls as possible.
matchControls() in the e1071 package: runs for a long time and them returns "not able to allocate vector of size 1352 GB"
Match() function from Matching package: Just runs and runs...
quickmatch() from the quickmatch package: It ran quickly but I am not sure I'm using the function correctly or how to extract the matched data from the "qm_matching" object returned. Below is my attempt using quickmatch on fake data.
library(MatchIt)
library(cem)
library(Matching)
library(rgenoud)
library(quickmatch)
set.seed(100)
control_df=data.frame(Group=factor("Control"),value=rnorm(1400000,95,2))
set.seed(101)
treatment_df=data.frame(Group=factor("Treatment"),value=c(rnorm(500000,92,2),rnorm(100000,50,5)))
dat=rbind(control_df,treatment_df)
covariate_balance(dat$Group, dat$value, matching = NULL,
normalize = TRUE, all_differences = TRUE)
my_distances <- distances(dat, dist_variables = c("value"))
matchedDat=quickmatch(my_distances,dat$Group )
matchedDat.df=data.frame(matchedDat)
Not sure what to do with the returned object. I think quickmatch may be the most viable option. The covariate_balance result shows a decent amount of imbalance between the Control and Treatment groups so some amount of matching can be done.
Specifically how do I obtain matched results,i.e. flag the subjects that were successfully matched between the Control and Treatment? The cluster_label from matchedDat.df implies that the function is creating a large number of clusters how/can I restrict this?
Any help with respect to speeding up some of the functions above or new suggestions would be appreciated.
After a more careful reading of the cem documentation I think I have the solution to my problem using the Matchit package or the cem package.
library(cem)
library(tidyverse)
set.seed(100)
control_df=data.frame(Group=factor("Control"),value=rnorm(1400000,95,2))
set.seed(101)
treatment_df=data.frame(Group=factor("Treatment"),value=c(rnorm(500000,92,2),rnorm(100000,50,5)))
dat=rbind(control_df,treatment_df)%>% rownames_to_column()
cem.match=cem(treatment="Group", baseline.group="Control",data=dat,keep.all=TRUE, drop ="rowname")
matchedData=data.frame(Group.check=cem.match$groups, matched=cem.match$matched,weights=cem.match$w)%>%
rownames_to_column()%>%
inner_join(dat,by="rowname") %>%
filter(matched==TRUE)
Related
I am using the "makeA" function in R's "pedigree" package to try and create a matrix of coefficients of relatedness between two sheep, based on a pedigree file.
My pedigree file has about 56,000 sheep, of which only about 26,000 are the ones I need coefficients of relatedness for (because they are the only remaining alive sheep).
How would I use the "which" argument to specify that I only want coefficients of relatedness for sheep with "Status" column value of "0" (indicating the animal is alive/at hand)?
I presume if I could do this it would make the function run much faster (at the moment it takes in excess of 30 minutes). Does anyone have any other pointers that may make it quicker? Open to try anything (within my limited R knowledge).
Here is my code:
library(pedigree)
library(data.table)
library(dplyr)
ped <- read.csv("ped.csv", header = TRUE)
start <- Sys.time()
makeA(ped,which = c(rep(TRUE,NROW(ped))))
end <- Sys.time()
end - start
I have more processing steps later on but I haven't even got past the step of creating the raw matrix for the actual (large) dataset.
Note that for smaller datasets the function works almost instantly.
I have a question regarding the aggregation of imputed data as created by the R-package 'mice'.
As far as I understand it, the 'complete'-command of 'mice' is applied to extract the imputed values of, e.g., the first imputation. However, when running a total of ten imputations, I am not sure, which imputed values to extract. Does anyone know how to extract the (aggregate) imputed data across all imputations?
Since I would like to enter the data into MS Excel and perform further calculations in another software tool, such a command would be very helpful.
Thank you for your comments. A simple example (from 'mice' itself) can be found below:
R> library("mice")
R> nhanes
R> imp <- mice(nhanes, seed = 23109) #create imputation
R> complete(imp) #extraction of the five imputed datasets (row-stacked matrix)
How can I aggregate the five imputed data sets and extract the imputed values to Excel?
I had similar issue.
I used the code below which is good enough to numeric vars.
For others I thought about randomly choose one of the imputed results (because averaging can disrupt it).
My offered code is (for numeric):
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)
completedData <- complete(tempData, 'long')
a<-aggregate(completedData[,3:6] , by = list(completedData$.id),FUN= mean)
you should join the results back.
I think the 'Hmisc' is a better package.
if you already found nicer/ more elegant/ built in solution - please share with us.
You should use complete(imp,action="long") to get values for each imputation. If you see ?complete, you will find
complete(x, action = 1, include = FALSE)
Arguments
x
An object of class mids as created by the function mice().
action
If action is a scalar between 1 and x$m, the function returns the data with imputation number action filled in. Thus, action=1 returns the first completed data set, action=2 returns the second completed data set, and so on. The value of action can also be one of the following strings: 'long', 'broad', 'repeated'. See 'Details' for the interpretation.
include
Flag to indicate whether the orginal data with the missing values should be included. This requires that action is specified as 'long', 'broad' or 'repeated'.
So, the default is to return the first imputed values. In addition, the argument action can also be a string: long, broad, and repeated. If you enter long, it will give you the data in long format. You can also set include = TRUE if you want the original missing data.
ok, but still you have to choose one imputed dataset for further analyses... I think the best option is to analyze using your complete(imp,action="long") and pool the results afterwards.fit <- with(data=imp,exp=lm(bmi~hyp+chl))
pool(fit)
but I also assume its not forbidden to use just one of the imputed datasets ;)
I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().
I'm trying to analyze data from the University of Minnesota IPUMS dataset for the 1990 US census in R. I'm using the survey package because the data is weighted. Just taking the household data (and ignoring the person variables to keep things simple), I am attempting to calculate the mean of hhincome (household income). To do this I created a survey design object using the svydesign() function with the following code:
> require(foreign)
> ipums.household <- read.dta("/path/to/stata_export.dta")
> ipums.household[ipums.household$hhincome==9999999, "hhincome"] <- NA # Fix missing
> ipums.hh.design <- svydesign(id=~1, weights=~hhwt, data=ipums.household)
> svymean(ipums.household$hhincome, ipums.hh.design, na.rm=TRUE)
mean SE
[1,] 37029 17.365
So far so good. However, I get a different standard error if I attempt the same calculation in Stata (using code meant for a different portion of the same dataset):
use "C:\I\Hate\Backslashes\stata_export.dta"
replace hhincome = . if hhincome == 9999999
(933734 real changes made, 933734 to missing)
mean hhincome [fweight = hhwt] # The code from the link above.
Mean estimation Number of obs = 91746420
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
hhincome | 37028.99 3.542749 37022.05 37035.94
--------------------------------------------------------------
And, looking at another way to skin this cat, the author of survey, has this suggestion for frequency weighting:
expanded.data<-as.data.frame(lapply(compressed.data,
function(x) rep(x,compressed.data$weights)))
However, I can't seem to get this code to work:
> hh.dataframe <- data.frame(ipums.household$hhincome, ipums.household$hhwt)
> expanded.hh.dataframe <- as.data.frame(lapply(hh.dataframe, function(x) rep(x, hh.dataframe$hhwt)))
Error in rep(x, hh.dataframe$hhwt) : invalid 'times' argument
Which I can't seem to fix. This may be related to this issue.
So in sum:
Why don't I get the same answers in Stata and R?
Which one is right (or am I doing something wrong in both cases)?
Assuming I got the rep() solution working, would that replicate Stata's results?
What's the right way to do it? Kudos if the answer allows me to use the plyr package for doing arbitrary calculations, rather than being limited to the functions implemented in survey (svymean(), svyglm() etc.)
Update
So after the excellent help I've received here and from IPUMS via email, I'm using the following code to properly handle survey weighting. I describe here in case someone else has this problem in future.
Initial Stata Preparation
Since IPUMS don't currently publish scripts for importing their data into R, you'll need to start from Stata, SAS, or SPSS. I'll stick with Stata for now. Begin by running the import script from IPUMS. Then before continuing add the following variable:
generate strata = statefip*100000 + puma
This creates a unique integer for each PUMA of the form 240001, with first two digits as the state fip code (24 in the case of Maryland) and the last four a PUMA id which is unique on a per state basis. If you're going to use R you might also find it helpful to run this as well
generate statefip_num = statefip * 1
This will create an additional variable without labels, since importing .dta files into R apply the labels and lose the underlying integers.
Stata and svyset
As Keith explained, survey sampling is handled by Stata by invoking svyset.
For an individual level analysis I now use:
svyset serial [pweight=perwt], strata(strata)
This sets the weighting to perwt, the stratification to the variable we created above, and uses the household serial number to account for clustering. If we were using multiple years, we might want to try
generate double yearserial = year*100000000 + serial
to account for longitudinal clustering as well.
For household level analysis (without years):
svyset serial [pweight=hhwt], strata(strata)
Should be self-explanatory (though I think in this case serial is actually superfluous). Replacing serial with yearserial will take into account a time series.
Doing it in R
Assuming you're importing a .dta file with the additional strata variable explained above and analysing at the individual letter:
require(foreign)
ipums <- read.dta('/path/to/data.dta')
require(survey)
ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Or at the household level:
ipums.hh.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=hhwt)
Hope someone finds this helpful, and thanks so much to Dwin, Keith and Brandon from IPUMS.
1&2) The comment you cited from Lumley was written in 2001 and predates any of his published work with the survey package which has only been out a few years. You are probably using "weights" in two different senses. (Lumley describes three possible senses early in his book.) The survey function svydesign is using probability weights rather than frequency weights. Seems likely that these are not really frequency weights but rather probability weights, given the massive size of that dataset, and that would mean that the survey package result is correct and the Stata result incorrect. If you are not convinced, then the survey package offers the function as.svrepdesign() with which Lumley's book describes how to create a replicate weight vector from a svydesign-object.
3) I think so, but as RMN said ..."It would be wrong."
4) Since it's wrong (IMO) it's not necessary.
You shouldn't be using frequency weights in Stata. That is pretty clear. If IPUMS doesn't have a "complex" survey design, you can just use:
mean hhincome [pw = hhwt]
Or, for convenience:
svyset [pw = hhwt]
svy: mean hhincome
svy: regress hhincome `x'
What's nice about the second option is that you can use it for more complex survey designs (via options on svyset. Then you can run lots of commands without having to typ [pw...] all the time.
Slight addition for people who don't have access to Stata or SAS; (I would put this in comments but...)
The library SAScii can use the SAS code file to read in the IPUMS downloaded data. The code to read in the data is from the doc
library(SAScii)
IPUMS.file.location <- "..\\usa_00007dat\\usa_00007.dat"
IPUMS.SAS.read.in.instructions <- "..\\usa_00007dat\\usa_00007.sas"
#store the IPUMS extract as an R data frame!
IPUMS.df <-
read.SAScii (
IPUMS.file.location ,
IPUMS.SAS.read.in.instructions ,
zipped = F )
How can I plot a very large data set in R?
I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?
In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff big-data handling package:
ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){
stopifnot(all(qs<=1 & qs>=0))
ffsort(ffv,...)->ffvs
j<-(qs*(length(ffv)-1))+1
jf<-floor(j);ceiling(j)->jc
rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2
}
This is an exact algorithm, so it uses sorting -- and thus may take a lot of time.
Problem is you can't load all data into the memory. So you could do sampling of the data, as indicated earlier by #Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the "randomized method" described in the link #aix gave.
If you can't subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it's getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.
Good, take a simple .csv file, then following function samples a fraction p of the data:
sample.df <- function(f,n=10000,split=",",p=0.1){
con <- file(f,open="rt",)
on.exit(close(con,type="rt"))
y <- data.frame()
#read header
x <- character(0)
while(length(x)==0){
x <- strsplit(readLines(con,n=1),split)[[1]]
}
Names <- x
#read and process data
repeat{
x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
if(is.null(x)) {break}
names(x) <- Names
nn <- nrow(x)
id <- sample(1:nn,round(nn*p))
y <- rbind(y,x[id,])
}
rownames(y) <- NULL
return(y)
}
An example of the usage :
#Make a file
Df <- data.frame(
X1=1:10000,
X2=1:10000,
X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)
# n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)
#clean up
unlink("test.txt")
All you need for a boxplot are the quantiles, the "whisker" extremes, and the outliers (if shown), which is all easily precomputed. Take a look at the boxplot.stats function.
You should also look at the RSQLite, SQLiteDF, RODBC, and biglm packages. For large datasets is can be useful to store the data in a database and pull only pieces into R. The databases can also do sorting for you and then computing quantiles on sorted data is much simpler (then just use the quantiles to do the plots).
There is also the hexbin package (bioconductor) for doing scatterplot equivalents with very large datasets (probably still want to use a sample of the data, but works with a large sample).
You could put the data into a database and calculate the quantiles using SQL. See : http://forge.mysql.com/tools/tool.php?id=149
This is an interesting problem.
Boxplots require quantiles. Computing quantiles on very large datasets is tricky.
The simplest solution that may or may not work in your case is to downsample the data first, and produce plots of the sample. In other words, read a bunch of records at a time, and retain a subset of them in memory (choosing either deterministically or randomly.) At the end, produce plots based on the data that's been retained in memory. Again, whether or not this is viable very much depends on the properties of your data.
Alternatively, there exist algorithms that can economically and approximately compute quantiles in an "online" fashion, meaning that they are presented with one observation at a time, and each observation is shown exactly once. While I have some limited experience with such algorithms, I have not seen any readily-available R implementations.
The following paper presents a brief overview of some relevant algorithms: Quantiles on Streams.
You could make plots from manageable sample of your data. E.g. if you use only 10% randomly chosen rows then boxplot on this sample shouldn't differ from all-data boxplot.
If your data are on some database there you be able to create some random flag (as I know almost every database engine has some kind of random number generator).
Second thing is how large is your dataset? For boxplot you need two columns: value variable and group variable. This example:
N <- 1e6
x <- rnorm(N)
b <- sapply(1:100, function(i) paste(sample(letters,40,TRUE),collapse=""))
g <- factor(sample(b,N,TRUE))
boxplot(x~g)
needs 100MB of RAM. If N=1e7 then it uses <1GB of RAM (which is still manageable to modern machine).
Perhaps you can think about using disk.frame to summarise the data down first before running the plotting?
The problem with R (and other languages like Python and Julia) is that you have to load all your data into memory to plot it. As of 2022, the best solution is to use DuckDB (there is an R connector), it allows you to query very large datasets (CSV, parquet, among others), and it comes with many functions to compute summary statistics. The idea is to use DuckDB to compute those statistics, load such statistics into R/Python/Julia, and plot.
Computing a boxplot with SQL + R
You need a bunch of statistics to plot a boxplot. If you want a complete reference, you can look at matplotlib's code. The code is in Python, but the code is pretty straightforward, so you'll get it even if you don't know Python.
The most critical piece are percentiles; you can compute those in DuckDB like this (just change the placeholders):
SELECT
percentile_disc(0.25) WITHIN GROUP (ORDER BY "{{column}}") AS q1,
percentile_disc(0.50) WITHIN GROUP (ORDER BY "{{column}}") AS med,
percentile_disc(0.75) WITHIN GROUP (ORDER BY "{{column}}") AS q3,
AVG("{{column}}") AS mean,
COUNT(*) AS N
FROM "{{path/to/data.parquet}}"
You need some other statistics to create the boxplot with all its details. For full implementation, check this (note: it's written in Python). I had to implement this for a package I wrote called JupySQL, which allows plotting very large datasets in Jupyter by leveraging SQL engines such as DuckDB.
Once you compute the statistics, you can use R to generate the boxplot.