I have a very large set of claims data (called data2) with 1 row per enrollee and columns enrolid (enrollment id), jan16allwd,...,dec16allwd, as well as some other fields that aren't relevant to this. For each enrollee I'm looking to extract the coefficient of the regression for (allowed claims~month). I've tried this:
allowed <- c(data2$jan16allwd, data2$feb16allwd, data2$mar16allwd, data2$apr16allwd,
data2$may16allwd, data2$jun16allwd, data2$jul16allwd, data2$aug16allwd,
data2$sept16allwd, data2$oct16allwd, data2$nov16allwd, data2$dec16allwd)
months <- (1:12)
betas.allwd <- unlist(lapply(split(data2,data2$enrolid),function(chunk)
{return(coef(lm(allowed~months, data=chunk))[[2]])}))
but keep getting an error about the lengths of the datasets being different. I know it's due to the allowed fields not being split up by enrolid. How can I fix this and return the vector I need?
Related
I'm testing for random intercepts as a preparation for growth curve modeling.
Therefore, I've first created a wide subset and then converted it to a Long data set.
Calculating my ModelM1 <- gls(ent_act~1, data=school_l) with the long data set, I get an error message as I have missing values. In my long subset these values are stated as NaN.
When applying temp<-na.omit(school_l$ent_act), I can calculate ModelM1. But, when calculating ModelM2 ModelM2 <- lme(temp~1, random=~1|ID, data=school_l), then I get the error message of my variables being of unqueal lengths.
How can I deal with those missing values?
Any ideas or recommendations?
What you might get success with would be to make a temp dataframe where your remove entire lines indexed by negation of the missing condition: !is.na(school_1$ent_act)
temp<-school_l[ !is.na(school_l$ent_act), ]
Then re-run the lme call. There should now be no mismatch of variable lengths.
ModelM2 <- lme(ent_act ~1, random= ~1|ID, data=school_l)
Note that using school_l is going to be potentially confusing because it looks so much like school_1 when viewed in Times font.
Question:
I have a data.frame (hlth) that consists of 49 vectors - a mix of numeric(25:49) and factor(1:24). I am trying to randomly select 50 rows, then calculate column means only for the numeric columns (dropping the other values), and then place the random row mean(s) into a new data.frame (beta). I would then like to iterate this process 1000 times.
I have attempted this process but the values that get returned are identical and the new means will not enter the new data.frame
Here is a few rows and columns of the data.frame(hlth)
DateIn adgadj Sex VetMedCharges pwtfc
1/01/2006 3.033310 STEER 0.00 675.1151
1/10/1992 3.388245 STEER 2540.33 640.2261
1/10/1995 3.550847 STEER 572.78 607.6200
1/10/1996 2.893707 HEIFER 549.42 425.5217
1/10/1996 3.647233 STEER 669.18 403.8238
The code I have used thus far:
set.seed[25]
beta<-data.frame()
net.row<-function(n=50){
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE)
rNames <- row.names(hlth)
subset(hlth,rNames%in%netrow,select=c(25:49))
colMeans(s1,na.rm=TRUE,dims=1)
}
beta$net.row=replicate(1000,net.row()); net.row
The two issues, that I have detected, are:
1) Returns the same value(s) each iteration
2) "Error during wrap-up: object of type 'closure' is not subsettable" when the beta$netrow
Any suggestions would be appreciated!!!
Just adding to my comment (and firstly pasting it):
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE) should presumably by netrow = ... and the error is a scoping problem - R is trying to subset the function beta, presumably again, because it can't find netRowMeans in the data.frame you've defined, moves on to the global environment and throws an error there.
There are also a couple of other things. You don't assign subset(hlth,rNames%in%netrow,select=c(25:49)) to a variable, which I think you mean to assign to s1, so colMeans is probably running on something you've set in the global environment.
If you want to pass a variable directly in to the data frame beta in that manner, you'll have to initialise beta with the right number of columns and number of rows - the column means you've passed out will be a vector of (1 x 25), so won't fit in a single column. You would probably be better of initalising a matrix called mat or something (to avoid confusion with scoping errors masking the actual error messages) with 25 columns and 1000 rows.
EDIT: Question has been edited slightly since I posted this, but most points still stand.
I have a question regarding the aggregation of imputed data as created by the R-package 'mice'.
As far as I understand it, the 'complete'-command of 'mice' is applied to extract the imputed values of, e.g., the first imputation. However, when running a total of ten imputations, I am not sure, which imputed values to extract. Does anyone know how to extract the (aggregate) imputed data across all imputations?
Since I would like to enter the data into MS Excel and perform further calculations in another software tool, such a command would be very helpful.
Thank you for your comments. A simple example (from 'mice' itself) can be found below:
R> library("mice")
R> nhanes
R> imp <- mice(nhanes, seed = 23109) #create imputation
R> complete(imp) #extraction of the five imputed datasets (row-stacked matrix)
How can I aggregate the five imputed data sets and extract the imputed values to Excel?
I had similar issue.
I used the code below which is good enough to numeric vars.
For others I thought about randomly choose one of the imputed results (because averaging can disrupt it).
My offered code is (for numeric):
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)
completedData <- complete(tempData, 'long')
a<-aggregate(completedData[,3:6] , by = list(completedData$.id),FUN= mean)
you should join the results back.
I think the 'Hmisc' is a better package.
if you already found nicer/ more elegant/ built in solution - please share with us.
You should use complete(imp,action="long") to get values for each imputation. If you see ?complete, you will find
complete(x, action = 1, include = FALSE)
Arguments
x
An object of class mids as created by the function mice().
action
If action is a scalar between 1 and x$m, the function returns the data with imputation number action filled in. Thus, action=1 returns the first completed data set, action=2 returns the second completed data set, and so on. The value of action can also be one of the following strings: 'long', 'broad', 'repeated'. See 'Details' for the interpretation.
include
Flag to indicate whether the orginal data with the missing values should be included. This requires that action is specified as 'long', 'broad' or 'repeated'.
So, the default is to return the first imputed values. In addition, the argument action can also be a string: long, broad, and repeated. If you enter long, it will give you the data in long format. You can also set include = TRUE if you want the original missing data.
ok, but still you have to choose one imputed dataset for further analyses... I think the best option is to analyze using your complete(imp,action="long") and pool the results afterwards.fit <- with(data=imp,exp=lm(bmi~hyp+chl))
pool(fit)
but I also assume its not forbidden to use just one of the imputed datasets ;)
For a marketing class I have to write a function that calculates the retention rate of the customers (probability that a customer still is a customer). I've come so far that I isolated the ids of the individual customers and stored them in the matrix first.transactions.data. I then split them into cohorts (group of customers by time) with split() and stored them in the list cohort.
Now comes my problem: I calculated another sub-matrix from the full data set called final.period.data where I will calculate the retention rate. However, therefore I have to isolate the ids in final.period.data for each cohort. My instructor told me that I should create an additional column in final.period.data that shows TRUE or FALSE depending on whether the cohort's id and final.period.data's id are the same. For this I tried to use exists, but I always receive error messages. I tried the following:
final.period.data <- if(exists(cohort$'1'$id, where = final.period.data$id) final.period.data$same = TRUE)
but always receive error messages such as: unexpected symbol or invalid first argument. I also tried to convert the list cohort into a matrix but this didn't help either. How do I have to change the exist command or is there a simpler way to locate cohort's ids in final.period.data?
Thank you for your help.
You can just create a function that does what you want:
funct <-(final.period.data){
if (final.period.data$cohort =='1' & final.period.data$id ==<condition2>){
#Change the number for the TRUE condition}
else{ #If it doesn't fit the two conditions
#Change the number for the FALSE condition}
}
vector <- c(nrow(final.period.data))
final.period.data <- cbind(vector)
And use it as the apply function. Here will you find more information about apply
But I usually do it with a for loop, first creating the new column and then adding it to the data frame.
I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().