Only just discovered Plyr and it has saved me a tonne of lines combining multiple data frames which is great. BUT I have another renaming problem I cannot fathom.
I have a list, which contains a number of data frames (this is a subset as there are actually 108 in the real list).
> str(mydata)
List of 4
$ C11:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.96 0.91 0.74 0.5
..$ n.ENSEMBLE.RECALL : num [1:8] 0.88 0.88 0.88 0.88 0.9 0.91 0.94 0.95
$ C12:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.96 0.89 0.86 0.72
..$ n.ENSEMBLE.RECALL : num [1:8] 0.91 0.91 0.91 0.91 0.93 0.96 0.97 0.98
$ C13:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.94 0.79 0.65 0.46
..$ n.ENSEMBLE.RECALL : num [1:8] 0.85 0.85 0.85 0.85 0.88 0.9 0.92 0.91
$ C14:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.98 0.95 0.88 0.74
..$ n.ENSEMBLE.RECALL : num [1:8] 0.91 0.91 0.91 0.91 0.92 0.94 0.95 0.98
What I really want to achieve is for each data frame to have the columns prepended with the title of the dataframe. So in the example the columns would be:
C11.X, C11.n.ENSEMBLE.COVERAGE & C11.n.ENSEMBLE.RECALL
C12.X, C12.n.ENSEMBLE.COVERAGE & C12.n.ENSEMBLE.RECALL
C13.X, C13.n.ENSEMBLE.COVERAGE & C13.n.ENSEMBLE.RECALL
C14.X, C14.n.ENSEMBLE.COVERAGE & C14.n.ENSEMBLE.RECALL
Can anyone suggest an elegant approach to renaming columns like this?
Here's a reproducible example using the iris data set:
# produce a named list of data.frames as sample data:
dflist <- split(iris, iris$Species)
# store the list element names:
n <- names(dflist)
# rename the elements:
Map(function(df, vec) setNames(df, paste(vec, names(df), sep = ".")), dflist, n)
Related
I run an lm() in R and this is the results of the summary:
Multiple R-squared: 0.8918, Adjusted R-squared: 0.8917
F-statistic: 9416 on 9 and 10283 DF, p-value: < 2.2e-16
and it seems that it is a good model, but if I calculate the R^2 manually I obtain this:
model=lm(S~0+C+HA+L1+L2,data=train)
pred=predict(model,train)
rss <- sum((model$fitted.values - train$S) ^ 2)
tss <- sum((train$S - mean(train$S)) ^ 2)
1 - rss/tss
##[1] 0.247238
rSquared(train$S,(train$S-model$fitted.values))
## [,1]
## [1,] 0.247238
What's wrong?
str(train[,c('S','Campionato','HA','L1','L2')])
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10292 obs. of 5 variables:
$ S : num 19 18 9 12 12 8 21 24 9 8 ...
$ C : Factor w/ 6 levels "D","E","F","I",..: 4 4 4 4 4 4 4 4 4 4 ...
$ HA : Factor w/ 2 levels "A","H": 1 2 1 1 2 1 2 2 1 2 ...
$ L1 : num 0.99 1.41 1.46 1.43 1.12 1.08 1.4 1.45 0.85 1.44 ...
$ L2 : num 1.31 0.63 1.16 1.15 1.29 1.31 0.7 0.65 1.35 0.59 ...
You are running a model without the intercept (the ~0 on the right hand side of your formula). For these kinds of models the calculation of R^2 is problematic and will produce misleading values. This post explains it very well: https://stats.stackexchange.com/a/26205/99681
I run an lm() in R and this is the results of the summary:
Multiple R-squared: 0.8918, Adjusted R-squared: 0.8917
F-statistic: 9416 on 9 and 10283 DF, p-value: < 2.2e-16
and it seems that it is a good model, but if I calculate the R^2 manually I obtain this:
model=lm(S~0+C+HA+L1+L2,data=train)
pred=predict(model,train)
rss <- sum((model$fitted.values - train$S) ^ 2)
tss <- sum((train$S - mean(train$S)) ^ 2)
1 - rss/tss
##[1] 0.247238
rSquared(train$S,(train$S-model$fitted.values))
## [,1]
## [1,] 0.247238
What's wrong?
str(train[,c('S','Campionato','HA','L1','L2')])
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10292 obs. of 5 variables:
$ S : num 19 18 9 12 12 8 21 24 9 8 ...
$ C : Factor w/ 6 levels "D","E","F","I",..: 4 4 4 4 4 4 4 4 4 4 ...
$ HA : Factor w/ 2 levels "A","H": 1 2 1 1 2 1 2 2 1 2 ...
$ L1 : num 0.99 1.41 1.46 1.43 1.12 1.08 1.4 1.45 0.85 1.44 ...
$ L2 : num 1.31 0.63 1.16 1.15 1.29 1.31 0.7 0.65 1.35 0.59 ...
You are running a model without the intercept (the ~0 on the right hand side of your formula). For these kinds of models the calculation of R^2 is problematic and will produce misleading values. This post explains it very well: https://stats.stackexchange.com/a/26205/99681
While trying to determine the optimal number of clusters for a kmeans, I tried to use the package mclust with the following code :
d_clust <- Mclust(df,
G=1:10,
mclust.options("emModelNames"))
d_clust$BIC
df is a data frame of 132656 obs. of 19 variables, the data is scaled, and there is no missing values (no NA/NaN/Inf values I checked with is.na and is.finite). Also, my variables are all in numeric format thanks to as.numeric
However after using the code, the screen displays "fitting" with a loading bar, goes up to 11%, and then after a moment I get the error message :
NAs in foreign function call (arg 13)
Does anyone know why I have this type of error ?
EDIT
Output of str(df) (I modified the variable name because of confidential issues)
'data.frame': 132656 obs. of 19 variables:
$ X1: num 0.5 1 1 1 0.5 1 1 1 1 1 ...
$ X2: num 0.714 0.286 1 0.857 0.286 ...
$ X3: num 0.667 1 0.667 0.667 0.667 ...
$ X4: num 0.714 0.429 1 0.714 0.429 ...
$ X5: num 0.667 0.333 1 0.667 0.333 ...
$ X6: num 0.5 0.25 1 0.5 0.25 0.25 0 0.5 0.5 0.25 ...
$ X7: num 0.667 0.667 0.667 0.667 0.667 ...
$ X8: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
$ X9: num 0.667 0 0.667 0.333 0 ...
$ X10: num 1 0.833 1 1 1 ...
$ X11: num 1 0.75 1 1 1 1 1 1 1 1 ...
$ X12: num 1 1 1 0.8 1 1 1 1 1 1 ...
$ X13: num 0.5 0.75 0.75 0.5 0.75 0.25 0.75 0.5 0.5 0.5 ...
$ X14: num 0.75 0.75 0.75 1 0.75 0.75 0.75 1 0.75 0.75 ...
$ X15: num 1 0 0.5 1 1 1 0.75 1 0.5 1 ...
$ X16: num 1 0.333 0.667 0.833 0.833 ...
$ X17: num 1 1 1 1 1 1 1 1 1 1 ...
$ X18: num 0.00157 0.000438 0.001059 0.000879 0.004919 ...
$ X19: num 0.5 0.125 1 0.625 0.125 0.125 0.125 1 0.5 0.25 ...
I have a dataframe with the following structure:
'data.frame': 13095 obs. of 1433 variables:
$ my : Factor w/ 624 levels "19631","19632",..: 1 1 1 1 1 1 1 1 1 1 ...
$ s1 : num NA NA NA NA NA NA NA NA NA NA ...
Where my is a factor with the number of the month, s1,..,,s1426 vectors that contain my dependent variable, f1,..,f6 vectors that contain my indepent variables.
S1 is a vector with many NA observations.
I want to regress s1 on f1,…,f6. To make the regression, I used the following code:
try1 <- lmList(s1 ~ f1+f2+f3+f4+f5+f6 |my , data=d1)
try1
That's the output that I received:
Call: lmList(formula = s1 ~ f1 + f2 + f3 + f4 + f5 + f6 | my, data = d1)
Coefficients:
Error in !unlist(lapply(coefs, is.null)) : invalid argument type
I tried to change na.action to na.omit but I have the same output.
I tried to create a new dataframe with:
d2<-data.frame(d1$my,d1$s1,d1$f1,d1$f2,d1$f3,d1$f4,d1$f5,d1$f6)
colnames(d2)<-c("my", "s1", "f1", "f2", "f3", "f4", "f5", "f6")
d2 has the following structure:
'data.frame': 13095 obs. of 8 variables:
$ my: Factor w/ 624 levels "19631","19632",..: 1 1 1 1 1 1 1 1 1 1 ...
$ s1: num NA NA NA NA NA NA NA NA NA NA ...
$ f1: num -0.54 1.66 0.68 0.06 0.9 -0.16 0.19 0.25 0.57 -0.1 ...
$ f2: num 0.94 0.98 0.63 0.32 -0.03 0.11 0.2 -0.03 0.07 0.01 ...
$ f3: num 0.31 -0.25 0.02 0.29 0.22 0.07 -0.09 -0.17 0.21 0.28 ...
$ f4: num 1.5 1.7 1.14 -0.02 0.36 0.49 -0.13 0.18 0.14 0.47 ...
$ f5: num -0.5 -1.96 -0.66 -0.17 -0.43 0.24 -0.12 -0.01 -0.58 0.52 ...
$ f6: num 0.38 0.3 0.35 0.3 0.08 0.13 -0.18 -0.05 -0.08 0.03 ...
When I run the regression:
try2<-lmList(s1~f1+f2+f3+f4+f5+f6|my,data=d2)
try2,
It works without any problem.
How can I solve the problem? I need to run a regression for every s, so I can’t just create a new dataframe every time. I read the documentation of lmList and also this site but I didn’t find anything related at the size of the dataframe, so I don’t think that the problem depends from the size of d1, but for all the other things the two dataframe are equivalent.
I also tried to create an example but when I build a new dataframe, also with many NA like my file, I don’t have these problems (i.e. lmList works fine).
I use the package lm4.
I want to add a column to each of my data frames in my list table after I do this code :
#list of my dataframes
df <- list(df1,df2,df3,df4)
#compute stats
stats <- function(d) do.call(rbind, lapply(split(d, d[,2]), function(x) data.frame(Nb= length(x$Year), Mean=mean(x$A), SD=sd(x$A) )))
#Apply to list of dataframes
table <- lapply(df, stats)
This column which I call Source for example, include the names of my dataframes along with Nb, Mean and SD variables. So the variable Source should contain df1,df1,df1... for my table[1], and so on.
Is there anyway I can add it in my code above?
Here's a different way of doing things:
First, let's start with some reproducible data:
set.seed(1)
n = 10
dat <- list(data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)),
data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)),
data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)),
data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)))
Then, you want a function that adds columns to a data.frame. The obvious candidate is within. The particular things you want to calculate are constant values for each observation within a particular category. To do that, use ave for each of the columns you want to add. Here's your new function:
stat <- function(d){
within(d, {
Nb = ave(a, b, FUN=length)
Mean = ave(a, b, FUN=mean)
SD = ave(a, b, FUN=sd)
})
}
Then just lapply it to your list of data.frames:
lapply(dat, stat)
As you can see, columns are added as appropriate:
> str(lapply(dat, stat))
List of 4
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] -0.626 0.184 -0.836 1.595 0.33 ...
..$ b : int [1:10] 3 1 2 1 1 2 1 2 3 2
..$ SD : num [1:10] 0.85 0.643 0.738 0.643 0.643 ...
..$ Mean: num [1:10] -0.0253 0.649 -0.3058 0.649 0.649 ...
..$ Nb : num [1:10] 2 4 4 4 4 4 4 4 2 4
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] -0.0449 -0.0162 0.9438 0.8212 0.5939 ...
..$ b : int [1:10] 2 3 2 1 1 1 1 2 2 2
..$ SD : num [1:10] 1.141 NA 1.141 0.136 0.136 ...
..$ Mean: num [1:10] -0.0792 -0.0162 -0.0792 0.7791 0.7791 ...
..$ Nb : num [1:10] 5 1 5 4 4 4 4 5 5 5
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] 1.3587 -0.1028 0.3877 -0.0538 -1.3771 ...
..$ b : int [1:10] 2 3 2 1 3 1 3 1 1 1
..$ SD : num [1:10] 0.687 0.668 0.687 0.635 0.668 ...
..$ Mean: num [1:10] 0.873 -0.625 0.873 0.267 -0.625 ...
..$ Nb : num [1:10] 2 3 2 5 3 5 3 5 5 5
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] -0.707 0.365 0.769 -0.112 0.881 ...
..$ b : int [1:10] 3 3 2 2 1 1 3 1 2 2
..$ SD : num [1:10] 0.593 0.593 1.111 1.111 0.297 ...
..$ Mean: num [1:10] -0.318 -0.318 0.24 0.24 0.54 ...
..$ Nb : num [1:10] 3 3 4 4 3 3 3 3 4 4