compute k-means after PCA

compute k-means after PCA - r

I'm new to R and I want to do a k-means clustering based on the results of pca. I did like this (taken Iris dataset as an example):
library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- iris %>%
select(- Species)
# compute PCA
res.pca <- PCA(df,
scale.unit = TRUE,
graph = FALSE)
summary(res.pca)
# k-means clustering
kc <- kmeans(res.pca, 3)
Then I got an error:Error in storage.mode(x) <- "double" : a list cannot converted automatically into a 'double'.
The output of the PCA are:
> res.pca
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 150 individuals, described by 4 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
>
> summary(res.pca)
Call:
PCA(X = df, scale.unit = TRUE, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4
Variance 2.918 0.914 0.147 0.021
% of var. 72.962 22.851 3.669 0.518
Cumulative % of var. 72.962 95.813 99.482 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
1 | 2.319 | -2.265 1.172 0.954 | 0.480 0.168 0.043 | -0.128 0.074 0.003 |
2 | 2.202 | -2.081 0.989 0.893 | -0.674 0.331 0.094 | -0.235 0.250 0.011 |
3 | 2.389 | -2.364 1.277 0.979 | -0.342 0.085 0.020 | 0.044 0.009 0.000 |
4 | 2.378 | -2.299 1.208 0.935 | -0.597 0.260 0.063 | 0.091 0.038 0.001 |
5 | 2.476 | -2.390 1.305 0.932 | 0.647 0.305 0.068 | 0.016 0.001 0.000 |
6 | 2.555 | -2.076 0.984 0.660 | 1.489 1.617 0.340 | 0.027 0.003 0.000 |
7 | 2.468 | -2.444 1.364 0.981 | 0.048 0.002 0.000 | 0.335 0.511 0.018 |
8 | 2.246 | -2.233 1.139 0.988 | 0.223 0.036 0.010 | -0.089 0.036 0.002 |
9 | 2.592 | -2.335 1.245 0.812 | -1.115 0.907 0.185 | 0.145 0.096 0.003 |
10 | 2.249 | -2.184 1.090 0.943 | -0.469 0.160 0.043 | -0.254 0.293 0.013 |
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
Sepal.Length | 0.890 27.151 0.792 | 0.361 14.244 0.130 | -0.276 51.778 0.076 |
Sepal.Width | -0.460 7.255 0.212 | 0.883 85.247 0.779 | 0.094 5.972 0.009 |
Petal.Length | 0.992 33.688 0.983 | 0.023 0.060 0.001 | 0.054 2.020 0.003 |
Petal.Width | 0.965 31.906 0.931 | 0.064 0.448 0.004 | 0.243 40.230 0.059 |
Could somebody help me with this problem ? What should I put instead of res.pca in the kmeans()? I don't know which part of the PCA results should I extract to use in the fonction kmeans()
Thank you in advance.

The principal component scores are stored under res.pca$ind$coord What you want to do kmeans on these:
So we can do:
kc <- kmeans(res.pca$ind$coord, 3)
plot(res.pca$ind$coord[,1:2],col=factor(kc$cluster))

It seems kmeans() expects a numeric matrix as input, however you are giving to it res.pca which is a list. Thus you get the error "cannot convert object of type list to double". "Double" is R's class to matrix or vectors of pure numbers.
I'm not sure about the what the PCA function outputs, so you must find a way to extract the PCA values from it, make it a matrix, and then run kmeans.
Hope it helps.
But for future reference, you can do a few things to make your questions easier to help:
Provide a reproducible example (a df with a few lines)
Translate error messages to english
Add the packages the function is from

Related

Computing marginal effects: Why does ggeffect and ggemmeans give difference answers?

Example
library(glmmTMB)
library(ggeffects)
## Zero-inflated negative binomial model
(m <- glmmTMB(count ~ spp + mined + (1|site),
ziformula=~spp + mined,
family=nbinom2,
data=Salamanders,
na.action = "na.fail"))
summary(m)
ggemmeans(m, terms="spp")
spp | Predicted | 95% CI
--------------------------------
GP | 1.11 | [0.66, 1.86]
PR | 0.42 | [0.11, 1.59]
DM | 1.32 | [0.81, 2.13]
EC-A | 0.75 | [0.37, 1.53]
EC-L | 1.81 | [1.09, 3.00]
DES-L | 2.00 | [1.25, 3.21]
DF | 0.99 | [0.61, 1.62]
ggeffects::ggeffect(m, terms="spp")
spp | Predicted | 95% CI
--------------------------------
GP | 1.14 | [0.69, 1.90]
PR | 0.44 | [0.12, 1.63]
DM | 1.36 | [0.85, 2.18]
EC-A | 0.78 | [0.39, 1.57]
EC-L | 1.87 | [1.13, 3.07]
DES-L | 2.06 | [1.30, 3.28]
DF | 1.02 | [0.63, 1.65]
Questions
Why are ggeffect and ggemmeans giving different results for the marginal effects? Is it simply something internal with how the packages emmeans and effects are computing them? Also, does anyone know of some resources on how to compute marginal effects from scratch for a model like that in the example?

You fit a complex model: zero-inflated negative binomial model with random effects.
What you observe has little to do with the model specification. Let's show this by fitting a simpler model: Poisson with fixed effects only.
library("glmmTMB")
library("ggeffects")
m <- glmmTMB(
count ~ spp + mined,
family = poisson,
data = Salamanders
)
ggemmeans(m, terms = "spp")
#> # Predicted counts of count
#>
#> spp | Predicted | 95% CI
#> --------------------------------
#> GP | 0.73 | [0.59, 0.89]
#> PR | 0.18 | [0.12, 0.27]
#> DM | 0.91 | [0.76, 1.10]
#> EC-A | 0.34 | [0.25, 0.45]
#> EC-L | 1.35 | [1.15, 1.59]
#> DES-L | 1.43 | [1.22, 1.68]
#> DF | 0.79 | [0.64, 0.96]
ggeffect(m, terms = "spp")
#> # Predicted counts of count
#>
#> spp | Predicted | 95% CI
#> --------------------------------
#> GP | 0.76 | [0.62, 0.93]
#> PR | 0.19 | [0.13, 0.28]
#> DM | 0.96 | [0.79, 1.15]
#> EC-A | 0.35 | [0.26, 0.47]
#> EC-L | 1.41 | [1.20, 1.66]
#> DES-L | 1.50 | [1.28, 1.75]
#> DF | 0.82 | [0.67, 1.00]
The documentation explains that internally ggemmeans() calls emmeans::emmeans() while ggeffect() calls effects::Effect().
Both emmeans and effects compute marginal effects but they make a different (default) choice how to marginalize out (ie. average over) mined in order to get the effect of spp.
mined is a categorical variable with two levels: "yes" and "no". The crucial bit is that the two levels are not balanced: there are slightly more "no"s than "yes"s.
xtabs(~ mined + spp, data = Salamanders)
#> spp
#> mined GP PR DM EC-A EC-L DES-L DF
#> yes 44 44 44 44 44 44 44
#> no 48 48 48 48 48 48 48
Intuitively, this means that the weighted average over mined [think of (44 × yes + 48 × no) / 92] is not the same as the simple average over mined [think of (yes + no) / 2].
Let's check the intuition by specifying how to marginalize out mined when we call emmeans::emmeans() directly.
# mean (default)
emmeans::emmeans(m, "spp", type = "response", weights = "equal")
#> spp rate SE df lower.CL upper.CL
#> GP 0.726 0.0767 636 0.590 0.893
#> PR 0.181 0.0358 636 0.123 0.267
#> DM 0.914 0.0879 636 0.757 1.104
#> EC-A 0.336 0.0497 636 0.251 0.449
#> EC-L 1.351 0.1120 636 1.148 1.590
#> DES-L 1.432 0.1163 636 1.221 1.679
#> DF 0.786 0.0804 636 0.643 0.961
#>
#> Results are averaged over the levels of: mined
#> Confidence level used: 0.95
#> Intervals are back-transformed from the log scale
# weighted mean
emmeans::emmeans(m, "spp", type = "response", weights = "proportional")
#> spp rate SE df lower.CL upper.CL
#> GP 0.759 0.0794 636 0.618 0.932
#> PR 0.190 0.0373 636 0.129 0.279
#> DM 0.955 0.0909 636 0.793 1.152
#> EC-A 0.351 0.0517 636 0.263 0.469
#> EC-L 1.412 0.1153 636 1.203 1.658
#> DES-L 1.496 0.1196 636 1.279 1.751
#> DF 0.822 0.0832 636 0.674 1.003
#>
#> Results are averaged over the levels of: mined
#> Confidence level used: 0.95
#> Intervals are back-transformed from the log scale
The second option returns the marginal effects computed with ggeffects::ggeffect.
Update
#Daniel points out that ggeffects accepts the weights argument and will pass it to emmeans. This way you can keep using ggeffects and still control how predictions are averaged to compute marginal effects.
Try it out for yourself with:
ggemmeans(m, terms="spp", weights = "proportional")
ggemmeans(m, terms="spp", weights = "equal")

Changepoints detection in time series in R

I need some guidance regarding how changepoints work in time series. I am trying to detect some changepoints using R, and the package called "changepoint" (https://cran.r-project.org/web/packages/changepoint/changepoint.pdf).
There are options for how to detect when the variance (cpt.var) and the mean (cpt.mean) change, but what I'm trying to look for is when the time series changes trend.
Maybe I'm confused with what changepoints really are, but is there any way to get this information?
I am showing the result of using cpt.var() function, and I have added some arrows, showing what I would like to achieve.
Is there any way to achieve this? I guess should be somehow like inflection points...
I would appreciate any light on this.
Thanks beforehand,
Jon
EDIT
I have tried with the approach of using diff(), but is not detecting the change correctly:
The data I am using is the following:
[1] 10.695 10.715 10.700 10.665 10.830 10.830 10.800 11.070 11.145 11.270 11.015 11.060 10.945 10.965 10.780 10.735 10.705 10.680 10.600 10.335 10.220 10.125
[23] 10.370 10.595 10.680 11.000 10.980 11.065 11.060 11.355 11.445 11.415 11.350 11.310 11.330 11.360 11.445 11.335 11.275 11.300 11.295 11.470 11.445 11.325
[45] 11.300 11.260 11.200 11.210 11.230 11.240 11.300 11.250 11.285 11.215 11.260 11.395 11.410 11.235 11.320 11.475 11.470 11.685 11.740 11.740 11.700 11.905
[67] 11.720 12.230 12.285 12.505 12.410 11.995 12.110 12.005 11.915 11.890 11.820 11.730 11.700 11.660 11.685 11.615 11.360 11.425 11.185 11.275 11.265 11.375
[89] 11.310 11.250 11.050 10.880 10.775 10.775 10.805 10.755 10.595 10.700 10.585 10.510 10.290 10.255 10.395 10.290 10.425 10.405 10.365 10.010 10.305 10.185
[111] 10.400 10.700 10.725 10.875 10.750 10.760 10.905 10.680 10.670 10.895 10.790 10.990 10.925 10.980 10.975 11.035 10.895 10.985 11.035 11.295 11.245 11.535
[133] 11.510 11.430 11.450 11.390 11.520 11.585
And when I do diff() I get this data:
[1] 0.020 -0.015 -0.035 0.165 0.000 -0.030 0.270 0.075 0.125 -0.255 0.045 -0.115 0.020 -0.185 -0.045 -0.030 -0.025 -0.080 -0.265 -0.115 -0.095 0.245
[23] 0.225 0.085 0.320 -0.020 0.085 -0.005 0.295 0.090 -0.030 -0.065 -0.040 0.020 0.030 0.085 -0.110 -0.060 0.025 -0.005 0.175 -0.025 -0.120 -0.025
[45] -0.040 -0.060 0.010 0.020 0.010 0.060 -0.050 0.035 -0.070 0.045 0.135 0.015 -0.175 0.085 0.155 -0.005 0.215 0.055 0.000 -0.040 0.205 -0.185
[67] 0.510 0.055 0.220 -0.095 -0.415 0.115 -0.105 -0.090 -0.025 -0.070 -0.090 -0.030 -0.040 0.025 -0.070 -0.255 0.065 -0.240 0.090 -0.010 0.110 -0.065
[89] -0.060 -0.200 -0.170 -0.105 0.000 0.030 -0.050 -0.160 0.105 -0.115 -0.075 -0.220 -0.035 0.140 -0.105 0.135 -0.020 -0.040 -0.355 0.295 -0.120 0.215
[111] 0.300 0.025 0.150 -0.125 0.010 0.145 -0.225 -0.010 0.225 -0.105 0.200 -0.065 0.055 -0.005 0.060 -0.140 0.090 0.050 0.260 -0.050 0.290 -0.025
[133] -0.080 0.020 -0.060 0.130 0.065
What I get is the next results:
> cpt =cpt.mean(diff(vector), method="PELT")
> (cpt.pts <- attributes(cpt)$cpts)
[1] 137
Appearly does not make sense... Any clue?

In R, there are many packages available for time series changepoint detection. changepoint is definitely a very useful one. A partial list of the packages is summarized in CRAN Task View:
Change point detection is provided in strucchange (using linear regression models), and in trend (using nonparametric tests). The changepoint package provides many popular changepoint methods, and ecp does nonparametric changepoint detection for univariate and multivariate series. changepoint.np implements the nonparametric PELT algorithm, while changepoint.mv detects changepoints in multivariate time series. InspectChangepoint uses sparse projection to estimate changepoints in high-dimensional time series. robcp provides robust change-point detection using Huberized cusum tests, and Rbeast provides Bayesian change-point detection and time series decomposition.
Here is also a great blog comparing several alternative packages: https://www.marinedatascience.co/blog/2019/09/28/comparison-of-change-point-detection-methods/. Another impressive comparison is from Dr. Jonas Kristoffer Lindeløv who developed the mcp package: https://lindeloev.github.io/mcp/articles/packages.html.
Below I used your sample time series to generate some quick results using the Rbeast package developed by myself (chosen here apparently for ego of self-promoting as well as perceived relvance). Rbeast is a Baysian changepoint detection algorithm and it can estimate the probability of changepoint occurrence. It can also be used for decomposing time series into seasonality and trend, but apparently, your time series is trend-only, so in the beast function below, season='none' is specified.
y = c(10.695,10.715,10.700,10.665,10.830,10.830,10.800,11.070,11.145,11.270,11.015,11.060,10.945,10.965,10.780,10.735,10.705,
10.680,10.600,10.335,10.220,10.125,10.370,10.595,10.680,11.000,10.980,11.065,11.060,11.355,11.445,11.415,11.350,11.310,11.330,
11.360,11.445,11.335,11.275,11.300,11.295,11.470,11.445,11.325,11.300,11.260,11.200,11.210,11.230,11.240,11.300,11.250,11.285,
11.215,11.260,11.395,11.410,11.235,11.320,11.475,11.470,11.685,11.740,11.740,11.700,11.905,11.720,12.230,12.285,12.505,12.410,
11.995,12.110,12.005,11.915,11.890,11.820,11.730,11.700,11.660,11.685,11.615,11.360,11.425,11.185,11.275,11.265,11.375,11.310,
11.250,11.050,10.880,10.775,10.775,10.805,10.755,10.595,10.700,10.585,10.510,10.290,10.255,10.395,10.290,10.425,10.405,10.365,
10.010,10.305,10.185,10.400,10.700,10.725,10.875,10.750,10.760,10.905,10.680,10.670,10.895,10.790,10.990,10.925,10.980,10.975,
11.035,10.895,10.985,11.035,11.295,11.245,11.535 ,11.510,11.430,11.450,11.390,11.520,11.585)
library(Rbeast)
out=beast(y, season='none')
plot(out)
print(out)
In the figure above, dashed vertical lines mark the most likely locations of changepoints; the green curve of Pr(tcp) shows the point-wise probability of changepoint occurrence over time. The order_t curve gives the estimated mean order of the piecewise polynomials needed to adequately fit the trend (the 0-th order is constant and the 1st order is linear): An average order toward 0 means that the trend is more likely to be flat and an order close to 1 means that the trend is linear. The output can be also printed as some ascii outputs, as shown below. Again, it says that the time series is most likely to have 8 changepoints; their most probable locations are given in out$trend$cp.
Result for time series #1 (total number of time series in 'out': 1)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ SEASONAL CHANGEPOINTS +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
No seasonal/periodic component present (i.e., season='none')
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ TREND CHANGEPOINTS +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
An ascii plot of the probability dist for number of chgpts(ncp)
---------------------------------------------------------------
Pr(ncp=0 )=0.000|* |
Pr(ncp=1 )=0.000|* |
Pr(ncp=2 )=0.000|* |
Pr(ncp=3 )=0.000|* |
Pr(ncp=4 )=0.000|* |
Pr(ncp=5 )=0.000|* |
Pr(ncp=6 )=0.055|***** |
Pr(ncp=7 )=0.074|****** |
Pr(ncp=8 )=0.575|******************************************** |
Pr(ncp=9 )=0.240|******************* |
Pr(ncp=10)=0.056|***** |
---------------------------------------------------------------
Max ncp : 10 | A parameter you set (e.g., maxTrendKnotNum) |
Mode ncp: 8 | Pr(ncp= 8)=0.57; there is a 57.5% probability|
| that the trend componet has 8 chngept(s). |
Avg ncp : 8.17 | Sum[ncp*Pr(ncp)] |
---------------------------------------------------------------
List of most probable trend changepoints (avg number of changpts: 8.17)
--------------------------------.
tcp# |time (cp) |prob(cpPr)|
-----|---------------|----------|
1 |8.0000 | 0.92767|
2 |112.0000 | 0.91433|
3 |68.0000 | 0.84213|
4 |21.0000 | 0.80188|
5 |32.0000 | 0.78171|
6 |130.0000 | 0.76938|
7 |101.0000 | 0.66404|
8 |62.0000 | 0.61171|
--------------------------------'

If the signal isn't too noisy, you could use diff to detect changepoints in slope instead of mean:
library(changepoint)
set.seed(1)
slope <- rep(sample(10,10)-5,sample(100,10))
sig <- cumsum(slope)+runif(n=length(slope),min = -1, max = 1)
cpt =cpt.mean(diff(sig),method="PELT")
# Show change point
(cpt.pts <- attributes(cpt)$cpts)
#> [1] 58 109 206 312 367 440 447 520 599
plot(sig,type="l")
lines(x=cpt.pts,y=sig[cpt.pts],type="p",col="red",cex=2)
Another option which seems to work better with the data you provided is to use piecewise linear segmentation:
library(ifultools)
changepoints <- linearSegmentation(x=1:length(data),y=data,angle.tolerance = 90,n.fit=10,plot=T)
changepoints
#[1] 13 24 36 58 72 106

Reproducing a result from R in Stata - Telling R or Stata to remove the same variables causing perfect collinearity/singularities

I am trying to reproduce a result from R in Stata (Please note that the data below is fictitious and serves just as an example). For some reason however, Stata appears to deal with certain issues differently than R. It chooses different dummy variables to kick out in case of multicollinearity.
I have posted a related question dealing with the statistical implications of these country-year dummies being removed here.
In the example below, R kicks out 2, while Stata kicks out 3, leading to a different result. Check for example the coefficients and p-values for vote and vote_won.
In essence, all I want to know is how to communicate to either R or Stata, which variables to kick out, so that they both do the same.
Data
The data looks as follows:
library(data.table)
library(dplyr)
library(foreign)
library(censReg)
library(wooldridge)
data('mroz')
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
n <- 2
DT <- data.table( country = rep(sample(country, length(mroz), replace = T), each = n),
year = c(replicate(length(mroz), sample(year, n))))
x <- DT
DT <- rbind(DT, DT); DT <- rbind(DT, DT); DT <- rbind(DT, DT) ; DT <- rbind(DT, DT); DT <- rbind(DT, x)
mroz <- mroz[-c(749:753),]
DT <- cbind(mroz, DT)
DT <- DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(taxrate = base_rate + as.integer(runif(1,-2.5,+2.5)))
DT <- DT %>%
group_by(country, year) %>%
mutate(vote = sample(c(0,1),1),
votewon = ifelse(vote==1, sample(c(0,1),1),0))
rm(mroz,x, country, year)
The lm regression in R
summary(lm(educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year), data=DT))
Call:
lm(formula = educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year),
data = DT)
Residuals:
Min 1Q Median 3Q Max
-7.450 -0.805 -0.268 0.954 5.332
Coefficients: (3 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.170064 0.418578 26.69 < 0.0000000000000002 ***
exper 0.103880 0.029912 3.47 0.00055 ***
I(exper^2) -0.002965 0.000966 -3.07 0.00222 **
vote 0.576865 0.504540 1.14 0.25327
votewon 0.622522 0.636241 0.98 0.32818
countryA:as.factor(year)2005 -0.196348 0.503245 -0.39 0.69653
countryB:as.factor(year)2005 -0.530681 0.616653 -0.86 0.38975
countryC:as.factor(year)2005 0.650166 0.552019 1.18 0.23926
countryD:as.factor(year)2005 -0.515195 0.638060 -0.81 0.41968
countryE:as.factor(year)2005 0.731681 0.502807 1.46 0.14605
countryG:as.factor(year)2005 0.213345 0.674642 0.32 0.75192
countryH:as.factor(year)2005 -0.811374 0.637254 -1.27 0.20334
countryI:as.factor(year)2005 0.584787 0.503606 1.16 0.24594
countryJ:as.factor(year)2005 0.554397 0.674789 0.82 0.41158
countryA:as.factor(year)2010 0.388603 0.503358 0.77 0.44035
countryB:as.factor(year)2010 -0.727834 0.617210 -1.18 0.23869
countryC:as.factor(year)2010 -0.308601 0.504041 -0.61 0.54056
countryD:as.factor(year)2010 0.785603 0.503165 1.56 0.11888
countryE:as.factor(year)2010 0.280305 0.452293 0.62 0.53562
countryG:as.factor(year)2010 0.672074 0.674721 1.00 0.31954
countryH:as.factor(year)2010 NA NA NA NA
countryI:as.factor(year)2010 NA NA NA NA
countryJ:as.factor(year)2010 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.3 on 728 degrees of freedom
Multiple R-squared: 0.037, Adjusted R-squared: 0.0119
F-statistic: 1.47 on 19 and 728 DF, p-value: 0.0882
Same regression in Stata
write.dta(DT, "C:/Users/.../mroz_adapted.dta")
encode country, gen(n_country)
reg educ c.exper c.exper#c.exper vote votewon n_country#i.year
note: 9.n_country#2010.year omitted because of collinearity
note: 10.n_country#2010.year omitted because of collinearity
Source | SS df MS Number of obs = 748
-------------+---------------------------------- F(21, 726) = 1.80
Model | 192.989406 21 9.18997171 Prob > F = 0.0154
Residual | 3705.47583 726 5.1039612 R-squared = 0.0495
-------------+---------------------------------- Adj R-squared = 0.0220
Total | 3898.46524 747 5.21882897 Root MSE = 2.2592
---------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
exper | .1109858 .0297829 3.73 0.000 .052515 .1694567
|
c.exper#c.exper | -.0031891 .000963 -3.31 0.001 -.0050796 -.0012986
|
vote | .0697273 .4477115 0.16 0.876 -.8092365 .9486911
votewon | -.0147825 .6329659 -0.02 0.981 -1.257445 1.227879
|
n_country#year |
A#2010 | .0858634 .4475956 0.19 0.848 -.7928728 .9645997
B#2005 | -.4950677 .5003744 -0.99 0.323 -1.477421 .4872858
B#2010 | .0951657 .5010335 0.19 0.849 -.8884818 1.078813
C#2005 | -.5162827 .447755 -1.15 0.249 -1.395332 .3627664
C#2010 | -.0151834 .4478624 -0.03 0.973 -.8944434 .8640767
D#2005 | .3664596 .5008503 0.73 0.465 -.6168283 1.349747
D#2010 | .5119858 .500727 1.02 0.307 -.4710599 1.495031
E#2005 | .5837942 .6717616 0.87 0.385 -.7350329 1.902621
E#2010 | .185601 .5010855 0.37 0.711 -.7981486 1.169351
F#2005 | .5987978 .6333009 0.95 0.345 -.6445219 1.842117
F#2010 | .4853639 .7763936 0.63 0.532 -1.038881 2.009608
G#2005 | -.3341302 .6328998 -0.53 0.598 -1.576663 .9084021
G#2010 | .2873193 .6334566 0.45 0.650 -.956306 1.530945
H#2005 | -.4365233 .4195984 -1.04 0.299 -1.260294 .3872479
H#2010 | -.1683725 .6134262 -0.27 0.784 -1.372673 1.035928
I#2005 | -.39264 .7755549 -0.51 0.613 -1.915238 1.129958
I#2010 | 0 (omitted)
J#2005 | 1.036108 .4476018 2.31 0.021 .1573591 1.914856
J#2010 | 0 (omitted)
|
_cons | 11.58369 .350721 33.03 0.000 10.89514 12.27224
---------------------------------------------------------------------------------

Just for your question about which 'variables to kick out": I guess you meant which combination of interaction terms to be used as the reference group for calculating regression coefficients.
By default, Stata uses the combination of the lowest values of two variables as the reference while R uses the highest values of two variables as the reference. I use Stata auto data to demonstrate this:
# In R
webuse::webuse("auto")
auto$foreign = as.factor(auto$foreign)
auto$rep78 = as.factor(auto$rep78)
# Model
r_model <- lm(mpg ~ rep78:foreign, data=auto)
broom::tidy(r_model)
# A tibble: 11 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 26.3 1.65 15.9 2.09e-23
2 rep781:foreign0 -5.33 3.88 -1.38 1.74e- 1
3 rep782:foreign0 -7.21 2.41 -2.99 4.01e- 3
4 rep783:foreign0 -7.33 1.91 -3.84 2.94e- 4
5 rep784:foreign0 -7.89 2.34 -3.37 1.29e- 3
6 rep785:foreign0 5.67 3.88 1.46 1.49e- 1
7 rep781:foreign1 NA NA NA NA
8 rep782:foreign1 NA NA NA NA
9 rep783:foreign1 -3.00 3.31 -0.907 3.68e- 1
10 rep784:foreign1 -1.44 2.34 -0.618 5.39e- 1
11 rep785:foreign1 NA NA NA NA
In Stata:
. reg mpg i.foreign#i.rep78
note: 1.foreign#1b.rep78 identifies no observations in the sample
note: 1.foreign#2.rep78 identifies no observations in the sample
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(7, 61) = 4.88
Model | 839.550121 7 119.935732 Prob > F = 0.0002
Residual | 1500.65278 61 24.6008652 R-squared = 0.3588
-------------+---------------------------------- Adj R-squared = 0.2852
Total | 2340.2029 68 34.4147485 Root MSE = 4.9599
-------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign#rep78 |
Domestic#2 | -1.875 3.921166 -0.48 0.634 -9.715855 5.965855
Domestic#3 | -2 3.634773 -0.55 0.584 -9.268178 5.268178
Domestic#4 | -2.555556 3.877352 -0.66 0.512 -10.3088 5.19769
Domestic#5 | 11 4.959926 2.22 0.030 1.082015 20.91798
Foreign#1 | 0 (empty)
Foreign#2 | 0 (empty)
Foreign#3 | 2.333333 4.527772 0.52 0.608 -6.720507 11.38717
Foreign#4 | 3.888889 3.877352 1.00 0.320 -3.864357 11.64213
Foreign#5 | 5.333333 3.877352 1.38 0.174 -2.419912 13.08658
|
_cons | 21 3.507197 5.99 0.000 13.98693 28.01307
-------------------------------------------------------------------------------
To reproduce the previous R in Stata, we could recode those two variables foreign and rep78:
. reg mpg i.foreign2#i.rep2
note: 0b.foreign2#1.rep2 identifies no observations in the sample
note: 0b.foreign2#2.rep2 identifies no observations in the sample
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(7, 61) = 4.88
Model | 839.550121 7 119.935732 Prob > F = 0.0002
Residual | 1500.65278 61 24.6008652 R-squared = 0.3588
-------------+---------------------------------- Adj R-squared = 0.2852
Total | 2340.2029 68 34.4147485 Root MSE = 4.9599
-------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign2#rep2 |
0 1 | 0 (empty)
0 2 | 0 (empty)
0 3 | -3 3.306617 -0.91 0.368 -9.61199 3.61199
0 4 | -1.444444 2.338132 -0.62 0.539 -6.119827 3.230938
1 0 | 5.666667 3.877352 1.46 0.149 -2.086579 13.41991
1 1 | -5.333333 3.877352 -1.38 0.174 -13.08658 2.419912
1 2 | -7.208333 2.410091 -2.99 0.004 -12.02761 -2.389059
1 3 | -7.333333 1.909076 -3.84 0.000 -11.15077 -3.515899
1 4 | -7.888889 2.338132 -3.37 0.001 -12.56427 -3.213506
|
_cons | 26.33333 1.653309 15.93 0.000 23.02734 29.63933
-------------------------------------------------------------------------------
The same approach applies to reproduce Stata results in R, just redefine levels of those two factor variables.

How to conduct meta-analysis for thousands of genes at one time using R package "metafor"

I'm trying to do a meta-analysis on many genes using R package "metafor", I know how to do it one gene at a time but it would be ridiculous to do so for thousands of genes. Could somebody help me out of this! Appreciate any suggestions!
I have all the results of se and HR for all the genes named 'se_summary' and 'HR_summary' respectively.
I need to use both se and HR of these genes from five studies "ICGC, TCGA, G71, G62, G8" as input to conduct the meta analysis.
The code I used to do the meta analysis for one single gene (using gene AAK1 as an example) is:
library(metafor)
se.AAK1 <- as.numeric(se_summary[rownames(se_summary) == 'AAK1',][,-1])
HR.AAK1 <- as.numeric(HR_summary[rownames(HR_summary) == 'AAK1',][,-1])
beta.AAK1 <- log(HR.AAK1)
####First I need to use the random model to see if the test for Heterogeneity is significant or not.
pool.AAK1 <- rma(beta.AAK1, sei=se.AAK1)
summary(pool.AAK1)
#### and this gives the following output:
#>Random-Effects Model (k = 5; tau^2 estimator: REML)
#> logLik deviance AIC BIC AICc
#> -2.5686 5.1372 9.1372 7.9098 21.1372
#>tau^2 (estimated amount of total heterogeneity): 0.0870 (SE = 0.1176)
#>tau (square root of estimated tau^2 value): 0.2950
#>I^2 (total heterogeneity / total variability): 53.67%
#>H^2 (total variability / sampling variability): 2.16
#>Test for Heterogeneity:
#>Q(df = 4) = 8.5490, p-val = 0.0734
#>Model Results:
#>estimate se zval pval ci.lb ci.ub
#> -0.3206 0.1832 -1.7500 0.0801 -0.6797 0.0385 .
#>---
#>Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
####If the I^2 > 50%, we still use the Random-effect Model but if the I^2 <= 50%, we then use the Fixed-effect Model
pool.AAK1 <- rma(beta.AAK1, sei=se.AAK1, method="FE")
summary(pool.AAK1)
####this gives the following output:
#>Fixed-Effects Model (k = 5)
#> logLik deviance AIC BIC AICc
#> -2.5793 8.5490 7.1587 6.7681 8.4920
#>Test for Heterogeneity:
#>Q(df = 4) = 8.5490, p-val = 0.0734
#>Model Results:
#>estimate se zval pval ci.lb ci.ub
#> -0.2564 0.1191 -2.1524 0.0314 -0.4898 -0.0229 *
#>---
#>Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This works just fine if I got only one gene, but I need to do it all at one time for all these genes and then export the output including "Heterogeneity p-val", and all the model results "estimate, se, zval, pval, ci.lb, ci.ub " to one .txt file, each row for a gene, the output should be like this:
Gene_symbol Heterogeneity_p-val estimate se zval pval ci.lb ci.ub
AAK1 0.0734 -0.2564 0.1191 -2.1524 0.0314 -0.4898 -0.0229
A2M 0.9664 0.1688 0.1173 1.4388 0.1502 -0.0611 0.3987
In case of need, here is a piece of sample data "se_summary"
Gene_symbol ICGC_se TCGA_se G71_se G62_se G8_se
A1CF 0.312 0.21 0.219 0.292 0.381
A2M 0.305 0.21 0.219 0.292 0.387
A2ML1 0.314 0.211 0.222 0.289 0.389
A4GALT 0.305 0.21 0.225 0.288 0.388
A4GNT 0.306 0.211 0.222 0.288 0.385
AAAS 0.308 0.213 0.223 0.298 0.38
AACS 0.307 0.209 0.221 0.287 0.38
AADAC 0.302 0.212 0.221 0.293 0.404
AADAT 0.308 0.214 0.22 0.288 0.391
AAK1 0.304 0.209 0.22 0.303 0.438
AAMP 0.303 0.211 0.222 0.288 0.394
And a piece of sample data "HR_summary"
Gene_symbol ICGC_HR TCGA_HR G71_HR G62_HR G8_HR
A1CF 1.689 1.427 0.864 1.884 1.133
A2M 1.234 1.102 1.11 1.369 1.338
A2ML1 0.563 0.747 0.535 1.002 0.752
A4GALT 0.969 0.891 0.613 0.985 0.882
A4GNT 1.486 0.764 1.051 1.317 1.465
AAAS 1.51 1.178 1.076 0.467 0.681
AACS 1.4 1.022 1.255 1.006 1.416
AADAC 0.979 0.642 1.236 1.581 1.234
AADAT 1.366 1.405 1.18 1.057 1.408
AAK1 1.04 0.923 0.881 0.469 0.329
AAMP 1.122 0.639 1.473 0.964 1.284

point 1: if your data is collected from different populations, you should not use fixed effect model. because HR could be difference among your populations.
point 2: if you convert HR to log(HR), therefore SE should be calculated for log(HR).
your data:
se_summary=data.frame(
Gene_symbol=c("A1CF","A2M","A2ML1","A4GALT","A4GNT","AAAS","AACS","AADAC","AADAT","AAK1","AAMP"),
ICGC_se=c(0.312,0.305,0.314,0.305,0.306,0.308,0.307,0.302,0.308,0.304,0.303),
TCGA_se=c(0.21,0.21,0.211,0.21,0.211,0.213,0.209,0.212,0.214,0.209,0.211),
G71_se=c(0.219,0.219,0.222,0.225,0.222,0.223,0.221,0.221,0.22,0.22,0.222),
G62_se=c(0.292,0.292,0.289,0.288,0.288,0.298,0.287,0.293,0.288,0.303,0.288),
G8_se=c(0.381,0.387,0.389,0.388,0.385,0.38,0.38,0.404,0.391,0.438,0.394))
and
HR_summary=data.frame(
Gene_symbol=c("A1CF","A2M","A2ML1","A4GALT","A4GNT","AAAS","AACS","AADAC","AADAT","AAK1","AAMP"),
ICGC_HR=c(1.689,1.234,0.563,0.969,1.486,1.51,1.4,0.979,1.366,1.04,1.122),
TCGA_HR=c(1.427,1.102,0.747,0.891,0.764,1.178,1.022,0.642,1.405,0.923,0.639),
G71_HR=c(0.864,1.11,0.535,0.613,1.051,1.076,1.255,1.236,1.18,0.881,1.473),
G62_HR=c(1.884,1.369,1.002,0.985,1.317,0.467,1.006,1.581,1.057,0.469,0.964),
G8_HR=c(1.133,1.338,0.752,0.882,1.465,0.681,1.416,1.234,1.408,0.329,1.284))
1)merge data
data=cbind(se_summary,log(HR_summary[,-1]))
2) a function to calculate meta-log HR
met=function(x) {
y=rma(as.numeric(x[7:11]), sei=as.numeric(x[2:6]))
y=c(y$b,y$beta,y$se,y$zval,y$pval,y$ci.lb,y$ci.ub,y$tau2,y$I2)
y
}
3)perform function for all rows
results=data.frame(t(apply(data,1,met)))
rownames(results)=rownames(data)
colnames(results)=c("b","beta","se","zval","pval","ci.lb","ci.ub","tau2","I2")
4)results
> results
b beta se zval pval
A1CF 0.27683114 0.27683114 0.1538070 1.7998601 0.071882735
A2M 0.16877042 0.16877042 0.1172977 1.4388214 0.150201136
A2ML1 -0.37676308 -0.37676308 0.1182825 -3.1852811 0.001446134
A4GALT -0.18975044 -0.18975044 0.1179515 -1.6087159 0.107678477
A4GNT 0.09500277 0.09500277 0.1392486 0.6822528 0.495079085
AAAS -0.07012629 -0.07012629 0.2000932 -0.3504680 0.725987468
AACS 0.15333550 0.15333550 0.1170061 1.3104915 0.190029610
AADAC 0.04902471 0.04902471 0.1738017 0.2820727 0.777887764
AADAT 0.23785528 0.23785528 0.1181503 2.0131593 0.044097875
AAK1 -0.32062727 -0.32062727 0.1832183 -1.7499744 0.080122725
AAMP 0.02722082 0.02722082 0.1724461 0.1578512 0.874574077
ci.lb ci.ub tau2 I2
A1CF -0.024625107 0.57828740 0.04413257 37.89339
A2M -0.061128821 0.39866965 0.00000000 0.00000
A2ML1 -0.608592552 -0.14493360 0.00000000 0.00000
A4GALT -0.420931120 0.04143024 0.00000000 0.00000
A4GNT -0.177919527 0.36792508 0.02455208 25.35146
AAAS -0.462301836 0.32204926 0.12145183 62.23915
AACS -0.075992239 0.38266324 0.00000000 0.00000
AADAC -0.291620349 0.38966978 0.07385974 50.18761
AADAT 0.006285038 0.46942552 0.00000000 0.00000
AAK1 -0.679728455 0.03847392 0.08700387 53.66905
AAMP -0.310767314 0.36520895 0.07266674 50.07330

Put the data in long format, with both the effect sizes and the se data side by side, then use a split and apply rma to each of these. You can make your own version of broom's tidy function just for rma objects.
library(metafor)
library(reshape)
se_summary<-read.table(text="
Gene_symbol ICGC_se TCGA_se G71_se G62_se G8_se
AADAT 0.308 0.214 0.22 0.288 0.391
AAK1 0.304 0.209 0.22 0.303 0.438
AAMP 0.303 0.211 0.222 0.288 0.394
",header=T)
HR_summary<-read.table(text="
Gene_symbol ICGC_HR TCGA_HR G71_HR G62_HR G8_HR
AADAT 0.308 0.214 0.22 0.288 0.391
AAK1 0.304 0.209 0.22 0.303 0.438
AAMP 0.303 0.211 0.222 0.288 0.394
",header=T)
HR_summary<-melt(HR_summary,id.vars = "Gene_symbol")%>%
mutate(.,variable=sapply(strsplit(as.character(variable), split='_', fixed=TRUE), function(x) (x[1])))%>%
rename(gene=variable)
se_summary<-melt(se_summary,id.vars = "Gene_symbol")%>%
mutate(.,variable=sapply(strsplit(as.character(variable), split='_', fixed=TRUE), function(x) (x[1])))%>%
rename(gene=variable)
HR_summary<-merge(HR_summary,se_summary,by=c("Gene_symbol","gene"),suffixes=c(".HR",".se"))
tidy.rma<-function(x) {
return(data.frame(estimate=x$b,se=x$se,zval=x$zval,ci.lb=x$ci.lb,ci.ub=x$ci.ub,k=x$k,Heterog_pv=x$QEp#the main stuff: overall ES, etc
#variance components( random effects stuff): nlvls is n sites
)) #test for heterogeneity q value and p-value
}
rbindlist(lapply(split(HR_summary, droplevels(HR_summary$Gene_symbol)),
function(x)with(x, tidy.rma(rma(yi=value.HR, sei=value.se,method="FE")))),idcol = "Gene_symbol2")

Translating Stata regression into R

I am currently trying to translate Stata regression into R and here is the original code :
char ethnicity[omit]8
char cid[omit]3
xi: reg nationalism i.cid ib(8).ethnicity male age religious education income rural_now rural_prev killed [pw=stdwt] if warcountry ==1, cl(cid)
and here is what I have so far in terms of translating it into R
lm(nationalism ~ cid + ethnicity +male+ age+ religious+ education+ income+ rural_now+ rural_prev+ killed, data=tab5data)
My question is how do I do the first portion of the Stata code ( char ethnicity[omit]8) because I know it is the reference group but I am unsure how to do that in R. Do I need to remove all those groups in the original dataset or do I need to run those groups in a seperate regression all together? Also what exactly does the ib(8) mean?

You can use relevel() in R. The code below uses a user-written command rsource to run R from within Stata to show the equivalence:
. sysuse auto, clear
(1978 Automobile Data)
. saveold auto, version(12) replace
(saving in Stata 12 format, which can be read by Stata 11 or 12)
file auto.dta saved
.
. rsource, terminator(XXX)
Assumed R program path: "/usr/local/bin/R"
Beginning of R output
> library("foreign")
> mydata<-read.dta("~/Desktop/auto.dta")
> mydata$rep78 <- relevel(as.factor(mydata$rep78), ref = 4)
> m1<-lm(price ~ rep78,data = mydata)
> summary(m1)
Call:
lm(formula = price ~ rep78, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-3138.2 -1925.2 -1181.5 369.5 9476.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6071.5 702.4 8.643 2.38e-12 ***
rep781 -1507.0 2221.3 -0.678 0.500
rep782 -103.9 1266.4 -0.082 0.935
rep783 357.7 888.5 0.403 0.689
rep785 -158.5 1140.6 -0.139 0.890
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2980 on 64 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.01449, Adjusted R-squared: -0.0471
F-statistic: 0.2353 on 4 and 64 DF, p-value: 0.9174
>
End of R output
.
. /* Old Way */
. char rep78[omit]4
. xi: reg price i.rep78
i.rep78 _Irep78_1-5 (naturally coded; _Irep78_4 omitted)
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(4, 64) = 0.24
Model | 8360542.63 4 2090135.66 Prob > F = 0.9174
Residual | 568436416 64 8881819 R-squared = 0.0145
-------------+---------------------------------- Adj R-squared = -0.0471
Total | 576796959 68 8482308.22 Root MSE = 2980.2
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Irep78_1 | -1507 2221.338 -0.68 0.500 -5944.633 2930.633
_Irep78_2 | -103.875 1266.358 -0.08 0.935 -2633.715 2425.965
_Irep78_3 | 357.7333 888.5353 0.40 0.689 -1417.32 2132.787
_Irep78_5 | -158.5 1140.558 -0.14 0.890 -2437.026 2120.026
_cons | 6071.5 702.4489 8.64 0.000 4668.197 7474.803
------------------------------------------------------------------------------
.
. /* Post-Stata 11 Way */
. reg price ib4.rep78
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(4, 64) = 0.24
Model | 8360542.63 4 2090135.66 Prob > F = 0.9174
Residual | 568436416 64 8881819 R-squared = 0.0145
-------------+---------------------------------- Adj R-squared = -0.0471
Total | 576796959 68 8482308.22 Root MSE = 2980.2
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rep78 |
1 | -1507 2221.338 -0.68 0.500 -5944.633 2930.633
2 | -103.875 1266.358 -0.08 0.935 -2633.715 2425.965
3 | 357.7333 888.5353 0.40 0.689 -1417.32 2132.787
5 | -158.5 1140.558 -0.14 0.890 -2437.026 2120.026
|
_cons | 6071.5 702.4489 8.64 0.000 4668.197 7474.803
------------------------------------------------------------------------------
. fvset base 4 rep78
. reg price i.rep78
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(4, 64) = 0.24
Model | 8360542.63 4 2090135.66 Prob > F = 0.9174
Residual | 568436416 64 8881819 R-squared = 0.0145
-------------+---------------------------------- Adj R-squared = -0.0471
Total | 576796959 68 8482308.22 Root MSE = 2980.2
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rep78 |
1 | -1507 2221.338 -0.68 0.500 -5944.633 2930.633
2 | -103.875 1266.358 -0.08 0.935 -2633.715 2425.965
3 | 357.7333 888.5353 0.40 0.689 -1417.32 2132.787
5 | -158.5 1140.558 -0.14 0.890 -2437.026 2120.026
|
_cons | 6071.5 702.4489 8.64 0.000 4668.197 7474.803
------------------------------------------------------------------------------

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

compute k-means after PCA - r

The principal component scores are stored under res.pca$ind$coord What you want to do kmeans on these: So we can do: kc <- kmeans(res.pca$ind$coord, 3) plot(res.pca$ind$coord[,1:2],col=factor(kc$cluster))

Related

Computing marginal effects: Why does ggeffect and ggemmeans give difference answers?

Changepoints detection in time series in R

Reproducing a result from R in Stata - Telling R or Stata to remove the same variables causing perfect collinearity/singularities

How to conduct meta-analysis for thousands of genes at one time using R package "metafor"

Translating Stata regression into R

Categories

Resources