Factor Analysis R - different results? - r

I've run a factor analysis on a dataset in R, using the psych package. Up until about 1 month ago, it has spit out the same output, but recently, it's different.
When I try running it on older versions of the psych package, it also churns out a different output. I'm at a loss for diagnosing this issue and trying to get the original output. I don't see how it could be a coding issue since the results were generated in the past -- I'm just struggling getting the same output now..
Below is the condensed version of code.
To download psych package:
# check if package "psych" is installed. if not, remind the user to install
if(("psych" %in% rownames(installed.packages())) == FALSE){
stop("Please install package 'psych' by running 'install.package('psych')'")
}
library(psych) # this package is needed for factor analysis
To run the FA:
n_factor=3
# the variables defined below are used to record the iterative process
LIST_min_in_max_loading_vector <- NULL
LIST_drop_variable <- NULL
min_in_max_loading_vector=0
flag=1
while(TRUE){
cat("The ",flag," Step is done. \n")
fa_result<- fa(dat,nfactors=n_factor,rotate = "varimax", cor='poly')
max_loading_in_each_row <- sapply(1:dim(fa_result$loadings)[1],function(j) max(abs(fa_result$loadings[j,])))
variable_names=row.names(fa_result$loadings)
min_in_max_loading_vector <- min(max_loading_in_each_row)
# Please note that here we have a cut-off value 0.5.
# This means that the minimum of the absolute values of all the loadings must be bigger than 0.5
# it's also the stop condition of our iterative algorithm
if(min_in_max_loading_vector>0.5){
break
}
min_variable <- variable_names[which(max_loading_in_each_row==min_in_max_loading_vector)]
cat("The minimum of the maximum absolute loadings is:",min_in_max_loading_vector,"\n")
drop_index <- which(row.names(fa_result$loadings)==min_variable)
cat(min_variable," is droped in this round.\n\n")
dat <- dat[,-drop_index]
#record the process of dropping
LIST_min_in_max_loading_vector[flag]=min_in_max_loading_vector
LIST_drop_variable[flag] <- min_variable
print(fa_result$loadings)
flag=flag+1
}
Can anyone potentially troubleshoot this?

Related

ggforest error - undefined columns selected

I am trying to make a forrest plot for my model with ggforest().
Here is the code to create mock data to reproduce the problem.
Data is formatted according to Therneau for time dependent covariates. I guess this might be the reason why ggforest does not operate properly.
library(survival)
library(survminer)
set.seed(1)
repetitions<-floor(sample(rnorm(1:10, 10)))
id<-rep(1:10, times=repetitions )
age<-rep(floor(sample(18:80,10)),times=repetitions)
diabetes<-rep(sample(0:1,10,replace=TRUE), times=repetitions)
bil<-sample(4:60,length(id), replace=TRUE)
status<-rep(1,length(id))
indices<-vector(length=10)
for(i in 1:10){
indices[i]<-sum(repetitions[1:i])
}
status[indices]<-2
daystart <- vector()
a<-vector()
for(i in 1:10){
if(i==1){ daystart<-1:indices[i]
} else {a<-1:(indices[i]-indices[i-1])
}
daystart<-c(daystart,a)
}
dayend<-daystart+1
mock_data<-cbind.data.frame(id,age,diabetes, bil, daystart, dayend, status)
mock_data$agegroup<-cut(mock_data$age, 2)
fit2<-coxph(Surv(daystart,dayend, status)~bil+diabetes+strata(agegroup), data=mock_data)
ggforest(fit2 , data=mock_data)
I get
Error in [.data.frame(data, ,var ) : undefined columns selected.
I tried installing previous version of package broom ( version 0.5.6) as, as suggested in previous threads, but it didnt resolve the issue. R versions 3.6.1 and 4.1.1 were used. Any ideas?
EDIT: So, the ggforest() gets confused with +strata(). Removing +strata() produces a plot.
So, the problem was in this row in ggforest() function.
terms <- attr(model$terms, "dataClasses")[-1]
I just did the quick fix and copy-pasted the body of the function
in the new function I created, adding index "-4", in order not to add
strata attributes to terms.
I guess the original function might be changed to accomodate this and exclude
strata from terms, but I should stress that I am not great at math or statistics,
so I am not 100 % sure if stratifying the data for time-varying cox proportional hazards analysis is valid if I stratify by continuous variable such as age. That would end up with each strata containing the data for only several individuals with the same age, each having repeated measurements values.

xgboost in R providing unexpected prediction

Below is a code that produces a simple xgboost model to show the issue I've been seeing. Once the model has been built, we predict using this model and take the second row in our data. If we take the log of relative difference between prediction of the 10th and 9th model, it should give us the prediction for the 10th tree: 0.00873184 in this case.
Now if we use the input to the tree (matrix "a" which has value 0.1234561702 for row 2) and run through the model, we expect a prediction of 0.0121501638. However, it looks like after the second split (<0.123456173) it takes the wrong direction and ends up at the node with 0.00873187464 - very close to what we expect!
Does anyone have an idea what is going on?
10th Tree
Versions:
R: 4.1.0
xgboost: 1.4.1.1
dplyr: 1.0.7
data.table: 1.14.0
library(xgboost)
library(dplyr)
library(data.table)
set.seed(2)
a <- matrix(runif(1000,0.1234561,0.1234562),
ncol=1,nrow=1000)
colnames(a) <- c("b")
d <- abs(rnorm(1000,3*a[,1]))
d2 <- xgb.DMatrix(data = a,label = d)
e <- xgboost::xgboost(data=d2,nrounds=10,method="hist",objective="reg:gamma")
xgb.plot.tree(e$feature_names,e,trees=9)
x <- 2
log((predict(e,a,ntreelimit = 10)/predict(e,a,ntreelimit = 9)))[x]
format(a[x,],nsmall=10)
For anyone interested in the answer, the xgboost team provided it here:
https://github.com/dmlc/xgboost/issues/7294
In short, xgboost converts the input data into float32 before training whereas R uses double by default. Hence, what should be done is convert 0.1234561702 to float32 before running through the model. Doing that gives the value 0.123456173 which now takes the right path.

Using R Package NNMAPSlite to get City Environmental vs Mortality Dataset

I have several question for those who have worked with R studio. Currently I need to work with NMMAPSlite package. However, I found that there is an issue from the package itself when I wanted to initialise the database connection to remote DB that store the NMMAPS City dataset.
In short, I need help to either
resolve the problem with NMMAPSlite old R package or
where to find the NMMAPS dataset in csv format
BACKGROUND
As a background, I'm using NMMAPSLite packages with intend to reproduce paper of Antonio Gasparrini. Attached at the bottom is the code base I would like to run. It requires:
require(dlnm);
require(NMMAPSlite)
Now the package NMMAPSlite has been deprecated it seems, so I managed to install the dependencies and the package from archive. I will elaborate below on the links required to get the dependencies for NMMAPS and DLNM as well.
PROBLEM
The problems occur when calling initDB() where it says it failed to create remoteDB instance due to invalid object creation. But I suspect, rather, the error comes from the fact the url is not supported. Here is the NMMAPS docs that describes the initDB() function. The db initialisation is necessary to read the city dataset.
The following is the error from R Console when running initDB()
creating directory 'NMMAPS' for local storage
Error in validObject(.Object) :
invalid class “remoteDB” object: object needs a 'url' of type 'http://'
In addition: Warning message:
In grep("^http://", URL, fixed = TRUE, perl = TRUE) :
argument 'perl = TRUE' will be ignored
QUESTIONS
I know this packages NMMAPS are deprecated and too old perhaps, but I really want to reproduce/replicate Antonio Gasparrini's paper: Distributed lag non-linear models for the purpose of my undergraduate thesis project.
Hence,
I wonder if there is anyway to get NMMAPS Dataset for cities environment data vs mortality rate. I visited the official NMMAPS Database but the link for downloading the data is either broken or the server is already down
Or you can also help me to find out if there is equivalent to NMMAPSlite package in R. I just need to download the cities dataset that contains humidity trend, temperatures trend, dewpoint, CO2 trends, Ozone O3 trend, and deaths/mortality rate with respect to time at any particular city for over 2 years. The most important that I need is the mortality rate and Ozone O3 trend.
Or last effort, perhaps do you mind suggesting me similar dataset that is used by his paper? Something where I can derive/analyze time relationship to estimate mortality rate given environmental and air polution information?
APPENDIX
Definition of initDB
baseurl = "http://www.ihapss.jhsph.edu/NMMAPS/v0.1"
function (basedir = "NMMAPS")
{
if (!file.exists(basedir))
message(gettextf("creating directory '%s' for local storage",
basedir))
outcome <- new("remoteDB", url = paste(baseurl, "outcome",
sep = "/"), dir = file.path(basedir, "outcome"), name = "outcome")
exposure <- new("remoteDB", url = paste(baseurl, "exposure",
sep = "/"), dir = file.path(basedir, "exposure"), name = "exposure")
Meta <- new("remoteDB", url = paste(baseurl, "Meta", sep = "/"),
dir = file.path(basedir, "Meta"), name = "Meta")
assign("exposure", exposure, .dbEnv)
assign("outcome", outcome, .dbEnv)
assign("Meta", Meta, .dbEnv)
}
Code to run:
The error comes from line 3
require(dlnm);require(NMMAPSlite)
##############################
# LOAD AND PREPARE THE DATASET
##############################
initDB()
data <- readCity("ny", collapseAge = TRUE)
data <- data[,c("city", "date", "dow", "death", "tmpd", "dptp", "rhum", "o3tmean", "o3mtrend", "cotmean", "comtrend")]
# TEMPERATURE: CONVERSION TO CELSIUS
data$temp <- (data$tmpd-32)*5/9
# POLLUTION: O3 AND CO AT LAG-01
data$o3 <- data$o3tmean + data$o3mtrend
data$co <- data$cotmean + data$comtrend
data$o301 <- filter(data$o3,c(1,1)/2,side=1)
data$co01 <- filter(data$co,c(1,1)/2, side=1)
# DEW POINT TEMPERATURE AT LAG 0-1
data$dp01 <- filter(data$dptp,c(1,1)/2,side=1)
##############################
# CROSSBASIS SPECIFICATION
##############################
# FIXING THE KNOTS AT EQUALLY SPACED VALUES
range <- range(data$temp,na.rm=T)
ktemp <- range [1] + (range [2]-range [1])/5*1:4
# CROSSBASIS MATRIX
ns.basis <- crossbasis(data$temp,varknots=ktemp,cenvalue=21, lagdf=5,maxlag=30)
##############################
# MODEL FIT AND PREDICTION
##############################
ns <- glm(death ~ ns.basis + ns (dp01, df=3 ) + dow + o301 + co01 +
ns(date,df=14*7),family=quasipoisson(), data)
ns.pred <- crosspred(ns.basis,ns,at=-16:33)
##############################
# RESULTS AND PLOTS
##############################
# 3-D PLOT (FIGURE 1)
crossplot(ns.pred,label="Temperature")
# SLICES (FIGURE 2, TOP)
percentiles <- round(quantile(data$temp,c(0.001,0.05,0.95,0.999)), 1)
ns.pred <- crosspred(ns.basis,ns,at=c(percentiles,-16:33))
crossplot(ns.pred,"slices",var=percentiles,lag=c(0,5,15,28), label="Temperature")
# OVERALL EFFECT (FIGURE 2, BELOW)
crossplot(ns.pred,"overall",label="Temperature", title="Overall effect of temperature on mortality
New York 1987–2000" )
# RR AT CHOSEN PERCENTILES VERSUS 21C (AND 95%CI)
ns.pred$allRRfit[as.character(percentiles)]
cbind(ns.pred$allRRlow,ns.pred$allRRhigh)[as.character(percentiles),]
##############################
# THE MOVING AVERAGE MODELS UP TO LAG x (DESCRIBED IN SECTION 5.2)
# CAN BE CREATED BY THE CROSSBASIS FUNCTION INCLUDING THE
# ARGUMENTS lagtype="strata", lagdf=1, maxlag=x
Resources for your context
Distributed lag non-linear models link
Rstudio's NMMAPSlite Package docs pdf download
Rstudio's DNLM Package docs pdf
Duplicate questions from another forum: forum
How to install package from tar/archive: link
Meanwhile, I will contact the author of this package and see if I can get the dataset. Preferable in csv format.
It seems that your code is based on R ver. < 3.0.0. You might find it difficult to reproduce the paper as the current R is typical > 4.0.0. You could try to install the windows version of NMMAPS database from the link given by 'Lil'. But, you will need to install an older version of R (2.9.2).
Or, you could hang on with the latest version of R and make a simple search on GitHub. In case you haven't found the NMMAPS database, you will find how to deal with the database here.
you could try this link http://www.biostat.jhsph.edu/IHAPSS/data/NMMAPS/R/ to download the package. There you have the city-data compressed where you can choose New York manually if initDB does not work.

Is there a way to make an R function return its internal variable?

I am new to R. I am currently trying to implement a regression based on instrumental variable from the sysid R-package. I chose this package since it can predict my instrument.
I found a suitable method ("iv" is the function here) to solve my problem. But the R function is not returning the "Predicted Instrument" as one of its return argument. I am very much interested in that predicted variable. Is there any way to get this variable as an argument?
I already tried to create a clone of this function but it has many dependent function from sysid package so it failed. I also tried to use the "source" command to link this modified function in my R code but rest of the libraries are delinked from my current script. Please provide me any solution to get the predicted instrument. The source code is available below:
https://rdrr.io/cran/sysid/src/R/iv.R.
iv4 <- function(z,order=c(0,1,0)){
na <- order[1]; nb <- order[2]
# Steps 1-2
mod_iv <- iv(z,order)
# Step 3
w <- resid(mod_iv)
mod_ar <- ar(w,aic = F,order.max =na+nb)
Lhat <- signal::Arma(b=c(1,-mod_ar$ar),a=1)
# Step 4
x2 <- matrix(sim(mod_iv$sys,inputData(z)))
ivcompute(z,x2,order,Lhat)
}
I want predicted instrument- Lhat to be returned. I also welcome suggestion for using any other package or regression method which can do the same(predict instrument).

R: Autokrige.cv function in automap package generates NaNs

I’m fairly new to R and I am trying to make interpolations of temperature measurements that where gathered from different station across the Netherlands. I have data for about 35 stations that make measurements every 10 minutes covering a timespan of about two weeks. Accordingly, I figured it would be best to make a loop that takes care of this. To see how well the interpolation technique works I want to do a cross validation for every timestamp.
In order to do this I used the Autokrige function from the automap package, and next I used the compare.cv function from the automap package in order to get an overview of the most important statistics for all time stamps. Besides that, I made sure the cross validation is only done if at least 25 stations registred meassurements.
The problem however is, that my code as described below works most of the time but gives the following warnings in 4 cases:
1. In sqrt(ret[[var.name]]) : NaNs produced
2. In sqrt(ret[[var.name]]) : NaNs produced
3. In sqrt(ret[[var.name]]) : NaNs produced
4. In sqrt(ret[[var.name]]) : NaNs produced
When I try to use the compare.cv command for the total list including all the cross validations it gives me the following error:
"Error in quantile.default(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, :
missing values and NaN's not allowed if 'na.rm' is FALSE"
Im wondering what causes the Autokrige function to generate NaNs in the cross validation, and more importantly how I can remove them from the results.cv so that I can use the compare.cv function?
rm(list=ls())
# load packages
require(sp)
require(gstat)
require(ggmap)
require(automap)
require(ggplot2)
#load data (download link provided below)
load("download path") https://www.dropbox.com/s/qmi3loub29e55io/meassurements_aug.RDS?dl=0
# make data spatial and assign spatial coordinate system
coordinates(meassurements) = ~x+y
proj4string(meassurements) <- CRS("+init=epsg:4326")
meassurements_df <- as.data.frame(meassurements)
# loop for cross validation
timestamp <- meassurements$import_log_id
results.cv=list()
for (i in unique(timestamp)) {
x = meassurements_df[which(meassurements$import_log_id == i), ]
if(sum(!is.na(x$temperature)) > 25){
results.cv[[paste0(i)]] = autoKrige.cv (temperature ~ 1, meassurements[which(meassurements$import_log_id == i & !is.na(meassurements$temperature)), ])
}
}
# calculate key statistics (RMSE MAE etc)
compare.cv(results.cv)
Thanks!
I came across the same problem and solved it with the help of remove.duplicates() of package sp on the SpatialPointDataFrame used for kriging. Prior to that I calculated the mean of the relevant variables in the DataFrame.
SPDF#data <- SPDF#data %>%
group_by(varx,vary,varz) %>%
mutate_at(vars(one_of(relevant_var)),mean,na.rm=TRUE) %>%
ungroup()
SPDF <- SPDF %>% remove.duplicates()
At the time I was encountering the same problem the Dropbox link above was not working anymore, so I could not check this specific example.

Resources