Path Analysis with plspm - r

I am attempting to build a Partial Least Squares Path Model using 'plspm'. After reading through the tutorial and formatting my data I am getting hung up on an error:
"Error in if (w.dif < tol || itermax == iter) break : missing value where TRUE/FALSE needed".
I assume that this error is the result of missing values for some of the latent variables (e.g. Soil_Displaced) has a lot of NAs because this variable was only measured in a subset of the replicates in the experiment. Is there a way to get around this error and work with variables with a lot of missing values. I am attaching my code and dateset here and the dataset can also be found in this dropbox file; https://www.dropbox.com/sh/51x08p4yf5qlbp5/-al2pwdCol
this is my code for now:
# inner model matrix
warming = c(0,0,0,0,0,0)
Treatment=c(0,0,0,0,0,0)
Soil_Displaced = c(1,1,0,0,0,0)
Mass_Lost_10mm = c(1,1,0,0,0,0)
Mass_Lost_01mm = c(1,1,0,0,0,0)
Daily_CO2 = c(1,1,0,1,0,0)
Path_inner = rbind(warming, Treatment, Soil_Displaced, Mass_Lost_10mm, Mass_Lost_01mm,Daily_CO2 )
innerplot(Path_inner)
#develop the outter model
Path_outter = list (3, 4:5, 6, 7, 8, 9)
# modes
#designates the model as a reflective model
Path_modes = rep("A", 6)
# Run it plspm(Data, inner matrix, outer list, modes)
Path_pls = plspm(data.2011, Path_inner, Path_outter, Path_modes)
Any input on this issue would be helpful. Thanks!

plspm does work limited with missing values, you have to set the scaling to numeric.
for your example the code looks as follows:
example_scaling = list(c("NUM"),
c("NUM", "NUM"),
c("NUM"),
c("NUM"),
c("NUM"),
c("NUM"))
Path_pls = plspm(data.2011, Path_inner, Path_outter, Path_modes, scaling = example_scaling)
But heres the limitation:
However if your dataset contains one observation where all indicators of a latent variable are missing values, this won't work.
First Case: F.e. the latent variable "Treatment" has 2 indicators, if one of them is NA, it works fine.
Second Case: But if there is just one observation where both indicators are NA, it won't work.
Since youre measuring the other 5 latent variables with just one indicator and you say your data contains lots of missing values, the second one will likely be the case.

PLSPM will not work with missing values therefore I had to interpolate some of the missing values from known observations. When this was done the code above worked great!.

Related

Normalising the data by scale() function changes the correlation in R

I have some data in data frame format and I want to see the correlation between two groups of these data (observation and estimation) in the original form, and in the normalized forms. The data is available here data. Here are the codes:
joint_df_estimation_observation <- data.frame(df_estimation = as.vector(as.matrix(rowSums(estimation_data))), df_observation = rowSums(observation_data))
joint_df_estimation_observation
cor.test(joint_df_estimation_observation$df_estimation, joint_df_estimation_observation$df_observation)
and I got the results:
normalised_estimation_data =
as.data.frame(scale(as.matrix(t(as.matrix(estimation_data))), scale = TRUE))
normalised_estimation_data = as.data.frame(t(as.matrix(normalised_estimation_data)))
normalised_observation_data = as.data.frame(scale(as.matrix(t(as.matrix(observation_data))), scale = TRUE))
normalised_observation_data = as.data.frame(t(as.matrix(normalised_observation_data)))
joint_df_estimation_observation_normalisation <- data.frame(df_estimation = as.vector(as.matrix(rowSums(normalised_estimation_data))), df_observation = rowSums(normalised_observation_data))
joint_df_estimation_observation_normalisation
cor.test(joint_df_estimation_observation_normalisation$df_estimation, joint_df_estimation_observation_normalisation$df_observation)
and I got the result
I do not understand why the correlation is so much different? I thought that the correlation will remain same even if the data are normalized. Is there anything wrong that I have done? I have considered the case there maybe NAs that are generated if there is a constant value across all columns for some rows in the original data, which caused this issue. However, I have removed them and It actually does not change the results.

Error in array, regression loop using "plyr"

Good morning,
I´m currently trying to run a truncated regression loop on my dataset. In the following I will give you a reproducible example of my dataframe.
library(plyr)
library(truncreg)
df <- data.frame("grid_id" = rep(c(1,2), 6),
"htcm" = rep(c(160,170,175), 4),
stringsAsFactors = FALSE)
View(df)
Now I tried to run a truncated regression on the variable "htcm" grouped by grid_id to receive only coefficients (intercept such as sigma), which I then stored into a dataframe. This code is written based on the ideas of #hadley
reg <- dlply(df, "grid_id", function(.)
truncreg(htcm ~ 1, data = ., point = 160, direction = "left")
)
regcoef <- ldply(reg, coef)
As this code works for one of my three datasets, I receive error messages for the other two ones. The datasets do not differ in any column but in their absolute length
(length(df1) = 4,000; length(df2) = 100,000; length(df3) = 13,000)
The error message which occurs is
"Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of type vector, was 'NULL'
I do not even know how to reproduce an example where this error code occurs, because this code works totally fine with one of my three datasets.
I already accounted for missing values in both columns.
Does anyone has a guess what I can fix to this code?
Thanks!!
EDIT:
I think I found the origin of error in my code, the problem is most likely about that in a truncated regression model, the standard deviation is calculated which automatically implies more than one observation for any group. As there are also groups with only n = 1 observations included, the standard deviation equals zero which causes my code to detect a vector of length = NULL. How can I drop the groups with less than two observations within the regression code?

argument "x" is missing, with no default in ezANOVA

I am getting a weird problem with ezANOVA. When I try to execute code below it says that some data is missing, but when I look at the data, nothing is missing.
model_acc <- ezANOVA(data = nback_acc_summary[complete.cases(nback_acc_summary),],
dv = Stimulus1.ACC,
wid = Subject,
within = c(ExperimentName, Target),
between = Group,
type = 3,
detailed = T)
When I run these lines I get an error message that says:
Error in ezANOVA_main(data = data, dv = dv, wid = wid, within = within, :
One or more cells is missing data. Try using ezDesign() to check your data.
Then I run
ezDesign(nback_acc_summary)
And get the message:
Error in as.list(c(x, y, row, col)) :
argument "x" is missing, with no default
I am not sure what to change in the code, because I can't really figure out what the problem is. I've researched the issue online, and it seems like quite a lot of users have encountered it before, but there is a very limited amount of solutions posted. I would be grateful for any kind of help.
Thanks!
For an ANOVA model you must have observations in all conditions created by the design of your model.
For example, if ExperimentName, Target, and Group each have two levels each, you have 2 x 2 x 2 = 8 conditions which require multiple observations in each condition. Then, add a constraint to this that your model is repeated measures which means that each Subject within a level of your between factor Group must have observations for all of the within conditions (i.e., ExperimentName x Target = 2 x 2 = 4).
The first error suggests you have fallen short of having enough data in the conditions suggested by your model.
The following should produce a plot to help identify which conditions are missing data:
ezDesign(
data = nback_acc_summary[complete.cases(nback_acc_summary), ],
x = Target,
y = Subject,
row = ExperimentName,
col = Group
)

cpquery function of bnlearn always returns 0 for simple discrete artificial data

I want to calculate conditional probabilities based on a bayesian network based on some binary data I create. However, using the bnlearn::cpquery, always a value of 0 is returned, while bnlearn::bn.fit fits a correct model.
# Create data for binary chain network X1->Y->X2 with zeros and ones.
# All base rates are 0.5, increase in probability due to parent = .5 (from .25 to.75)
X1<-c(rep(1,9),rep(1,3),1,rep(1,3),rep(0,3),0,rep(0,3),rep(0,9))
Y<-c(rep(1,9),rep(1,3),0,rep(0,3),rep(1,3),1,rep(0,3),rep(0,9))
X2<-c(rep(1,9),rep(0,3),1,rep(0,3),rep(1,3),0,rep(1,3),rep(0,9))
dag1<-data.frame(X1,Y,X2)
# Fit bayes net to data.
res <- hc(dag1)
fittedbn <- bn.fit(res, data = dag1)
# Fitting works as expected, as seen by graph structure and coefficients in fittedbn:
plot(res)
print(fittedbn)
# Conditional probability query
cpquery(fittedbn, event = (Y==1), evidence = (X1==1), method = "ls")
# Using LW method
cpquery(fittedbn, event = (Y==1), evidence = list(X1 = 1), method = "lw")
'cpquery' just returns 0. I have also tried using the predict function, however this returns an error:
predict(object = fittedbn, node = "Y", data = list(X1=="1"), method = "bayes-lw", prob = True)
# Returns:
# Error in check.data(data, allow.levels = TRUE) :
# the data must be in a data frame.
In the above cpquery the expected result is .75, but 0 is returned. This is not specific to this event or evidence, regardless of what event or evidence I put in (e.g event = (X2==1), evidence = (X1==0) or event = (X2==1), evidence = (X1==0 & Y==1)) the function returns 0.
One thing I tried, as I thought the small amount of observations might be an issue, is to just increase the number of observations (i.e. vertically concatenating the above dataframe with itself a bunch of times), however this did not change the output.
I've seen many threads on cpquery and that it can be fragile, but none indicate this issue. To note: the example in 'cpquery' documentation works as expected, so it seems the problem is not due to my environment.
Any help would be greatly appreciated!

"Input datasets must be dataframes" error in kamila package in R

I have a mixed type data set, one continuous variable, and eight categorical variables, so I wanted to try kamila clustering. It gives me an error when I use one continuous variable, but when I use two continuous variables it is working.
library(kamila)
data <- read.csv("mixed.csv",header=FALSE,sep=";")
conInd <- 9
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
Error in kamila(conVar = conVar[testInd, ], catFactor =
catFactor[testInd, : Input datasets must be dataframes
I think the problem is that the function assumes that you have at least two of both data types (i.e. >= 2 continuous variables, and >= 2 categorical variables). It looks like you supplied a single column index (conInd = 9, just column 9), so you have only one continuous variable in your data. Try adding another continuous variable to your continuous data.
I had the same problem (with categoricals) and this approach fixed it for me.
I think the ultimate source of the error in the program is at around line 170 of the source code. Here's the relevant snippet...
numObs <- nrow(conVar)
numInTest <- floor(numObs/2)
for (cvRun in 1:numPredStrCvRun) {
for (ithNcInd in 1:length(numClust)) {
testInd <- sample(numObs, size = numInTest, replace = FALSE)
testClust <- kamila(conVar = conVar[testInd,],
catFactor = catFactor[testInd, ],
numClust = numClust[ithNcInd],
numInit = numInit, conWeights = conWeights,
catWeights = catWeights, maxIter = maxIter,
conInitMethod = conInitMethod, catBw = catBw,
verbose = FALSE)
When the code partitions your data into a training set, it's selecting rows from a one-column data.frame, but that returns a vector by default in that case. So you end up with "not a data.frame" even though you did supply a data.frame. That's where the error comes from.
If you can't dig up another variable to add to your data, you could edit the code such that the calls to kamila in the cvRun for loop wrap the data.frame() function around any subsetted conVar or catFactor, e.g.
testClust <- kamila(conVar = data.frame(conVar[testInd,]),
catFactor = data.frame(catFactor[testInd,], ... )
and just save that as your own version of the function called say, my_kamila, which you could use instead.
Hope this helps.

Resources