I have been trying to use the mdatools package to run a pls-da using the plsda() function. I have data with 9000 variables and around 30 observations. Each of the 30 rows are patients; the first column contains the clinical status for each patient (disease or control) and the remaining 8999 columns contain numerical data on the patients. I used the following code to run the plsda:
plsda(data[,2:9000], data[,1], ncomp = 8999, coeffs.ci = 'jk')
When the code finally compiles, it returns an error, saying"Error in selectCompNum.pls(model,ncomp): wrong number of selected components!"
I chose ncomp =8999 as the total number of numbers from 2 to 9000...and the strange thing is, this worked well with a low number of components. For example, when I tried
plsda(data[,2:10], data[,1], ncomp = 9, coeffs.ci = 'jk')
No error message is returned.
Perhaps I am misunderstanding how to select the right number of components? I would greatly appreciate any help. Thank you very much in advance!
I am developer of the mdatools package and just came across your question accidentally. Number of components in PLS/PLS-DA is a number of latent variables. Every latent variable is a linear combination of original variables. Normally, you need much less components than the original variables, depending on the type of data number of components can be from 1-2 to 10-20. I recommend you to look at PLS part of the tutorial and ask me directly (e.g. by email) if you still have any questions or issues with the package.
Related
I am trying to make a PLS-SEM model and I am using the plsm() function in R from the semPLS package. However, at first I got an error saying:
The latent variables are not allowed to coincide with names of observed variables.
I understood it, but after going through my input and even in my measurement model matrix adding single-factor constructs (directly measured variables) I now get the following:
mod <- plsm(data = survey, strucmod = smin, measuremod = mmin)
Error in plsm(data = survey, strucmod = smin, measuremod = mmin) :
The manifest variables must be contained in the data.
I am at a loss as to how I should proceed. It seems that whenever I "fix" one problem, it directly causes another. Does anyone have any examples aside from the standard mobi example from the package where I could see how it's done when I have both latent and directly measured variables?
Found the code for the function, but now I'm even more confused.
https://github.com/cran/semPLS/blob/master/R/plsm.R
Could anyone explain in a simple manner how I am supposed to name my df columns, and the measurement model to avoid this problem?
don't know if you ever solved this, but i just had a similar issue, and seemed to be the only other person. i ended up getting this solved through some trial and error.
I created three tables:
structmodel: SM - column names: Source|Target
measurement model: MM - Column names: Source|Target
Data: Column names - Measurement headers
I converted the sm and mm tables to a matrix
datamatrix_SM = as.matrix(SM)
datamatrix_MM = as.matrix(MM)
I'm trying to create a list of conjoint cards using R.
I have followed the professor's introduction, with my own dataset, but I'm stuck with this issue, which I have no idea.
library(conjoint)
experiment<-expand.grid(
ServiceRange = c("RA", "Active", "Passive","Basic"),
IdentProce = c("high", "mid", "low"),
Fee = c(1000,500,100),
Firm = c("KorFin","KorComp","KorStrt", "ForComp")
)
print(experiment)
design=caFactorialDesign(data=experiment, type="orthogonal")
print(design)
at the "design" line, I'm keep getting the following error message:
Error in optFederov(~., data, nTrials = i, approximate = FALSE, nRepeats = 50) :
nTrials must not be greater than the number of rows in data
How do I address this issue?
You're getting this error because you have 144 rows in experiment, but the nTrials mentioned in the error gets bigger than 144. This causes an error for optFederov(), which is called inside caFactorialDesign(). The problem stems from the fact that your Fee column has relatively large values.
I'm not familiar with how the conjoint package is set up, but I can show you how to troubleshoot this error. You can read the conjoint documentation for more on how to select appropriate experimental data.
(Note that the example data in the documentation always has very low numeric values, usually values between 1-10. Compare that with your Fee vector, which has values up to 1000.)
You can see the source code for a function loaded into your RStudio namespace by highlighting the function name (e.g. caFactorialDesign) and hitting Command-Return (on a Mac - probably something similar on PC). You can also just look at the source code on GitHub.
The caFactorialDesign is implemented here. That link highlights the line (26) that is throwing the error for you:
temp.design<-optFederov(~., data, nTrials=i, approximate=FALSE, nRepeats=50)
Recall the error message:
nTrials must not be greater than the number of rows in data
You've passed in experiment as the data parameter, so nrow(experiment) will tell us what the upper limit on nTrials is:
nrow(experiment) # 144
We can actually just think of the error for this dataset as:
nTrials must not be greater than 144
Ok, so how is the value for nTrials determined? We can see nTrials is actually an argument to optFederov(), and its value is set as i - often a sign that there's a for-loop wrapping an operation. And in fact, that's what we see:
for (i in ca.number: profiles.number)
{
temp.design<-optFederov(~., data, nTrials=i, approximate=FALSE, nRepeats=50)
...
}
This tells us that optFederov() is going to get called for each value of i in the loop, which will start at ca.number and will go up to profiles.number (inclusive).
How are these two variables assigned? If we look a little higher up in the caFactorialDesign() definition, ca.number is defined on lines 5-9:
num <- data.frame(data.matrix(data))
vars.number<-length(num)
levels.number<-0
for (i in 1:length(num)) levels.number<-levels.number+max(num[i])
ca.number<-levels.number-vars.number+1
You can run these calculations outside of the function - just remember that data == experiment. So just change that first line to num <- data.frame(data.matrix(experiment)), and then run that chunk of code. You can see that ca.number == 1008!!
In other words, the very first value of i in the for-loop which calls optFederov() is already way bigger than the max limit: 1008 >> 144.
It's possible you can include these numeric values as factors or strings in your definition of experiment - I'm not sure if that is an appropriate way to do this analysis. But I hope it's clear that you won't be able to use such large values in caFactorialDesign(), unless you have a much larger number of total observations in your data.
I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.
I am trying to (eventually) plot data by groups, using the prodlim function.
I'm adjusting and adapting code that someone else (not available for questions) has written, and I'm not very familiar with the prodlim library/function. There are definitely other ways to do what I'd like to, but I'm trying to keep it consistent with what the previous person did.
I have code that works, when dividing the data into 2 groups, but when I try to adjust for a 4 group situation, I get an error.
Of note, the data is coming over from SAS using StatTransfer, which has been working fine.
I am new to coding, but I have compared the dataframes I'm trying to work with. The second is just a subset of the first (where the code does work), with all the same variables, and both of the variables I'm trying to group by are integer values.
Hist(medpop$dz_time, medpop$dz_status) works just fine, so the problem must be with the prodlim function, and I haven't understood much of what I've looked up about it, sadly :/ But it the documentation seems to indicate it supports continuous or categorical variables, and doesn't seem limited to binary either. None of the options seem applicable as I understand them.
this works:
M <- prodlim(Hist(dz_time, dz_status)~med, data=pop)
where med is a binary value =1 when a member of this population is taking it, and dz is a disease that some portion develop.
this does not:
(either of these get the error as below)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=medpop)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=pop, subset=pop$med==1)
medpop = the subset of the original population taking the med,
strength = categorical variable ("1","2","3","4")
For the line that does work, the next step is just plot(M), giving a plot with two lines, med==0 and med==1 (showing cumulative incidence of dz_status by dz_time).
For the other line, I get an error saying
Error in KernSmooth::dpik(cumtabx/N, kernel = "box") :
scale estimate is zero for input data
I don't know what that means or how to fix it.. :/
I am doing research in a lab with a mentor who has developed a model that analyzes genetic data, which utilizes an ANOVA. I have simulated a dataset that I want to use in evaluating our model's ability to handle varying levels of missing data.
Our dataset consists of 15 species, with 4 individuals each, which we represent by naming the columns 'A'(x4) 'B'(x4)...etc. Each row represents a gene.
I'm trying to come up with a code that removes 1% of the data randomly, but such that each species has at least 2 individuals with valid data, because otherwise our model will just quit out (since it's ANOVA-based).
I realize this makes the 'randomly' missing data not so random, but we're trying different methods. It's important that the missing data is otherwise randomized. I'm hoping someone could help me with setting this up?
I try to do a toy example that maybe can help
is_valid_df<-function(df,col,val){
all(table(df[col])>val)
}
filter_function<-function(df,perc,col,val){
n=dim(df)[1]
filter<-sample(1:n,n*perc)
if(is_valid_df(df[-filter,],col,val)){
return(df[-filter,])
}else{
filter_function(df,perc,col,val)
cat("resampling\n")
}
}
set.seed(20)
a<-(filter_function(iris,0.1,"Species",44))