I'm running through a large dataset chunk by chunk, updating a list of linear models as I go using the biglm function. The issue occurs when a particular chunk does not contain all the factors that I have in my linear model, and I get this error:
Error in update.biglm(model, new) : model matrices incompatible
The description of update.biglm mentions that factor levels must be the same across all chunks. I could probably come up with a workaround to avoid this, but there must be a better way. This pdf, on the 'biglm' page, mentions that "Factors must have their full set of levels
specified (not necessarily present in the data chunk)". So I think there is some way to specify all the possible levels so that I can update a model with not all the factors present, but I can't figure out how to do it.
Here's an example piece of code to illustrate my problem:
df = data.frame(a = rnorm(12),b = as.factor(rep(1:4,each = 3)),c = rep(0:1,6))
model = biglm(a~b+c,data = df
df.new = data.frame(a = rnorm(6),b = as.factor(rep(1:2,each = 3)),c =rep(0:1, 3))
model.new = update(model,df.new)
Thanks for any advice you have.
I came across this problem also. Are the variables in your large data frame specified as factors before breaking them into chunks? Also, is the data set formatted as a data frame?
large_df <- as.data.frame(large_data_set) # just to make sure it's a df.
large_df$factor.vars <- as.factor(large_df$factor.vars)
If this is the case, then all of the factor levels should be preserved in the factor variables even after breaking the data frame into chunks. This will ensure that biglm creates the proper design matrix from the first call, and that all subsequent updates will be compatible.
If you have different data frames from the start, (as you illustrate in your example), perhaps you should merge them into one before breaking down into chunks. Continuing from your example:
df.large <- rbind(df,df.new)
chunk1 <- df.large[1:12,]
chunk2 <- df.large[13:18,]
model <- biglm(a~b+c,data = chunk1)
model.new <- update(model,chunk2) # this is now compatible
Related
How do I extract the 'VarCompContrib" column in the data frame produced using the gageRR function in R?
This is for a GageRR analysis of a measurement system. I'm trying to make a very user friendly program where other people can just enter the information required, like number of operators, parts, and measurements, as well as the measurements themselves, and output the correct analysis. I'm gonna use an if-statement later on to do the "analysis" portion, but I am having trouble actually managing the data frame produced with gageRR.
library(MASS)
library(Rsolnp)
library(qualityTools)
design = gageRRDesign(Operators=3, Parts=10, Measurements=2, randomize=FALSE)
response(design) = c(23,22,22,22,22,25,23,22,23,22,20,22,22,22,24,25,27,28,
23,24,23,24,24,22,22,22,24,23,22,24,20,20,25,24,22,24,21,20,21,22,21,22,21,
21,24,27,25,27,23,22,25,23,23,22,22,23,25,21,24,23)
gdo=gageRR(design)
plot(gdo)
I am looking to get a 7 number column vector under VarCompContrib
For starters, you can look at the structure of gdo with str(gdo). From there, we see that Varcomp is a slot, so we can access it with gdo#Varcomp and just convert it to a data.frame:
library(qualityTools)
design <- gageRRDesign(Operators = 3, Parts = 10, Measurements = 2, randomize = FALSE)
response(design) <- c(
23,22,22,22,22,25,23,22,23,22,20,22,22,22,24,25,27,28,23,24,23,24,24,22,22,22,24,23,22,24,
20,20,25,24,22,24,21,20,21,22,21,22,21,21,24,27,25,27,23,22,25,23,23,22,22,23,25,21,24,23
)
gdo <- gageRR(design)
data.frame(gdo#Varcomp)
# totalRR repeatability reproducibility a a_b bTob totalVar
# 1 1.66441 1.209028 0.4553819 0.4553819 0 1.781211 3.445621
I have a data frame with 30 row and 850 column(features).
when I want to use svm or other classifier with caret and e1071 packages, I faced this error!
Error in terms.formula(formula, data = data) :
duplicated name 'X10Percentile' in data frame using '.'
Even when I want to use feature selection method such as Boruta, I face the same error.
I double check my feature and found nothing. I thought I must have the same column name in data frame so I create a sample data and check as follow:
test<-data.frame("w1"=c(1:6),"w1.1"=c(2:7),"w1"=c(3:8), "ta"=c("T","F","T","F","F","T"))
set.seed(100)
train <- createDataPartition(y=test$ta,p=0.6,list = FALSE)
TrainSet <- test[train,]
TestSet <- test[-train,]
trcontrol_rcv<- trainControl(method="cv", number=10)
svm_test<-svm(ta ~., data=TrainSet,trControl=trcontrol_rcv)
It works good and no Error occurs.
As I see no error happen when test data even has exactly the same colname.
I want to know why this error"Error in terms.formula(formula, data = data) :
duplicated name 'X10Percentile' in data frame using '.'" happen for my data, and how can I eliminate it?
Thank you in advance.
Thank you, everyone. Fortunately, I found the cause of this error.
Because R considers variables as factors. Therefore it makes a data. frame (which in fact is a list).To solve this problem, I converted it into a data numeric in the following way;
test1<-sapply(test,function(x) as.numeric(as.character(x)))
For me that was not the solution, I had a LargeMatrix as an object of only numeric type vectors.
The problem was that some dimnames(MyLargeMatrix) were duplicated. I change them and the error went away.
I'm learning R by using it on one project where I need to extract unique paths from logs.
Now, My workaround (lower) part of the code work, but I had to split the log into two files and perform grouping on them separately, while I tried the same on variables, I was getting all the data in all three path counts.
Can someone point me to what is wrong in the first approach, as I doubt that writing physically files to a disk is intended way?
a = read.csv('download-report-06-10-2017.csv')
yesterdays_data <- a[grepl("2017-10-05", a$Download.Time), ]
todays_data <- a[grepl("2017-10-06", a$Download.Time), ]
write.csv(yesterdays_data, "yesterdays.csv")
write.csv(todays_data, "todays.csv")
path_count <- as.data.frame(table(a$Path))
path_count_today <- as.data.frame(table(todays_data$Path))
path_count_yday <- as.data.frame(table(yesterdays_data$Path))
#### path_count, path_count_today & path_count_yday contain the same values and I expect them to be different ???
yd = read.csv('yesterdays.csv')
td = read.csv('todays.csv')
path_count_td <- as.data.frame(table(td$Path))
path_count_yd <- as.data.frame(table(yd$Path))
#### path_count_td and path_count_yd are different, as I'd expect in upper three variables
I am somewhat new to R, so forgive my basic questions.
I perform a CCA on a full dataset (358 sites, 40 abiotic parameters, 100 species observation).
library(vegan)
env <- read.table("env.txt", header = TRUE, sep = "\t", dec = ",")
otu <- read.table(otu.txt", header = TRUE, sep = "\t", dec = ",")
cca <- cca(otu~., data=env)
cca.plot <- plot(cca, choices=c(1,2))
vif.cca(cca)
ccared <- cca(formula = otu ~EnvPar1,2,n, data = env)
ccared.plot <- plot(ccared, choices=c(1,2))
orditorp(ccared.plot, display="sites")
This works without using sample names in the first columns (initially, the first column containing numeric samples names got interpreted as a variable, so i used tables without that information. When i add site names to the plot via orditorp, it gives "row.name=n" in the plot.)
I want to use my sample names, however. I tried row.names=1 on both tables with sample name information:
envnames <- read.table("envwithnames.txt", header = TRUE, row.names=1, sep = "\t", dec = ",")
otunames <- read.table("otuwithnames.txt", header = TRUE, row.names=1, sep = "\t", dec = ",")
, and any combination of env/otu/envnames/otunames. cca worked out well in any case, but any plot command yielded
plot.ccarownames <- plot(cca(ccarownames, choices=c(1,2)))
Error in rowSums(X) : 'x' must be numeric
My second problem is connected to that: The 358 sites are grouped into 6 groups (4x60,2x59). The complete matrix has this information inferred as an extra column.
Since i couldnt work out the row name problem, i am even more stuck with nominal data, anyhow.
The original matrix contains a first column (sample names, numeric, but can be easily transformed to nominal) and second one (group identity, nominal), followed by biological observations.
What i would like to have:
A CCA containing all six groups that is coloring sites per group.
A CCA containing only data for one group (without manual
construction of individual input tables)
CCA plots that are using my original sample names.
Any help is appreciated! Really, i am stuck with it since yesterday morning :/
I'm using cca() from vegan myself and I have some of your own problems, however I've been able to at least solve your original "row names" problem. I'm doing a CCA analysis on data from 41 soils, with 334 species and 39 environmental factors.
In my case I used
rownames(MyDataSet) <- MyDataSet$ObservationNamesColumn
(I used default names such as MyDataSet for the sake of example here)
However I still had environmental factors which weren't numerical (such as soil texture). You could try checking for non numerical factors in case you have a mistake in your original dataset or an abiotic factor which is not interpreted as numerical for any other reason. To do this you can either use the command str(MyDataSet) which tells you the nature of each of your variable, or lapply(MyDataSet, class) which also tells you the same but in a different output.
In case you have abiotic factors which are not numerical (again, such as texture) and you want to remove them, you can do so by creating a whole new dataset using only the numerical variables (you will still keep your observation names as they were defined as row names), this is rather easy to do and can be done using something similar to this:
MyDataSet.num <- MyDataSet[,sapply(MyDataSet, is.numeric)]
This creates a new data set which has the same rows as the original but only columns (variables) with numeric values. You should be able then to continue your work using this new data set.
I am very new to both R programming and statistics (I'm a microbiologist) but I hope this helps!
I want to use ChemoSpec with a mass spectra of about 60'000 datapoint.
I have them already in one txt file as a matrix (X + 90 samples = 91 columns; 60'000 rows).
How may I adapt this file as spectra data without exporting again each single file in csv format (which is quite long in R given the size of my data)?
The typical (and only?) way to import data into ChemoSpec is by way of the getManyCsv() function, which as the question indicates requires one CSV file for each sample.
Creating 90 CSV files from the 91 columns - 60,000 rows file described, may be somewhat slow and tedious in R, but could be done with a standalone application, whether existing utility or some ad-hoc script.
An R-only solution would be to create a new method, say getOneBigCsv(), adapted from getManyCsv(). After all, the logic of getManyCsv() is relatively straight forward.
Don't expect such a solution to be sizzling fast, but it should, in any case, compare with the time it takes to run getManyCsv() and avoid having to create and manage the many files, hence overall be faster and certainly less messy.
Sorry I missed your question 2 days ago. I'm the author of ChemoSpec - always feel free to write directly to me in addition to posting somewhere.
The solution is straightforward. You already have your data in a matrix (after you read it in with >read.csv("file.txt"). So you can use it to manually create a Spectra object. In the R console type ?Spectra to see the structure of a Spectra object, which is a list with specific entries. You will need to put your X column (which I assume is mass) into the freq slot. Then the rest of the data matrix will go into the data slot. Then manually create the other needed entries (making sure the data types are correct). Finally, assign the Spectra class to your completed list by doing something like >class(my.spectra) <- "Spectra" and you should be good to go. I can give you more details on or off list if you describe your data a bit more fully. Perhaps you have already solved the problem?
By the way, ChemoSpec is totally untested with MS data, but I'd love to find out how it works for you. There may be some changes that would be helpful so I hope you'll send me feedback.
Good Luck, and let me know how else I can help.
many years passed and I am not sure if anybody is still interested in this topic. But I had the same problem and did a little workaround to convert my data to class 'Spectra' by extracting the information from the data itself:
#Assumption:
# Data is stored as a numeric data.frame with column names presenting samples
# and row names including domain axis
dataframe2Spectra <- function(Spectrum_df,
freq = as.numeric(rownames(Spectrum_df)),
data = as.matrix(t(Spectrum_df)),
names = paste("YourFileDescription", 1:dim(Spectrum_df)[2]),
groups = rep(factor("Factor"), dim(Spectrum_df)[2]),
colors = rainbow(dim(Spectrum_df)[2]),
sym = 1:dim(Spectrum_df)[2],
alt.sym = letters[1:dim(Spectrum_df)[2]],
unit = c("a.u.", "Domain"),
desc = "Some signal. Describe it with 'desc'"){
features <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "unit", "desc")
Spectrum_chem <- vector("list", length(features))
names(Spectrum_chem) <- features
Spectrum_chem$freq <- freq
Spectrum_chem$data <- data
Spectrum_chem$names <- names
Spectrum_chem$groups <- groups
Spectrum_chem$colors <- colors
Spectrum_chem$sym <- sym
Spectrum_chem$alt.sym <- alt.sym
Spectrum_chem$unit <- unit
Spectrum_chem$desc <- desc
# important step
class(Spectrum_chem) <- "Spectra"
# some warnings
if (length(freq)!=dim(data)[2]) print("Dimension of data is NOT #samples X length of freq")
if (length(names)>dim(data)[1]) print("Too many names")
if (length(names)<dim(data)[1]) print("Too less names")
if (length(groups)>dim(data)[1]) print("Too many groups")
if (length(groups)<dim(data)[1]) print("Too less groups")
if (length(colors)>dim(data)[1]) print("Too many colors")
if (length(colors)<dim(data)[1]) print("Too less colors")
if (is.matrix(data)==F) print("'data' is not a matrix or it's not numeric")
return(Spectrum_chem)
}
Spectrum_chem <- dataframe2Spectra(Spectrum)
chkSpectra(Spectrum_chem)