Convert an ff object to a data.frame - r

I am working with big matrix and the ff package.
I am loading an ff object and I want to use it to calculate a crps (a score).
For example, I have a ff_matrix (called Mat with 25 rows and 7303 columns) which is a precipitation forecast (7303 represents the number of days (about 20 years) and 25 are the 25 precipitation simulations for one day). I also have a ff_array with the observations for these 20 years (called Obs and with 7303 values).
With the package ensembleBMA I want to calculate the CRPS. I need to put my ff_matrix and my ff_array in an "ensembleBMA" object (in fact this is a data.frame).
For this code:
ensembleBMA(Mat,Obs)
I have this error:
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class 'c("ff_matrix", "ff_array", "ff")' into a data.frame
I tried different options such as:
as.data.frame(Mat)
as.matrix(Mat)
transform.ffdf(as.ffdf(Mat))
I always have these errors:
Error in as.data.frame.default(Mat_Ptot_212_1) : cannot automatically convert class 'c("ff_matrix", "ff_array", "ff")' into a data frame (data.frame)
or
opening ff /tmp/RtmpWrlY4n/clone9d3376b435.ff Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : write error
Does someone has an idea?

One way us to first convert your ff_array to an array and convert that to a data.frame:
Mat <- ff(1, vmode="double", dim=c(25, 7303))
as.data.frame(Mat[,])
or first convert your ff_array to an ffdf and convert that to an data.frame:
as.ffdf(Mat)[,]
or
as.data.frame(as.ffdf(Mat))
The last two solutions seem to be much slower than the first. This has probably to do with the large number of columns which slows down as.ffdf which has to create 7303 files.
There does not seem to be a as.data.frame.ff_array.

Related

How to extract components from an object of class "spec"?

I am trying to construct a table of power spectra and run into this problem:
Define the table:
V <- tibble(month=double(),day=double(),hour=double(),minutes=double(),
frequency=double(),power=double(),period=double())
compute the spectrum:
S <- spec.pgram(Spec2d$Inst,spans=windowSize,log="yes")
which creates an object of class "spec"
I need to extract the data from S and put it into V. When I try:
V$frequency <- S$freq
I get this error message:
Error: Assigned data `S$freq` must be compatible with existing data.
x Existing data has 0 rows.
x Assigned data has 48 rows.
ℹ Only vectors of size 1 are recycled.
which doesn't make sense to me. I have tried to coerce S$freq into different different types of objects but nothing works.
S$freq is a vector of length 48 as in the error message
What is going on? Is there a workaround?
Don't initialise the dataframe/tibble first. Try :
S <- spec.pgram(Spec2d$Inst,spans=windowSize,log="yes")
V <- data.frame(frequency = S$freq)

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

xgb.DMatrix Error: The length of labels must equal to the number of rows in the input data

I am using xgboost in R.
I created the xgb matrix fine using a matrix as input, but when I reduce the number in columns in the matrix data, I receive an error.
This works:
> dim(ctt1)
[1] 6401 5901
> xgbmat1 <- xgb.DMatrix(
Matrix(data.matrix(ctt1)),
label = as.matrix(as.numeric(data$V2)) - 1
)
This does not:
> dim(ctt1[,nr])
[1] 6401 1048
xgbmat1 <- xgb.DMatrix(
Matrix(data.matrix(ctt1[,nr])),
label = as.matrix(as.numeric(data$V2)) - 1)
Error in xgb.setinfo(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
In my case I fixed this error by changing assign operation:
labels <- df_train$target_feature
It turns out that by removing some columns, there are some rows with all 0s, and could not contribute to model.
For sparse matrices, xgboost R interface uses the CSC format creation method. The problem currently is that this method automatically determines the number of rows from the existing non-sparse values, and any completely sparse rows at the end are not counted in. A similar loss of completely sparse columns at the end can happen with the CSR sparse format. For more details see xgboost issue #1223 and also wikipedia on the sparse matrix formats.
The proper way for creating the DBMatrix Like
xgtrain <- xgb.DMatrix(data = as.matrix(X_train[,-5]), label = `X_train$item_cnt_month)`
drop the label column in data parameter and use same data set for create label column in index five i have item_cnt_month i drop it at run time and use same data set for referring label column
Before splitting your data, you need to turn it into a data frame.
For Exemplo:
data <- read.csv(...)
data = as.data.frame(data)
Now you can set your train data and test data to use in your "sparse.model.matrix" and "xgb.DMatrix".

Removing/parsing rows from a matrix in R

I'm trying to parse out specific rows from a data matrix. The actual data is numeric and comprises a single column. I've used this method before for other data, and I cannot figure out why this isn't working.
csize = data.matrix(wc$Csize)
length(csize)
[1] 134
csize[-111,][-110,][-107,][-105,][-104,][-94,][-88,][-68,][-58,][-57,][-56,][-30,][-22,][,1]
Error in csize[-111, ][-110, ] : incorrect number of dimensions
Here is the code that does work for me with other data:
w.pc.res <- prcomp(sizeshapew)
w.pcdata <- w.pc.res$x
length(w.pcdata)
[1] 11792
w.pcdata[-111,][-110,][-107,][-105,][-104,][-94,][-88,][-68,][-58,][-57,][-56,][-30,][-22,][,1]
I don't think it likes the multiple subscripting, just provide the subscripts in a vector e.g csize[c(-111, -110, ...),]

Unable to Convert Chi-Squared Values into a Numeric Column in R

I've been working on a project for a little bit for a homework assignment and I've been stuck on a logistical problem for a while now.
What I have at the moment is a list that returns 10000 values in the format:
[[10000]]
X-squared
0.1867083
(This is the 10000th value of the list)
What I really would like is to just have the chi-squared value alone so I can do things like create a histogram of the values.
Is there any way I can do this? I'm fine with repeating the test from the start if necessary.
My current code is:
nsims = 10000
for (i in 1:nsims) {cancer.cells <- c(rep("M",24),rep("B",13))
malig[i] <- sum(sample(cancer.cells,21)=="M")}
benign = 21 - malig
rbenign = 13 - benign
rmalig = 24 - malig
for (i in 1:nsims) {test = cbind(c(rbenign[i],benign[i]),c(rmalig[i],malig[i]))
cancerchi[i] = chisq.test(test,correct=FALSE) }
It gives me all I need, I just cannot perform follow-up analysis on it such as creating a histogram.
Thanks for taking the time to read this!
I'll provide an answer at the suggestion of #Dr. Mike.
hist requires a vector as input. The reason that hist(cancerchi) will not work is because cancerchi is a list, not a vector.
There a several ways to convert cancerchi, from a list into a format that hist can work with. Here are 3 ways:
hist(as.data.frame(unlist(cancerchi)))
Note that if you do not reassign cancerchi it will still be a list and cannot be passed directly to hist.
# i.e
class(cancerchi)
hist(cancerchi) # will still give you an error
If you reassign, it can be another type of object:
(class(cancerchi2 <- unlist(cancerchi)))
(class(cancerchi3 <- as.data.frame(unlist(cancerchi))))
# using the ldply function in the plyr package
library(plyr)
(class(cancerchi4 <- ldply(cancerchi)))
these new objects can be passed to hist directly
hist(cancerchi2)
hist(cancerchi3[,1]) # specify column because cancerchi3 is a data frame, not a vector
hist(cancerchi4[,1]) # specify column because cancerchi4 is a data frame, not a vector
A little extra information: other useful commands for looking at your objects include str and attributes.

Resources