Converting from dgCMatrix/dgRMatrix to scipy sparse matrix - r

I am working on the netflix data set and attempting to use the nmslibR package to do some KNN type work on the sparse matrix that results from the netflix data set. This package only accepts scipy sparse matrices as inputs, so I need to convert my R sparse matrix to that format. When I attempt to do so, I am getting the following error. dfm2 is a 1.1 gb dgCMatrix, I have also attempted it on a dgRMatrix with the exact same error.
dfm3<-TO_scipy_sparse(dfm2)
Error in TO_scipy_sparse(dfm2) : attempt to apply non-function
I don't know how to provide a good sample dataset for my problem, the sparse matrix I'm working with is 1.1 gb, so if someone has a suggestion on how I can make it easier to help me please let me know. I would also be open to hearing other packages that will do KNN/KNN type functions in r for sparse matrices.
Edit:
I use the following code to generate a sample sparse matrix in the dgCMatrix format and attempt to convert it to a sci py sparse matrix and get the following error.
library(Matrix)
library(nmslibR)
sparse<-Matrix(sample(c(1,2,3,4,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),10000,
replace=T),
ncol=50,
byrow=T)
dfm3 <- TO_scipy_sparse(sparse)
Error in TO_scipy_sparse(sparse) : attempt to apply non-function
To answer a question about whether sparse is a dgCMatrix:
str(sparse)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..# i : int [1:2414] 0 6 9 10 13 20 22 23 25 49 ...
..# p : int [1:51] 0 45 92 146 185 227 277 330 383 435 ...
..# Dim : int [1:2] 200 50
..# Dimnames:List of 2
.. ..$ : NULL
.. ..$ : NULL
..# x : num [1:2414] 4 1 1 2 5 3 2 5 3 5 ...
..# factors : list()

The 'attempt to apply non-function' error is a known issue when something is wrong with the python configuration in the operating system. There are similar issues for other Python packages that I ported from Python to R. You can have a look here.
You should also know that the nmslibR package uses the reticulate package for the interface between Python and R, so there must be similar issues too. If the error persists then you can open an issue in the nmslibR repository providing some sample data.

Related

dendrogram from pre-made linkage matrix

the problem:
in R, I need to plot a dendrogram + cut the associated tree from a linkage matrix created in a different language. based on the nature of the dataset, the prior processing is only available with this other language. so I need to be able to work in R from an already determined linkage matrix.
I have a linkage matrix and a correlation matrix created from a different language. I saved both as csv files and can read either as a data frame into R.
my approach
I wanted to convert the linkage matrix to an hclust object in R, so that I could pass to as.dendrogram and then subsequently use cutree.
When I run as.hclust(df), I get the error:
Error in as.hclust.default(df) : argument 'x' cannot be coerced to class “hclust” Consider providing an as.hclust.data.frame() method
as.hclust only takes a dist, Diana, or Agnes object
I'm unsuccessfully able to convert the data frame to any of these objects to proceed with my downstream analysis.
an alternative would be to work with the correlation matrix, but I'm not seeing a way to backtrack the physical distances from which to build a meaningful dendrogram.
I could use scipy.cluster.hierarchy.cut_tree in Python but there are documented issues with the function that remain unresolved, so I wanted to use R.
many thanks
I'm not sure what would you call the "linkage matrix" or whether there's a "standard" format for them across packages, but in these cases in helps to use str:
x <- matrix(rnorm(30), ncol = 3)
hc <- hclust(dist(x), method = "complete")
str(hc)
List of 7
$ merge : int [1:9, 1:2] -5 -6 -8 -4 -2 -3 -1 6 5 -7 ...
$ height : num [1:9] 0.714 0.976 1.381 1.468 2.065 ...
$ order : int [1:10] 2 6 10 3 8 5 7 1 4 9
$ labels : NULL
$ method : chr "complete"
$ call : language hclust(d = dist(x), method = "complete")
$ dist.method: chr "euclidean"
- attr(*, "class")= chr "hclust"
So, from this, one can deduce that it's a simple S3 structure, and it should be possible to create an imitation with your already-determined step-by-step data like this:
my_hc <- list(
merge = <your data>,
height = <your data>,
order = <your data>,
labels = NULL,
method = "complete",
call = "some_optional_string",
dist.method = "your_custom_distance"
)
class(my_hc) <- "hclust"
Otherwise, you could let R re-do the clustering from a distance matrix if that's available or computationally feasible.

Using .mat data to do multiple linear regression in R

I have a dataset in .mat file. Because most of my project is going to be R, I want to analyze the dataset in R rather than Matlab. I have used "R.matlab" library to convert into R but I am struggling to convert the data to dataframe to do further processing with it.
library(R.matlab)
>data <- readMat(paste(dataDirectory, 'Data.mat', sep=""))
> str(data)
List of 1
$ Data: num [1:32, 1:5, 1:895] 0.999 0.999 1 1 1 ...
- attr(*, "header")=List of 3
..$ description: chr "MATLAB 5.0 MAT-file, Platform: PCWIN, Created on: Fri Oct 18 11:36:04 2013 "
..$ version : chr "5"
..$ endian : chr "little"'''
I have tried the following codes from what I found from other questions but they do not do exactly what I wanted to do.
data = lapply(data, unlist, use.names=FALSE)
df <- as.data.frame(data)
> str(df)
'data.frame': 32 obs. of 4475 variables:
I want to convert into a data frame to 5 observations (Y,X1,X2,X3,X4) but right now there is 32 observation.
I do not know how to go further from here as I never worked with such a large dataset and couldn't find a relevant post. I am also new to R and coding so please excuse me if I will have some trouble with some of the answers. Any help would be greatly appreciated.
Thanks

Error when trying to evaluate Markov Random Fields using mgcv::gam "mismatch between nb/polys supplied area names and data area names"

I tried to implement this great blog post by Gavin Simpson using data downloaded from the cancensus package, but I get the following error when trying to evaluate the gam:
Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) :
mismatch between nb/polys supplied area names and data area names
In addition: Warning message:
In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch
between nb/polys supplied area names and data area names") :
the condition has length > 1 and only the first element will be used
I have posted my minimal working example here. Any tips would be greatly appreciated.
Best,
Zoltan
I know you already found your answer, however I had the same error and a different problem, so I'll post my solution here for posterity.
(Note: I used the sf package instead of rgdal and spdep)
library(sf)
sh_terr <- st_read("your_shp.shp", stringsAsFactors = T)
neighb <- st_touches(sh_terr, sparse = T) %>%
lapply(function(xx) sh_terr$FSA[xx] %>% factor(levels = levels(sh_terr$FSA))) %>%
set_names(sh_terr$FSA)
Your neighboring object structure should look like:
str(neighb[1:5])
List of 5
$ G0A: Factor w/ 419 levels "G0A","G0C","G0E",..: 14 15 16 17 21 22 39 49 50 51 ...
$ G0C: Factor w/ 419 levels "G0A","G0C","G0E",..: 3 6 67
$ G0E: Factor w/ 419 levels "G0A","G0C","G0E",..: 2 6 65 67
$ G0G: Factor w/ 419 levels "G0A","G0C","G0E",..: 5 16 62 70 271
$ G0H: Factor w/ 419 levels "G0A","G0C","G0E",..: 4 14 16 68 70 71
And your spline formula:
Effect ~ s(FSA, bs = "mrf", xt = list(nb = neighb), k = 41, fx = TRUE)
It's all in the factors. FSA in your main data object of your gam must be factor, and your neighboring object structure should be a list of factors with as many levels as the TOTAL number of levels in your main data.
Found it -- You must make sure that you don't have any polygons with missing Y:
shp <- shp[!is.na(shp#data$Y), ]

Why do I get this error below while using the Cubist package in R?

I have some personal dataset. So I split it into variable to predict and predictors.
Following is the syntax:
library(Cubist)
str(A)
'data.frame': 6038 obs. of 3 variables:
$ ads_return_count : num 7 10 10 4 10 10 10 10 10 9 ...
$ actual_cpc : num 0.0678 0.3888 0.2947 0.0179 0.095 ...
$ is_user_agent_bot: Factor w/ 1 level "False": 1 1 1 1 1 1 1 1 1 1 ...
cubist(A[,c("ads_return_count","is_user_agent_bot")],A[,"actual_cpc"])
And I am getting the following error
cubist code called exit with value 1
Error in strsplit(tmp, "\"")[[1]] : subscript out of bounds
Is there something I am missing ?
Simulate some data to make a reproducible example:
A=data.frame(ads_return_count=sample(100,10,TRUE), actual_cpc=runif(100), is_user_agent_bot=factor(rep("False",100)))
cubist(A[,c("ads_return_count","is_user_agent_bot")],A[,"actual_cpc"])
cubist code called exit with value 1
Error in strsplit(tmp, "\"")[[1]] : subscript out of bounds
Great, now we're on the same page.
What bothers me is that the second argument, the outcome, is all "False". I'm not sure a model with only one outcome is meaningful. Let's try something with two outcomes:
> A2=data.frame(ads_return_count=sample(100,10,TRUE), actual_cpc=runif(100), is_user_agent_bot=sample(c("True","False"),100,TRUE))
> cubist(A2[,c("ads_return_count","is_user_agent_bot")],A2[,"actual_cpc"])
Call:
cubist.default(x = A2[, c("ads_return_count", "is_user_agent_bot")], y =
A2[, "actual_cpc"])
Number of samples: 100
Number of predictors: 2
Number of committees: 1
Number of rules: 1
I would say this was an uninformative error message from cubist caused by having a single outcome possibility.
I had the same issue with mine except it turned out to be a level name was a missing value "". Replacing those levels with text did the trick.
Seems there is a similar issue with c5.0 decision tree
C5.0 decision tree - c50 code called exit with value 1

How do I handle multiple kinds of missingness in R?

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.

Resources