Using .mat data to do multiple linear regression in R - r

I have a dataset in .mat file. Because most of my project is going to be R, I want to analyze the dataset in R rather than Matlab. I have used "R.matlab" library to convert into R but I am struggling to convert the data to dataframe to do further processing with it.
library(R.matlab)
>data <- readMat(paste(dataDirectory, 'Data.mat', sep=""))
> str(data)
List of 1
$ Data: num [1:32, 1:5, 1:895] 0.999 0.999 1 1 1 ...
- attr(*, "header")=List of 3
..$ description: chr "MATLAB 5.0 MAT-file, Platform: PCWIN, Created on: Fri Oct 18 11:36:04 2013 "
..$ version : chr "5"
..$ endian : chr "little"'''
I have tried the following codes from what I found from other questions but they do not do exactly what I wanted to do.
data = lapply(data, unlist, use.names=FALSE)
df <- as.data.frame(data)
> str(df)
'data.frame': 32 obs. of 4475 variables:
I want to convert into a data frame to 5 observations (Y,X1,X2,X3,X4) but right now there is 32 observation.
I do not know how to go further from here as I never worked with such a large dataset and couldn't find a relevant post. I am also new to R and coding so please excuse me if I will have some trouble with some of the answers. Any help would be greatly appreciated.
Thanks

Related

Class of Data and how to do some data manipulation in R

I have a subset of a genetic dataset in which I want to run some correlations between the CpG markers.
I have inspected the class, class(data) of this subset and it shows that it's a
[1] "matrix" "array"
The structure str(data) also shows an output of the form
num [1:64881, 1:704] 0.0149 NA 0.0558 NA NA ...
-- attr(*, "dimnames")=List of 2
..$ : chr [1:64881] "cg11223003" NA "cg22629907" NA ...
..$ : chr [1:704] "200357150075_R01C01" "200357150075_R02C01" "200357150075_R03C01" "200357150075_R04C01" ...
It actually looks as though it were a data frame but the class of the variable tells otherwise. It's kind of confusing.
I need help on how to manipulate the dataset to obtain a matrix or data frame format of the markers to enable run the correlations.

collapse data frame with embedded matrices [duplicate]

This question already has answers here:
aggregate() puts multiple output columns in a matrix instead
(1 answer)
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 4 years ago.
Under certain conditions, R generates data frames that contain matrices as elements. This requires some determination to do by hand, but happens e.g. with the results of an aggregate() call where the aggregation function returns multiple values:
set.seed(101)
d0 <- data.frame(g=factor(rep(1:2,each=20)), x=rnorm(20))
d1 <- aggregate(x~g, data=d0, FUN=function(x) c(m=mean(x), s=sd(x)))
str(d1)
## 'data.frame': 2 obs. of 2 variables:
## $ g: Factor w/ 2 levels "1","2": 1 2
## $ x: num [1:2, 1:2] -0.0973 -0.0973 0.8668 0.8668
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "m" "s"
This makes a certain amount of sense, but can make trouble for downstream processing code (for example, ggplot2 doesn't like it). The printed representation can also be confusing if you don't know what you're looking at:
d1
## g x.m x.s
## 1 1 -0.09731741 0.86678436
## 2 2 -0.09731741 0.86678436
I'm looking for a relatively simple way to collapse this object to a regular three-column data frame (either with names g, m, s, or with names g, x.m, x.s ...).
I know this problem won't arise with tidyverse (group_by + summarise), but am looking for a base-R solution.

Converting from dgCMatrix/dgRMatrix to scipy sparse matrix

I am working on the netflix data set and attempting to use the nmslibR package to do some KNN type work on the sparse matrix that results from the netflix data set. This package only accepts scipy sparse matrices as inputs, so I need to convert my R sparse matrix to that format. When I attempt to do so, I am getting the following error. dfm2 is a 1.1 gb dgCMatrix, I have also attempted it on a dgRMatrix with the exact same error.
dfm3<-TO_scipy_sparse(dfm2)
Error in TO_scipy_sparse(dfm2) : attempt to apply non-function
I don't know how to provide a good sample dataset for my problem, the sparse matrix I'm working with is 1.1 gb, so if someone has a suggestion on how I can make it easier to help me please let me know. I would also be open to hearing other packages that will do KNN/KNN type functions in r for sparse matrices.
Edit:
I use the following code to generate a sample sparse matrix in the dgCMatrix format and attempt to convert it to a sci py sparse matrix and get the following error.
library(Matrix)
library(nmslibR)
sparse<-Matrix(sample(c(1,2,3,4,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),10000,
replace=T),
ncol=50,
byrow=T)
dfm3 <- TO_scipy_sparse(sparse)
Error in TO_scipy_sparse(sparse) : attempt to apply non-function
To answer a question about whether sparse is a dgCMatrix:
str(sparse)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..# i : int [1:2414] 0 6 9 10 13 20 22 23 25 49 ...
..# p : int [1:51] 0 45 92 146 185 227 277 330 383 435 ...
..# Dim : int [1:2] 200 50
..# Dimnames:List of 2
.. ..$ : NULL
.. ..$ : NULL
..# x : num [1:2414] 4 1 1 2 5 3 2 5 3 5 ...
..# factors : list()
The 'attempt to apply non-function' error is a known issue when something is wrong with the python configuration in the operating system. There are similar issues for other Python packages that I ported from Python to R. You can have a look here.
You should also know that the nmslibR package uses the reticulate package for the interface between Python and R, so there must be similar issues too. If the error persists then you can open an issue in the nmslibR repository providing some sample data.

Organization of data with metadata

I have a dataframe that contains two columns X-data and Y-data.
This represents some experimental data.
Now I have a lot of additional information that I want to associate with this data, such as temperatures, flow rates and so on the sample was recorded at. I have this metadata in a second dataframe.
The data and metadata should always stay together, but I also want to be able to do calculations with the data
As I have many of those data-metadata pairs (>100), I was wondering what people think is an efficient way to organize the data?
For now, I have the two dataframes in a list, but I find accessing the individual values or data-columns tedious (= a lot of code and brackets to write).
You can use an attribute:
dfr <- data.frame(x=1:3,y=rnorm(3))
meta <- list(temp="30C",date=as.Date("2013-02-27"))
attr(dfr,"meta") <- meta
dfr
x y
1 1 -1.3580532
2 2 -0.9873850
3 3 0.3809447
attr(dfr,"meta")
$temp
[1] "30C"
$date
[1] "2013-02-27"
str(dfr)
'data.frame': 3 obs. of 2 variables:
$ x: int 1 2 3
$ y: num -1.358 -0.987 0.381
- attr(*, "meta")=List of 2
..$ temp: chr "30C"
..$ date: Date, format: "2013-02-27"

How to convert 4d array to 3d array subsetting on specific elements of one of the dimensions

Here is probably an easy question.. but I am really struggling so help is very much appreciated.
I have 4d data that I wish to transform into 3d data. The data has the following attributes:
lon <- 1:96
lat <- 1:73
lev <- 1:60
tme <- 1:12
data <- array(runif(96*73*60*12),
dim=c(96,73,60,12) ) # fill with random test values
What I would like to do is calculate the mean of the first few levels (say 1:6). The new data would be of the form:
new.data <- array(96*73*12), dim=c(96,73,12) ) # again just test data
But would contain the mean of the first 5 levels of data. At the moment the only way I have been able to make it work is to write a rather inefficient loop which extracts each of the first 5 levels and divides the sum of those by 5 to get the mean.
I have tried:
new.data <- apply(data, c(1,2,4), mean)
Which nicely gives me the mean of ALL the vertical levels but can't understand how to subset the 3rd dimension to get an average of only a few! e.g.
new.data <- apply(data, c(1,2,3[1:5],4), mean) # which returns
Error in ds[-MARGIN] : only 0's may be mixed with negative subscripts
I am desperate for some help!
apply with indexing (the proper use of "[") should be enough for the mean of the first six levels of the third dimension if I understand your terminology:
> str(apply(data[,,1:6,] , c(1,2,4), FUN=mean) )
num [1:96, 1:73, 1:12] 0.327 0.717 0.611 0.388 0.47 ...
This returns a 96 x 73 by 12 matrix.
In addition to the answer of #DWin, I would recommend the plyr package. The package provides apply like functions. The analgue of apply is the plyr function aaply. The first two letters of a plyr function specify the input and the output type, aa in this case, array and array.
> system.time(str(apply(data[,,1:6,], c(1,2,4), mean)))
num [1:96, 1:73, 1:12] 0.389 0.157 0.437 0.703 0.61 ...
user system elapsed
2.180 0.004 2.184
> Library(plyr)
> system.time(str(aaply(data[,,1:6,], c(1,2,4), mean)))
num [1:96, 1:73, 1:12] 0.389 0.157 0.437 0.703 0.61 ...
- attr(*, "dimnames")=List of 3
..$ X1: chr [1:96] "1" "2" "3" "4" ...
..$ X2: chr [1:73] "1" "2" "3" "4" ...
..$ X3: chr [1:12] "1" "2" "3" "4" ...
user system elapsed
40.243 0.016 40.262
In this example it is slower than apply, but there are a few advantages. The packages supports parallel processing, it also supports outputting the results to a data.frame or list (nice for plotting using ggplot2), and it can show a progress bar (nice for long running processes). Although in this case I'd still go for apply because of performance.
More information regarding the plyr package can be found in this paper. Maybe someone can comment on the poor performance of aaply in this example?

Resources