I have a subset of a genetic dataset in which I want to run some correlations between the CpG markers.
I have inspected the class, class(data) of this subset and it shows that it's a
[1] "matrix" "array"
The structure str(data) also shows an output of the form
num [1:64881, 1:704] 0.0149 NA 0.0558 NA NA ...
-- attr(*, "dimnames")=List of 2
..$ : chr [1:64881] "cg11223003" NA "cg22629907" NA ...
..$ : chr [1:704] "200357150075_R01C01" "200357150075_R02C01" "200357150075_R03C01" "200357150075_R04C01" ...
It actually looks as though it were a data frame but the class of the variable tells otherwise. It's kind of confusing.
I need help on how to manipulate the dataset to obtain a matrix or data frame format of the markers to enable run the correlations.
Related
I have a dataset in .mat file. Because most of my project is going to be R, I want to analyze the dataset in R rather than Matlab. I have used "R.matlab" library to convert into R but I am struggling to convert the data to dataframe to do further processing with it.
library(R.matlab)
>data <- readMat(paste(dataDirectory, 'Data.mat', sep=""))
> str(data)
List of 1
$ Data: num [1:32, 1:5, 1:895] 0.999 0.999 1 1 1 ...
- attr(*, "header")=List of 3
..$ description: chr "MATLAB 5.0 MAT-file, Platform: PCWIN, Created on: Fri Oct 18 11:36:04 2013 "
..$ version : chr "5"
..$ endian : chr "little"'''
I have tried the following codes from what I found from other questions but they do not do exactly what I wanted to do.
data = lapply(data, unlist, use.names=FALSE)
df <- as.data.frame(data)
> str(df)
'data.frame': 32 obs. of 4475 variables:
I want to convert into a data frame to 5 observations (Y,X1,X2,X3,X4) but right now there is 32 observation.
I do not know how to go further from here as I never worked with such a large dataset and couldn't find a relevant post. I am also new to R and coding so please excuse me if I will have some trouble with some of the answers. Any help would be greatly appreciated.
Thanks
I would like to access some elements of an Anova summary in R. I've been trying things like in this question Access or parse elements in summary() in R.
When I convert the summary to a string it shows something like this:
str(summ)
List of 1
$ :Classes 'anova' and 'data.frame': 2 obs. of 5 variables:
..$ Df : num [1:2] 3 60
..$ Sum Sq : num [1:2] 0.457 2.647
..$ Mean Sq: num [1:2] 0.1523 0.0441
..$ F value: num [1:2] 3.45 NA
..$ Pr(>F) : num [1:2] 0.022 NA
- attr(*, "class")= chr [1:2] "summary.aov" "listof"
How can I access the F value?
I've been trying things like summ[c('F value')] and I still can't get it to work.
Any help would be greatly appreciated!
You have the anova object inside a list (first line of str output is List of 1). So you need to get the "F value" of this single element, like:
summm[[1]][["F value"]]
As an addition to the answer above I'd recommend to start using the broom package when you want to access/use various elements of a model object.
First, by using the str command you don't convert the summary into a string, but you just see the structure of your summary, which is a list. So, str means "structure".
The broom package enables you to save the info of your model object as a data frame, which is easier to manipulate. Check my simple example:
library(broom)
fit <- aov(mpg ~ vs, data = mtcars)
# check the summary of the ANOVA (not possible to access info/elements)
fit2 = summary(fit)
fit2
# Df Sum Sq Mean Sq F value Pr(>F)
# vs 1 496.5 496.5 23.66 3.42e-05 ***
# Residuals 30 629.5 21.0
# create a data frame of the ANOVA
fit3 = tidy(fit)
fit3
# term df sumsq meansq statistic p.value
# 1 vs 1 496.5279 496.52790 23.66224 3.415937e-05
# 2 Residuals 30 629.5193 20.98398 NA NA
# get F value (or any other values)
fit3$statistic[1]
#[1] 23.66224
I think for the specific example you provided you don't really need to use the broom method, but if it happens to deal with more complicated model objects it will be really useful to try it.
I want to create a data frame in R.
To make an easy 2x2 example of my problem:
Assume the first column is a simple vector:
first <- c(1:2)
The second column is for every row a character vector (but of different length), for example:
c('A') for the first row and c('B','C') for the second.
How can I build this data frame?
If you want to store different vector sizes in each row of a certain column, you will need to use a list, problem that (from ?data.frame)
If a list or data frame or matrix is passed to data.frame it is as if
each component or column had been passed as a separate argument
Thus you will need to wrap it up into I in order to protect you desired structure, e.g.
df <- data.frame(first = 1:2, Second = I(list("A", c("B", "C"))))
str(df)
# 'data.frame': 2 obs. of 2 variables:
# $ first : int 1 2
# $ Second:List of 2
# ..$ : chr "A"
# ..$ : chr "B" "C"
# ..- attr(*, "class")= chr "AsIs"
I have a large matrix
> str(distMatrix)
num [1:551, 1:551] 0 6 5 Inf 5 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:551] "+" "ABRAHAM" "ACTS" "ADVANCE" ...
..$ : chr [1:551] "+" "ABRAHAM" "ACTS" "ADVANCE" ...
which contains numeric values. I need to gather all numeric values into ONE long list (for acquiring distribution). Currently what I have:
for(i in 1:dim(distMatrix)[[1]]){
for (j in 1:1:dim(distMatrix)[[1]]){
distances[length(distances)+1] <- distMatrix[i,j]
}
}
However, that takes forever. Can anyone suggest a faster way?
To turn a matrix into a list, the length of which is the same as the number of elements in the matrix, you can simply do
as.list(distMatrix)
This goes down the columns, but you can use the transpose
as.list(t(distMatrix))
to make it go across the rows. Since your matrix is 551x551 it should be sufficiently efficient.
I have a dataframe that contains two columns X-data and Y-data.
This represents some experimental data.
Now I have a lot of additional information that I want to associate with this data, such as temperatures, flow rates and so on the sample was recorded at. I have this metadata in a second dataframe.
The data and metadata should always stay together, but I also want to be able to do calculations with the data
As I have many of those data-metadata pairs (>100), I was wondering what people think is an efficient way to organize the data?
For now, I have the two dataframes in a list, but I find accessing the individual values or data-columns tedious (= a lot of code and brackets to write).
You can use an attribute:
dfr <- data.frame(x=1:3,y=rnorm(3))
meta <- list(temp="30C",date=as.Date("2013-02-27"))
attr(dfr,"meta") <- meta
dfr
x y
1 1 -1.3580532
2 2 -0.9873850
3 3 0.3809447
attr(dfr,"meta")
$temp
[1] "30C"
$date
[1] "2013-02-27"
str(dfr)
'data.frame': 3 obs. of 2 variables:
$ x: int 1 2 3
$ y: num -1.358 -0.987 0.381
- attr(*, "meta")=List of 2
..$ temp: chr "30C"
..$ date: Date, format: "2013-02-27"