One of the best ways to make a question reproducible is to use one of the built in data sets. Using data(), however, is frustrating because no information about the structure of the data set is provided.
How can I quickly view the structure of available data sets?
The following function may help:
dataStr <- function(fun=function(x) TRUE)
str(
Filter(
fun,
Filter(
Negate(is.null),
mget(data()$results[, "Item"], inh=T, ifn=list(NULL))
) ) )
It accepts a filtering function, applies it to all the data sets, and prints out the structure of the matching data sets. For example, if we're looking for matrices:
> dataStr(is.matrix)
List of 8
$ WorldPhones : num [1:7, 1:7] 45939 60423 64721 68484 71799 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:7] "1951" "1956" "1957" "1958" ...
.. ..$ : chr [1:7] "N.Amer" "Europe" "Asia" "S.Amer" ...
$ occupationalStatus : 'table' int [1:8, 1:8] 50 16 12 11 2 12 0 0 19 40 ...
..- attr(*, "dimnames")=List of 2
.. ..$ origin : chr [1:8] "1" "2" "3" "4" ...
.. ..$ destination: chr [1:8] "1" "2" "3" "4" ...
$ volcano : num [1:87, 1:61] 100 101 102 103 104 105 105 106 107 108 ...
--- 5 entries omitted ---
Or for data frames (also omitting entries):
> dataStr(is.data.frame)
List of 42
$ BOD :'data.frame': 6 obs. of 2 variables:
..$ Time : num [1:6] 1 2 3 4 5 7
..$ demand: num [1:6] 8.3 10.3 19 16 15.6 19.8
..- attr(*, "reference")= chr "A1.4, p. 270"
$ CO2 :Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame': 84 obs. of 5 variables:
..$ Plant : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
..$ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
..$ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
..$ conc : num [1:84] 95 175 250 350 500 675 1000 95 175 250 ...
..$ uptake : num [1:84] 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
--- 40 entries omitted ---
Or even for simple vectors:
> dataStr(function(x) is.atomic(x) && is.vector(x) && !is.ts(x))
List of 4
$ euro : Named num [1:11] 13.76 40.34 1.96 166.39 5.95 ...
..- attr(*, "names")= chr [1:11] "ATS" "BEF" "DEM" "ESP" ...
$ islands: Named num [1:48] 11506 5500 16988 2968 16 ...
..- attr(*, "names")= chr [1:48] "Africa" "Antarctica" "Asia" "Australia" ...
$ precip : Named num [1:70] 67 54.7 7 48.5 14 17.2 20.7 13 43.4 40.2 ...
..- attr(*, "names")= chr [1:70] "Mobile" "Juneau" "Phoenix" "Little Rock" ...
$ rivers : num [1:141] 735 320 325 392 524 ...
Related
I am using ConsensusClusterPlus package in R for clustering my omic data. I want to use my clusters for regression.Is there a way to create composite scores if say i reduce 1000 genes to 7 clusters and use those 7 clusters for regression.
I tried to look at structure of cluster in R.
results = ConsensusClusterPlus(d1,maxK=maxK,reps=1000,pItem=0.8,pFeature=1, title=title,clusterAlg="hc",distance="pearson",seed=1262118388.71279,plot="png")
icl = calcICL(results,title=title,plot="png")
str(results[[7]])
List of 5
$ consensusMatrix: num [1:40, 1:40] 1 0.689 0.976 1 1 ...
$ consensusTree :List of 7
..$ merge : int [1:39, 1:2] -1 -5 -7 -8 -9 -10 -11 -12 -13 -14 ...
..$ height : num [1:39] 0 0 0 0 0 0 0 0 0 0 ...
..$ order : int [1:40] 40 34 35 28 6 32 22 18 21 19 ...
..$ labels : NULL
..$ method : chr "average"
..$ call : language hclust(d = as.dist(1 - fm), method = finalLinkage)
..$ dist.method: NULL
..- attr(*, "class")= chr "hclust"
$ consensusClass : Named int [1:40] 1 1 1 1 1 2 1 1 1 1 ...
..- attr(*, "names")= chr [1:40] "CAR 12:0" "CAR 12:1" "CAR 13:0" "CAR 14:0" ...
$ ml : num [1:40, 1:40] 1 0.689 0.976 1 1 ...
$ clrs :List of 3
..$ : chr [1:40] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" ...
..$ : num 8
..$ : chr [1:7] "#A6CEE3" "#FB9A99" "#FF7F00" "#FDBF6F" ...
How to find composite scores ?
This is the frist time to perform KDE in R with data which has more than 5 variables for me for anomaly detection.
As far as I know that KDE is performable for multidimensional data but I couldn't find examples which using more than 5 dimensional data.
I'm using data which have 'age', 'trestbps', 'chol', 'thalach', and 'oldpeak' 5 variables as like below.
'data.frame': 176 obs. of 5 variables:
$ age : int 30 50 50 50 50 60 50 40 50 40 ...
$ trestbps: int 130 130 130 130 130 130 130 130 130 130 ...
$ chol : int 198 245 221 288 205 309 240 243 289 250 ...
$ thalach : int 130 166 164 159 184 131 154 152 124 179 ...
$ oldpeak : num 1.6 2.4 0 0.2 0 1.8 0.6 0 1 0 ...
I performed KDE for those data, with the approach as like below, but I'm not sure it is correct approach, and proper result.
evpts <- do.call(expand.grid, lapply(df3, quantile, prob = c(0.1,.25,.5,.75,.9)))
hat2 <- kde(df3, eval.points = evpts)
> str(hat2)
List of 9
$ x : num [1:176, 1:5] 30 50 50 50 50 60 50 40 50 40 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:5] "age" "trestbps" "chol" "thalach" ...
$ eval.points:'data.frame': 3125 obs. of 5 variables:
..$ age : Named num [1:3125] 40 40 50 60 60 40 40 50 60 60 ...
.. ..- attr(*, "names")= chr [1:3125] "10%" "25%" "50%" "75%" ...
..$ trestbps: Named num [1:3125] 108 108 108 108 108 112 112 112 112 112 ...
.. ..- attr(*, "names")= chr [1:3125] "10%" "10%" "10%" "10%" ...
..$ chol : Named num [1:3125] 194 194 194 194 194 194 194 194 194 194 ...
.. ..- attr(*, "names")= chr [1:3125] "10%" "10%" "10%" "10%" ...
..$ thalach : Named num [1:3125] 114 114 114 114 114 ...
.. ..- attr(*, "names")= chr [1:3125] "10%" "10%" "10%" "10%" ...
..$ oldpeak : Named num [1:3125] 0 0 0 0 0 0 0 0 0 0 ...
.. ..- attr(*, "names")= chr [1:3125] "10%" "10%" "10%" "10%" ...
..- attr(*, "out.attrs")=List of 2
.. ..$ dim : Named int [1:5] 5 5 5 5 5
.. .. ..- attr(*, "names")= chr [1:5] "age" "trestbps" "chol" "thalach" ...
.. ..$ dimnames:List of 5
.. .. ..$ age : chr [1:5] "age=40" "age=40" "age=50" "age=60" ...
.. .. ..$ trestbps: chr [1:5] "trestbps=108" "trestbps=112" "trestbps=120" "trestbps=128" ...
.. .. ..$ chol : chr [1:5] "chol=194.00" "chol=211.00" "chol=244.00" "chol=283.75" ...
.. .. ..$ thalach : chr [1:5] "thalach=113.50" "thalach=128.25" "thalach=150.00" "thalach=164.00" ...
.. .. ..$ oldpeak : chr [1:5] "oldpeak=0.0" "oldpeak=0.0" "oldpeak=0.8" "oldpeak=1.8" ...
$ estimate : Named num [1:3125] 5.64e-12 5.64e-12 2.85e-09 7.76e-10 7.76e-10 ...
..- attr(*, "names")= chr [1:3125] "1" "2" "3" "4" ...
$ H : num [1:5, 1:5] 6.972 0.866 5.065 -6.541 0.189 ...
$ gridded : logi FALSE
$ binned : logi FALSE
$ names : chr [1:5] "age" "trestbps" "chol" "thalach" ...
$ w : num [1:176] 1 1 1 1 1 1 1 1 1 1 ...
$ type : chr "kde"
- attr(*, "class")= chr "kde"
If it is not proper approach, could you please help me to get correct approach?
Thank you for your support.
Is there any way I can export this data to a csv file, instead of typing things in manually.
Below is the output from Hmisc describe function:
library(Hmisc) # Hmisc describe
> Hmisc::describe(data)
data
3 Variables 6 Observations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
ID
n missing distinct Info Mean Gmd
6 0 3 0.857 112.2 1.267
Value 110 112 113
Frequency 1 2 3
Proportion 0.167 0.333 0.500
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Date
n missing distinct
6 0 3
Value 23/04/2018 24/04/2018 25/04/2018
Frequency 3 2 1
Proportion 0.500 0.333 0.167
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Revenue
n missing distinct Info Mean Gmd
6 0 6 1 74 17.2
lowest : 51 65 70 85 86, highest: 65 70 85 86 87
Value 51 65 70 85 86 87
Frequency 1 1 1 1 1 1
Proportion 0.167 0.167 0.167 0.167 0.167 0.167
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset:
> data
ID Date Revenue
1 113 23/04/2018 51
2 113 23/04/2018 87
3 113 23/04/2018 70
4 112 24/04/2018 85
5 112 24/04/2018 65
6 110 25/04/2018 86
I doubt writing it to csv would be helpful. Try writing it to text file instead.
cat(capture.output(Hmisc::describe(data)), file = 'result.txt', sep = '\n')
Probably not going to be easy. You could use capture.output but then you would need to parse the sections differently depending on their class and counts. You could also assign the results to a data object and try to work with that, but again, there will be a diversity of formats:
obj <- describe(iris)
str(obj)
# this is the canonical example of a dataframe but it doesn't even capture all the cases.
List of 5
$ Sepal.Length:List of 6
..$ descript: chr "Sepal.Length"
..$ units : NULL
..$ format : NULL
..$ counts : Named chr [1:13] "150" "0" "35" "0.998" ...
.. ..- attr(*, "names")= chr [1:13] "n" "missing" "distinct" "Info" ...
..$ values :List of 2
.. ..$ value : num [1:35] 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 ...
.. ..$ frequency: num [1:35(1d)] 1 3 1 4 2 5 6 10 9 4 ...
..$ extremes: Named num [1:10] 4.3 4.4 4.5 4.6 4.7 7.3 7.4 7.6 7.7 7.9
.. ..- attr(*, "names")= chr [1:10] "L1" "L2" "L3" "L4" ...
..- attr(*, "class")= chr "describe"
$ Sepal.Width :List of 6
..$ descript: chr "Sepal.Width"
..$ units : NULL
..$ format : NULL
..$ counts : Named chr [1:13] "150" "0" "23" "0.992" ...
.. ..- attr(*, "names")= chr [1:13] "n" "missing" "distinct" "Info" ...
..$ values :List of 2
.. ..$ value : num [1:23] 2 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 ...
.. ..$ frequency: num [1:23(1d)] 1 3 4 3 8 5 9 14 10 26 ...
..$ extremes: Named num [1:10] 2 2.2 2.3 2.4 2.5 3.9 4 4.1 4.2 4.4
.. ..- attr(*, "names")= chr [1:10] "L1" "L2" "L3" "L4" ...
..- attr(*, "class")= chr "describe"
$ Petal.Length:List of 6
..$ descript: chr "Petal.Length"
..$ units : NULL
..$ format : NULL
..$ counts : Named chr [1:13] "150" "0" "43" "0.998" ...
.. ..- attr(*, "names")= chr [1:13] "n" "missing" "distinct" "Info" ...
..$ values :List of 2
.. ..$ value : num [1:43] 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.9 3 ...
.. ..$ frequency: num [1:43(1d)] 1 1 2 7 13 13 7 4 2 1 ...
..$ extremes: Named num [1:10] 1 1.1 1.2 1.3 1.4 6.3 6.4 6.6 6.7 6.9
.. ..- attr(*, "names")= chr [1:10] "L1" "L2" "L3" "L4" ...
..- attr(*, "class")= chr "describe"
$ Petal.Width :List of 6
..$ descript: chr "Petal.Width"
..$ units : NULL
..$ format : NULL
..$ counts : Named chr [1:13] "150" "0" "22" "0.99" ...
.. ..- attr(*, "names")= chr [1:13] "n" "missing" "distinct" "Info" ...
..$ values :List of 2
.. ..$ value : num [1:22] 0.1 0.2 0.3 0.4 0.5 0.6 1 1.1 1.2 1.3 ...
.. ..$ frequency: num [1:22(1d)] 5 29 7 7 1 1 7 3 5 13 ...
..$ extremes: Named num [1:10] 0.1 0.2 0.3 0.4 0.5 2.1 2.2 2.3 2.4 2.5
.. ..- attr(*, "names")= chr [1:10] "L1" "L2" "L3" "L4" ...
..- attr(*, "class")= chr "describe"
$ Species :List of 5
..$ descript: chr "Species"
..$ units : NULL
..$ format : NULL
..$ counts : Named num [1:3] 150 0 3
.. ..- attr(*, "names")= chr [1:3] "n" "missing" "distinct"
..$ values :List of 2
.. ..$ value : chr [1:3] "setosa" "versicolor" "virginica"
.. ..$ frequency: num [1:3(1d)] 50 50 50
..- attr(*, "class")= chr "describe"
- attr(*, "descript")= chr "iris"
- attr(*, "dimensions")= int [1:2] 150 5
- attr(*, "class")= chr "describe"
I am trying to build a PCA with a matrix of labeled numeric data. I am trying to select only certain columns (6-78) to include in the PCA, but have an error (syntax?)
Here's the code:
cytokines.pca <- prcomp(PICHCytokines[,c(6:78)], center = TRUE, scale. = TRUE)
summary(cytokines.pca)
The error is:
Error in [.data.frame(data, , c(6:78)) : undefined columns selected
Here's the structure of my data frame:
str(PICHCytokines)
'data.frame': 106 obs. of 69 variables:
$ Record.ID : Factor w/ 106 levels "FA001","FA007",..: 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "label")= chr "Record ID"
$ Event.Name : Factor w/ 2 levels "Enrollment and Admission",..: 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Event Name"
$ Time.since.trauma: 'labelled' num 0.717 7.717 1.383 0.817 2.85 ...
..- attr(*, "label")= chr "Time since trauma"
$ Batch.Number : 'labelled' int 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Batch Number"
$ Plate.Number : 'labelled' int 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Plate Number"
$ FASL.MFI : 'labelled' num 748 295 256 333 275 ...
..- attr(*, "label")= chr "FASL MFI"
$ TGFA.MFI : 'labelled' num 122 64.2 96 126 94.8 ...
..- attr(*, "label")= chr "TGFA MFI"
$ MIP1A.MFI : 'labelled' num 1611 142 158 339 168 ...
..- attr(*, "label")= chr "MIP1A MFI"
$ IL27.MFI : 'labelled' num 139.2 40 63 52.5 63.2 ...
..- attr(*, "label")= chr "IL27 MFI"
$ IL1B.MFI : 'labelled' num 68 38.2 77.5 46 70.8 ...
..- attr(*, "label")= chr "IL1B MFI"
$ IL2.MFI : 'labelled' num 159 61.5 120.8 79.5 117.2 ...
..- attr(*, "label")= chr "IL2 MFI"
I was trying to do CreateTableOne from tableone package for my dataset called m.dataaaaaa using the following code:
CreateTableOne(vars =Vars,strata = "ejecfraclesstha40_gps", factorVars =Catvars, data = m.dataaaaaa, test = T)
But I got the following error :
Error in [<-.data.frame(x, i, value = value) : duplicate
subscripts for columns In addition: Warning message: In
ModuleReturnVarsExist(vars, data) : The data frame does not have:
ejecfraclesstha40 Dropped
structure of the data is shown below as it is a big database
str(m.dataaaaaa)
Classes ‘data.table’ and 'data.frame': 194 obs. of 203 variables:
$ ejecfraclesstha40_gps : num 1 0 1 0 0 0 1 1 1 0 ...
$ Serial.ID : num 2 3 4 7 10 14 17 20 23 24 ...
..- attr(*, "format.spss")= chr "F4.0"
$ Serial.ID_matched.EF.cohort.Ivan1.to.2 : num 2 NA 4 NA NA NA 17 20 23 NA ...
..- attr(*, "format.spss")= chr "F8.0"
$ ps..matched.EF.cohort.Ivan1.to.2 : num 0.138 NA 0.19 NA NA NA 0.176 0.286 0.152 NA ...
..- attr(*, "format.spss")= chr "F8.3"
$ psweight1.to.2 : num 1 NA 1 NA NA NA 1 1 1 NA ...
..- attr(*, "format.spss")= chr "F8.2"
$ matched_ID1.to.2 : num 483 NA 763 NA NA NA 180 176 239 NA ...
..- attr(*, "format.spss")= chr "F8.2"
$ matched_cases_in_control1.to.2 : num 2 NA 2 NA NA NA 2 2 2 NA ...
..- attr(*, "format.spss")= chr "F8.2"
$ ejecfrac_4gps : num 1 3 1 3 3 3 1 1 1 3 ...
..- attr(*, "format.spss")= chr "F8.2"
..- attr(*, "labels")= Named num 1 2 3 4
.. ..- attr(*, "names")= chr "EF<35%" "EF=35 - <40%" "EF=40 - <=50" "EF>50%"
$ ejecfrac_4gps30 : num 1 4 1 3 3 4 1 1 1 4 ...
..- attr(*, "format.spss")= chr "F8.2"
..- attr(*, "labels")= Named num 1 2 3 4
.. ..- attr(*, "names")= chr "EF<=30%" "EF>30 - 39%" "EF=40 - 49%" "EF>=50%"
$ renisch : num 29 31 23 18 48 19 10 29 17 13 ...
..- attr(*, "label")= chr "renal + visceral ischemic time"
..- attr(*, "format.spss")= chr "F3.0"
..- attr(*, "display_width")= int 12
$ totxct : num 46 31 55 46 48 19 54 29 17 37 ...
..- attr(*, "label")= chr "total cross-clamp time"
..- attr(*, "format.spss")= chr "F4.0"
..- attr(*, "display_width")= int 12
The original database was read from spss into r.
My main problem is with this error :
Error in [<-.data.frame(x, i, value = value) : duplicate subscripts for columns
Any advice will be greatly appreciated.