I would like to build the Cumulative Distribution Function (CDF) from an input file that contains the data to generate a histogram. The input file has one column per bin and one column with the amount of ocurrences inside each bin, so it looks like this:
bin column6
0 1189
5 11957
10 24203
15 21518
20 14515
25 10323
30 7799
35 6015
40 4869
45 3858
50 3215
55 2615
60 2350
65 1890
70 1673
75 1433
80 1218
85 942
90 869
95 736
100 605
105 528
110 449
115 429
120 327
125 252
130 208
135 170
140 154
145 138
150 124
155 86
160 113
165 108
170 71
175 72
180 51
185 58
190 37
195 29
200 35
205 24
210 11
215 24
220 16
225 20
230 15
235 5
240 11
245 4
250 4
255 6
260 6
265 6
270 4
275 3
280 4
285 2
290 3
295 1
300 5
305 3
310 2
315 1
320 1
325 2
330 0
335 1
340 2
345 0
350 0
355 2
360 4
365 2
370 0
375 1
380 1
385 2
390 0
395 1
400 1
405 1
I use R to visualize the histogram using the following code:
library(ggplot2)
input <- read.table('/home/agalvez/data/domains/histo_leu.txt', sep="\t", header=TRUE)
histo <- ggplot(data=input, aes(x=input$bin, y=input$column6)) +
geom_bar(stat="identity")
histo
Could someone give me some advice on how to build the CDF for this histogram? Thanks in advance!
Bit unclear question, I assume you are looking for the eCDF since any parametric CDF generally has an analytical formula.
In R, you can use ecdf to generate an eCDF.
library(purrr)
library(tidyr)
library(dplyr)
library(ggplot2)
input <- input %>%
filter(column6 != 0) %>%
mutate(
column6 = map(column6, ~1:.x)
) %>%
unnest(column6)
# Make the ecdf
input %$%
ecdf(bin)
# To plot use stat_ecdf
input %>%
ggplot(aes(bin))+
stat_ecdf(geom = "step")
Related
I need to make a DESeq2 analysis with my dataset for an homework, but I'm really new with this package (I never used it before).
When I want to make a
counts <- read.table("ProstateCancerCountData.txt",sep="", header=TRUE, row.names=1)
metadat<- read.table("mart_export.txt",sep=",", header=TRUE, row.names=1)
counts <- as.matrix(counts)
dds <- DESeqDataSetFromMatrix(countData = counts, colData = metadat, design = ~ GC.content+ Gene.type)
I have this error :
Erreur dans DESeqDataSetFromMatrix(countData = counts, colData = metadat, :
ncol(countData) == nrow(colData) n'est pas TRUE
I don't know how to fix it.
This is the two dataset I have to used for the analysis :
head(counts)
N_10 T_10 N_11 T_12 N_13 T_13 N_14 T_14 N_1 T_1 N_2 T_2 N_3
ENSG00000000003 401 442 1155 1095 788 754 852 938 774 520 808 648 891
ENSG00000000005 0 7 23 9 5 2 45 5 11 10 56 8 7
ENSG00000000419 112 96 424 468 385 452 751 491 247 222 509 363 706
ENSG00000000457 13 121 327 165 40 204 290 199 70 121 104 151 352
ENSG00000000460 24 66 162 137 71 159 174 156 86 94 120 91 166
ENSG00000000938 96 128 218 372 126 129 538 320 117 129 157 238 177
T_3 N_4 N_5 T_6 N_7 T_7 N_8 T_8 N_9 T_9
ENSG00000000003 1071 2059 737 1006 1146 653 1299 1306 1522 490
ENSG00000000005 0 18 0 7 1 4 1 2 0 3
ENSG00000000419 622 988 307 402 294 323 535 518 573 322
ENSG00000000457 333 328 58 153 138 115 179 200 86 85
ENSG00000000460 152 162 100 100 101 148 128 78 83 109
ENSG00000000938 86 113 410 230 64 76 93 61 121 68
head(metadat)
Chromosome.scaffold.name Gene.start..bp. Gene.end..bp.
ENSG00000271782 1 50902700 50902978
ENSG00000232753 1 103817769 103828355
ENSG00000225767 1 50927141 50936822
ENSG00000202140 1 50965430 50965529
ENSG00000207194 1 51048076 51048183
ENSG00000252825 1 51215968 51216025
GC.content Gene.type
ENSG00000271782 35.48 lincRNA
ENSG00000232753 33.99 lincRNA
ENSG00000225767 38.99 antisense
ENSG00000202140 43.00 misc_RNA
ENSG00000207194 37.96 snRNA
ENSG00000252825 36.21 snRNA
Thank you for your help, and for your lighting
EDIT :
Thank you for your previous answer.
I take an another dataset to make this homework. But I have another bug :
This is my new dataset :
head(mycounts)
R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver R1L6Liver
ENSG00000177757 2 1 0 0 1
ENSG00000187634 49 27 43 34 23
ENSG00000188976 73 34 77 56 45
ENSG00000187961 15 8 15 13 11
ENSG00000187583 1 0 1 1 0
ENSG00000187642 4 0 5 0 2
R1L7Kidney R1L8Liver R2L2Kidney R2L3Liver R2L6Kidney
ENSG00000177757 2 0 1 1 3
ENSG00000187634 41 35 42 25 47
ENSG00000188976 68 55 70 42 82
ENSG00000187961 13 12 12 20 15
ENSG00000187583 3 0 0 2 3
ENSG00000187642 12 1 9 4 9
head(myfactors)
Tissue TissueRun
R1L1Kidney Kidney Kidney_1
R1L2Liver Liver Liver_1
R1L3Kidney Kidney Kidney_1
R1L4Liver Liver Liver_1
R1L6Liver Liver Liver_1
R1L7Kidney Kidney Kidney_1
When I code my DESeq object, I would take the Tissue and TissueRun for take care of the batch. But I have an error :
dds2 <- DESeqDataSetFromMatrix(countData = mycounts, colData = myfactors, design = ~ Tissue + TissueRun)
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
Please read the vignette section 'Model matrix not full rank':
vignette('DESeq2')
Thank you for your help
Given e.g. the Orange data set, I would like to arrange the observations in a matrix in which the measurements (circumference) taken on each tree are arranged in rows (for a total of 5 rows).
One unsatisfactory way of obtaining this result is as follows:
mat<-matrix(Orange[,3],nrow=5, ncol = 7,byrow=T, dimnames = list(c(unique(Orange$Tree)),c(1:7)))
An alternative way would be using the dcast( ) function within the data.table package.
This allows you to convert data from long to wide. In this case, I've created an ID to could the number of records per Tree.
In the re-shaped data, Tree becomes our primary column and circumference is recorded in 7 unique columns (one for each age).
library(data.table)
Orange <- data.table(Orange)[,ID := seq(1:.N), by=Tree]
Orange2 <- dcast(
data = Orange,
formula = Tree ~ ID,
value.var = "circumference")
Orange2
Tree 1 2 3 4 5 6 7
1: 3 30 51 75 108 115 139 140
2: 1 30 58 87 115 120 142 145
3: 5 30 49 81 125 142 174 177
4: 2 33 69 111 156 172 203 203
5: 4 32 62 112 167 179 209 214
EDIT (in response to additional comments/questions):
Technically the data is already ordered by Tree (defined within the data). This is because the variable Tree is a factor variable with preset levels. To order numerically, here are 2 things: (1) Order by as.character( ) and (2) Re-level the variable.
Orange2[order(as.character(Tree),]
1: 1 30 58 87 115 120 142 145
2: 2 33 69 111 156 172 203 203
3: 3 30 51 75 108 115 139 140
4: 4 32 62 112 167 179 209 214
5: 5 30 49 81 125 142 174 177
class(Orange$Tree)
[1] "ordered" "factor"
levels(Orange$Tree)
[1] "3" "1" "5" "2" "4"
Orange2[,Tree := factor(Tree, c("1","2","3","4","5"), ordered = FALSE)]
Orange2[order(Tree),]
Tree 1 2 3 4 5 6 7
1: 1 30 58 87 115 120 142 145
2: 2 33 69 111 156 172 203 203
3: 3 30 51 75 108 115 139 140
4: 4 32 62 112 167 179 209 214
5: 5 30 49 81 125 142 174 177
In base, you could simply do:
aggregate(circumference ~ Tree, Orange, I)
If you don't want to order it afterwards: aggregate(circumference ~ as.character(Tree), Orange, I) (that will strip the factor ordering).
Or similar to #RyanF:
Orange$id <- sequence(rle(as.character(Orange$Tree))$lengths)
reshape(Orange[,-2],
idvar = "Tree",
timevar = "id",
direction = "wide")
Output:
Tree circumference.1 circumference.2 circumference.3 circumference.4 circumference.5 circumference.6 circumference.7
1 1 30 58 87 115 120 142 145
8 2 33 69 111 156 172 203 203
15 3 30 51 75 108 115 139 140
22 4 32 62 112 167 179 209 214
29 5 30 49 81 125 142 174 177
I'd like to do several manipulations with datasets that are in-built in R from the packages that I have. So, first, I made a vector with dataset's names, but when I tried to filter the datasets which have only one column, I got an error, saying that the length of the argument is 0. Here is the code:
for (i in datasets){
if (ncol(i)==1){dataset <- i datasets <- c(dataset, datasets) }
}
It treats the names of the datasets as a character vector.
Here is the head of the aforementioned vector: [1] ability.cov airmiles AirPassengers airquality anscombe attenu. It's silly, but how could I treat the entries as dataframes?
I don't fully understand your logic, but based on your code, you want to identify which dataset that has one column by using ncol(x) == 1. If that's true, then you need to deal with some issues:
the various structures of the datasets. ncol produces the number of columns on data.frame and matrix but does not on time-series. For example: ncol(anscombe) results in 8 but ncol(AirPassengers) results in NULL. If you decide to use ncol, then you need to coerce each dataset to a data.frame by using as.data.frame.
indexing the character vector of the names of the datasets. You need to call a dataset, not its character name, to be able to use as.data.frame. One way of doing this is by using eval(parse(text=the_name)).
the way to store the result. You can use c() to combine the results but the datasets will be converted to vectors, no longer in their initial structures. I recommend using list to preserve the data frame structures of the datasets.
Here is one possible solution based on those considerations:
datasets <- c("ability.cov", "airmiles", "AirPassengers", "airquality", "anscombe", "attenu")
single_col_datasets <- vector('list', 1)
for (i in seq_along(datasets)){
if (ncol(as.data.frame(eval(parse(text = datasets[i])))) == 1){
single_col_datasets[[i]] <- as.data.frame(eval(parse(text = datasets[i])))
names(single_col_datasets[[i]]) <- datasets[i]
}
not.null.element <- single_col_datasets[lengths(single_col_datasets) != 0]
new.datasets <- list(not.null.element, datasets)
}
Here is the result:
new.datasets
[[1]]
[[1]][[1]]
airmiles
1 412
2 480
3 683
4 1052
5 1385
6 1418
7 1634
8 2178
9 3362
10 5948
11 6109
12 5981
13 6753
14 8003
15 10566
16 12528
17 14760
18 16769
19 19819
20 22362
21 25340
22 25343
23 29269
24 30514
[[1]][[2]]
AirPassengers
1 112
2 118
3 132
4 129
5 121
6 135
7 148
8 148
9 136
10 119
11 104
12 118
13 115
14 126
15 141
16 135
17 125
18 149
19 170
20 170
21 158
22 133
23 114
24 140
25 145
26 150
27 178
28 163
29 172
30 178
31 199
32 199
33 184
34 162
35 146
36 166
37 171
38 180
39 193
40 181
41 183
42 218
43 230
44 242
45 209
46 191
47 172
48 194
49 196
50 196
51 236
52 235
53 229
54 243
55 264
56 272
57 237
58 211
59 180
60 201
61 204
62 188
63 235
64 227
65 234
66 264
67 302
68 293
69 259
70 229
71 203
72 229
73 242
74 233
75 267
76 269
77 270
78 315
79 364
80 347
81 312
82 274
83 237
84 278
85 284
86 277
87 317
88 313
89 318
90 374
91 413
92 405
93 355
94 306
95 271
96 306
97 315
98 301
99 356
100 348
101 355
102 422
103 465
104 467
105 404
106 347
107 305
108 336
109 340
110 318
111 362
112 348
113 363
114 435
115 491
116 505
117 404
118 359
119 310
120 337
121 360
122 342
123 406
124 396
125 420
126 472
127 548
128 559
129 463
130 407
131 362
132 405
133 417
134 391
135 419
136 461
137 472
138 535
139 622
140 606
141 508
142 461
143 390
144 432
[[2]]
[1] "ability.cov" "airmiles" "AirPassengers" "airquality" "anscombe" "attenu"
You can use the get function:
for (i in datasets){
if (ncol(get(i))==1){
dataset <- i
datasets <- c(dataset, datasets)
}
}
My data looks like below.
ID Group timing glucose_level
<chr> <dbl> <int> <dbl>
1 black 7 0 0 136
2 black 1 0 0 116
3 blue 20 0 0 144
4 green 18 0 0 114
5 red 4 0 0 126
6 red 5 0 0 80
7 green 17 0 0 111
8 green 3 0 0 109
9 red 20 0 0 96
10 black 39 0 0 140
There are some missing values in glucose level.
Below are part of glucose level data
[697] 128 157 132 142 141 128 97 120 123 131 132 126 140 103 147 181 217 257 218 234 240 281 273 224 210 227 NA NA 245
[726] 230 252 270 238 134 173 193 151 128 180 218 218 190 225 214 186 140 237 239 279 246 244 146 196 157 178 140 127 187
[755] 206 177 220 179 167 127 219 223 241 162 235 140 187 154 172 116 139 194 173 150 187 131 176 114 154 180 223 150 219
[784] 130 169 104 136 132 121 175 169 128 110 101 100 92 122 196 203 96 143 129 NA 72 141 143 129 149 132 107 94 76
[813] 80 95 63 198 181 86 122
I wanna use a loop to replace the missing values.
Here are my code:
for(i in 1:length(data)){
if(is.na(data[i,'glucose_level'])){
if(data[i,'Group']==0){
data[i,'glucose_level']=162.7059
}else if(data[i,'Group']==1){
data[i,'glucose_level']= 163.1415
}else{
data[i,'glucose_level']= 165.9106
}
}
}
I print out data$glucose_level and find there are still missing values in it.why no changes in my data???
You can use nested ifelse or case_when and check for conditions and assign values accordingly.
library(dplyr)
data <- data %>%
mutate(glucose_level = case_when(!is.na(glucose_level) ~ glucose_level,
Group == 0 ~ 162.7059,
Group == 1 ~ 163.1415,
TRUE ~ 165.9106))
We can use fcase from data.table
library(data.table)
setDT(data)[, glucose_level := fcase(!is.na(glucose_level), glucose_level,
Group == 0, 162.7059,
Group == 1,163.1415,
165.9106)]
> head(m)
X id1 q_following topic_followed topic_answered nfollowers nfollowing
1 1 1 80 80 100 180 180
2 2 1 76 76 95 171 171
3 3 1 72 72 90 162 162
4 4 1 68 68 85 153 153
5 5 1 64 64 80 144 144
6 6 1 60 60 75 135 135
> head(d)
X id1 q_following topic_followed topic_answered nfollowers nfollowing
1 1 1 63 735 665 949 146
2 2 1 89 737 666 587 185
3 3 1 121 742 670 428 264
4 4 1 277 750 706 622 265
5 5 1 339 765 734 108 294
6 6 1 363 767 766 291 427
matcher <- function(x,y){ return(na.omit(m[which(d[,y]==x),y])) }
max_matcher <- function(x) { return(sum(matcher(x,3:13))) }
result <- foreach(1:1000, function(x) {
if(max(max_matcher(1:1000)) == max_matcher(x)) return(x)
})
I want to compute result across each group, grouped by id1 of dataframe m.
m %>% group_by(id1) %>% summarise(result) #doesn't work
by(m, m[,"id1"], result) #doesn't work
How should I proceed?