I need to make a DESeq2 analysis with my dataset for an homework, but I'm really new with this package (I never used it before).
When I want to make a
counts <- read.table("ProstateCancerCountData.txt",sep="", header=TRUE, row.names=1)
metadat<- read.table("mart_export.txt",sep=",", header=TRUE, row.names=1)
counts <- as.matrix(counts)
dds <- DESeqDataSetFromMatrix(countData = counts, colData = metadat, design = ~ GC.content+ Gene.type)
I have this error :
Erreur dans DESeqDataSetFromMatrix(countData = counts, colData = metadat, :
ncol(countData) == nrow(colData) n'est pas TRUE
I don't know how to fix it.
This is the two dataset I have to used for the analysis :
head(counts)
N_10 T_10 N_11 T_12 N_13 T_13 N_14 T_14 N_1 T_1 N_2 T_2 N_3
ENSG00000000003 401 442 1155 1095 788 754 852 938 774 520 808 648 891
ENSG00000000005 0 7 23 9 5 2 45 5 11 10 56 8 7
ENSG00000000419 112 96 424 468 385 452 751 491 247 222 509 363 706
ENSG00000000457 13 121 327 165 40 204 290 199 70 121 104 151 352
ENSG00000000460 24 66 162 137 71 159 174 156 86 94 120 91 166
ENSG00000000938 96 128 218 372 126 129 538 320 117 129 157 238 177
T_3 N_4 N_5 T_6 N_7 T_7 N_8 T_8 N_9 T_9
ENSG00000000003 1071 2059 737 1006 1146 653 1299 1306 1522 490
ENSG00000000005 0 18 0 7 1 4 1 2 0 3
ENSG00000000419 622 988 307 402 294 323 535 518 573 322
ENSG00000000457 333 328 58 153 138 115 179 200 86 85
ENSG00000000460 152 162 100 100 101 148 128 78 83 109
ENSG00000000938 86 113 410 230 64 76 93 61 121 68
head(metadat)
Chromosome.scaffold.name Gene.start..bp. Gene.end..bp.
ENSG00000271782 1 50902700 50902978
ENSG00000232753 1 103817769 103828355
ENSG00000225767 1 50927141 50936822
ENSG00000202140 1 50965430 50965529
ENSG00000207194 1 51048076 51048183
ENSG00000252825 1 51215968 51216025
GC.content Gene.type
ENSG00000271782 35.48 lincRNA
ENSG00000232753 33.99 lincRNA
ENSG00000225767 38.99 antisense
ENSG00000202140 43.00 misc_RNA
ENSG00000207194 37.96 snRNA
ENSG00000252825 36.21 snRNA
Thank you for your help, and for your lighting
EDIT :
Thank you for your previous answer.
I take an another dataset to make this homework. But I have another bug :
This is my new dataset :
head(mycounts)
R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver R1L6Liver
ENSG00000177757 2 1 0 0 1
ENSG00000187634 49 27 43 34 23
ENSG00000188976 73 34 77 56 45
ENSG00000187961 15 8 15 13 11
ENSG00000187583 1 0 1 1 0
ENSG00000187642 4 0 5 0 2
R1L7Kidney R1L8Liver R2L2Kidney R2L3Liver R2L6Kidney
ENSG00000177757 2 0 1 1 3
ENSG00000187634 41 35 42 25 47
ENSG00000188976 68 55 70 42 82
ENSG00000187961 13 12 12 20 15
ENSG00000187583 3 0 0 2 3
ENSG00000187642 12 1 9 4 9
head(myfactors)
Tissue TissueRun
R1L1Kidney Kidney Kidney_1
R1L2Liver Liver Liver_1
R1L3Kidney Kidney Kidney_1
R1L4Liver Liver Liver_1
R1L6Liver Liver Liver_1
R1L7Kidney Kidney Kidney_1
When I code my DESeq object, I would take the Tissue and TissueRun for take care of the batch. But I have an error :
dds2 <- DESeqDataSetFromMatrix(countData = mycounts, colData = myfactors, design = ~ Tissue + TissueRun)
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
Please read the vignette section 'Model matrix not full rank':
vignette('DESeq2')
Thank you for your help
df <- data.frame(items=sample(LETTERS,replace= T),quantity=sample(1:100,26,replace=FALSE),price=sample(100:1000,26,replace=FALSE))
I want to group_by sum quantity is about 500(ballpark) ,
When count close about 500 put the same group,like below
Any help would be appreciated.
Updated
Because the condition need to change, I reset the threshold to 250,
I summarize to find the max total value for each group, and then,
How could I change the the total of group6 < 200 , into group5.
I think about using ifelse but can't work successfully.
set.seed(123)
df <- data.frame(items=sample(LETTERS,replace= T),quantity=sample(1:100,26,replace=FALSE),price=sample(100:1000,26,replace=FALSE))
df$group=cumsum(c(1,ifelse(diff(cumsum(df$quantity)%% 250) < 0,1,0)))
df$total=ave(df$quantity,df$group,FUN=cumsum)
df %>% group_by(group) %>% summarise(max = max(total, na.rm=TRUE))
# A tibble: 6 × 2
group max
<dbl> <int>
1 1 238
2 2 254
3 3 256
4 4 246
5 5 237
6 6 101
I want get like
> df
items quantity price group total
1 O 36 393 1 36
2 S 78 376 1 114
3 N 81 562 1 195
4 C 43 140 1 238
5 J 76 530 2 76
6 R 15 189 2 91
7 V 32 415 2 123
8 K 7 322 2 130
9 E 9 627 2 139
10 T 41 215 2 180
11 N 74 705 2 254
12 V 23 873 3 23
13 Y 27 846 3 50
14 Z 60 555 3 110
15 E 53 697 3 163
16 S 93 953 3 256
17 Y 86 138 4 86
18 Y 88 258 4 174
19 I 38 851 4 212
20 C 34 308 4 246
21 H 69 473 5 69
22 Z 72 917 5 141
23 G 96 133 5 237
24 J 63 615 5 300
25 I 13 112 5 376
26 S 25 168 5 477
Thank you for any helping all the time.
Base R
set.seed(123)
df <- data.frame(items=sample(LETTERS,replace= T),quantity=sample(1:100,26,replace=FALSE),price=sample(100:1000,26,replace=FALSE))
df$group=cumsum(c(1,ifelse(diff(cumsum(df$quantity)%%500)<0,1,0)))
df$total=ave(df$quantity,df$group,FUN=cumsum)
items quantity price group total
1 O 36 393 1 36
2 S 78 376 1 114
3 N 81 562 1 195
4 C 43 140 1 238
5 J 76 530 1 314
6 R 15 189 1 329
7 V 32 415 1 361
8 K 7 322 1 368
9 E 9 627 1 377
10 T 41 215 1 418
11 N 74 705 1 492
12 V 23 873 2 23
13 Y 27 846 2 50
14 Z 60 555 2 110
15 E 53 697 2 163
16 S 93 953 2 256
17 Y 86 138 2 342
18 Y 88 258 2 430
19 I 38 851 2 468
20 C 34 308 2 502
21 H 69 473 3 69
22 Z 72 917 3 141
23 G 96 133 3 237
24 J 63 615 3 300
25 I 13 112 3 313
26 S 25 168 3 338
You could use Reduce(..., accumulate = TRUE) to find where the first cumulative quantity >= 500.
set.seed(123)
df <- data.frame(items=sample(LETTERS,replace= T),quantity=sample(1:100,26,replace=FALSE),price=sample(100:1000,26,replace=FALSE))
library(dplyr)
df %>%
group_by(group = lag(cumsum(Reduce(\(x, y) {
z <- x + y
if(z < 500) z else 0
}, quantity, accumulate = TRUE) == 0) + 1, default = 1)) %>%
mutate(total = sum(quantity)) %>%
ungroup()
# A tibble: 26 × 5
items quantity price group total
<chr> <int> <int> <dbl> <int>
1 O 36 393 1 515
2 S 78 376 1 515
3 N 81 562 1 515
4 C 43 140 1 515
5 J 76 530 1 515
6 R 15 189 1 515
7 V 32 415 1 515
8 K 7 322 1 515
9 E 9 627 1 515
10 T 41 215 1 515
11 N 74 705 1 515
12 V 23 873 1 515
13 Y 27 846 2 548
14 Z 60 555 2 548
15 E 53 697 2 548
16 S 93 953 2 548
17 Y 86 138 2 548
18 Y 88 258 2 548
19 I 38 851 2 548
20 C 34 308 2 548
21 H 69 473 2 548
22 Z 72 917 3 269
23 G 96 133 3 269
24 J 63 615 3 269
25 I 13 112 3 269
26 S 25 168 3 269
Here is a base R solution. The groups break after the cumulative sum passes a threshold. The output of aggregate shows that all cumulative sums are above thres except for the last one.
set.seed(2022)
df <- data.frame(items=sample(LETTERS,replace= T),
quantity=sample(1:100,26,replace=FALSE),
price=sample(100:1000,26,replace=FALSE))
f <- function(x, thres) {
grp <- integer(length(x))
run <- 0
current_grp <- 0L
for(i in seq_along(x)) {
run <- run + x[i]
grp[i] <- current_grp
if(run > thres) {
current_grp <- current_grp + 1L
run <- 0
}
}
grp
}
thres <- 500
group <- f(df$quantity, thres)
aggregate(quantity ~ group, df, sum)
#> group quantity
#> 1 0 552
#> 2 1 513
#> 3 2 214
ave(df$quantity, group, FUN = cumsum)
#> [1] 70 133 155 224 235 327 347 409 481 484 552 29 95 129 224 263 294 377 433
#> [20] 434 453 513 50 91 182 214
Created on 2022-09-06 by the reprex package (v2.0.1)
Edit
To assign groups and total quantities to the data can be done as follows.
df$group <- f(df$quantity, thres)
df$total_quantity <- ave(df$quantity, df$group, FUN = cumsum)
head(df)
#> items quantity price group total_quantity
#> 1 D 70 731 0 70
#> 2 S 63 516 0 133
#> 3 N 22 710 0 155
#> 4 W 69 829 0 224
#> 5 K 11 887 0 235
#> 6 D 92 317 0 327
Created on 2022-09-06 by the reprex package (v2.0.1)
Edit 2
To assign only the total quantity per group use sum instead of cumsum.
df$total_quantity <- ave(df$quantity, df$group, FUN = sum)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
My sample data looks like this:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time (days) Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick.
Kind regards, and thank you in advance.
Here is a base R option
u <- cbind(
data[1],
do.call(
rbind,
lapply(
split.default(data[-1], ceiling(seq_along(data[-1]) / 2)),
setNames,
c("Value", "Time")
)
)
)
out <- `row.names<-`(
subset(
x <- u[order(u$pid), ],
complete.cases(x)
), NULL
)
such that
> out
pid Value Time
1 1 1356 1435
2 1 1483 1405
3 1 1563 1374
4 2 943 1848
5 2 1173 1818
6 2 1300 1785
7 3 1590 185
8 3 1585 294
9 4 130 72
10 4 140 82
11 4 220 126
12 4 166 159
13 4 380 189
14 4 353 231
15 4 180 268
16 4 571 334
17 4 443 70
18 4 266 124
19 4 213 156
20 4 583 173
21 4 510 222
22 4 596 303
23 4 476 145
24 4 656 217
25 4 816 289
26 4 136 79
27 4 756 89
28 4 703 128
29 4 776 166
30 4 586 203
31 4 526 240
32 4 580 278
33 4 483 371
An option with pivot_longer
library(dplyr)
library(tidyr)
names(data)[8] <- "measurement4"
data %>%
pivot_longer(cols = -pid, names_to = c('.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])", values_drop_na = TRUE) %>% select(-grp)
# A tibble: 33 x 3
# pid measurement Tdays
# <int> <int> <int>
# 1 1 1356 1435
# 2 1 1483 1405
# 3 1 1563 1374
# 4 2 943 1848
# 5 2 1173 1818
# 6 2 1300 1785
# 7 3 1590 185
# 8 3 1585 294
# 9 4 130 72
#10 4 443 70
# … with 23 more rows
I'm dealing with the following dataset
animal protein herd sire dam
6 416 189.29 2 15 236
7 417 183.27 2 6 295
9 419 193.24 3 11 268
10 420 198.84 2 12 295
11 421 205.25 3 3 251
12 422 204.15 2 2 281
13 423 200.20 2 3 248
14 424 197.22 2 11 222
15 425 201.14 1 10 262
17 427 196.20 1 11 290
18 428 208.13 3 9 294
19 429 213.01 3 14 254
21 431 203.38 2 4 273
22 432 190.56 2 8 248
25 435 196.59 3 9 226
26 436 193.31 3 10 249
27 437 207.89 3 7 272
29 439 202.98 2 10 260
30 440 177.28 2 4 291
31 441 182.04 1 6 282
32 442 217.50 2 3 265
33 443 190.43 2 11 248
35 445 197.24 2 4 256
37 447 197.16 3 5 240
42 452 183.07 3 5 293
43 453 197.99 2 6 293
44 454 208.27 2 6 254
45 455 187.61 3 12 271
46 456 173.18 2 6 280
47 457 187.89 2 6 235
48 458 191.96 1 7 286
49 459 196.39 1 4 275
50 460 178.51 2 13 262
52 462 204.17 1 6 253
53 463 203.77 2 11 273
54 464 206.25 1 13 249
55 465 211.63 2 13 222
56 466 211.34 1 6 228
57 467 194.34 2 1 217
58 468 201.53 2 12 247
59 469 198.01 2 3 251
60 470 188.94 2 7 290
61 471 190.49 3 2 220
62 472 197.34 2 3 224
63 473 194.04 1 15 229
64 474 202.74 2 1 287
67 477 189.98 1 6 300
69 479 206.37 3 2 293
70 480 183.81 2 10 274
72 482 190.70 2 12 265
74 484 194.25 3 2 262
75 485 191.15 3 10 297
76 486 193.23 3 15 255
77 487 193.29 2 4 266
78 488 182.20 1 15 260
81 491 195.89 2 12 294
82 492 200.77 1 8 278
83 493 179.12 2 7 281
85 495 172.14 3 13 252
86 496 183.82 1 4 264
88 498 195.32 1 6 249
89 499 197.19 1 13 274
90 500 178.07 1 8 293
92 502 209.65 2 7 241
95 505 199.66 3 5 220
96 506 190.96 2 11 259
98 508 206.58 3 3 230
100 510 196.60 2 5 231
103 513 193.25 2 15 280
104 514 181.34 2 3 227
I'm interested with the animals indexes and corresponding to them the dams' indexes. Using table function I was able to check that some dams are matched to different animals. In fact I got the following output
217 220 222 224 226 227 228 229 230 231 235 236 240 241 247 248 249 251 252 253 254 255 256 259 260 262
1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 1 2 1 1 1 2 3
264 265 266 268 271 272 273 274 275 278 280 281 282 286 287 290 291 293 294 295 297 300
1 2 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 4 2 2 1 1
Using length function I checked that there are only 48 dams in this dataset.
I would like to 'reindex' them with the integers 1, ..., 48 instead of these given in my set. Is there any method of doing such things?
You can use match and unique.
df$index <- match(df$dam, unique(df$dam))
Or convert to factor and then integer
df$index <- as.integer(factor(df$dam))
Another option is group_indices from dplyr.
df$index <- dplyr::group_indices(df, dam)
We can use .GRP in data.table
library(data.table)
setDT(df)[, index := .GRP, dam]
> head(m)
X id1 q_following topic_followed topic_answered nfollowers nfollowing
1 1 1 80 80 100 180 180
2 2 1 76 76 95 171 171
3 3 1 72 72 90 162 162
4 4 1 68 68 85 153 153
5 5 1 64 64 80 144 144
6 6 1 60 60 75 135 135
> head(d)
X id1 q_following topic_followed topic_answered nfollowers nfollowing
1 1 1 63 735 665 949 146
2 2 1 89 737 666 587 185
3 3 1 121 742 670 428 264
4 4 1 277 750 706 622 265
5 5 1 339 765 734 108 294
6 6 1 363 767 766 291 427
matcher <- function(x,y){ return(na.omit(m[which(d[,y]==x),y])) }
max_matcher <- function(x) { return(sum(matcher(x,3:13))) }
result <- foreach(1:1000, function(x) {
if(max(max_matcher(1:1000)) == max_matcher(x)) return(x)
})
I want to compute result across each group, grouped by id1 of dataframe m.
m %>% group_by(id1) %>% summarise(result) #doesn't work
by(m, m[,"id1"], result) #doesn't work
How should I proceed?