I have done a Fisher test on all my rows which outputs a lot of p-values. How could I correctly combine p-values to the original columns? I tried the following codes but the rows in original data (d) do not match with p-values (e) in the merged dataframe (f).
d <- read.table('test.txt', header = FALSE)
e <-apply(d,1, function(x) fisher.test(matrix(x,nr=2), alternative='greater')$p.value)
f <-merge(d,as.data.frame(e),by.x=0,by.y=0)
> d
V1 V2 V3 V4
1 1 839 63 222247
2 1 839 47 222263
3 1 839 299 222011
4 6 834 1821 220489
5 1 839 198 222112
6 1 839 324 221986
7 2 838 808 221502
8 3 837 935 221375
9 4 836 1723 220587
10 1 839 117 22219
> e
[1] 2.144749e-01 1.656028e-01 6.776690e-01 6.848409e-01 5.280300e-01 7.067099e-01 8.091576e-01 6.859446e-01
[9] 8.895988e-01 3.592658e-01
> f
Row.names V1 V2 V3 V4 e
1 1 1 839 63 222247 2.144749e-01
2 10 1 839 117 222193 3.592658e-01
3 11 6 834 850 221460 1.071752e-01
4 12 29 811 11625 210685 9.941101e-01
5 13 2 838 1231 221079 9.463472e-01
6 14 1 839 1236 221074 9.907043e-01
7 15 3 837 905 221405 6.647785e-01
8 16 3 837 793 221517 5.768163e-01
9 17 6 834 687 221623 4.906665e-02
10 18 1 839 226 222084 5.753710e-01
f <-cbind(d,e)
# V1 V2 V3 V4 e
#1 1 839 63 222247 0.2144749
#2 1 839 47 222263 0.1656028
#3 1 839 299 222011 0.6776690
#4 6 834 1821 220489 0.6848409
#5 1 839 198 222112 0.5280300
#6 1 839 324 221986 0.7067099
#7 2 838 808 221502 0.8091576
#8 3 837 935 221375 0.6859446
#9 4 836 1723 220587 0.8895988
#10 1 839 117 22219 0.9873172
Related
I'm trying to use rBlast for protein sequences but somehow it doesn't work. It works fine for nucleotide sequences but for proteins it just doesn't return any match (I used a sequence from the searched dataset, so there can't be no match). In the description it stands "This includes interfaces to blastn, blastp, blastx..." but in the help file in R studio it says "Description Execute blastn from blast+". Did anybody run rBlast for proteins?
Here's what I ran:
listF<-list.files("Trich_prot_fasta/")
fa<-paste0("Trich_prot_fasta/",listF[i])
makeblastdb(fa, dbtype = "prot", args="")
bl <- blast("Trich_prot_fasta/Tri5640_1_GeneModels_FilteredModels1_aa.fasta", type="blastp")
seq <- readAAStringSet("NDRkinase/testSeq.txt")
cl <- predict(bl, seq)
Result:
> cl <- predict(bl, seq)
Warning message: In predict.BLAST(bl, seq) : BLAST did not return a
match!
Tried to reproduce the error but everything worked as expected on my system (macOS BigSur 11.6 / R v4.1.1 / Rstudio v1.4.1717).
Given your blastn was successful, perhaps you are combining multiple fasta files for your protein fasta reference database? If that's the case, try concatenating them together and use the path to the file instead of an R object ("fa") when making your blastdb. Or perhaps:
makeblastdb(file = "Trich_prot_fasta/Tri5640_1_GeneModels_FilteredModels1_aa.fasta", type = "prot)
Instead of:
makeblastdb(fa, dbtype = "prot", args="")
Also, please edit your question to include the output from sessionInfo() (might help narrow things down).
library(tidyverse)
#BiocManager::install("Biostrings")
#devtools::install_github("mhahsler/rBLAST")
library(rBLAST)
# Download an example fasta file:
# https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000001542/UP000001542_5722.fasta.gz
# Grab the first fasta sequence as "example_sequence.fasta"
listF <- list.files("~/Downloads/Trich_example", full.names = TRUE)
listF
#> [1] "~/Downloads/Trich_example/UP000001542_5722.fasta"
#> [2] "~/Downloads/Trich_example/example_sequence.fasta"
makeblastdb(file = "~/Downloads/Trich_example/UP000001542_5722.fasta", dbtype = "prot")
bl <- blast("~/Downloads/Trich_example/UP000001542_5722.fasta", type = "blastp")
seq <- readAAStringSet("~/Downloads/Trich_example/example_sequence.fasta")
cl <- predict(bl, seq)
cl
#> QueryID SubjectID Perc.Ident Alignment.Length
#> 1 Example_sequence_1 tr|A2D8A1|A2D8A1_TRIVA 100.000 694
#> 2 Example_sequence_1 tr|A2E4L0|A2E4L0_TRIVA 64.553 694
#> 3 Example_sequence_1 tr|A2E4L0|A2E4L0_TRIVA 32.436 669
#> 4 Example_sequence_1 tr|A2D899|A2D899_TRIVA 64.344 488
#> 5 Example_sequence_1 tr|A2D899|A2D899_TRIVA 31.004 458
#> 6 Example_sequence_1 tr|A2D899|A2D899_TRIVA 27.070 314
#> 7 Example_sequence_1 tr|A2D898|A2D898_TRIVA 54.915 468
#> 8 Example_sequence_1 tr|A2D898|A2D898_TRIVA 33.691 653
#> 9 Example_sequence_1 tr|A2D898|A2D898_TRIVA 32.936 671
#> 10 Example_sequence_1 tr|A2D898|A2D898_TRIVA 29.969 654
#> 11 Example_sequence_1 tr|A2D898|A2D898_TRIVA 26.694 487
#> 12 Example_sequence_1 tr|A2D898|A2D898_TRIVA 25.000 464
#> 13 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 39.106 716
#> 14 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 30.724 677
#> 15 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 29.257 417
#> 16 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 23.438 640
#> 17 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 22.981 718
#> 18 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 24.107 112
#> 19 Example_sequence_1 tr|A2FI39|A2FI39_TRIVA 33.378 740
#> 20 Example_sequence_1 tr|A2FI39|A2FI39_TRIVA 31.440 722
#> Mismatches Gap.Openings Q.start Q.end S.start S.end E Bits
#> 1 0 0 1 694 1 694 0.00e+00 1402.0
#> 2 243 2 1 692 163 855 0.00e+00 920.0
#> 3 410 15 22 671 1 646 3.02e-94 312.0
#> 4 173 1 205 692 1 487 0.00e+00 644.0
#> 5 308 7 22 476 1 453 3.55e-55 198.0
#> 6 196 5 13 294 173 485 4.12e-25 110.0
#> 7 211 0 1 468 683 1150 8.48e-169 514.0
#> 8 420 11 2 647 501 1147 1.61e-91 309.0
#> 9 396 10 2 666 363 985 5.78e-89 301.0
#> 10 406 11 16 664 195 801 1.01e-66 238.0
#> 11 297 10 208 662 21 479 1.60e-36 147.0
#> 12 316 7 11 469 29 465 3.04e-36 147.0
#> 13 386 4 2 667 248 963 1.72e-149 461.0
#> 14 411 10 2 625 66 737 8.34e-83 283.0
#> 15 286 5 129 542 14 424 2.66e-52 196.0
#> 16 421 15 5 607 365 972 3.07e-38 152.0
#> 17 407 21 77 662 27 730 1.25e-33 138.0
#> 18 81 3 552 661 3 112 2.10e-01 35.4
#> 19 421 9 3 675 394 1128 1.12e-115 375.0
#> 20 409 15 2 647 163 874 1.21e-82 285.0
...
Created on 2021-09-30 by the reprex package (v2.0.1)
I am sure this is a super easy answer but I am struggling with how to add a column with two different variables to my dataframe. Currently, this is what it looks like
vcv.index model.index par.index grid index estimate se lcl ucl fixed
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157
5 10 10 20 A 20 0.7575811 0.05033490 0.6461758 0.8424612
6 21 21 61 B 61 0.8713467 0.07638687 0.6404598 0.9626184
7 22 22 62 B 62 0.6074379 0.06881230 0.4677827 0.7314827
8 23 23 63 B 63 0.6041054 0.06107520 0.4805279 0.7156792
9 24 24 64 B 64 0.5806565 0.06927308 0.4422237 0.7074601
10 25 25 65 B 65 0.7370944 0.05892108 0.6070620 0.8357394
11 41 41 121 C 121 0.8048479 0.09684385 0.5519097 0.9324759
12 42 42 122 C 122 0.5259547 0.07165218 0.3871380 0.6608721
13 43 43 123 C 123 0.5427100 0.07127273 0.4033255 0.6757137
14 44 44 124 C 124 0.5168820 0.06156392 0.3975561 0.6343132
15 45 45 125 C 125 0.6550049 0.07378403 0.5002851 0.7826343
16 196 196 586 A 586 0.8536314 0.08709394 0.5979992 0.9580976
17 197 197 587 A 587 0.5672194 0.07079508 0.4268452 0.6975725
18 198 198 588 A 588 0.5675415 0.06380445 0.4408540 0.6859714
19 199 199 589 A 589 0.5666874 0.06499899 0.4377071 0.6872233
20 200 200 590 A 590 0.7058542 0.05985868 0.5769484 0.8085177
21 211 211 631 B 631 0.8360614 0.09413427 0.5703031 0.9514472
22 212 212 632 B 632 0.5432872 0.07906200 0.3891364 0.6895701
23 213 213 633 B 633 0.5400994 0.06497607 0.4129055 0.6622759
24 214 214 634 B 634 0.5161692 0.06292706 0.3943257 0.6361202
25 215 215 635 B 635 0.6821667 0.07280044 0.5263841 0.8056298
26 226 226 676 C 676 0.7621875 0.10484478 0.5077465 0.9087471
27 227 227 677 C 677 0.4607440 0.07326970 0.3240229 0.6036386
28 228 228 678 C 678 0.4775168 0.08336433 0.3219349 0.6375872
29 229 229 679 C 679 0.4517655 0.06393339 0.3319262 0.5774725
30 230 230 680 C 680 0.5944330 0.07210672 0.4491995 0.7248303
then I am adding a column with periods 1-5 repeated until reaches the end
with this code
SurJagPred$estimates %<>% mutate(Primary = rep(1:5, 6))
and I also need to add sex( F, M) as well. the numbers 1-15 are female and the 16-30 are male. So overall it should look like this.
> vcv.index model.index par.index grid index estimate se lcl ucl fixed Primary Sex
F
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751 1 F
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014 2 F
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169 3 F
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157 4 F
We can use rep with each on a vector of values to replicate each element of the vector to that many times
SurJagPred$estimates %<>%
mutate(Sex = rep(c("F", "M"), each = 15))
I have a dataset that look like
C_ID I_ID Loan R1 Prot_id Collateral R2 maxRank
1 A c 341 1 p1 506 1 3
2 A c 341 1 p2 366 2 3
3 A c 341 1 p3 263 3 3
4 A a 689 2 p1 506 1 3
5 A a 689 2 p2 366 2 3
6 A a 689 2 p3 263 3 3
7 A d 720 3 p1 506 1 3
8 A d 720 3 p2 366 2 3
9 A d 720 3 p3 263 3 3
10 A b 334 4 p1 506 1 3
11 A b 334 4 p2 366 2 3
12 A b 334 4 p3 263 3 3
13 A e 752 5 p1 506 1 3
14 A e 752 5 p2 366 2 3
15 A e 752 5 p3 263 3 3
16 B h 193 1 p5 529 1 2
17 B h 193 1 p4 414 2 2
18 B g 494 2 p5 529 1 2
19 B g 494 2 p4 414 2 2
20 B f 227 3 p5 529 1 2
21 B f 227 3 p4 414 2 2
22 B j 785 4 p5 529 1 2
23 B j 785 4 p4 414 2 2
24 B i 371 5 p5 529 1 2
25 B i 371 5 p4 414 2 2
26 B k 395 6 p5 529 1 2
27 B k 395 6 p4 414 2 2
Where R1 is ranking of loan for each contract_id group and R2 is ranking of each collateral under the cotnract_id. What is needed is
C_ID I_ID Loan R1 Prot_id Prot_value R2 maxRank PreAllocation Allocation PostAllocation Residual
1 A c 341 1 p1 506 1 3 341 341 0 165
2 A c 341 1 p2 366 2 3 0 0 0 366
3 A c 341 1 p3 263 3 3 0 0 0 263
4 A a 689 2 p1 506 1 3 689 165 524 0
5 A a 689 2 p2 366 2 3 524 366 158 0
6 A a 689 2 p3 263 3 3 158 158 0 105
7 A d 720 3 p1 506 1 3 720 0 720 0
8 A d 720 3 p2 366 2 3 720 0 720 0
9 A d 720 3 p3 263 3 3 720 105 615 0
10 A b 334 4 p1 506 1 3 334 0 334 0
11 A b 334 4 p2 366 2 3 334 0 334 0
12 A b 334 4 p3 263 3 3 334 0 334 0
13 A e 752 5 p1 506 1 3 752 0 752 0
14 A e 752 5 p2 366 2 3 752 0 752 0
15 A e 752 5 p3 263 3 3 752 0 752 0
16 B h 193 1 p5 529 1 2 193 193 0 336
17 B h 193 1 p4 414 2 2 0 0 0 414
18 B g 494 2 p5 529 1 2 494 336 158 0
19 B g 494 2 p4 414 2 2 158 158 0 256
20 B f 227 3 p5 529 1 2 227 0 227 0
21 B f 227 3 p4 414 2 2 227 227 0 29
22 B j 785 4 p5 529 1 2 785 0 785 0
23 B j 785 4 p4 414 2 2 785 29 756 0
24 B i 371 5 p5 529 1 2 371 0 371 0
25 B i 371 5 p4 414 2 2 371 0 371 0
26 B k 395 6 p5 529 1 2 395 0 395 0
27 B k 395 6 p4 414 2 2 395 0 395 0
Only the Allocation column is important and the other column are just to arrive at Allocation column. I was able to arrive at this using loop as below
df3 <- as.data.frame(df3)
df3$PreAllocation <- 0
df3$Allocation <- 0
df3$PostAllocation <- 0
df3$Residual <- 0
for (i in 1:nrow(df3)){
df3$PreAllocation[i] <- ifelse(df3$R2[i]==1,df3$Loan[i],df3$PostAllocation[i-1])
df3$Allocation[i]<- ifelse(df3$R1[i] >1, min(df3$Residual[i -
df3$maxRank[i]],df3$PreAllocation[i]),min(df3$PreAllocation[i],df3$Prot_value[i]))
df3$PostAllocation[i]<- df3$PreAllocation[i] - df3$Allocation[i]
df3$Residual[i] <- ifelse(df3$R1[i]==1, (df3$Prot_value[i] - df3$Allocation[i]), (df3$Residual[i-
df3$maxRank[i]] - df3$Allocation[i]))
}
However when dataset is big, there are performance issues. I have been trying to arrive at the same using apply functions; rowwise + transform etc but could not arrive at it. for
1. Columns are interdependent.
2. Need to use dynamic (based on maxRank) lag of columns being generated in calculation of later rows etc.
Any suggestion. Thanks.
It looks like a loop was used so you could look at the value in the previous row. Below is a solution using dplyr, which has a function lag() (and lead()) which let you look at previous (or successive) rows. It is also using pmin() (there's also a pmax()), which takes the min/max between corresponding elements in a set of vectors.
# `mutate()` takes data and one or more LHS=RHS statements about the data.
# Each column in the LHS is created (or overwritten) with the logic on the
# RHS. Conveniently, we don't have to prepend each with `df$3`.
# Uses dplyr's if_else()` instead of base R.
# `lag(x, n=1)` looks at previous row's value for x.
# Run `pmin(1:5, 5:1)` for a simple example
# of how it works.
df3 <- mutate(df3,
PreAllocation = if_else(R2 == 1, Loan, lag(PostAllocation, n = 1)),
Allocation = if_else(R3 > 1, pmin(lag(Residual, n = maxRank), PreAllocation),
pmin(PreAllocation, Prot_value)),
PostAllocation = PreAllocation - Allocation,
Residual = if_else(R1 == 1, Prot_value - Allocation, lag(Residual, n = maxRank - Allocation))
)
I encourage you to look at the dplyr CRAN page and the "Introduction to dplyr" vignette for further information.
If you'd like a syntax which is closer to base R's for subsetting & assignment, you might also consider the data.table package.
These are very popular frameworks for data manipulation and aggregation.
I did a rfm analysis using package "rfm". The results are in tibble and I can't seem to figure out how to export it to .csv. I tried argument below but it exported a blank file.
> dim(bmdata4RFM)
[1] 1182580 3
> str(bmdata4RFM)
'data.frame': 1182580 obs. of 3 variables:
$ customer_ID: num 0 0 0 0 0 0 0 0 0 0 ...
$ sales_date : Factor w/ 366 levels "1/1/2018 0:00:00",..: 267 275 286 297 300 301 302 303 304 305 ...
$ sales : num 101541 110543 60932 75472 43588 ...
> head(bmdata4RFM,5)
customer_ID sales_date sales
1 0 6/30/2017 0:00:00 101540.70
2 0 7/1/2017 0:00:00 110543.35
3 0 7/2/2017 0:00:00 60932.20
4 0 7/3/2017 0:00:00 75471.93
5 0 7/4/2017 0:00:00 43587.70
> library(rfm)
> # convert date from factor to date format
> bmdata4RFM[,2] <- as.Date(as.character(bmdata4RFM[,2]), format = "%m/%d/%Y")
> rfm_result_v2
# A tibble: 535,868 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<dbl> <date> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
1 0 2018-06-30 12 366 42462470. 5 5 5 555
2 1 2018-06-30 12 20 2264. 5 5 5 555
3 2 2018-01-12 181 24 1689 3 5 5 355
4 3 2018-05-04 69 27 1984. 4 5 5 455
5 6 2017-12-07 217 12 922. 2 5 5 255
6 7 2018-01-15 178 19 1680. 3 5 5 355
7 9 2018-01-05 188 19 2106 2 5 5 255
8 20 2018-04-11 92 4 414. 4 5 5 455
9 26 2018-02-10 152 1 72 3 1 2 312
10 48 2017-12-20 204 1 90 2 1 3 213
11 68 2017-09-30 285 1 37 1 1 1 111
12 70 2017-12-17 207 1 18 2 1 1 211
13 104 2017-08-11 335 1 90 1 1 3 113
14 120 2017-07-27 350 1 19 1 1 1 111
15 134 2018-01-13 180 1 275 3 1 4 314
16 153 2018-06-24 18 10 1677 5 5 5 555
17 155 2018-05-28 45 1 315 5 1 4 514
18 171 2018-06-11 31 6 3485. 5 5 5 555
19 172 2018-05-24 49 1 93 5 1 3 513
20 174 2018-06-06 36 3 347. 5 4 5 545
# ... with 535,858 more rows
> write.csv(rfm_result_v2,"bmdataRFMFunction_output071218v2.csv")
The problem seems to be that the result of the rfm_table_order is not only a tibble: looking at this question already solved, and using its data, you can know this:
> class(rfm_result)
[1] "rfm_table_order" "tibble" "data.frame"
So if for example choose this:
> rfm_result$rfm
# A tibble: 325 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<int> <date> <dbl> <dbl> <int> <int> <int> <int> <dbl>
1 1 2017-08-06 353 1 145 4 1 2 412
2 2 2016-10-15 648 1 268 2 1 3 213
3 5 2016-12-14 588 1 119 3 1 1 311
4 7 2017-04-27 454 1 290 3 1 3 313
5 8 2016-12-07 595 3 835 2 5 5 255
6 10 2017-07-31 359 1 192 4 1 2 412
7 11 2017-08-16 343 1 278 4 1 3 413
8 12 2017-10-14 284 2 294 5 4 3 543
9 15 2016-07-12 743 1 206 2 1 2 212
10 17 2017-05-22 429 2 405 4 4 4 444
# ... with 315 more rows
You can export it with this command:
write.table(rfm_result$rfm , file = "your_path\\df.csv")
OP asks for a CSV output.
Being very picky, write.table(rfm_result$rfm , file = "your_path\\df.csv") creates a TSV.
If you want a CSV add the sep="," parameter and also you'll likely want to not write out the row names so also use row.names=FALSE.
write.table(rfm_result$rfm , file = "your_path\\df.csv", sep=",", row.names=FALSE)
I have a data frame below and I want to find the average row value for all columns with header *R and all columns with *G.
The output should then be four columns: Rfam, Classes, avg.rowR, avg.rowG
I was playing around with the rowMeans() function, but I am not sure how to specify the columns.
Rfam Classes 26G 26R 35G 35R 46G 46R 48G 48R 55G 55R
5_8S_rRNA rRNA 63 39 8 27 26 17 28 43 41 17
5S_rRNA rRNA 171 149 119 109 681 47 95 161 417 153
7SK 7SK 53 282 748 371 248 42 425 384 316 198
ACA64 Other 7 8 19 2 10 1 36 10 10 4
let-7 miRNA 121825 73207 25259 75080 54301 63510 30444 53800 78961 47533
lin-4 miRNA 10149 16263 5629 19680 11297 37866 3816 9677 11713 10068
Metazoa_SRP SRP 317 1629 1008 418 1205 407 1116 1225 1413 1075
mir-1 miRNA 3 4 1 2 0 26 1 1 0 4
mir-10 miRNA 912163 1411287 523793 1487160 517017 1466085 107597 551381 727720 788201
mir-101 miRNA 461 320 199 553 174 460 278 297 256 254
mir-103 miRNA 937 419 202 497 318 217 328 343 891 439
mir-1180 miRNA 110 32 4 17 53 47 6 29 35 22
mir-1226 miRNA 11 3 0 3 6 0 1 2 5 4
mir-1237 miRNA 3 2 1 1 0 1 0 2 1 1
mir-1249 miRNA 5 14 2 9 4 5 9 5 7 7
newcols <- sapply(c("R$", "G$"), function(x) rowMeans(df[grep(x, names(df))]))
setNames(cbind(df[1:2], newcols), c(names(df)[1:2], "avg.rowR", "avg.rowG"))
# Rfam Classes avg.rowR avg.rowG
# 1 5_8S_rRNA rRNA 28.6 33.2
# 2 5S_rRNA rRNA 123.8 296.6
# 3 7SK 7SK 255.4 358.0
# 4 ACA64 Other 5.0 16.4
# 5 let-7 miRNA 62626.0 62158.0
# 6 lin-4 miRNA 18710.8 8520.8
# 7 Metazoa_SRP SRP 950.8 1011.8
# 8 mir-1 miRNA 7.4 1.0
# 9 mir-10 miRNA 1140822.8 557658.0
# 10 mir-101 miRNA 376.8 273.6
# 11 mir-103 miRNA 383.0 535.2
# 12 mir-1180 miRNA 29.4 41.6
# 13 mir-1226 miRNA 2.4 4.6
# 14 mir-1237 miRNA 1.4 1.0
# 15 mir-1249 miRNA 8.0 5.4
One way to look for patterns in column names is to use the grep family of functions. The function call grep("R$", names(df)) will return the index of all column names that end with R. When we use it with sapply we can search for the R and G columns in one expression.
The core of the second line is cbind(df[1:2], newcols). That is the binding of the first two columns of df and the two new columns of mean values. Wrapping it with setNames(.., c(names(df)f[1:2]....)) formats the column names to match your desired output.