I'm trying to use rBlast for protein sequences but somehow it doesn't work. It works fine for nucleotide sequences but for proteins it just doesn't return any match (I used a sequence from the searched dataset, so there can't be no match). In the description it stands "This includes interfaces to blastn, blastp, blastx..." but in the help file in R studio it says "Description Execute blastn from blast+". Did anybody run rBlast for proteins?
Here's what I ran:
listF<-list.files("Trich_prot_fasta/")
fa<-paste0("Trich_prot_fasta/",listF[i])
makeblastdb(fa, dbtype = "prot", args="")
bl <- blast("Trich_prot_fasta/Tri5640_1_GeneModels_FilteredModels1_aa.fasta", type="blastp")
seq <- readAAStringSet("NDRkinase/testSeq.txt")
cl <- predict(bl, seq)
Result:
> cl <- predict(bl, seq)
Warning message: In predict.BLAST(bl, seq) : BLAST did not return a
match!
Tried to reproduce the error but everything worked as expected on my system (macOS BigSur 11.6 / R v4.1.1 / Rstudio v1.4.1717).
Given your blastn was successful, perhaps you are combining multiple fasta files for your protein fasta reference database? If that's the case, try concatenating them together and use the path to the file instead of an R object ("fa") when making your blastdb. Or perhaps:
makeblastdb(file = "Trich_prot_fasta/Tri5640_1_GeneModels_FilteredModels1_aa.fasta", type = "prot)
Instead of:
makeblastdb(fa, dbtype = "prot", args="")
Also, please edit your question to include the output from sessionInfo() (might help narrow things down).
library(tidyverse)
#BiocManager::install("Biostrings")
#devtools::install_github("mhahsler/rBLAST")
library(rBLAST)
# Download an example fasta file:
# https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000001542/UP000001542_5722.fasta.gz
# Grab the first fasta sequence as "example_sequence.fasta"
listF <- list.files("~/Downloads/Trich_example", full.names = TRUE)
listF
#> [1] "~/Downloads/Trich_example/UP000001542_5722.fasta"
#> [2] "~/Downloads/Trich_example/example_sequence.fasta"
makeblastdb(file = "~/Downloads/Trich_example/UP000001542_5722.fasta", dbtype = "prot")
bl <- blast("~/Downloads/Trich_example/UP000001542_5722.fasta", type = "blastp")
seq <- readAAStringSet("~/Downloads/Trich_example/example_sequence.fasta")
cl <- predict(bl, seq)
cl
#> QueryID SubjectID Perc.Ident Alignment.Length
#> 1 Example_sequence_1 tr|A2D8A1|A2D8A1_TRIVA 100.000 694
#> 2 Example_sequence_1 tr|A2E4L0|A2E4L0_TRIVA 64.553 694
#> 3 Example_sequence_1 tr|A2E4L0|A2E4L0_TRIVA 32.436 669
#> 4 Example_sequence_1 tr|A2D899|A2D899_TRIVA 64.344 488
#> 5 Example_sequence_1 tr|A2D899|A2D899_TRIVA 31.004 458
#> 6 Example_sequence_1 tr|A2D899|A2D899_TRIVA 27.070 314
#> 7 Example_sequence_1 tr|A2D898|A2D898_TRIVA 54.915 468
#> 8 Example_sequence_1 tr|A2D898|A2D898_TRIVA 33.691 653
#> 9 Example_sequence_1 tr|A2D898|A2D898_TRIVA 32.936 671
#> 10 Example_sequence_1 tr|A2D898|A2D898_TRIVA 29.969 654
#> 11 Example_sequence_1 tr|A2D898|A2D898_TRIVA 26.694 487
#> 12 Example_sequence_1 tr|A2D898|A2D898_TRIVA 25.000 464
#> 13 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 39.106 716
#> 14 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 30.724 677
#> 15 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 29.257 417
#> 16 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 23.438 640
#> 17 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 22.981 718
#> 18 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 24.107 112
#> 19 Example_sequence_1 tr|A2FI39|A2FI39_TRIVA 33.378 740
#> 20 Example_sequence_1 tr|A2FI39|A2FI39_TRIVA 31.440 722
#> Mismatches Gap.Openings Q.start Q.end S.start S.end E Bits
#> 1 0 0 1 694 1 694 0.00e+00 1402.0
#> 2 243 2 1 692 163 855 0.00e+00 920.0
#> 3 410 15 22 671 1 646 3.02e-94 312.0
#> 4 173 1 205 692 1 487 0.00e+00 644.0
#> 5 308 7 22 476 1 453 3.55e-55 198.0
#> 6 196 5 13 294 173 485 4.12e-25 110.0
#> 7 211 0 1 468 683 1150 8.48e-169 514.0
#> 8 420 11 2 647 501 1147 1.61e-91 309.0
#> 9 396 10 2 666 363 985 5.78e-89 301.0
#> 10 406 11 16 664 195 801 1.01e-66 238.0
#> 11 297 10 208 662 21 479 1.60e-36 147.0
#> 12 316 7 11 469 29 465 3.04e-36 147.0
#> 13 386 4 2 667 248 963 1.72e-149 461.0
#> 14 411 10 2 625 66 737 8.34e-83 283.0
#> 15 286 5 129 542 14 424 2.66e-52 196.0
#> 16 421 15 5 607 365 972 3.07e-38 152.0
#> 17 407 21 77 662 27 730 1.25e-33 138.0
#> 18 81 3 552 661 3 112 2.10e-01 35.4
#> 19 421 9 3 675 394 1128 1.12e-115 375.0
#> 20 409 15 2 647 163 874 1.21e-82 285.0
...
Created on 2021-09-30 by the reprex package (v2.0.1)
Related
I have a dataframe containing location data of different animals. Each animal has a unique id and each observation has a time stamp and some further metrics of the location observation. See a subset of the data below. The subset contains the first two observations of each id.
> sub
id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
2 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
3 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
4 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
5 333 B -80.8211 24.8441 11625 6980 37 2018-12-17 20:45:05
6 333 3 -80.8137 24.8263 155 100 69 2018-12-17 21:00:43
7 444 3 -80.4535 25.0848 501 33 104 2019-10-20 19:44:16
8 444 1 -80.8086 24.8364 6356 126 87 2020-01-18 20:32:28
9 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17
10 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35
11 666 2 -77.7221 24.4902 1129 75 66 2020-07-12 21:09:02
12 666 2 -77.7097 24.4905 314 248 164 2020-07-12 21:11:37
13 777 3 -77.7133 24.4820 406 58 110 2020-06-20 11:18:18
14 777 3 -77.7218 24.4844 170 93 107 2020-06-20 11:51:06
15 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
16 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
However, I need to do some data housekeeping, i.e. I need to include the day/time and location each animal was released. And after that I need to filter out observations for each animal that occurred pre-release of the corresponding animal.
I have a an additional dataframe that contains the necessary release metadata:
> stack
id release lat lon
1 888 2017-11-27 14:53 25.69201 -79.31534
2 333 2019-01-31 16:09 25.68896 -79.31326
3 222 2019-02-02 15:55 25.70051 -79.31393
4 111 2019-04-02 10:43 25.68534 -79.31341
5 444 2020-03-13 15:04 24.42892 -77.69518
6 666 2020-10-27 09:40 24.58290 -77.69561
7 555 2020-01-21 14:38 24.43333 -77.69637
8 777 2020-06-25 08:54 24.42712 -77.76427
So my question is: how can I add the release information (time and lat/lon) to the dataframe fore each id (while the columns a, b, and c can be NA). And how can I then filter out the observations that occured before each animal's release time? I have been looking into possibilites using dplyr but was not yet able to resolve my issue.
You've not provided an easy way of obtaining your data (dput()) is by far the best and you have issues with your date time values (release uses Y-M-D H:M whereas date uses Y:M:D H:M:S) so for clarity I've included code to obtain the data frames I use at the end of this post.
First, the solution:
library(tidyverse)
library(lubridate)
sub %>%
left_join(stack, by="id") %>%
mutate(
release=ymd_hms(paste0(release, ":00")),
date=ymd_hms(date)
) %>%
filter(date >= release)
id lc lon.x lat.x a b c date release lat.y lon.y
1 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17 2020-01-21 14:38:00 24.43333 -77.69637
2 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35 2020-01-21 14:38:00 24.43333 -77.69637
As I indicated in comments.
To obtain the data
sub <- read.table(textConnection("id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
2 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
3 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
4 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
5 333 B -80.8211 24.8441 11625 6980 37 '2018-12-17 20:45:05'
6 333 3 -80.8137 24.8263 155 100 69 '2018-12-17 21:00:43'
7 444 3 -80.4535 25.0848 501 33 104 '2019-10-20 19:44:16'
8 444 1 -80.8086 24.8364 6356 126 87 '2020-01-18 20:32:28'
9 555 3 -77.7211 24.4887 665 45 68 '2020-07-12 21:09:17'
10 555 3 -77.7163 24.4897 285 129 130 '2020-07-12 21:10:35'
11 666 2 -77.7221 24.4902 1129 75 66 '2020-07-12 21:09:02'
12 666 2 -77.7097 24.4905 314 248 164 '2020-07-12 21:11:37'
13 777 3 -77.7133 24.4820 406 58 110 '2020-06-20 11:18:18'
14 777 3 -77.7218 24.4844 170 93 107 '2020-06-20 11:51:06'
15 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'
16 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'"), header=TRUE)
stack <- read.table(textConnection("id release lat lon
1 888 '2017-11-27 14:53' 25.69201 -79.31534
2 333 '2019-01-31 16:09' 25.68896 -79.31326
3 222 '2019-02-02 15:55' 25.70051 -79.31393
4 111 '2019-04-02 10:43' 25.68534 -79.31341
5 444 '2020-03-13 15:04' 24.42892 -77.69518
6 666 '2020-10-27 09:40' 24.58290 -77.69561
7 555 '2020-01-21 14:38' 24.43333 -77.69637
8 777 '2020-06-25 08:54' 24.42712 -77.76427"), header=TRUE)
I have:
a number m of categorical features (x1, x2, ... xm)
1 categorical feature (y)
all in a dataframe (df).
I would like have a function that give a single table with all the crossings between xi and y: for example
table1 = table (df $ x1, df $ y) ... tablem = table (df $ xm, df $ y)
aggregate tables with rbind
I'm almost there but it doesn't work.
How about this:
data(diamonds, package="ggplot2")
tabs <- lapply(diamonds[,c("color", "clarity")], \(x){
table(x, diamonds$cut)
})
do.call(rbind,tabs)
#> Fair Good Very Good Premium Ideal
#> D 163 662 1513 1603 2834
#> E 224 933 2400 2337 3903
#> F 312 909 2164 2331 3826
#> G 314 871 2299 2924 4884
#> H 303 702 1824 2360 3115
#> I 175 522 1204 1428 2093
#> J 119 307 678 808 896
#> I1 210 96 84 205 146
#> SI2 466 1081 2100 2949 2598
#> SI1 408 1560 3240 3575 4282
#> VS2 261 978 2591 3357 5071
#> VS1 170 648 1775 1989 3589
#> VVS2 69 286 1235 870 2606
#> VVS1 17 186 789 616 2047
#> IF 9 71 268 230 1212
Created on 2022-05-30 by the reprex package (v2.0.1)
An example with mtcars, c("vs","am","gear") (your x's) vs "carb" (your y):
do.call(
rbind,
sapply(
c("vs","am","gear"),
function(x){
as.data.frame(table(mtcars[,x],mtcars$carb))
},
simplify=F
)
)
Var1 Var2 Freq
vs.1 0 1 0
vs.2 1 1 7
vs.3 0 2 5
vs.4 1 2 5
vs.5 0 3 3
vs.6 1 3 0
vs.7 0 4 8
vs.8 1 4 2
...
var1 is the value of to variable in the row names, var2 is the value of y.
Let's assume we somehow ended up with data frame object (T2 in below example) and we want to subset our original data with that dataframe. Is there a way to do without using | in subset object?
Here is a dataset I was playing but failed
education = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) = c("X", "State", "Region", "Urban.Population", "Per.Capita.Income", "Minor.Population", "Education.Expenditures")
head(education)
T1 = c(1,4,13,15,17,23,33,38)
T2 = education[T1,]$State
subset(education, State=="ME"| State=="MA" | State=="MI" | State=="MN" | State=="MO" | State=="MD" | State=="MS" | State=="MT")
subset(education, State==T2[3])
subset(education, State==T2)
PS: I created T2 as states starting with M but I don't want using string or anything. Just assume we somehow ended up with T2 in which outputs are some states.
I'm not quite sure what would be an acceptable answer but subset(education, State %in% T2) uses T2 as is and does not use |. Does this solve your problem? It's almost the same approach as Jon Spring points out in the comments, but instead of specifying a vector we can just use T2 with %in%. You say T2 is a data.frame object, but in the data you provided it turns out to be a character vector.
education = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) = c("X", "State", "Region", "Urban.Population", "Per.Capita.Income", "Minor.Population", "Education.Expenditures")
T1 = c(1,4,13,15,17,23,33,38)
T2 = education[T1,]$State
T2 # T2 is not a data.frame object (R 4.0)
#> [1] "ME" "MA" "MI" "MN" "MO" "MD" "MS" "MT"
subset(education, State %in% T2)
#> X State Region Urban.Population Per.Capita.Income Minor.Population
#> 1 1 ME 1 508 3944 325
#> 4 4 MA 1 846 5233 305
#> 13 13 MI 2 738 5439 337
#> 15 15 MN 2 664 4921 330
#> 17 17 MO 2 701 4672 309
#> 23 23 MD 3 766 5331 323
#> 33 33 MS 3 445 3448 358
#> 38 38 MT 4 534 4418 335
#> Education.Expenditures
#> 1 235
#> 4 261
#> 13 379
#> 15 378
#> 17 231
#> 23 330
#> 33 215
#> 38 302
But lets say T2 would be an actual data.frame:
T2 = education[T1,]["State"]
T2 #check
#> State
#> 1 ME
#> 4 MA
#> 13 MI
#> 15 MN
#> 17 MO
#> 23 MD
#> 33 MS
#> 38 MT
Then we could coerce it into a vector by subsetting it with drop = TRUE.
subset(education, State %in% T2[, , drop = TRUE])
#> X State Region Urban.Population Per.Capita.Income Minor.Population
#> 1 1 ME 1 508 3944 325
#> 4 4 MA 1 846 5233 305
#> 13 13 MI 2 738 5439 337
#> 15 15 MN 2 664 4921 330
#> 17 17 MO 2 701 4672 309
#> 23 23 MD 3 766 5331 323
#> 33 33 MS 3 445 3448 358
#> 38 38 MT 4 534 4418 335
#> Education.Expenditures
#> 1 235
#> 4 261
#> 13 379
#> 15 378
#> 17 231
#> 23 330
#> 33 215
#> 38 302
Created on 2021-06-12 by the reprex package (v0.3.0)
I am sure this is a super easy answer but I am struggling with how to add a column with two different variables to my dataframe. Currently, this is what it looks like
vcv.index model.index par.index grid index estimate se lcl ucl fixed
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157
5 10 10 20 A 20 0.7575811 0.05033490 0.6461758 0.8424612
6 21 21 61 B 61 0.8713467 0.07638687 0.6404598 0.9626184
7 22 22 62 B 62 0.6074379 0.06881230 0.4677827 0.7314827
8 23 23 63 B 63 0.6041054 0.06107520 0.4805279 0.7156792
9 24 24 64 B 64 0.5806565 0.06927308 0.4422237 0.7074601
10 25 25 65 B 65 0.7370944 0.05892108 0.6070620 0.8357394
11 41 41 121 C 121 0.8048479 0.09684385 0.5519097 0.9324759
12 42 42 122 C 122 0.5259547 0.07165218 0.3871380 0.6608721
13 43 43 123 C 123 0.5427100 0.07127273 0.4033255 0.6757137
14 44 44 124 C 124 0.5168820 0.06156392 0.3975561 0.6343132
15 45 45 125 C 125 0.6550049 0.07378403 0.5002851 0.7826343
16 196 196 586 A 586 0.8536314 0.08709394 0.5979992 0.9580976
17 197 197 587 A 587 0.5672194 0.07079508 0.4268452 0.6975725
18 198 198 588 A 588 0.5675415 0.06380445 0.4408540 0.6859714
19 199 199 589 A 589 0.5666874 0.06499899 0.4377071 0.6872233
20 200 200 590 A 590 0.7058542 0.05985868 0.5769484 0.8085177
21 211 211 631 B 631 0.8360614 0.09413427 0.5703031 0.9514472
22 212 212 632 B 632 0.5432872 0.07906200 0.3891364 0.6895701
23 213 213 633 B 633 0.5400994 0.06497607 0.4129055 0.6622759
24 214 214 634 B 634 0.5161692 0.06292706 0.3943257 0.6361202
25 215 215 635 B 635 0.6821667 0.07280044 0.5263841 0.8056298
26 226 226 676 C 676 0.7621875 0.10484478 0.5077465 0.9087471
27 227 227 677 C 677 0.4607440 0.07326970 0.3240229 0.6036386
28 228 228 678 C 678 0.4775168 0.08336433 0.3219349 0.6375872
29 229 229 679 C 679 0.4517655 0.06393339 0.3319262 0.5774725
30 230 230 680 C 680 0.5944330 0.07210672 0.4491995 0.7248303
then I am adding a column with periods 1-5 repeated until reaches the end
with this code
SurJagPred$estimates %<>% mutate(Primary = rep(1:5, 6))
and I also need to add sex( F, M) as well. the numbers 1-15 are female and the 16-30 are male. So overall it should look like this.
> vcv.index model.index par.index grid index estimate se lcl ucl fixed Primary Sex
F
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751 1 F
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014 2 F
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169 3 F
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157 4 F
We can use rep with each on a vector of values to replicate each element of the vector to that many times
SurJagPred$estimates %<>%
mutate(Sex = rep(c("F", "M"), each = 15))
I have done a Fisher test on all my rows which outputs a lot of p-values. How could I correctly combine p-values to the original columns? I tried the following codes but the rows in original data (d) do not match with p-values (e) in the merged dataframe (f).
d <- read.table('test.txt', header = FALSE)
e <-apply(d,1, function(x) fisher.test(matrix(x,nr=2), alternative='greater')$p.value)
f <-merge(d,as.data.frame(e),by.x=0,by.y=0)
> d
V1 V2 V3 V4
1 1 839 63 222247
2 1 839 47 222263
3 1 839 299 222011
4 6 834 1821 220489
5 1 839 198 222112
6 1 839 324 221986
7 2 838 808 221502
8 3 837 935 221375
9 4 836 1723 220587
10 1 839 117 22219
> e
[1] 2.144749e-01 1.656028e-01 6.776690e-01 6.848409e-01 5.280300e-01 7.067099e-01 8.091576e-01 6.859446e-01
[9] 8.895988e-01 3.592658e-01
> f
Row.names V1 V2 V3 V4 e
1 1 1 839 63 222247 2.144749e-01
2 10 1 839 117 222193 3.592658e-01
3 11 6 834 850 221460 1.071752e-01
4 12 29 811 11625 210685 9.941101e-01
5 13 2 838 1231 221079 9.463472e-01
6 14 1 839 1236 221074 9.907043e-01
7 15 3 837 905 221405 6.647785e-01
8 16 3 837 793 221517 5.768163e-01
9 17 6 834 687 221623 4.906665e-02
10 18 1 839 226 222084 5.753710e-01
f <-cbind(d,e)
# V1 V2 V3 V4 e
#1 1 839 63 222247 0.2144749
#2 1 839 47 222263 0.1656028
#3 1 839 299 222011 0.6776690
#4 6 834 1821 220489 0.6848409
#5 1 839 198 222112 0.5280300
#6 1 839 324 221986 0.7067099
#7 2 838 808 221502 0.8091576
#8 3 837 935 221375 0.6859446
#9 4 836 1723 220587 0.8895988
#10 1 839 117 22219 0.9873172