Below is the sample data. I know that I have to do a left join. The question is how to have it only return values that match (indcodelist = indcodelist2) but with the highest codetype value.
indcodelist <- c(110000,111000,112000,113000,114000,115000,121000,210000,211000,315000)
estemp <- c(11,21,31,41,51,61,55,21,22,874)
projemp <- c(15,25,36,45,52,61,31,29,31,899)
nchg <- c(4,4,5,4,1,0,-24,8,9,25)
firsttable <- data.frame(indcodelist,estemp,projemp,nchg)
indcodelist2 <- c(110000,111000,112000,113000,114000,115000,121000,210000,211000,315000,110000,111000,112000,113000)
codetype <- c(18,18,18,18,18,18,18,18,18,18,10,10,10,10)
codetitle <- c("Accountant","Doctor","Lawyer","Teacher","Economist","Financial Analyst","Meteorologist","Dentist", "Editor","Veterinarian","Accounting Technician","Doctor","Lawyer","Teacher")
secondtable <- data.frame(indcodelist2,codetype,codetitle)
tried <- left_join(firsttable,secondtable, by =c(indcodelist = "indcodelist2"))
Desired Result
indcodelist estemp projemp nchg codetitle
110000 11 15 4 Accountant
111000 21 25 4 Doctor
If you only want values that match in both tables, inner_join might be what you’re looking for. You can see this answer to understand different types of joins.
To get the highest codetype, you can use dplyr::slice_max(). Be aware the default behavior is to return values that tie. If there is more than one codetitle at the same codetype, they’ll all be returned.
library(tidyverse)
firsttable %>%
inner_join(., secondtable, by = c("indcodelist" = "indcodelist2")) %>%
group_by(indcodelist) %>%
slice_max(codetype)
#> # A tibble: 10 × 6
#> # Groups: indcodelist [10]
#> indcodelist estemp projemp nchg codetype codetitle
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 110000 11 15 4 18 Accountant
#> 2 111000 21 25 4 18 Doctor
#> 3 112000 31 36 5 18 Lawyer
#> 4 113000 41 45 4 18 Teacher
#> 5 114000 51 52 1 18 Economist
#> 6 115000 61 61 0 18 Financial Analyst
#> 7 121000 55 31 -24 18 Meteorologist
#> 8 210000 21 29 8 18 Dentist
#> 9 211000 22 31 9 18 Editor
#> 10 315000 874 899 25 18 Veterinarian
Created on 2022-09-15 by the reprex package (v2.0.1)
You might use {powerjoin} :
library(powerjoin)
power_inner_join(
firsttable,
secondtable |> summarize_by_keys(dplyr::across()[which.max(codetype),]),
by = c("indcodelist" = "indcodelist2")
)
#> indcodelist estemp projemp nchg codetype codetitle
#> 1 110000 11 15 4 18 Accountant
#> 2 111000 21 25 4 18 Doctor
#> 3 112000 31 36 5 18 Lawyer
#> 4 113000 41 45 4 18 Teacher
#> 5 114000 51 52 1 18 Economist
#> 6 115000 61 61 0 18 Financial Analyst
#> 7 121000 55 31 -24 18 Meteorologist
#> 8 210000 21 29 8 18 Dentist
#> 9 211000 22 31 9 18 Editor
#> 10 315000 874 899 25 18 Veterinarian
I'm trying to go through a column and create a secondary column called status. Status is based on a condition of times. If times is >250 then status should be assigned a "good", if not then the current times row should be summed (similar to cumsum) to rows below until the point where the running_sum is >250. At this point the status of the current row should be changed to good and everything starts afresh.
I've tried the for loop below but I can't get it to work (for instance 3rd row status should be good in the example). Can someone provide an example of the above and explain how it works please? Thank you.
set.seed(1234)
test = data.frame(times = round(abs(rnorm(20,100,100)),0))
test
#> times
#> 1 21
#> 2 128
#> 3 208
#> 4 135
#> 5 143
#> 6 151
#> 7 43
#> 8 45
#> 9 44
#> 10 11
#> 11 52
#> 12 0
#> 13 22
#> 14 106
#> 15 196
#> 16 89
#> 17 49
#> 18 9
#> 19 16
#> 20 342
test$status <- 'bad'
running_sum <- 0
for (i in 1:length(test$times)) {
if (test$times[i] >= 250 | running_sum > 250) {
test$status[i] <- "good"
running_sum <- 0
} else {
running_sum <- running_sum + test$times[i]
}
print(running_sum)
}
#> [1] 21
#> [1] 149
#> [1] 357
#> [1] 0
#> [1] 143
#> [1] 294
#> [1] 0
#> [1] 45
#> [1] 89
#> [1] 100
#> [1] 152
#> [1] 152
#> [1] 174
#> [1] 280
#> [1] 0
#> [1] 89
#> [1] 138
#> [1] 147
#> [1] 163
#> [1] 0
test
#> times status
#> 1 21 bad
#> 2 128 bad
#> 3 208 bad
#> 4 135 good
#> 5 143 bad
#> 6 151 bad
#> 7 43 good
#> 8 45 bad
#> 9 44 bad
#> 10 11 bad
#> 11 52 bad
#> 12 0 bad
#> 13 22 bad
#> 14 106 bad
#> 15 196 good
#> 16 89 bad
#> 17 49 bad
#> 18 9 bad
#> 19 16 bad
#> 20 342 good
using this nice answer from #MrFlick,
set.seed(1234)
test = data.frame(times = round(abs(rnorm(20,100,100)),0))
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}
library(tidyverse)
test %>% mutate(temp = ifelse(sum_reset_at(250)(times) < 250, "bad", "good"))
# times temp
# 1 21 bad
# 2 128 bad
# 3 208 good
# 4 135 bad
# 5 143 good
# 6 151 bad
# 7 43 bad
# 8 45 bad
# 9 44 good
# 10 11 bad
# 11 52 bad
# 12 0 bad
# 13 22 bad
# 14 106 bad
# 15 196 good
# 16 89 bad
# 17 49 bad
# 18 9 bad
# 19 16 bad
# 20 342 good
You just need to change the order of your loop operations: increment first, then test.
set.seed(1234)
test = data.frame(times = round(abs(rnorm(20,100,100)),0))
test$status <- 'bad'
running_sum <- 0
for (i in 1:length(test$times)) {
running_sum <- running_sum + test$times[i]
print(running_sum)
if (test$times[i] >= 250 | running_sum > 250) {
test$status[i] <- "good"
running_sum <- 0
}
}
Result:
times status
1 21 bad
2 128 bad
3 208 good
4 135 bad
5 143 good
6 151 bad
7 43 bad
8 45 bad
9 44 good
10 11 bad
11 52 bad
12 0 bad
13 22 bad
14 106 bad
15 196 good
16 89 bad
17 49 bad
18 9 bad
19 16 bad
20 342 good
I'm trying to use rBlast for protein sequences but somehow it doesn't work. It works fine for nucleotide sequences but for proteins it just doesn't return any match (I used a sequence from the searched dataset, so there can't be no match). In the description it stands "This includes interfaces to blastn, blastp, blastx..." but in the help file in R studio it says "Description Execute blastn from blast+". Did anybody run rBlast for proteins?
Here's what I ran:
listF<-list.files("Trich_prot_fasta/")
fa<-paste0("Trich_prot_fasta/",listF[i])
makeblastdb(fa, dbtype = "prot", args="")
bl <- blast("Trich_prot_fasta/Tri5640_1_GeneModels_FilteredModels1_aa.fasta", type="blastp")
seq <- readAAStringSet("NDRkinase/testSeq.txt")
cl <- predict(bl, seq)
Result:
> cl <- predict(bl, seq)
Warning message: In predict.BLAST(bl, seq) : BLAST did not return a
match!
Tried to reproduce the error but everything worked as expected on my system (macOS BigSur 11.6 / R v4.1.1 / Rstudio v1.4.1717).
Given your blastn was successful, perhaps you are combining multiple fasta files for your protein fasta reference database? If that's the case, try concatenating them together and use the path to the file instead of an R object ("fa") when making your blastdb. Or perhaps:
makeblastdb(file = "Trich_prot_fasta/Tri5640_1_GeneModels_FilteredModels1_aa.fasta", type = "prot)
Instead of:
makeblastdb(fa, dbtype = "prot", args="")
Also, please edit your question to include the output from sessionInfo() (might help narrow things down).
library(tidyverse)
#BiocManager::install("Biostrings")
#devtools::install_github("mhahsler/rBLAST")
library(rBLAST)
# Download an example fasta file:
# https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000001542/UP000001542_5722.fasta.gz
# Grab the first fasta sequence as "example_sequence.fasta"
listF <- list.files("~/Downloads/Trich_example", full.names = TRUE)
listF
#> [1] "~/Downloads/Trich_example/UP000001542_5722.fasta"
#> [2] "~/Downloads/Trich_example/example_sequence.fasta"
makeblastdb(file = "~/Downloads/Trich_example/UP000001542_5722.fasta", dbtype = "prot")
bl <- blast("~/Downloads/Trich_example/UP000001542_5722.fasta", type = "blastp")
seq <- readAAStringSet("~/Downloads/Trich_example/example_sequence.fasta")
cl <- predict(bl, seq)
cl
#> QueryID SubjectID Perc.Ident Alignment.Length
#> 1 Example_sequence_1 tr|A2D8A1|A2D8A1_TRIVA 100.000 694
#> 2 Example_sequence_1 tr|A2E4L0|A2E4L0_TRIVA 64.553 694
#> 3 Example_sequence_1 tr|A2E4L0|A2E4L0_TRIVA 32.436 669
#> 4 Example_sequence_1 tr|A2D899|A2D899_TRIVA 64.344 488
#> 5 Example_sequence_1 tr|A2D899|A2D899_TRIVA 31.004 458
#> 6 Example_sequence_1 tr|A2D899|A2D899_TRIVA 27.070 314
#> 7 Example_sequence_1 tr|A2D898|A2D898_TRIVA 54.915 468
#> 8 Example_sequence_1 tr|A2D898|A2D898_TRIVA 33.691 653
#> 9 Example_sequence_1 tr|A2D898|A2D898_TRIVA 32.936 671
#> 10 Example_sequence_1 tr|A2D898|A2D898_TRIVA 29.969 654
#> 11 Example_sequence_1 tr|A2D898|A2D898_TRIVA 26.694 487
#> 12 Example_sequence_1 tr|A2D898|A2D898_TRIVA 25.000 464
#> 13 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 39.106 716
#> 14 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 30.724 677
#> 15 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 29.257 417
#> 16 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 23.438 640
#> 17 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 22.981 718
#> 18 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 24.107 112
#> 19 Example_sequence_1 tr|A2FI39|A2FI39_TRIVA 33.378 740
#> 20 Example_sequence_1 tr|A2FI39|A2FI39_TRIVA 31.440 722
#> Mismatches Gap.Openings Q.start Q.end S.start S.end E Bits
#> 1 0 0 1 694 1 694 0.00e+00 1402.0
#> 2 243 2 1 692 163 855 0.00e+00 920.0
#> 3 410 15 22 671 1 646 3.02e-94 312.0
#> 4 173 1 205 692 1 487 0.00e+00 644.0
#> 5 308 7 22 476 1 453 3.55e-55 198.0
#> 6 196 5 13 294 173 485 4.12e-25 110.0
#> 7 211 0 1 468 683 1150 8.48e-169 514.0
#> 8 420 11 2 647 501 1147 1.61e-91 309.0
#> 9 396 10 2 666 363 985 5.78e-89 301.0
#> 10 406 11 16 664 195 801 1.01e-66 238.0
#> 11 297 10 208 662 21 479 1.60e-36 147.0
#> 12 316 7 11 469 29 465 3.04e-36 147.0
#> 13 386 4 2 667 248 963 1.72e-149 461.0
#> 14 411 10 2 625 66 737 8.34e-83 283.0
#> 15 286 5 129 542 14 424 2.66e-52 196.0
#> 16 421 15 5 607 365 972 3.07e-38 152.0
#> 17 407 21 77 662 27 730 1.25e-33 138.0
#> 18 81 3 552 661 3 112 2.10e-01 35.4
#> 19 421 9 3 675 394 1128 1.12e-115 375.0
#> 20 409 15 2 647 163 874 1.21e-82 285.0
...
Created on 2021-09-30 by the reprex package (v2.0.1)
I'm looking for a code to extract a time interval (500ms) of a column (called time) for each trial onset, so that I can calculate a baseline of the first 500ms of each trial
actual time in ms between two consecutive rows of the column varies, because the dataset is downsampled and only changes are reported, so I cannot just count a certain number of rows to define the time interval.
I tried this:
baseline <- labchart %>%
dplyr::filter(time[1:(length(labchart$time)+500)]) %>%
dplyr::group_by(Participant, trialonset)
but only got error messages like:
Error: Argument 2 filter condition does not evaluate to a logical vector
And I am not sure, if (time[1:(length(labchart$Time)+500)]) would really give me the first 500ms of each trial?
It's difficult to know exactly what you're asking here. I think what you're asking is how to group observations into 500ms periods given only time intervals between observations.
Suppose the data looks like this:
``` r
labchart <- data.frame(time = sample(50:300, 20, TRUE), data = rnorm(20))
labchart
#> time data
#> 1 277 -1.33120732
#> 2 224 -0.85356280
#> 3 80 -0.32012499
#> 4 255 0.32433366
#> 5 227 -0.49600772
#> 6 248 2.23246918
#> 7 138 -1.40170795
#> 8 115 -0.76525043
#> 9 159 0.14239351
#> 10 207 -1.53064873
#> 11 139 -0.82303066
#> 12 185 1.12473125
#> 13 239 -0.22491238
#> 14 117 -0.55809297
#> 15 147 0.83225435
#> 16 200 0.75178516
#> 17 170 -0.78484405
#> 18 208 1.21000589
#> 19 196 -0.74576650
#> 20 184 0.02459359
Then we can create a column for total elapsed time and which 500ms period the observation belongs to like this:
library(dplyr)
labchart %>%
mutate(elapsed = lag(cumsum(time), 1, 0),
period = 500 * (elapsed %/% 500))
#> time data elapsed period
#> 1 277 -1.33120732 0 0
#> 2 224 -0.85356280 277 0
#> 3 80 -0.32012499 501 500
#> 4 255 0.32433366 581 500
#> 5 227 -0.49600772 836 500
#> 6 248 2.23246918 1063 1000
#> 7 138 -1.40170795 1311 1000
#> 8 115 -0.76525043 1449 1000
#> 9 159 0.14239351 1564 1500
#> 10 207 -1.53064873 1723 1500
#> 11 139 -0.82303066 1930 1500
#> 12 185 1.12473125 2069 2000
#> 13 239 -0.22491238 2254 2000
#> 14 117 -0.55809297 2493 2000
#> 15 147 0.83225435 2610 2500
#> 16 200 0.75178516 2757 2500
#> 17 170 -0.78484405 2957 2500
#> 18 208 1.21000589 3127 3000
#> 19 196 -0.74576650 3335 3000
#> 20 184 0.02459359 3531 3500
I wish to use subqueries like structure in dplyr efficiently. For example we do subqueries like
select * from table where tbl_id in (select tbl_id from table1 where name='a');
Here, I assume we are not saving any results from inner query. I want to use similar structure in dplyr chains.
I have tried to use the result of one dplyr chain in another by putting it in brackets but it doesn't work that way. I already know that we can save it as temporary df and use it but I don't want to save it.
Below are the two table data/dataframes :
# Libraries
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Q 16 Write a query in SQL to find the name of those
# movies where one or more actors acted in two or more movies
movie <- read.csv("q14_movie.csv")
movie_cast <- read.csv("q16_movie_cast.csv")
print(movie)
#> mov_id mov_title mov_year mov_time mov_lang
#> 1 901 Vertigo 1958 128 English
#> 2 902 The Innocents 1961 100 English
#> 3 903 Lawrence of Arabia 1962 216 English
#> 4 904 The Deer Hunter 1978 183 English
#> 5 905 Amadeus 1984 160 English
#> 6 906 Blade Runner 1982 117 English
#> 7 907 Eyes Wide Shut 1999 159 English
#> 8 908 The Usual Suspects 1995 106 English
#> 9 909 Chinatown 1974 130 English
#> 10 910 Boogie Nights 1997 155 English
#> 11 911 Annie Hall 1977 93 English
#> 12 912 Princess Mononoke 1997 134 Japanese
#> 13 913 The Shawshank Redemption 1994 142 English
#> 14 914 American Beauty 1999 122 English
#> 15 915 Titanic 1997 194 English
#> 16 916 Good Will Hunting 1997 126 English
#> 17 917 Deliverance 1972 109 English
#> 18 918 Trainspotting 1996 94 English
#> 19 919 The Prestige 2006 130 English
#> 20 920 Donnie Darko 2001 113 English
#> 21 921 Slumdog Millionaire 2008 120 English
#> 22 922 Aliens 1986 137 English
#> 23 923 Beyond the Sea 2004 118 English
#> 24 924 Avatar 2009 162 English
#> 25 926 Seven Samurai 1954 207 Japanese
#> 26 927 Spirited Away 2001 125 Japanese
#> 27 928 Back to the Future 1985 116 English
#> 28 925 Braveheart 1995 178 English
#> mov_dt_rel mov_rel_country
#> 1 1958-08-24 UK
#> 2 1962-02-19 SW
#> 3 1962-12-11 UK
#> 4 1979-03-08 UK
#> 5 1985-01-07 UK
#> 6 1982-09-09 UK
#> 7 UK
#> 8 1995-08-25 UK
#> 9 1974-08-09 UK
#> 10 1998-02-16 UK
#> 11 1977-04-20 USA
#> 12 2001-10-19 UK
#> 13 1995-02-17 UK
#> 14 UK
#> 15 1998-01-23 UK
#> 16 1998-06-03 UK
#> 17 1982-10-05 UK
#> 18 1996-02-23 UK
#> 19 2006-11-10 UK
#> 20 UK
#> 21 2009-01-09 UK
#> 22 1986-08-29 UK
#> 23 2004-11-26 UK
#> 24 2009-12-17 UK
#> 25 1954-04-26 JP
#> 26 2003-09-12 UK
#> 27 1985-12-04 UK
#> 28 1995-09-08 UK
print(movie_cast)
#> act_id mov_id role
#> 1 101 901 John Scottie Ferguson
#> 2 102 902 Miss Giddens
#> 3 103 903 T.E. Lawrence
#> 4 104 904 Michael
#> 5 105 905 Antonio Salieri
#> 6 106 906 Rick Deckard
#> 7 107 907 Alice Harford
#> 8 108 908 McManus
#> 9 110 910 Eddie Adams
#> 10 111 911 Alvy Singer
#> 11 112 912 San
#> 12 113 913 Andy Dufresne
#> 13 114 914 Lester Burnham
#> 14 115 915 Rose DeWitt Bukater
#> 15 116 916 Sean Maguire
#> 16 117 917 Ed
#> 17 118 918 Renton
#> 18 120 920 Elizabeth Darko
#> 19 121 921 Older Jamal
#> 20 122 922 Ripley
#> 21 114 923 Bobby Darin
#> 22 109 909 J.J. Gittes
#> 23 119 919 Alfred Borden
sqldf('select * from movie m join movie_cast mc on
m.mov_id=mc.mov_id where mc.act_id in
(select act_id from movie_cast group by act_id having count(mov_id)>1)')
#> mov_id mov_title mov_year mov_time mov_lang mov_dt_rel
#> 1 914 American Beauty 1999 122 English
#> 2 923 Beyond the Sea 2004 118 English 2004-11-26
#> mov_rel_country act_id mov_id role
#> 1 UK 114 914 Lester Burnham
#> 2 UK 114 923 Bobby Darin
tmp <- movie_cast %>%
group_by(act_id) %>%
summarise(num_movies=n_distinct(mov_id)) %>%
filter(num_movies>1)
inner_join(movie,movie_cast,by='mov_id') %>% filter(act_id %in% tmp$act_id)
#> mov_id mov_title mov_year mov_time mov_lang mov_dt_rel
#> 1 914 American Beauty 1999 122 English
#> 2 923 Beyond the Sea 2004 118 English 2004-11-26
#> mov_rel_country act_id role
#> 1 UK 114 Lester Burnham
#> 2 UK 114 Bobby Darin
inner_join(movie,movie_cast,by='mov_id') %>%
filter(act_id %in% (movie_cast %>%
group_by(act_id) %>%
summarise(num_movies=n_distinct(mov_id)) %>%
filter(num_movies>1) %>%
select(act_id)))
#> [1] mov_id mov_title mov_year mov_time
#> [5] mov_lang mov_dt_rel mov_rel_country act_id
#> [9] role
#> <0 rows> (or 0-length row.names)
I wish to get same results without saving as tmp as explained in code !
Thanks