I'm looking for a code to extract a time interval (500ms) of a column (called time) for each trial onset, so that I can calculate a baseline of the first 500ms of each trial
actual time in ms between two consecutive rows of the column varies, because the dataset is downsampled and only changes are reported, so I cannot just count a certain number of rows to define the time interval.
I tried this:
baseline <- labchart %>%
dplyr::filter(time[1:(length(labchart$time)+500)]) %>%
dplyr::group_by(Participant, trialonset)
but only got error messages like:
Error: Argument 2 filter condition does not evaluate to a logical vector
And I am not sure, if (time[1:(length(labchart$Time)+500)]) would really give me the first 500ms of each trial?
It's difficult to know exactly what you're asking here. I think what you're asking is how to group observations into 500ms periods given only time intervals between observations.
Suppose the data looks like this:
``` r
labchart <- data.frame(time = sample(50:300, 20, TRUE), data = rnorm(20))
labchart
#> time data
#> 1 277 -1.33120732
#> 2 224 -0.85356280
#> 3 80 -0.32012499
#> 4 255 0.32433366
#> 5 227 -0.49600772
#> 6 248 2.23246918
#> 7 138 -1.40170795
#> 8 115 -0.76525043
#> 9 159 0.14239351
#> 10 207 -1.53064873
#> 11 139 -0.82303066
#> 12 185 1.12473125
#> 13 239 -0.22491238
#> 14 117 -0.55809297
#> 15 147 0.83225435
#> 16 200 0.75178516
#> 17 170 -0.78484405
#> 18 208 1.21000589
#> 19 196 -0.74576650
#> 20 184 0.02459359
Then we can create a column for total elapsed time and which 500ms period the observation belongs to like this:
library(dplyr)
labchart %>%
mutate(elapsed = lag(cumsum(time), 1, 0),
period = 500 * (elapsed %/% 500))
#> time data elapsed period
#> 1 277 -1.33120732 0 0
#> 2 224 -0.85356280 277 0
#> 3 80 -0.32012499 501 500
#> 4 255 0.32433366 581 500
#> 5 227 -0.49600772 836 500
#> 6 248 2.23246918 1063 1000
#> 7 138 -1.40170795 1311 1000
#> 8 115 -0.76525043 1449 1000
#> 9 159 0.14239351 1564 1500
#> 10 207 -1.53064873 1723 1500
#> 11 139 -0.82303066 1930 1500
#> 12 185 1.12473125 2069 2000
#> 13 239 -0.22491238 2254 2000
#> 14 117 -0.55809297 2493 2000
#> 15 147 0.83225435 2610 2500
#> 16 200 0.75178516 2757 2500
#> 17 170 -0.78484405 2957 2500
#> 18 208 1.21000589 3127 3000
#> 19 196 -0.74576650 3335 3000
#> 20 184 0.02459359 3531 3500
Related
I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56
I'm trying to go through a column and create a secondary column called status. Status is based on a condition of times. If times is >250 then status should be assigned a "good", if not then the current times row should be summed (similar to cumsum) to rows below until the point where the running_sum is >250. At this point the status of the current row should be changed to good and everything starts afresh.
I've tried the for loop below but I can't get it to work (for instance 3rd row status should be good in the example). Can someone provide an example of the above and explain how it works please? Thank you.
set.seed(1234)
test = data.frame(times = round(abs(rnorm(20,100,100)),0))
test
#> times
#> 1 21
#> 2 128
#> 3 208
#> 4 135
#> 5 143
#> 6 151
#> 7 43
#> 8 45
#> 9 44
#> 10 11
#> 11 52
#> 12 0
#> 13 22
#> 14 106
#> 15 196
#> 16 89
#> 17 49
#> 18 9
#> 19 16
#> 20 342
test$status <- 'bad'
running_sum <- 0
for (i in 1:length(test$times)) {
if (test$times[i] >= 250 | running_sum > 250) {
test$status[i] <- "good"
running_sum <- 0
} else {
running_sum <- running_sum + test$times[i]
}
print(running_sum)
}
#> [1] 21
#> [1] 149
#> [1] 357
#> [1] 0
#> [1] 143
#> [1] 294
#> [1] 0
#> [1] 45
#> [1] 89
#> [1] 100
#> [1] 152
#> [1] 152
#> [1] 174
#> [1] 280
#> [1] 0
#> [1] 89
#> [1] 138
#> [1] 147
#> [1] 163
#> [1] 0
test
#> times status
#> 1 21 bad
#> 2 128 bad
#> 3 208 bad
#> 4 135 good
#> 5 143 bad
#> 6 151 bad
#> 7 43 good
#> 8 45 bad
#> 9 44 bad
#> 10 11 bad
#> 11 52 bad
#> 12 0 bad
#> 13 22 bad
#> 14 106 bad
#> 15 196 good
#> 16 89 bad
#> 17 49 bad
#> 18 9 bad
#> 19 16 bad
#> 20 342 good
using this nice answer from #MrFlick,
set.seed(1234)
test = data.frame(times = round(abs(rnorm(20,100,100)),0))
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}
library(tidyverse)
test %>% mutate(temp = ifelse(sum_reset_at(250)(times) < 250, "bad", "good"))
# times temp
# 1 21 bad
# 2 128 bad
# 3 208 good
# 4 135 bad
# 5 143 good
# 6 151 bad
# 7 43 bad
# 8 45 bad
# 9 44 good
# 10 11 bad
# 11 52 bad
# 12 0 bad
# 13 22 bad
# 14 106 bad
# 15 196 good
# 16 89 bad
# 17 49 bad
# 18 9 bad
# 19 16 bad
# 20 342 good
You just need to change the order of your loop operations: increment first, then test.
set.seed(1234)
test = data.frame(times = round(abs(rnorm(20,100,100)),0))
test$status <- 'bad'
running_sum <- 0
for (i in 1:length(test$times)) {
running_sum <- running_sum + test$times[i]
print(running_sum)
if (test$times[i] >= 250 | running_sum > 250) {
test$status[i] <- "good"
running_sum <- 0
}
}
Result:
times status
1 21 bad
2 128 bad
3 208 good
4 135 bad
5 143 good
6 151 bad
7 43 bad
8 45 bad
9 44 good
10 11 bad
11 52 bad
12 0 bad
13 22 bad
14 106 bad
15 196 good
16 89 bad
17 49 bad
18 9 bad
19 16 bad
20 342 good
I'm trying to use rBlast for protein sequences but somehow it doesn't work. It works fine for nucleotide sequences but for proteins it just doesn't return any match (I used a sequence from the searched dataset, so there can't be no match). In the description it stands "This includes interfaces to blastn, blastp, blastx..." but in the help file in R studio it says "Description Execute blastn from blast+". Did anybody run rBlast for proteins?
Here's what I ran:
listF<-list.files("Trich_prot_fasta/")
fa<-paste0("Trich_prot_fasta/",listF[i])
makeblastdb(fa, dbtype = "prot", args="")
bl <- blast("Trich_prot_fasta/Tri5640_1_GeneModels_FilteredModels1_aa.fasta", type="blastp")
seq <- readAAStringSet("NDRkinase/testSeq.txt")
cl <- predict(bl, seq)
Result:
> cl <- predict(bl, seq)
Warning message: In predict.BLAST(bl, seq) : BLAST did not return a
match!
Tried to reproduce the error but everything worked as expected on my system (macOS BigSur 11.6 / R v4.1.1 / Rstudio v1.4.1717).
Given your blastn was successful, perhaps you are combining multiple fasta files for your protein fasta reference database? If that's the case, try concatenating them together and use the path to the file instead of an R object ("fa") when making your blastdb. Or perhaps:
makeblastdb(file = "Trich_prot_fasta/Tri5640_1_GeneModels_FilteredModels1_aa.fasta", type = "prot)
Instead of:
makeblastdb(fa, dbtype = "prot", args="")
Also, please edit your question to include the output from sessionInfo() (might help narrow things down).
library(tidyverse)
#BiocManager::install("Biostrings")
#devtools::install_github("mhahsler/rBLAST")
library(rBLAST)
# Download an example fasta file:
# https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000001542/UP000001542_5722.fasta.gz
# Grab the first fasta sequence as "example_sequence.fasta"
listF <- list.files("~/Downloads/Trich_example", full.names = TRUE)
listF
#> [1] "~/Downloads/Trich_example/UP000001542_5722.fasta"
#> [2] "~/Downloads/Trich_example/example_sequence.fasta"
makeblastdb(file = "~/Downloads/Trich_example/UP000001542_5722.fasta", dbtype = "prot")
bl <- blast("~/Downloads/Trich_example/UP000001542_5722.fasta", type = "blastp")
seq <- readAAStringSet("~/Downloads/Trich_example/example_sequence.fasta")
cl <- predict(bl, seq)
cl
#> QueryID SubjectID Perc.Ident Alignment.Length
#> 1 Example_sequence_1 tr|A2D8A1|A2D8A1_TRIVA 100.000 694
#> 2 Example_sequence_1 tr|A2E4L0|A2E4L0_TRIVA 64.553 694
#> 3 Example_sequence_1 tr|A2E4L0|A2E4L0_TRIVA 32.436 669
#> 4 Example_sequence_1 tr|A2D899|A2D899_TRIVA 64.344 488
#> 5 Example_sequence_1 tr|A2D899|A2D899_TRIVA 31.004 458
#> 6 Example_sequence_1 tr|A2D899|A2D899_TRIVA 27.070 314
#> 7 Example_sequence_1 tr|A2D898|A2D898_TRIVA 54.915 468
#> 8 Example_sequence_1 tr|A2D898|A2D898_TRIVA 33.691 653
#> 9 Example_sequence_1 tr|A2D898|A2D898_TRIVA 32.936 671
#> 10 Example_sequence_1 tr|A2D898|A2D898_TRIVA 29.969 654
#> 11 Example_sequence_1 tr|A2D898|A2D898_TRIVA 26.694 487
#> 12 Example_sequence_1 tr|A2D898|A2D898_TRIVA 25.000 464
#> 13 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 39.106 716
#> 14 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 30.724 677
#> 15 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 29.257 417
#> 16 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 23.438 640
#> 17 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 22.981 718
#> 18 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 24.107 112
#> 19 Example_sequence_1 tr|A2FI39|A2FI39_TRIVA 33.378 740
#> 20 Example_sequence_1 tr|A2FI39|A2FI39_TRIVA 31.440 722
#> Mismatches Gap.Openings Q.start Q.end S.start S.end E Bits
#> 1 0 0 1 694 1 694 0.00e+00 1402.0
#> 2 243 2 1 692 163 855 0.00e+00 920.0
#> 3 410 15 22 671 1 646 3.02e-94 312.0
#> 4 173 1 205 692 1 487 0.00e+00 644.0
#> 5 308 7 22 476 1 453 3.55e-55 198.0
#> 6 196 5 13 294 173 485 4.12e-25 110.0
#> 7 211 0 1 468 683 1150 8.48e-169 514.0
#> 8 420 11 2 647 501 1147 1.61e-91 309.0
#> 9 396 10 2 666 363 985 5.78e-89 301.0
#> 10 406 11 16 664 195 801 1.01e-66 238.0
#> 11 297 10 208 662 21 479 1.60e-36 147.0
#> 12 316 7 11 469 29 465 3.04e-36 147.0
#> 13 386 4 2 667 248 963 1.72e-149 461.0
#> 14 411 10 2 625 66 737 8.34e-83 283.0
#> 15 286 5 129 542 14 424 2.66e-52 196.0
#> 16 421 15 5 607 365 972 3.07e-38 152.0
#> 17 407 21 77 662 27 730 1.25e-33 138.0
#> 18 81 3 552 661 3 112 2.10e-01 35.4
#> 19 421 9 3 675 394 1128 1.12e-115 375.0
#> 20 409 15 2 647 163 874 1.21e-82 285.0
...
Created on 2021-09-30 by the reprex package (v2.0.1)
Let's assume we somehow ended up with data frame object (T2 in below example) and we want to subset our original data with that dataframe. Is there a way to do without using | in subset object?
Here is a dataset I was playing but failed
education = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) = c("X", "State", "Region", "Urban.Population", "Per.Capita.Income", "Minor.Population", "Education.Expenditures")
head(education)
T1 = c(1,4,13,15,17,23,33,38)
T2 = education[T1,]$State
subset(education, State=="ME"| State=="MA" | State=="MI" | State=="MN" | State=="MO" | State=="MD" | State=="MS" | State=="MT")
subset(education, State==T2[3])
subset(education, State==T2)
PS: I created T2 as states starting with M but I don't want using string or anything. Just assume we somehow ended up with T2 in which outputs are some states.
I'm not quite sure what would be an acceptable answer but subset(education, State %in% T2) uses T2 as is and does not use |. Does this solve your problem? It's almost the same approach as Jon Spring points out in the comments, but instead of specifying a vector we can just use T2 with %in%. You say T2 is a data.frame object, but in the data you provided it turns out to be a character vector.
education = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) = c("X", "State", "Region", "Urban.Population", "Per.Capita.Income", "Minor.Population", "Education.Expenditures")
T1 = c(1,4,13,15,17,23,33,38)
T2 = education[T1,]$State
T2 # T2 is not a data.frame object (R 4.0)
#> [1] "ME" "MA" "MI" "MN" "MO" "MD" "MS" "MT"
subset(education, State %in% T2)
#> X State Region Urban.Population Per.Capita.Income Minor.Population
#> 1 1 ME 1 508 3944 325
#> 4 4 MA 1 846 5233 305
#> 13 13 MI 2 738 5439 337
#> 15 15 MN 2 664 4921 330
#> 17 17 MO 2 701 4672 309
#> 23 23 MD 3 766 5331 323
#> 33 33 MS 3 445 3448 358
#> 38 38 MT 4 534 4418 335
#> Education.Expenditures
#> 1 235
#> 4 261
#> 13 379
#> 15 378
#> 17 231
#> 23 330
#> 33 215
#> 38 302
But lets say T2 would be an actual data.frame:
T2 = education[T1,]["State"]
T2 #check
#> State
#> 1 ME
#> 4 MA
#> 13 MI
#> 15 MN
#> 17 MO
#> 23 MD
#> 33 MS
#> 38 MT
Then we could coerce it into a vector by subsetting it with drop = TRUE.
subset(education, State %in% T2[, , drop = TRUE])
#> X State Region Urban.Population Per.Capita.Income Minor.Population
#> 1 1 ME 1 508 3944 325
#> 4 4 MA 1 846 5233 305
#> 13 13 MI 2 738 5439 337
#> 15 15 MN 2 664 4921 330
#> 17 17 MO 2 701 4672 309
#> 23 23 MD 3 766 5331 323
#> 33 33 MS 3 445 3448 358
#> 38 38 MT 4 534 4418 335
#> Education.Expenditures
#> 1 235
#> 4 261
#> 13 379
#> 15 378
#> 17 231
#> 23 330
#> 33 215
#> 38 302
Created on 2021-06-12 by the reprex package (v0.3.0)
I need to calculate summary statistics for observations of bird breeding activity for each of 150 species. The data frame has the species (scodef), the type of observation (codef)(e.g. nest building), and the ordinal date (days since 1 January, since the data were collected over multiple years). Using dplyr I get exactly the result I want.
library(dplyr)
library(tidyr)
phenology %>% group_by(sCodef, codef) %>%
summarize(N=n(), Min=min(jdate), Max=max(jdate), Median=median(jdate))
# A tibble: 552 x 6
# Groups: sCodef [?]
sCodef codef N Min Max Median
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 ABDU AY 3 172 184 181
2 ABDU FL 12 135 225 188
3 ACFL AY 18 165 222 195
4 ACFL CN 4 142 156 152.
5 ACFL FL 10 166 197 192.
6 ACFL NB 6 139 184 150.
7 ACFL NY 6 166 207 182
8 AMCO FL 1 220 220 220
9 AMCR AY 53 89 198 161
10 AMCR FL 78 133 225 166.
# ... with 542 more rows
How do I get these summary statistics into some sort of data object so that I can export them to use ultimately in a Word document? I have tried this and gotten an error. All of the many explanations of summarize I have reviewed just show the summary data on screen. Thanks
out3 <- summarize(N=n(), Min=min(jdate), Max=max(jdate), median=median(jdate))
Error: This function should not be called directly
Assign this to a variable, then write to a csv like so:
summarydf <- phenology %>% group_by......(as above)
write.csv(summarydf, filename="yourfilenamehere.csv")