I hope someone here could help me. I am trying to learn R coding for my work.
I am following the growth of plants ( called Ecotype) over the time, they are Treated with Mock or a bacteria Xcc. I have 2 different experiments (done at different time) and after image processing I get the Area.
I would like to compute Normalized_Area = Area(t1)/Area(t0) for each ecotype, for each treatment for each experiment (Manip) which is the Area at a time divide by the Area of this ecotype at the beginning of the experiment(t0). Each plant have a different Area at time 0 and the different experiments have a different starting time. (example of expected results in Normalized_Area column)
Please find below a piece of my df
# A tibble: 24 x 6
Manip Traitment Ecotype Date Area Normalized_Area
<dbl> <chr> <chr> <dttm> <dbl> <dbl>
1 1 mock a1-2 2017-12-12 00:00:00 17699 1
2 1 mock a1-2 2017-12-13 00:00:00 24538 1.39
3 1 mock a1-2 2017-12-14 00:00:00 27958 1.58
4 1 xcc a1-2 2017-12-12 00:00:00 19857 1
5 1 xcc a1-2 2017-12-13 00:00:00 27973 1.41
6 1 xcc a1-2 2017-12-14 00:00:00 35875 1.81
7 2 mock a1-2 2018-03-20 00:00:00 18177 1
8 2 mock a1-2 2018-03-21 00:00:00 20251 1.11
9 2 mock a1-2 2018-03-23 00:00:00 36679 2.02
10 2 xcc a1-2 2018-03-20 00:00:00 17261 1
11 2 xcc a1-2 2018-03-21 00:00:00 18697 1.08
12 2 xcc a1-2 2018-03-23 00:00:00 35345 2.05
13 1 mock a1-10 2017-12-12 00:00:00 22853 1
14 1 mock a1-10 2017-12-13 00:00:00 34641 1.52
15 1 mock a1-10 2017-12-14 00:00:00 40311 1.76
16 1 xcc a1-10 2017-12-12 00:00:00 23754 1
17 1 xcc a1-10 2017-12-13 00:00:00 33247 1.40
18 1 xcc a1-10 2017-12-14 00:00:00 40603 1.71
19 2 mock a1-10 2018-03-20 00:00:00 28201 1
20 2 mock a1-10 2018-03-21 00:00:00 30306 1.07
21 2 mock a1-10 2018-03-23 00:00:00 49086 1.74
22 2 xcc a1-10 2018-03-20 00:00:00 27217 1
23 2 xcc a1-10 2018-03-21 00:00:00 29844 1.10
24 2 xcc a1-10 2018-03-23 00:00:00 46540 1.71
I wrote a piece of code using For loops, but it raise some errors and I would like to turn it into a more readable code with dplyr
date_debut=c("2017-12-12", "2018-03-20") # starting_time
data$Normalized_Area = NA
for(manips in levels(as.factor(data$Manip))){ # for each manip
for(ecoty in levels(as.factor(data$Ecotype))){ # for each ecotype
for(traity in levels(as.factor(data$Traitement))){ # for each treatment
for( dd in levels(as.factor(date_debut))){ # for each level
tmp = subset(data,subset=c(Traitement==traity & Ecotype == ecoty & Manip == manips)) # creation d'un fichier tmp
if(dim(tmp)[1] != 0){
#tmp = ordered(tmp$date[1:length(tmp$date-1)])
# compute Area mean at D=0 for each Experiment
if(dd %in% as.character(tmp$Date)!=F){
A0 = tmp$Area[as.character(tmp$Date)== dd] # Select A0 in tmp$Area corresponding to dd
Norm_Area = tmp$Area /A0
data$Normalized_Area[data$Traitement == traity & data$Ecotype== ecoty & data$Manip == manips] = Norm_Area
}
}
}
}
}
Here the beginning of my new code, but I get stuck
gpeData %>%
group_by(Traitement, Ecotype, Manip ) %>%
mutate_( Normalized_Area = Area / Area[which(Date %in% date_debut)] )
Does someone have any idea how to do that? I apologize for the ugly code, but I learned alone.
You were very close to have to solving the problem yourself. Here is my solution, I used the which.min to find the index of the earliest date from each group, then I used this index value in the calculation.
gpeData<-structure(list(Manip = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L),
Traitment = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("mock", "xcc"), class = "factor"), Ecotype = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("a1-10", "a1-2"
), class = "factor"), Date = structure(c(1513036800, 1513123200,
1513209600, 1513036800, 1513123200, 1513209600, 1521504000,
1521590400, 1521763200, 1521504000, 1521590400, 1521763200,
1513036800, 1513123200, 1513209600, 1513036800, 1513123200,
1513209600, 1521504000, 1521590400, 1521763200, 1521504000,
1521590400, 1521763200), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
Area = c(17699L, 24538L, 27958L, 19857L, 27973L, 35875L,
18177L, 20251L, 36679L, 17261L, 18697L, 35345L, 22853L, 34641L,
40311L, 23754L, 33247L, 40603L, 28201L, 30306L, 49086L, 27217L,
29844L, 46540L), Normalized_Area = c(1, 1.39, 1.58, 1, 1.41,
1.81, 1, 1.11, 2.02, 1, 1.08, 2.05, 1, 1.52, 1.76, 1, 1.4,
1.71, 1, 1.07, 1.74, 1, 1.1, 1.71)), row.names = c(NA, -24L
), class = "data.frame")
library(dplyr)
ans<-gpeData %>%
group_by(Traitment, Ecotype, Manip ) %>%
mutate(NormArea=Area[which.min(Date)], Normalized= Area/NormArea)
Related
I have some sequence event data for which I want to plot the trend of missingness on value across time. Example below:
id time value
1 aa122 1 1
2 aa2142 1 1
3 aa4341 1 1
4 bb132 1 2
5 bb2181 2 1
6 bb3242 2 3
7 bb3321 2 NA
8 cc122 2 1
9 cc2151 2 2
10 cc3241 3 1
11 dd161 3 3
12 dd2152 3 NA
13 dd3282 3 NA
14 ee162 3 1
15 ee2201 4 2
16 ee3331 4 NA
17 ff1102 4 NA
18 ff2141 4 NA
19 ff3232 5 1
20 gg142 5 3
21 gg2192 5 NA
22 gg3311 5 NA
23 gg4362 5 NA
24 ii111 5 NA
The NA suppose to increase over time (the behaviors are fading). How do I plot the NA across time
I think this is what you're looking for? You want to see how many NA's appear over time. Assuming this is correct, if each time is a group, then you can count the number of NA's appear in each group
data:
df <- structure(list(id = structure(1:24, .Label = c("aa122", "aa2142",
"aa4341", "bb132", "bb2181", "bb3242", "bb3321", "cc122", "cc2151",
"cc3241", "dd161", "dd2152", "dd3282", "ee162", "ee2201", "ee3331",
"ff1102", "ff2141", "ff3232", "gg142", "gg2192", "gg3311", "gg4362",
"ii111"), class = "factor"), time = c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L), value = c(1L, 1L, 1L, 2L, 1L, 3L, NA, 1L, 2L, 1L, 3L,
NA, NA, 1L, 2L, NA, NA, NA, 1L, 3L, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-24L))
library(tidyverse)
library(ggplot2)
df %>%
group_by(time) %>%
summarise(sumNA = sum(is.na(value)))
# A tibble: 5 × 2
time sumNA
<int> <int>
1 1 0
2 2 1
3 3 2
4 4 3
5 5 4
You can then plot this using ggplot2
df %>%
group_by(time) %>%
summarise(sumNA = sum(is.na(value))) %>%
ggplot(aes(x=time)) +
geom_line(aes(y=sumNA))
As you can see, as time increases, the number of NA's also increases
I am a novice trying to analyze trap catch data in R and am looking for an efficient way to loop through by trap line. The first column is trap ID. The second column is the trap line that each trap is associated with. The remaining columns are values related to target catch and bycatch for each visit to the traps. I want to write code that will evaluate the data during each visit for each trap line. Here is an example of data I am working with:
Sample Data:
Data <- structure(list(Trap_ID = c(1L, 2L, 1L, 1L, 2L, 3L), Trapline = c("Cemetery",
"Cemetery", "Golf", "Church", "Church", "Church"), Target_Visit_1 = c(0L,
1L, 5L, 0L, 1L, 1L), Bycatch_Visit_1 = c(3L, 2L, 0L, 2L, 1L,
4L), Target_Visit_2 = c(1L, 1L, 2L, 0L, 1L, 0L), Bycatch_Visit_2 = c(4L,
2L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
The number of traps per trapline varies. I have a code that I wrote out for each Trapline (there are 14 different traplines), but I was hoping there would be a way to consolidate it into one line of code that would calculate values while the trapline was constant, and then when it changed to the next trapline it would start a new calculation. Here is an example of how I was finding the sum of bycatch found at the Cemetery Trapline for visit 1.
CemetaryBycatch1 <- Data %>% select(Bycatch Visit 1 %>% filter(Data$Trapline == "Cemetery")
sum(CemetaryBycatch1)
As of right now I have code like this written out for each trapline for each visit, but with 14 traplines and 8 total visits, I would like to avoid having to write out so many lines of code and was hoping there was a way to loop through it with one block of code that would calculate value (sum, mean, etc.) for each trap line.
Thanks
Does something like this help you?
You can add a filter for Trapline in between group_by and summarise_all.
Code:
library(dplyr)
Data <- structure(list(Trap_ID = c(1L, 2L, 1L, 1L, 2L, 3L), Trapline = c("Cemetery",
"Cemetery", "Golf", "Church", "Church", "Church"), Target_Visit_1 = c(0L,
1L, 5L, 0L, 1L, 1L), Bycatch_Visit_1 = c(3L, 2L, 0L, 2L, 1L,
4L), Target_Visit_2 = c(1L, 1L, 2L, 0L, 1L, 0L), Bycatch_Visit_2 = c(4L,
2L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
df
Data %>%
group_by(Trap_ID, Trapline) %>%
summarise_all(list(sum))
Output:
#> # A tibble: 6 x 6
#> # Groups: Trap_ID [3]
#> Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
#> <int> <chr> <int> <int> <int> <int>
#> 1 1 Cemetery 0 3 1 4
#> 2 1 Church 0 2 0 0
#> 3 1 Golf 5 0 2 1
#> 4 2 Cemetery 1 2 1 2
#> 5 2 Church 1 1 1 1
#> 6 3 Church 1 4 0 0
Created on 2020-10-16 by the reprex package (v0.3.0)
Adding another row to Data:
Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
1 Cemetery 100 200 1 4
Will give you:
#> # A tibble: 6 x 6
#> # Groups: Trap_ID [3]
#> Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
#> <int> <chr> <int> <int> <int> <int>
#> 1 1 Cemetery 100 203 2 8
#> 2 1 Church 0 2 0 0
#> 3 1 Golf 5 0 2 1
#> 4 2 Cemetery 1 2 1 2
#> 5 2 Church 1 1 1 1
#> 6 3 Church 1 4 0 0
Created on 2020-10-16 by the reprex package (v0.3.0)
If I have this data
Group,start_time
1,9:05:00
1,9:07:00
1,19:09:00
1,9:00:00
1,9:00:00
1,9:02:00
2,9:05:00
2,9:07:00
2,19:09:00
2,9:00:00
2,9:00:00
2,9:02:00
and I would like to get a column check on my data like below. How can I do that? Thanks
Group,start_time, check
1,9:05:00,True
1,9:07:00,True
1,19:09:00, True
1,9:00:00,False
1,9:00:00,False
1,9:02:00,False
2,9:05:00,True
2,9:07:00,True
2,19:09:00,True
2,9:00:00,False
2,9:00:00,False
2,9:02:00,False
Here's a possible solution:
df = structure(list(Group = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), start_time = structure(c(4L, 5L, 1L, 2L, 2L, 3L,
4L, 5L, 1L, 2L, 2L, 3L), .Label = c("19:09:00", "9:00:00", "9:02:00",
"9:05:00", "9:07:00"), class = "factor")), class = "data.frame", row.names = c(NA, -12L))
library(dplyr)
df %>%
group_by(Group) %>%
mutate(check = as.numeric(gsub(":","",start_time)) >= cummax(as.numeric(gsub(":","",start_time)))) %>%
ungroup()
# # A tibble: 12 x 3
# Group start_time check
# <int> <fct> <lgl>
# 1 1 9:05:00 TRUE
# 2 1 9:07:00 TRUE
# 3 1 19:09:00 TRUE
# 4 1 9:00:00 FALSE
# 5 1 9:00:00 FALSE
# 6 1 9:02:00 FALSE
# 7 2 9:05:00 TRUE
# 8 2 9:07:00 TRUE
# 9 2 19:09:00 TRUE
#10 2 9:00:00 FALSE
#11 2 9:00:00 FALSE
#12 2 9:02:00 FALSE
I'm assuming that FALSE cases are the ones that we seem to go back in time.
In order to compare times I remove : and I create a number using the remaining (numerical) characters.
By using the timestamp of produced unit, i want to check in which shift it was produced. Basically the production is carried out in two shifts per day. shifts timings are 06:00 to 18:00 and 18:00 to 06:00. shifts data frame below shows the planning of shifts of december month.
let me make it more clear
2015-12-01 A shift(2015-12-01 06:00:00 to 2015-12-01 17:59:59)
2015-12-01 D shift(2015-12-01 18:00:00 to 2015-12-02 05:59:59)
2015-12-02 A shift(2015-12-02 06:00:00 to 2015-12-02 17:59:59)
2015-12-02 D shift(2015-12-02 18:00:00 to 2015-12-03 05:59:59)
and so on..
head(shifts)
date day_shift night_shift
1 2015-12-01 A D
2 2015-12-02 A D
3 2015-12-03 B A
4 2015-12-04 B A
5 2015-12-05 C B
6 2015-12-06 C B
shifts <- structure(list(date = structure(1:31, .Label = c("2015-12-01",
"2015-12-02", "2015-12-03", "2015-12-04", "2015-12-05", "2015-12-06",
"2015-12-07", "2015-12-08", "2015-12-09", "2015-12-10", "2015-12-11",
"2015-12-12", "2015-12-13", "2015-12-14", "2015-12-15", "2015-12-16",
"2015-12-17", "2015-12-18", "2015-12-19", "2015-12-20", "2015-12-21",
"2015-12-22", "2015-12-23", "2015-12-24", "2015-12-25", "2015-12-26",
"2015-12-27", "2015-12-28", "2015-12-29", "2015-12-30", "2015-12-31"
), class = "factor"), day_shift = structure(c(1L, 1L, 2L, 2L,
3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L,
3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L, 4L), .Label = c("A",
"B", "C", "D"), class = "factor"), night_shift = structure(c(4L,
4L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L, 4L,
4L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L), .Label = c("A",
"B", "C", "D"), class = "factor")), .Names = c("date", "day_shift",
"night_shift"), class = "data.frame", row.names = c(NA, -31L))
In the check data frame, i have the timestamp of each unit produced. by using these timestamp i want to check in which shift the unit is produced.
head(check)
eventtime
1 2015-12-01 06:10:08
2 2015-12-01 10:10:24
3 2015-12-01 19:01:15
4 2015-12-02 01:54:54
5 2015-12-02 06:24:14
6 2015-12-02 08:15:47
check <- structure(list(eventtime = structure(c(1448946608, 1448961024,
1448992875, 1449017694, 1449033854, 1449040547, 1449076903, 1449085710,
1449100168, 1449119720), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = "eventtime", row.names = c(NA,
-10L), class = "data.frame")
Desired Result:
ds
eventtime shift
1 2015-12-01 06:10:08 A
2 2015-12-01 10:10:24 A
3 2015-12-01 19:01:15 D
4 2015-12-02 01:54:54 D
5 2015-12-02 06:24:14 A
6 2015-12-02 08:15:47 A
7 2015-12-02 18:21:43 D
8 2015-12-02 20:48:30 D
9 2015-12-03 00:49:28 D
10 2015-12-03 06:15:20 B
To keep it simple, i showed only shifts plan of december month. In reality i need to check the complete year.
Here's an answer using lubridate and its %within% functions to check if a date is within an interval. Depending upon whether your raw data is actually stored as factors or not you can simplify the code by removing some of the conversions.
library(lubridate)
day_shift_start <- as.POSIXct(shifts$date) + hms("06:00:00")
day_shift_end <- as.POSIXct(shifts$date) + hms("17:59:59")
night_shift_start <- as.POSIXct(shifts$date) + hms("18:00:00")
night_shift_end <- as.POSIXct(shifts$date) + days(1) + hms("05:59:59")
shift_intervals <- data.frame(intervals = c(interval(day_shift_start, day_shift_end),
interval(night_shift_start, night_shift_end)),
shift = c(as.character(shifts$day_shift),
as.character(shifts$night_shift)))
check$shift <- unlist(lapply(check$eventtime, function(x) {
shift_intervals$shift[x %within% shift_intervals$intervals]
}))
check
# eventtime shift
# 1 2015-12-01 06:10:08 A
# 2 2015-12-01 10:10:24 A
# 3 2015-12-01 19:01:15 D
# 4 2015-12-02 01:54:54 D
# 5 2015-12-02 06:24:14 A
# 6 2015-12-02 08:15:47 A
# 7 2015-12-02 18:21:43 D
# 8 2015-12-02 20:48:30 D
# 9 2015-12-03 00:49:28 D
# 10 2015-12-03 06:15:20 B
I have a dataframe in long form for which I need to aggregate several observations taken on a particular day.
Example data:
long <- structure(list(Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Genotype = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), View = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor"), variable = c(1496L, 1704L,
1738L, 1553L, 1834L, 1421L, 1208L, 1845L, 1325L, 1264L, 1920L,
1735L)), .Names = c("Day", "Genotype", "View", "variable"), row.names = c(NA, -12L),
class = "data.frame")
> long
Day Genotype View variable
1 1 A 1 1496
2 1 A 2 1704
3 1 A 3 1738
4 1 B 1 1553
5 1 B 2 1834
6 1 B 3 1421
7 2 A 1 1208
8 2 A 2 1845
9 2 A 3 1325
10 2 B 1 1264
11 2 B 2 1920
12 2 B 3 1735
I need to aggregate each genotype for each day by taking the cube root of the product of each view. So for genotype A on day 1, (1496 * 1704 * 1738)^(1/3). Final dataframe would look like:
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Have been going round and round with reshape2 for the last couple of days, but not getting anywhere. Help appreciated!
I'd probably use plyr and ddply for this task:
library(plyr)
ddply(long, .(Day, Genotype), summarize,
summary = prod(variable) ^ (1/3))
#-----
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Or this with dcast:
dcast(data = long, Day + Genotype ~ .,
value.var = "variable", function(x) prod(x) ^ (1/3))
#-----
Day Genotype NA
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
An other solution without additional packages.
aggregate(list(Summary=long$variable),by=list(Day=long$Day,Genotype=long$Genotype),function(x) prod(x)^(1/length(x)))
Day Genotype Summary
1 1 A 1642.418
2 2 A 1434.695
3 1 B 1593.633
4 2 B 1614.790