I have two datasets, and I need to merge them by the ID value. The problems are:
The ID value can be repeated across the same dataset (no other unique value is available).
The two datasets are not equal in the rows number or the column numbers.
Example:
df1
ID
Gender
99
Male
85
Female
7
Male
df2
ID
Body_Temperature
Body_Temperature_date_time
99
36
1/1/2020 12:00 am
99
38
2/1/2020 10:30 am
99
37
1/1/2020 06:41 am
52
38
1/2/2020 11:00 am
11
39
4/5/2020 09:09 pm
7
35
9/8/2020 02:30 am
How can I turn these two datasets into one single dataset in a way that allows me to apply some machine learning models on it later on?
Depending on your expected results, if you are wanting to return all rows from each dataframe, then you can use a full_join from dplyr:
library(dplyr)
full_join(df2, df1, by = "ID")
Or with base R:
merge(x=df2,y=df1,by="ID",all=TRUE)
Output
ID Body_Temperature Body_Temperature_date_time Gender
1 99 36 1/1/2020 12:00 am Male
2 99 38 2/1/2020 10:30 am Male
3 99 37 1/1/2020 06:41 am Male
4 52 38 1/2/2020 11:00 am <NA>
5 11 39 4/5/2020 09:09 pm <NA>
6 7 35 9/8/2020 02:30 am Male
7 85 NA <NA> Female
If you have more than 2 dataframes to combine, which only overlap with the ID column, then you can use reduce on a dataframe list (so put all the dataframes that you want to combine into a list):
library(tidyverse)
df_list <- list(df1, df2)
multi_full <- reduce(df_list, function(x, y, ...)
full_join(x, y, by = "ID", ...))
Or Reduce with base R:
df_list <- list(df1, df2)
multi_full <- Reduce(function(x, y, ...)
merge(x, y, by = "ID", all = TRUE, ...), df_list)
Data
df1 <- structure(list(ID = c(99L, 85L, 7L), Gender = c("Male", "Female",
"Male")), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(ID = c(99L, 99L, 99L, 52L, 11L, 7L), Body_Temperature = c(36L,
38L, 37L, 38L, 39L, 35L), Body_Temperature_date_time = c("1/1/2020 12:00 am",
"2/1/2020 10:30 am", "1/1/2020 06:41 am", "1/2/2020 11:00 am",
"4/5/2020 09:09 pm", "9/8/2020 02:30 am")), class = "data.frame", row.names = c(NA,
-6L))
I am not 100% sure how to formulate my question because I don't know the formal names are for what it is that I am trying to do with my dataset. Based on previous questions, there appears to be some way to address what I am trying to, but I am unable at making the logical jump from their problem to my own.
I have attached a sample of my data here.
The first thing I did with my data was add a column indicating which species (sps) are predators (coded as 1) and which species are prey (coded as 0).
#specify which are predators and prey
d1 = d1 %>%
group_by(sps) %>% #grouped by species
mutate(pp=ifelse(sps %in% c("MUXX", "MUVI","MEME"), 1,0)) #mutate to specify predators as 1 and prey as 0
My data is structured as such:
head(d1) #visualize the first few lines of the data
# A tibble: 6 x 8
# Groups: sps [4]
ID date km culv.id type sps time pp
<int> <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <dbl>
1 2012-06-19 80 A DCC MICRO 2:19 0
2 2012-06-21 80 A DCC MUXX 23:23 1
3 2012-07-15 80 A DCC MAMO 11:38 0
4 2012-07-20 80 A DCC MICRO 22:19 0
5 2012-07-29 80 A DCC MICRO 23:03 0
6 2012-08-07 80 A DCC PRLO 2:04 0
Here is also the output for dput(head(d1)):
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20", "2012-07-29", "2012-08-07" ), km = c(80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A", "A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX", "MAMO", "MICRO", "MICRO", "PRLO" ), time = c("2:19", "23:23", "11:38", "22:19", "23:03", "2:04" ), pp = c(0, 1, 0, 0, 0, 0)), .Names = c("ID", "date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA, 6L ), class = "data.frame")
I also converted the time and date using the following code:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new formatĀ
The (most) relevant columns are date, time, and pp (where 1 = predator species and 0 = prey species).
I am now trying to figure out how to extract the following information (average +/- std):
average time between prey-prey observations
average time between prey-predator observations
average time between predator-predator observations
average time between predator-prey observations
To put one of these examples (#2) into words:
What is the average time between when a prey species (pp = 0) is first seen followed by a predator species (pp = 1)?
I am trying to figure out how to do this for my dataset overall first. I think that once I figure out how to do that, it should be fairly straightforward to restrict the data.
Here is a data.table (and lubridate) version that might address your problem:
Using a selection of your posted data (posted at bottom), with a slight modification to your datetime creation so that the format works with data.table:
d1$datetime <- as.POSIXct(strptime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M'))
Convert to a data table:
d1 <- as.data.table(d1)
Calculate time differences for equal pp values for animals by specialization (prey or predator), less than (pred to prey), or greater than (prey to pred).
d1$class.class <- d1[d1, difftime(x.datetime, i.datetime, units = "days"),
on = .(datetime > datetime, pp == pp), mult = "first"]
d1$prey.pred <-d1[d1, x.datetime - i.datetime,
on = .(datetime > datetime, pp > pp ), mult = "first"]
d1$pred.prey <- d1[d1, x.datetime - i.datetime,
on = .(datetime > datetime, pp < pp), mult = "first"]
Gives you a column for each:
> head(d1[, 7:ncol(d1)])
time pp datetime class.class prey.pred pred.prey
1: 2:19 0 2012-06-19 02:19:00 26.388194 days 2.877778 days NA days
2: 23:23 1 2012-06-21 23:23:00 74.177083 days NA days 23.51042 days
3: 11:38 0 2012-07-15 11:38:00 5.445139 days 50.666667 days NA days
4: 22:19 0 2012-07-20 22:19:00 9.030556 days 45.221528 days NA days
5: 23:03 0 2012-07-29 23:03:00 8.125694 days 36.190972 days NA days
6: 2:04 0 2012-08-07 02:04:00 1.911111 days 28.065278 days NA days
And you can get summary statistics as you like:
d1[by = sps,, .(mean.same.class = mean(class.class, na.rm = TRUE),
sd.same.class = sd(class.class, na.rm = TRUE),
mean.prey.pred = mean(prey.pred, na.rm = TRUE),
sd.prey.pred = sd(prey.pred, na.rm = TRUE),
mean.pred.prey = mean(pred.prey, na.rm = TRUE),
sd.pred.prey = sd(pred.prey, na.rm = TRUE))]
sps mean.same.class sd.same.class mean.prey.pred sd.prey.pred mean.pred.prey sd.pred.prey
1: MICRO 7.886237 days 8.0547631 18.80733 days 15.504646 NaN days NA
2: MUXX 42.073611 days 45.4011658 NaN days NA 13.01366 days 9.315697
3: MAMO 5.445139 days NA 50.66667 days NA NaN days NA
4: PRLO 2.475694 days 0.7984414 26.62708 days 2.033914 NaN days NA
5: LEAM 2.897222 days NA 10.11597 days NA NaN days NA
Libraries: data.table, lubridate
Data:
> dput(d1)
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15200L, 15201L, 15199L, 15177L, 15178L, 15204L, 15205L
), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20",
"2012-07-29", "2012-08-07", "2012-08-08", "2012-08-09", "2012-08-13",
"2012-08-13", "2012-08-25", "2012-08-27", "2012-09-04", "2012-09-09",
"2012-09-11", "2012-09-14", "2012-09-23", "2012-09-26", "2012-09-27"
), km = c(80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L,
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC",
"DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC",
"DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX",
"MAMO", "MICRO", "MICRO", "PRLO", "MICRO", "PRLO", "MICRO", "MICRO",
"LEAM", "MICRO", "MUXX", "MICRO", "MICRO", "MUXX", "MICRO", "MICRO",
"MICRO"), time = c("2:19", "23:23", "11:38", "22:19", "23:03",
"2:04", "23:56", "23:06", "0:04", "0:46", "0:51", "22:23", "3:38",
"21:08", "0:40", "2:55", "22:09", "20:46", "3:20"), pp = c(0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0)), class = "data.frame", .Names = c("ID",
"date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA,
-19L))
Edit:
I'm not really sure about mixing tidyverse and data.table ideologies, but you could potentially do what you described in comments using do. For example, make a modified version of df:
d1 <- as.data.table(d1)
d1$datetime <- as.POSIXct(strptime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M'))
d1Mod <- d1
d1Mod$km[10:nrow(d1Mod)] <- 90
Then, define the data.table bit as a function:
foo <- function(df_) {
df_$class.class <- df_[df_, difftime(x.datetime, i.datetime, units = "days"),
on = .(datetime > datetime, pp == pp), mult = "first"]
df_$prey.pred <-df_[df_, x.datetime - i.datetime,
on = .(datetime > datetime, pp > pp ), mult = "first"]
df_$pred.prey <- df_[df_, x.datetime - i.datetime,
on = .(datetime > datetime, pp < pp), mult = "first"]
return(df_)
}
Running d1 %>% group_by(km) %>% do(foo(as.data.table(.))) gets you the same output as in the original answer above (since all km values are 80). If you run it on the modified d1Mod you get an output that looks like it has been grouped by km:
> d1Mod %>%
+ group_by(km) %>%
+ do(foo(as.data.table(.)))
# A tibble: 19 x 12
# Groups: km [2]
ID date km culv.id type sps time pp datetime class.class prey.pred pred.prey
<int> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dttm> <time> <time> <time>
1 1 2012-06-19 80 A DCC MICRO 2:19 0 2012-06-19 02:19:00 26.3881944 days 2.877778 days NA days
2 2 2012-06-21 80 A DCC MUXX 23:23 1 2012-06-21 23:23:00 NA days NA days 23.510417 days
3 3 2012-07-15 80 A DCC MAMO 11:38 0 2012-07-15 11:38:00 5.4451389 days NA days NA days
4 4 2012-07-20 80 A DCC MICRO 22:19 0 2012-07-20 22:19:00 9.0305556 days NA days NA days
5 5 2012-07-29 80 A DCC MICRO 23:03 0 2012-07-29 23:03:00 8.1256944 days NA days NA days
6 8 2012-08-07 80 A DCC PRLO 2:04 0 2012-08-07 02:04:00 1.9111111 days NA days NA days
7 9 2012-08-08 80 A DCC MICRO 23:56 0 2012-08-08 23:56:00 0.9652778 days NA days NA days
8 10 2012-08-09 80 A DCC PRLO 23:06 0 2012-08-09 23:06:00 3.0402778 days NA days NA days
9 11 2012-08-13 80 A DCC MICRO 0:04 0 2012-08-13 00:04:00 NA days NA days NA days
10 12 2012-08-13 90 A DCC MICRO 0:46 0 2012-08-13 00:46:00 12.0034722 days 22.119444 days NA days
11 13 2012-08-25 90 A DCC LEAM 0:51 0 2012-08-25 00:51:00 2.8972222 days 10.115972 days NA days
12 14 2012-08-27 90 A DCC MICRO 22:23 0 2012-08-27 22:23:00 12.9479167 days 7.218750 days NA days
13 15200 2012-09-04 90 A DCC MUXX 3:38 1 2012-09-04 03:38:00 9.9701389 days NA days 5.729167 days
14 15201 2012-09-09 90 A DCC MICRO 21:08 0 2012-09-09 21:08:00 1.1472222 days 4.240972 days NA days
15 15199 2012-09-11 90 A DCC MICRO 0:40 0 2012-09-11 00:40:00 12.8951389 days 3.093750 days NA days
16 15177 2012-09-14 90 A DCC MUXX 2:55 1 2012-09-14 02:55:00 NA days NA days 9.801389 days
17 15178 2012-09-23 90 A DCC MICRO 22:09 0 2012-09-23 22:09:00 2.9423611 days NA days NA days
18 15204 2012-09-26 90 A DCC MICRO 20:46 0 2012-09-26 20:46:00 0.2736111 days NA days NA days
19 15205 2012-09-27 90 A DCC MICRO 3:20 0 2012-09-27 03:20:00 NA days NA days NA days
However, you'll have to do some checking to make sure that the calculations are actually doing what you need- I don't have example output or actual km/year info to truth these results against (read: I don't know what I'm looking at!).
Note also that I think arrange is irrelevant for the operations here, considering that the datetime gets sorted in the function.
I'll use the piece on the comments as an example:
d1 = structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20", "2012-07-29", "2012-08-07" ), km = c(80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A", "A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX", "MAMO", "MICRO", "MICRO", "PRLO" ), time = c("2:19", "23:23", "11:38", "22:19", "23:03", "2:04" ), pp = c(0, 1, 0, 0, 0, 0)), .Names = c("ID", "date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA, 6L ), class = "data.frame")
We add the datetime column just as you specified:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE)
First, add a column indicating which sequence of happened prey/predator and the time between observations (we remove the first row because there is no information about the previous observation). Note that, the timedif is a numerical value indicating the number of days.
d1 = d1 %>% mutate(prev = lag(pp))
d1 = d1 %>% mutate(timedif = as.numeric(as.POSIXct(datetime) - lag(as.POSIXct(datetime))))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
At this point, your table looks like
> d1
ID date km culv.id type sps time pp datetime prev timedif seque
1 2 2012-06-21 80 A DCC MUXX 23:23 1 2012-06-21 23:23 0 2.877778 10
2 3 2012-07-15 80 A DCC MAMO 11:38 0 2012-07-15 11:38 1 23.510417 01
3 4 2012-07-20 80 A DCC MICRO 22:19 0 2012-07-20 22:19 0 5.445139 00
4 5 2012-07-29 80 A DCC MICRO 23:03 0 2012-07-29 23:03 0 9.030556 00
5 8 2012-08-07 80 A DCC PRLO 2:04 0 2012-08-07 02:04 0 8.125694 00
After that, just take the wanted statistics for each group by using
avg = d1 %>% group_by(seque) %>% summarise(mean(timedif))
sdevs = d1 %>% group_by(seque) %>% summarise(sd(timedif))
We obtain
>avg
# A tibble: 3 x 2
seque `mean(timedif)`
<fctr> <dbl>
1 00 7.533796
2 01 23.510417
3 10 2.877778
> sdevs
# A tibble: 3 x 2
seque `sd(timedif)`
<fctr> <dbl>
1 00 1.864554
2 01 NA
3 10 NA
Note that the standard deviation is not computed because we only have one observation in the sample dataset for these categories.
I would like to able to aggregate survey data collected over a range of days into a unique period. For example, for the first three dates (2015-03-17, 2015-03-23, 2015-03-26), i'd like to combine to produce the period "March 2015". I will then use these combined dates to produce boxplots which show "Average.Counts" for that period.
All up I would like to make 4 unique periods:
March 15 (first 3 dates as per table below)
September 15 (dates 4,5 as per table below)
March 2016 (dates 6-15 as per table below)
September 2016 (dates 16-23 as per table below)
Here are the dataset headings.
head(Survival.Pre.Harvest)
Bay.Unique Date Average.Count Total.Predators Time Previous.Average.Count
2 1 2015-03-17 346.9 2 0 NA
3 1 2015-09-14 326.6 8 181 346.9
4 1 2016-02-29 322.6 3 349 326.6
7 2 2015-03-17 326.4 2 0 NA
8 2 2015-09-14 288.8 4 181 326.4
9 2 2016-02-29 271.4 6 349 288.8
These are the unique dates within the dataset.
table(Survival.Pre.Harvest$Date)
2015-03-17 2015-03-23 2015-03-26 2015-09-14 2015-09-15 2016-02-24 2016-02-25 2016-02-26 2016-02-29
9 3 1 9 3 4 6 6 5
2016-03-01 2016-03-02 2016-03-03 2016-03-04 2016-03-22 2016-03-23 2016-09-12 2016-09-13 2016-09-14
3 6 3 6 6 2 6 6 4
2016-09-20 2016-09-22 2016-10-18 2016-10-19 2016-10-20
7 10 4 3 14
Thanks in advance!
dput(head(Survival.Pre.Harvest))
structure(list(Bay.Unique = c(1, 1, 1, 2, 2, 2), Date = structure(c(16511,
16692, 16860, 16511, 16692, 16860), class = "Date"), Average.Count = c(346.9,
326.6, 322.6, 326.4, 288.8, 271.4), Total.Predators = c(2L, 8L,
3L, 2L, 4L, 6L), Time = c(0, 181, 349, 0, 181, 349), Previous.Average.Count = c(NA,
346.9, 326.6, NA, 326.4, 288.8)), .Names = c("Bay.Unique", "Date",
"Average.Count", "Total.Predators", "Time", "Previous.Average.Count"
), row.names = c(2L, 3L, 4L, 7L, 8L, 9L), class = "data.frame")
This should work:
library(lubridate)
library(ggplot2)
Survival.Pre.Harvest$Date <- ymd(Survival.Pre.Harvest$Date)
bks = ymd("2015-01-01", "2015-08-31", "2016-01-01", "2016-08-31", "2017-01-01")
lbs <- c("Mar2015", "Sep2015", "Mar2016", "Sep2016")
Survival.Pre.Harvest$yearmonth <- cut.Date(Survival.Pre.Harvest$Date, breaks = bks, labels = lbs)
ggplot(Survival.Pre.Harvest, aes(x=yearmonth, y=Average.Count)) + geom_boxplot()
I have a data frame that looks as the following:
system Id initial final
665 9 16001 6070 6071
683 10 16001 6100 6101
696 11 16001 6101 6113
712 10 16971 6150 6151
715 11 16971 6151 6163
4966 7 4118 10238 10242
5031 9 4118 10260 10278
5088 10 4118 10279 10304
5115 11 4118 10305 10317
structure(list(system = c(9L, 10L, 11L, 10L, 11L, 7L, 9L, 10L,
11L), Id = c(16001L, 16001L, 16001L, 16971L, 16971L, 4118L, 4118L,
4118L, 4118L), initial = c(6070, 6100, 6101, 6150, 6151, 10238,
10260, 10279, 10305), final = c(6071, 6101, 6113, 6151, 6163,
10242, 10278, 10304, 10317)), .Names = c("system", "Id", "initial",
"final"), row.names = c(665L, 683L, 696L, 712L, 715L, 4966L,
5031L, 5088L, 5115L), class = "data.frame")
I would like to get a new data frame with the next structure
Id system length initial final
1 16001 9,10,11 3 6070 6113
2 16971 10,11 2 6150 6163
3 4118 7 1 10238 10242
4 4118 9,10,11 3 10260 10317
structure(list(Id = c(16001L, 16971L, 4118L, 4118L), system = structure(c(3L,
1L, 2L, 3L), .Label = c("10,11", "7", "9,10,11"), class = "factor"),
length = c(3L, 2L, 1L, 3L), initial = c(6070L, 6150L, 10238L,
10260L), final = c(6113, 6163, 10242, 10317)), .Names = c("Id",
"system", "length", "initial", "final"), class = "data.frame", row.names = c(NA,
-4L))
The grouping is by Id and the difference (between rows) in "system" field equal to one. Also I would like to get the different "system" and how many of that involved in grouping. Finally a column with the first "initial" and the last "final" involved also.
It is possible to do that in r?
Thanks.
You could use data.table. Convert "data.frame" to "data.table" (setDT), create a grouping variable "indx" by taking the difference of adjacent elements of "system" (diff(system)), cumsum the logical vector, use "Id" and "indx" as grouping variable to get the statistics.
library(data.table)
setDT(df)[,list(system=toString(system), length=.N, initial=initial[1L],
final=final[.N]), by=list(Id,indx=cumsum(c(TRUE, diff(system)!=1)))][,
indx:=NULL][]
# Id system length initial final
#1: 16001 9, 10, 11 3 6070 6113
#2: 16971 10, 11 2 6150 6163
#3: 4118 7 1 10238 10242
#4: 4118 9, 10, 11 3 10260 10317
Or based on #jazzurro's comment about using first/last functions from dplyr,
library(dplyr)
df %>%
group_by(indx=cumsum(c(TRUE, diff(system)!=1)), Id) %>%
summarise(system=toString(system), length=n(),
initial=first(initial), final=last(final))
A solution without data.table, but plyr:
library(plyr)
func = function(subdf)
{
bool = c(diff(subdf$system),1)==1
ldply(split(subdf, bool), function(u){
data.frame(system = paste(u$system, collapse=','),
Id = unique(u$Id),
length = nrow(u),
initial= head(u,1)$initial,
final = tail(u,1)$final)
})
}
ldply(split(df, df$Id), func)
# .id system length Id initial final
#1 FALSE 7 1 4118 10238 10242
#2 TRUE 9,10,11 3 4118 10260 10317
#3 TRUE 9,10,11 3 16001 6070 6113
#4 TRUE 10,11 2 16971 6150 6163