I am trying to add two variables to my dataset from another dataset which is different in length. I have a coralreef survey dataset for which I am missing start and end times of each dive per site and zone of survey.
Additionally I have a table containing the start and end times of each dive per site and zone:
This table repeats the wpt (site) because 2 zones are measured per site, meaning in this table each row should be unique. In my own dataset I have many more repetitions of wpt because I have several observations in the same site and zone. I need to match the unique rows of mergingdata to merge it to my fishdata returning the start and end times of the mergingdata. So I want to match and merge by "wpt" and by "zone"
this is what I have tried:
merge<- merge(fishdata, mergingdata, by="wpt", all=TRUE, sort=FALSE)
but this only merges by zone, and my output gets an extra column called zone.y - is there a way in which I can merge by the unique combination of 2 variables? "wpt" and "zone"?
Thank you!
The documentation of merge help(merge) says:
By default the data frames are merged on the columns with names they
both have, but separate specifications of the columns can be given by
by.x and by.y.
As you have both id columns in both data.frames, merge function will combine the data using those common columns. So, omiting the id parameter in your code should work.
merge<- merge(fishdata, mergingdata, all=TRUE, sort=FALSE)
However, you can also specify the identifier columns using by, by.x and by.y parameters as follow:
merge<- merge(fishdata, mergingdata, by=c("wpt","zone"), all=TRUE, sort=FALSE)
EDIT
Looking at your post modifications, I figured out that your data has the following structure:
fishdata <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "23.11.2014", class = "factor"),
entry = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "shore", class = "factor"),
wpt = c(2L, 2L, 2L, 2L, 2L, 2L), zone = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = "DO", class = "factor"), transect = c(1L,
1L, 1L, 1L, 1L, 1L), gps = c(NA, NA, NA, NA, NA, NA), surveyor = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "ev", class = "factor"), depth_code = c(NA,
NA, NA, NA, NA, NA), phase = structure(c(2L, 2L, 1L, 1L,
1L, 1L), .Label = c("S_PRIN", "S_STOP"), class = "factor"),
species = structure(c(2L, 1L, 2L, 2L, 1L, 1L), .Label = c("IP",
"TP"), class = "factor"), family = c(NA, NA, NA, NA, NA,
NA)), .Names = c("date", "entry", "wpt", "zone", "transect",
"gps", "surveyor", "depth_code", "phase", "species", "family"
), class = "data.frame", row.names = c(NA, -6L))
mergingdata <- structure(list(start.time = c(10.34, 10.57, 10, 10.24, 9.15,
9.39), end.time = c(10.5, 11.1, 10.2, 10.4, 9.3, 9.5), wpt = c(2L,
2L, 3L, 3L, 4L, 4L), zone = structure(c(1L, 2L, 1L, 2L, 1L, 2L
), .Label = c("DO", "LT"), class = "factor")), .Names = c("start.time",
"end.time", "wpt", "zone"), class = "data.frame", row.names = c(NA,
-6L))
Assiuming that the dataset structures are correct...
> fishdata
date entry wpt zone transect gps surveyor depth_code phase species family
1 23.11.2014 shore 2 DO 1 NA ev NA S_STOP TP NA
2 23.11.2014 shore 2 DO 1 NA ev NA S_STOP IP NA
3 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN TP NA
4 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN TP NA
5 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN IP NA
6 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN IP NA
> mergingdata
start.time end.time wpt zone
1 10.34 10.5 2 DO
2 10.57 11.1 2 LT
3 10.00 10.2 3 DO
4 10.24 10.4 3 LT
5 9.15 9.3 4 DO
6 9.39 9.5 4 LT
I do the merge as follow:
> merge(x = fishdata, y = mergingdata, all.x = TRUE)
wpt zone date entry transect gps surveyor depth_code phase species family start.time end.time
1 2 DO 23.11.2014 shore 1 NA ev NA S_STOP TP NA 10.34 10.5
2 2 DO 23.11.2014 shore 1 NA ev NA S_STOP IP NA 10.34 10.5
3 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN TP NA 10.34 10.5
4 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN TP NA 10.34 10.5
5 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN IP NA 10.34 10.5
6 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN IP NA 10.34 10.5
Note that I use x.all=TRUE, because what we want is to have all the rows from the x object which is fishdata merged with the extra columns of the y object (mergingdata). All that, by using the common columns of both objects as an index.
Related
I want to interpolate missing values using dplyr, piping, and approx().
Data:
test <- structure(list(site = structure(c(3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L), .Label = c("lake", "stream", "wetland"), class = "factor"),
depth = c(0L, -3L, -4L, -8L, 0L, -1L, -3L, -5L, 0L, -2L,
-4L, -6L), var1 = c(1L, NA, 3L, 4L, 1L, 2L, NA, 4L, 1L, NA,
NA, 4L), var2 = c(1L, NA, 3L, 4L, NA, NA, NA, NA, NA, 2L,
NA, NA)), .Names = c("site", "depth", "var1", "var2"), class = "data.frame", row.names = c(NA,
-12L))
This code works:
library(tidyverse)
# interpolate missing var1 values for each site using approx()
test_int <- test %>%
group_by(site) %>%
mutate_at(vars(c(var1)),
funs("i" = approx(depth, ., depth, rule=1, method="linear")[["y"]]))
But the code no longer works if it encounters a grouping (site & var) that doesn't have at least 2 non-NA values, e.g.,
# here I'm trying to interpolate missing values for var1 & var2
test_int2 <- test %>%
group_by(site) %>%
mutate_at(vars(c(var1, var2)),
funs("i" = approx(depth, ., depth, rule=1, method="linear")[["y"]]))
R appropriately throws this error:
Error in mutate_impl(.data, dots) :
Evaluation error: need at least two non-NA values to interpolate.
How do I include a conditional statement or filter so that it only tries to interpolate cases where the site has at least 2 non-NA values and skips the rest or returns NA for those?
This will do what you are looking for...
test_int2 <- test %>%
group_by(site) %>%
mutate_at(vars(c(var1, var2)),
funs("i"=if(sum(!is.na(.))>1)
approx(depth, ., depth, rule=1, method="linear")[["y"]]
else
NA))
test_int2
# A tibble: 12 x 6
# Groups: site [3]
site depth var1 var2 var1_i var2_i
<fctr> <int> <int> <int> <dbl> <dbl>
1 wetland 0 1 1 1.0 1.0
2 wetland -3 NA NA 2.5 2.5
3 wetland -4 3 3 3.0 3.0
4 wetland -8 4 4 4.0 4.0
5 lake 0 1 NA 1.0 NA
6 lake -1 2 NA 2.0 NA
7 lake -3 NA NA 3.0 NA
8 lake -5 4 NA 4.0 NA
9 stream 0 1 NA 1.0 NA
10 stream -2 NA 2 2.0 NA
11 stream -4 NA NA 3.0 NA
12 stream -6 4 NA 4.0 NA
I have a dataset with which I want to conduct a multilevel analysis. Therefore I have two rows for every patient, and a couple column with 1's and 2's (1 = patient, 2 = partner of patient).
Now, I have variables with date of birth and age, for both patient and partner in different columns that are now on the same row.
What I want to do is to write a code that does:
if mydata$couple == 2, then replace mydata$dateofbirthpatient with mydata$dateofbirthpatient
And that for every row. Since I have multiple variables that I want to replace, it would be lovely if I could get this in a loop and just 'add' variables that I want to replace.
What I tried so far:
mydf_longer <- if (mydf_long$couple == 2) {
mydf_long$pgebdat <- mydf_long$prgebdat
}
Ofcourse this wasn't working - but simply stated this is what I want.
And I started with this code, following the example in By row, replace values equal to value in specified column
, but don't know how to finish:
mydf_longer[6:7][mydf_longer[,1:4]==mydf_longer[2,2]] <-
Any ideas? Let me know if you need more information.
Example of data:
# id couple groep_MNC zkhs fbeh pgebdat p_age pgesl prgebdat pr_age
# 1 3 1 1 1 1 1955-12-01 42.50000 1 <NA> NA
# 1.1 3 2 1 1 1 1955-12-01 42.50000 1 <NA> NA
# 2 5 1 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.5
# 2.1 5 2 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.5
# 3 7 1 1 1 1 1958-04-10 40.25000 1 <NA> NA
# 3.1 7 2 1 1 1 1958-04-10 40.25000 1 <NA> NA
mydf_long <- structure(
list(id = c(3L, 3L, 5L, 5L, 7L, 7L),
couple = c(1L, 2L, 1L, 2L, 1L, 2L),
groep_MNC = c(1L, 1L, 1L, 1L, 1L, 1L),
zkhs = c(1L, 1L, 1L, 1L, 1L, 1L),
fbeh = c(1L, 1L, 1L, 1L, 1L, 1L),
pgebdat = structure(c(-5145, -5145, -9764, -9764, -4284, -4284), class = "Date"),
p_age = c(42.5, 42.5, 55.16667, 55.16667, 40.25, 40.25),
pgesl = c(1L, 1L, 1L, 1L, 1L, 1L),
prgebdat = structure(c(NA, NA, -2815, -2815, NA, NA), class = "Date"),
pr_age = c(NA, NA, 36.5, 36.5, NA, NA)),
.Names = c("id", "couple", "groep_MNC", "zkhs", "fbeh", "pgebdat",
"p_age", "pgesl", "prgebdat", "pr_age"),
row.names = c("1", "1.1", "2", "2.1", "3", "3.1"),
class = "data.frame"
)
The following for loop should work if you only want to change the values based on a condition:
for(i in 1:nrow(mydata)){
if(mydata$couple[i] == 2){
mydata$pgebdat[i] <- mydata$prgebdat[i]
}
}
OR
As suggested by #lmo, following will work faster.
mydata$pgebdat[mydata$couple == 2] <- mydata$prgebdat[mydata$couple == 2]
"f","index","values","lo.80","lo.95","hi.80","hi.95"
"auto.arima",2017-07-31 16:40:00,2.81613884762163,NA,NA,NA,NA
"auto.arima",2017-07-31 16:40:10,2.83441637197378,NA,NA,NA,NA
"auto.arima",2017-07-31 20:39:10,3.18497899649267,2.73259824384436,2.49312233904087,3.63735974914098,3.87683565394447
"auto.arima",2017-07-31 20:39:20,3.16981166809297,2.69309866988864,2.44074205235297,3.64652466629731,3.89888128383297
"ets",2017-07-31 16:40:00,2.93983529828936,NA,NA,NA,NA
"ets",2017-07-31 16:40:10,3.09739640066054,NA,NA,NA,NA
"ets",2017-07-31 20:39:10,3.1951571771414,2.80966705285567,2.60560090776504,3.58064730142714,3.78471344651776
"ets",2017-07-31 20:39:20,3.33876776870274,2.93593322313957,2.72268549604222,3.7416023142659,3.95485004136325
"bats",2017-07-31 16:40:00,2.82795253090081,NA,NA,NA,NA
"bats",2017-07-31 16:40:10,2.96389759682623,NA,NA,NA,NA
"bats",2017-07-31 20:39:10,3.1383560278272,2.76890864400062,2.573335012715,3.50780341165378,3.7033770429394
"bats",2017-07-31 20:39:20,3.3561357998535,2.98646195085452,2.79076843614824,3.72580964885248,3.92150316355876
I have a dataframe like above which has column names as:"f","index","values","lo.80","lo.95","hi.80","hi.95".
What I want to do is calculate the weighted average on forecast results from different models for a particular timestamp. By this what i mean is
For every row in auto.arima there is a corresponding row in ets and bats with the same timestamp value, so weighted average should be calculated something like this:
value_arima*1/3 + values_ets*1/3 + values_bats*1/3 ; similary values for lo.80 and other columns should be calculated.
This result should be stored in a new dataframe with all the weighted average values.
New dataframe can look something like:
index(timesamp from above dataframe),avg,avg_lo_80,avg_lo_95,avg_hi_80,avg_hi_95
I think I need to use spread() and mutate () function to achieve this. Being new to R I'm unable to proceed after forming this dataframe.
Please help.
The example you provide is not a weighted average but a simple average.
What you want is a simple aggregate.
The first part is your dataset as provided by dput (better for sharing here)
d <- structure(list(f = structure(c(1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L), .Label = c("auto.arima", "bats", "ets"), class = "factor"),
index = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L), .Label = c("2017-07-31 16:40:00", "2017-07-31 16:40:10",
"2017-07-31 20:39:10", "2017-07-31 20:39:20"), class = "factor"),
values = c(2.81613884762163, 2.83441637197378, 3.18497899649267,
3.16981166809297, 2.93983529828936, 3.09739640066054, 3.1951571771414,
3.33876776870274, 2.82795253090081, 2.96389759682623, 3.1383560278272,
3.3561357998535), lo.80 = c(NA, NA, 2.73259824384436, 2.69309866988864,
NA, NA, 2.80966705285567, 2.93593322313957, NA, NA, 2.76890864400062,
2.98646195085452), lo.95 = c(NA, NA, 2.49312233904087, 2.44074205235297,
NA, NA, 2.60560090776504, 2.72268549604222, NA, NA, 2.573335012715,
2.79076843614824), hi.80 = c(NA, NA, 3.63735974914098, 3.64652466629731,
NA, NA, 3.58064730142714, 3.7416023142659, NA, NA, 3.50780341165378,
3.72580964885248), hi.95 = c(NA, NA, 3.87683565394447, 3.89888128383297,
NA, NA, 3.78471344651776, 3.95485004136325, NA, NA, 3.7033770429394,
3.92150316355876)), .Names = c("f", "index", "values", "lo.80",
"lo.95", "hi.80", "hi.95"), class = "data.frame", row.names = c(NA,
-12L))
> aggregate(d[,3:7], by = d["index"], FUN = mean)
index values lo.80 lo.95 hi.80 hi.95
1 2017-07-31 16:40:00 2.861309 NA NA NA NA
2 2017-07-31 16:40:10 2.965237 NA NA NA NA
3 2017-07-31 20:39:10 3.172831 2.770391 2.557353 3.575270 3.788309
4 2017-07-31 20:39:20 3.288238 2.871831 2.651399 3.704646 3.925078
You can save this output in an object and change the column names as you want.
If you really want a weighted average this is a way to obtain it (here bat has a weight of 0.8 and the 2 others 0.1) :
> d$weight <- (d$f)
> levels(d$weight) # check the levels
[1] "auto.arima" "bats" "ets"
> levels(d$weight) <- c(0.1, 0.8, 0.1)
> # transform the factor into numbers
> # warning as.numeric(d$weight) is not correct !!
> d$weight <- as.numeric(as.character((d$weight)))
>
> # Here the result is saved in a data.frame called "result
> result <- aggregate(d[,3:7] * d$weight, by = d["index"], FUN = sum)
> result
index values lo.80 lo.95 hi.80 hi.95
1 2017-07-31 16:40:00 2.837959 NA NA NA NA
2 2017-07-31 16:40:10 2.964299 NA NA NA NA
3 2017-07-31 20:39:10 3.148698 2.769353 2.568540 3.528043 3.728857
4 2017-07-31 20:39:20 3.335767 2.952073 2.748958 3.719460 3.922576
Using the following dataframe I would like to group the data by replicate and group and then calculate a ratio of treatment values to control values.
structure(list(group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("case", "controls"), class = "factor"), treatment = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "EPA", class = "factor"),
replicate = structure(c(2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L), .Label = c("four",
"one", "three", "two"), class = "factor"), fatty_acid_family = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "saturated", class = "factor"),
fatty_acid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "14:0", class = "factor"),
quant = c(6.16, 6.415, 4.02, 4.05, 4.62, 4.435, 3.755, 3.755
)), .Names = c("group", "treatment", "replicate", "fatty_acid_family",
"fatty_acid", "quant"), class = "data.frame", row.names = c(NA,
-8L))
I have tried using dplyr as follows:
group_by(dataIn, replicate, group) %>% transmute(ratio = quant[group=="case"]/quant[group=="controls"])
but this results in Error: incompatible size (%d), expecting %d (the group size) or 1
Initially I thought this might be because I was trying to create 4 ratios from a df 8 rows deep and so I thought summarise might be the answer (collapsing each group to one ratio) but that doesn't work either (my understanding is a shortcoming).
group_by(dataIn, replicate, group) %>% summarise(ratio = quant[group=="case"]/quant[group=="controls"])
replicate group ratio
1 four case NA
2 four controls NA
3 one case NA
4 one controls NA
5 three case NA
6 three controls NA
7 two case NA
8 two controls NA
I would appreciate some advice on where I'm going wrong or even if this can be done with dplyr.
Thanks.
You can try:
group_by(dataIn, replicate) %>%
summarise(ratio = quant[group=="case"]/quant[group=="controls"])
#Source: local data frame [4 x 2]
#
# replicate ratio
#1 four 1.078562
#2 one 1.333333
#3 three 1.070573
#4 two 1.446449
Because you grouped by replicate and group, you could not access data from different groups at the same time.
#talat's answer solved for me. I created a minimal reproducible example to help my own understanding:
df <- structure(list(a = c("a", "a", "b", "b", "c", "c", "d", "d"),
b = c(1, 2, 1, 2, 1, 2, 1, 2), c = c(22, 15, 5, 0.2, 107,
6, 0.2, 4)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
# a b c
# 1 a 1 22.0
# 2 a 2 15.0
# 3 b 1 5.0
# 4 b 2 0.2
# 5 c 1 107.0
# 6 c 2 6.0
# 7 d 1 0.2
# 8 d 2 4.0
library(dplyr)
df %>%
group_by(a) %>%
summarise(prop = c[b == 1] / c[b == 2])
# a prop
# 1 a 1.466667
# 2 b 25.000000
# 3 c 17.833333
# 4 d 0.050000
I have a dataframe in long form for which I need to aggregate several observations taken on a particular day.
Example data:
long <- structure(list(Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Genotype = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), View = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor"), variable = c(1496L, 1704L,
1738L, 1553L, 1834L, 1421L, 1208L, 1845L, 1325L, 1264L, 1920L,
1735L)), .Names = c("Day", "Genotype", "View", "variable"), row.names = c(NA, -12L),
class = "data.frame")
> long
Day Genotype View variable
1 1 A 1 1496
2 1 A 2 1704
3 1 A 3 1738
4 1 B 1 1553
5 1 B 2 1834
6 1 B 3 1421
7 2 A 1 1208
8 2 A 2 1845
9 2 A 3 1325
10 2 B 1 1264
11 2 B 2 1920
12 2 B 3 1735
I need to aggregate each genotype for each day by taking the cube root of the product of each view. So for genotype A on day 1, (1496 * 1704 * 1738)^(1/3). Final dataframe would look like:
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Have been going round and round with reshape2 for the last couple of days, but not getting anywhere. Help appreciated!
I'd probably use plyr and ddply for this task:
library(plyr)
ddply(long, .(Day, Genotype), summarize,
summary = prod(variable) ^ (1/3))
#-----
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Or this with dcast:
dcast(data = long, Day + Genotype ~ .,
value.var = "variable", function(x) prod(x) ^ (1/3))
#-----
Day Genotype NA
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
An other solution without additional packages.
aggregate(list(Summary=long$variable),by=list(Day=long$Day,Genotype=long$Genotype),function(x) prod(x)^(1/length(x)))
Day Genotype Summary
1 1 A 1642.418
2 2 A 1434.695
3 1 B 1593.633
4 2 B 1614.790