I am trying to get the maximum value in the column event until an agreement (dummy) is reached; Events are nested in agreements, agreements are nested in dyad which run over year. Note that years are not always continuous, meaning there are breaks between the years (1986, 1987,2001,2002).
I am able to get the maximum values within the dyad with a ddply and max(event); but I struggle how to ‘assign’ the different events to the right agreement (until/after). I am basically lacking an 'identifier' which assigns each observation to an agreement.
The results which I am looking for are already in the column "result".
dyad year event agreement agreement.name result
1 1985 9
1 1986 4 1 agreement1 9
1 1987
1 2001 3
1 2002 1 agreement2 3
2 1999 1
2 2000 5
2 2001 1 agreement3 5
2 2002 2
2 2003
2 2004 1 agreement 4 2
Here is the data in a format which is hopefully easier to use:
df<-structure(list(dyad = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L), year = c(1985L, 1986L, 1987L, 2001L, 2002L, 1999L, 2000L,
2001L, 2002L, 2003L, 2004L), event = c(9L, 4L, NA, 3L, NA, 1L,
5L, NA, 2L, NA, NA), agreement = c(NA, 1L, NA, NA, 1L, NA, NA,
1L, NA, NA, 1L), agreement.name = c("", "agreement1", "", "",
"agreement2", "", "", "agreement3", "", "", "agreement 4"), result = c(NA,
9L, NA, NA, 3L, NA, NA, 5L, NA, NA, 2L)), .Names = c("dyad",
"year", "event", "agreement", "agreement.name", "result"), class = "data.frame", row.names = c(NA,
-11L))
Here is an option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), create another grouping variable ('ind') based on the non-empty elements in 'agreement.name'. Grouped by both 'dyad' and 'ind' columns, we create a new column 'result' using ifelse to fill the rows that have 'agreement.name' is non-empty with the max of 'event'
library(data.table)
setDT(df)[, ind:=cumsum(c(TRUE,diff(agreement.name=='')>0)),dyad][,
result:=ifelse(agreement.name!='', max(event, na.rm=TRUE), NA) ,
list(dyad, ind)][, ind:=NULL][]
# dyad year event agreement agreement.name result
# 1: 1 1985 9 NA NA
# 2: 1 1986 4 1 agreement1 9
# 3: 1 1987 NA NA NA
# 4: 1 2001 3 NA NA
# 5: 1 2002 NA 1 agreement2 3
# 6: 2 1999 1 NA NA
# 7: 2 2000 5 NA NA
# 8: 2 2001 NA 1 agreement3 5
# 9: 2 2002 2 NA NA
#10: 2 2003 NA NA NA
#11: 2 2004 NA 1 agreement 4 2
Or instead of ifelse, we can use numeric index
setDT(df)[, result:=c(NA, max(event, na.rm=TRUE))[(agreement.name!='')+1L] ,
list(ind= cumsum(c(TRUE,diff(agreement.name=='')>0)),dyad)][]
data
df <- structure(list(dyad = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L), year = c(1985L, 1986L, 1987L, 2001L, 2002L, 1999L, 2000L,
2001L, 2002L, 2003L, 2004L), event = c(9L, 4L, NA, 3L, NA, 1L,
5L, NA, 2L, NA, NA), agreement = c(NA, 1L, NA, NA, 1L, NA, NA,
1L, NA, NA, 1L), agreement.name = c("", "agreement1", "", "",
"agreement2", "", "", "agreement3", "", "", "agreement 4")),
.Names = c("dyad",
"year", "event", "agreement", "agreement.name"), row.names = c(NA,
-11L), class = "data.frame")
Related
This question already has an answer here:
How many non-NA values in each row for a matrix?
(1 answer)
Closed 2 years ago.
I have a large data set (907 x 1855). I need to count how many follow-ups each patient have had. A follow-up column contain either 1, 2 or NA and a follow-up may be defined as the specific column being !is.na().
There are up to max 20 follow-ups. As you can see, each follow up has the _vX added as suffix where x correspond to the number of follow-up.
Thus, follow-up nr 20 has the very inconvenient RedCapautogenerated column name p$fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
> head(p)
fu_location fu_location_v2 fu_location_v2_v3 fu_location_v2_v3_v4 ...
1 1 1 1 1 ...
2 2 2 1 2 ...
3 1 1 1 2 ...
4 2 2 2 2 ...
I need to count the number of !is.na(for column names that contains "fu_location"). I tried mutate(n_fu = sum(!is.na(contains("fu_location")))) but that did not work.
Preferably, the solution is in dplyr. Perhaps a function?
Expected output:
> head(p)
fu_location fu_location_v2 fu_location_v2_v3 fu_location_v2_v3_v4 n_fu
1 1 1 1 1 8
2 2 2 1 2 20
3 1 1 1 2 4
4 2 2 2 2 4
Data
p <- structure(list(fu_location = c(1L, 2L, 1L, 2L), fu_location_v2 = c(1L,
2L, 1L, 2L), fu_location_v2_v3 = c(1L, 1L, 1L, 2L), fu_location_v2_v3_v4 = c(1L,
2L, 2L, 2L), fu_location_v2_v3_v4_v5 = c(2L, 2L, NA, NA), fu_location_v2_v3_v4_v5_v6 = c(1L,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7 = c(2L, 1L, NA, NA
), fu_location_v2_v3_v4_v5_v6_v7_v8 = c(1L, 2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10 = c(NA,
1L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12 = c(NA,
1L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15 = c(NA,
1L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17 = c(NA,
1L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19 = c(NA,
1L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(NA,
2L, NA, NA)), row.names = c(NA, -4L), class = "data.frame")
Use rowSums :
library(dplyr)
p %>% mutate(n_fu = rowSums(!is.na(select(., contains('fu_location')))))
Or in base :
p$n_fu <- rowSums(!is.na(p[grep('fu_location', names(p))]))
I have a long-format balanced data frame (df1) that has 7 columns:
df1 <- structure(list(Product_ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3), Product_Category = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("A", "B"), class = "factor"),
Manufacture_Date = c(1950, 1950, 1950, 1950, 1950, 1960,
1960, 1960, 1960, 1960, 1940, 1940, 1940, 1940, 1940), Control_Date = c(1961L,
1962L, 1963L, 1964L, 1965L, 1961L, 1962L, 1963L, 1964L, 1965L,
1961L, 1962L, 1963L, 1964L, 1965L), Country_Code = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ABC",
"DEF", "GHI"), class = "factor"), Var1 = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Var2 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
15L), class = "data.frame")
Each Product_ID in this data set is linked with a unique Product_Category and Country_Code and Manufacture_Date, and is followed over time (Control_Date). Product_Category has two possible values (A or B); Country_Code and Manufacture_Date have 190 and 90 unique values, respectively. There are 400,000 unique Product_ID's, that are followed over a period of 50 years (Control_Date from 1961 to 2010). This means that df1 has 20,000,000 rows. The last two columns of this data frame are NA at the beginning and have to be filled using the data available in another data frame (df2):
df2 <- structure(list(Product_ID = 1:6, Product_Category = structure(c(1L,
2L, 1L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"),
Manufacture_Date = c(1950, 1960, 1940, 1950, 1940, 2000),
Country_Code = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("ABC",
"DEF", "GHI"), class = "factor"), Year_1961 = c(5, NA, 10,
NA, 6, NA), Year_1962 = c(NA, NA, 4, 5, 3, NA), Year_1963 = c(8,
6, NA, 5, 6, NA), Year_1964 = c(NA, NA, 9, NA, 10, NA), Year_1965 = c(6,
NA, 7, 4, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
This second data frame contains another type of information on the exact same 400,000 products, in wide-format. Each row represents a unique product (Product_ID) accompanied by its Product_Category, Manufacture_Date and Country_Code. There are 50 other columns (for each year from 1961 to 2010) that contain a measured value (or NA) for each product in each of those years.
Now what I would like to do is to fill in the Var1 & Var2 columns in the first data frame, by doing some calculation on the data available in the second data frame. More precisely, for each row in the first data frame (i.e. a product at Control_Date "t"), the last two columns are defined as follows:
Var1: total number of products in df2 with the same Product_Category, Manufacture_Date and Country_Code that have non-NA value in Year_t;
Var2: total number of products in df2 with different Product_Category but the same Manufacture_Date and Country_Code that have non-NA value in Year_t.
My initial solution with nested for-loops is as follows:
for (i in unique(df1$Product_ID)){
Category <- unique(df1[which(df1$Product_ID==i),"Product_Category"])
Opposite_Category <- ifelse(Category=="A","B","A")
Manufacture <- unique(df1[which(df1$Product_ID==i),"Manufacture_Date"])
Country <- unique(df1[which(df1$Product_ID==i),"Country_Code"])
ID_Similar_Product <- df2[which(df2$Product_Category==Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
ID_Quasi_Similar_Product <- df2[which(df2$Product_Category==Opposite_Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
for (j in unique(df1$Control_Date)){
df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var1"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Similar_Product),paste0("Year_",j)])))
df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var2"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Quasi_Similar_Product),paste0("Year_",j)])))
}
}
The problem with this approach is that it takes a lot of time to be run. So I would like to know if anybody could suggest a vectorized version that would do the job in less time.
See if this does what you want. I'm using the data.table package since you have a rather large (20M) dataset.
library(data.table)
setDT(df1)
setDT(df2)
# Set keys on the "triplet" to speed up everything
setkey(df1, Product_Category, Manufacture_Date, Country_Code)
setkey(df2, Product_Category, Manufacture_Date, Country_Code)
# Omit the Var1 and Var2 from df1
df1[, c("Var1", "Var2") := NULL]
# Reshape df2 to long form
df2.long <- melt(df2, measure=patterns("^Year_"))
# Split "variable" at the "_" to extract 4-digit year into "Control_Date" and delete leftovers.
df2.long[, c("variable","Control_Date") := tstrsplit(variable, "_", fixed=TRUE)][
, variable := NULL]
# Group by triplet, Var1=count non-NA in value, join with...
# (Group by doublet, N=count non-NA), update Var2=N-Var1.
df2_N <- df2.long[, .(Var1 = sum(!is.na(value))),
by=.(Product_Category, Manufacture_Date, Country_Code)][
df2.long[, .(N = sum(!is.na(value))),
by=.(Manufacture_Date, Country_Code)],
Var2 := N - Var1, on=c("Manufacture_Date", "Country_Code")]
# Update join: df1 with df2_N
df1[df2_N, c("Var1","Var2") := .(i.Var1, i.Var2),
on = .(Product_Category, Manufacture_Date, Country_Code)]
df1
Product_ID Product_Category Manufacture_Date Control_Date Country_Code Var1 Var2
1: 3 A 1940 1961 GHI 4 0
2: 3 A 1940 1962 GHI 4 0
3: 3 A 1940 1963 GHI 4 0
4: 3 A 1940 1964 GHI 4 0
5: 3 A 1940 1965 GHI 4 0
6: 1 A 1950 1961 ABC 6 0
7: 1 A 1950 1962 ABC 6 0
8: 1 A 1950 1963 ABC 6 0
9: 1 A 1950 1964 ABC 6 0
10: 1 A 1950 1965 ABC 6 0
11: 2 B 1960 1961 DEF NA NA
12: 2 B 1960 1962 DEF NA NA
13: 2 B 1960 1963 DEF NA NA
14: 2 B 1960 1964 DEF NA NA
15: 2 B 1960 1965 DEF NA NA
df2
Product_ID Product_Category Manufacture_Date Country_Code Year_1961 Year_1962 Year_1963 Year_1964 Year_1965
1: 5 A 1940 DEF 6 3 6 10 NA
2: 3 A 1940 GHI 10 4 NA 9 7
3: 1 A 1950 ABC 5 NA 8 NA 6
4: 4 A 1950 ABC NA 5 5 NA 4
5: 2 B 1940 DEF NA NA 6 NA NA
6: 6 B 2000 GHI NA NA NA NA NA
I have a dataset with which I want to conduct a multilevel analysis. Therefore I have two rows for every patient, and a couple column with 1's and 2's (1 = patient, 2 = partner of patient).
Now, I have variables with date of birth and age, for both patient and partner in different columns that are now on the same row.
What I want to do is to write a code that does:
if mydata$couple == 2, then replace mydata$dateofbirthpatient with mydata$dateofbirthpatient
And that for every row. Since I have multiple variables that I want to replace, it would be lovely if I could get this in a loop and just 'add' variables that I want to replace.
What I tried so far:
mydf_longer <- if (mydf_long$couple == 2) {
mydf_long$pgebdat <- mydf_long$prgebdat
}
Ofcourse this wasn't working - but simply stated this is what I want.
And I started with this code, following the example in By row, replace values equal to value in specified column
, but don't know how to finish:
mydf_longer[6:7][mydf_longer[,1:4]==mydf_longer[2,2]] <-
Any ideas? Let me know if you need more information.
Example of data:
# id couple groep_MNC zkhs fbeh pgebdat p_age pgesl prgebdat pr_age
# 1 3 1 1 1 1 1955-12-01 42.50000 1 <NA> NA
# 1.1 3 2 1 1 1 1955-12-01 42.50000 1 <NA> NA
# 2 5 1 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.5
# 2.1 5 2 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.5
# 3 7 1 1 1 1 1958-04-10 40.25000 1 <NA> NA
# 3.1 7 2 1 1 1 1958-04-10 40.25000 1 <NA> NA
mydf_long <- structure(
list(id = c(3L, 3L, 5L, 5L, 7L, 7L),
couple = c(1L, 2L, 1L, 2L, 1L, 2L),
groep_MNC = c(1L, 1L, 1L, 1L, 1L, 1L),
zkhs = c(1L, 1L, 1L, 1L, 1L, 1L),
fbeh = c(1L, 1L, 1L, 1L, 1L, 1L),
pgebdat = structure(c(-5145, -5145, -9764, -9764, -4284, -4284), class = "Date"),
p_age = c(42.5, 42.5, 55.16667, 55.16667, 40.25, 40.25),
pgesl = c(1L, 1L, 1L, 1L, 1L, 1L),
prgebdat = structure(c(NA, NA, -2815, -2815, NA, NA), class = "Date"),
pr_age = c(NA, NA, 36.5, 36.5, NA, NA)),
.Names = c("id", "couple", "groep_MNC", "zkhs", "fbeh", "pgebdat",
"p_age", "pgesl", "prgebdat", "pr_age"),
row.names = c("1", "1.1", "2", "2.1", "3", "3.1"),
class = "data.frame"
)
The following for loop should work if you only want to change the values based on a condition:
for(i in 1:nrow(mydata)){
if(mydata$couple[i] == 2){
mydata$pgebdat[i] <- mydata$prgebdat[i]
}
}
OR
As suggested by #lmo, following will work faster.
mydata$pgebdat[mydata$couple == 2] <- mydata$prgebdat[mydata$couple == 2]
My dataframe looks like this and I want two separate cumulative columns, one for fund A and the other for fund B
Name Event SalesAmount Fund Cum-A(desired) Cum-B(desired)
John Webinar NA NA NA NA
John Sale 1000 A 1000 NA
John Sale 2000 B 1000 2000
John Sale 3000 A 4000 2000
John Email NA NA 4000 2000
Tom Webinar NA NA NA NA
Tom Sale 1000 A 1000 NA
Tom Sale 2000 B 1000 2000
Tom Sale 3000 A 4000 2000
Tom Email NA NA 4000 2000
I have tried:
df<-
df %>%
group_by(Name)%>%
mutate(Cum-A = as.numeric(ifelse(Fund=="A",cumsum(SalesAmount),0)))%>%
mutate(Cum-B = as.numeric(ifelse(Fund=="B",cumsum(SalesAmount),0)))
but it is totally not what I want as it shows me the runningtotal of both funds,albeit only on the row when the funds match.
Kindly help.
How about:
library(dplyr)
d %>%
group_by(Name) %>%
mutate(cA=cumsum(ifelse(!is.na(Fund) & Fund=="A",SalesAmount,0))) %>%
mutate(cB=cumsum(ifelse(!is.na(Fund) & Fund=="B",SalesAmount,0)))
The output:
Source: local data frame [10 x 8]
Groups: Name
Name Event SalesAmount Fund Cum.A.desired. Cum.B.desired. cA cB
1 John Webinar NA NA NA NA 0 0
2 John Sale 1000 A 1000 NA 1000 0
3 John Sale 2000 B 1000 2000 1000 2000
4 John Sale 3000 A 4000 2000 4000 2000
5 John Email NA NA 4000 2000 4000 2000
6 Tom Webinar NA NA NA NA 0 0
7 Tom Sale 1000 A 1000 NA 1000 0
8 Tom Sale 2000 B 1000 2000 1000 2000
9 Tom Sale 3000 A 4000 2000 4000 2000
10 Tom Email NA NA 4000 2000 4000 2000
Zeroes in the resulting columns can be replaced by NA afterwards if needed:
result$cA[result$cA==0] <- NA
result$cB[result$cB==0] <- NA
Your input data set:
d <- structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("John", "Tom"), class = "factor"), Event = structure(c(3L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 2L, 1L), .Label = c("Email", "Sale", "Webinar"), class = "factor"), SalesAmount = c(NA, 1000L, 2000L, 3000L, NA, NA, 1000L, 2000L, 3000L, NA), Fund = structure(c(NA, 1L, 2L, 1L, NA, NA, 1L, 2L, 1L, NA), .Label = c("A", "B"), class = "factor"), Cum.A.desired. = c(NA, 1000L, 1000L, 4000L, 4000L, NA, 1000L, 1000L, 4000L, 4000L), Cum.B.desired. = c(NA, NA, 2000L, 2000L, 2000L, NA, NA, 2000L, 2000L, 2000L)), .Names = c("Name", "Event", "SalesAmount", "Fund", "Cum.A.desired.", "Cum.B.desired." ), class = "data.frame", row.names = c(NA, -10L))
Here's an approach generalizing to more funds, using zoo and data.table:
# prep
require(data.table)
require(zoo)
setDT(d)
d[,Fund:=as.character(Fund)] # because factors are the worst
uf <- unique(d[Event=="Sale"]$Fund) # collect set of funds
First, assign cumulative sales on the relevant subset of observations:
for (f in uf) d[(Event=="Sale"&Fund==f),paste0('c',f):=cumsum(SalesAmount),by=Name]
Then, carry the last observation forward:
d[,paste0('c',uf):=lapply(.SD,na.locf,na.rm=FALSE),.SDcols=paste0('c',uf),by=Name]
You can shorten #Marat's answer slightly by rolling it all into a single mutate:
df %>%
group_by(Name) %>%
mutate(
cA = cumsum(ifelse(!is.na(Fund) & Fund == "A", SalesAmount, 0)),
cB = cumsum(ifelse(!is.na(Fund) & Fund == "B", SalesAmount, 0)),
cA = ifelse(cA == 0, NA, cA),
cB = ifelse(cB == 0, NA, cB)
)
I am trying to add two variables to my dataset from another dataset which is different in length. I have a coralreef survey dataset for which I am missing start and end times of each dive per site and zone of survey.
Additionally I have a table containing the start and end times of each dive per site and zone:
This table repeats the wpt (site) because 2 zones are measured per site, meaning in this table each row should be unique. In my own dataset I have many more repetitions of wpt because I have several observations in the same site and zone. I need to match the unique rows of mergingdata to merge it to my fishdata returning the start and end times of the mergingdata. So I want to match and merge by "wpt" and by "zone"
this is what I have tried:
merge<- merge(fishdata, mergingdata, by="wpt", all=TRUE, sort=FALSE)
but this only merges by zone, and my output gets an extra column called zone.y - is there a way in which I can merge by the unique combination of 2 variables? "wpt" and "zone"?
Thank you!
The documentation of merge help(merge) says:
By default the data frames are merged on the columns with names they
both have, but separate specifications of the columns can be given by
by.x and by.y.
As you have both id columns in both data.frames, merge function will combine the data using those common columns. So, omiting the id parameter in your code should work.
merge<- merge(fishdata, mergingdata, all=TRUE, sort=FALSE)
However, you can also specify the identifier columns using by, by.x and by.y parameters as follow:
merge<- merge(fishdata, mergingdata, by=c("wpt","zone"), all=TRUE, sort=FALSE)
EDIT
Looking at your post modifications, I figured out that your data has the following structure:
fishdata <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "23.11.2014", class = "factor"),
entry = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "shore", class = "factor"),
wpt = c(2L, 2L, 2L, 2L, 2L, 2L), zone = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = "DO", class = "factor"), transect = c(1L,
1L, 1L, 1L, 1L, 1L), gps = c(NA, NA, NA, NA, NA, NA), surveyor = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "ev", class = "factor"), depth_code = c(NA,
NA, NA, NA, NA, NA), phase = structure(c(2L, 2L, 1L, 1L,
1L, 1L), .Label = c("S_PRIN", "S_STOP"), class = "factor"),
species = structure(c(2L, 1L, 2L, 2L, 1L, 1L), .Label = c("IP",
"TP"), class = "factor"), family = c(NA, NA, NA, NA, NA,
NA)), .Names = c("date", "entry", "wpt", "zone", "transect",
"gps", "surveyor", "depth_code", "phase", "species", "family"
), class = "data.frame", row.names = c(NA, -6L))
mergingdata <- structure(list(start.time = c(10.34, 10.57, 10, 10.24, 9.15,
9.39), end.time = c(10.5, 11.1, 10.2, 10.4, 9.3, 9.5), wpt = c(2L,
2L, 3L, 3L, 4L, 4L), zone = structure(c(1L, 2L, 1L, 2L, 1L, 2L
), .Label = c("DO", "LT"), class = "factor")), .Names = c("start.time",
"end.time", "wpt", "zone"), class = "data.frame", row.names = c(NA,
-6L))
Assiuming that the dataset structures are correct...
> fishdata
date entry wpt zone transect gps surveyor depth_code phase species family
1 23.11.2014 shore 2 DO 1 NA ev NA S_STOP TP NA
2 23.11.2014 shore 2 DO 1 NA ev NA S_STOP IP NA
3 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN TP NA
4 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN TP NA
5 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN IP NA
6 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN IP NA
> mergingdata
start.time end.time wpt zone
1 10.34 10.5 2 DO
2 10.57 11.1 2 LT
3 10.00 10.2 3 DO
4 10.24 10.4 3 LT
5 9.15 9.3 4 DO
6 9.39 9.5 4 LT
I do the merge as follow:
> merge(x = fishdata, y = mergingdata, all.x = TRUE)
wpt zone date entry transect gps surveyor depth_code phase species family start.time end.time
1 2 DO 23.11.2014 shore 1 NA ev NA S_STOP TP NA 10.34 10.5
2 2 DO 23.11.2014 shore 1 NA ev NA S_STOP IP NA 10.34 10.5
3 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN TP NA 10.34 10.5
4 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN TP NA 10.34 10.5
5 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN IP NA 10.34 10.5
6 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN IP NA 10.34 10.5
Note that I use x.all=TRUE, because what we want is to have all the rows from the x object which is fishdata merged with the extra columns of the y object (mergingdata). All that, by using the common columns of both objects as an index.