Counting number of Rainfall as Raindays - r

I want to count the number of days rain fell in a month for different years at different location.
This is my data:
Location Year Month Day Precipitation
A 2008 1 1 0
A 2008 1 2 8.32
A 2008 1 3 4.89
A 2008 1 4 0
I have up to 18 locations, year is from 2008 - 2018, 12 months in each year and 0 for precipitation means no rain on that day.

You can use aggregate:
aggregate(cbind(days=x$Precipitation > 0), as.list(x[c("Location", "Year", "Month")]), sum)
# Location Year Month days
#1 A 2008 1 2
Data:
x <- structure(list(Location = structure(c(1L, 1L, 1L, 1L), .Label = "A", class = "factor"),
Year = c(2008L, 2008L, 2008L, 2008L), Month = c(1L, 1L, 1L,
1L), Day = 1:4, Precipitation = c(0, 8.32, 4.89, 0)), class = "data.frame", row.names = c(NA, -4L))

Based on the available information
df <- df %>%
filter(Precipitation != 0) %>%
group_by(Location, Year, Month) %>%
summarize(DaysOfRain = n())

Related

Creating a calculated matrix in R [duplicate]

This question already has answers here:
Aggregate by multiple columns and reshape from long to wide
(4 answers)
Closed 2 years ago.
I have a table similar to this
Year Month Purchase_ind Value
2018 1 1 100
2018 1 1 100
2018 1 0 100
2018 2 1 2
2018 2 0 198
2018 3 1 568
2019 1 0 230
.
.
.
And I want to do a matrix whth:
Year for Y axis
Month for X axis
in the calculate section, I need (Value with Purchase ind=1)/Total value
Having this as a result:
2018 2019 2020
1 0.66 0 x
2 0.01 x x
3 1 x x
Thanks a lot for your help!
You can calculate the proportion for Year and Month and cast the data to wide format :
library(dplyr)
df %>%
group_by(Year, Month) %>%
summarise(Value = sum(Value[Purchase_ind == 1])/sum(Value)) %>%
tidyr::pivot_wider(names_from = Year, values_from = Value)
#Add values_fill = 0 if you want 0's instead of `NA`'s
#tidyr::pivot_wider(names_from = Year, values_from = Value, values_fill = 0)
# Month `2018` `2019`
# <int> <dbl> <dbl>
#1 1 0.667 0
#2 2 0.01 NA
#3 3 1 NA
data
df <- structure(list(Year = c(2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2019L), Month = c(1L, 1L, 1L, 2L, 2L, 3L, 1L), Purchase_ind = c(1L,
1L, 0L, 1L, 0L, 1L, 0L), Value = c(100L, 100L, 100L, 2L, 198L,
568L, 230L)), class = "data.frame", row.names = c(NA, -7L))
using data.table:
DT <- data.table(year = c(2018,2018,2018,2018,2018,2018,2019),
month = c(1,1,1,2,2,3,1),
purchase_ind = c(1,1,0,1,0,1,0),
value = c(100,100,100,2,198,568,230))
DT[, value_ind := fifelse(purchase_ind == 1, value, 0)]
DT <- copy(DT[, .(calculate_session = sum(value_ind) / sum(value)), by = .(year, month)])
dcast(DT, month ~ year, value.var = 'calculate_session')
Output:
month 2018 2019
1: 1 0.6666667 0
2: 2 0.0100000 NA
3: 3 1.0000000 NA
in base R you could do:
(a <- prop.table(xtabs(Value ~ Month + Year + Purchase_ind, df), c(1, 2)))
, , Purchase_ind = 0
Year
Month 2018 2019
1 0.3333333 1.0000000
2 0.9900000
3 0.0000000
, , Purchase_ind = 1
Year
Month 2018 2019
1 0.6666667 0.0000000
2 0.0100000
3 1.0000000
of course if you only need the purchase_ind = 1, you could just subscript it:
a[, , "1"] #or even a[, , 2]
Year
Month 2018 2019
1 0.6666667 0.0000000
2 0.0100000
3 1.0000000

Separate a csv file based on common and uncommon dates from two separate csv files

I have two csv files:
DAT1:
Year,Month,Day,Rainfall
1979,01,01,0.1
1979,01,02,0.3
1979,01,03,0.5
1979,01,04,1
1979,01,05,2
DAT2:
SN,CY,Year,Month,Day,Hour,MSLP
1,1979,1979,01,03,06,1000
3,1979,1979,01,05,12,999
I want to
(1) extract the data with dates that are not common between DAT1 and DAT2.
(2) extract the data with common dates between DATA1 and DAT2 and add the "Rainfall" column.
So the expected output for (1) is:
Year,Month,Day,Rainfall
1979,01,01,0.1
1979,01,02,0.3
1979,01,04,1
The expected output for (2) is:
SN,CY,Year,Month,Day,Hour,MSLP,Rainfall
1,1979,1979,01,03,06,1000,0.5
3,1979,1979,01,05,12,999,2
The DAT1 has continuous dates from 1979-01-01 (daily), while DAT2 has random dates.
Right now, I am separating them manually! But I will be applying this for a data from 1979-2017.
Is there a more efficient on how to do this in R?
I'll appreciate any help on this.
I would suggest this base R approach using the data you shared as a1 and a2 (I also included it in the code):
#Data
a1 <- structure(list(Year = c(1979L, 1979L, 1979L, 1979L, 1979L), Month = c(1L,
1L, 1L, 1L, 1L), Day = 1:5, Rainfall = c(0.1, 0.3, 0.5, 1, 2)), class = "data.frame", row.names = c(NA,
-5L))
a2 <- structure(list(SN = c(1L, 3L), CY = c(1979L, 1979L), Year = c(1979L,
1979L), Month = c(1L, 1L), Day = c(3L, 5L), Hour = c(6L, 12L),
MSLP = 1000:999, Rainfall = c(100L, 50L)), class = "data.frame", row.names = c(NA,
-2L))
Code:
#Code
a1[!paste(a1$Year,a1$Month,a1$Day) %in% paste(a2$Year,a2$Month,a2$Day),]
Output:
Year Month Day Rainfall
1 1979 1 1 0.1
2 1979 1 2 0.3
4 1979 1 4 1.0
For second question you can use merge():
merge(a2,a1,by.x=c('Year','Month','Day'),by.y=c('Year','Month','Day'),all.x=T,sort = F,suffixes = c('.1','.2'))
Output:
Year Month Day SN CY Hour MSLP Rainfall.1 Rainfall.2
1 1979 1 3 1 1979 6 1000 100 0.5
2 1979 1 5 3 1979 12 999 50 2.0

Merging two data frames in long format based on date

I have a 2 data frames, one (df1) that records the daily occurrence of different activities and another (df2) that records properties of the occurred activity during the day.
From df1 it is possible to identify the repeated occurrence of an activity as well the duration. When the day starts is specified by the Date variable.
For example:
id 12 the occurrence starts at day1 and ends at d7. In this case the occurrence is 7 and duration is 11.
for id 123 the week starts at day 5 and ends at d7; occurred in repeated order because of there are gap days at day 6 and duration is 6 and id 123 (starts at day6 ends at day 7) occurred 2 times consecutively and duration 6.
In df1 the variable Date defines the day when the record started. For example id 12 record started at day1 and so on.
I would like to identify if during the consecutive occurrence if there are records on the activity properties in df2.
For example id 12, occurred 7 times and duration is 12 there is record for Wednesday (day3 in df1) and this record corresponds to the 3 day of the consecutive occurrence. For id 123 there is no data (eg. no consecutive occurrence) but for id 10 for 6 day occurrence and duration 18 there is a record on the 6th day.
Df1:
id day1 day2 day3 day4 day5 day6 day7 Date
12 2 1 2 1 1 3 1 Mon
123 0 3 0 3 3 0 3 Fri
10 0 3 3 3 3 3 3 Sat
Df2:
id c1 c2 Date
12 3 3 Wednesday
123 3 2 Fri
10 3 1 Sat
Outcome:
id c1 c2 Occurrence Position
12 3 3 7 3
123 0 0 0 0
10 3 1 2 1
Sample data: df1
structure(list(id = c(12L, 123L, 10L), day1 = c(2L, 0L, 3L),
day2 = c(1L, 3L, 3L), day3 = c(2L, 0L, 3L), day4 = c(1L,
3L, 3L), day5 = c(1L, 3L, 3L), day6 = c(3L, 0L, 3L), day7 = c(1L,
3L, 3L), Date = c("Monday", "Friday", "Saturday")), row.names = c(NA,
-3L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000002a81a571ef0>)
df2:
structure(list(id = c(12, 123, 10), c1 = c(3, 3, 3), c2 = c(3,
2, 1), Date = structure(c(3L, 1L, 2L), .Label = c("Friday", "Saturday",
"Wednesday"), class = "factor")), row.names = c(NA, -3L), class = "data.frame")
A solution with dplyr (maybe not the shortest one):
# library
library(tidyverse)
# get data
df1 <- structure(list(id = c(12L, 123L, 10L),
day1 = c(2L, 0L, 3L),
day2 = c(1L, 3L, 3L),
day3 = c(2L, 0L, 3L),
day4 = c(1L,3L, 3L),
day5 = c(1L, 3L, 3L),
day6 = c(3L, 0L, 3L),
day7 = c(1L,3L, 3L),
Date = c("Monday", "Friday", "Saturday")),
row.names = c(NA,-3L), class = c("data.table", "data.frame"))
df2 <- structure(list(id = c(12, 123, 10),
c1 = c(3, 3, 3),
c2 = c(3, 2, 1),
Date = structure(c(3L, 1L, 2L), .Label = c("Friday", "Saturday","Wednesday"),
class = "factor")), row.names = c(NA, -3L), class = "data.frame")
# change days to numeric (will help you later)
df1 %>% mutate(
Date_nr_df1=case_when(
Date=="Monday" ~ 1,
Date=="Tuesday" ~2,
Date=="Wednesday" ~3,
Date=="Thursday" ~4,
Date=="Friday" ~5,
Date=="Saturday" ~6,
Date=="Sunday" ~7)) -> df1
df2 %>% mutate(
Date_nr_df2=case_when(
Date=="Monday" ~ 1,
Date=="Tuesday" ~2,
Date=="Wednesday" ~3,
Date=="Thursday" ~4,
Date=="Friday" ~5,
Date=="Saturday" ~6,
Date=="Sunday" ~7)) -> df2
# combine data by the id column
left_join(df1,df2, by=c("id")) -> df
# adjust data
df %>%
group_by(id) %>% # to make changes per row
mutate(days=paste0(day1,day2,day3,day4,day5,day6,day7)) %>% #pastes the values together
mutate(days_correct=substring(days,Date_nr_df1)) %>% # applies the start day
mutate(Occurrence_seq=str_split(days_correct, fixed("0"))[[1]][1]) %>% # extracts all days before 0
mutate(Occurrence=nchar(Occurrence_seq)) %>% ## counts these days
mutate(Occurrence=case_when(Occurrence==1 ~ 0, TRUE ~ as.numeric(Occurrence))) %>% # sets Occurrence to 0 if there is no consecutive occurrence
mutate(Position=Date_nr_df2-Date_nr_df1+1) %>% ## calculates the position you wanted
mutate(c1=case_when(Occurrence==0 ~0, TRUE ~ c1),
c2=case_when(Occurrence==0 ~0, TRUE ~c1),
Position=case_when(Occurrence==0 ~ 0, TRUE ~ as.numeric(Position))) %>%
ungroup() %>% ungroups the df
select(id,c1,c2,Occurrence,Position) # selects the wanted variables
#> # A tibble: 3 x 5
#> id c1 c2 Occurrence Position
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 12 3 3 7 3
#> 2 123 0 0 0 0
#> 3 10 3 3 2 1
Created on 2020-04-10 by the reprex package (v0.2.1)

Get the limited rows on date range from dataframe in R

I am having this dataframe.
token DD1 Type DD2 Price
AB-1 2018-01-01 10:12:15 Low 2018-01-25 10000
AB-5 2018-01-10 10:12:15 Low 2018-01-25 15000
AB-2 2018-01-05 12:25:04 High 2018-01-20 25000
AB-3 2018-01-03 17:04:25 Low 2018-01-27 50000
....
AB-8 2017-12-10 21:08:12 Low 2017-12-30 60000
AB-8 2017-12-10 21:08:12 High 2017-12-30 30000
dput:
structure(list(token = structure(c(2L, 5L, 3L, 4L, 1L, 6L, 6L
), .Label = c("....", "AB-1", "AB-2", "AB-3", "AB-5", "AB-8"), class = "factor"),
DD1 = structure(c(2L, 5L, 4L, 3L, 1L, 6L, 6L), .Label = c("",
"01/01/2018 10:12:15", "03/01/2018 17:04:25", "05/01/2018 12:25:04",
"10/01/2018 10:12:15", "10/12/2017 21:08:12"), class = "factor"),
Type = structure(c(3L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("",
"High", "Low"), class = "factor"), DD2 = structure(c(3L,
3L, 2L, 4L, 1L, 5L, 5L), .Label = c("", "20/01/2018", "25/01/2018",
"27/01/2018", "30/12/2017"), class = "factor"), Price = c(10000L,
15000L, 25000L, 50000L, NA, 60000L, 30000L)), .Names = c("token",
"DD1", "Type", "DD2", "Price"), class = "data.frame", row.names = c(NA,
-7L))
From the above mentioned dataframe I want 2 kind of sub set data frame based on date (last three date in descending order (from DD2) if row is not available for particular date than show that date with all fields as '0') and month (last three date in descending order if row is not available for particular date than show that date with all fields as '0').
Formula for Avg Low (same for Avg high): DD2-DD1 and than take Median as per nrow available.
% Formula For month: (Recent Value-Old Value)/(Old Vaule)
The code should pick last three days data as well as last three months data from dataframe whenever i run the code.
DF1:
Date nrow for Low Med Low sum of value low nrow for High Med High sum of value High
27-01-2018 1 24 50000 0 0 0
26-01-2018 0 0 0 0 0 0
25-01-2018 2 19.5 25000 0 0 0
DF2
Month nrow low % sum low % nrow high % sum high %
Jan-18 3 200% 75000 25% 1 0% 25000 -17%
Dec-17 1 100% 60000 100% 1 100% 0 100%
Nov-17 0 - - - 0 - - -
Although this Q already has an accepted answer, I felt challenged to provide an answer which uses dcast() and melt(). Any missing dates and months are completed using CJ() and joins as requested by the OP.
The code tries to reproduce OP's expected results as close as possible. The particular customisation is why the code looks so much convoluted.
If requested, I am willing to explain the code in more detail.
library(data.table)
setDT(DF)
# daily
DF1 <-
DF[, .(n = .N, days = median(difftime(as.Date(DD2, "%d/%m/%Y"),
as.Date(DD1, "%d/%m/%Y"), units = "day")),
sum = sum(Price)), by = .(DD2, Type)][
, Date := as.Date(DD2, "%d/%m/%Y")][
, dcast(.SD, Date ~ Type, value.var = c("n", "days", "sum"), fill = 0)][
.(Date = seq(max(Date), length.out = 3L, by = "-1 days")), on = "Date"][
, setcolorder(.SD, c(1, 3, 5, 7, 2, 4, 6))][
is.na(n_Low), (2:7) := lapply(.SD, function(x) 0), .SDcols = 2:7][]
DF1
Date n_Low days_Low sum_Low n_High days_High sum_High
1: 2018-01-27 1 24.0 days 50000 0 0 days 0
2: 2018-01-26 0 0.0 days 0 0 0 days 0
3: 2018-01-25 2 19.5 days 25000 0 0 days 0
# monthly
DF2 <-
DF[, Month := lubridate::floor_date(as.Date(DD2, "%d/%m/%Y"), unit = "month")][
, .(n = .N, sum = sum(Price)), by = .(Month, Type)][
CJ(Month = seq(max(Month), length.out = 3L, by = "-1 months"), Type = unique(Type)),
on = .(Month, Type)][
, melt(.SD, id.vars = c("Month", "Type"))][
is.na(value), value := 0][
, Pct := {
old <- shift(value); round(100 * ifelse(old == 0, 1, (value - old) / old))
},
by = .(variable, Type)][
, dcast(.SD, Type + Month ~ variable, value.var = c("value", "Pct"))][
, setnames(.SD, c("value_n", "value_sum"), c("n", "sum"))][
, dcast(.SD, Month ~ Type, value.var = c("n", "Pct_n", "sum", "Pct_sum"))][
order(-Month), setcolorder(.SD, c(1, 3, 5, 7, 9, 2, 4, 6, 8))]
DF2
Month n_Low Pct_n_Low sum_Low Pct_sum_Low n_High Pct_n_High sum_High Pct_sum_High
1: 2018-01-01 3 200 75000 25 1 0 25000 -17
2: 2017-12-01 1 100 60000 100 1 100 30000 100
3: 2017-11-01 0 NA 0 NA 0 NA 0 NA
Does the following approach help?
require(tidyverse)
Edit
This is a very convoluted approach and is most certainly possible to be solved more elegantly.
dat <- structure(list(token = structure(c(2L, 5L, 3L, 4L, 1L, 6L, 6L), .Label = c("....", "AB-1", "AB-2", "AB-3", "AB-5", "AB-8"), class = "character"), DD1 = structure(c(2L, 5L, 4L, 3L, 1L, 6L, 6L), .Label = c("", "01/01/2018 10:12:15", "03/01/2018 17:04:25", "05/01/2018 12:25:04", "10/01/2018 10:12:15", "10/12/2017 21:08:12"), class = "factor"),
Type = structure(c(3L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("", "High", "Low"), class = "character"), DD2 = structure(c(3L, 3L, 2L, 4L, 1L, 5L, 5L), .Label = c("", "20/01/2018", "25/01/2018", "27/01/2018", "30/12/2017"), class = "factor"), Price = c(10000L, 15000L, 25000L, 50000L, NA, 60000L, 30000L)), .Names = c("token", "DD1", "Type", "DD2", "Price"), class = "data.frame", row.names = c(NA, -7L))
#I have included this into the code because structure(your output) had messed up a lot with factors
dat <- dat[c(1:4,6:7),]
dat <- dat %>% mutate(DD1 = dmy_hms(DD1), DD2 = dmy(DD2), Type = as.character(Type))
dat_summary <- dat %>%
mutate(diff_days = round(as.duration(DD1%--%DD2)/ddays(1),0),
#uses lubridate to calculate the number of days between each DD2 and DD1
n = n()) %>%
group_by(DD2,Type) %>% #because your operations are performed by each Type by DD2
summarise(med = median(diff_days),# calculates the median
sum = sum(Price)) # and the sum
# A tibble: 5 x 4
# Groups: DD2 [?]
DD2 Type med sum
<date> <chr> <dbl> <int>
1 2017-12-30 2 19.0 30000
2 2017-12-30 3 19.0 60000
3 2018-01-20 2 14.0 25000
4 2018-01-25 3 19.5 25000
5 2018-01-27 3 23.0 50000
Now find the first day with a value in Price
datematch <- dat %>% group_by(Type,month = floor_date(DD2, "month")) %>%
arrange(Type, desc(DD2)) %>%
summarise(maxDate = max(DD2)) %>%
select(Type, maxDate)
now create helper data frames for merging. dummy_dates will contain the last day with a value and the previous two days, for both types (low and high), all_dates will contain... well, all dates
list1 <- split(datematch$maxDate, datematch$Type)
list_type2 <- do.call('c',lapply(list1[['2']], function(x) seq(as.Date(x)-2, as.Date(x), by="days")))
list_type3 <- do.call('c',lapply(list1[['3']], function(x) seq(as.Date(x)-2, as.Date(x), by="days")))
dd_2 <- data.frame (DD2 = list_type2, Type = as.character(rep('2', length(list_type2))), stringsAsFactors = F)
dd_3 <- data.frame (DD2 = list_type3, Type = as.character(rep('3', length(list_type3))), stringsAsFactors = F)
dummy_date = rbind(dd_2, dd_3)
seq_date <- seq(as.Date('2017-12-01'),as.Date('2018-01-31'), by = 'days')
all_dates <- data.frame (DD2 = rep(seq_date,2), Type = as.character(rep(c('2','3'),each = length(seq_date))),stringsAsFactors = F)
now we can join your data frame with all days, so that every single day in the month gets a row
all_dates <- left_join(dd_date, dat_summary, by = c('DD2', 'Type'))
and we can filter this result with dummy_date, which (as we remember) contains only the required days before the last day with data
df1<- left_join(dummy_date, all_dates, by = c('DD2', 'Type')) %>% arrange(Type, desc(DD2))
df1
DD2 Type med sum
1 2018-01-20 2 14.0 25000
2 2018-01-19 2 NA NA
3 2018-01-18 2 NA NA
4 2017-12-30 2 19.0 30000
5 2017-12-29 2 NA NA
6 2017-12-28 2 NA NA
7 2018-01-27 3 23.0 50000
8 2018-01-26 3 NA NA
9 2018-01-25 3 19.5 25000
10 2017-12-30 3 19.0 60000
11 2017-12-29 3 NA NA
12 2017-12-28 3 NA NA
Sorry that 'type' is not correctly put as low and high, had problems to read your data. I hope that this helps somewhat
edit
added suggestion for a way to get to DF2
df1 %>% group_by(Type, month = floor_date(DD2, 'month')) %>%
summarise(sum = sum(sum, na.rm = T),
n = max (n1, na.rm = T)) %>%
unite(sum.n, c('sum','n')) %>%
spread(Type, sum.n) %>%
rename(low = '3', high = '2') %>%
separate(high, c('high','n_high')) %>%
separate(low, c('low','n_low')) %>%
mutate(dummy_low = as.integer(c(NA, low[1:length(low)-1])),
dummy_high = as.integer(c(NA, high[1:length(high)-1])),
low = as.integer(low),
high = as.integer(high))%>%
mutate(perc_low = 100*(low-dummy_low)/dummy_low)
# A tibble: 2 x 8
month high n_high low n_low dummy_low dummy_high perc_low
<date> <int> <chr> <int> <chr> <int> <int> <dbl>
1 2017-12-01 30000 1 60000 1 NA NA NA
2 2018-01-01 25000 1 75000 3 60000 30000 25.0
It's up to you to add the remaining columns for 'high' and the count. I am sure that the solution is not the most elegant one but it should work. DF2 has now only two months, but this is because you have provided only 2 months in your example. It should work with any number of months, and you can then filter the last three months.

Use unique rows from data.frame to subset another data.frame

I have a data.frame v that I would like to use the unique rows from
#v
DAY MONTH YEAR
1 1 1 2000
2 1 1 2000
3 2 2 2000
4 2 2 2000
5 2 3 2001
to subset a data.frame w.
# w
DAY MONTH YEAR V1 V2 V3
1 1 1 2000 1 2 3
2 1 1 2000 3 2 1
3 2 2 2000 2 3 1
4 2 2 2001 1 2 3
5 3 4 2001 3 2 1
The result is data.frame vw. Where only the rows in 'w' that match the unique rows (e.g. (DAY, MONTH, YEAR)) in v are remaining.
# vw
DAY MONTH YEAR V1 V2 V3
1 1 1 2000 1 2 3
2 2 2 2000 2 3 1
Right now I am using the code below, where I merge the data.frames and then use ddply to pick only the unqiue/ first instance of a row. This work, but will become cumbersome if I have to include V1=x$V1[1], etc for all of my variables in the ddply part of the code. Is there a way to use the first instance of (DAY, MONTH, YEAR) and the rest of the columns on that row?
Or, is there another to approach the problem of using unique rows from one data.frame to subset another data.frame?
v <- structure(list(DAY = c(1L, 1L, 2L, 2L, 2L), MONTH = c(1L, 1L,
2L, 2L, 3L), YEAR = c(2000L, 2000L, 2000L, 2000L, 2001L)), .Names = c("DAY",
"MONTH", "YEAR"), class = "data.frame", row.names = c(NA, -5L
))
w <- structure(list(DAY = c(1L, 1L, 2L, 2L, 3L), MONTH = c(1L, 1L,
2L, 2L, 4L), YEAR = c(2000L, 2000L, 2000L, 2001L, 2001L), V1 = c(1L,
3L, 2L, 1L, 3L), V2 = c(2L, 2L, 3L, 2L, 2L), V3 = c(3L, 1L, 1L,
3L, 1L)), .Names = c("DAY", "MONTH", "YEAR", "V1", "V2", "V3"
), class = "data.frame", row.names = c(NA, -5L))
vw_example <- structure(list(DAY = 1:2, MONTH = 1:2, YEAR = c(2000L, 2000L),
V1 = 1:2, V2 = 2:3, V3 = c(3L, 1L)), .Names = c("DAY", "MONTH",
"YEAR", "V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-2L))
wv_inter <- merge(v, w, by=c("DAY","MONTH","YEAR"))
vw <- ddply(www,.(DAY, MONTH, YEAR),function(x) data.frame(DAY=x$DAY[1],MONTH=x$MONTH[1],YEAR=x$YEAR[1], V1=x$V1[1], V2=x$V2[1], V3=x$V3[1]))
In base R, I would take unique of v first before merging. The merge command will by default merge on common column names, so by is unnecessary here.
vw <- merge(unique(v), w)
With your approach (take the first row from each combination), I think you could do (untested):
vw <- ddply(www,.(DAY, MONTH, YEAR),function(x) x[1,])
library(data.table)
v <- data.table(v)
w <- data.table(w)
setkey(v)
setkeyv(w, names(v))
# if you want to capture ALL unique values of `v`, use:
w[unique(v, by=NULL)]
# if you want only values that mutually exist in `v` and `w` use:
w[unique(v, by=NULL), nomatch=0L]
EDITED:
Rather than merge a unique v with w, to get a unique vw first merge v and w and then select values unique on the DAY MONTH YEAR columns.
vw <- merge(v, w, by=c("DAY","MONTH","YEAR"))
vw <- vw[which( ! duplicated(vw[,c("DAY","MONTH","YEAR")]) ), ]

Resources