Calculating the median of a time series, by 8 every 8 hours - r

I am new to R and I do have to calculate the mean of time series, containing 5 years, with hourly taken data of ozon etc..
My df looks like:
structure(list(date = structure(c(1L, 1L, 1L, 1L), .Label = "01.01.2010", class = "factor"),
day.of = c(1L, 1L, 1L, 1L), time = structure(1:4, .Label = c("00:00",
"01:00", "02:00", "03:00"), class = "factor"), SVF_Ray = c(1L,
1L, 1L, 1L), Gmax = c(0, 0, 0, 0), Ta = c(-1.3, -1.2, -1.2,
-1.2), Tmrt = c(-19.3, -12.1, -12, -12.1), PET = c(-10.4,
-8.7, -8.7, -8.7), PT = c(-11.3, -9.3, -9.3, -9.3), Ozon = c(61.35,
62.65, 63.4, 63.85), rDatum = structure(c(14610, 14610, 14610,
14610), class = "Date"), year = c(2010, 2010, 2010, 2010),
month = c(1, 1, 1, 1), day = c(1, 1, 1, 1), hour = c(0, 1,
2, 3)), .Names = c("date", "day.of", "time", "SVF_Ray", "Gmax",
"Ta", "Tmrt", "PET", "PT", "Ozon", "rDatum", "year", "month",
"day", "hour"), row.names = c(NA, 4L), class = "data.frame")
I would like to calculate the mean of Ozon every 8 hours, so a series of 4 calculated means for every day. I have arranged my datum like:
Datum_Ozon$rDatum <- as.Date(data$date, format="%d.%m.%Y")
Datum_Ozon$hour<-as.numeric(unlist(strsplit(as.character(df$time), ":"))[seq(1, 2 * length(df$time), 2)])
Format is numeric
But I don't know any further in achieving my goal. Thanks in advance!

If its the case that your data is regular and complete (ie, every hour has a record), the following base R code should do the trick:
# Get the number of 8 hour intervals
intervalCnt <- nrow(df) / 8L
# add a grouping vector to your data
df$group <- rep(1:intervalCnt, each=8)
# get the median for each interval, keep year var around for later
intervalMedian <- aggregate(var~group + day + month + year, data=df, FUN=median)
Note that this solution relies on the assumption that the data has a regular structure, i.e., every hour has a record. If the measure of interest is missing, i.e. NA, then simply adding na.rm to the aggregate function will return the statistics of interest:
# get the median for each interval
intervalMedian <- aggregate(var~group + day + month + year, data=df, FUN=median, na.rm=T)
If you have a variable for hour of the day, here is a simple way to check for data regularity:
table(df$hourOfDay)
The result of this function is a frequency count of each hour. The counts should be equal. Another thing to check is that the first observation starts in the hour following the final observation, i.e. if the hour of observation 1 == "00:00", then the hour of the final observation should be 23:00.
To provide a plot of the mean of the 8 hour periods by year, you can again use aggregate:
intervalMeans.year <- aggregate(var~group, data=intervalMedian,
FUN=mean, na.rm=T)
The inclusion of the group, day, month, and year variables in the intervalMedian data.frame allow for a lot of different aggregations. For example, with a minor adjustment, it is possible to get the average value of a variable over the 5 year period for each time period-day-month:
intervalMedian$periodDay <- rep(1:3, length.out=intervalMedian)
intervalMeans.dayMonthPeriod <- aggregate(var~periodDay+day+month,
data=intervalMedian, FUN=mean, na.rm=T)

Here is a basic example using a dplyr pipe rather than a plyr approach as well as ifelse(). Everything is self contained here:
library(dplyr)
## OP data
df <-
structure(list(date = structure(c(1L, 1L, 1L, 1L), .Label = "01.01.2010", class = "factor"),
day.of = c(1L, 1L, 1L, 1L), time = structure(1:4, .Label = c("00:00",
"01:00", "02:00", "03:00"), class = "factor"), SVF_Ray = c(1L,
1L, 1L, 1L), Gmax = c(0, 0, 0, 0), Ta = c(-1.3, -1.2, -1.2,
-1.2), Tmrt = c(-19.3, -12.1, -12, -12.1), PET = c(-10.4,
-8.7, -8.7, -8.7), PT = c(-11.3, -9.3, -9.3, -9.3), Ozon = c(61.35,
62.65, 63.4, 63.85), rDatum = structure(c(14610, 14610, 14610,
14610), class = "Date"), year = c(2010, 2010, 2010, 2010),
month = c(1, 1, 1, 1), day = c(1, 1, 1, 1), hour = c(0, 1,
2, 3)), .Names = c("date", "day.of", "time", "SVF_Ray", "Gmax",
"Ta", "Tmrt", "PET", "PT", "Ozon", "rDatum", "year", "month",
"day", "hour"), row.names = c(NA, 4L), class = "data.frame")
df %>%
mutate(DayChunk=ifelse(hour %in% c(0:7),"FirstThird",
ifelse(hour %in% c(8:15), "SecondThird"
,"ThirdThird")
)) %>%
group_by(Date, DayChunk) %>%
summarise(MedOzon=median(Ozon))

Look up the function seq.POSIXt. There are options to specify the start and stop intervals. This function is designed to create sequences of time. For your problem:
myseq<-seq(ISOdate(2010,01,01, 00, 00, 00, tz="GMT"), to=ISOdate(2016,01,05), by = "8 hour")
Use the ISOdate functions to set the start and stop times. If you are going to be working much with times, I suggest researching the function strptime and the POSIXlt/ct time classes.
Now with the breaks defined and assuming you have a column in your dataframe (Datum_Ozon) named "datetime", then use "cut" to group/subset your data.
Datum_Ozon$datetime<-as.POSIXct(paste(as.character(Datum_Ozon$date),
as.character(Datum_Ozon$time)), "%d.%m.%Y %H:%M", tz="GMT" )
library(dplyr)
summarize(group_by(Datum_Ozon, cut(Datum_Ozon$datetime, myseq)), mean(Ozon))

Related

Count how often two factors have the same output value

I want to calculate the number of times two individuals share the same group number. I'm working with quite a large dataset (169 individuals and over a 1000 observations (rows) of them) and I'm looking for an efficient way to count the occurrence of them being in the same group. My (simplified) data looks like this:
ID
Group number
Date
Time
Aa
1
15-06-22
15:05:22
Bd
1
15-06-22
15:05:27
Cr
2
15-06-22
15:07:12
Bd
1
15-06-22
17:33:15
Aa
2
15-06-22
17:36:54
Cr
2
15-06-22
17:37:01
...
I would like my output data to look like this:
Aa-Bd
Aa-Cr
Bd-Cr
...
1
1
0
Or:
Occurrence
Dyad
1
Aa-Bd; Aa-Cr
0
Bd-Cr
Or even a matrix might work. I've been trying to replicate the solution posed for this problem: Count occurrences of a variable having two given values corresponding to one value of another variable
but for some reason my matrix remains empty, even though I know that certain individuals have been in groups with others.
Any help and suggestions would be extremely appreciated! I feel like the solution shouldn't be too complicated but for some reason I can't seem to figure it out.
Thanks in advance!
Edit: some example data from dput():
dput(c[1:5,])
structure(list(Date = structure(c(19129, 19129, 19129, 19129,
19129), class = "Date"), Time = c("11:05:58", "11:06:06", "11:06:16",
"11:06:33", "11:06:59"), Data = structure(c(1L, 1L, 1L, 1L, 1L
), .Label = "Crossing", class = "factor"), Group = structure(c(5L,
5L, 5L, 5L, 5L), .Label = c("Ankhase", "Baie Dankie", "Kubu",
"Lemon Tree", "Noha"), class = "factor"), IDIndividual1 = structure(c(158L,
158L, 34L, 153L, 14L), .Label = c("Aaa", "Aal", "Aan", "Aapi",
"Aar", "Aara", "Aare", "Aat", "Amst", "App", "Asis", "Awa", "Beir",
"Bela", "Bet", "Buk", "Daa", "Dais", "Dazz", "Deli", "Dewe",
"Dian", "Digb", "Dix", "Dok", "Dore", "Eina", "Eis", "Enge",
"Fle", "Flu", "Fur", "Gale", "Gaya", "Gese", "Gha", "Ghid", "Gib",
"Gil", "Ginq", "Gobe", "Godu", "Goe", "Gom", "Gran", "Gree",
"Gri", "Gris", "Griv", "Guat", "Gub", "Guba", "Gubh", "Guz",
"Haai", "Hee", "Heer", "Heli", "Hond", "Kom", "Lail", "Lewe",
"Lif", "Lill", "Lizz", "Mara", "Mas", "Miel", "Misk", "Moes",
"Mom", "Mui", "Naal", "Nak", "Ncok", "Nda", "Ndaw", "Ndl", "Ndon",
"Ndum", "Nge", "Nko", "Nkos", "Non", "Nooi", "Numb", "Nurk",
"Nuu", "Obse", "Oerw", "Oke", "Ome", "Oort", "Ouli", "Oup", "Palm",
"Pann", "Papp", "Pie", "Piep", "Pix", "Pom", "Popp", "Prai",
"Prat", "Pret", "Prim", "Puol", "Raba", "Rafa", "Ram", "Rat",
"Rede", "Ree", "Reen", "Regi", "Ren", "Reno", "Rid", "Rim", "Rioj",
"Riss", "Riva", "Rivi", "Roc", "Sari", "Sey", "Sho", "Sig", "Sirk",
"Sitr", "Skem", "Sla", "Spe", "Summary", "Syl", "Tam", "Ted",
"Tev", "Udup", "Uls", "Umb", "Unk", "UnkAM", "UnkBB", "UnkJ",
"UnkJF", "UnkJM", "Upps", "Utic", "Utr", "Vla", "Vul", "Xala",
"Xar", "Xeni", "Xia", "Xian", "Xih", "Xin", "Xinp", "Xop", "Yam",
"Yamu", "Yara", "Yaz", "Yelo", "Yodo", "Yuko"), class = "factor"),
Behaviour = structure(c(2L, 3L, 1L, 1L, 1L), .Label = c("Crossing",
"First Approacher", "First Crosser", "Last Crosser", "Summary"
), class = "factor"), CrossingType = c("Road - Ground Level",
"Road - Ground Level", "Road - Ground Level", "Road - Ground Level",
"Road - Ground Level"), GPSS = c(-27.9999, -27.9999, -27.9999,
-27.9999, -27.9999), GPSE = c(31.20376, 31.20376, 31.20376,
31.20376, 31.20376), Context = structure(c(1L, 1L, 1L, 1L,
1L), .Label = c("Crossing", "Feeding", "Moving", "Unknown"
), class = "factor"), Observers = structure(c(12L, 12L, 12L,
12L, 12L), .Label = c("Christelle", "Christelle; Giulia",
"Christelle; Maria", "Elif; Giulia", "Josefien; Zach; Flavia; Maria",
"Mathieu", "Mathieu; Giulia", "Mike; Mila", "Mila", "Mila; Christelle; Giulia",
"Mila; Elif", "Mila; Giulia", "Nokubonga; Mila", "Nokubonga; Tam; Flavia",
"Nokubonga; Tam; Flavia; Maria", "Nokubonga; Zach; Flavia; Maria",
"Tam; Flavia", "Tam; Zach; Flavia; Maria", "Zach", "Zach; Elif; Giulia",
"Zach; Flavia; Maria", "Zach; Giulia"), class = "factor"),
DeviceId = structure(c(10L, 10L, 10L, 10L, 10L), .Label = c("{129F4050-2294-0D43-890F-3B2DEF58FC1A}",
"{1A678F44-DB8C-1245-8DD7-9C2F92F086CA}", "{1B249FD2-AA95-5745-9A32-56CDD0587018}",
"{2C7026A6-6EDC-BA4F-84EC-3DDADFFD4FD7}", "{2E489E9F-00BE-E342-8CAE-941618B2F0E6}",
"{359CEB57-351F-F54F-B2BD-77A05FB6C349}", "{3727647C-B73A-184B-B187-D6BF75646B84}",
"{7A4E6639-7387-7648-88EC-7FD27A0F258A}", "{854B02F2-5979-174A-AAE8-398C21664824}",
"{89B5C791-1F71-0149-A2F7-F05E0197F501}", "{D92DF19A-9021-A740-AD99-DCCE1D88E064}"
), class = "factor"), Obs.nr = c(1, 1, 1, 1, 1), Gp.nr = c(1,
3, 3, 4, 5)), row.names = c(NA, -5L), groups = structure(list(
Obs.nr = 1, .rows = structure(list(1:5), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
In here Gp.nr is my group number, IDIndividual1 is my ID.
This is not efficient at all, but as a starting point you can use (GN denotes the group number)
my_ID <- unique(df$ID)
matrix <- matrix(nrow = length(my_ID),ncol = length(my_ID))
for (i in 1:length(my_ID)){
for (j in 1:length(my_ID)){
matrix[i,j] <- length(intersect(df$GN[df$ID == my_ID[i]],df$GN[df$ID == my_ID[j]]))}}
Check this out:
## Creating the Dataframe
df = data.frame(ID = c("Aa","Bd","Cc","Dd","Cr"),
GroupNumber=c(1,2,1,3,3))
## Loading the libraries
library(dplyr)
library(tidyverse)
library(stringr)
## Grouping to find out which observations share same group
df1 = df %>%
group_by(GroupNumber) %>%
summarise(ID_=paste(ID, collapse="-"),
CountbyID = n_distinct(ID_)) %>%
filter(str_detect(ID_, "-"))
## Creating all possible pair combinations and then joining and concatenating all rows
df2 = data.frame(t(combn(df$ID,2))) %>%
mutate(Comb = paste(X1,"-",X2, sep = "")) %>%
left_join(df1, by=c("Comb"="ID_")) %>%
select(Comb, CountbyID) %>%
replace(is.na(.), 0) %>%
group_by(CountbyID) %>%
summarise(ID=paste(Comb, collapse=";"))
Hope this helps!
UPDATE
The way the dataframe is setup, its causing issues to the "IDIndividual1" column. Based on the way it is setup, it has more factor levels than the unique data points. Therefore, I simply converted it to a character. Try the code below:
df = df[,c("IDIndividual1","Gp.nr")]
colnames(df) = c("ID","GroupNumber")
df$ID = as.character(df$ID) ## Converting factors to characters
## Loading the libraries
library(dplyr)
library(tidyverse)
library(stringr)
## Grouping to find out which observations share same group
df1 = df %>%
group_by(GroupNumber) %>%
summarise(ID_=paste(ID, collapse="-"),
CountbyID = n_distinct(ID_)) %>%
filter(str_detect(ID_, "-"))
## Creating all possible pair combinations and then joining and concatenating all rows
df2 = data.frame(t(combn(df$ID,2))) %>%
distinct() %>%
filter(X1 != X2) %>%
mutate(Comb = paste(X1,"-",X2, sep = "")) %>%
left_join(df1, by=c("Comb"="ID_")) %>%
select(Comb, CountbyID) %>%
replace(is.na(.), 0) %>%
group_by(CountbyID) %>%
summarise(ID=paste(Comb, collapse=";"))

R: new column based whether categorical levels of another column are the same or different from each other

I am having a problem creating a new column in a data where the column content is defined by levels in a factor in a different column are the same or different, which is dependent on another 2 columns.
Basically, I have a bunch of cows with different ID's that can have different parities. The quarter is the udder quarter affected by the disease and I would like to create a new column with a result that is based on whether quarters are the same or different or occurring once. Any help would be appreciated. Code for abbreviated data frame below/ The new column is the one I would like to achieve.
AnimalID <- c(10,10,10,10,12,12,12,12,14)
Parity <- c(8,8,9,9,4,4,4,4,2)
Udder_quarter <- c("LH","LH","RH","RH","LH","RH","LF","RF","RF")
new_column <- c("same quarter","same quarter","different quarter","different quarter","different quarter","different quarter","different quarter","different quarter","one quarter")
quarters<- data.frame(AnimalID,Parity,Udder_quarter,new_column)
structure(list(HerdAnimalID = c(100165, 100165, 100327, 100327,
100450, 100450), Parity = c(6, 6, 5, 5, 3, 3), no_parities = c(1,
1, 1, 1, 1, 1), case = c("1pathogen_lact", "1pathogen_lact",
"1pathogen_lact", "1pathogen_lact", "1pathogen_lact", "1pathogen_lact"
), FARM = c(1, 1, 1, 1, 1, 1), `CASE NO` = c("101", "101", "638",
"638", "593", "593"), MASTDATE = structure(c(1085529600, 1087689600,
1097884800, 1101254400, 1106092800, 1106784000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), QRT = c("LF", "LF", "RH", "LF", "LH",
"LH"), MastitisDiagnosis = c("Corynebacterium spp", "Corynebacterium spp",
"S. uberis", "S. uberis", "Bacillus spp", "Bacillus spp"), PrevCalvDate =
structure(c(1075334400,
1075334400, 1096156800, 1096156800, 1091145600, 1091145600), class =
c("POSIXct",
"POSIXt"), tzone = "UTC")), .Names = c("HerdAnimalID", "Parity",
"no_parities", "case", "FARM", "CASE NO", "MASTDATE", "QRT",
"MastitisDiagnosis", "PrevCalvDate"), row.names = c(NA, -6L), class =
c("tbl_df",
"tbl", "data.frame"))
library(dplyr)
quarters %>%
group_by(AnimalID) %>%
mutate(new_column = ifelse(n()==1, 'one quarter', NA)) %>%
group_by(Parity, add=T) %>%
mutate(new_column=ifelse(length(unique(Udder_quarter))==1 & is.na(new_column),
"same quarter",
ifelse(length(unique(Udder_quarter))>1,
"different quarter",
new_column))) %>%
data.frame()
Output is:
AnimalID Parity Udder_quarter new_column
1 10 8 LH same quarter
2 10 8 LH same quarter
3 10 9 RH same quarter
4 10 9 RH same quarter
5 12 4 LH different quarter
6 12 4 RH different quarter
7 12 4 LF different quarter
8 12 4 RF different quarter
9 14 2 RF one quarter
Sample data:
quarters <- structure(list(AnimalID = c(10, 10, 10, 10, 12, 12, 12, 12, 14
), Parity = c(8, 8, 9, 9, 4, 4, 4, 4, 2), Udder_quarter = structure(c(2L,
2L, 4L, 4L, 2L, 4L, 1L, 3L, 3L), .Label = c("LF", "LH", "RF",
"RH"), class = "factor")), .Names = c("AnimalID", "Parity", "Udder_quarter"
), row.names = c(NA, -9L), class = "data.frame")
I would use ave to do that:
f <- function(x) {
if (length(x)==1) return("one")
else if (all(x == x[1])) return("same")
else return("different")
}
ave(Udder_quarter, interaction(AnimalID, Parity), FUN=f)
# [1] "same" "same" "same" "same" "different"
# [6] "different" "different" "different" "one"

R calculate daily mean data from irregular hourly data

I have a large data set that I am trying to reformat. Unfortunately, I cannot aggregate daily data.
dataset1_updated<- structure(list(X = 1:5, Time = structure(c(1L, 1L, 2L, 3L, 3L), .Label = c("7/29/11 10:29", "7/29/11 10:30", "7/29/11 10:32"
), class = "factor"), O3 = c(32.032608222367, 32.032608222367,
32.032608222367, 32.032608222367, 32.032608222367), SO2 = c(2.611,
2.605, 2.744, 2.767, 2.778), NO = c(0.081, 0.081, 0.081, 0.081,
0.081), NO2 = c(1.938, 1.912, 1.912, 1.896, 1.863), NOx = c(2.019,
1.993, 1.993, 1.977, 1.944)), .Names = c("X", "Time", "O3",
"SO2", "NO", "NO2", "NOx"), row.names = c(NA, 5L), class = "data.frame")
I convert the data set to xts object, and apply daily mean function, the results are "NA". Could you please tell me what is missing?
x <- as.xts(as.POSIXct(dataset1_updated$Time, format="%m/%d/%Y %H:%M"))
x_up<- apply.daily(x, colMeans)
write.csv(as.data.frame(as.matrix(x_up)), file="test")
thank you,
We need to change the xts statement as the as.xts is applying only on the datetime class and not on the entire dataset
xt1 <- xts(dataset1_updated[-(1:2)], order.by = as.POSIXct(dataset1_updated$Time,
format = "%m/%d/%y %H:%M"))
x_up <- apply.daily(xt1, colMeans)
x_up
# O3 SO2 NO NO2 NOx
#2011-07-29 10:32:00 32.03261 2.701 0.081 1.9042 1.9852

Improvement in for loop using other method

Problem
There is 1 main station (df) and 3 local stations (s) stacked in a single data.frame with values for three days. The idea is to take each day from the main station, find the relative anomaly of the three local stations, and smooth it using inverse distance weighting (IDW) from the phylin package. This is then applied to the value in the main station by multiplication.
Any suggestions on improving this code (e.g. data.table, dplyr, apply)? I still don't know how to approach this problem without the cumbersome for loop.
dput
s <- structure(list(id = c("USC00031152", "USC00034638", "USC00036352",
"USC00031152", "USC00034638", "USC00036352", "USC00031152", "USC00034638",
"USC00036352"), lat = c(33.59, 34.7392, 35.2833, 33.59, 34.7392,
35.2833, 33.59, 34.7392, 35.2833), long = c(-92.8236, -90.7664,
-93.1, -92.8236, -90.7664, -93.1, -92.8236, -90.7664, -93.1),
year = c(1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900,
1900), month = c(1, 1, 1, 1, 1, 1, 1, 1, 1), day = c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), value = c(63.3157576809045,
86.0490598902219, 76.506386949066, 71.3760752788486, 89.9119576975542,
76.3535163951321, 53.7259645981243, 61.7989638892985, 85.8911224149051
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-9L), .Names = c("id", "lat", "long", "year", "month", "day",
"value"))
df <- structure(list(id = c(12345, 12345, 12345), lat = c(100, 100,
100), long = c(50, 50, 50), year = c(1900, 1900, 1900), month = c(1,
1, 1), day = 1:3, value = c(54.8780020601509, 106.966029162171,
98.3198828955801)), row.names = c(NA, -3L), class = "data.frame", .Names = c("id",
"lat", "long", "year", "month", "day", "value"))
Code
library(phylin)
nearest <- function(i, loc){
# Stack 3 local stations
stack <- s[loc:(loc+2),]
# Get 1 main station
station <- df[i,]
# Check for NA and build relative anomaly (r)
stack <- stack[!is.na(stack$value),]
stack$r <- stack$value/station$value
# Use IDW and return v
v <- as.numeric(ifelse(dim(stack)[1] == 1,
stack$r,
idw(stack$r, stack[,c(2,3,8)], station[,2:3])))
return(v)
}
ncdc <- 1
for (i in 1:nrow(df)){
# Get relative anomaly from function
r <- nearest(i, ncdc)
# Get value from main station and apply anomaly
p <- df[i,7]
df[i,7] <- p*r
# Iterate to next 3 local stations
ncdc <- ncdc + 3
}
Suppose you let your nearest function unchanged.
Then you could get your new value column in df by
newvalue <- sapply(1:NROW(df), function (i) df[i,7] * nearest(i, 3*(i-1)+1))
df$value <- newvalue

Passing current value of ddply split on to function

Here is some sample data for which I want to encode the gender of the names over time:
names_to_encode <- structure(list(names = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("names", "year"), row.names = c(NA, -6L), class = "data.frame")
Here is a minimal set of the Social Security data, limited to just those names from 1890 and 1990:
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
I've defined a function which subsets the Social Security data given a year or range of years. In other words, it calculates whether a name was male or female over a given time period by figuring out the proportion of male and female births with that name. Here is the function along with a helper function:
require(plyr)
require(dplyr)
select_ssa <- function(years) {
# If we get only one year (1890) convert it to a range of years (1890-1890)
if (length(years) == 1) years <- c(years, years)
# Calculate the male and female proportions for the given range of years
ssa_select <- ssa_demo %.%
filter(year >= years[1], year <= years[2]) %.%
group_by(name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = sapply(proportion_female, male_or_female))
return(ssa_select)
}
# Helper function to determine whether a name is male or female in a given year
male_or_female <- function(proportion_female) {
if (proportion_female > 0.5) {
return("female")
} else if(proportion_female == 0.5000) {
return("either")
} else {
return("male")
}
}
Now what I want to do is use plyr, specifically ddply, to subset the data to be encoded by year, and merge each of those pieces with the value returned by the select_ssa function. This is the code I have.
ddply(names_to_encode, .(year), merge, y = select_ssa(year), by.x = "names", by.y = "name", all.x = TRUE)
When calling select_ssa(year), this command works just fine if I hard code a value like 1890 as the argument to the function. But when I try to pass it the current value for year that ddply is working with, I get an error message:
Error in filter_impl(.data, dots(...), environment()) :
(list) object cannot be coerced to type 'integer'
How can I pass the current value of year on to ddply?
I think you're making things too complicated by trying to do a join inside ddply. If I were to use dplyr I would probably do something more like this:
names_to_encode <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("name", "year"), row.names = c(NA, -6L), class = "data.frame")
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
names_to_encode$name <- as.character(names_to_encode$name)
names_to_encode$year <- as.integer(names_to_encode$year)
tmp <- left_join(ssa_demo,names_to_encode) %.%
group_by(year,name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5,"either",
ifelse(proportion_female > 0.5,"female","male")))
Note that 0.1.1 is still a little finicky about the types of join columns, so I had to convert them. I think I saw some activity on github that suggested that was either fixed in the dev version, or at least something they're working on.

Resources