Is it possible to alter the output of ddply? I wondered if was possible to present the unique results for a subset on ONE row instead of giving each result a new row. E.g.
ID Season Year
5074 Summer 2008
5074 Summer 2009
5074 Winter 2008
5074 Winter 2009
5074 Winter 2010
Into...
ID Season Year
5074 Summer 2008,2009
5074 Winter 2008,2009,2010
I often use ddply to manually diagnose the results of for-loops etc, and presenting the results like this would reduce the length of the output and making the check go much faster.
Cheers!
First load in the data
dd = read.table(textConnection("ID Season Year
5074 Summer 2008
5074 Summer 2009
5074 Winter 2008
5074 Winter 2009
5074 Winter 2010"), header=TRUE)
then just use ddply as normal, splitting by ID and Season
ddply(dd, .(ID, Season), summarise, Year=paste(Year, collapse=","))
We use the collapse argument in paste to return a single character. Since you want to use this as a check, it might be worth using sort on Year, i.e.
paste(sort(Year), collapse=",")
dat <- read.table(text="ID Season Year
5074 Summer 2008
5074 Summer 2009
5074 Winter 2008
5074 Winter 2009
5074 Winter 2010", header = TRUE)
The output can be transformed using aggregate:
aggregate(Year ~ ID + Season, data = dat, paste)
# ID Season Year
#1 5074 Summer 2008, 2009
#2 5074 Winter 2008, 2009, 2010
This is a perfect fit for the new nice printing of lists in data.table version 1.8.2
library(data.table)
DT <- as.data.table(dd)
DT[,list(Year = list(Year)), by = list(ID, Season)]
## ID Season Year
## 1: 5074 Summer 2008,2009
## 2: 5074 Winter 2008,2009,2010
The good thing about the results in this format is the fact that it is just the printing that is affected, you can still access the results without any string splitting
DT[(ID==5074)&(Season == 'Summer'), Year]
## [1] 2008 2009
DT[(ID==5074)&(Season == 'Winter'), Year]
## [1] 2008 2009 2010
Related
I am trying to round consectutive years to the nearest year that a census took place. Unfortunately, in NZ the spacing between census is not always consistent. Eg. I want to round years 2000 to 2020 to the nearest value of 2001, 2006, 2013, 2018. Is there a way to do this without resorting to a series of if_else or case_when statements?
You could use sapply to find the minimum absolute difference between the two vectors.
Suppose your vectors were like this:
census_years <- c(2001, 2006, 2013, 2018)
all_years <- 2000:2020
Then you can do:
sapply(all_years, function(x) census_years[which.min(abs(census_years - x))])
#> [1] 2001 2001 2001 2001 2006 2006 2006 2006 2006 2006 2013 2013 2013 2013 2013
#> [16] 2013 2018 2018 2018 2018 2018
Created on 2020-12-09 by the reprex package (v0.3.0)
We can use findInterval
census_year[findInterval(year_in_question, census_year)+1]
#[1] 2013
data
census_year <- c(2001, 2006, 2013, 2018)
year_in_question <- 2012
This does the trick, by finding the smallest difference between the year and the census years. Vectorizing is left as an exercise...
require(magrittr)
census_year <- c(2001, 2006, 2013, 2018)
year_in_question <- 2012
abs(census_year - year_in_question) %>% # abs diff in years
which.min() %>% # index number of the smallest abs difference
census_year[.] # use that index number
[1] 2013
I am trying to extract the unemployment rate data from this site. In the form, there is a select tag with some options. I can extract the table from default year 2007 to 2017. But I am having a hard time to set a value for from_year and to_year. Here is the code I have so far:
session = html_session("https://data.bls.gov/timeseries/LNS14000000")
form = read_html("https://data.bls.gov/timeseries/LNS14000000") %>% html_node("table form") %>% html_form()
set_values(form, from_year = 2000, to_year = as.numeric(format(Sys.Date(), "%Y"))) # nothing happened if I set the value for years
submit_form(session, form)
It doesn't work as expected.
Thanks so much #Andrew!
I can use the api to extract the data.
library(rjson)
library(blsAPI)
uer1 <- list(
'seriesid'=c('LNS14000000'),
'startyear'=2000,
'endyear'=2009)
response <- blsAPI(uer1, 2, TRUE)
The response looks like:
year period periodName value seriesID
1 2009 M12 December 9.9 LNS14000000
2 2009 M11 November 9.9 LNS14000000
3 2009 M10 October 10.0 LNS14000000
4 2009 M09 September 9.8 LNS14000000
5 2009 M08 August 9.6 LNS14000000
6 2009 M07 July 9.5 LNS14000000
...
Note that there are some query limits in the api.
api limits
I wish to split my data sets into year quarters according to definition of hydrological year. According to Wikipedia, "Due to meteorological and geographical factors, the definition of the water years varies". In USA, hydrological year is a period between October 1st of one year and September 30th of the next.
I use definition of hydrological year for Poland (starts at November 1st and ends at October 31st).
Sample data set looks as folllows:
sampleData <- structure(list(date = structure(c(15946, 15947, 15875, 15910, 15869, 15888, 15823, 16059, 16068, 16067), class = "Date"),`example value` = c(-0.325806595888448, 0.116001346459147, 1.68884381116696, -0.480527505762716, -0.50307381813168,-1.12032214801472, -0.659699514672226, -0.547101497279717, 0.729148872679021,-0.769760735764215)), .Names = c("date", "example value"), row.names = c(NA, -10L), class = "data.frame")
For some reason, function "cut" in my code complains that "breaks" and "labels" differs in length (but they don't). If I omit "labels" options in cut (as below) function works perfectly.
What is wrong with labels?
ToHydroQuarters <-function(df)
{
result <- df
yearStart <- as.numeric(format(min(df$date),'%Y'))-1
#Hydrological year in Poland starts at November 1st
DateStart <- as.Date(paste(yearStart,"-11-01",sep=""))
breaks <- seq(from=DateStart, to=max(df$date)+90, by="quarter")
breakYear <- format(breaks,'%Y')
#Please, do not create labels in such way.
#Please note that for November and December we have next hydrological year - since it started at 1st November. So, we need to check month to decide which year we have (?) or use cut function again as mentioned here: http://stackoverflow.com/questions/22073881/hydrological-year-time-series
labels <- c(paste("Winter",breakYear[1]),
paste("Spring",breakYear[2]),
paste("Summer",breakYear[3]),
paste("Autumn",breakYear[4]),
paste("Autumn",breakYear[5]))
######Here is problem - once I add labels parameter, function complains about different lengths
result$hydroYear <- cut(df$date, breaks)
result
}
Firstly I think it is unwise to have labels as a "hardcoded" variable in a function since it is impossible to check without some kind of reproducible example, however I can see what you're trying to achieve.
You claim that your break and labels should be the correct length, however the function itself doesn't always work (this is without the labels, even if the labels did exist the cut function did not process the last portion of the dates).
For example:
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="day")))
a <- ToHydroQuarters(df)
tail(a)
returns:
date hydroYear
971 2011-08-29 <NA>
972 2011-08-30 <NA>
973 2011-08-31 <NA>
974 2011-09-01 <NA>
975 2011-09-02 <NA>
976 2011-09-03 <NA>
Doing something like breaks <- seq(from=DateStart, to=max(df$date)+90, by="quarter"), does resolve that issue, as it forces a break to actually exist. This might solve your labelling issue that you've had in your function, but it does not make the function "generic".
Personally on the coding side I think it would be better to convert the month, and year parts separately, because it would be easier to understand. For example, you could use library(lubridate) to easily extract the month and specify the breaks and the labels as you normally would. I was thinking the function could look something like this:
thq <- function(date) {
mnth <- cut(month(date), breaks=c(1,4,7, 10, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Spring", "Summer", "Autumn", "Winter"))
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
So then using some dummy data ...
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="month")))
thq <- function(date) {
mnth <- cut(month(date), breaks=c(1,4,7, 10, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Spring", "Summer", "Autumn", "Winter"))
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
df$newdate <- thq(df$date)
Which has the following output:
date newdate
1 2009-01-01 Spring 2009
2 2009-02-01 Spring 2009
3 2009-03-01 Spring 2009
4 2009-04-01 Summer 2009
5 2009-05-01 Summer 2009
6 2009-06-01 Summer 2009
7 2009-07-01 Autumn 2009
8 2009-08-01 Autumn 2009
9 2009-09-01 Autumn 2009
10 2009-10-01 Winter 2010
11 2009-11-01 Winter 2010
12 2009-12-01 Winter 2010
13 2010-01-01 Spring 2010
14 2010-02-01 Spring 2010
15 2010-03-01 Spring 2010
16 2010-04-01 Summer 2010
17 2010-05-01 Summer 2010
18 2010-06-01 Summer 2010
19 2010-07-01 Autumn 2010
20 2010-08-01 Autumn 2010
21 2010-09-01 Autumn 2010
22 2010-10-01 Winter 2011
23 2010-11-01 Winter 2011
24 2010-12-01 Winter 2011
25 2011-01-01 Spring 2011
26 2011-02-01 Spring 2011
27 2011-03-01 Spring 2011
28 2011-04-01 Summer 2011
29 2011-05-01 Summer 2011
30 2011-06-01 Summer 2011
31 2011-07-01 Autumn 2011
32 2011-08-01 Autumn 2011
33 2011-09-01 Autumn 2011
You can shift the months using the modulo operator if it is in a weird order...
thq <- function(date) {
mnth <- cut(((month(df$date)+1) %% 12), breaks=c(0, 3, 6, 9, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Nov_Jan", "Feb_Apr", "May_Jul", "Aug_Oct")
)
# you will need to alter the return statement yourself, because
# I feel there is enough information for you to do it, rather than
# me changing it every time you change the question.
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="day")))
df$new <- thq(df$date)
head(df)
output:
> head(df)
date new
1 2009-01-01 Nov_Jan 2009
2 2009-01-02 Nov_Jan 2009
3 2009-01-03 Nov_Jan 2009
4 2009-01-04 Nov_Jan 2009
5 2009-01-05 Nov_Jan 2009
6 2009-01-06 Nov_Jan 2009
I want to return the final row for each subsection of a dataframe. I'm aware of the ddply and aggregate functions, but they are not giving the expected output in this case, as the column by which I split the data has recurring names.
For example, in df:
year <- rep(c(2011, 2012, 2013), each=12)
season <- rep(c("Spring", "Summer", "Autumn", "Winter"), each=3)
allseason <- rep(season, 3)
temp <- rnorm(36, mean = 61, sd = 10)
df <- data.frame(year, allseason, temp)
I want to return the final temp reading at the end of every season. When I run either
final1 <- aggregate(df, list(df$allseason), tail, 1)
or
final2 <- ddply(df, .(allseason), tail, 1)
I get only the final 4 seasons (i.e. those of 2013). The function seems to stop there and does not go back to previous years/seasons. My intended output is a data frame with 12 rows * 3 columns.
All help appreciated!
*I notice that in the df created here, the allseasons column is designated as a factor with 4 levels, whereas this is not the case in my original dataframe.
In your ddply code, you only forgot to also group by year:
With plyr:
library(plyr)
ddply(df, .(year, allseason), tail, 1)
Or with dplyr
library(dplyr)
df %>%
group_by(year, allseason) %>%
do(tail(.,1))
Or if you want a base R alternative you can use ave:
df[with(df, ave(year, list(year, allseason), FUN = seq_along)) == 3,]
Result:
# year allseason temp
#1 2011 Autumn 63.40626
#2 2011 Spring 59.69441
#3 2011 Summer 42.33252
#4 2011 Winter 79.10926
#5 2012 Autumn 63.14974
#6 2012 Spring 60.32811
#7 2012 Summer 67.57364
#8 2012 Winter 61.39100
#9 2013 Autumn 50.30501
#10 2013 Spring 61.43044
#11 2013 Summer 55.16605
#12 2013 Winter 69.37070
Note that the output will contain the same rows in each case, only the ordering may differ.
And just to add to #beginneR's answer, your aggregate solution should look like:
aggregate(temp ~ allseason + year, data = df, tail, 1)
# or:
with(df, aggregate(temp, list(allseason, year), tail, 1))
Result:
allseason year temp
1 Autumn 2011 64.51539
2 Spring 2011 45.14341
3 Summer 2011 62.29240
4 Winter 2011 47.97461
5 Autumn 2012 43.16781
6 Spring 2012 80.02419
7 Summer 2012 72.31149
8 Winter 2012 45.58344
9 Autumn 2013 55.92607
10 Spring 2013 52.06778
11 Summer 2013 51.01308
12 Winter 2013 53.22452
I have R data that looks like this.
Year Total
2005 238.79
2005 165.46
2005 196.07
2005 135.28
2005 180.30
2005 237.95
2005 714.74
2005 828.19
2005 516.19
2005 279.76
2005 281.88
2005 338.68
The left most column Year goes from 2005 to 2009. I want to do a qqPlot of the Total using only the files that have 2005 in the year column. how can i do this
Another option is to use subset(), which might seem more natural:
tmp <- subset(dat, subset = Year == 2005, select = Total)
qqnorm(tmp)
qqline(tmp)
Do note that subset() is not recommended for use in programming as the sugar that makes it works gets all messed up when running inside other functions/environments. Using it interactively like this is what subset() was designed for.
First, some example data:
dat <- read.table(text="Year Total
2005 238.79
2005 165.46
2005 196.07
2005 135.28
2005 180.30
2005 237.95
2008 714.74
2008 828.19
2008 516.19
2009 279.76
2009 281.88
2009 338.68", header = TRUE)
If you want a normal QQ plot:
qqnorm(dat[dat$Year == 2005, "Total"])