I have a data frame like so:
df <- structure(list(year = c(1990, 1990, 1990, 1990, 1990, 1990, 1990,
1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1991, 1991, 1991,
1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991,
1991), group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
value = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L, 19L)), .Names = c("year", "group", "value"
), row.names = c(NA, -30L), class = "data.frame")
> df
year group value
1 1990 A 1
2 1990 A 2
3 1990 A 3
4 1990 A 4
5 1990 A 5
6 1990 A 6
7 1990 B 7
8 1990 B 8
9 1990 B 9
10 1990 B 10
11 1990 B 11
12 1990 B 12
13 1990 B 13
14 1990 B 14
15 1990 B 15
16 1991 A 5
17 1991 A 6
18 1991 A 7
19 1991 A 8
20 1991 A 9
21 1991 A 10
22 1991 A 11
23 1991 A 12
24 1991 A 13
25 1991 A 14
26 1991 B 15
27 1991 B 16
28 1991 B 17
29 1991 B 18
30 1991 B 19
I need to apply a function for each year (I intend to do that with plyr and summarise) but only on the factor level with the most rows (A or B). Is there a way to automatically select this level (A or B) for each year?
df2 <- ddply(df, .(year), summarise, result="some operation on longest level"))
desired output:
> df2
year group value result
1 1990 B 7 5
2 1990 B 8 4
3 1990 B 9 5
4 1990 B 10 3
5 1990 B 11 3
6 1990 B 12 8
7 1990 B 13 11
8 1990 B 14 7
9 1990 B 15 2
10 1991 A 5 10
11 1991 A 6 13
12 1991 A 7 9
13 1991 A 8 7
14 1991 A 9 6
15 1991 A 10 1
16 1991 A 11 15
17 1991 A 12 5
18 1991 A 13 5
19 1991 A 14 2
this might be another approach with dplyr
library(dplyr)
df <- df %.% group_by(year,group) %.% mutate(count = n()) %.% ungroup()
df <- df %.% group_by(year) %.% filter(count %in% max(count)) %.% mutate(result = sqrt(value))
df$count <- NULL
since i am not sure what function you want to apply to result I used sqrt(value) as in #rbatt's answer
Sorry, I don't use plyr myself, but here's how i might do it with base functions. Perhaps that will inspire a plyr solution for you.
#find largest groups for each year
maxgroups <- tapply(df$group, df$year, function(x) which.max(table(x)))
#create group names
maxpairs <- paste(names(maxgroups),levels(df$group)[maxgroups], sep=".")
#helper function
ifnotin<-function(val,set,ifnotin) {out<-val; out[!val%in%set]<-ifnotin; droplevels(out)}
#new factor indicating best group
tgroups <- ifnotin(interaction(df$year, df$group), maxpairs, NA)
#now transform the best groups by adding year to result (or whatever transformation you need to do)
transform(df, value=ifelse(!is.na(tgroups), value+year, value))
I wasn't sure if your transformation need to know what group/year it was for or not. If you just needed to know if it was in a group that needed transformation you could skip the tgroups and just use
needstransform <- interaction(df$year, df$group) %in% maxpairs
but tgroups has NA values that would be good for summaries tapply(df$value, droplevels(tgroups), mean) and such
I don't think this is a very good answer because it's super obfuscated (and it doesn't use your desired plyr approach), but maybe it will stimulate someone else's thinking:
Basically, you just need to know which values of group you want to look at for each year. Let's say you figure that out and store those values (in the same order as splits of the original data by year) in a variable called m, then you can mapply some function that subsets each split (of the data by year) by group and then does whatever other calculations you want.
do.call(rbind, mapply(function(x,y) {
tmp <- x[x$group==y,]
#fun(tmp) # apply your function to the relevant subset
}, split(df,df$year), m, SIMPLIFY=FALSE))
I thought of three different ways you could generate m. Here they are:
m <- with(df, levels(group)[apply(table(group, year), 2, which.max)])
m <- levels(df$group)[sapply(split(df, df$year), function(x) which.max(sapply(split(x, x$group), nrow)))]
m <- with(df, levels(group)[apply(tapply(year, list(group, year), length),2,which.max)])
This is what I came up with:
df2 <- ddply(
df,
.(year),
summarise,
result=sqrt(
value[group==names(which.max(table(df$group)))]
)
)
Related
I want to do a weekly time series analysis for each sales_point_id separately with the results of fact value and what was predicted.
dput()
timeseries=structure(list(sales_point_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), calendar_id_operday = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L,
20L, 21L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L), line_fact_amt = c(55767L,
59913L, 36363L, 48558L, 505L, 76344L, 22533L, 11965L, 78944L,
36754L, 30621L, 55716L, 32470L, 62165L, 57986L, 2652L, 16487L,
72849L, 73715L, 65656L, 64411L, 47460L, 61866L, 10877L, 72392L,
53011L, 23544L, 76692L, 10388L, 24255L, 56684L, 59329L, 6655L,
65612L, 17495L, 10389L, 63702L, 47407L, 78782L, 22898L, 21151L,
32587L)), class = "data.frame", row.names = c(NA, -42L))
i need weekly forecast and week=1 its mean 20210101-20210108(ymd) but here there is no date format only week, just such a specificity of these data .
Here are my futile attempts
library("lubridate")
# first the grouping variable
timeseries$group <- paste0(timeseries$sales_point_id)
groups <- unique(timeseries$group)
# find starting date per group and save them as a list of elements
timeseries$date <- as.Date(as.character(timeseries$calendar_id_operday), )
timeseries <- timeseries[order(timeseries$calendar_id_operday),]
start_dates <- format(timeseries$date[match(groups, timeseries$group)], )
start_dates <- strsplit(start_dates, ' ')
listed <- split(timeseries,timeseries$group)
str(listed)
listed_ts
listed_arima <- lapply(listed_ts,function(x) auto.arima(x,allowmean = F ))
#Now the forecast for each arima:
listed_forecast <- lapply(listed_arima,function(x) forecast(x,12) )# forecast 12 weeks ahead
listed_forecast
# need to flat it down to a data.frame, do.call and rbind help:
do.call(rbind,listed_forecast)
#Get a prediction of initial values
lapply(listed_arima, fitted)
Error in lapply(listed_ts, function(x) auto.arima(x, allowmean = F))
What i do wrong and how fix to correct working of timeseries.
my ideal and desired result as just sample output format .
sales_point_id calendar_id_operday line_fact_amt.fact
1 1 1 436
2 1 2 56
3 1 3 66
4 1 4 NaN
5 1 5 NaN
6 1 6 NaN
7 1 7 NaN
8 1 8 NaN
9 1 9 NaN
10 1 10 NaN
11 1 11 NaN
12 1 12 NaN
13 1 13 NaN
14 1 14 NaN
15 1 15 NaN
line_fact_amt.predict forecast.ahead
1 435 NaN
2 57 NaN
3 70 NaN
4 NaN 524
5 NaN 945
6 NaN 235
7 NaN 200
8 NaN 326
9 NaN 437
10 NaN 7
11 NaN 191
12 NaN 321
13 NaN 919
14 NaN 407
15 NaN 82
As always I appreciate any of your help.
You would find life much easier if you used the fable package instead of the forecast package. It handles weekly data better, and it allows forecasts of multiple series at once.
Here is an example using your data. First, we turn the data into a tsibble object, which is the construction needed for fable. It is like a data frame except it has a time index, and an optional key to define separate time series.
library(fable)
library(tsibble)
library(lubridate)
library(dplyr)
# Turn data frame into tsibble
timeseries <- timeseries %>%
mutate(week = yearweek(paste0("2021-W",calendar_id_operday))) %>%
select(week, sales_point_id, line_fact_amt) %>%
as_tsibble(index=week, key=sales_point_id)
timeseries
#> # A tsibble: 42 x 3 [1W]
#> # Key: sales_point_id [2]
#> week sales_point_id line_fact_amt
#> <week> <int> <int>
#> 1 2021 W01 1 55767
#> 2 2021 W02 1 59913
#> 3 2021 W03 1 36363
#> 4 2021 W04 1 48558
#> 5 2021 W05 1 505
#> 6 2021 W06 1 76344
#> 7 2021 W07 1 22533
#> 8 2021 W08 1 11965
#> 9 2021 W09 1 78944
#> 10 2021 W10 1 36754
#> # … with 32 more rows
Then we fit an ARIMA model to each series, produce forecasts for 12 weeks ahead, and combine the fitted values with the forecasts as requested. The .mean column contains the point forecasts.
# Fit an ARIMA model for each group
arima <- timeseries %>%
model(arima = ARIMA(line_fact_amt))
# Forecast h steps ahead
fc <- forecast(arima, h = 12)
# Create tsibble with fitted values and forecasts
bind_rows(augment(arima), fc)
#> # A tsibble: 66 x 8 [1W]
#> # Key: sales_point_id, .model [2]
#> sales_point_id .model week line_fact_amt .fitted .resid .innov .mean
#> <int> <chr> <week> <dist> <dbl> <dbl> <dbl> <dbl>
#> 1 1 arima 2021 W01 55767 45827. 9940. 9940. NA
#> 2 1 arima 2021 W02 59913 45827. 14086. 14086. NA
#> 3 1 arima 2021 W03 36363 45827. -9464. -9464. NA
#> 4 1 arima 2021 W04 48558 45827. 2731. 2731. NA
#> 5 1 arima 2021 W05 505 45827. -45322. -45322. NA
#> 6 1 arima 2021 W06 76344 45827. 30517. 30517. NA
#> 7 1 arima 2021 W07 22533 45827. -23294. -23294. NA
#> 8 1 arima 2021 W08 11965 45827. -33862. -33862. NA
#> 9 1 arima 2021 W09 78944 45827. 33117. 33117. NA
#> 10 1 arima 2021 W10 36754 45827. -9073. -9073. NA
#> # … with 56 more rows
Created on 2022-03-01 by the reprex package (v2.0.1)
See my textbook at OTexts.com/fpp3 for more information.
I want to make groups of data where measurements are done in multiple Year on the same species at the same Lat and Long. Then, I want to run linear regression on all those groups (using N as dependent variable and Year as independent variable).
Practice dataset:
Species Year Lat Long N
1 1 1999 1 1 5
2 1 2001 2 1 5
3 2 2010 3 3 4
4 2 2010 3 3 2
5 2 2011 3 3 5
6 2 2012 3 3 8
7 3 2007 8 7 -10
8 3 2019 8 7 100
9 2 2000 1 1 5
First, I averaged data where multiple measurements were done in the same Year on the same Species at the same latitude and longitude . Then, I split data based on Lat, Long and Species. However, this still groups rows together where Lat, Long and Species are not equal ($ '4'). Furthermore, I want to remove $'1', since I only want to use data where multiple measurements are done over a number of Year. How do I do this?
Data <- read.table("Dataset.txt", header = TRUE)
Agr_Data <- aggregate(N ~ Lat + Long + Year + Species, data = Data, mean)
Split_Data <- split(Agr_Data, Agr_Data$Lat + Agr_Data$Long + Agr_Data$Species)
Regression_Data <- lapply(Split_Data, function(Split_Data) lm(N~Year, data = Split_Data) )
Split_Data
$`3`
Lat Long Year Species N
1 1 1 1999 1 5
$`4`
Lat Long Year Species N
2 2 1 2001 1 5
3 1 1 2000 2 5
$`8`
Lat Long Year Species N
4 3 3 2010 2 3
5 3 3 2011 2 5
6 3 3 2012 2 8
$`18`
Lat Long Year Species N
7 8 7 2007 3 -10
8 8 7 2019 3 100
Desired output:
Lat Long Species Coefficients
3 3 2 2.5
8 7 3 9.167
Base R solution:
# 1. Import data:
df <- structure(list(Species = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 2L ),
Year = c(1999L, 2001L, 2010L, 2010L, 2011L, 2012L, 2007L, 2019L, 2000L),
Lat = c(1L, 2L, 3L, 3L, 3L, 3L, 8L, 8L, 1L),
Long = c(1L, 1L, 3L, 3L, 3L, 3L, 7L, 7L, 1L),
N = c(5L, 5L, 4L, 2L, 5L, 8L, -10L, 100L, 5L)),
class = "data.frame", row.names = c(NA, -9L ))
# 2. Aggregate data:
df <- aggregate(N ~ Lat + Long + Year + Species, data = df, mean)
# 3. Concatenate vecs to create grouping vec:
df$grouping_var <- paste(df$Species, df$Lat, df$Long, sep = ", ")
# 4. split apply combine lm:
coeff_n <- as.numeric(do.call("rbind", lapply(split(df, df$grouping_var),
function(x){
ifelse(nrow(x) > 1, coef(lm(N ~ Species+Lat+Long, data = x)), NA)
}
)
)
)
# 5. Create a dataframe of coeffs:
coeff_df <- data.frame(cbind(grouping_var = unique(df$grouping_var), coeff_n = coeff_n))
# 6. Merge the dataframes together:
df <- merge(df, coeff_df, by = "grouping_var", all.x = TRUE)
I'm trying to iterate through some calculations on subsets of my df using a for-loop at the end of a dplyr pipe, but variables I pass to the for-loop from the df aren't recognized.
I've tried to follow steps from this post:
use for loop with pipes in R.
Basically, I'm wrapping the for-loop in a user defined function and passing the df to the function via pipes.
I'm using a product sales dataset and am trying to calculate average sales of each pair of periods within each quarter for each product (a sort of sales baseline for promotions). For example, my first pass through the subset would calculate the average of periods 2 and 3, omitting 1. My second pass would exclude period 2 and calculate the average sales for 1 and 3, etc.
#Create dataframe
Article <- rep(1:3, each = 6)
Quarter <- rep(1:2, each = 3, 3)
Period <- rep(1:3, 6)
Sales <- sample(10:20, 18, replace = T)
df <-data.frame(Article, Quarter, Period, Sales)
foo <- function(x){
for (i in unique(Period)) {
filter(Period != i) %>%
summarize(average_sales = mean(Sales))
}
return(x)
}
df <- df %>%
group_by(Article, Quarter) %>%
foo()
#Desired resultant df:
average_sales <- c(14.5, 16.5, 12, 12, 16, 15, 16.5, 12.5, 16, 15, 14, 18, 11.5, 11, 11.5, 16, 16, 12)
df$average_sales <- average_sales
print(df, row.names = F)
Article Quarter Period Sales average_sales
1 1 1 14 14.5
1 1 2 10 16.5
1 1 3 19 12.0
1 2 1 19 12.0
1 2 2 11 16.0
1 2 3 13 15.0
2 1 1 12 16.5
2 1 2 20 12.5
2 1 3 13 16.0
2 2 1 17 15.0
2 2 2 19 14.0
2 2 3 11 18.0
3 1 1 11 11.5
3 1 2 12 11.0
3 1 3 11 11.5
3 2 1 12 16.0
3 2 2 12 16.0
3 2 3 20 12.0
I know this code still doesn't give me my end result, which would ideally be a fifth variable in the df which contains, for each period, the mean sales of the other two periods, but this is where I'm stuck. I'm not even sure if a for-loop is the best/most efficient way to solve this problem (I'm a limited R coder and not familiar with the entire suite of tidyverse tools), but any suggestions on how to complete the dataframe would also be greatly appreciated. Thanks!
Turning my comments into an answer, with some simplified examples to try to help you understand how to fix your function:
foo1 <- function(x) {
1 + 2
return(x)
}
foo1(0)
# [1] 0
foo1 is my simplified version of your function. In takes in an argument x, does something that doesn't use x, and then returns x. It's a pointless function - it doesn't matter that we do 1 + 2, because nothing is done with the result. In its last line, foo1 returns the same value that was passed to it, untouched.
foo2 <- function(x) {
x + 1
return(x)
}
foo2(0)
# [1] 0
foo2 is a little bit better, but ultimately equally pointless. The calculation in the middle uses x, which is logically a step forward, but the result, x + 1, isn't saved, and the function still returns the original x that was passed in.
foo3 <- function(x) {
y <- x + 1
return(y)
}
foo3(0)
# [1] 1
Finally, a function that does something! foo3 adds 1 to its input, modifies the input to store that result in a new variable y, (it could just as well modify x, x <- x + 1), and then it returns the modified variable.
With a for loop, you can't just do y <- for(...), we need to do the assignment inside the loop:
foo4 <- function(x) {
for(i in 1:3) {
y <- x + i
}
return(y)
}
foo4(0)
# [1] 3
foo4 shows a common beginner mistake - y is modified each time through the loop, but it is overwritten each time. y will be x + 1, the first time through, then y will be x + 2, then when i is 3 y will be x + 3, with no memory of the previous iterations. We need to give y some length, so it can store each iteration separately.
foo5 <- function(x) {
y <- numeric(3)
for(i in 1:3) {
y[i] <- x + i
}
return(y)
}
foo5(0)
# [1] 1 2 3
foo5 is good! We initialize y to have the right length, and each iteration of the loop saves its result to a different part of y, and then the whole y is returned at the end.
foo <- function(x) {
y <- list() # with a `list`, we don't absolutely need to specify the length upfront
for(i in unique(x$Period)) {
# use [[ for list assignment
y[[i]] <- x %>%
filter(Period != i) %>%
summarize(
period_excluded = i, # we'll use this to keep track
average_sales = mean(Sales)
)
}
# do ourselves a favor and turn the list of data frames into a single data frame
# with bind_rows before returning
return(bind_rows(y))
}
foo(df)
# period_excluded average_sales
# 1 1 14.58333
# 2 2 14.16667
# 3 3 15.58333
If we are looking for a way to get the mean of elements other than the 'Sales' for the particular 'period', get the difference of the 'Sales' with the sum of the 'Sales' for each 'Article', 'Quarter', and divide by length of the group -1.
library(dplyr)
df %>%
group_by(Article, Quarter) %>%
mutate(average_sales = (sum(Sales)- Sales)/(n()-1))
# A tibble: 18 x 5
# Groups: Article, Quarter [6]
# Article Quarter Period Sales average_sales
# <int> <int> <int> <int> <dbl>
# 1 1 1 1 14 14.5
# 2 1 1 2 10 16.5
# 3 1 1 3 19 12
# 4 1 2 1 19 12
# 5 1 2 2 11 16
# 6 1 2 3 13 15
# 7 2 1 1 12 16.5
# 8 2 1 2 20 12.5
# 9 2 1 3 13 16
#10 2 2 1 17 15
#11 2 2 2 19 14
#12 2 2 3 11 18
#13 3 1 1 11 11.5
#14 3 1 2 12 11
#15 3 1 3 11 11.5
#16 3 2 1 12 16
#17 3 2 2 12 16
#18 3 2 3 20 12
data
df <- structure(list(Article = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Quarter = c(1L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L),
Period = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L), Sales = c(14L, 10L, 19L, 19L, 11L,
13L, 12L, 20L, 13L, 17L, 19L, 11L, 11L, 12L, 11L, 12L, 12L,
20L)), row.names = c(NA, -18L), class = "data.frame")
I am new to R and am I have a data frame that looks something like this.
Date A B
1990 Q1 2 3
Q2 4 2
Q3 7 6
Q4 5 3
1991 Q1 7 6
Q2 1 8
Q3 7 6
Q4 9 2
1992 Q1 1 7
Q2 4 6
Q3 1 3
Q4 5 8
...
The column stretches all the way to the end of the row and both the start date and the end date is not fixed as the data is constantly updated. I would like to format the date column into a date class and achieve something like this:
Date A B
1990 Q1 2 3
1990 Q2 4 2
1990 Q3 7 6
1990 Q4 5 3
1991 Q1 7 6
1991 Q2 1 8
1991 Q3 7 6
1991 Q4 9 2
1992 Q1 1 7
1992 Q2 4 6
1992 Q3 1 3
1992 Q4 5 8
...
I thought of recreating a new column of dates on the left and use the first date provided by the data (i.e. '1990 Q1') as the starting date and the length based on the number of rows. Was looking at using seq. and as.yearqtr commands but can't seem to work out a proper code for it. Anyone knows of a better way to do this?
To use the yearqtr function from the zoo package to create a year-quarter time series, you can first split the df$Date values into year and quarter strings, use na.locf, also from the zoo package, to fill in missing values of year with the value from the previous row, and then transform to a zoo time series with year quarter dates. Code would look like
library(zoo)
# split Date into year and quarter strings
tmp <- t(sapply(strsplit((df$Date), " "), function(x) if(length(x)==1) c(NA, x) else x))
# use na.locf to replace NA with previous year
tmp <- paste(na.locf(tmp[,1]), tmp[,2])
# transform df into a zoo time series object with yearqtr dates
df_zoo <- zoo(df[,-1], order.by = as.yearqtr(tmp))
We could do this in base R. Create a grouping variable using grep and cumsum, extract the numeric substring from 'Date', replace the '' values with the year values using ave, and then paste it with the quarter substring extracted using sub.
df$Date <- paste(ave(sub("\\s*Q.", "", df$Date),
cumsum(grepl("^\\d+", df$Date)), FUN = function(x) x[nzchar(x)]),
sub("^\\d+\\s+", "", df$Date))
df$Date
#[1] "1990 Q1" "1990 Q2" "1990 Q3" "1990 Q4" "1991 Q1" "1991 Q2"
#[7] "1991 Q3" "1991 Q4" "1992 Q1" "1992 Q2" "1992 Q3" "1992 Q4"
NO Addtional packages needed.
If we need a package solution, data.table can be used
library(data.table)
library(stringr)
setDT(df)[, Date:=sub("^(Q.*)", paste0(word(Date[1],1), " \\1") , Date),
cumsum(grepl("^\\d+" , Date))]
df
# Date A B
# 1: 1990 Q1 2 3
# 2: 1990 Q2 4 2
# 3: 1990 Q3 7 6
# 4: 1990 Q4 5 3
# 5: 1991 Q1 7 6
# 6: 1991 Q2 1 8
# 7: 1991 Q3 7 6
# 8: 1991 Q4 9 2
# 9: 1992 Q1 1 7
#10: 1992 Q2 4 6
#11: 1992 Q3 1 3
#12: 1992 Q4 5 8
data
df <- structure(list(Date = c("1990 Q1", "Q2", "Q3", "Q4", "1991 Q1",
"Q2", "Q3", "Q4", "1992 Q1", "Q2", "Q3", "Q4"), A = c(2L, 4L,
7L, 5L, 7L, 1L, 7L, 9L, 1L, 4L, 1L, 5L), B = c(3L, 2L, 6L, 3L,
6L, 8L, 6L, 2L, 7L, 6L, 3L, 8L)), .Names = c("Date", "A", "B"
), row.names = c(NA, -12L), class = "data.frame")
Here is a straight forward way to create the sequence which you are looking for:
numrows<-10 #number of elements desired
#create the sequence of Date objects
qtrseq<-seq(as.Date("1990-01-01"), by="quarter", length.out = numrows)
#created vector for the formatted display
qtrformatted<-paste(as.POSIXlt(qtrseq)$year+1900, quarters(qtrseq))
The downside of this method and the other listed solutions is the lost of the Date object. There is no good way in base R to format the Q1, Q2... and have the object remain a Date object. Depending on your application it might be best to store the date sequence in the data frame and use the statement for qtr formatted only output purposes.
Best of luck.
Assuming Date is a single character column, here's an option using tidyr:
library(tidyr)
# separate date into year and quarter, inserting NAs in year as necessary
df %>% separate(Date, into = c('year', 'quarter'), fill = 'left') %>%
# fill NAs with previous value
fill(year) %>%
# join year and quarter back into a single column
unite(Date, year, quarter, sep = ' ')
# Date A B
# 1 1990 Q1 2 3
# 2 1990 Q2 4 2
# 3 1990 Q3 7 6
# 4 1990 Q4 5 3
# 5 1991 Q1 7 6
# 6 1991 Q2 1 8
# 7 1991 Q3 7 6
# 8 1991 Q4 9 2
# 9 1992 Q1 1 7
# 10 1992 Q2 4 6
# 11 1992 Q3 1 3
# 12 1992 Q4 5 8
Data
df <- structure(list(Date = structure(c(1L, 4L, 5L, 6L, 2L, 4L, 5L,
6L, 3L, 4L, 5L, 6L), .Label = c("1990 Q1", "1991 Q1", "1992 Q1",
"Q2", "Q3", "Q4"), class = "factor"), A = c(2L, 4L, 7L, 5L, 7L,
1L, 7L, 9L, 1L, 4L, 1L, 5L), B = c(3L, 2L, 6L, 3L, 6L, 8L, 6L,
2L, 7L, 6L, 3L, 8L)), .Names = c("Date", "A", "B"), class = "data.frame", row.names = c(NA,
-12L))
Here is something you can try
library(dplyr); library(stringr); library(zoo)
df %>% mutate(Date = paste(na.locf(str_extract(Date, "^[0-9]{4}")),
str_extract(Date, "Q[1-4]$"), sep = " "))
Date A B
1 1990 Q1 2 3
2 1990 Q2 4 2
3 1990 Q3 7 6
4 1990 Q4 5 3
5 1991 Q1 7 6
6 1991 Q2 1 8
7 1991 Q3 7 6
8 1991 Q4 9 2
9 1992 Q1 1 7
10 1992 Q2 4 6
11 1992 Q3 1 3
12 1992 Q4 5 8
I would like to create a subset of data that consists of Units that have a higher score in QTR 4 than QTR 1 (upward trend). Doesn't matter if QTR 2 or 3 are present.
Unit QTR Score
5 4 34
1 1 22
5 3 67
2 4 78
3 2 39
5 2 34
1 2 34
5 1 67
1 3 70
1 4 89
3 4 19
Subset would be:
Unit QTR Score
1 1 22
1 2 34
1 3 70
1 4 89
I've tried variants of something like this:
upward_subset <- subset(mydata,Unit if QTR=4~Score > QTR=1~Score)
Thank you for your time
If the dataframe is named "d", then this succeeds on your test set:
d[ which(d$Unit %in%
(sapply( split(d, d["Unit"]),
function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) ,
]
#-------------
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
An alternative in two steps:
result <- unlist(
by(
test,
test$Unit,
function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2])
)
test[test$Unit %in% names(result[result==TRUE]),]
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
A solution using data.table (Probably there are better versions than what I have at the moment).
Note: Assuming a QTR value for a given Unit is unique
Data:
df <- structure(list(Unit = c(5L, 1L, 5L, 2L, 3L, 5L, 1L, 5L, 1L, 1L,
3L), QTR = c(4L, 1L, 3L, 4L, 2L, 2L, 2L, 1L, 3L, 4L, 4L), Score = c(34L,
22L, 67L, 78L, 39L, 34L, 34L, 67L, 70L, 89L, 19L)), .Names = c("Unit",
"QTR", "Score"), class = "data.frame", row.names = c(NA, -11L
))
Solution:
dt <- data.table(df, key=c("Unit", "QTR"))
dt[, Score[Score[QTR == 4] > Score[QTR == 1]], by=Unit]
Unit V1
1: 1 22
2: 1 34
3: 1 70
4: 1 89