Delete rows of Dataframe based on dates in R - r

I have a data frame which has over 4000 columns and 3000 rows. Columns are companies and rows have daily stock closing price. The rows have daily observation data based on dates of the Month. Now, I want is to remove rows in between the last date of of each month i.e. I want to have data of only last day of month based on the avaiable date of month form my data frame. Last date of each month should be according to the date column in my data frame avaiable.
the main challenge and difference of my question to others is date of last month should be according to provided dates in my dataframe. Its a financial data and non trading days and no. of trading days differ from other types of sectors of industry
I illustrate some part of my dataframe.
Date A B
30/12/1999 1 3
04/01/2000 1 3
05/01/2000 1 3
06/01/2000 1 3
07/01/2000 1 3
10/01/2000 1 3
11/01/2000 1 3
12/01/2000 1 3
13/01/2000 1 3
14/01/2000 1 3
17/01/2000 1 3
18/01/2000 1 3
19/01/2000 1 3
20/01/2000 1 3
21/01/2000 1 3
24/01/2000 1 3
25/01/2000 1 3
26/01/2000 1 3
27/01/2000 1 3
28/01/2000 1 3
31/01/2000 1 3
01/02/2000 1 3
02/02/2000 1 3
03/02/2000 1 3
04/02/2000 1 3
07/02/2000 1 3
08/02/2000 1 3
09/02/2000 1 3
10/02/2000 1 3
11/02/2000 1 3
14/02/2000 1 3
15/02/2000 1 3
16/02/2000 1 3
17/02/2000 1 3
18/02/2000 1 3
21/02/2000 1 3
22/02/2000 1 3
23/02/2000 1 3
24/02/2000 1 3
25/02/2000 1 3
28/02/2000 1 3
29/02/2000 1 3
Desired output
Date A B
30/12/1999 1 3
31/01/2000 1 3
29/02/2000 1 3
I would really appreciate your help in this regard.

Using lubridate and dplyr, first parse Date
library(lubridate)
library(dplyr)
df$Date <- dmy(df$Date)
Now we can build a dplyr chain to filter:
df %>% group_by(month = month(Date), year = year(Date)) %>% filter(Date == max(Date))
where we group_by month and year columns we add, and then filter down to only the dates that are the max for each group. It returns
Source: local data frame [3 x 5]
Groups: month, year [3]
Date A B month year
(time) (int) (int) (dbl) (dbl)
1 1999-12-30 1 3 12 1999
2 2000-01-31 1 3 1 2000
3 2000-02-29 1 3 2 2000
You could, of course, do this all in base R if you prefer.
Edit: H/T #Jaap for recommending using group_by to add columns instead of a separate mutate. You could also use slice(which.max(Date)) instead of the filter term; it would likely be a hint faster, if that's a concern.

We can also use data.table
library(data.table)
library(lubridate)
setDT(df1)[, c('month', 'year', 'Date') :={tmp <- dmy(Date)
list(month= month(tmp), year= year(tmp), Date= tmp)}
][, .SD[ which.max(Date)] ,.(month, year)]
# month year Date A B
#1: 12 1999 1999-12-30 1 3
#2: 1 2000 2000-01-31 1 3
#3: 2 2000 2000-02-29 1 3

Here's another possibility:
month_year <- as.numeric(as.factor(sub("^[0-9]*/","",df1$Date)))
df1[!!c(diff(month_year),1),]
# Date A B
#1 30/12/1999 1 3
#21 31/01/2000 1 3
#42 29/02/2000 1 3
This solution does not change the format of the date in the original dataframe. However, it is assumed that the data is chronologically ordered like the data displayed in the OP.
data
df1 <- structure(list(Date = structure(c(41L, 4L, 6L, 7L, 8L, 12L, 14L,
16L, 17L, 18L, 22L, 24L, 26L, 27L, 28L, 32L, 34L, 36L, 37L, 38L,
42L, 1L, 2L, 3L, 5L, 9L, 10L, 11L, 13L, 15L, 19L, 20L, 21L, 23L,
25L, 29L, 30L, 31L, 33L, 35L, 39L, 40L), .Label = c("01/02/2000",
"02/02/2000", "03/02/2000", "04/01/2000", "04/02/2000", "05/01/2000",
"06/01/2000", "07/01/2000", "07/02/2000", "08/02/2000", "09/02/2000",
"10/01/2000", "10/02/2000", "11/01/2000", "11/02/2000", "12/01/2000",
"13/01/2000", "14/01/2000", "14/02/2000", "15/02/2000", "16/02/2000",
"17/01/2000", "17/02/2000", "18/01/2000", "18/02/2000", "19/01/2000",
"20/01/2000", "21/01/2000", "21/02/2000", "22/02/2000", "23/02/2000",
"24/01/2000", "24/02/2000", "25/01/2000", "25/02/2000", "26/01/2000",
"27/01/2000", "28/01/2000", "28/02/2000", "29/02/2000", "30/12/1999",
"31/01/2000"), class = "factor"), A = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), B = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
)), .Names = c("Date", "A", "B"), class = "data.frame", row.names = c(NA,
-42L))

I'd create a vector containing the end of month dates for your data like so:
library(dplyr)
df.dates = seq(as.Date("1999-01-01"),as.Date(Sys.Date()),by="months")-1
df.dates = as.data.frame(df.dates)
names(df.dates) = "Date"
df.joined = inner_join(df.dates, df)
This assumes that you have your data in a data frame with the Date column named "Date"
*Re-reading the question, this won't work if the last trading day isn't the last day of the month. #alistaire has a better solution using max(Date)

Related

Efficient way to calculate percentage of a specific value in a specific time

I have a csv file like these: this csv filled is called df_plane in R
Situation
flight_uses
People-ID
1
1
1
2
1
1
3
0
1
1
1
2
2
1
2
3
1
2
1
1
3
2
0
3
3
1
3
1
1
4
2
1
4
3
0
4
1
1
5
2
0
5
3
0
5
1
1
6
2
1
6
3
NA
6
1
NA
7
2
1
7
3
1
7
1
1
8
2
0
8
3
0
8
1
NA
9
2
NA
9
3
1
9
1
1
10
2
1
10
3
0
10
1
0
11
2
0
11
3
0
11
I would like to find out what percentage of people uses airplane in situation 2. I would like to know if there is a more efficient way than use the code below. Because with the below code I have to calculate it manually.
table(select(df_plane,situation,flight_uses))
You can use functions from the janitor package.
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
df_plane <- tibble::tribble(
~Situation, ~flight_uses, ~`People-ID`,
1L, 1L, 1L,
2L, 1L, 1L,
3L, 0L, 1L,
1L, 1L, 2L,
2L, 1L, 2L,
3L, 1L, 2L,
1L, 1L, 3L,
2L, 0L, 3L,
3L, 1L, 3L,
1L, 1L, 4L,
2L, 1L, 4L,
3L, 0L, 4L,
1L, 1L, 5L,
2L, 0L, 5L,
3L, 0L, 5L,
1L, 1L, 6L,
2L, 1L, 6L,
3L, NA, 6L,
1L, NA, 7L,
2L, 1L, 7L,
3L, 1L, 7L,
1L, 1L, 8L,
2L, 0L, 8L,
3L, 0L, 8L,
1L, NA, 9L,
2L, NA, 9L,
3L, 1L, 9L,
1L, 1L, 10L,
2L, 1L, 10L,
3L, 0L, 10L,
1L, 0L, 11L,
2L, 0L, 11L,
3L, 0L, 11L
) |>
clean_names()
df_plane |>
tabyl(situation, flight_uses) |>
adorn_percentages() |>
adorn_pct_formatting()
#> situation 0 1 NA_
#> 1 9.1% 72.7% 18.2%
#> 2 36.4% 54.5% 9.1%
#> 3 54.5% 36.4% 9.1%
Created on 2022-10-26 with reprex v2.0.2
In Situation 2, 54.5% of passengers uses airplane.
You can use mean to calculate the proportion
> with(df_plane,mean(replace(flight_uses, is.na(flight_uses), 0)[Situation==2]))
[1] 0.5454545
Are you asking, of those rows where Situation==2, what is the percent where flight_uses==1?
dplyr approach
dplyr is useful for these types of manipulations:
library(dplyr)
df_plane |>
filter(Situation == 2) |>
summarise(
percent_using_plane = sum(flight_uses==1, na.rm=T) / n() * 100
)
# percent_using_plane
# 1 54.54545
base R
If you want to stick with the base R table syntax (which seems fine in this case but can become unwieldy once calculations get more complicated), you were nearly there:
table(df_plane[df_plane$Situation==2,]$flight_uses) / nrow(df_plane[df_plane$Situation==2,])*100
# 0 1
# 36.36364 54.54545
Use with instead of dplyr::select and wrap it in proportions.
proportions(with(df_plane, table(flight_uses, Situation, useNA='ifany')), 2)
# Situation
# flight_uses 1 2 3
# 0 0.09090909 0.36363636 0.54545455
# 1 0.72727273 0.54545455 0.36363636
# <NA> 0.18181818 0.09090909 0.09090909

Number of remaining days of a month after maximum value appear

I have a panel data frame like this
date firms return
5/1/1988 A 5
6/1/1988 A 6
7/1/1988 A 4
8/1/1988 A 5
9/1/1988 A 6
11/1/1988 A 6
12/1/1988 A 13
13/01/1988 A 3
14/01/1988 A 2
15/01/1988 A 5
16/01/1988 A 2
18/01/1988 A 7
19/01/1988 A 3
20/01/1988 A 5
21/01/1988 A 7
22/01/1988 A 5
23/01/1988 A 9
25/01/1988 A 1
26/01/1988 A 5
27/01/1988 A 2
28/01/1988 A 7
29/01/1988 A 2
5/1/1988 B 5
6/1/1988 B 7
7/1/1988 B 5
8/1/1988 B 9
9/1/1988 B 1
11/1/1988 B 5
12/1/1988 B 2
13/01/1988 B 7
14/01/1988 B 2
15/01/1988 B 5
16/01/1988 B 6
18/01/1988 B 8
19/01/1988 B 5
20/01/1988 B 4
21/01/1988 B 3
22/01/1988 B 18
23/01/1988 B 5
25/01/1988 B 2
26/01/1988 B 7
27/01/1988 B 3
28/01/1988 B 9
29/01/1988 B 2
Now from the above panel data, I want to find a variable called DMAX. DMAX means the unit of days as the difference between the Maximum return day and the last trading day of the same month. For example, in January 1988 the Maximum return appears on 12 Jan 1988 for firm A. Hence the DMAX is the number of days between 12 Jan 1988 to the end of that month which is 15 days.
For firm B, the maximum value appears on 22 Jan 1988. So the remaining number of days of that month is 6 days. Therefore the expected outcome is
date Firms DMAX(days)
Jan-88 A 15
Jan-88 B 6
I would be grateful if you can help me in this regard.
One way using the dplyr package would be the following. I called your data mydf. First, manipulate date. Then, group the data by date and firms. Then, you look for the row with the largest value in return and handle subtraction.
mutate(mydf, date = format(as.Date(date, format = "%d/%m/%Y"), "%m-%Y")) %>%
group_by(date, firms) %>%
summarize(DMAX = n() - which.max(return))
# A tibble: 2 x 3
# Groups: date [?]
# date firms DMAX
# <chr> <fct> <int>
#1 01-1988 A 15
#2 01-1988 B 6
DATA
mydf <-structure(list(date = structure(c(18L, 19L, 20L, 21L, 22L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L), .Label = c("11/1/1988",
"12/1/1988", "13/01/1988", "14/01/1988", "15/01/1988", "16/01/1988",
"18/01/1988", "19/01/1988", "20/01/1988", "21/01/1988", "22/01/1988",
"23/01/1988", "25/01/1988", "26/01/1988", "27/01/1988", "28/01/1988",
"29/01/1988", "5/1/1988", "6/1/1988", "7/1/1988", "8/1/1988",
"9/1/1988"), class = "factor"), firms = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
return = c(5L, 6L, 4L, 5L, 6L, 6L, 13L, 3L, 2L, 5L, 2L, 7L,
3L, 5L, 7L, 5L, 9L, 1L, 5L, 2L, 7L, 2L, 5L, 7L, 5L, 9L, 1L,
5L, 2L, 7L, 2L, 5L, 6L, 8L, 5L, 4L, 3L, 18L, 5L, 2L, 7L,
3L, 9L, 2L)), class = "data.frame", row.names = c(NA, -44L
))
1) Base R For each year/month and firm aggregate the difference between the number of rows and the position of the maximum return row. No packages are used.
with(transform(DF, date = as.Date(date, "%d/%m/%Y")),
aggregate(list(DMAX = return),
data.frame(date = format(date, "%Y-%m"), firms),
function(x) length(x) - which.max(x)))
giving:
date firms DMAX
1 1988-01 A 15
2 1988-01 B 6
2) zoo Read DF into a zoo object zd with one column per firm and then aggregate that by year/month. Finally melt it to a long form data frame using fortify.zoo. The fortify.zoo line can be omitted if a zoo time series object is ok as the result.
library(zoo)
zd <- read.zoo(DF, index = "date", format = "%d/%m/%Y", split = "firms")
ag <- aggregate(zd, as.yearmon, function(x) length(na.omit(x)) - which.max(na.omit(x)))
fortify.zoo(ag, melt = TRUE)
giving:
Index Series Value
1 Jan 1988 A 15
2 Jan 1988 B 6
Note that ag is a monthly zoo series of the form:
> ag
A B
Jan 1988 15 6
3) data.table
library(data.table)
DT <- as.data.table(DF)
DT[, list(DMAX = .N - which.max(return)),
by = list(date = format(as.Date(date, "%d/%m/%Y"), "%Y-%m"), firms)]
giving:
date firms DMAX
1: 1988-01 A 15
2: 1988-01 B 6
Note
Lines <- "
date firms return
5/1/1988 A 5
6/1/1988 A 6
7/1/1988 A 4
8/1/1988 A 5
9/1/1988 A 6
11/1/1988 A 6
12/1/1988 A 13
13/01/1988 A 3
14/01/1988 A 2
15/01/1988 A 5
16/01/1988 A 2
18/01/1988 A 7
19/01/1988 A 3
20/01/1988 A 5
21/01/1988 A 7
22/01/1988 A 5
23/01/1988 A 9
25/01/1988 A 1
26/01/1988 A 5
27/01/1988 A 2
28/01/1988 A 7
29/01/1988 A 2
5/1/1988 B 5
6/1/1988 B 7
7/1/1988 B 5
8/1/1988 B 9
9/1/1988 B 1
11/1/1988 B 5
12/1/1988 B 2
13/01/1988 B 7
14/01/1988 B 2
15/01/1988 B 5
16/01/1988 B 6
18/01/1988 B 8
19/01/1988 B 5
20/01/1988 B 4
21/01/1988 B 3
22/01/1988 B 18
23/01/1988 B 5
25/01/1988 B 2
26/01/1988 B 7
27/01/1988 B 3
28/01/1988 B 9
29/01/1988 B 2
"
DF <- read.table(text = Lines, header = TRUE)
Here is a tidyverse solution.
library(tidyverse)
library(zoo)
df1 %>%
mutate(date = dmy(date),
month = as.yearmon(date)) %>%
group_by(firms, month) %>%
summarise(i = which(return == max(return)),
DMAX = last(date) - date[last(i)]) %>%
select(month, firms, DMAX)
## A tibble: 2 x 3
## Groups: firms [2]
# month firms DMAX
# <S3: yearmon> <chr> <time>
#1 Jan 1988 A 17 days
#2 Jan 1988 B " 7 days"

Subsetting a data frame and replacing a column based on condition

I am working on a data frame with three columns labelled as id, time1 and time2. A sample is:
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L),
time1 = c(12L, 5L, 3L, 5L, 6L, 30L, 3L, 30L, 7L, 2L, 17L, 5L, 8L, 3L, 22L, 5L, 15L, 4L, 7L, 23L),
time2=c(23L,23L,23L,23L,23L,22L,22L,22L,22L,22L,25L,25L,25L,25L,25L,24L,24L,24L,24L,24L)
),
.Names = c("id", "time1","time2"),
class = "data.frame",
row.names = c(NA,-20L)
)
I am using R and I am trying to subset this data and replace column time2 with a new column based on the following criteria:
Sum the values of time1 for each id until it is greater than or equal to the corresponding value of time2 for that id.
Replace the cells in time1 where the summations terminate with the corresponding time2 value for each id.
Column time2 is to be replaced with a new column labelled as status which consists of 0's and 1's. That is, status takes on 1 for the non-replaced values of time1 and 0 for all the replaced values of time1.
In summary, I am expecting to see something like this:
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
time1 = c(12L, 5L, 3L, 23, 22L, 17L, 5L, 25L, 5L, 15L, 24L),
status=c(1L,1L,1L,0L,0L,1L,1L,0L,1L,1L,0L)
),
.Names = c("id", "time1","status"),
class = "data.frame",
row.names = c(NA,-11L)
)
I greatly appreciate any help on this.
We can do the following:
library(tidyverse);
df %>%
group_by(id) %>%
mutate(
status = as.numeric(cumsum(time1) < time2),
time1 = ifelse(status == 1, time1, time2)) %>%
group_by(id, status) %>%
mutate(n = 1:n()) %>%
ungroup() %>%
filter(status == 1 | (status == 0 & n == 1)) %>%
select(-n, -time2)
## A tibble: 11 x 3
# id time1 status
# <int> <int> <dbl>
# 1 1 12 1.
# 2 1 5 1.
# 3 1 3 1.
# 4 1 23 0.
# 5 2 22 0.
# 6 3 17 1.
# 7 3 5 1.
# 8 3 25 0.
# 9 4 5 1.
#10 4 15 1.
#11 4 24 0.
Explanation: We group rows by id, then calculate the cumulative sum of time1 entries, and flag those rows where cumsum(time1) < time2 with 1, else with 0; we replace time1 entries with time2 entries if status == 1. Lastly we need to remove excess status = 0 rows; to do so, we regroup by id and status, number rows consecutively, and keep only one row for status = 0 per id.

window function is not working as expected

I have a monthly time series - monthlyTs:
monthlyTs <- ts(all.xts , frequency = 12, start=decimal_date(ymd("2012-01-29")))
head(index(monthlyTs))
1 "2012-01-29 00:00:00 UTC" "2012-02-26 01:22:47 UTC" "2012-03-25
02:45:35 UTC" "2012-04-29 04:29:04 UTC"
[5] "2012-05-27 05:51:52 UTC" "2012-06-24 07:14:39 UTC"
I want to apply a time windows that starts from 2013:
head(window(monthly, start = 2013))
2012-01-29 00:00:00 2
2012-02-26 01:22:47 8 2012-03-25 02:45:35 6 2012-04-29 04:29:04
5 2012-05-27 05:51:52 4 2012-06-24 07:14:39 4
So looks like window function is not filtering as expected. What is wrong?
Fully reproducible example as requested:
christmas.csv - tiny CSV file (google trends for 'Christmas' request)
#Reading data from the csv. Format - [week start date], [views per week]
data = read.csv('christmas.csv', sep=",", header = FALSE, skip = 3,col.names = c("Week","Views"))[[2]]
# creating time series
myTs <- ts(data[[2]], freq=365.25/7, start=decimal_date(ymd("2012-01-29")))
#converting from weekly to month time series
all.xts <- xts(myTs, date_decimal(index(myTs)))
monthlyTs <- ts(all.xts , frequency = 12, start=decimal_date(ymd("2012-01-29")))
head(window(monthlyTs, start = 2013))
2012-01-29 00:00:00 2
2012-02-26 01:22:47 8 2012-03-25 02:45:35 6 2012-04-29 04:29:04 5
2012-05-27 05:51:52 4 2012-06-24 07:14:39 4
There are two problems :
the object all.xts is a weekly and not a monthly time
The value your pass for the argument frequency is not correct
For the second point, try to change the value you pass for the argument start in your call of the function ts with
c(lubridate::year("2012-01-29"), lubridate::month("2012-01-29"))
and change the frequency to value 12. i.e use the line :
ts(all.xts , frequency = 12, start = c(lubridate::year("2012-01-29"), lubridate::month("2012-01-29")) )
Using the output from dput, your code rewrite as follow :
data <- c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 5L, 5L, 6L, 8L, 11L, 16L, 22L, 33L, 42L,
45L, 55L, 64L, 8L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 5L, 6L, 8L,
12L, 16L, 21L, 27L, 43L, 47L, 56L, 79L, 10L, 5L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 5L, 5L, 6L, 8L, 12L, 17L, 21L, 27L, 43L, 47L, 53L,
87L, 12L, 5L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 5L, 6L, 6L, 8L, 13L,
17L, 20L, 27L, 44L, 50L, 54L, 100L, 15L, 6L, 3L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 5L, 5L, 6L, 8L, 11L, 16L, 21L, 29L, 43L, 48L, 53L, 80L,
46L, 8L, 3L, 2L)
myTs <- ts(data, freq=365.25/7, start=decimal_date(ymd("2012-01-29")))
all.xts <- xts::xts(myTs, date_decimal(index(myTs)))
monthlyTs <- ts(all.xts , frequency = 12, start = c(lubridate::year("2012-01-29"), lubridate::month("2012-01-29")) )
window(monthlyTs, start= c(2013))
The last line will print :
> window(monthlyTs, start= c(2013))
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 1 1 1 1 1 1 1 1 1 1 1 1
2014 1 1 1 1 2 2 2 2 3 3 3 4
2015 5 5 6 8 11 16 22 33 42 45 55 64
2016 8 4 2 2 2 2 2 2 1 1 1 1
2017 1 1 1 1 1 1 1 1 1 1 1 1
2018 1 1 1 1 1 1 1 2 2 2 2 2
2019 3 3 3 4 4 5 6 8 12 16 21 27
2020 43 47 56 79 10 5 2 2 2 1 1 1
2021 1 1 1 1 1 1 1 1 1 1 1 1
2022 1 1 1 1 1 1 1 1 1 1 2 2
2023 2 2 2 2 3 3 3 4 5 5 6 8
2024 12 17 21 27 43 47 53 87 12 5 2 2
2025 2 1 1 1 1 1 1 1 1 1 1 1
2026 1 1 1 1 1 1 1 1 1 1 1 1
2027 1 2 2 2 2 2 2 2 3 3 3 4
2028 5 6 6 8 13 17 20 27 44 50 54 100
2029 15 6 3 2 2 1 1 1 1 1 1 1
2030 1 1 1 1 1 1 1 1 1 1 1 1
2031 1 1 1 1 1 1 2 2 2 2 2 2
2032 3 3 3 4 5 5 6 8 11 16 21 29
2033 43 48 53 80 46 8 3 2

ddply summarise proportional count

I am having some trouble using the ddply function from the plyr package. I am trying to summarise the following data with counts and proportions within each group. Here's my data:
structure(list(X5employf = structure(c(1L, 3L, 1L, 1L, 1L, 3L,
1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 1L,
3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L,
3L, 3L, 1L), .Label = c("increase", "decrease", "same"), class = "factor"),
X5employff = structure(c(2L, 6L, NA, 2L, 4L, 6L, 5L, 2L,
2L, 8L, 2L, 2L, 2L, 7L, 7L, 8L, 11L, 7L, 2L, 8L, 8L, 11L,
7L, 6L, 2L, 5L, 2L, 8L, 7L, 7L, 7L, 8L, 6L, 7L, 5L, 5L, 7L,
2L, 6L, 7L, 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 2L, 5L, 2L, 2L,
2L, 5L, 12L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 2L, 5L, 2L,
13L, 9L, 9L, 9L, 7L, 8L, 5L), .Label = c("", "1", "1 and 8",
"2", "3", "4", "5", "6", "6 and 7", "6 and 7 ", "7", "8",
"1 and 8"), class = "factor")), .Names = c("X5employf", "X5employff"
), row.names = c(NA, 73L), class = "data.frame")
And here's my call using ddply:
ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), prop=(n/sum(n))*100)
This gives me the counts of each instance of X5employff correctly, but but seems as though the proportion is being calculated across each row and not within each level of the factor X5employf as follows:
X5employf X5employff n prop
1 increase 1 26 100
2 increase 2 1 100
3 increase 3 15 100
4 increase 1 and 8 1 100
5 increase <NA> 1 100
6 decrease 4 1 100
7 decrease 5 5 100
8 decrease 6 2 100
9 decrease 7 1 100
10 decrease 8 1 100
11 same 4 4 100
12 same 5 6 100
13 same 6 5 100
14 same 6 and 7 3 100
15 same 7 1 100
When manually calculating the proportions within each group I get this:
X5employf X5employff n prop
1 increase 1 26 59.09
2 increase 2 1 2.27
3 increase 3 15 34.09
4 increase 1 and 8 1 2.27
5 increase <NA> 1 2.27
6 decrease 4 1 10.00
7 decrease 5 5 50.00
8 decrease 6 2 20.00
9 decrease 7 1 10.00
10 decrease 8 1 10.00
11 same 4 4 21.05
12 same 5 6 31.57
13 same 6 5 26.31
14 same 6 and 7 3 15.78
15 same 7 1 5.26
As you can see the sum of proportions in each level of factor X5employf equals 100.
I know this is probably ridiculously simple, but I can't seem to get my head around it despite reading all sorts of similar posts. Can anyone help with this and my understanding of how the summarise function works?!
Many, many thanks
Marty
You cannot do it in one ddply call because what gets passed to each summarize call is a subset of your data for a specific combination of your group variables. At this lowest level, you do not have access to that intermediate level sum(n). Instead, do it in two steps:
kano_final <- ddply(kano_final, .(X5employf), transform,
sum.n = length(X5employf))
ddply(kano_final, .(X5employf, X5employff), summarise,
n = length(X5employff), prop = n / sum.n[1] * 100)
Edit: using a single ddply call and using table as you hinted towards:
ddply(kano_final, .(X5employf), summarise,
n = Filter(function(x) x > 0, table(X5employff, useNA = "ifany")),
prop = 100* prop.table(n),
X5employff = names(n))
I'd add here an example with dplyr which makes it quite easily in one step, with a short-code and easy-to-read syntax.
d is your data.frame
library(dplyr)
d%.%
dplyr:::group_by(X5employf, X5employff) %.%
dplyr:::summarise(n = length(X5employff)) %.%
dplyr:::mutate(ngr = sum(n)) %.%
dplyr:::mutate(prop = n/ngr*100)
will result in
Source: local data frame [15 x 5]
Groups: X5employf
X5employf X5employff n ngr prop
1 increase 1 26 44 59.090909
2 increase 2 1 44 2.272727
3 increase 3 15 44 34.090909
4 increase 1 and 8 1 44 2.272727
5 increase NA 1 44 2.272727
6 decrease 4 1 10 10.000000
7 decrease 5 5 10 50.000000
8 decrease 6 2 10 20.000000
9 decrease 7 1 10 10.000000
10 decrease 8 1 10 10.000000
11 same 4 4 19 21.052632
12 same 5 6 19 31.578947
13 same 6 5 19 26.315789
14 same 6 and 7 3 19 15.789474
15 same 7 1 19 5.263158
What you apparently want to do is to find out the proportions of X5employff for every value of X5employf. However, you don't tell ddply that X5employf and X5employff are different; to ddply, these two variables are just two variables to split up the data. Also, since there is one observation per line, i.e. count = 1 for every line of the data, the length of each (X5employf, X5employff) combination equals the sum of each (X5employf, X5employff) combination.
The simplest "plyr way" to solve your problem that I can think of is the following:
result <- ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), drop=FALSE)
n <- result$n
n2 <- ddply(kano_final, .(X5employf), summarise, n=length(X5employff))$n
result <- data.frame(result, prop=n/rep(n2, each=13)*100)
You can also use good old xtabs:
a <- xtabs(~X5employf + X5employff, kano_final)
b <- xtabs(~X5employf, kano_final)
a/matrix(b, nrow=3, ncol=ncol(a))

Resources