I have a panel data frame like this
date firms return
5/1/1988 A 5
6/1/1988 A 6
7/1/1988 A 4
8/1/1988 A 5
9/1/1988 A 6
11/1/1988 A 6
12/1/1988 A 13
13/01/1988 A 3
14/01/1988 A 2
15/01/1988 A 5
16/01/1988 A 2
18/01/1988 A 7
19/01/1988 A 3
20/01/1988 A 5
21/01/1988 A 7
22/01/1988 A 5
23/01/1988 A 9
25/01/1988 A 1
26/01/1988 A 5
27/01/1988 A 2
28/01/1988 A 7
29/01/1988 A 2
5/1/1988 B 5
6/1/1988 B 7
7/1/1988 B 5
8/1/1988 B 9
9/1/1988 B 1
11/1/1988 B 5
12/1/1988 B 2
13/01/1988 B 7
14/01/1988 B 2
15/01/1988 B 5
16/01/1988 B 6
18/01/1988 B 8
19/01/1988 B 5
20/01/1988 B 4
21/01/1988 B 3
22/01/1988 B 18
23/01/1988 B 5
25/01/1988 B 2
26/01/1988 B 7
27/01/1988 B 3
28/01/1988 B 9
29/01/1988 B 2
Now from the above panel data, I want to find a variable called DMAX. DMAX means the unit of days as the difference between the Maximum return day and the last trading day of the same month. For example, in January 1988 the Maximum return appears on 12 Jan 1988 for firm A. Hence the DMAX is the number of days between 12 Jan 1988 to the end of that month which is 15 days.
For firm B, the maximum value appears on 22 Jan 1988. So the remaining number of days of that month is 6 days. Therefore the expected outcome is
date Firms DMAX(days)
Jan-88 A 15
Jan-88 B 6
I would be grateful if you can help me in this regard.
One way using the dplyr package would be the following. I called your data mydf. First, manipulate date. Then, group the data by date and firms. Then, you look for the row with the largest value in return and handle subtraction.
mutate(mydf, date = format(as.Date(date, format = "%d/%m/%Y"), "%m-%Y")) %>%
group_by(date, firms) %>%
summarize(DMAX = n() - which.max(return))
# A tibble: 2 x 3
# Groups: date [?]
# date firms DMAX
# <chr> <fct> <int>
#1 01-1988 A 15
#2 01-1988 B 6
DATA
mydf <-structure(list(date = structure(c(18L, 19L, 20L, 21L, 22L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L), .Label = c("11/1/1988",
"12/1/1988", "13/01/1988", "14/01/1988", "15/01/1988", "16/01/1988",
"18/01/1988", "19/01/1988", "20/01/1988", "21/01/1988", "22/01/1988",
"23/01/1988", "25/01/1988", "26/01/1988", "27/01/1988", "28/01/1988",
"29/01/1988", "5/1/1988", "6/1/1988", "7/1/1988", "8/1/1988",
"9/1/1988"), class = "factor"), firms = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
return = c(5L, 6L, 4L, 5L, 6L, 6L, 13L, 3L, 2L, 5L, 2L, 7L,
3L, 5L, 7L, 5L, 9L, 1L, 5L, 2L, 7L, 2L, 5L, 7L, 5L, 9L, 1L,
5L, 2L, 7L, 2L, 5L, 6L, 8L, 5L, 4L, 3L, 18L, 5L, 2L, 7L,
3L, 9L, 2L)), class = "data.frame", row.names = c(NA, -44L
))
1) Base R For each year/month and firm aggregate the difference between the number of rows and the position of the maximum return row. No packages are used.
with(transform(DF, date = as.Date(date, "%d/%m/%Y")),
aggregate(list(DMAX = return),
data.frame(date = format(date, "%Y-%m"), firms),
function(x) length(x) - which.max(x)))
giving:
date firms DMAX
1 1988-01 A 15
2 1988-01 B 6
2) zoo Read DF into a zoo object zd with one column per firm and then aggregate that by year/month. Finally melt it to a long form data frame using fortify.zoo. The fortify.zoo line can be omitted if a zoo time series object is ok as the result.
library(zoo)
zd <- read.zoo(DF, index = "date", format = "%d/%m/%Y", split = "firms")
ag <- aggregate(zd, as.yearmon, function(x) length(na.omit(x)) - which.max(na.omit(x)))
fortify.zoo(ag, melt = TRUE)
giving:
Index Series Value
1 Jan 1988 A 15
2 Jan 1988 B 6
Note that ag is a monthly zoo series of the form:
> ag
A B
Jan 1988 15 6
3) data.table
library(data.table)
DT <- as.data.table(DF)
DT[, list(DMAX = .N - which.max(return)),
by = list(date = format(as.Date(date, "%d/%m/%Y"), "%Y-%m"), firms)]
giving:
date firms DMAX
1: 1988-01 A 15
2: 1988-01 B 6
Note
Lines <- "
date firms return
5/1/1988 A 5
6/1/1988 A 6
7/1/1988 A 4
8/1/1988 A 5
9/1/1988 A 6
11/1/1988 A 6
12/1/1988 A 13
13/01/1988 A 3
14/01/1988 A 2
15/01/1988 A 5
16/01/1988 A 2
18/01/1988 A 7
19/01/1988 A 3
20/01/1988 A 5
21/01/1988 A 7
22/01/1988 A 5
23/01/1988 A 9
25/01/1988 A 1
26/01/1988 A 5
27/01/1988 A 2
28/01/1988 A 7
29/01/1988 A 2
5/1/1988 B 5
6/1/1988 B 7
7/1/1988 B 5
8/1/1988 B 9
9/1/1988 B 1
11/1/1988 B 5
12/1/1988 B 2
13/01/1988 B 7
14/01/1988 B 2
15/01/1988 B 5
16/01/1988 B 6
18/01/1988 B 8
19/01/1988 B 5
20/01/1988 B 4
21/01/1988 B 3
22/01/1988 B 18
23/01/1988 B 5
25/01/1988 B 2
26/01/1988 B 7
27/01/1988 B 3
28/01/1988 B 9
29/01/1988 B 2
"
DF <- read.table(text = Lines, header = TRUE)
Here is a tidyverse solution.
library(tidyverse)
library(zoo)
df1 %>%
mutate(date = dmy(date),
month = as.yearmon(date)) %>%
group_by(firms, month) %>%
summarise(i = which(return == max(return)),
DMAX = last(date) - date[last(i)]) %>%
select(month, firms, DMAX)
## A tibble: 2 x 3
## Groups: firms [2]
# month firms DMAX
# <S3: yearmon> <chr> <time>
#1 Jan 1988 A 17 days
#2 Jan 1988 B " 7 days"
Related
df is a dataframe where each row is a pair of items (from item1 & item2).
I want to keep the 1st row of the dataframe, and then keep only the 1st rows where the previous value of item2 is the current value of item1.
So I except my data to look like output.
I would prefer a tidy(or purrr) way of doing so but open to any suggestions.
df <- structure(list(item1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 7L),
item2 = c(4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L,
6L, 7L, 8L, 5L, 6L, 7L, 8L, 7L, 8L, 7L, 8L, 8L)), row.names = c(NA,
-24L), class = c("tbl_df", "tbl", "data.frame"))
df
#> item1 item2
#> 1 1 4
#> 2 1 5
#> 3 1 6
#> 4 1 7
#> 5 1 8
#> 6 2 4
#> 7 2 5
#> 8 2 6
#> 9 2 7
#> 10 2 8
#> 11 3 4
#> 12 3 5
#> 13 3 6
#> 14 3 7
#> 15 3 8
#> 16 4 5
#> 17 4 6
#> 18 4 7
#> 19 4 8
#> 20 5 7
#> 21 5 8
#> 22 6 7
#> 23 6 8
#> 24 7 8
output <- data.frame(item1 = c(1,4,5,7),
item2 = c(4,5,7,8))
output
#> item1 item2
#> 1 1 4
#> 2 4 5
#> 3 5 7
#> 4 7 8
Created on 2022-09-22 by the reprex package (v2.0.1)
Here's a solution using the tidyverse.
Using a lag(..., default = 1) ensures we also output the first row.
library(tidyverse)
df <- tibble(
item1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 7L),
item2 = c(4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 5L, 6L, 7L, 8L, 7L, 8L, 7L, 8L, 8L)
)
df %>%
group_by(item1) %>%
summarize(item2 = first(item2)) %>%
filter(item1 == lag(item2, default = 1))
#> # A tibble: 4 × 2
#> item1 item2
#> <int> <int>
#> 1 1 4
#> 2 4 5
#> 3 5 7
#> 4 7 8
Created on 2022-09-22 by the reprex package (v2.0.1)
This is probably not what you were looking for (not a very tidy solution), but it yilds the desired output.
library(tidyverse)
df <- data.frame(
item1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 7L),
item2 = c(4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L,
6L, 7L, 8L, 5L, 6L, 7L, 8L, 7L, 8L, 7L, 8L, 8L)
)
my_filter <- function(df_to_find, df_orig){
value_to_find <- tail(df_to_find, 1)$item2
df_found <- df_orig %>%
filter(item1 == value_to_find) %>%
head(1)
if(nrow(df_found) > 0){
# if something found, recall this function
# with the newly found data appended to the old results
return(Recall(bind_rows(df_to_find, df_found), df_orig))
} else{
# once you reach a state when nothing else is found return the results so far
# this is called recursion in programming
return(bind_rows(df_to_find))
}
}
Created on 2022-09-22 by the reprex package (v2.0.1)
Here is another untidy and recursive solution:
last2current = function (x) {
first = x[1, ]
first_match = with(x, match(item2[1], item1))
if (is.na(first_match)) return(first)
other = x[first_match:nrow(x), ]
rbind(first, last2current(other))
}
last2current(df)
item1 item2
1 1 4
16 4 5
20 5 7
24 7 8
Explanation:
This is a recursive function, this meaning that it calls itself. It stores the first row, then looks for the first match of item2[1] on item1 and stores the row number in first_match. If there is no first_match it means we are done, so return(). If there is a match then it does the same procedure on the rows from the first match to the end of the data frame. Finally it cbinds all the rows.
Note that this will fail if there is a row where item1 == item2 since item1[1] is included in match.
This won't be directly vectorizable--I would do it with a simple for loop. This will almost certainly be faster than a recursive solution for any sizable data.
keep = logical(length = nrow(df))
keep[1] = TRUE
target = df$item2[1]
for(i in 2:nrow(df)) {
if(df$item1[i] == target) {
keep[i] = TRUE
target = df$item2[i]
}
}
result = df[keep, ]
result
# # A tibble: 4 × 2
# item1 item2
# <int> <int>
# 1 1 4
# 2 4 5
# 3 5 7
# 4 7 8
A base R recursion:
relation <- function(df, row){
if(is.na(row)) head(row, -1)
else c(row, relation(df, match(df[row, 2], df[,1])))
}
# Starting at row 1
df[relation(df, 1), ]
item1 item2
1 1 4
16 4 5
20 5 7
24 7 8
# Starting at row 2
df[relation(df, 2), ]
item1 item2
2 1 5
20 5 7
24 7 8
# Starting at row 4
df[relation(df, 4), ]
item1 item2
4 1 7
24 7 8
I have a data, as an example I show below
a = rep(1:5, each=3)
b = rep(c("a","b","c","a","c"), each = 3)
df = data.frame(a,b)
I want to select all the rows that have the "a"
I tried to do it with
df[df$a %in% a,]
Can someone give me an idea how to get them out?
df2<- structure(list(V1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), V2 = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("B02", "B03",
"B04", "B05", "B06", "B07", "C02", "C03", "C04", "C05", "C06",
"C07"), class = "factor")), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-24L))
I want to select specific rows that start with B but not all of them and just 02, 03, 04, 05
1 B02
1 B03
1 B04
1 B05
2 B02
2 B03
2 B04
2 B05
I also want to have the original data without them too
We need to check the 'b' column
df[df$b %in% 'a',]
For the updated question with 'df2', we can use paste to create the strings 'B02' to 'B05' and use %in% to subset
df2[df2$V2 %in% paste0("B0", 2:5),]
Or another option is grep
df2[grep("^B0[2-5]$", df2$V2),]
> df
a b
1 1 a
2 1 a
3 1 a
4 2 b
5 2 b
6 2 b
7 3 c
8 3 c
9 3 c
10 4 a
11 4 a
12 4 a
13 5 c
14 5 c
15 5 c
This basically says:
For all columns in df choose rows that have value equal to a
> rows_with_a<-df[df$b=='a', ]
> rows_with_a
a b
1 1 a
2 1 a
3 1 a
10 4 a
11 4 a
12 4 a
I have a data frame which has over 4000 columns and 3000 rows. Columns are companies and rows have daily stock closing price. The rows have daily observation data based on dates of the Month. Now, I want is to remove rows in between the last date of of each month i.e. I want to have data of only last day of month based on the avaiable date of month form my data frame. Last date of each month should be according to the date column in my data frame avaiable.
the main challenge and difference of my question to others is date of last month should be according to provided dates in my dataframe. Its a financial data and non trading days and no. of trading days differ from other types of sectors of industry
I illustrate some part of my dataframe.
Date A B
30/12/1999 1 3
04/01/2000 1 3
05/01/2000 1 3
06/01/2000 1 3
07/01/2000 1 3
10/01/2000 1 3
11/01/2000 1 3
12/01/2000 1 3
13/01/2000 1 3
14/01/2000 1 3
17/01/2000 1 3
18/01/2000 1 3
19/01/2000 1 3
20/01/2000 1 3
21/01/2000 1 3
24/01/2000 1 3
25/01/2000 1 3
26/01/2000 1 3
27/01/2000 1 3
28/01/2000 1 3
31/01/2000 1 3
01/02/2000 1 3
02/02/2000 1 3
03/02/2000 1 3
04/02/2000 1 3
07/02/2000 1 3
08/02/2000 1 3
09/02/2000 1 3
10/02/2000 1 3
11/02/2000 1 3
14/02/2000 1 3
15/02/2000 1 3
16/02/2000 1 3
17/02/2000 1 3
18/02/2000 1 3
21/02/2000 1 3
22/02/2000 1 3
23/02/2000 1 3
24/02/2000 1 3
25/02/2000 1 3
28/02/2000 1 3
29/02/2000 1 3
Desired output
Date A B
30/12/1999 1 3
31/01/2000 1 3
29/02/2000 1 3
I would really appreciate your help in this regard.
Using lubridate and dplyr, first parse Date
library(lubridate)
library(dplyr)
df$Date <- dmy(df$Date)
Now we can build a dplyr chain to filter:
df %>% group_by(month = month(Date), year = year(Date)) %>% filter(Date == max(Date))
where we group_by month and year columns we add, and then filter down to only the dates that are the max for each group. It returns
Source: local data frame [3 x 5]
Groups: month, year [3]
Date A B month year
(time) (int) (int) (dbl) (dbl)
1 1999-12-30 1 3 12 1999
2 2000-01-31 1 3 1 2000
3 2000-02-29 1 3 2 2000
You could, of course, do this all in base R if you prefer.
Edit: H/T #Jaap for recommending using group_by to add columns instead of a separate mutate. You could also use slice(which.max(Date)) instead of the filter term; it would likely be a hint faster, if that's a concern.
We can also use data.table
library(data.table)
library(lubridate)
setDT(df1)[, c('month', 'year', 'Date') :={tmp <- dmy(Date)
list(month= month(tmp), year= year(tmp), Date= tmp)}
][, .SD[ which.max(Date)] ,.(month, year)]
# month year Date A B
#1: 12 1999 1999-12-30 1 3
#2: 1 2000 2000-01-31 1 3
#3: 2 2000 2000-02-29 1 3
Here's another possibility:
month_year <- as.numeric(as.factor(sub("^[0-9]*/","",df1$Date)))
df1[!!c(diff(month_year),1),]
# Date A B
#1 30/12/1999 1 3
#21 31/01/2000 1 3
#42 29/02/2000 1 3
This solution does not change the format of the date in the original dataframe. However, it is assumed that the data is chronologically ordered like the data displayed in the OP.
data
df1 <- structure(list(Date = structure(c(41L, 4L, 6L, 7L, 8L, 12L, 14L,
16L, 17L, 18L, 22L, 24L, 26L, 27L, 28L, 32L, 34L, 36L, 37L, 38L,
42L, 1L, 2L, 3L, 5L, 9L, 10L, 11L, 13L, 15L, 19L, 20L, 21L, 23L,
25L, 29L, 30L, 31L, 33L, 35L, 39L, 40L), .Label = c("01/02/2000",
"02/02/2000", "03/02/2000", "04/01/2000", "04/02/2000", "05/01/2000",
"06/01/2000", "07/01/2000", "07/02/2000", "08/02/2000", "09/02/2000",
"10/01/2000", "10/02/2000", "11/01/2000", "11/02/2000", "12/01/2000",
"13/01/2000", "14/01/2000", "14/02/2000", "15/02/2000", "16/02/2000",
"17/01/2000", "17/02/2000", "18/01/2000", "18/02/2000", "19/01/2000",
"20/01/2000", "21/01/2000", "21/02/2000", "22/02/2000", "23/02/2000",
"24/01/2000", "24/02/2000", "25/01/2000", "25/02/2000", "26/01/2000",
"27/01/2000", "28/01/2000", "28/02/2000", "29/02/2000", "30/12/1999",
"31/01/2000"), class = "factor"), A = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), B = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
)), .Names = c("Date", "A", "B"), class = "data.frame", row.names = c(NA,
-42L))
I'd create a vector containing the end of month dates for your data like so:
library(dplyr)
df.dates = seq(as.Date("1999-01-01"),as.Date(Sys.Date()),by="months")-1
df.dates = as.data.frame(df.dates)
names(df.dates) = "Date"
df.joined = inner_join(df.dates, df)
This assumes that you have your data in a data frame with the Date column named "Date"
*Re-reading the question, this won't work if the last trading day isn't the last day of the month. #alistaire has a better solution using max(Date)
Suppose I have data which looks like this
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 5 8 1 X K John
1 A 2 6 9 2 X K John
1 A 2 5 8 3 X K John
2 B 2 4 6 1 X L Sam
2 B 2 3 4 2 X L Sam
2 B 2 5 7 3 X L Sam
3 C 2 5 11 1 X M John
3 C 2 5 11 2 X L John
3 C 2 5 11 3 X K John
4 D 2 8 10 1 Y M John
4 D 2 8 10 2 Y K John
4 D 2 5 7 3 Y K John
5 E 2 5 9 1 Y M Sam
5 E 2 5 9 2 Y L Sam
5 E 2 5 9 3 Y M Sam
6 F 2 4 7 1 Z M Kyle
6 F 2 5 8 2 Z L Kyle
6 F 2 5 8 3 Z M Kyle
if I apply table function, it will just combines are the rows and result will be
K L M
X 4 4 1
Y 2 1 3
Z 0 1 2
Now what if I want not the sum of all rows but only sum of those rows with Unique Id
so it looks like
K L M
X 2 2 1
Y 1 1 2
Z 0 1 1
Thanks
If df is your data.frame:
# Subset original data.frame to keep columns of interest
df1 <- df[,c("Id", "Category", "Mode")]
# Remove duplicated rows
df1 <- df1[!duplicated(df1),]
# Create table
with(df1, table(Category, Mode))
# Mode
# Category K L M
# X 2 2 1
# Y 1 1 2
# Z 0 1 1
Or in one line using unique
table(unique(df[c("Id", "Category", "Mode")])[-1])
df <- structure(list(Id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), Name = structure(c(1L, 1L, 1L,
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("A",
"B", "C", "D", "E", "F"), class = "factor"), Price = c(2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), sales = c(5L, 6L, 5L, 4L, 3L, 5L, 5L, 5L, 5L, 8L, 8L, 5L,
5L, 5L, 5L, 4L, 5L, 5L), Profit = c(8L, 9L, 8L, 6L, 4L, 7L, 11L,
11L, 11L, 10L, 10L, 7L, 9L, 9L, 9L, 7L, 8L, 8L), Month = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("X", "Y", "Z"
), class = "factor"), Mode = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 3L, 2L, 1L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 2L, 3L), .Label = c("K",
"L", "M"), class = "factor"), Supplier = structure(c(1L, 1L,
1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L
), .Label = c("John", "Kyle", "Sam"), class = "factor")), .Names = c("Id",
"Name", "Price", "sales", "Profit", "Month", "Category", "Mode",
"Supplier"), class = "data.frame", row.names = c(NA, -18L))
We can try
library(data.table)
dcast(unique(setDT(df1[c('Category', 'Mode', 'Id')])),
Category~Mode, value.var='Id', length)
# Category K L M
#1: X 2 2 1
#2: Y 1 1 2
#3: Z 0 1 1
Or with dplyr
library(dplyr)
df1 %>%
distinct(Id, Category, Mode) %>%
group_by(Category, Mode) %>%
tally() %>%
spread(Mode, n, fill=0)
# Category K L M
# (chr) (dbl) (dbl) (dbl)
#1 X 2 2 1
#2 Y 1 1 2
#3 Z 0 1 1
Or as #David Arenburg suggested, a variant of the above is
df1 %>%
distinct(Id, Category, Mode) %>%
select(Category, Mode) %>%
table()
I am having some trouble using the ddply function from the plyr package. I am trying to summarise the following data with counts and proportions within each group. Here's my data:
structure(list(X5employf = structure(c(1L, 3L, 1L, 1L, 1L, 3L,
1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 1L,
3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L,
3L, 3L, 1L), .Label = c("increase", "decrease", "same"), class = "factor"),
X5employff = structure(c(2L, 6L, NA, 2L, 4L, 6L, 5L, 2L,
2L, 8L, 2L, 2L, 2L, 7L, 7L, 8L, 11L, 7L, 2L, 8L, 8L, 11L,
7L, 6L, 2L, 5L, 2L, 8L, 7L, 7L, 7L, 8L, 6L, 7L, 5L, 5L, 7L,
2L, 6L, 7L, 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 2L, 5L, 2L, 2L,
2L, 5L, 12L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 2L, 5L, 2L,
13L, 9L, 9L, 9L, 7L, 8L, 5L), .Label = c("", "1", "1 and 8",
"2", "3", "4", "5", "6", "6 and 7", "6 and 7 ", "7", "8",
"1 and 8"), class = "factor")), .Names = c("X5employf", "X5employff"
), row.names = c(NA, 73L), class = "data.frame")
And here's my call using ddply:
ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), prop=(n/sum(n))*100)
This gives me the counts of each instance of X5employff correctly, but but seems as though the proportion is being calculated across each row and not within each level of the factor X5employf as follows:
X5employf X5employff n prop
1 increase 1 26 100
2 increase 2 1 100
3 increase 3 15 100
4 increase 1 and 8 1 100
5 increase <NA> 1 100
6 decrease 4 1 100
7 decrease 5 5 100
8 decrease 6 2 100
9 decrease 7 1 100
10 decrease 8 1 100
11 same 4 4 100
12 same 5 6 100
13 same 6 5 100
14 same 6 and 7 3 100
15 same 7 1 100
When manually calculating the proportions within each group I get this:
X5employf X5employff n prop
1 increase 1 26 59.09
2 increase 2 1 2.27
3 increase 3 15 34.09
4 increase 1 and 8 1 2.27
5 increase <NA> 1 2.27
6 decrease 4 1 10.00
7 decrease 5 5 50.00
8 decrease 6 2 20.00
9 decrease 7 1 10.00
10 decrease 8 1 10.00
11 same 4 4 21.05
12 same 5 6 31.57
13 same 6 5 26.31
14 same 6 and 7 3 15.78
15 same 7 1 5.26
As you can see the sum of proportions in each level of factor X5employf equals 100.
I know this is probably ridiculously simple, but I can't seem to get my head around it despite reading all sorts of similar posts. Can anyone help with this and my understanding of how the summarise function works?!
Many, many thanks
Marty
You cannot do it in one ddply call because what gets passed to each summarize call is a subset of your data for a specific combination of your group variables. At this lowest level, you do not have access to that intermediate level sum(n). Instead, do it in two steps:
kano_final <- ddply(kano_final, .(X5employf), transform,
sum.n = length(X5employf))
ddply(kano_final, .(X5employf, X5employff), summarise,
n = length(X5employff), prop = n / sum.n[1] * 100)
Edit: using a single ddply call and using table as you hinted towards:
ddply(kano_final, .(X5employf), summarise,
n = Filter(function(x) x > 0, table(X5employff, useNA = "ifany")),
prop = 100* prop.table(n),
X5employff = names(n))
I'd add here an example with dplyr which makes it quite easily in one step, with a short-code and easy-to-read syntax.
d is your data.frame
library(dplyr)
d%.%
dplyr:::group_by(X5employf, X5employff) %.%
dplyr:::summarise(n = length(X5employff)) %.%
dplyr:::mutate(ngr = sum(n)) %.%
dplyr:::mutate(prop = n/ngr*100)
will result in
Source: local data frame [15 x 5]
Groups: X5employf
X5employf X5employff n ngr prop
1 increase 1 26 44 59.090909
2 increase 2 1 44 2.272727
3 increase 3 15 44 34.090909
4 increase 1 and 8 1 44 2.272727
5 increase NA 1 44 2.272727
6 decrease 4 1 10 10.000000
7 decrease 5 5 10 50.000000
8 decrease 6 2 10 20.000000
9 decrease 7 1 10 10.000000
10 decrease 8 1 10 10.000000
11 same 4 4 19 21.052632
12 same 5 6 19 31.578947
13 same 6 5 19 26.315789
14 same 6 and 7 3 19 15.789474
15 same 7 1 19 5.263158
What you apparently want to do is to find out the proportions of X5employff for every value of X5employf. However, you don't tell ddply that X5employf and X5employff are different; to ddply, these two variables are just two variables to split up the data. Also, since there is one observation per line, i.e. count = 1 for every line of the data, the length of each (X5employf, X5employff) combination equals the sum of each (X5employf, X5employff) combination.
The simplest "plyr way" to solve your problem that I can think of is the following:
result <- ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), drop=FALSE)
n <- result$n
n2 <- ddply(kano_final, .(X5employf), summarise, n=length(X5employff))$n
result <- data.frame(result, prop=n/rep(n2, each=13)*100)
You can also use good old xtabs:
a <- xtabs(~X5employf + X5employff, kano_final)
b <- xtabs(~X5employf, kano_final)
a/matrix(b, nrow=3, ncol=ncol(a))