I have a data set ProductTable, I want to return the date of all the ProductsFamily has been ordered first time and the very last time. Examples:
ProductTable
OrderPostingYear OrderPostingMonth OrderPostingDate ProductsFamily Sales QTY
2008 1 20 R1 5234 1
2008 1 12 R2 223 2
2009 1 30 R3 34 1
2008 2 1 R1 1634 3
2010 4 23 R3 224 1
2009 3 20 R1 5234 1
2010 7 12 R2 223 2
Result as followings
OrderTime
ProductsFamily OrderStart OrderEnd SumSales
R1 2008/1/20 2009/3/20 12102
R2 2008/1/12 2010/7/12 446
R3 2009/1/30 2010/4/23 258
I have no idea how to do it. Any suggestions?
ProductTable <- structure(list(OrderPostingYear = c(2008L, 2008L, 2009L, 2008L,
2010L, 2009L, 2010L), OrderPostingMonth = c(1L, 1L, 1L, 2L, 4L,
3L, 7L), OrderPostingDate = c(20L, 12L, 30L, 1L, 23L, 20L, 12L
), ProductsFamily = structure(c(1L, 2L, 3L, 1L, 3L, 1L, 2L), .Label = c("R1",
"R2", "R3"), class = "factor"), Sales = c(5234L, 223L, 34L, 1634L,
224L, 5234L, 223L), QTY = c(1L, 2L, 1L, 3L, 1L, 1L, 2L)), .Names = c("OrderPostingYear",
"OrderPostingMonth", "OrderPostingDate", "ProductsFamily", "Sales",
"QTY"), class = "data.frame", row.names = c(NA, -7L))
We can also use dplyr/tidyr to do this. We arrange the columns, concatenate the 'Year:Date' columns with unite, group by 'ProductsFamily', get the first, last of 'Date' column and sum of 'Sales' within summarise.
library(dplyr)
library(tidyr)
ProductTable %>%
arrange(ProductsFamily, OrderPostingYear, OrderPostingMonth, OrderPostingDate) %>%
unite(Date,OrderPostingYear:OrderPostingDate, sep='/') %>%
group_by(ProductsFamily) %>%
summarise(OrderStart=first(Date), OrderEnd=last(Date), SumSales=sum(Sales))
# Source: local data frame [3 x 4]
# ProductsFamily OrderStart OrderEnd SumSales
# (fctr) (chr) (chr) (int)
# 1 R1 2008/1/20 2009/3/20 12102
# 2 R2 2008/1/12 2010/7/12 446
# 3 R3 2009/1/30 2010/4/23 258
You can first set up the date in a new column, and then aggregate your data using data.table package (you take the first and last date by ID, as well as the sum of sales):
library(data.table)
# First build up the date
ProductTable$date = with(ProductTable,
as.Date(paste(OrderPostingYear,
OrderPostingMonth,
OrderPostingDate, sep = "." ),
format = "%Y.%m.%d"))
# In a second step, aggregate your data
setDT(ProductTable)[,list(OrderStart = sort(date)[1],
OrderEnd = sort(date)[.N],
SumSales = sum(Sales))
,ProductsFamily]
# ProductsFamily OrderStart OrderEnd SumSales
#1: R1 2008-01-20 2009-03-20 12102
#2: R2 2008-01-12 2010-07-12 446
#3: R3 2009-01-30 2010-04-23 258
Related
I have a dataset simplified as below: there are multiple customers and each CUSTOMER may have several loans. A CUSTOMER with at least 1 LOAN_DEFAULT is marked as CUSTOMER_DEFAULT, and the DEFAULT_DATE is the first time of default.
CUSTOMER LOAN DATE AMOUNT LOAN_DEFAULT CUSTOMER_DEFAULT DEFAULT_DATE CLASSIFICATION
1 101 201601 100 Y Y 201501 S
1 102 201603 100 N Y 201501 S
1 103 201501 100 Y Y 201501 S
1 104 201501 200 N Y 201501 S
2 201 201601 100 N N - M
2 202 201603 100 N N - M
How can I calculate the loan amount for a customer at the first default date, e.g. 201501 for customer 1 that equals to the total AMOUNT at that month, that I should get a number of 300?
What I think of is compare the DATE and DEFAULT_DATE and if they are the same, use sum function. But my code didn't work.
I want to summarise the number of default customers by Classification, but by using summarise function it seems not working properly?
We can sum AMOUNT value where DATE is equal to first DEFAULT_DATE for each CUSTOMER.
library(dplyr)
df %>%
group_by(CUSTOMER) %>%
summarise(total_sum = sum(AMOUNT[DATE == first(DEFAULT_DATE)]))
# CUSTOMER total_sum
# <int> <int>
#1 1 300
#2 2 0
To get number of default customers for each CLASSIFICATION we can do :
df %>%
group_by(CLASSIFICATION) %>%
summarise(no_default_cust = n_distinct(CUSTOMER[CUSTOMER_DEFAULT == "Y"]))
data
df <- structure(list(CUSTOMER = c(1L, 1L, 1L, 1L, 2L, 2L), LOAN = c(101L,
102L, 103L, 104L, 201L, 202L), DATE = c(201601L, 201603L, 201501L,
201501L, 201601L, 201603L), AMOUNT = c(100L, 100L, 100L, 200L,
100L, 100L), LOAN_DEFAULT = structure(c(2L, 1L, 2L, 1L, 1L, 1L
), .Label = c("N", "Y"), class = "factor"), CUSTOMER_DEFAULT = structure(c(2L,
2L, 2L, 2L, 1L, 1L), .Label = c("N", "Y"), class = "factor"),
DEFAULT_DATE = structure(c(2L, 2L, 2L, 2L, 1L, 1L), .Label = c("-",
"201501"), class = "factor"), CLASSIFICATION = structure(c(2L,
2L, 2L, 2L, 1L, 1L), .Label = c("M", "S"), class = "factor")),
class = "data.frame", row.names = c(NA, -6L))
In base you can use aggregate to get the sum of AMOUNT per CUSTOMER. With x[x$DATE == x$DEFAULT_DATE,] you can subset to those lines where DATE equals to DEFAULT_DATE.
aggregate(AMOUNT ~ CUSTOMER, x[x$DATE == x$DEFAULT_DATE,], sum)
# CUSTOMER AMOUNT
#1 1 300
To get the number of default customers by Classification you can use table in combination with unique:
table(unique(x[x$CUSTOMER_DEFAULT=="Y",c("CUSTOMER", "CLASSIFICATION")])[,2])
#M S
#0 1
Data:
x <- read.table(header=TRUE, text="CUSTOMER LOAN DATE AMOUNT LOAN_DEFAULT CUSTOMER_DEFAULT DEFAULT_DATE CLASSIFICATION
1 101 201601 100 Y Y 201501 S
1 102 201603 100 N Y 201501 S
1 103 201501 100 Y Y 201501 S
1 104 201501 200 N Y 201501 S
2 201 201601 100 N N - M
2 202 201603 100 N N - M")
Another option with data.table
library(data.table)
setDT(df)[, .(total_sum = sum(AMOUNT[DATE == first(DEFAULT_DATE)])), CUSTOMER]
I am currently working with a data set in R that contains four variables for a large set of individuals: pid, month, window, and agedays. I'm trying to create a loop that will output the min and max agedays of each group of combinations between month and window into a new data table that I can export as a csv.
Here's an example of the data:
pid agedays month window
1 22 2 1
2 35 3 2
3 33 3 2
4 55 3 2
1 66 2 1
2 55 4 2
3 80 4 2
4 90 4 2
I'd like for the new data table to contain the min and max agedays of each group within each combination of window and month as well as the count of each group within each combination. The range for month is 2-24 and the range for window is 0-2.
The data table should look something like this:
month window min max N
2 1 22 66 1
3 2 33 55 3
etc....
where N is the number of unique individuals (pids) within each group
After grouping by 'month', 'window', get the min, max of 'agedays' and the number of distinct (n_distinct) elements of 'pid'
library(dplyr)
df1 %>%
group_by(month, window) %>%
summarise(min = min(agedays), max = max(agedays), N = n_distinct(pid))
# A tibble: 3 x 5
# Groups: month [3]
# month window min max N
# <int> <int> <int> <int> <int>
#1 2 1 22 66 1
#2 3 2 33 55 3
#3 4 2 55 90 3
We can also do this with data.table
library(data.table)
setDT(df1)[, .(min = min(agedays), max = max(agedays),
N = uniqueN(pid)), by = .(month, window)]
Or using split from base R
do.call(rbind, lapply(split(df1, df1[c('month', 'window')], drop = TRUE),
function(x) cbind(month = x$month[1], window = x$window[1], min = min(x$agedays), max = max(x$agedays),
N = length(unique(x$pid)))))
data
df1 <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))
Using data.table, we can calculate min, max of agedays along with number of rows for each combination of month and window.
library(data.table)
setDT(df) #Convert to data.table if it is not already
df[, .(min_age = min(agedays, na.rm = TRUE),
max_age = max(agedays, na.rm = TRUE), N = .N), .(month, window)]
# month window min_age max_age N
#1: 2 1 22 66 2
#2: 3 2 33 55 3
#3: 4 2 55 90 3
data
df <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -8L))
This question already has answers here:
How to merge and sum two data frames
(5 answers)
Closed 3 years ago.
I have a question regarding to addition of rows from different tables having same column names. I have time series of two tables with values 8760 rows (whole year).
Table1
Name Year Month Day Hour Value
Plant_1 2020 1 1 1 10
Plant_2 2020 1 1 1 20
Plant_3 2020 1 1 1 30
Plant_1 2020 1 1 2 40
Plant_2 2020 1 1 2 50
Plant_3 2020 1 1 2 60
Table2
Name Year Month Day Hour Value
Plant_x 2020 1 1 1 1
Plant_y 2020 1 1 1 2
Plant_z 2020 1 1 1 3
Plant_x 2020 1 1 2 4
Plant_y 2020 1 1 2 5
Plant_z 2020 1 1 2 6
What I want is, summation of value of all plants at same time period like
Year Month Day Hour Value
2020 1 1 1 66
2020 1 1 2 165
I don't care about name of plant but need to get sum of total value at each hour of the year. I was trying to do something like this but doesn't work for tables more than two and I have 9 to 10 such tables. Could anyone help me to improve this code or any other function which I can use?
SumOfValue <- Table1%>%
full_join(Table2) %>%
group_by (Year,Month,Day,Hour) %>%
summarise(Value=sum(Value))
Any help would be appreciated. Thank you.
It looks like your two dataframes have the same exact format, so you can just rbind them and then get the summary per Year, Month, Day and Hour.
df = rbind(a,b)%>%group_by(Year,Month,Day,Hour)%>%summarise(Value=sum(Value))
# Alternative as suggested by Sotos
bind_rows(a, b) %>%group_by(Year,Month,Day,Hour)%>%summarise(Value=sum(Value))
# A tibble: 2 x 5
# Groups: Year, Month, Day [?]
Year Month Day Hour Value
<int> <int> <int> <int> <int>
1 2020 1 1 1 66
2 2020 1 1 2 165
Data
a = structure(list(Name = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Plant_1",
"Plant_2", "Plant_3"), class = "factor"), Year = c(2020L, 2020L,
2020L, 2020L, 2020L, 2020L), Month = c(1L, 1L, 1L, 1L, 1L, 1L
), Day = c(1L, 1L, 1L, 1L, 1L, 1L), Hour = c(1L, 1L, 1L, 2L,
2L, 2L), Value = c(10L, 20L, 30L, 40L, 50L, 60L)), class = "data.frame", row.names = c(NA,
-6L))
b = structure(list(Name = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Plant_x",
"Plant_y", "Plant_z"), class = "factor"), Year = c(2020L, 2020L,
2020L, 2020L, 2020L, 2020L), Month = c(1L, 1L, 1L, 1L, 1L, 1L
), Day = c(1L, 1L, 1L, 1L, 1L, 1L), Hour = c(1L, 1L, 1L, 2L,
2L, 2L), Value = 1:6), class = "data.frame", row.names = c(NA,
-6L))
Consider the sample data
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 8L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 0L, 1L, 0L, 0L)
),
.Names = c("id", "A", "B"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id (stored in column 1) has varying number of entries for column A and B. In the example data, there are four observations with id = 1. I am looking for a way to subset this data in R so that there will be at most 3 entries for for each id and finally create another column (labelled as C) which consists of the order of each id. The expected output would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 1L, 0L, 0L),
C = c(1L, 2L, 3L, 1L, 2L, 1L)
),
.Names = c("id", "A", "B","C"),
class = "data.frame",
row.names = c(NA,-6L)
)
Your help is much appreciated.
Like this?
library(data.table)
dt <- as.data.table(df)
dt[, C := seq(.N), by = id]
dt <- dt[C <= 3,]
dt
# id A B C
# 1: 1 20 1 1
# 2: 1 12 1 2
# 3: 1 13 0 3
# 4: 2 11 1 1
# 5: 2 21 0 2
# 6: 3 17 0 1
Here is one option with dplyr and considering the top 3 values based on A (based of the comments of #Ronak Shah).
library(dplyr)
df %>%
group_by(id) %>%
top_n(n = 3, wt = A) %>% # top 3 values based on A
mutate(C = rank(id, ties.method = "first")) # C consists of the order of each id
# A tibble: 6 x 4
# Groups: id [3]
id A B C
<int> <int> <int> <int>
1 1 20 1 1
2 1 12 1 2
3 1 13 0 3
4 2 11 1 1
5 2 21 0 2
6 3 17 0 1
I want delete the rows if the columns (YEAR, POL, CTY, ID, AMOUNT) are equal in the values across all rows. Please see the output table below.
Table:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 36 3500 RANGE1 L0015N20W23
2017 30408 11 36 3500 RANGE1 L00210N20W24
2017 30408 11 36 3500 RANGE1 L00310N20W25
2017 30409 11 36 3500 RANGE1 L0015N20W23
2017 30409 11 35 3500 RANGE2 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
Output:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 35 3500 RANGE1 L0015N20W23
You can try this:
no_duplicate_cols <- c("YEAR", "POL", "CTY", "ID", "AMOUNT")
new_df <- df[!duplicated(df[, no_duplicate_cols]), ]
The data frame new_df will hold the rows from df that are not duplicated.
If I understood the question correctly then I think you can try this
library(dplyr)
df %>%
group_by(YEAR, POL, CTY, ID, AMOUNT) %>%
filter(n() == 1)
Output (but it seems that the output provided in the original question has bit of typo!):
# A tibble: 1 x 7
# Groups: YEAR, POL, CTY, ID, AMOUNT [1]
YEAR POL CTY ID AMOUNT RAN LEGAL
1 2017 30409 11 36 3500 RANGE1 L0015N20W23
#sample data
> dput(df)
structure(list(YEAR = c(2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L), POL = c(30408L, 30408L, 30408L, 30409L, 30409L, 30409L,
30409L), CTY = c(11L, 11L, 11L, 11L, 11L, 11L, 11L), ID = c(36L,
36L, 36L, 36L, 35L, 35L, 35L), AMOUNT = c(3500L, 3500L, 3500L,
3500L, 3500L, 3500L, 3500L), RAN = structure(c(1L, 1L, 1L, 1L,
2L, 3L, 3L), .Label = c("RANGE1", "RANGE2", "RANGE3"), class = "factor"),
LEGAL = structure(c(1L, 2L, 3L, 1L, 4L, 4L, 4L), .Label = c("L0015N20W23",
"L00210N20W24", "L00310N20W25", "NANANA"), class = "factor")), .Names = c("YEAR",
"POL", "CTY", "ID", "AMOUNT", "RAN", "LEGAL"), class = "data.frame", row.names = c(NA,
-7L))