Find the newest entry per client in R - r

I am trying to filter a dataframe in which I have three columns:
date (format: "day/month/year")
client name
client spending on an specific product
I want to filter this df so I could get only the newest data purchase by client
Is there any way I could do this?

Let me first create a dummy data frame
library(dplyr)
names <- c("A", "B", "C", "D")
client <- sample(names, size=20, replace=T)
dates <- sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 20)
amount <- sample(c(0:1000), size=20)
df <- data.frame(dates, client, amount)
So the data frame looks like this
dates client amount
1 1999-08-21 A 632
2 1999-08-06 B 449
3 1999-03-20 B 402
4 1999-05-15 B 557
5 1999-04-29 D 960
6 1999-03-07 A 977
7 1999-12-02 D 106
8 1999-12-08 D 891
9 1999-12-06 B 375
10 1999-03-28 C 509
11 1999-07-27 C 722
12 1999-02-01 D 923
13 1999-02-20 B 517
14 1999-12-17 B 487
15 1999-11-27 C 486
16 1999-05-26 B 873
17 1999-01-11 A 493
18 1999-08-16 A 620
19 1999-03-17 B 899
20 1999-03-01 C 297
You can then get filter the data
result <- df %>%
group_by(client) %>%
filter(dates == max(dates))
result
which will give you the following result.
dates client amount
<date> <fct> <int>
1 1999-08-21 A 632
2 1999-12-08 D 891
3 1999-12-17 B 487
4 1999-11-27 C 486

Related

Accessing previous rows to make a calculation in a current row in R

I have data that I need to make calculations within sub-groups of rows. Specifically, I am calculating the nitrogen (N) recovery efficiency of crops between different field treatments. This requires finding the difference in N uptake of crops (totN) between a plot with a given amount of N applied (Nrate) and the control plot with 0 N applied, then dividing by the Nrate.
This is how the data looks for just one year:
year drain Nrate totN
2016 C 0 190
2016 C 100 220
2016 C 200 230
2016 N 0 130
2016 N 100 200
2016 N 200 220
I have gotten this far, thanks to Performing calculations between rows in R, but I am not sure how to reference the control row (Nrate = 0) within the sub-group each row is in.
This is where I am:
library(tidyverse)
df <- data.frame(year=c(2016,2016,2016,2016,2016,2016),
drain=c("C","C","C","N","N","N"),
Nrate=c(0,100,200,0,100,200),
totN=c(190,220,230,130,200,220))
df %>%
group_by(year,drain) %>%
mutate(id = row_number()) %>%
mutate(RE = ifelse(id != 1,
(totN - <the totN where Nrate=0 for same year and drain>) / Nrate,
NA))
This is what I expect to get:
year drain Nrate totN RE
2016 C 0 190 NA
2016 C 100 220 0.3 #(220-190)/100
2016 C 200 230 0.2 #(230-190)/200
2016 N 0 130 NA
2016 N 100 200 0.7 #(200-130)/100
2016 N 200 220 0.45 #(220-130)/200
We may subset the 'totN' by indexing or if it is already ordered, use first(totN)
library(dplyr)
df %>%
group_by(year, drain) %>%
mutate(RE = na_if((totN - totN[Nrate == 0])/Nrate, "NaN")) %>%
ungroup
-output
# A tibble: 6 x 5
# year drain Nrate totN RE
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 2016 C 0 190 NA
#2 2016 C 100 220 0.3
#3 2016 C 200 230 0.2
#4 2016 N 0 130 NA
#5 2016 N 100 200 0.7
#6 2016 N 200 220 0.45

R Panel data: Create new variable based on ifelse() statement and previous row

My question refers to the following (simplified) panel data, for which I would like to create some sort of xrd_stock.
#Setup data
library(tidyverse)
firm_id <- c(rep(1, 5), rep(2, 3), rep(3, 4))
firm_name <- c(rep("Cosco", 5), rep("Apple", 3), rep("BP", 4))
fyear <- c(seq(2000, 2004, 1), seq(2003, 2005, 1), seq(2005, 2008, 1))
xrd <- c(49,93,121,84,37,197,36,154,104,116,6,21)
df <- data.frame(firm_id, firm_name, fyear, xrd)
#Define variables
growth = 0.08
depr = 0.15
For a new variable called xrd_stock I'd like to apply the following mechanics:
each firm_id should be handled separately: group_by(firm_id)
where fyear is at minimum, calculate xrd_stock as: xrd/(growth + depr)
otherwise, calculate xrd_stock as: xrd + (1-depr) * [xrd_stock from previous row]
With the following code, I already succeeded with step 1. and 2. and parts of step 3.
df2 <- df %>%
ungroup() %>%
group_by(firm_id) %>%
arrange(firm_id, fyear, decreasing = TRUE) %>% #Ensure that data is arranged w/ in asc(fyear) order; not required in this specific example as df is already in correct order
mutate(xrd_stock = ifelse(fyear == min(fyear), xrd/(growth + depr), xrd + (1-depr)*lag(xrd_stock))))
Difficulties occur in the else part of the function, such that R returns:
Error: Problem with `mutate()` input `xrd_stock`.
x object 'xrd_stock' not found
i Input `xrd_stock` is `ifelse(...)`.
i The error occured in group 1: firm_id = 1.
Run `rlang::last_error()` to see where the error occurred.
From this error message, I understand that R cannot refer to the just created xrd_stock in the previous row (logical when considering/assuming that R is not strictly working from top to bottom); however, when simply putting a 9 in the else part, my above code runs without any errors.
Can anyone help me with this problem so that results look eventually as shown below. I am more than happy to answer additional questions if required. Thank you very much to everyone in advance, who looks at my question :-)
Target results (Excel-calculated):
id name fyear xrd xrd_stock Calculation for xrd_stock
1 Cosco 2000 49 213 =49/(0.08+0.15)
1 Cosco 2001 93 274 =93+(1-0.15)*213
1 Cosco 2002 121 354 …
1 Cosco 2003 84 385 …
1 Cosco 2004 37 364 …
2 Apple 2003 197 857 =197/(0.08+0.15)
2 Apple 2004 36 764 =36+(1-0.15)*857
2 Apple 2005 154 803 …
3 BP 2005 104 452 …
3 BP 2006 116 500 …
3 BP 2007 6 431 …
3 BP 2008 21 388 …
arrange the data by fyear so minimum year is always the 1st row, you can then use accumulate to calculate.
library(dplyr)
df %>%
arrange(firm_id, fyear) %>%
group_by(firm_id) %>%
mutate(xrd_stock = purrr::accumulate(xrd[-1], ~.y + (1-depr) * .x,
.init = first(xrd)/(growth + depr)))
# firm_id firm_name fyear xrd xrd_stock
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 Cosco 2000 49 213.
# 2 1 Cosco 2001 93 274.
# 3 1 Cosco 2002 121 354.
# 4 1 Cosco 2003 84 385.
# 5 1 Cosco 2004 37 364.
# 6 2 Apple 2003 197 857.
# 7 2 Apple 2004 36 764.
# 8 2 Apple 2005 154 803.
# 9 3 BP 2005 104 452.
#10 3 BP 2006 116 500.
#11 3 BP 2007 6 431.
#12 3 BP 2008 21 388.

How to lookup and sum multiple columns in R

Suppose I have 2 dataframes structured as such:
GROUPS:
P1 P2 P3 P4
123 213 312 231
345 123 213 567
INDIVIDUAL_RESULTS:
ID SCORE
123 23
213 12
312 11
213 19
345 10
567 22
I want to add a column to the GROUPS which is a sum of each of their individual results:
P1 P2 P3 P4 SCORE
123 213 312 231 65
I've tried using various merge techniques, but have really just created a mess. I feel like there's a simple solution I just don't know about, would really appreciate some guidance!
d1=read.table(text="
P1 P2 P3 P4
123 213 312 231
345 123 213 567",h=T)
d2=read.table(text="
ID SCORE
123 23
213 12
312 11
231 19
345 10
567 22",h=T)
I will be using the apply and match functions. Apply will apply the match function to each row of d1, match will find the matching values from the row of d1 and d2$ID (their indices) and then take the values in d2$SCORE at those indices. In the end we sum them up.
d1$SCORE=apply(d1,1,function(x){
sum(d2$SCORE[match(x,d2$ID)])
})
and the result
P1 P2 P3 P4 SCORE
1 123 213 312 231 65
2 345 123 213 567 67
I would try a slow but could be an intuitive way for new users. I think the difficulty was created by the format of your data d1. If you do a little bit of tidy up:
library(tidyverse)
d1<-data.frame(t(d1))
colnames(d1) <-c("group1", "group2")
d1$P = row.names(d1)
d1<-d1 %>%
pivot_longer(
cols = group1:group2,
names_to = "Group",
values_to = "ID"
)
df <-left_join(d1, d2, by ="ID")
df
# A tibble: 8 x 4
P Group ID SCORE
<chr> <chr> <int> <int>
1 P1 group1 123 23
2 P1 group2 345 10
3 P2 group1 213 12
4 P2 group2 123 23
5 P3 group1 312 11
6 P3 group2 213 12
7 P4 group1 231 19
8 P4 group2 567 22
Once you get the data to this more "conventional" format, we can easily work out a tidyverse solution.
df %>%
group_by(Group) %>%
summarize(SCORE = sum(SCORE))
# A tibble: 2 x 2
Group SCORE
<chr> <int>
1 group1 65
2 group2 67
Another possibility is to reformat the first data.frame to contain the group and subgroup Information:
groups <- tidyr::gather(d1,name,number,P1:P4)
These information could be added to the second data.frame and could be further used for different analyses. Such as aggregrations.
d2_groups <- merge(groups, d2, by.x = "number",by.y = "ID")
aggregate(d2_groups$SCORE, by=list(groups = d2_groups$name), FUN=sum)

Find matching intervals in data frame by range of two column values

I have a data frame of time related events.
Here is an example:
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 19 75 56 ID2
ADAM 2 A 384 407 23 ID2
ADAM 3 B 0 79 79 ID2
ADAM 4 B 505 586 81 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
There are essentially two different groups, ID 1 & 2. For each of those groups, there are 18 different name's. Each of those people appear in 3 different sequences, A-C. They then have active time periods during those sequences, and I mark the start/end events and calculate the duration.
I'd like to isolate each person and find when they have matching time intervals with people in both the opposite and same group ID.
Using the example data above, I want to find when John and Adam appear during the same sequence, at the same time. I then want to compare John to the rest of the 17 names in ID1/ID2.
I do not need to match the exact amount of shared 'active' time, I just am hoping to isolate the rows that are common.
My comforts are in using dplyr, but I can't crack this yet. I looked around and saw some similar examples with adjacency matrices, but those are with precise and exact data points. I can't figure out the strategy with a range/interval.
Thank you!
UPDATE:
Here is the example of the desired result
Name Event Order Sequence start_event end_event duration Group
JOHN 3 A 392 429 37 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 2 A 384 407 23 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
I'm thinking you'd isolate each event row for John, mark the start/end time frame and then iterate through every name and event for the remainder of the data frame to find time points that fit first within the same sequence, and then secondly against the bench-marked start/end time frame of John.
As I understand it, you want to return any row where an event for John with a particular sequence number overlaps an event for anybody else with the same sequence value. To achieve this, you could use split-apply-combine to split by sequence, identify the overlapping rows, and then re-combine:
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
jpos <- which(x$Name == "JOHN")
njpos <- which(x$Name != "JOHN")
over <- outer(jpos, njpos, function(a, b) {
overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
})
x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
# Name EventOrder Sequence start_event end_event duration Group
# A.2 JOHN 2 A 60 112 52 ID1
# A.3 JOHN 3 A 392 429 37 ID1
# A.7 ADAM 1 A 19 75 56 ID2
# A.8 ADAM 2 A 384 407 23 ID2
# C.5 JOHN 5 C 147 226 79 ID1
# C.6 JOHN 6 C 566 611 45 ID1
# C.11 ADAM 5 C 140 205 65 ID2
# C.12 ADAM 6 C 522 599 77 ID2
Note that my output includes two additional rows that are not shown in the question -- sequence A for John from time range [60, 112], which overlaps sequence A for Adam from time range [19, 75].
This could be pretty easily mapped into dplyr language:
library(dplyr)
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
sliceRows <- function(name, start, end) {
jpos <- which(name == "JOHN")
njpos <- which(name != "JOHN")
over <- outer(jpos, njpos, function(a, b) overlap(start[a], end[a], start[b], end[b]))
c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0])
}
dat %>%
group_by(Sequence) %>%
slice(sliceRows(Name, start_event, end_event))
# Source: local data frame [8 x 7]
# Groups: Sequence [3]
#
# Name EventOrder Sequence start_event end_event duration Group
# (fctr) (int) (fctr) (int) (int) (int) (fctr)
# 1 JOHN 2 A 60 112 52 ID1
# 2 JOHN 3 A 392 429 37 ID1
# 3 ADAM 1 A 19 75 56 ID2
# 4 ADAM 2 A 384 407 23 ID2
# 5 JOHN 5 C 147 226 79 ID1
# 6 JOHN 6 C 566 611 45 ID1
# 7 ADAM 5 C 140 205 65 ID2
# 8 ADAM 6 C 522 599 77 ID2
If you wanted to be able to compute the overlaps for a specified pair of users, this could be done by wrapping the operation into a function that specifies the pair of users to be processed:
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
pair.overlap <- function(dat, user1, user2) {
dat <- dat[dat$Name %in% c(user1, user2),]
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
jpos <- which(x$Name == user1)
njpos <- which(x$Name == user2)
over <- outer(jpos, njpos, function(a, b) {
overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
})
x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
}
You could use pair.overlap(dat, "JOHN", "ADAM") to get the previous output. Generating the overlaps for every pair of users can now be done with combn and apply:
apply(combn(unique(as.character(dat$Name)), 2), 2, function(x) pair.overlap(dat, x[1], x[2]))

Reshaping a data frame --- changing rows to columns

Suppose that we have a data frame that looks like
set.seed(7302012)
county <- rep(letters[1:4], each=2)
state <- rep(LETTERS[1], times=8)
industry <- rep(c("construction", "manufacturing"), 4)
employment <- round(rnorm(8, 100, 50), 0)
establishments <- round(rnorm(8, 20, 5), 0)
data <- data.frame(state, county, industry, employment, establishments)
state county industry employment establishments
1 A a construction 146 19
2 A a manufacturing 110 20
3 A b construction 121 10
4 A b manufacturing 90 27
5 A c construction 197 18
6 A c manufacturing 73 29
7 A d construction 98 30
8 A d manufacturing 102 19
We'd like to reshape this so that each row represents a (state and) county, rather than a county-industry, with columns construction.employment, construction.establishments, and analogous versions for manufacturing. What is an efficient way to do this?
One way is to subset
construction <- data[data$industry == "construction", ]
names(construction)[4:5] <- c("construction.employment", "construction.establishments")
And similarly for manufacturing, then do a merge. This isn't so bad if there are only two industries, but imagine that there are 14; this process would become tedious (though made less so by using a for loop over the levels of industry).
Any other ideas?
This can be done in base R reshape, if I understand your question correctly:
reshape(data, direction="wide", idvar=c("state", "county"), timevar="industry")
# state county employment.construction establishments.construction
# 1 A a 146 19
# 3 A b 121 10
# 5 A c 197 18
# 7 A d 98 30
# employment.manufacturing establishments.manufacturing
# 1 110 20
# 3 90 27
# 5 73 29
# 7 102 19
Also using the reshape package:
library(reshape)
m <- reshape::melt(data)
cast(m, state + county~...)
Yielding:
> cast(m, state + county~...)
state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments
1 A a 146 19 110 20
2 A b 121 10 90 27
3 A c 197 18 73 29
4 A d 98 30 102 19
I personally use the base reshape so I probably should have shown this using reshape2 (Wickham) but forgot there was a reshape2 package. Slightly different:
library(reshape2)
m <- reshape2::melt(data)
dcast(m, state + county~...)

Resources