R - cumulative count over multiple columns, by factor - r

I have a dataset which simplified look something like this:
YEAR = c(2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010,2010,2010,2010)
FROM = c("A","C","B","D","B","A","C","A","C","B","A","D","B","A")
TO = c("B","D","C","A","D","C","B","B","A","D","D","C","A","D")
DATA = data.frame(YEAR,FROM,TO)
YEAR FROM TO
2009 A B
2009 C D
2009 B C
2009 D A
2009 B D
2009 A C
2009 C B
2010 A B
2010 C A
2010 B D
2010 A D
2010 D C
2010 B A
2010 A D
What I want is two additional columns, let's say OCC_FROM and OCC_TO, that is the cumulative count of occurrences in both FROM and TO columns in prior rows, by YEAR. Like this:
YEAR FROM TO OCC_FROM OCC_TO
2009 A B 0 0
2009 C D 0 0
2009 B C 1 1
2009 C A 2 1
2009 B D 2 1
2009 A C 2 3
2009 C B 4 3
2010 A B 0 0
2010 C A 0 1
2010 B D 1 0
2010 A B 2 2
2010 D C 1 1
2010 B A 3 3
2010 A D 4 2
What I've managed to produce, with the help of Cumulative count in R, is this, which is obviously not quite what I want since it doesn't take YEAR into consideration:
DATA$OCC_FROM = sapply(1:length(DATA$FROM),function(i)sum(DATA$FROM[i]==DATA$FROM[1:i]))+sapply(1:length(DATA$FROM),function(i)sum(DATA$FROM[i]==DATA$TO[1:i]))-1
DATA$OCC_TO = sapply(1:length(DATA$TO),function(i)sum(DATA$TO[i]==DATA$FROM[1:i]))+sapply(1:length(DATA$TO),function(i)sum(DATA$TO[i]==DATA$TO[1:i]))-1
YEAR FROM TO OCC_FROM OCC_TO
2009 A B 0 0
2009 C D 0 0
2009 B C 1 1
2009 C A 2 1
2009 B D 2 1
2009 A C 2 3
2009 C B 4 3
2010 A B 3 4
2010 C A 5 4
2010 B D 5 2
2010 A B 5 6
2010 D C 3 6
2010 B A 7 6
2010 A D 7 4
Edit: I also want to be able to sum two columns cumulatively based on FROM and TO, by YEAR as before. For simplicity I'll use OCC_FROM and OCC_TO. Like this:
YEAR FROM TO OCC_FROM OCC_TO TOTAL_FROM TOTAL_TO
2009 A B 0 0 0 0
2009 C D 0 0 0 0
2009 B C 1 1 0 0
2009 C A 2 1 1 0
2009 B D 2 1 1 0
2009 A C 2 3 1 3
2009 C B 4 3 6 3
2010 A B 0 0 0 0
2010 C A 0 1 0 0
2010 B D 1 0 0 0
2010 A B 2 2 1 1
2010 D C 1 1 0 0
2010 B A 3 3 3 3
2010 A D 4 2 6 1

You could try
prevCount <- function(x) {
eq <- outer(x,x,"==")
eq <- eq & upper.tri(eq)
eqInt <- ifelse(eq, 1, 0)
return(apply(eqInt,2,sum))
}
DATA$OCC_FROM <- ave(DATA$FROM, DATA$YEAR, FUN=prevCount )
prevCount is a function that, within a year, returns the number of elements prior to each element that are identical. The ave call then applies this per year.
Rolling up the corrections in the comment, we get
ord <- order(c(1:nrow(DATA), 1:nrow(data)))
targets <- c(data$FROM, data$TO)[ord]
yr <- c(data$YEAR, data$YEAR)[ord]
res <- ave(targets, yr, FUN=prevCount)
data$occ_from <- res[seq(1, length(res), 2)]
data$occ_to <- res[seq(2, length(res), 2)]
Also, the prevCount function can be simplified as:
prevCount <- function(x) {ave(x==x, x, FUN=cumsum)}

# Your data - this is not the data at the top of your question but from your
# solution [the 'from' and 'to' don't correspond exactly between your question
# and solution]
YEAR <- c(2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010,2010,2010,2010)
FROM <- c("A","C","B","C","B","A","C","A","C","B","A","D","B","A")
TO <- c( "B","D","C","A","D","C","B","B","A","D","B","C","A","D")
mydf <- data.frame(YEAR,FROM,TO)
names(mydf) <- tolower(names(mydf))
#---------------------------------------------------
# Function to get cumulative sum across columns by group
f <- function(from , to){
# combine the columns 'from' and 'to' alternately
l <- c(rbind(from , to))
# Get and sum duplicate values
dup <- duplicated(l)
sums <- ave(dup , l , FUN= cumsum)
# Reshape data & output
out <- t(matrix(sums ,2))
colnames(out) <- c("occ_from","occ_to")
out
}
# Not considering year
f(mydf$from , mydf$to)
# (data.frame(mydf , f(mydf$from , mydf$to) )
# Calculate by year
s <- split(mydf , mydf$year)
d <- do.call(rbind,lapply(s,function(i) f(i[,"from"],i[,"to"])))
(mydf <- data.frame(mydf , d , row.names=NULL))

Related

How to use table by with two arguments? R

My dataset is like this
df
ID Year County APV sample
1 2014 A 1 1
1 2015 A 1 1
1 2016 A 0 0
1 2017 A NA 0
1 2018 A NA 0
1 2019 A NA 0
2 2014 B 1 1
2 2015 B 1 1
2 2016 B 1 1
2 2017 B 1 1
2 2018 B 0 0
2 2019 B NA 0
3 2014 A 1 1
3 2015 A 1 1
3 2016 A 0 0
3 2017 A NA 0
3 2018 A NA 0
3 2019 A NA 0
And so on
So I want to tabulate this data.
If I only want to tabulate by year
datos<-as.data.frame(table(df$APV==0 & df$sample==0, by=df$Year))
the data set that I obtain looks like this:
df1
Var1 by Freq
FALSE 2014 3
TRUE 2014 0
FALSE 2015 3
TRUE 2015 0
FALSE 2016 1
TRUE 2016 2
. . .
. . .
. . .
So false means the still open firms.
How can I tabulate by year and County?
APV tells me the first closure of the enterprise, (the 0) so I want to know how many enterprises closed by year and county
There are two approaches.
I added !is.na(APV) for two reasons: (1) it wasn't clear to me what you expected to happen there; and (2) table was actually more robust to NA than xtabs, so I wanted the two results to be the same. The premise of the two approaches are the same, but they do appear to handle NAs differently.
table
You might just need to know that table takes an arbitrary number of arguments, so
head(as.data.frame(table(df$Var1, df$Year, df$County)))
# Var1 Var2 Var3 Freq
# 1 FALSE 2014 A 2
# 2 TRUE 2014 A 0
# 3 FALSE 2015 A 2
# 4 TRUE 2015 A 0
# 5 FALSE 2016 A 0
# 6 TRUE 2016 A 2
While the names are lost, it still works.
xtabs
out <- as.data.frame(
xtabs(~ Var1 + Year + County,
data = transform(df, Var1 = (!is.na(APV) & APV == 0 & sample == 0)))
)
head(out)
# Var1 Year County Freq
# 1 FALSE 2014 A 2
# 2 TRUE 2014 A 0
# 3 FALSE 2015 A 2
# 4 TRUE 2015 A 0
# 5 FALSE 2016 A 0
# 6 TRUE 2016 A 2
(I used transform for my simplicity.)
do.call for dynamic columns
out2 <- as.data.frame(
do.call(table, subset(transform(df, Var1 = (!is.na(APV) & APV == 0 & sample == 0)),
select = c(Var1, Year, County)))
)
(same results)

Turn dataframe with a row for each id and law (with begin and end years) into a file with a row for each id and year

I have a df called laws with a row for each law (one for each id):
laws <- data.frame(id=c(1,2,3),beginyear=c(2001,2002,2005),endyear=c(2003,2005,2006), law1=c(0,0,1), law2=c(1,0,1))
from which I want to create second called idyear with a row for each id and year:
idyear <- data.frame(id=c(rep(1,6),rep(2,6),rep(3,6)), year=(rep(c(2001:2006),3)), law1=c(rep(0,16),1,1), law2=c(1,1,1,rep(0,13),1,1))
How would I efficiently go about writing some code to get the idyear df output from the laws df? The two law variables are indicator variables == 1 if the idyear$year is >= laws$beginyear AND idyear$year is <= laws$endyear.
I am a beginner with R, but I'm willing to try anything (apply, loops, etc.) to get this to work.
1) base expand.grid will create an 18 x 2 data frame of all id and year combinations and then merge will merge it back together with laws. Zero out any law1 and law2 entry for which year is not between beginyear and endyear. Finally drop the beginyear and endyear columns. No packages are used.
g <- with(laws, expand.grid(year = min(beginyear):max(endyear), id = id))
m <- merge(g, laws)
m[m$year < m$beginyear | m$year > m$endyear, c("law1", "law2")] <- 0
m <- subset(m, select = - c(beginyear, endyear))
# check
identical(m, idyear)
## [1] TRUE
2) magrittr This is the same solution as (1) except we have used magrittr pipelines to express it. Note the mixture of pipe operators.
library(magrittr)
laws %$%
expand.grid(year = min(beginyear):max(endyear), id = id) %>%
merge(laws) %$%
{ .[year < beginyear | year > endyear, c("law1", "law2")] <- 0; .} %>%
subset(select = - c(beginyear, endyear))
Update: Fixed. Added (2).
A solution using tidyverse. The last as.data.frame() is optional, which just convert the tbl to a data frame.
library(tidyverse)
idyear <- laws %>%
mutate(year = map2(beginyear, endyear, `:`)) %>%
unnest() %>%
complete(id, year = full_seq(year, period = 1L), fill = list(law1 = 0L, law2 = 0L)) %>%
select(-beginyear, -endyear) %>%
as.data.frame()
idyear
# id year law1 law2
# 1 1 2001 0 1
# 2 1 2002 0 1
# 3 1 2003 0 1
# 4 1 2004 0 0
# 5 1 2005 0 0
# 6 1 2006 0 0
# 7 2 2001 0 0
# 8 2 2002 0 0
# 9 2 2003 0 0
# 10 2 2004 0 0
# 11 2 2005 0 0
# 12 2 2006 0 0
# 13 3 2001 0 0
# 14 3 2002 0 0
# 15 3 2003 0 0
# 16 3 2004 0 0
# 17 3 2005 1 1
# 18 3 2006 1 1
Use of mapply function can help.
# Function to expand year between begin and end
gen_data <- function(x_id, x_beginyear, x_endyear, x_law1, x_law2){
df <- data.frame(x_id, x_beginyear:x_endyear, x_law1, x_law2)
df
}
idyearlst <- data.frame()
idyearlst <- rbind(idyearlst, mapply(gen_data, laws$id, laws$beginyear,
laws$endyear, laws$law1, laws$law2))
# Finally convert list to data.frame
idyear <- setNames(do.call(rbind.data.frame, idyearlst), c("id", "year", "law1", "law2"))
Result will be like:
> idyear
id year law1 law2
V1.1 1 2001 0 1
V1.2 1 2002 0 1
V1.3 1 2003 0 1
V2.4 2 2002 0 0
V2.5 2 2003 0 0
V2.6 2 2004 0 0
V2.7 2 2005 0 0
V3.8 3 2005 1 1
V3.9 3 2006 1 1
Kind of an ugly approach, but I think it gets what you're after, using G. Grothendieck's g expand.grid data frame as a base, and your laws dataframe.
new.df <- data.frame(t(apply(g, 1, function(x){
yearspan = laws[laws$id == x['id'], 'beginyear']:laws[laws$id == x['id'], 'endyear']
law1 = laws$law1[laws$id == x['id'] & x['year'] %in% yearspan]
law2 = laws$law2[laws$id == x['id'] & x['year'] %in% yearspan]
x['law1'] = ifelse(length(law1 > 0), law1, 0)
x['law2'] = ifelse(length(law2 > 0), law2, 0)
return(x)
})))
> new.df
id year law1 law2
1 1 2001 0 1
2 1 2002 0 1
3 1 2003 0 1
4 1 2004 0 0
5 1 2005 0 0
6 1 2006 0 0
7 2 2001 0 0
8 2 2002 0 0
9 2 2003 0 0
10 2 2004 0 0
11 2 2005 0 0
12 2 2006 0 0
13 3 2001 0 0
14 3 2002 0 0
15 3 2003 0 0
16 3 2004 0 0
17 3 2005 1 1
18 3 2006 1 1
Libraries:
dplyr (for arrange, not really necessary)
Data:
laws <- data.frame(id=c(1,2,3),
beginyear=c(2001,2002,2005),
endyear=c(2003,2005,2006),
law1=c(0,0,1), law2=c(1,0,1))
g <- with(laws, expand.grid(id = id, year = min(beginyear):max(endyear)))
g <- arrange(g, id)

Find co-occurrence of values in large data set

I have a large data set with month, customer ID and store ID. There is one record per customer, per location, per month summarizing their activity at that location.
Month Customer ID Store
Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C
I'm interested in creating a matrix that shows the number of customers that each location shares with another. Like this:
A B C
A 4 2 2
B 2 4 2
C 2 2 4
For example, since customer visited Store A and then Store B in the next month, they would be added to the tally. I'm interested in number of shared customers, not number of visits.
I tried the sparse matrix approach in this thread(Creating co-occurrence matrix), but the numbers returned don't match up for some reason I cannot understand.
Any ideas would be greatly appreciated!
Update:
The original solution that I posted worked for your data. But your data has
the unusual property that no customer ever visited the same store in two different
months. Presuming that would happen, a modification is needed.
What we need is a matrix of stores by customers that has 1 if the customer ever
visited the store and zero otherwise. The original solution used
M = as.matrix(table(Dat$ID_Store, Dat$Customer))
which gives how many different months the store was visited by each customer. With
different data, these numbers might be more than one. We can fix that by using
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
If you look at this matrix, it will say TRUE and FALSE, but since TRUE=1 and FALSE=0
that will work just fine. So the full corrected solution is:
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
M %*% t(M)
A B C
A 4 2 2
B 2 4 2
C 2 2 4
We can try this too:
library(reshape2)
df <- dcast(df,CustomerID~Store, length, value.var='Store')
# CustomerID A B C
#1 1 1 1 1
#2 2 1 1 0 # Customer 2 went to stores A,B but not to C
#3 3 1 0 1
#4 4 1 0 0
#5 7 0 1 0
#6 11 0 0 1
#7 12 0 1 1
crossprod(as.matrix(df[-1]))
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4
with library arules:
library(arules)
write(' Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C', 'basket_single')
tr <- read.transactions("basket_single", format = "single", cols = c(2,3))
inspect(tr)
# items transactionID
#[1] {A,B,C} 1
#[2] {C} 11
#[3] {B,C} 12
#[4] {A,B} 2
#[5] {A,C} 3
#[6] {A} 4
#[7] {B} 7
image(tr)
crossTable(tr, sort=TRUE)
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4

Create a panel data frame

I would like to create a panel from a dataset that has one observation for every given time period such that every unit has a new observation for every time period. Using the following example:
id <- seq(1:4)
year <- c(2005, 2008, 2008, 2007)
y <- c(1,0,0,1)
frame <- data.frame(id, year, y)
frame
id year y
1 1 2005 1
2 2 2008 0
3 3 2008 0
4 4 2007 1
For each unique ID, I would like there to be a unique observation for the year 2005, 2006, 2007, and 2008 (the lower and upper time periods on this frame), and set the outcome y to 0 for all the times in which there isn't an existing observation, such that the new frame looks like:
id year y
1 1 2005 1
2 1 2006 0
3 1 2007 0
4 1 2008 0
....
13 4 2005 0
14 4 2006 0
15 4 2007 1
16 4 2008 0
I haven't had much success with loops; Any and all thoughts would be greatly appreciated.
1) reshape2 Create a grid g of all years and id values crossed and rbind it with frame.
Then using the reshape2 package cast frame from long to wide form and then melt it back to long form. Finally rearrange the rows and columns as desired.
The lines ending in one # are only to ensure that every year is present so if we knew that were the case those lines could be omitted. The line ending in ## is only to rearrange the rows and columns so if that did not matter that line could be omitted too.
library(reshape2)
g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #
wide <- dcast(frame, year ~ id, fill = 0, fun = sum, value.var = "y")
long <- melt(wide, id = "year", variable.name = "id", value.name = "y")
long <- long[order(long$id, long$year), c("id", "year", "y")] ##
giving:
> long
id year y
1 1 2005 1
2 1 2006 0
3 1 2007 0
4 1 2008 0
5 2 2005 0
6 2 2006 0
7 2 2007 0
8 2 2008 0
9 3 2005 0
10 3 2006 0
11 3 2007 0
12 3 2008 0
13 4 2005 0
14 4 2006 0
15 4 2007 1
16 4 2008 0
2) aggregate A shorter solution would be to run just the two lines that end with # above and then follow those with an aggregate as shown. This solution uses no addon packages.
g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #
aggregate(y ~ year + id, frame, sum)[c("id", "year", "y")]
This gives the same answer as solution (1) except as noted by a commenter solution (1) above makes id a factor whereas it is not in this solution.
Using data.table:
require(data.table)
DT <- data.table(frame, key=c("id", "year"))
comb <- CJ(1:4, 2005:2008) # like 'expand.grid', but faster + sets key
ans <- DT[comb][is.na(y), y:=0L] # perform a join (DT[comb]), then set NAs to 0
# id year y
# 1: 1 2005 1
# 2: 1 2006 0
# 3: 1 2007 0
# 4: 1 2008 0
# 5: 2 2005 0
# 6: 2 2006 0
# 7: 2 2007 0
# 8: 2 2008 0
# 9: 3 2005 0
# 10: 3 2006 0
# 11: 3 2007 0
# 12: 3 2008 0
# 13: 4 2005 0
# 14: 4 2006 0
# 15: 4 2007 1
# 16: 4 2008 0
maybe not an elegant solution, but anyway:
df <- expand.grid(id=id, year=unique(year))
frame <- frame[frame$y != 0,]
df$y <- 0
df2 <- rbind(frame, df)
df2 <- df2[!duplicated(df2[,c("id", "year")]),]
df2 <- df2[order(df2$id, df2$year),]
rownames(df2) <- NULL
df2
# id year y
# 1 1 2005 1
# 2 1 2006 0
# 3 1 2007 0
# 4 1 2008 0
# 5 2 2005 0
# 6 2 2006 0
# 7 2 2007 0
# 8 2 2008 0
# 9 3 2005 0
# 10 3 2006 0
# 11 3 2007 0
# 12 3 2008 0
# 13 4 2005 0
# 14 4 2006 0
# 15 4 2007 1
# 16 4 2008 0

Time-series data: How to code t-1 and t+1 based on value of specific variable in t0?

I am interested in learning how a specific factor such as foreign investments behaves 5 years before and after change, e.g. outbreak of civil war.
This is the structure of my data (the factor is not included here):
year country change time
2001 A 0 ? (-1)
2002 A 1 0
2003 A 0 ? (+1)
2004 A 0 ? (+2)
2002 B 0 ? (-2)
2003 B 0 ? (-1)
2004 B 1 0
...
I am seeking to replace the question marks by the respective values in brackets, e.g., "-1" for the year prior to change (t-1) and "+1" for the year following change (t+1). The presence of change is coded with 1.
How would you do this? I am grateful for any suggestions.
> dat <- read.table(text="year country change time
+ 2001 A 0 ?(-1)
+ 2002 A 1 0
+ 2003 A 0 ?(+1)
+ 2004 A 0 ?(+2)
+ 2002 B 0 ?(-2)
+ 2003 B 0 ?(-1)
+ 2004 B 1 0
+ ", header=TRUE)
> with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) )
$A
[1] -1 0 1 2
$B
[1] -2 -1 0
> dat$time <-unlist( with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0
>
Slightly less complex would be to use ave instead of unlist(tapply(...))
> dat$time <- with(dat, ave(change, country, FUN=function(x) seq(length(x))-which(x==1) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0

Resources