r - ggplot multiple line graphs for each unique instance over time - r

The Problem
Plotting a bunch of line plots on top of one another, but I only want to color 10 specifically after they are all plotted amongst themselves (to visualize how my 'targets' traveled over time while being able to view the masses of other behind them. So an example of this would be like 100 line graphs over time, but I want to color 5 or 10 of them specifically to discuss about with respect to the trend of the 90 other grayscale ones.
The following post has a pretty good image that I want to replicate, but with slightly more meat on the bones, , Except I want MANY lines behind those 3 all grayscale, but those 3 are my highlighted cities I want to see in the foreground, per say.
My original data was in the following form:
# The unique identifier is a City-State combo,
# there can be the same cities in 1 state or many.
# Each state's year ranges from 1:35, but may not have
# all of the values available to us, but some are complete.
r1 <- c("city1" , "state1" , "year" , "population" , rnorm(11) , "2")
r2 <- c("city1" , "state2" , "year" , "population" , rnorm(11) , "3")
r3 <- c("city2" , "state1" , "year" , "population" , rnorm(11) , "2")
r4 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "1")
r5 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "7")
df <- data.frame(matrix(nrow = 5, ncol = 16))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- c("City", "State", "Year", "Population", 1:11, "Cluster")
head(df)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# City | State | Year | Population | ... 11 Variables ... | Cluster #
# ----------------------------------------------------------------------#
# Each row is a city instance with these features ... #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
But I thought it might be better to view the data differently, so I also have it in the following format. I am not sure which is better for this problem.
cols <- c(0:35)
rows <- c("unique_city1", "unique_city2","unique_city3","unique_city4","unique_city5")
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)
df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- cols
row.names(df) <- rows
head(df)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Year1 Year2 .......... Year 35 #
# UniqueCityState1 VAL NA .......... VAL #
# UniqueCityState2 VAL VAL .......... NA #
# . #
# . #
# . #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
Prior Attempts
I have tried using melt to get the data into a format that is possible for ggplot to accept and plot each of these cities over time, but nothing has seemed to work. Also, I have tried creating my own functions to loop through each of my unique city-state combinations to stack ggplots which had some fair amount of research available on the topic, but nothing yet still. I am not sure how I could find each of these unique citystate pairs and plot them over time taking their cluster value or any numeric value for that matter. Or maybe what I am seeking is not possible, I am not sure.
Thoughts?
EDIT: More information about data structure
> head(df)
city state year population stat1 stat2 stat3 stat4 stat5
1 BESSEMER 1 1 31509 0.3808436 0 0.63473928 2.8563268 9.5528262
2 BIRMINGHAM 1 1 282081 0.3119671 0 0.97489728 6.0266377 9.1321287
3 MOUNTAIN BROOK 1 1 18221 0.0000000 0 0.05488173 0.2744086 0.4390538
4 FAIRFIELD 1 1 12978 0.1541069 0 0.46232085 3.0050855 9.8628448
5 GARDENDALE 1 1 7828 0.2554931 0 0.00000000 0.7664793 1.2774655
6 LEEDS 1 1 7865 0.2542912 0 0.12714558 1.5257470 13.3502861
stat6 stat6 stat7 stat8 stat9 cluster
1 26.976419 53.54026 5.712654 0 0.2856327 9
2 35.670605 65.49183 11.982374 0 0.4963113 9
3 6.311399 21.40387 1.426925 0 0.1097635 3
4 21.266759 68.11527 11.480968 0 1.0787487 9
5 6.770567 23.24987 3.960143 0 0.0000000 3
6 24.157661 39.79657 4.450095 0 1.5257470 15
agg
1 99.93970
2 130.08675
3 30.02031
4 115.42611
5 36.28002
6 85.18754
And ultimately I need it in the form of unique cities as row.names, 1:35 as col.names and the value inside each cell to be agg if that year was present or NA if it wasn't. Again I am sure this is possible, I just can't attain a good solution to it and my current way is unstable.

If I understand your question correctly, you want to plot all the lines in one color, and then plot a few lines with several different colors. You may use ggplot2, calling geom_line twice on two data frames. The first time plot all city data without mapping lines to color. The second time plot just the subset of your target city and mapping lines to color. You will need to re-organize your original data frame and subset the data frame for the target city. In the following code I used tidyr and dplyr to process the data frame.
### Set.seed to improve reproducibility
set.seed(123)
### Load package
library(tidyr)
library(dplyr)
library(ggplot2)
### Prepare example data frame
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)
df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- 1:35
df <- df %>% mutate(City = 1:5)
### Reorganize the data for plotting
df2 <- df %>%
gather(Year, Value, -City) %>%
mutate(Year = as.numeric(Year))
The gather function takes df as the first argument. It will create the key column called Year, which will store year number. The year number are the column names of each column in the df data frame except the City column. gather function will also create a column called Value, which will store all the numeric values from each column in in the df data frame except the City column. Finally, City column will not involve in this process, so use -City to tell the gather function "do not transform the data from the City column".
### Subset df2, select the city of interest
df3 <- df2 %>%
# In this example, assuming that City 2 and City 3 are of interest
filter(City %in% c(2, 3))
### Plot the data
ggplot(data = df2, aes(x = Year, y = Value, group = factor(City))) +
# Plot all city data here in gray lines
geom_line(size = 1, color = "gray") +
# Plot target city data with colors
geom_line(data = df3,
aes(x = Year, y = Value, group = City, color = factor(City)),
size = 2)
The resulting plot can be seen here: https://dl.dropboxusercontent.com/u/23652366/example_plot.png

Related

Creating new variables for multiple data frames in a for loop

I have 8 data frames and I want to create a variable for each of this data frame. I use a for a loop and the code I have used is given below:
year <- 2001
dflist <- list(bhps01, bhps02, bhps03, bhps04, bhps05, bhps06, bhps07, bhps08)
for (df in dflist){
df[["year"]] <- as.character(year)
assign()
year <- year + 1
}
bhps01,...,bhps08 are the data frame objects and year is a character variable. bhps01 is the data frame for year 2001, bhps02 is the data frame for year 2002 and so on.
Each data corresponds to a year, so bhps01 corresponds to year 2001, bhps corresponds to 2002 and so on. So, I want to create a year variable for each one of these data. So, year variable would be "2001" for bhps01 data, "2002" for bhps02 and so on.
The code runs fine but it does not create the variable year for either of the data frames except the local variable df.
Can someone please explain the error in the above code? Or is there an alternative of doing the same thing?
The syntax in the for loop is wrong. I am not entirely sure what you try to accomplish but let us try this
year = 2001
A = data.frame(a = c(1, 1), b = c(2, 2))
B = data.frame(a = c(1, 1), b = c(2, 2))
L = list(A, B)
for (i in seq_along(L)) {
L[[i]][, dim(L[[i]])[2] + 1] = as.character(rep(year,dim(L[[i]])[1]))
year = year + 1
}
with output
> L
[[1]]
a b V3
1 1 2 2001
2 1 2 2001
[[2]]
a b V3
1 1 2 2002
2 1 2 2002
That is what you intend as output, correct?
In order to change the column name to "year" you can do
L = lapply(L, function(x) {colnames(x)[3] = "year"; x})
You take a copy of the dataframe from the list, and add the variable "year" to it, but then do not assign it anywhere, which is why it is discarded (i.e. not stored in a variable). Here's a fix:
year <- 2001
dflist <- list(bhps01, bhps02, bhps03, bhps04, bhps05, bhps06, bhps07, bhps08)
counter <- 0
for (df in dflist){
counter <- counter + 1
df[["year"]] <- as.character(year)
dflist[[counter]] <- df
year <- year + 1
}
If you want the original dataframes to be edited, you could assign the result back on the rather then into the list. This is a bit of an indirect route, and notice the change in creating the dflist with names. We create the df, and then assign it to the original name. For example:
year <- 2001
dflist <- list(bhps01 = bhps01, bhps02 = bhps02, bhps03 = bhps03, bhps04 = bhps04, bhps05 = bhps05, bhps06 = bhps06, bhps07 = bhps07, bhps08 = bhps08)
counter <- 0
for (df in dflist){
counter <- counter + 1
df[["year"]] <- as.character(year)
dflist[[counter]] <- df
assign(names(dflist)[counter], df)
year <- year + 1
}

Looping row numbers from one dataframe to create new data using logical operations in R

I would like to extract a dataframe that shows how many years it takes for NInd variable (dataset p1) to recover due to some culling happening, which is showed in dataframe e1.
I have the following datasets (mine are much bigger, but just to give you something to play with):
# Dataset 1
Batch <- c(2,2,2,2,2,2,2,2,2,2)
Rep <- c(0,0,0,0,0,0,0,0,0,0)
Year <- c(0,0,1,1,2,2,3,3,4,4)
RepSeason <- c(0,0,0,0,0,0,0,0,0,0)
PatchID <- c(17,25,19,16,21,24,23,20,18,33)
Species <- c(0,0,0,0,0,0,0,0,0,0)
Selected <- c(1,1,1,1,1,1,1,1,1,1)
Nculled <- c(811,4068,1755,449,1195,1711,619,4332,457,5883)
e1 <- data.frame(Batch,Rep,Year,RepSeason,PatchID,Species,Selected,Nculled)
# Dataset 2
Batch <- c(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
Rep <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
Year <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
RepSeason <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
PatchID <- c(17,25,19,16,21,24,23,20,18,33,17,25,19,16,21,24,23,20,18,33,17,25,19,16,21,24,23,20,18,33)
Ncells <- c(6,5,6,4,4,5,6,5,5,5,6,5,6,4,4,5,6,7,3,5,4,4,3,3,4,4,5,5,6,4)
Species <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
NInd <- c(656,656,262,350,175,218,919,218,984,875,700,190,93,127,52,54,292,12,43,68,308,1000,98,29,656,656,262,350,175,300)
p1 <- data.frame(Batch, Rep, Year, RepSeason, PatchID, Ncells, Species, NInd)
The dataset called e1 shows only those year where some culled happened to the population on specific PatchID.
I have created the following script that basically use each row from e1 to create a Recovery number. Maybe there is an easier way to get to the end, but this is the one I managed to get...
When you run this, you are working on ONE row of e1, so we focus on the first PatchID encounter and then do some calculation to match that up with p1, and finally I get a number named Recovery.
Now, the thing is my dataframe has 50,000 rows, so doing this over and over looks quite tedious. So, that's where I thought a loop may be useful. But have tried and no luck on how to make it work at all...
#here is where I would like the loop
e2 <- e1[1,] # Trial for one row only # but the idea is having here a loop that keep doing of comes next for each row
e3 <- e2 %>%
select(1,2,4,5)
p2 <- p1[,c(1,2,4,5,3,6,7,8)] # Re-order
row2 <- which(apply(p2, 1, function(x) return(all(x == e3))))
p3 <- p1 %>%
slice(row2) # all years with that particular patch in that particular Batch
#How many times was this patch cull during this replicate?
e4 <- e2[,c(1,2,4,5,3,6,7,8)]
e4 <- e4 %>%
select(1,2,3,4)
c_batch <- e1[,c(1,2,4,5,3,6,7,8)]
row <- which(apply(c_batch, 1, function(x) return(all(x == e4))))
c4 <- c_batch %>%
slice(row)
# Number of year to recover to 95% that had before culled
c5 <- c4[1,] # extract the first time was culled
c5 <- c5 %>%
select(1:5)
row3 <- which(apply(p2, 1, function(x) return(all(x == c5))))
Before <- p2 %>%
slice(row3)
NInd <- Before[,8] # Before culling number of individuals
Year2 <- Before[,5] # Year number where first culling happened (that actually the number corresponds to individuals before culling, as the Pop file is developed during reproduction, while Cull file is developed after!)
Percent <- (95*NInd)/100 # 95% recovery we want to achieve would correspond to having 95% of NInd BEFORE culled happened (Year2)
After <- p3 %>%
filter(NInd >= Percent & Year > Year2) # Look rows that match number of ind and Year
After2 <- After[1,] # we just want the first year where the recovery was successfully achieved
Recovery <- After2$Year - Before$Year
# no. of years to reach 95% of the population immediately before the cull
I reckon that the end would have to change somehow to to tell R that we are creating a dataframe with the Recovery, something like:
Batch <- c(1,1,2,2)
Rep <- c(0,0,0,0)
PatchID <- c(17,25,30,12)
Recovery <- c(1,2,1,5)
Final <- data.frame(Batch, Rep, PatchID, Recovery)
Would that be possible? OR this is just too mess-up and I may should try a different way?
Does the following solve the problem correectly?
I have first added a unique ID to your data.frames to allow matching of the cull and population files (this saves most of you complicated look-up code):
# Add a unique ID for the patch/replicate etc. (as done in the example code)
e1$RepID = paste(e1$Batch, e1$Rep, e1$RepSeason, e1$PatchID, sep = ":")
p1$RepID = paste(p1$Batch, p1$Rep, p1$RepSeason, p1$PatchID, sep = ":")
If you want a quick overview of the number of times each patch was culled, the new RepID makes this easy:
# How many times was each patch culled?
table(p1$RepID)
Then you want a loop to check the recovery time after each cull.
My solutions uses an sapply loop (which also retains the RepIDs so you can match to other metadata later):
sapply(unique(e1$RepID), function(rep_id){
all_cull_events = e1[e1$RepID == rep_id, , drop = F]
first_year = order(all_cull_events$Year)[1] # The first cull year (assuming data might not be in temporal order)
first_cull_event = all_cull_events[first_year, ] # The row corresponding to the first cull event
population_counts = p1[p1$RepID == first_cull_event$RepID, ] # The population counts for this plot/replicate
population_counts = population_counts[order(population_counts$Year), ] # Order by year (assuming data might not be in temporal order)
pop_at_first_cull_event = population_counts[population_counts$Year == first_cull_event$Year, "NInd"]
population_counts_after_cull = population_counts[population_counts$Year > first_cull_event$Year, , drop = F]
years_to_recovery = which(population_counts_after_cull$NInd >= (pop_at_first_cull_event * .95))[1] # First year to pass 95% threshold
return(years_to_recovery)
})
2:0:0:17 2:0:0:25 2:0:0:19 2:0:0:16 2:0:0:21 2:0:0:24 2:0:0:23 2:0:0:20 2:0:0:18 2:0:0:33
1 2 1 NA NA NA NA NA NA NA
(The output contains some NAs because the first cull year was outside the range of population counts in the data you gave us)
Please check this against your expected output though. There were some aspects of the question and example code that were not clear (see comments).

Plot one column in every matrix in a list of matrices in R

I'm quite new to programming as well as data analysis, please bear with me here. My data currently consists of a list of 14 matrices (lom), each corresponding to data from a country (with two-letter country codes).
Here is a full sample for Austria:
> lom["AT"]
$`AT`
Year AllKey AllSub SelKey SelSub
1 2000 1.622279 0.5334964 1.892894 0.8057591
2 2001 1.903745 0.5827514 2.291335 0.8295899
3 2002 1.646538 0.4873866 2.006873 0.7360566
4 2003 1.405250 0.8692641 2.105648 1.2711968
5 2004 1.511154 1.5091751 1.970236 1.9407666
6 2005 1.459177 0.6781008 1.808982 1.1362805
7 2006 1.604652 0.5038658 1.942126 0.7992008
8 2007 2.107326 0.9260200 2.683072 1.3302627
9 2008 1.969735 0.6178362 2.994758 1.2051339
10 2009 1.955768 0.7365529 2.896198 1.2272024
11 2010 2.476157 0.7952590 3.715950 1.5686643
12 2011 2.092459 0.4970011 2.766169 0.6476707
13 2012 1.913122 0.5338756 2.450942 0.6022315
14 2013 2.086200 0.6739412 2.786736 0.9211941
15 2014 2.579428 0.8424793 3.152541 1.0225888
16 2015 10.662568 5.8472436 9.769320 3.8840780
17 2016 11.088286 4.6504581 10.567789 3.2383420
18 2017 7.225053 1.7528594 6.747515 1.2781224
I'd like to get all 14 countries plotted against x = Year and y = each of the other variables, i.e. four plots with 14 lines each. Hence the requirement in the question title.
I keep coming up with impossibilities involving some combination of a for loop and some apply function, for example:
for (i in colnames(lom$anyCountry)) {
ggplot(lapply(lom, function(x) x[,1:14], aes(x=Year, y=i)
}
which apart from many other problems I can now see throws:
Error: data must be a data frame, or other object coercible by fortify(), not a list
which led me to combine the list of matrices into a big matrix inspired by
this:
bigDF <- do.call(rbind, lom)
I suppose I could restructure my data some other way, perhaps I'm missing some functionality that would help... probably both. I would appreciate any pointers as to how to achieve this as efficiently as possible.
Consider appending all matrix data into a master, single data frame with a country indicator that you can use for the color argument of line plots:
# CREATE LARGE DATAFRAME FROM MATRIX LIST
lom_df <- do.call(rbind, lapply(lom, data.frame))
# CREATE COLUMN NAMES FROM ROWNAMES
lom_df$country <- gsub("\\..*$", "", row.names(lom_df))
row.names(lom_df) <- NULL
# EXTRACT ALL FOUR Y COLUMN NAMES (MINUS Year AND country)
y_columns <- colnames(lom_df[2:(ncol(lom_df)-1)])
# PRODUCE LIST OF FOUR PLOTS EACH WITH COUNTRY LINES
plot_list <- lapply(y_columns, function(col)
ggplot(lom_df, aes_string(x="Year", y=col, color="country")) +
geom_line()
)
# OUTPUT EACH LIST
plot_list
This solution uses package ggplot2.
It has two steps, data preparation and plotting.
First of all the list must be transformed into one large data frame, with a column as an id column. I have searched SO for a function that does this but couldn't find one so here it goes.
rbindWithID <- function(x, id.name = "ID", sep = "."){
if(is.null(names(x))) names(x) <- paste(id.name, seq_along(x), sep = sep)
res <- lapply(names(x), function(nm){
DF <- x[[nm]]
DF[[id.name]] <- nm
x[[nm]] <- cbind(DF[ncol(DF)], DF[-ncol(DF)])
x[[nm]]
})
do.call(rbind, res)
}
lom_df <- rbindWithID(lom, "Country")
Now reshape the data frame from wide to long.
molten <- reshape2::melt(lom_df, id.vars = c("Country", "Year"))
Finally, plot it.
library(ggplot2)
ggplot(molten, aes(Year, value, colour = Country)) +
geom_line() +
facet_wrap(~ variable)
DATA.
set.seed(1234) # Make the results reproducible
lom <- lapply(1:4, function(i){
data.frame(
Year = 2000:2008,
AllKey = runif(9, 1, 2),
AllSub = runif(9, 0, 2),
SelKey = runif(9, 1, 2),
SelSub = runif(9, 0, 2)
)
})
names(lom) <- c("AT", "DE", "FR", "PT")

aggregating a data frame over a column

I have a data frame, one of the columns representing years. Let's say
region <- c("Spain", "Italy", "Norway")
year <- c("2010","2011","2012","2010","2011","2012","2010","2011","2012")
m1 <- c("10","11","12","13","14","15","16","17","18")
m2 <- c("20","30","40","50","60","70","80","90","100")
data <- data.frame(region,year,m1,m2)
I want to aggregate the data set m1 in a way taking 3-year averages for each country. I am confused in how to do that with a data frame. Any comment is highly appreciated.
Thanks in advance!
First, your m1 variable needs to be numeric. Convert it using as.numeric():
data$m1 <- as.numeric(as.character(data$m1))
Then, you can use aggregate like this:
aggregate(m1 ~ region, FUN = mean, data = data)
# region m1
# 1 Italy 14
# 2 Norway 15
# 3 Spain 13
To avoid the awkward type conversion (as.numeric(as.character())), you should eliminate the quotes from the setup for m1 and m2:
m1 <- c(10,11,12,13,14,15,16,17,18)
m2 <- c(20,30,40,50,60,70,80,90,100)
Alternative approach using dplyr:
library(dplyr)
region <- c("Spain", "Italy", "Norway")
year <- c("2010","2011","2012","2010","2011","2012","2010","2011","2012")
m1 <- c(10,11,12,13,14,15,16,17,18)
m2 <- c(20,30,40,50,60,70,80,90,100)
data <- data.frame(region,year,m1,m2)
data %>%
group_by(region) %>%
summarise(mean_m1 = mean(m1),
mean_m2 = mean(m2))
# region mean_m1 mean_m2
# 1 Italy 14 60
# 2 Norway 15 70
# 3 Spain 13 50

Reshape data frame for consecutive years

I have data about thousands of customers who visited stores in the 3 past years.
For each customer, I have :
ID
Combination of a year and the first store visited in this year.
Customer_Id | Year_*_Store
1 2010_A
1 2011_B
1 2012_C
2 2010_A
2 2011_B
2 2012_D
What I’d like to have is the following structure of data in order to visualize the evolution of the customers’behaviour with a riverplot( aka Sankey plot)
For instance the 2 customers, who firstly visited the store A in 2010, firstly visited the store B in 2011:
SOURCE | TARGET | NB_CUSTOMERS
2010_A 2011_B 2
2011_B 2012_C 1
2011_B 2012_D 1
I don't want links between two years which are not consecutive like 2010_A and 2012_D
How can I do that in R ?
I would do this with dplyr (faster)
df<-read.table(header=T,text="Customer_Id Year_Store
1 2010_A
1 2011_B
1 2012_C
2 2010_A
2 2011_B
2 2012_D")
require(dplyr) # for aggregation
require(riverplot) # for Sankey
targets<-
group_by(df,Customer_Id) %.% # group by Customer
mutate(source=Year_Store,target=c(as.character(Year_Store)[-1],NA)) %.% # add a lag to show the shift
filter(!is.na(target)) %.% # filter out empty edges
regroup(list("source","target")) %.% # regroup by source & target
summarise(len=length(Customer_Id)) %.% # count customers for relationship
mutate(step=as.integer(substr(target,1,4))-as.integer(substr(source,1,4))) %.% # add a step to show how many years
filter(step==1) # filter out relationships for non consec years
topnodes <- c(as.character(unique(df$Year_Store))) # unique nodes
nodes <- data.frame( ID=topnodes, # IDs
x=as.numeric(substr(topnodes,1,4)), # x value for plot
col= rainbow(length(topnodes)), # color each different
labels= topnodes, # labels
stringsAsFactors= FALSE )
edges<- # create list of list
lapply(unique(targets$source),function(x){
l<-as.list(filter(targets,source==x)$len) # targets per source
names(l)<-filter(targets,source==x)$target # name of target
l
})
names(edges)<-unique(targets$source) # name top level nodes
r <- makeRiver( nodes, edges) # make the River
plot( r ) # plot it!
Note that you can't have a * in column names (see ?make.names). Here is a basic approach:
Split Year_store into two separate columns Year and Store in your data frame; at the moment it contains two completely different kinds of data and you actually need to process them separately.
Make a NextYear column, defined as Year + 1
Make a NextStore column, in which you assign the store code matching Customer_Id and for which Year is the same as this row's NextYear, assigning NA if there is no record of the customer visiting a store the next year, and throwing an error if the data do not meet the required specification (are ambiguous about which store was visited first the next year).
Strip out any of the rows in which NextStore is NA, and combine the NextYear and NextStore columns into a NextYear_NextStore column.
Summarize your data frame by the Year_store and NextYear_NextStore columns e.g. using ddply in the plyr package.
Some sample data:
# same example data as question
customer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2012_D"),
stringsAsFactors = FALSE)
# alternative data should throw error, customer 2 is inconsistent in 2011
badCustomer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2011_D"),
stringsAsFactors = FALSE)
And an implementation:
require(plyr)
splitYearStore <- function(df) {
df$Year <- as.numeric(substring(df$Year_Store, 1, 4))
df$Store <- as.character(substring(df$Year_Store, 6))
return(df)
}
findNextStore <- function(df, matchCust, matchYear) {
matchingStore <- with(df,
df[Customer_Id == matchCust & Year == matchYear, "Store"])
if (length(matchingStore) == 0) {
return(NA)
} else if (length(matchingStore) > 1) {
errorString <- paste("Inconsistent store results for customer",
matchCust, "in year", matchYear)
stop(errorString)
} else {
return(matchingStore)
}
}
tabulateTransitions <- function(df) {
df <- splitYearStore(df)
df$NextYear <- df$Year + 1
df$NextStore <- mapply(findNextStore, matchCust = df$Customer_Id,
matchYear = df$NextYear, MoreArgs = list(df = df))
df$NextYear_NextStore <- with(df, paste(NextYear, NextStore, sep = "_"))
df <- df[!is.na(df$NextStore),]
df <- ddply(df, .(Source = Year_Store, Target = NextYear_NextStore),
summarise, No_Customers = length(Customer_Id))
return(df)
}
Results:
> tabulateTransitions(customer.df)
Source Target No_Customers
1 2010_A 2011_B 2
2 2011_B 2012_C 1
3 2011_B 2012_D 1
> tabulateTransitions(badCustomer.df)
Error in function (df, matchCust, matchYear) :
Inconsistent store results for customer 2 in year 2011
No attempt has been made to optimise; if your data set is massive then perhaps you should investigate a data.table solution.

Resources