I have 8 data frames and I want to create a variable for each of this data frame. I use a for a loop and the code I have used is given below:
year <- 2001
dflist <- list(bhps01, bhps02, bhps03, bhps04, bhps05, bhps06, bhps07, bhps08)
for (df in dflist){
df[["year"]] <- as.character(year)
assign()
year <- year + 1
}
bhps01,...,bhps08 are the data frame objects and year is a character variable. bhps01 is the data frame for year 2001, bhps02 is the data frame for year 2002 and so on.
Each data corresponds to a year, so bhps01 corresponds to year 2001, bhps corresponds to 2002 and so on. So, I want to create a year variable for each one of these data. So, year variable would be "2001" for bhps01 data, "2002" for bhps02 and so on.
The code runs fine but it does not create the variable year for either of the data frames except the local variable df.
Can someone please explain the error in the above code? Or is there an alternative of doing the same thing?
The syntax in the for loop is wrong. I am not entirely sure what you try to accomplish but let us try this
year = 2001
A = data.frame(a = c(1, 1), b = c(2, 2))
B = data.frame(a = c(1, 1), b = c(2, 2))
L = list(A, B)
for (i in seq_along(L)) {
L[[i]][, dim(L[[i]])[2] + 1] = as.character(rep(year,dim(L[[i]])[1]))
year = year + 1
}
with output
> L
[[1]]
a b V3
1 1 2 2001
2 1 2 2001
[[2]]
a b V3
1 1 2 2002
2 1 2 2002
That is what you intend as output, correct?
In order to change the column name to "year" you can do
L = lapply(L, function(x) {colnames(x)[3] = "year"; x})
You take a copy of the dataframe from the list, and add the variable "year" to it, but then do not assign it anywhere, which is why it is discarded (i.e. not stored in a variable). Here's a fix:
year <- 2001
dflist <- list(bhps01, bhps02, bhps03, bhps04, bhps05, bhps06, bhps07, bhps08)
counter <- 0
for (df in dflist){
counter <- counter + 1
df[["year"]] <- as.character(year)
dflist[[counter]] <- df
year <- year + 1
}
If you want the original dataframes to be edited, you could assign the result back on the rather then into the list. This is a bit of an indirect route, and notice the change in creating the dflist with names. We create the df, and then assign it to the original name. For example:
year <- 2001
dflist <- list(bhps01 = bhps01, bhps02 = bhps02, bhps03 = bhps03, bhps04 = bhps04, bhps05 = bhps05, bhps06 = bhps06, bhps07 = bhps07, bhps08 = bhps08)
counter <- 0
for (df in dflist){
counter <- counter + 1
df[["year"]] <- as.character(year)
dflist[[counter]] <- df
assign(names(dflist)[counter], df)
year <- year + 1
}
I would like to extract a dataframe that shows how many years it takes for NInd variable (dataset p1) to recover due to some culling happening, which is showed in dataframe e1.
I have the following datasets (mine are much bigger, but just to give you something to play with):
# Dataset 1
Batch <- c(2,2,2,2,2,2,2,2,2,2)
Rep <- c(0,0,0,0,0,0,0,0,0,0)
Year <- c(0,0,1,1,2,2,3,3,4,4)
RepSeason <- c(0,0,0,0,0,0,0,0,0,0)
PatchID <- c(17,25,19,16,21,24,23,20,18,33)
Species <- c(0,0,0,0,0,0,0,0,0,0)
Selected <- c(1,1,1,1,1,1,1,1,1,1)
Nculled <- c(811,4068,1755,449,1195,1711,619,4332,457,5883)
e1 <- data.frame(Batch,Rep,Year,RepSeason,PatchID,Species,Selected,Nculled)
# Dataset 2
Batch <- c(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
Rep <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
Year <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
RepSeason <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
PatchID <- c(17,25,19,16,21,24,23,20,18,33,17,25,19,16,21,24,23,20,18,33,17,25,19,16,21,24,23,20,18,33)
Ncells <- c(6,5,6,4,4,5,6,5,5,5,6,5,6,4,4,5,6,7,3,5,4,4,3,3,4,4,5,5,6,4)
Species <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
NInd <- c(656,656,262,350,175,218,919,218,984,875,700,190,93,127,52,54,292,12,43,68,308,1000,98,29,656,656,262,350,175,300)
p1 <- data.frame(Batch, Rep, Year, RepSeason, PatchID, Ncells, Species, NInd)
The dataset called e1 shows only those year where some culled happened to the population on specific PatchID.
I have created the following script that basically use each row from e1 to create a Recovery number. Maybe there is an easier way to get to the end, but this is the one I managed to get...
When you run this, you are working on ONE row of e1, so we focus on the first PatchID encounter and then do some calculation to match that up with p1, and finally I get a number named Recovery.
Now, the thing is my dataframe has 50,000 rows, so doing this over and over looks quite tedious. So, that's where I thought a loop may be useful. But have tried and no luck on how to make it work at all...
#here is where I would like the loop
e2 <- e1[1,] # Trial for one row only # but the idea is having here a loop that keep doing of comes next for each row
e3 <- e2 %>%
select(1,2,4,5)
p2 <- p1[,c(1,2,4,5,3,6,7,8)] # Re-order
row2 <- which(apply(p2, 1, function(x) return(all(x == e3))))
p3 <- p1 %>%
slice(row2) # all years with that particular patch in that particular Batch
#How many times was this patch cull during this replicate?
e4 <- e2[,c(1,2,4,5,3,6,7,8)]
e4 <- e4 %>%
select(1,2,3,4)
c_batch <- e1[,c(1,2,4,5,3,6,7,8)]
row <- which(apply(c_batch, 1, function(x) return(all(x == e4))))
c4 <- c_batch %>%
slice(row)
# Number of year to recover to 95% that had before culled
c5 <- c4[1,] # extract the first time was culled
c5 <- c5 %>%
select(1:5)
row3 <- which(apply(p2, 1, function(x) return(all(x == c5))))
Before <- p2 %>%
slice(row3)
NInd <- Before[,8] # Before culling number of individuals
Year2 <- Before[,5] # Year number where first culling happened (that actually the number corresponds to individuals before culling, as the Pop file is developed during reproduction, while Cull file is developed after!)
Percent <- (95*NInd)/100 # 95% recovery we want to achieve would correspond to having 95% of NInd BEFORE culled happened (Year2)
After <- p3 %>%
filter(NInd >= Percent & Year > Year2) # Look rows that match number of ind and Year
After2 <- After[1,] # we just want the first year where the recovery was successfully achieved
Recovery <- After2$Year - Before$Year
# no. of years to reach 95% of the population immediately before the cull
I reckon that the end would have to change somehow to to tell R that we are creating a dataframe with the Recovery, something like:
Batch <- c(1,1,2,2)
Rep <- c(0,0,0,0)
PatchID <- c(17,25,30,12)
Recovery <- c(1,2,1,5)
Final <- data.frame(Batch, Rep, PatchID, Recovery)
Would that be possible? OR this is just too mess-up and I may should try a different way?
Does the following solve the problem correectly?
I have first added a unique ID to your data.frames to allow matching of the cull and population files (this saves most of you complicated look-up code):
# Add a unique ID for the patch/replicate etc. (as done in the example code)
e1$RepID = paste(e1$Batch, e1$Rep, e1$RepSeason, e1$PatchID, sep = ":")
p1$RepID = paste(p1$Batch, p1$Rep, p1$RepSeason, p1$PatchID, sep = ":")
If you want a quick overview of the number of times each patch was culled, the new RepID makes this easy:
# How many times was each patch culled?
table(p1$RepID)
Then you want a loop to check the recovery time after each cull.
My solutions uses an sapply loop (which also retains the RepIDs so you can match to other metadata later):
sapply(unique(e1$RepID), function(rep_id){
all_cull_events = e1[e1$RepID == rep_id, , drop = F]
first_year = order(all_cull_events$Year)[1] # The first cull year (assuming data might not be in temporal order)
first_cull_event = all_cull_events[first_year, ] # The row corresponding to the first cull event
population_counts = p1[p1$RepID == first_cull_event$RepID, ] # The population counts for this plot/replicate
population_counts = population_counts[order(population_counts$Year), ] # Order by year (assuming data might not be in temporal order)
pop_at_first_cull_event = population_counts[population_counts$Year == first_cull_event$Year, "NInd"]
population_counts_after_cull = population_counts[population_counts$Year > first_cull_event$Year, , drop = F]
years_to_recovery = which(population_counts_after_cull$NInd >= (pop_at_first_cull_event * .95))[1] # First year to pass 95% threshold
return(years_to_recovery)
})
2:0:0:17 2:0:0:25 2:0:0:19 2:0:0:16 2:0:0:21 2:0:0:24 2:0:0:23 2:0:0:20 2:0:0:18 2:0:0:33
1 2 1 NA NA NA NA NA NA NA
(The output contains some NAs because the first cull year was outside the range of population counts in the data you gave us)
Please check this against your expected output though. There were some aspects of the question and example code that were not clear (see comments).
I'm quite new to programming as well as data analysis, please bear with me here. My data currently consists of a list of 14 matrices (lom), each corresponding to data from a country (with two-letter country codes).
Here is a full sample for Austria:
> lom["AT"]
$`AT`
Year AllKey AllSub SelKey SelSub
1 2000 1.622279 0.5334964 1.892894 0.8057591
2 2001 1.903745 0.5827514 2.291335 0.8295899
3 2002 1.646538 0.4873866 2.006873 0.7360566
4 2003 1.405250 0.8692641 2.105648 1.2711968
5 2004 1.511154 1.5091751 1.970236 1.9407666
6 2005 1.459177 0.6781008 1.808982 1.1362805
7 2006 1.604652 0.5038658 1.942126 0.7992008
8 2007 2.107326 0.9260200 2.683072 1.3302627
9 2008 1.969735 0.6178362 2.994758 1.2051339
10 2009 1.955768 0.7365529 2.896198 1.2272024
11 2010 2.476157 0.7952590 3.715950 1.5686643
12 2011 2.092459 0.4970011 2.766169 0.6476707
13 2012 1.913122 0.5338756 2.450942 0.6022315
14 2013 2.086200 0.6739412 2.786736 0.9211941
15 2014 2.579428 0.8424793 3.152541 1.0225888
16 2015 10.662568 5.8472436 9.769320 3.8840780
17 2016 11.088286 4.6504581 10.567789 3.2383420
18 2017 7.225053 1.7528594 6.747515 1.2781224
I'd like to get all 14 countries plotted against x = Year and y = each of the other variables, i.e. four plots with 14 lines each. Hence the requirement in the question title.
I keep coming up with impossibilities involving some combination of a for loop and some apply function, for example:
for (i in colnames(lom$anyCountry)) {
ggplot(lapply(lom, function(x) x[,1:14], aes(x=Year, y=i)
}
which apart from many other problems I can now see throws:
Error: data must be a data frame, or other object coercible by fortify(), not a list
which led me to combine the list of matrices into a big matrix inspired by
this:
bigDF <- do.call(rbind, lom)
I suppose I could restructure my data some other way, perhaps I'm missing some functionality that would help... probably both. I would appreciate any pointers as to how to achieve this as efficiently as possible.
Consider appending all matrix data into a master, single data frame with a country indicator that you can use for the color argument of line plots:
# CREATE LARGE DATAFRAME FROM MATRIX LIST
lom_df <- do.call(rbind, lapply(lom, data.frame))
# CREATE COLUMN NAMES FROM ROWNAMES
lom_df$country <- gsub("\\..*$", "", row.names(lom_df))
row.names(lom_df) <- NULL
# EXTRACT ALL FOUR Y COLUMN NAMES (MINUS Year AND country)
y_columns <- colnames(lom_df[2:(ncol(lom_df)-1)])
# PRODUCE LIST OF FOUR PLOTS EACH WITH COUNTRY LINES
plot_list <- lapply(y_columns, function(col)
ggplot(lom_df, aes_string(x="Year", y=col, color="country")) +
geom_line()
)
# OUTPUT EACH LIST
plot_list
This solution uses package ggplot2.
It has two steps, data preparation and plotting.
First of all the list must be transformed into one large data frame, with a column as an id column. I have searched SO for a function that does this but couldn't find one so here it goes.
rbindWithID <- function(x, id.name = "ID", sep = "."){
if(is.null(names(x))) names(x) <- paste(id.name, seq_along(x), sep = sep)
res <- lapply(names(x), function(nm){
DF <- x[[nm]]
DF[[id.name]] <- nm
x[[nm]] <- cbind(DF[ncol(DF)], DF[-ncol(DF)])
x[[nm]]
})
do.call(rbind, res)
}
lom_df <- rbindWithID(lom, "Country")
Now reshape the data frame from wide to long.
molten <- reshape2::melt(lom_df, id.vars = c("Country", "Year"))
Finally, plot it.
library(ggplot2)
ggplot(molten, aes(Year, value, colour = Country)) +
geom_line() +
facet_wrap(~ variable)
DATA.
set.seed(1234) # Make the results reproducible
lom <- lapply(1:4, function(i){
data.frame(
Year = 2000:2008,
AllKey = runif(9, 1, 2),
AllSub = runif(9, 0, 2),
SelKey = runif(9, 1, 2),
SelSub = runif(9, 0, 2)
)
})
names(lom) <- c("AT", "DE", "FR", "PT")
I have a data frame, one of the columns representing years. Let's say
region <- c("Spain", "Italy", "Norway")
year <- c("2010","2011","2012","2010","2011","2012","2010","2011","2012")
m1 <- c("10","11","12","13","14","15","16","17","18")
m2 <- c("20","30","40","50","60","70","80","90","100")
data <- data.frame(region,year,m1,m2)
I want to aggregate the data set m1 in a way taking 3-year averages for each country. I am confused in how to do that with a data frame. Any comment is highly appreciated.
Thanks in advance!
First, your m1 variable needs to be numeric. Convert it using as.numeric():
data$m1 <- as.numeric(as.character(data$m1))
Then, you can use aggregate like this:
aggregate(m1 ~ region, FUN = mean, data = data)
# region m1
# 1 Italy 14
# 2 Norway 15
# 3 Spain 13
To avoid the awkward type conversion (as.numeric(as.character())), you should eliminate the quotes from the setup for m1 and m2:
m1 <- c(10,11,12,13,14,15,16,17,18)
m2 <- c(20,30,40,50,60,70,80,90,100)
Alternative approach using dplyr:
library(dplyr)
region <- c("Spain", "Italy", "Norway")
year <- c("2010","2011","2012","2010","2011","2012","2010","2011","2012")
m1 <- c(10,11,12,13,14,15,16,17,18)
m2 <- c(20,30,40,50,60,70,80,90,100)
data <- data.frame(region,year,m1,m2)
data %>%
group_by(region) %>%
summarise(mean_m1 = mean(m1),
mean_m2 = mean(m2))
# region mean_m1 mean_m2
# 1 Italy 14 60
# 2 Norway 15 70
# 3 Spain 13 50
I have data about thousands of customers who visited stores in the 3 past years.
For each customer, I have :
ID
Combination of a year and the first store visited in this year.
Customer_Id | Year_*_Store
1 2010_A
1 2011_B
1 2012_C
2 2010_A
2 2011_B
2 2012_D
What I’d like to have is the following structure of data in order to visualize the evolution of the customers’behaviour with a riverplot( aka Sankey plot)
For instance the 2 customers, who firstly visited the store A in 2010, firstly visited the store B in 2011:
SOURCE | TARGET | NB_CUSTOMERS
2010_A 2011_B 2
2011_B 2012_C 1
2011_B 2012_D 1
I don't want links between two years which are not consecutive like 2010_A and 2012_D
How can I do that in R ?
I would do this with dplyr (faster)
df<-read.table(header=T,text="Customer_Id Year_Store
1 2010_A
1 2011_B
1 2012_C
2 2010_A
2 2011_B
2 2012_D")
require(dplyr) # for aggregation
require(riverplot) # for Sankey
targets<-
group_by(df,Customer_Id) %.% # group by Customer
mutate(source=Year_Store,target=c(as.character(Year_Store)[-1],NA)) %.% # add a lag to show the shift
filter(!is.na(target)) %.% # filter out empty edges
regroup(list("source","target")) %.% # regroup by source & target
summarise(len=length(Customer_Id)) %.% # count customers for relationship
mutate(step=as.integer(substr(target,1,4))-as.integer(substr(source,1,4))) %.% # add a step to show how many years
filter(step==1) # filter out relationships for non consec years
topnodes <- c(as.character(unique(df$Year_Store))) # unique nodes
nodes <- data.frame( ID=topnodes, # IDs
x=as.numeric(substr(topnodes,1,4)), # x value for plot
col= rainbow(length(topnodes)), # color each different
labels= topnodes, # labels
stringsAsFactors= FALSE )
edges<- # create list of list
lapply(unique(targets$source),function(x){
l<-as.list(filter(targets,source==x)$len) # targets per source
names(l)<-filter(targets,source==x)$target # name of target
l
})
names(edges)<-unique(targets$source) # name top level nodes
r <- makeRiver( nodes, edges) # make the River
plot( r ) # plot it!
Note that you can't have a * in column names (see ?make.names). Here is a basic approach:
Split Year_store into two separate columns Year and Store in your data frame; at the moment it contains two completely different kinds of data and you actually need to process them separately.
Make a NextYear column, defined as Year + 1
Make a NextStore column, in which you assign the store code matching Customer_Id and for which Year is the same as this row's NextYear, assigning NA if there is no record of the customer visiting a store the next year, and throwing an error if the data do not meet the required specification (are ambiguous about which store was visited first the next year).
Strip out any of the rows in which NextStore is NA, and combine the NextYear and NextStore columns into a NextYear_NextStore column.
Summarize your data frame by the Year_store and NextYear_NextStore columns e.g. using ddply in the plyr package.
Some sample data:
# same example data as question
customer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2012_D"),
stringsAsFactors = FALSE)
# alternative data should throw error, customer 2 is inconsistent in 2011
badCustomer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2011_D"),
stringsAsFactors = FALSE)
And an implementation:
require(plyr)
splitYearStore <- function(df) {
df$Year <- as.numeric(substring(df$Year_Store, 1, 4))
df$Store <- as.character(substring(df$Year_Store, 6))
return(df)
}
findNextStore <- function(df, matchCust, matchYear) {
matchingStore <- with(df,
df[Customer_Id == matchCust & Year == matchYear, "Store"])
if (length(matchingStore) == 0) {
return(NA)
} else if (length(matchingStore) > 1) {
errorString <- paste("Inconsistent store results for customer",
matchCust, "in year", matchYear)
stop(errorString)
} else {
return(matchingStore)
}
}
tabulateTransitions <- function(df) {
df <- splitYearStore(df)
df$NextYear <- df$Year + 1
df$NextStore <- mapply(findNextStore, matchCust = df$Customer_Id,
matchYear = df$NextYear, MoreArgs = list(df = df))
df$NextYear_NextStore <- with(df, paste(NextYear, NextStore, sep = "_"))
df <- df[!is.na(df$NextStore),]
df <- ddply(df, .(Source = Year_Store, Target = NextYear_NextStore),
summarise, No_Customers = length(Customer_Id))
return(df)
}
Results:
> tabulateTransitions(customer.df)
Source Target No_Customers
1 2010_A 2011_B 2
2 2011_B 2012_C 1
3 2011_B 2012_D 1
> tabulateTransitions(badCustomer.df)
Error in function (df, matchCust, matchYear) :
Inconsistent store results for customer 2 in year 2011
No attempt has been made to optimise; if your data set is massive then perhaps you should investigate a data.table solution.