A question about looping and creating/joining tables - r

My code is meant to order a table called Football (imported csv2) and then, using a for loop, go through the data and return the row number of the start year and end year.
Football[order(Football$Year),]
start_year <- min(Football$Year)
end_year <- max(Football$Year)
for (i in 1:nrow(Football)
{
if (Football$Year[i] = start_year)
{
row_of_start <- i
}
if (Football$Year[i] = end_year)
{
row_of_end <- i
}
}
This produces the following error:
> if (Football$Year[1] = start_year) row_of_start <- 1
Error: unexpected '=' in "if (Football$Year[1] ="
I appreciate there are probably ways of doing this without a for loop (which I would be very appreciative to know) although I would also like to know how to make the for loop work (to further my understanding).

You can skip the loop entirely using which(). This will usually be faster and more legible:
# Create example data
set.seed(123)
Football <- data.frame(Year = sample(1990:2000, size = 10),
foo = sample(letters, size = 10))
# Sort the data as you have done
Football_sort <- Football[order(Football$Year), ]
# Get the row numbers of the min and max (start and end years)
which(with(Football_sort, Year == min(Year)))
#> [1] 1
which(with(Football_sort, Year == max(Year)))
#> [1] 10
Depending upon what you actually want to do, you can skip the ordering step as well. Both of the below depend upon the dplyr package to work.
If you just want the start and end year rows rather than their row numbers:
library(dplyr)
Football %>%
filter(Year %in% c(min(Year), max(Year)))
#> Year foo
#> 1 2000 e
#> 2 1990 d
If you want the "year number" of the start and end year:
Football %>%
summarise(start_year = 1,
end_year = max(Year) - min(Year))
#> start_year end_year
#> 1 1 10

Related

Finding the longest continuous series of events

I have time-series data for many different countries of overall more than 50 years. Now I want to find the longest recording period of data for each country. I have tried something but it does not work yet (When I checked the output, it returned not the correct number of continuous years for most of the cases and I don't know why):
cntry <- as.list(unique(df$country))
df$longest.ts <- NULL
for (i in cntry) {
x <- max(diff(which(diff(df$year[df$country==i]) != 1)))
df$longest.ts[df$country==i] <- x
}
I appreciate your help,
Best
Edit: my data.frame is very big and has a lot of different variables but essentially I have got something like this:
df <- data.frame(
country = c("Bolivia","Bolivia","Bolivia","Bolivia","Bolivia","Bolivia",
"China","China","China","China","China","China","China","China"),
year = c(1923,1924,1925,1940,1945,1946,1960,1961,1962,1963,1964,1965,1981,1982)
)
And I would like to get a output with the countries and the longest time-series within the country (e.g. here Bolivia: 3yrs, China: 6yrs).
There is a nice trick using diff to take the differences and rle to encode the differences in pairs of (value, times_repeated) (run length encoding, see ?rle).
For example, consider
vec <- c(1960L, 1961L, 1962L, 1963L, 1964L, 1965L, 1981L, 1982L)
diff(vec)
#> [1] 1 1 1 1 1 16 1
rle(diff(vec))
#> Run Length Encoding
#> lengths: int [1:3] 5 1 1
#> values : int [1:3] 1 16 1
You want to find how many times in a row there is a difference of 1: this is the maximum length (+1) corresponding to a value of 1.
Putting that in a function (note the check on variable type, because in your example year is a factor instead of a numeric):
longest_ts <- function(vec) {
if(!is.numeric(vec)) stop("Vector must be numeric!")
RLE <- rle(diff(vec))
max(RLE$lengths[RLE$values == 1]) + 1
}
Now just apply to your column (don't forget to group_by):
df <- data.frame(
country = c("Bolivia","Bolivia","Bolivia","Bolivia","Bolivia","Bolivia",
"China","China","China","China","China","China","China","China"),
year = c("1923","1924","1925","1940","1945","1946","1960","1961","1962","1963","1964","1965","1981","1982")
)
library("dplyr")
df %>%
mutate(year = as.numeric(as.character(year))) %>% # fix your year variable
group_by(country) %>%
summarise(longest_ts = longest_ts(year))
Result:
country longest_ts
<fct> <dbl>
1 Bolivia 3
2 China 6

Pull out only two variables from a column

I have a dataframe in R for which one column has multiple variables. The variables either start with ABC, DEF, GHI. Those variables are followed by a series of 6 numbers (ie ABC052689, ABC062895, DEF045158).
For each row, i would like to pull one instance of ABC (the one with the largest number).
If the row has ABC052689, ABC062895, DEF045158, I would like it to pull out ABC062895 because it is greater than ABC052689.
I would then want to do the same for the variable that starts with DEF######.
I have managed to filter the data to have rows where ABC is there and either DEF or GHI is there:
library(tidyverse)
data_with_ABC <- test %>%
filter(str_detect(car,"ABC"))
data_with_ABC_and_DEF_or_GHI <- data_with_ABC %>%
filter(str_detect(car, "DEF") | str_detect(car, "GHI"))
I don't know how to pull out let's say ABC with the greatest number
ABC052689, ABC062895, DEF045158 -> ABC062895
For a base R solution, we can try using lapply along with strsplit to identify the greatest ABC plate in each CSV string, in each row.
df <- data.frame(car=c("ABC052689,ABC062895,DEF045158"), id=c(1),
stringsAsFactors=FALSE)
df$largest <- lapply(df$car, function(x) {
cars <- strsplit(x, ",", fixed=TRUE)[[1]]
cars <- cars[substr(cars, 1, 3) == "ABC"]
max <- cars[which.max(substr(cars, 4, 9))]
return(max)
})
df
car id largest
1 ABC052689,ABC062895,DEF045158 1 ABC062895
Note that we don't need to worry about casting the substring of the plate number, because it is fixed width text. This means that it should sort properly even as text.
Besides Tim's answer, if you want to do all ABC/DEF at one time, following code may help with library(tidyverse):
> df <- data.frame(car=c("ABC052689", "ABC062895", "DEF045158", "DEF192345"), stringsAsFactors=FALSE)
>
> df2 = df %>%
+ mutate(state = str_sub(car, 1, 3), plate = str_sub(car, 4, 9))
>
> df2
car state plate
1 ABC052689 ABC 052689
2 ABC062895 ABC 062895
3 DEF045158 DEF 045158
4 DEF192345 DEF 192345
>
> df2 %>%
+ group_by(state) %>%
+ summarise(maxplate = max(plate)) %>%
+ mutate(full = str_c(state, maxplate))
# A tibble: 2 x 3
state maxplate full
<chr> <chr> <chr>
1 ABC 062895 ABC062895
2 DEF 192345 DEF192345

How can I create new column in data frame by aggregating rows?

I have a large (~200k rows) dataframe that is structured like this:
df <-
data.frame(c(1,1,1,1,1), c('blue','blue','blue','blue','blue'), c('m','m','m','m','m'), c(2016,2016,2016,2016,2016),c(3,4,5,6,7), c(10,20,30,40,50))
colnames(df) <- c('id', 'color', 'size', 'year', 'week','revenue')
Let's say it is currently week 7, and I want to compare the trailing 4 week average of revenue to the current week's revenue. What I would like to do is create a new column for that average when all of the identifiers match.
df_new <-
data.frame(1, 'blue', 'm', 2016,7,50, 25 )
colnames(df_new) <- c('id', 'color', 'size', 'year', 'week','revenue', 't4ave')
How can I accomplish this efficiently? Thank you for the help
good question. for loops are pretty inefficient, but since you do have to check the conditions of prior entries, this is the only solution I can think of (mind you, I'm also an intermediate at R):
for (i in 1:nrow(df))
{
# condition for all entries to match up
if ((i > 5) && (df$id[i] == df$id[i-1] == df$id[i-2] == df$id[i-3] == df$id[i-4])
&& (df$color[i] == df$color[i-1] == df$color[i-2] == df$color[i-3] == df$color[i-4])
&& (df$size[i] == df$size[i-1] == df$size[i-2] == df$size[i-3] == df$size[i-4])
&& (df$year[i] == df$year[i-1] == df$year[i-2] == df$year[i-3] == df$year[i-4])
&& (df$week[i] == df$week[i-1] == df$week[i-2] == df$week[i-3] == df$week[i-4]))
# avg of last 4 entries' revenues
avg <- mean(df$revenue[i-1] + df$revenue[i-2] + df$revenue[i-3] + df$revenue[i-4])
# create new variable of difference between this entry and last 4's
df$diff <- df$revenue[i] - avg
}
This code will probably take forever, but it should work. If this is a one time thing for when the code needs to run, then it should be okay. Otherwise, hopefully others will be able to advise.
A solution using dplyr and zoo. The idea is to group the variable that are the same, such as id, color, size, and year. Aftet that, use rollmean to calculate the rolling mean of revenue. Use na.pad = TRUE and align = "right" to make sure the calculation covers the recent weeks. Finally, use lag to "shift" the calculation results to fit your needs.
library(dplyr)
library(zoo)
df2 <- df %>%
group_by(id, color, size, year) %>%
mutate(t4ave = rollmean(revenue, 4, na.pad = TRUE, align = "right")) %>%
mutate(t4ave = lag(t4ave))
df2
# A tibble: 5 x 7
# Groups: id, color, size, year [1]
id color size year week revenue t4ave
<dbl> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 1 blue m 2016 3 10 NA
2 1 blue m 2016 4 20 NA
3 1 blue m 2016 5 30 NA
4 1 blue m 2016 6 40 NA
5 1 blue m 2016 7 50 25

Building dummy variable with many conditions (R)

My dataset looks something like this
ID YOB ATT94 GRADE94 ATT96 GRADE96 ATT 96 .....
1 1975 1 12 0 NA
2 1985 1 3 1 5
3 1977 0 NA 0 NA
4 ......
(with ATTXX a dummy var. denoting attendance at school in year XX, GRADEXX denoting the school grade)
I'm trying to create a dummy variable that = 1 if an individual is attending school when they are 19/20 years old. e.g. if YOB = 1988 and ATT98 = 1 then the new variable = 1 etc. I've been attempting this using mutate in dplyr but I'm new to R (and coding in general!) so struggle to get anything other than an error any code I write.
Any help would be appreciated, thanks.
Edit:
So, I've just noticed that something has gone wrong, I changed your code a bit just to add another column to the long format data table. Here is what I did in the end:
df %>%
melt(id = c("ID", "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
so it looks something like e.g.
ID YOB VARIABLE VALUE dummy
1 1979 ATT94 1994 1
1 1979 ATT96 1996 1
1 1979 ATT98 0 0
2 1976 ATT94 0 0
2 1976 ATT96 1996 1
2 1976 ATT98 1998 1
i.e. whenever the ATT variables take a value other than 0 the dummy = 1, even if they're not 19/20 years old. Any ideas what could be going wrong?
On my phone so I can't check this right now but try:
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Edit: The above approach will create the column but when the condition does not hold it will be equal to NA
As #Greg Snow mentions, this approach assumes that the column was already created and is equal to zero initially. So you can do the following to get your dummy variable:
df$dummy <- rep(0, nrow(df))
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.
Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.
# Load libraries
library(dplyr)
library(reshape2)
# Create a sample dataset
ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()
# Recode ATTx variables with the actual year
df$ATT94[df$ATT94==1] <- 1994
df$ATT96[df$ATT96==1] <- 1996
df$ATT98[df$ATT98==1] <- 1998
# Melt the data into a long format and perform requested analysis
df %>%
melt(id = "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
#Warner shows a way to create the variable (or at least the 1's the assumption is the column has already been set to 0). Another approach is to not explicitly create a dummy variable, but have it created for you in the model syntax (what you asked for is essentially an interaction). If running a regression, this would be something like:
fit <- lm( resp ~ I(DOB==1988):I(ATT98==1), data=df )
or
fit <- lm( resp ~ I( (DOB==1988) & (ATT98==1) ), data=df)

Reshape data frame for consecutive years

I have data about thousands of customers who visited stores in the 3 past years.
For each customer, I have :
ID
Combination of a year and the first store visited in this year.
Customer_Id | Year_*_Store
1 2010_A
1 2011_B
1 2012_C
2 2010_A
2 2011_B
2 2012_D
What I’d like to have is the following structure of data in order to visualize the evolution of the customers’behaviour with a riverplot( aka Sankey plot)
For instance the 2 customers, who firstly visited the store A in 2010, firstly visited the store B in 2011:
SOURCE | TARGET | NB_CUSTOMERS
2010_A 2011_B 2
2011_B 2012_C 1
2011_B 2012_D 1
I don't want links between two years which are not consecutive like 2010_A and 2012_D
How can I do that in R ?
I would do this with dplyr (faster)
df<-read.table(header=T,text="Customer_Id Year_Store
1 2010_A
1 2011_B
1 2012_C
2 2010_A
2 2011_B
2 2012_D")
require(dplyr) # for aggregation
require(riverplot) # for Sankey
targets<-
group_by(df,Customer_Id) %.% # group by Customer
mutate(source=Year_Store,target=c(as.character(Year_Store)[-1],NA)) %.% # add a lag to show the shift
filter(!is.na(target)) %.% # filter out empty edges
regroup(list("source","target")) %.% # regroup by source & target
summarise(len=length(Customer_Id)) %.% # count customers for relationship
mutate(step=as.integer(substr(target,1,4))-as.integer(substr(source,1,4))) %.% # add a step to show how many years
filter(step==1) # filter out relationships for non consec years
topnodes <- c(as.character(unique(df$Year_Store))) # unique nodes
nodes <- data.frame( ID=topnodes, # IDs
x=as.numeric(substr(topnodes,1,4)), # x value for plot
col= rainbow(length(topnodes)), # color each different
labels= topnodes, # labels
stringsAsFactors= FALSE )
edges<- # create list of list
lapply(unique(targets$source),function(x){
l<-as.list(filter(targets,source==x)$len) # targets per source
names(l)<-filter(targets,source==x)$target # name of target
l
})
names(edges)<-unique(targets$source) # name top level nodes
r <- makeRiver( nodes, edges) # make the River
plot( r ) # plot it!
Note that you can't have a * in column names (see ?make.names). Here is a basic approach:
Split Year_store into two separate columns Year and Store in your data frame; at the moment it contains two completely different kinds of data and you actually need to process them separately.
Make a NextYear column, defined as Year + 1
Make a NextStore column, in which you assign the store code matching Customer_Id and for which Year is the same as this row's NextYear, assigning NA if there is no record of the customer visiting a store the next year, and throwing an error if the data do not meet the required specification (are ambiguous about which store was visited first the next year).
Strip out any of the rows in which NextStore is NA, and combine the NextYear and NextStore columns into a NextYear_NextStore column.
Summarize your data frame by the Year_store and NextYear_NextStore columns e.g. using ddply in the plyr package.
Some sample data:
# same example data as question
customer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2012_D"),
stringsAsFactors = FALSE)
# alternative data should throw error, customer 2 is inconsistent in 2011
badCustomer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2011_D"),
stringsAsFactors = FALSE)
And an implementation:
require(plyr)
splitYearStore <- function(df) {
df$Year <- as.numeric(substring(df$Year_Store, 1, 4))
df$Store <- as.character(substring(df$Year_Store, 6))
return(df)
}
findNextStore <- function(df, matchCust, matchYear) {
matchingStore <- with(df,
df[Customer_Id == matchCust & Year == matchYear, "Store"])
if (length(matchingStore) == 0) {
return(NA)
} else if (length(matchingStore) > 1) {
errorString <- paste("Inconsistent store results for customer",
matchCust, "in year", matchYear)
stop(errorString)
} else {
return(matchingStore)
}
}
tabulateTransitions <- function(df) {
df <- splitYearStore(df)
df$NextYear <- df$Year + 1
df$NextStore <- mapply(findNextStore, matchCust = df$Customer_Id,
matchYear = df$NextYear, MoreArgs = list(df = df))
df$NextYear_NextStore <- with(df, paste(NextYear, NextStore, sep = "_"))
df <- df[!is.na(df$NextStore),]
df <- ddply(df, .(Source = Year_Store, Target = NextYear_NextStore),
summarise, No_Customers = length(Customer_Id))
return(df)
}
Results:
> tabulateTransitions(customer.df)
Source Target No_Customers
1 2010_A 2011_B 2
2 2011_B 2012_C 1
3 2011_B 2012_D 1
> tabulateTransitions(badCustomer.df)
Error in function (df, matchCust, matchYear) :
Inconsistent store results for customer 2 in year 2011
No attempt has been made to optimise; if your data set is massive then perhaps you should investigate a data.table solution.

Resources