How to lag by an integer variable using R? - r

Say I have the following historical league results:
Season <- c(1,1,2,2,3,3,4,4,5,5)
Team <- c("Diverpool","Deverton","Diverpool","Deverton","Diverpool","Deverton","Diverpool","Deverton","Diverpool","Deverton")
End.Rank <- c(8,17,4,15,3,6,4,16,3,17)
PLRank <- cbind(Season,Team,End.Rank)
I want to (efficiently) create a one year lagged variable for each team based on two criteria:
lag End.Rank by Season (i.e. t-1 with Season as the time variable)
separate by team (Deverton's lagged End.Rank vs. Diverpool's lagged End.Rank)
Essentially, I'd like the output to be as follows:
l.End.Rank <- c(NA,NA,8,17,4,15,3,6,4,16)
Tried lag(), and lost when trying to do it in a for() loop at the moment.

You can try one of the following...
Note that I've used a data.frame instead of the matrix you get with cbind:
PLRank <- data.frame(Season, Team, End.Rank)
With "data.table":
library(data.table)
setDT(PLRank)[, l.End.Rank := shift(End.Rank), by = .(Team)][]
# Season Team End.Rank l.End.Rank
# 1: 1 Diverpool 8 NA
# 2: 1 Deverton 17 NA
# 3: 2 Diverpool 4 8
# 4: 2 Deverton 15 17
# 5: 3 Diverpool 3 4
# 6: 3 Deverton 6 15
# 7: 4 Diverpool 4 3
# 8: 4 Deverton 16 6
# 9: 5 Diverpool 3 4
# 10: 5 Deverton 17 16
Or, with "dplyr":
library(dplyr)
PLRank %>%
group_by(Team) %>%
mutate(l.End.Rank = lag(End.Rank))
# Source: local data frame [10 x 4]
# Groups: Team [2]
#
# Season Team End.Rank l.End.Rank
# (dbl) (fctr) (dbl) (dbl)
# 1 1 Diverpool 8 NA
# 2 1 Deverton 17 NA
# 3 2 Diverpool 4 8
# 4 2 Deverton 15 17
# 5 3 Diverpool 3 4
# 6 3 Deverton 6 15
# 7 4 Diverpool 4 3
# 8 4 Deverton 16 6
# 9 5 Diverpool 3 4
# 10 5 Deverton 17 16
Update
I had honestly entirely misread that you wanted this grouped by Season.
If you are lagging by season, perhaps you should consider widening the data, so that each season has just one row. Then a lag by season would be easy.
Examples:
Here, we use dcast from "data.table" to spread the values of "End.Rank" out by "Team". Then, we lag just the newly created columns.
library(data.table)
teams <- as.character(unique(PLRank$Team))
dcast(as.data.table(PLRank), Season ~ Team, value.var = "End.Rank")[
, (teams) := lapply(.SD, shift), .SDcols = teams][]
# Season Deverton Diverpool
# 1: 1 NA NA
# 2: 2 17 8
# 3: 3 15 4
# 4: 4 6 3
# 5: 5 16 4
Or, if you wanted both the team names and the values to be in a wide form, you could try something like:
dcast(as.data.table(PLRank)[, ind := sequence(.N), by = Season],
Season ~ ind, value.var = c("Team", "End.Rank"))[
, c("End.Rank_1", "End.Rank_2") := lapply(.SD, shift),
.SDcols = c("End.Rank_1", "End.Rank_2")][]
# Season Team_1 Team_2 End.Rank_1 End.Rank_2
# 1: 1 Diverpool Deverton NA NA
# 2: 2 Diverpool Deverton 8 17
# 3: 3 Diverpool Deverton 4 15
# 4: 4 Diverpool Deverton 3 6
# 5: 5 Diverpool Deverton 4 16
The approach in "dplyr" is similar. Since you're going to a wide form, you also need "tidyr" to be loaded.
library(dplyr)
library(tidyr)
PLRank %>%
spread(Team, End.Rank) %>%
mutate_each(funs(lag), -Season)
# Season Deverton Diverpool
# 1 1 NA NA
# 2 2 17 8
# 3 3 15 4
# 4 4 6 3
# 5 5 16 4

Related

R (data.table): call different columns in a loop

I am trying to call different columns of a data.table inside a loop, to get unique values of each column.
Consider the simple data.table below.
> df <- data.table(var_a = rep(1:10, 2),
+ var_b = 1:20)
> df
var_a var_b
1: 1 1
2: 2 2
3: 3 3
4: 4 4
5: 5 5
6: 6 6
7: 7 7
8: 8 8
9: 9 9
10: 10 10
11: 1 11
12: 2 12
13: 3 13
14: 4 14
15: 5 15
16: 6 16
17: 7 17
18: 8 18
19: 9 19
20: 10 20
My code works when I call for a specific column outside a loop,
> unique(df$var_a)
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, var_a])
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, "var_a"])
var_a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
but not when I do so within a loop that goes through different columns of the data.table.
> for(v in c("var_a","var_b")){
+ print(v)
+ df$v
+ unique(df[, .v])
+ unique(df[, "v"])
+ }
[1] "var_a"
Error in `[.data.table`(df, , .v) :
j (the 2nd argument inside [...]) is a single symbol but column name '.v' is not found. Perhaps you intended DT[, ...v]. This difference to data.frame is deliberate and explained in FAQ 1.1.
>
> unique(df[, ..var_a])
Error in `[.data.table`(df, , ..var_a) :
Variable 'var_a' is not found in calling scope. Looking in calling scope because you used the .. prefix.
For the first problem, when you're referencing a column name indirectly, you can either use double-dot ..v syntax, or add with=FALSE in the data.table::[ construct:
for (v in c("var_a", "var_b")) {
print(v)
print(df$v)
### either one of these will work:
print(unique(df[, ..v]))
# print(unique(df[, v, with = FALSE]))
}
# [1] "var_a"
# NULL
# var_a
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# [1] "var_b"
# NULL
# var_b
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# 11: 11
# 12: 12
# 13: 13
# 14: 14
# 15: 15
# 16: 16
# 17: 17
# 18: 18
# 19: 19
# 20: 20
# var_b
But this just prints it without changing anything. If all you want to do is look at unique values within each column (and not change the underlying frame), then I'd likely go with
lapply(df[,.(var_a, var_b)], unique)
# $var_a
# [1] 1 2 3 4 5 6 7 8 9 10
# $var_b
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
which shows the name and unique values. The use of lapply (whether on df as a whole or a subset of columns) is also preferable to another recommendation to use apply(df, 2, unique), though in this case it returns the same results.
Use .subset2 to refer to a column by its name:
for(v in c("var_a","var_b")) {
print(unique(.subset2(df, v)))
}
following the information on the first error, this would be the correct way to call in a loop:
for(v in c("var_a","var_b")){
print(unique(df[, ..v]))
}
# won't print all the lines
as for the second error you have not declared a variable called "var_a", it looks like you want to select by name.
# works as you have shown
unique(df[, "var_a"])
# works once the variable is declared
var_a <- "var_a"
unique(df[, ..var_a])
You may also be interested in the env param of data.table (see development version); here is an illustration below, but you could use this in a loop too.
v="var_a"
df[, v, env=list(v=v)]
Output:
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

copy() when using data.table inside functions

On my projects I usually do the data prepping with a few functions, so my code usually look like this:
readAndClean("directory") %>%
processing() %>%
readyForModelling()
Where I'm passing a data.table object from one function to another.
I've gotten a habit to always start this functions with:
processing <- function(data_init){
data <- copy(data_init)
}
to avoid making changes to the DT on the global environment, as the following example will:
test <- data.table(cars[1:10,])
processing <- function(data_init){
data_init[, id := 1:.N]
return("done")
}
test
# speed dist
# 1: 4 2
# 2: 4 10
# 3: 7 4
# 4: 7 22
# 5: 8 16
# 6: 9 10
# 7: 10 18
# 8: 10 26
# 9: 10 34
# 10: 11 17
processing(test)
# [1] "done"
test
# speed dist id
# 1: 4 2 1
# 2: 4 10 2
# 3: 7 4 3
# 4: 7 22 4
# 5: 8 16 5
# 6: 9 10 6
# 7: 10 18 7
# 8: 10 26 8
# 9: 10 34 9
# 10: 11 17 10
But this always seems a little ugly to me.
Is it the correct way of handling data.tables inside functions?

Create multiple sums

Ciao,
Here is a replicate able example.
df <- data.frame("STUDENT"=c(1,2,3,4,5),
"TEST1A"=c(NA,5,5,6,7),
"TEST2A"=c(NA,8,4,6,9),
"TEST3A"=c(NA,10,5,4,6),
"TEST1B"=c(5,6,7,4,1),
"TEST2B"=c(10,10,9,3,1),
"TEST3B"=c(0,5,6,9,NA),
"TEST1TOTAL"=c(NA,23,14,16,22),
"TEST2TOTAL"=c(10,16,15,12,NA))
I have columns STUDENT through TEST3B and want to create TEST1TOTAL TEST2TOTAL. TEST1TOTAL=TEST1A+TEST2A+TEST3A and so on for TEST2TOTAL. If there is any missing score in TEST1A TEST2A TEST3A then TEST1TOTAL is NA.
here is my attempt but is there a solution with less lines of coding? Because here I will need to write this line out many times as there are up to TEST A through O.
TEST1TOTAL=rowSums(df[,c('TEST1A', 'TEST2A', 'TEST3A')], na.rm=TRUE)
Using just R base functions:
output <- data.frame(df1, do.call(cbind, lapply(c("A$", "B$"), function(x) rowSums(df1[, grep(x, names(df1))]))))
Customizing colnames:
> colnames(output)[(ncol(output)-1):ncol(output)] <- c("TEST1TOTAL", "TEST2TOTAL")
> output
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL
1 1 NA NA NA 5 10 0 NA 15
2 2 5 8 10 6 10 5 23 21
3 3 5 4 5 7 9 6 14 22
4 4 6 6 4 4 3 9 16 16
5 5 7 9 6 1 1 NA 22 NA
Try:
library(dplyr)
df %>%
mutate(TEST1TOTAL = TEST1A+TEST2A+TEST3A,
TEST2TOTAL = TEST1B+TEST2B+TEST3B)
or
df %>%
mutate(TEST1TOTAL = rowSums(select(df, ends_with("A"))),
TEST2TOTAL = rowSums(select(df, ends_with("B"))))
I think for what you want, Jilber Urbina's solution is the way to go. For completeness sake (and because I learned something figuring it out) here's a tidyverse way to get the score totals by test number for any number of tests.
The advantage is you don't need to specify the identifiers for the tests (beyond that they're numbered or have a trailing letter) and the same code will work for any number of tests.
library(tidyverse)
df_totals <- df %>%
gather(test, score, -STUDENT) %>% # Convert from wide to long format
mutate(test_num = paste0('TEST', ('[^0-9]', '', test),
'TOTAL'), # Extract test_number from variable
test_let = gsub('TEST[0-9]*', '', test)) %>% # Extract test_letter (optional)
group_by(STUDENT, test_num) %>% # group by student + test
summarize(score_tot = sum(score)) %>% # Sum score by student/test
spread(test_num, score_tot) # Spread back to wide format
df_totals
# A tibble: 5 x 4
# Groups: STUDENT [5]
STUDENT TEST1TOTAL TEST2TOTAL TEST3TOTAL
<dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA
2 2 11 18 15
3 3 12 13 11
4 4 10 9 13
5 5 8 10 NA
If you want the individual scores too, just join the totals together with the original:
left_join(df, df_totals, by = 'STUDENT')
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL TEST3TOTAL
1 1 NA NA NA 5 10 0 NA NA NA
2 2 5 8 10 6 10 5 11 18 15
3 3 5 4 5 7 9 6 12 13 11
4 4 6 6 4 4 3 9 10 9 13
5 5 7 9 6 1 1 NA 8 10 NA

How to replace NAs of a variable with values from another dataframe

i hope this one isn´t stupid.
I have two dataframes with Variables ID and gender/sex. In df1, there are NAs. In df2, the variable is complete. I want to complete the column in df1 with the values from df2.
(In df1 the variable is called "gender". In df2 it is called "sex".)
Here is what i tried so far:
#example-data
ID<-seq(1,30,by=1)
df1<-as.data.frame(ID)
df2<-df1
df1$gender<-c(NA,"2","1",NA,"2","2","2","2","2","2",NA,"2","1","1",NA,"2","2","2","2","2","1","2","2",NA,"2","2","2","2","2",NA)
df2$sex<-c("2","2","1","2","2","2","2","2","2","2","2","2","1","1","2","2","2","2","2","2","1","2","2","2","2","2","2","2","2","2")
#Approach 1:
NAs.a <- is.na(df1$gender)
df1$gender[NAs.a] <- df2[match(df1$ID[NAs.a], df2$ID),]$sex
#Approach 2 (i like dplyr a lot, perhaps there´s a way to use it):
library("dplyr")
temp<-df2 %>% select(ID,gender)
#EDIT:
#df<-left_join(df1$gender,df2$gender, by="ID")
df<-left_join(df1,df2, by="ID")
Thank you very much.
Here's a quick solution using data.tables binary join this will join only gender with sex and leave all the rest of the columns untouched
library(data.table)
setkey(setDT(df1), ID)
df1[df2, gender := i.sex][]
# ID gender
# 1: 1 2
# 2: 2 2
# 3: 3 1
# 4: 4 2
# 5: 5 2
# 6: 6 2
# 7: 7 2
# 8: 8 2
# 9: 9 2
# 10: 10 2
# 11: 11 2
# 12: 12 2
# 13: 13 1
# 14: 14 1
# 15: 15 2
# 16: 16 2
# 17: 17 2
# 18: 18 2
# 19: 19 2
# 20: 20 2
# 21: 21 1
# 22: 22 2
# 23: 23 2
# 24: 24 2
# 25: 25 2
# 26: 26 2
# 27: 27 2
# 28: 28 2
# 29: 29 2
# 30: 30 2
This would probably be the simplest with base R.
idx <- is.na(df1$gender)
df1$gender[idx] = df2$sex[idx]
You could do
df1 %>% select(ID) %>% left_join(df2, by = "ID")
# ID sex
#1 1 2
#2 2 2
#3 3 1
#4 4 2
#5 5 2
#6 6 2
#.. ..
This assumes - as in the example - that all ID's from df1 are also present in df2 and have a sex/gender information there.
If you have other columns in your data you could also try this instead:
df1 %>% select(-gender) %>% left_join(df2[c("ID", "sex")], by = "ID")

Counting combinations without destroying type

I wonder whether someone has an idea for how to count combinations like the following in a better way than I've thought of.
> library(lubridate)
> df <- data.frame(x=sample(now()+hours(1:3), 100, T), y=sample(1:4, 100, T))
> with(df, as.data.frame(table(x, y)))
x y Freq
1 2012-06-15 00:10:18 1 5
2 2012-06-15 01:10:18 1 9
3 2012-06-15 02:10:18 1 8
4 2012-06-15 00:10:18 2 9
5 2012-06-15 01:10:18 2 10
6 2012-06-15 02:10:18 2 12
7 2012-06-15 00:10:18 3 7
8 2012-06-15 01:10:18 3 9
9 2012-06-15 02:10:18 3 6
10 2012-06-15 00:10:18 4 5
11 2012-06-15 01:10:18 4 14
12 2012-06-15 02:10:18 4 6
I like that format, but unfortunately when we ran x and y through table(), they got converted to factors. In the final output they can exist quite nicely as their original type, but getting there seems problematic. Currently I just manually fix all the types afterward, which is really messy because I have to re-set the timezone, and look up the percent-codes for the default date format, etc. etc.
It seems like an efficient solution would involve hashing the objects, or otherwise mapping integers to the unique values of x and y so we can use tabulate(), then mapping back.
Ideas?
Here's data.table version that preserves the column classes:
library(data.table)
dt <- data.table(df, key=c("x", "y"))
dt[, .N, by=key(dt)]
# x y N
# 1: 2012-06-14 18:10:22 1 8
# 2: 2012-06-14 18:10:22 2 10
# 3: 2012-06-14 18:10:22 3 8
# 4: 2012-06-14 18:10:22 4 8
# 5: 2012-06-14 19:10:22 1 6
# 6: 2012-06-14 19:10:22 2 8
# 7: 2012-06-14 19:10:22 3 6
# 8: 2012-06-14 19:10:22 4 6
# 9: 2012-06-14 20:10:22 1 15
# 10: 2012-06-14 20:10:22 2 5
# 11: 2012-06-14 20:10:22 3 12
# 12: 2012-06-14 20:10:22 4 8
str(dt[, .N, by=key(dt)])
# Classes ‘data.table’ and 'data.frame': 12 obs. of 3 variables:
# $ x: POSIXct, format: "2012-06-14 18:10:22" "2012-06-14 18:10:22" ...
# $ y: int 1 2 3 4 1 2 3 4 1 2 ...
# $ N: int 8 10 8 8 6 8 6 6 15 5 ...
Edit in response to follow-up question
To count the number of appearances of all possible combinations of the observed factor levels (including those which don't appear in the data), you can do something like the following:
dt<-dt[1:30,] # Make subset of dt in which some factor combinations don't appear
ii <- do.call("CJ", lapply(dt, unique)) # CJ() is similar to expand.grid()
dt[ii, .N]
# x y N
# 1: 2012-06-14 22:53:05 1 8
# 2: 2012-06-14 22:53:05 2 7
# 3: 2012-06-14 22:53:05 3 9
# 4: 2012-06-14 22:53:05 4 5
# 5: 2012-06-14 23:53:05 1 1
# 6: 2012-06-14 23:53:05 2 0
# 7: 2012-06-14 23:53:05 3 0
# 8: 2012-06-14 23:53:05 4 0
You can use ddply
library(plyr)
ddply(df, .(x, y), summarize, Freq = length(y))
If you want it arranged by y then x
ddply(df, .(y, x), summarize, Freq = length(y))
or if column ordering is important as well as row ordering
arrange(ddply(df, .(x, y), summarize, Freq = length(y)), y)

Resources