How to use lists? - r

I am trying to use lists in R as dictionaries in computing winning percentages for basketball teams. Basically, for each win, I'd like to increment the appropriate dictionary amount, and for each game, I'd like to increment the appropriate dictionary amount. Somehow, the answers I'm getting seem reasonable but are incorrect, and I can't figure out why the program logically doesn't give the expected outputs. Any suggestions or tips would be appreciated. The code I'm using is below:
games <- read.csv(game_pathname, header = FALSE)
names(games) <- c("GameDate", "DateCount", "HomeID", "AwayID", "HomePts", "AwayPts", "HomeAbbr", "AwayAbbre", "HomeName", "AwayName")
wins = list()
total = list()
for (team in unique(games$HomeName)) {
wins[team] <- 0
total[team] <- 0
}
for (i in 1:nrow(games)) {
if (games$HomePts[i] > games$AwayPts[i]) {
wins[games$HomeName[i]] <- wins[[games$HomeName[i]]] + 1
} else {
wins[games$AwayName[i]] <- wins[[games$AwayName[i]]] + 1
}
total[games$HomeName[i]] <- total[[games$HomeName[i]]] + 1
total[games$AwayName[i]] <- total[[games$AwayName[i]]] + 1
}
for (team in unique(games$HomeName)) {
print(paste(team, wins[[team]] / total[[team]]))
}

As I looked in the code and by creation toy example there is no problems in the algorithm. In the simulation below I used three teams, where one is complete looser, another break even, and the third is a champion.
games <- data.frame(HomeName = c("a", "b", "c"),
HomePts = c(1, 2, 3),
AwayPts = c(3, 1, 2),
AwayName = c("c", "a", "b") )
wins = list()
total = list()
for (team in unique(games$HomeName)) {
wins[team] <- 0
total[team] <- 0
}
for (i in 1:nrow(games)) {
if (games$HomePts[i] > games$AwayPts[i]) {
wins[games$HomeName[i]] <- wins[[games$HomeName[i]]] + 1
} else {
wins[games$AwayName[i]] <- wins[[games$AwayName[i]]] + 1
}
total[games$HomeName[i]] <- total[[games$HomeName[i]]] + 1
total[games$AwayName[i]] <- total[[games$AwayName[i]]] + 1
}
for (team in unique(games$HomeName)) {
print(paste(team, wins[[team]] / total[[team]]))
}
games
wins
total
The output of your algorithm is below:
[1] "a 0"
[1] "b 0.5"
[1] "c 1"
> games
HomeName HomePts AwayPts AwayName
1 a 1 3 c
2 b 2 1 a
3 c 3 2 b
> wins
$`a`
[1] 0
$b
[1] 1
$c
[1] 2
> total
$`a`
[1] 2
$b
[1] 2
$c
[1] 2
However it not very much "R-style" as using for and direct manipulation with list indices is considered like not "comme il faut" :)
You can get similar results with e.g. dplyr packagу, which is a part of tidyverse packages. Code below is a comparison of the results of the games, then split it into two data frames and merge it row-wise. Finally group by team name and calculate mean win rate. Please see below:
library(dplyr)
df <- games %>% mutate(hwins = (HomePts > AwayPts), awins = !hwins)
df_home <- df %>% select(HomeName, hwins) %>% rename(name = HomeName, wins = hwins)
df_away <- df %>% select(AwayName, awins) %>% rename(name = AwayName, wins = awins)
df <- bind_rows(df_home, df_away) %>% group_by(name) %>% summarise(mean_wins = mean(wins))
df
Output:
# A tibble: 3 x 2
name mean_wins
<fct> <dbl>
1 a 0
2 b 0.5
3 c 1

Related

R - Lag not working within function [objective: match similar adjacent rows]

I recently tried to match adjacent identical rows in a dataframe based on two variables (Condition1 and Outcome1 below). I have seen people doing this with all rows but not with adjacent rows, which is why I developed the following three-step work-around (which I hope did not overthink things):
-I lagged the variables based on which I wanted the matching to be done.
-I compared the variables and lagged-variables
-I deleted all rows in which both ware identical (and removed the remaining unnecessary columns).
Case <- c("Case 1", "Case 2", "Case 3", "Case 4", "Case 5")
Condition1 <- c(0, 1, 0, 0, 1)
Outcome1 <- c(0, 0, 0, 0, 1)
mwa.df <- data.frame(Case, Condition1, Outcome1)
new.df <- mwa.df
Condition_lag <- c(new.df$Condition1[-1],0)
Outcome_lag <- c(new.df$Outcome1[-1],0)
new.df <- cbind(new.df, Condition_lag, Outcome_lag)
new.df$Comp <- 0
new.df$Comp[new.df$Outcome1 == new.df$Outcome_lag & new.df$Condition1 == new.df$Condition_lag] <- 1
new.df <- subset(new.df, Comp == 0)
new.df <- subset(new.df, select = -c(Condition_lag, Outcome_lag, Comp))
This worked just fine. But when I tried to create a function for this because I had to do this operation with a large number of data frames, I encountered the problem that the lag did not work (i.e. the condition_lag <- c(new.df$condition[-1],0) and outcome_lag <- c(new.df$outcome[-1],0) operations were not carried out). The function code was:
FLC.Dframe <- function(old.df, condition, outcome){
new.df <- old.df
condition_lag <- c(new.df$condition[-1],0)
outcome_lag <- c(new.df$outcome[-1],0)
new.df <- cbind(new.df, condition_lag, outcome_lag)
new.df$comp <- 0
new.df$comp[new.df$outcome == new.df$outcome_lag & new.df$condition == new.df$condition_lag] <- 1
new.df <- subset(new.df, comp == 0)
new.df <- subset(new.df, select = -c(condition_lag, outcome_lag, comp))
return(new.df)
}
As for using the function, I wrote new.df <- FLC.Dframe(mwa.df, Condition1, Outcome1).
Could someone help me with this? Many thanks in advance.
Just generate run-length ids and remove the duplicates.
with(mwa.df, mwa.df[!duplicated(data.table::rleid(Condition1, Outcome1)), ])
Output
Case Condition1 Outcome1
1 Case 1 0 0
2 Case 2 1 0
3 Case 3 0 0
5 Case 5 1 1
If you want a function, then
FLC.Dframe <- function(df, cols) df[!duplicated(data.table::rleidv(df[, cols])), ]
Call this function like this
> FLC.Dframe(mwa.df, c("Condition1", "Outcome1"))
Case Condition1 Outcome1
1 Case 1 0 0
2 Case 2 1 0
3 Case 3 0 0
5 Case 5 1 1
The main problem with your function concerns the incorrect usage of $. This operator treats RHS input as is. For example, in this line new.df$condition the $ operator attempts to find in new.df a column named "condition", but not "Condition1", which is the value of condition. If you rewrite your function as follows, then it should work.
FLC.Dframe <- function(old.df, condition, outcome){
new.df <- old.df
condition_lag <- c(new.df[[condition]][-1],0)
outcome_lag <- c(new.df[[outcome]][-1],0)
new.df <- cbind(new.df, condition_lag, outcome_lag)
new.df$comp <- 0
new.df$comp[new.df[[outcome]] == new.df[["outcome_lag"]] & new.df[[condition]] == new.df[["condition_lag"]]] <- 1
new.df <- subset(new.df, comp == 0)
new.df <- subset(new.df, select = -c(condition_lag, outcome_lag, comp))
return(new.df)
}
You also need to call it like this (note that you need to use characters as inputs)
> FLC.Dframe(mwa.df, "Condition1", "Outcome1")
Case Condition1 Outcome1
1 Case 1 0 0
2 Case 2 1 0
4 Case 4 0 0
5 Case 5 1 1

Why is a calculating for every row faster than checking if calculation has been done?

I was working on a dataframe with 200.000+ rows and many columns. Let's take a sample dummy version as such that df :
set.seed(1)
"timeslot" = c(as.integer(abs(runif(10000,min=1,max=1000))))
"ID" = c(LETTERS[abs(as.integer(rnorm(10000,2)**3))%%9+1])
"variable1" = c(as.integer(rnorm(10000,2)**3))
df = data.frame(timeslot,ID,variable1)
df = df[order(df$timeslot, df$ID),]
I also calculate a column to check if the ID of that row is also present somewhere in the previous timeslot, called min1:
df$min1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "timeslot"] == 1){0} else {
max(df[x, "ID"] %in% df[df$timeslot == df[x,"timeslot"] - 1,"ID"])}
})
This all goes quite well and delivers the following head(df)/tail(df):
timeslot ID variable1 min1
4919 1 A 15 0
2329 1 C 48 0
7359 1 C 1 0
1978 1 E 6 0
2883 1 F 7 0
7448 1 F 21 0
-------------------------------
8462 998 F 1 1
1724 998 H 2 0
989 999 A 7 1
2589 999 D 12 1
3473 999 D 0 1
780 999 I 5 0
I want to perform some calculations on variable1, grouped by unique timeslot+ID. One of these calculations is funfac:
total=0
funfac <- function(x,y){ for (i in x){ (i <- i ** y);
total <- total + i};return((abs(total/(length(x))))**(1/y));total=0 }
However, now comes the difficult part: per ID in a specific timeslot I want to do a calculation over all same IDs in that timeslot and the previous timeslot. So if in timeslot '2' there are 3x D, and in timeslot '1' there are 2x D, the calculation should be done over all 5 Ds. My column min1 helps identify if that ID is present in the previous timeslot. If not: the calculation should return a NA.
First I did this with the following code:
lp5 = c()
for (j in 1:nrow(df)){
if (df[j,"min1"] == 0){lp5 = c(lp5,NA)} else {
total = 0
x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
for (i in x){
i = (i ** 5);
total <- total + i
}
lp5 = c(lp5,((abs(total/(length(x))))**(1/(5))))
}
}
tempdf = data.frame(df[,"timeslot"],df[,"ID"], lp5)
lp5 = tempdf[!duplicated(tempdf[,1:2]),][,3]
Figuring that I performed a lot of calculations double, I thought: Why not check if the calculation has been done already. Doing so by adding the unique timeframe+ID in a dataframe, including the calculated value. And each time checking if the value is in the dataframe already.
lp5DF = data.frame("timeslot" = numeric(0), "ID" = character(0), "lp5" = numeric(0))
for (j in 1:nrow(df)){
if (duplicated(rbind(lp5DF[,1:2],data.frame(timeslot=df[j,"timeslot"], ID=df[j,"ID"])))[nrow(lp5DF)+1]) {next} else{
if (df[j,"min1"] == 0){lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = NA))} else {
total = 0
x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
for (i in x){
(i <- i ** 5);total <- total + i
}
lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = ((abs(total/(length(x))))**(1/5)))) }
}
}
The output (head/tail) of lp5DF will be:
timeslot ID lp5
1 1 A NA
2 1 B NA
3 1 C NA
4 1 D NA
5 1 E NA
6 1 F NA
-------------------------
7738 999 B 14.83423
7739 999 C 14.80149
7740 999 E NA
7741 999 F 49.48538
7742 999 G 23.05222
7743 999 H NA
and: lp5DF[,3]==lp5
However, checking this seemed to be a lot slower (6.5x in my case). Since I have to run this kind of calculation multiple times over a lot of rows (dataframe may be expanded later in the project) both my ways are too slow. Why is the second one so slow, and is there a way to speed this up? Maybe something with lapply or the dplyr package?
There are just a lot to optimize. Try learning data manipulation packages like dplyr, data.table.
min1 can be calculated using the technique from here
library(dplyr)
dfs <- split(df$ID, df$timeslot)
df$min1 <- unlist(mapply(`%in%`, dfs, lag(dfs)))
lp5 is little tricky, but manageable
df1 <- df %>%
group_by(timeslot, ID) %>%
summarise(min1 = all(min1), s = sum(variable1^5), n = n()) %>%
group_by(ID) %>%
mutate(s1 = s + lag(s), n1 = n + lag(n), lp5 = ifelse(min1, abs((s1/n1)^(1/5)), NA))
lp5 <- df1$lp5
data.table equivalent is
library(data.table)
setDT(df)
dt1 <- df[, .(min1 = all(min1), s = sum(variable1^5), n = .N), by=.(timeslot, ID)]
dt1[, `:=`(s1 = s + shift(s), n1 = n + shift(n)), by=ID]
dt1[min1==TRUE, lp5 := abs((s1/n1)^(1/5)), by=ID]
lp5 <- dt1$lp5

How to replicate SPSS recode function in R?

Aim is to simulate the SPSS recode procedure in R. The copy-command is hard to translate.
In SPSS I have code as
RECODE A (1,2 = 1) (3,4 = copy) (8 thru hi = 3) (else = 1) into B.
Applied over A which looks like
A <- c(1,2,3,4,5,NA,7,8,9)
i get the following (SPSS) result:
A = 1,1,3,4,1,1,1,3,3
In R a similar code would look like this:
B <- Recode(A, recodes = ("c(1,2) = 1; c(3,4) = c(3,4); c(8,9) = 3; else = 1"), as.numeric.result = TRUE)
A = 1,1,3,4,1,1,1,3,3
The general Problem is to indicate the Values in the SPSS-copy statement. Here I wrote c(3,4) = c(3,4) - of course, it doesn´t work.
In SPSS also exists the possibility to say else = copy what returns the same output as R do.
Does anyone have a R function that works in the same way as SPSS?
use the levels function. Here's and example with a built in data set:
InsectSprays
levels(InsectSprays$spray)<-list(new1=c("A","C"),YEPS=c("B","D","E"),LASTLY="F")
InsectSprays
use this to reset the data set:
InsectSprays <- datasets::InsectSprays
You can combine ifelse and car::recode to achieve the result you want.
library(car)
A <- c(1,2,3,4,5,NA,7,8,9)
B <- ifelse(A %in% c(3,4), A, recode(A, "c(1,2) = 1; 8:hi = 3; else = 1"))
cbind(A, B)
You might want to check out the car package. Unfortunately, there is no "copy" functionality available.
library(car)
?recode
A <- c(1,2,3,4,5,NA,7,8,9)
B <- recode(A, "c(1,2) = 1; 3 = 3; 4 = 4; 8:hi = 3; else = 1")
B
## SPSS result: A = 1,1,3,4,1,1,1,3,3
## > B
## [1] 1 1 3 4 1 1 1 3 3
## >
library(expss)
a = c(1,2,3,4,5,NA,7,8,9)
# '%into%' supports multi-value assignment, eg: ... %into% (a1 %to% a3)
recode(a, 1:2 ~ 1, 3:4 ~ copy, 8 %thru% hi ~ 3, other ~ 1) %into% b
b
# 1 1 3 4 1 1 1 3 3
or, with standard R assignment:
b = recode(a, 1:2 ~ 1, 3:4 ~ copy, 8 %thru% hi ~ 3, other ~ 1)
b
# 1 1 3 4 1 1 1 3 3

count new elements in vector list

I want to count new elements that weren't present in previous years. In the example
Sample data:
var1 <- list('2003' = 1:3, '2004' = c(4:3), '2005' = c(6,4,1), '2006' = 1:4 )
I would like to get the output
newcount <- list('2003' = 0, '2004' = 1, '2005' = 1, '2006' = 0)
Unsuccessful code:
newcount <- mapply(setdiff, var1, seq_along(var1), function(i)
{if (i > 1) {Reduce(union, var1[i-1], accumulate=T)}}, length)
Almost there, but its better to use vector indexing to work with the offset and add the always-known initial element afterwards:
lapply(c(list(`2003`=integer(0)),
mapply(setdiff,var1[-1],
Reduce(union,var1,accumulate=TRUE)[-length(var1)])),length)
$`2003`
[1] 0
$`2004`
[1] 1
$`2005`
[1] 1
$`2006`
[1] 0
Assuming that var1 is sorted according to year, and that for 2003 you'd like 3 instead of 1, you could try
newcount <- lapply(seq_along(var1),function(x){
prev<-unlist(var1[seq_len(x-1)])
# Improvement suggested by plannapus
sum(!var1[[x]]%in%prev) # length(which(!var1[[x]]%in%prev))
})
names(newcount)<-names(var1)
newcount
# $`2003`
# [1] 3
# $`2004`
# [1] 1
# $`2005`
# [1] 1
# $`2006`
# [1] 0
OK, if you're absolutely sure that 2003 should be 0 (which I see as an exception to your logic), then you could do the following:
newcount <- c(0, lapply(seq_along(var1)[-1],function(x){
prev<-unlist(var1[seq_len(x-1)])
sum(!var1[[x]]%in%prev)
}))

Delete whole groups or group members

I am working on a big dataset and have got a problem with data cleaning. My data set looks like this:
data <- cbind (group = c(1,1,1,2,2,3,3,3,4,4,4,4,4),
member = c(1,2,3,1,2,1,2,3,1,2,3,4,5),
score = c(0,1,0,0,0,1,0,1,0,1,1,1,0))
I just want to keep the group in which the sum of score is equal to 1 and remove the whole group in which the sum of score is equal to 0. For the group in which the sum of the score is greater than 1, e.g., sum of score = 3, I want to randomly select two group members with score equal to 1 and remove them from the group. Then the data may look like this:
newdata <- cbind (group = c(1,1,1,3,3,4,4,4),
member = c(1,2,3,2,3,1,3,5),
score = c(0,1,0,0,1,0,1,0))
Does anybody can help me get this done?
I would write a function that combines the various manipulations for you. Here is one such function, heavily commented:
process <- function(x) {
## this adds a vector with the group sum score
x <- within(x, sumScore <- ave(score, group, FUN = sum))
## drop the group with sumScore == 0
x <- x[-which(x$sumScore == 0L), , drop = FALSE]
## choose groups with sumScore > 1
## sample sumScore - 1 of the rows where score == 1L
foo <- function(x) {
scr <- unique(x$sumScore) ## sanity & take only 1 of the sumScore
## which of the grups observations have score = 1L
want <- which(x$score == 1L)
## want to sample all bar one of these
want <- sample(want, scr-1)
## remove the selected rows & retun
x[-want, , drop = FALSE]
}
## which rows are samples with group sumScore > 1
want <- which(x$sumScore > 1L)
## select only those samples, split up those samples by group, lapplying foo
## to each group, then rbind the resulting data frames together
newX <- do.call(rbind,
lapply(split(x[want, , drop = FALSE], x[want, "group"]),
FUN = foo))
## bind the sampled sumScore > 1L on to x (without sumScore > 1L)
newX <- rbind(x[-want, , drop = FALSE], newX)
## remove row labels
rownames(newX) <- NULL
## return the data without the sumScore column
newX[, 1:3]
}
that with your data:
dat <- data.frame(group = c(1,1,1,2,2,3,3,3,4,4,4,4,4),
member = c(1,2,3,1,2,1,2,3,1,2,3,4,5),
score = c(0,1,0,0,0,1,0,1,0,1,1,1,0))
gives:
> set.seed(42)
> process(dat)
group member score
1 1 1 0
2 1 2 1
3 1 3 0
4 3 1 1
5 3 2 0
6 4 1 0
7 4 3 1
8 4 5 0
Which is I think what was wanted.
Update: In process() above, the internal function foo() could be rewritten to sample only 1 row and remove the others. I.e replace foo() with the one below:
foo <- function(x) {
scr <- unique(x$sumScore) ## sanity & take only 1 of the sumScore
## which of the grups observations have score = 1L
want <- which(x$score == 1L)
## want to sample just one of these
want <- sample(want, 1)
## return the selected row & retun
x[want, , drop = FALSE]
}
They are the same operations essentially but foo() that selects just 1 row makes the intended behaviour explicit; we want to select 1 row at random from those with score == 1L, rather than sample scr-1 values.
I would define a function that does what you want it to. Then use ddply and split by group.
myfun <- function(x) {
if(sum(x$score)==1) {
return(x)
} else if(sum(x$score)==0) {
return(data.frame())
} else {
row.names(x) <- NULL
score.1 <- sample(as.integer(row.names(x[x$score==1,])), nrow(x[x$score==1,])-1)
return(x[-score.1,])
}
}
library(plyr)
ddply(as.data.frame(dat), .(group), myfun)
group member score
1 1 1 0
2 1 2 1
3 1 3 0
4 3 1 1
5 4 1 0
6 4 2 1
7 4 3 1
ugroups<-unique(data[,1])
scores<-sapply(ugroups,function(x){sum(data[,1]==x & data[,3]==1)})
data[data[,1]%in%ugroups[scores>0],]
....... etc
will give you the cumulative scores for each group etc

Resources