Using daply with multi-variable function - r

I am still new to R and want to use a *ply function to extract information from a dataframe. A sample input dataframe looks like this:
# Construct the dataframe
season <- c("12","12","12","12","12")
hometeam <- c("Team A","MyTeam","MyTeam","Team D","Team E")
awayteam <- c("MyTeam","Team B","Team C","MyTeam","MyTeam")
score <- c("1 - 1","7 - 1","0 - 0","0 - 2","0 - 1")
stats <- data.frame(season,hometeam,awayteam,score)
print(stats)
season hometeam awayteam score
1 11/12 Team A MyTeam 1 - 1
2 11/12 MyTeam Team B 7 - 1
3 11/12 MyTeam Team C 0 - 0
4 11/12 Team D MyTeam 0 - 2
5 11/12 Team E MyTeam 0 - 1
What I want to do is extract both the opponent of 'MyTeam' as well as the winner. The score is always given as the home team's score vs. the away team's score. I have a way of extracting who the opponent is like this:
# Get the opponent to MyTeam; can add this to dataframe later
opponent <- ifelse(stats$hometeam == "MyTeam", stats$awayteam, stats$hometeam)
But I am stuck trying to get the winner of every match. I tried doing this with daply() and a named function like so:
# Separate out scores for home and away team to determine winner
stats <- separate(stats, score, c('homescore','awayscore'), sep=' - ', remove=TRUE)
# Function for use in ply to get the winner of a match
determineWinner <- function(homescore, awayscore, hometeam) {
homewon <- FALSE
if ( homescore < awayscore) {
homewon <- FALSE
} else if ( homescore > awayscore ) {
homewon <- TRUE
} else {
return("tie")
}
if ( hometeam == "MyTeam" ) {
ifelse(homewon, return("won"), return("lost"))
} else {
ifelse(homewon, return("lost"), return("won"))
}
}#end of function
winner <- daply(stats, .(homescore,awayscore,hometeam), determineWinner(stats$homescore, stats$awayscore, stats$hometeam) )
But, this clearly does not work. Am I applying the daply() method incorrectly? I think am still unsure how the *ply functions really behave. It seems like a *ply function is the way to go here, but if there are other solutions out there, I am all ears. Any help is greatly appreciated!

Your logic can be implemented using nested ifelse:
winner <- ifelse(stats$homescore > stats$awayscore,
ifelse(stats$hometeam == "MyTeam","won","lost"),
ifelse(stats$homescore < stats$awayscore,
ifelse(stats$hometeam == "MyTeam","lost","won"),
"tie"))
##[1] "tie" "won" "tie" "won" "won"

Related

Create a list from a complex comparison of two lists

I am working on market transaction data where each observation contains the value of the variable of the buyer's id, and the value of the variable of the seller's id. For each observation (i.e each transaction), I would like to create a variable equal to the number of other transactions the associated seller has done with a different buyer than the one involved in this transaction. As a consequence, in the following
data <- data.frame(Buyer_id = c("001","001","002","001"), Seller_id = c("021","022","022","021"))
I would like to obtain:
Result <- list(0,1,1,0)
I searched for already existing answers for similar problems than mine, usually involving the function mapply(), and tried to implement them, but it proved unsuccessful.
Thank you very much for helping me.
Are you looking for something like this? If yes, then you might want to change your reproducible example to have a c instead of list when you construct your data.frame.
data <- data.frame(Buyer_id = c("001","001","002","001"),
Seller_id = c("021","022","022","021"))
data$n <- NA
for (i in seq_len(nrow(data))) {
seller <- as.character(data[i, "Seller_id"])
buyer <- as.character(data[i, "Buyer_id"])
with.buyers <- as.character(data[data$Seller_id == seller, "Buyer_id"])
with.buyers <- unique(with.buyers)
diff.buyers <- with.buyers[!(with.buyers %in% buyer)]
data[i, "n"] <- length(diff.buyers)
}
Buyer_id Seller_id n
1 001 021 0
2 001 022 1
3 002 022 1
4 001 021 0
Apart from Roman Lustrik's solution, there is also an approach that uses graphs.
library(igraph)
data <- data.frame(Seller_id = c("021","022","022","021"),
Buyer_id = c("001","001","002","001"),
stringsAsFactors = FALSE)
my.graph <- graph_from_data_frame(data)
plot(my.graph)
degree(my.graph, mode = c("out"))
# Transform the graph into a simple graph. A simple graph does not allow
# duplicate edges.
my.graph <- simplify(my.graph)
plot(my.graph)
degree(my.graph, mode = c("out"))
V(my.graph)$out.degree <- degree(my.graph, mode = c("out"))
data$n <- apply(data,
MARGIN = 1,
FUN = function(transaction)
{
node.out.degree <- V(my.graph)$out.degree[ V(my.graph)$name == transaction["Seller_id"] ]
if (node.out.degree <= 1) {
# Since the vertex has at most 1 out degree we know that the current transaction
# is the only appearance of the current seller.
return(0)
} else {
# In this case, we know that the seller participates in at least one more
# tansaction. We therefore simply subtract minus one (the current transaction)
# from the out degree.
return(node.out.degree - 1)
}
})
data

Is there a way to mark "troughs" in a graph in R according to specific criteria?

I have a data set which, when plotted, produces a graph that looks like this:
Plot
The head of this data is:
> head(data_frame)
score position
73860 10 43000
73859 10 43001
73858 10 43002
73857 10 43003
73856 10 43004
73855 10 43005
I've uploaded the whole file as a tab delimited text file here.
As you can see, the plot has regions which have a score of around 10, but there's one region in the middle that "dips". I would like to identify these dips.
Defining a dip as:
Starting when the score is below 7
Ending when the score rises to 7 or above and stays at 7 or above for at least 500 positions
I would like to identify all the regions which meet the above definition, and output their start and end positions. In this case that would only be the one region.
However, I'm at a bit of a loss as to how to do this. Looks like the rle() function could be useful, but I'm not too sure how to implement it.
Expected output for the data frame would be something like:
[1] 44561 46568
(I haven't actually checked that everything in between these would qualify under the definition, but from the plot this looks about right)
I would be very grateful for any suggestions!
Andrei
So I've come up with one solution that uses a series of loops. I do realise this is inefficient, though, so if you have a better answer, please let me know.
results <- data.frame(matrix(ncol=2,nrow=1))
colnames(results) <- c("start","end")
state <- "out"
count <- 1
for (i in 1:dim(data_frame)[1]){
print(i/dim(data_frame)[1])
if (data_frame[i,3] < 7 & state=="out") {
results[count,1] <- data_frame[i,2]
state <- "in"
next
}
if (data_frame[i,3] >= 7 & state=="in") {
if ((i+500)>dim(data_frame)[1]){
results[count,2] <- data_frame[dim(data_frame)[1],2]
state <- "out"
break
}
if (any(data_frame[(i+1):(i+500),3] < 7)) {
next
} else {
results[count,2] <- data_frame[i-1,2]
count <- count+1
state <- "out"
next
}
}
if ((i+500)>dim(data_frame)[1] & state == "out") {
break
}
}
Something like this is a tidyverse solution and uses rle as OP suggested...
below7 <- data_frame$score < 7
x <- rle(below7)
runs <- tibble(
RunLength=x$lengths,
Below7=x$values,
RunStart=df$position[1]
) %>%
mutate(
RunStart=ifelse(
row_number() == 1,
data_frame$position[1],
data_frame$position[1] + cumsum(RunLength)-RunLength+1
),
RunEnd=RunStart + RunLength-1,
Dip=Below7,
Dip=Dip | Below7 | (!Below7 & RunLength < 500)
)
as.data.frame(runs)
Giving
RunLength Below7 RunStart RunEnd Dip
1 1393 FALSE 43000 44392 FALSE
2 84 TRUE 44394 44477 TRUE
3 84 FALSE 44478 44561 TRUE
...
19 60 FALSE 46338 46397 TRUE
20 171 TRUE 46398 46568 TRUE
21 2433 FALSE 46569 49001 FALSE
So to get OP's final answer
runs %>%
filter(Dip) %>%
summarise(
DipStart=min(RunStart),
DipEnd=max(RunEnd)
)
# A tibble: 1 x 2
DipStart DipEnd
<dbl> <dbl>
1 44394 46568
If the original data.frame might contain more than one dip, you'd have to do a little more work when creating the runs tibble: having indentified each individual run, you'd need to create an additional column, DipIndex say, which indexed each individual Dip.

Loop Taking Too Long to Finish in R

I have some code which includes a for loop, and nested if statements. The issue is that it is taking too long to run and I want to make it much faster.
I have data on cohorts in a data frame called f2_cebu_davao. There is also a column in this data frame called person_id. There are 3 categories of the cohorts: 'Baseline', 'Other Effects', 'Campaign'.
I want to loop through each person_id in the f2_cebu_davao data frame, and check to see which cohort it is in. If it is in the cohort 'Baseline' or 'Other Effects', then I will check the before_baseline_othereffects table to see if the ID can be found in that table. If it can, I make a new column in the f2_cebu_davao table and the value will be 'returning'. Otherwise, 'new'.
If the cohort name is 'campaign', I will check the before_campaign table and do the same procedure as above.
My data is quite big (all my objects are big) so this is taking a really long time to run (it's been running for more than 30 minutes and still not finished!).
How can I speed this up (possibly by using vectorization, or just by modifying the code a little)?
I tried do loop through but it's taking too long.
before_baseline_othereffects <- subset(loans_final_full, submitted_at_date < '2018-05-21')
before_campaign <- subset(loans_final_full, submitted_at_date < '2019-01-21')
for(i in 1:nrow(f2_cebu_davao)){
if(as.vector(f2_cebu_davao[, 'cohort'][i]) == 'Baseline') {
if(as.vector(f2_cebu_davao[,'person_id'][i]) %in% as.vector(unique(before_baseline_othereffects$person_id)) == TRUE) {
f2_cebu_davao$new_or_returning[i] <- 'Returning'
} else {
f2_cebu_davao$new_or_returning[i] <- 'New'
}
} else if (as.vector(f2_cebu_davao[, 'cohort'][i]) == 'Other Effects'){
if(as.vector(f2_cebu_davao[,'person_id'][i]) %in% as.vector(unique(before_baseline_othereffects$person_id)) == TRUE) {
f2_cebu_davao$new_or_returning[i] <- 'Returning'
} else {
f2_cebu_davao$new_or_returning[i] <- 'New'
}
} else {
if(as.vector(f2_cebu_davao[,'person_id'][i]) %in% as.vector(unique(before_campaign$person_id)) == TRUE) {
f2_cebu_davao$new_or_returning[i] <- 'Returning'
} else {
f2_cebu_davao$new_or_returning[i] <- 'New'
}
}
}
Happy to update and test this if you can provide some example data and desired output. I expect something like this should work.
Here I make up some fake data:
f2_cebu_davao <- data.frame(stringsAsFactors = F,
cohort = rep(c("Baseline", "Other Effects", "Something else",
"Another Something"), by = 3),
person_id = 1:12
)
before_baseline_othereffects <- c(1:4)
before_campaign <- c(5:8)
Here I apply it using dplyr's case_when, spelling out four cases. This code will be vectorized and I expect would run much faster than the current loop code.
The cohort is either "Baseline" or "Other Effects, and the person_id appears in before_baseline_othereffects. This creates "Returning" in rows 1 & 2.
Given the first condition wasn't met, but the cohort is still in either "Baseline" or "Other Effects," return "New," as is done in rows 5 & 6.
Given the first two conditions weren't met, but the person was in before_campaign, mark Returning, as in rows 7 & 8.
Otherwise, mark New, as in rows 3&4 and 9-12.
library(dplyr)
output <- f2_cebu_davao %>%
mutate(new_or_returning = case_when(
cohort %in% c("Baseline", "Other Effects") &
person_id %in% before_baseline_othereffects ~ "Returning",
cohort %in% c("Baseline", "Other Effects") ~ "New",
person_id %in% before_campaign ~ "Returning",
TRUE ~ "New"
))
Here's the output:
> output
cohort person_id new_or_returning
1 Baseline 1 Returning
2 Other Effects 2 Returning
3 Something else 3 New
4 Another Something 4 New
5 Baseline 5 New
6 Other Effects 6 New
7 Something else 7 Returning
8 Another Something 8 Returning
9 Baseline 9 New
10 Other Effects 10 New
11 Something else 11 New
12 Another Something 12 New

How to use R to check data consistency (make sure no contradiction between case and value)?

Let's say I have:
Person Movie Rating
Sally Titanic 4
Bill Titanic 4
Rob Titanic 4
Sue Cars 8
Alex Cars **9**
Bob Cars 8
As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?
Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?
Thanks
Here's a data.table solution
Define the function
Myfunc <- function(x) {
temp <- table(x)
names(temp)[which.max(temp)]
}
library(data.table)
Create a column with the correct rating (by reference)
setDT(df)[, CorrectRating := Myfunc(Rating), Movie][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Alex Cars 9 8
# 6: Bob Cars 8 8
Or If you want to remove the "bad" ratings
df[Rating == CorrectRating][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Bob Cars 8 8
It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.
You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))
This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:
newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))
If you want to fix them, do
newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))
You can test the above with a reproducible version of your data:
dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"),
Movie = rep(c("Titanic", "Cars"), each = 3),
Rating = c(4, 4, 4, 8, 9, 8))
If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.
tapply(dat$Rating, dat$Movie, FUN=var)
which(.Last.value > 0.00001)
tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x)))
which(.Last.value != 0)
which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0)
which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) )
dat[.Last.value,]
I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset
# one of many functions to find mode, could use any other
modefunc <- function(x){
names(table(x))[table(x)==max(table(x))]
}
# add variable for mode split by Movie
x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc)
# do whatever you want with the records that are different
x[x$Rating != x$mode, ]
If you want another function for mode, try other functions for mode

Aggregate data by order of magnitude in R

Suppose this data set:
df = data.frame(city=c(A,A,A,B,B,C,C,C,C), party=c(D,R,I,D,R,D,R,I,U), votes=c(10,2,9,7,2,1,7,3,8))
I want to calculate the difference in votes between the most and the second most voted parties for each city. But each city has more than two parties. The result I want would be:
city vote.diff
A 1
B 5
C 1
Any help?
The dataframe used for testing. Modeled after the psuedo-code above:
df = data.frame(city=c('A','A','A','B','B','C','C','C','C'),
party=c('D','R','I','D','R','D','R','I','U'),
votes=c(10,2,9,7,2,1,7,3,8))
The code:
resV <- lapply(split(df, df$city), function(df.x){
res <- xtabs(votes~party, data=df.x);
-diff( res[ order(res,decreasing=TRUE) ][1:2]) } )
data.frame(city=names(resV),
vote.diff=unlist(resV),
winner=names(unlist(resV)) )
city vote.diff winner
A.I A 1 A.I
B.R B 5 B.R
C.R C 1 C.R
If you want to leave out the winner column it should be a trivial fix.

Resources