Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have an R data.frame of college football data, with two entries for each game (one for each team, with stats and whatnot). I would like to compare points from these to create a binary Win/Loss variable, but I have no idea how (I'm not very experienced with R).
Is there a way I can iterate through the columns and try to match them up against another column (I have a game ID variable, so I'd match on that) and create aforementioned binary Win/Loss variable by comparing points values?
Excerpt of dataframe (many variables left out):
Team Code Name Game Code Date Site Points
5 Akron 5050320051201 12/1/2005 NEUTRAL 32
5 Akron 404000520051226 12/26/2005 NEUTRAL 23
8 Alabama 419000820050903 9/3/2005 TEAM 37
8 Alabama 664000820050910 9/10/2005 TEAM 43
What I want is to append a new column, a binary variable that's assigned 1 or 0 based on if the team won or lost. To figure this out, I need to take the game code, say 5050320051201, find the other row with that same game code (there's only one other row with that same game code, for the other team in that game), and compare the points value for the two, and use that to assign the 1 or 0 for the Win/Loss variable.
Assuming that your data has exactly two teams for each unique Game Code and there are no tie games as given by the following example:
df <- structure(list(`Team Code` = c(5L, 6L, 5L, 5L, 8L, 9L, 9L, 8L
), Name = c("Akron", "St. Joseph", "Akron", "Miami(Ohio)", "Alabama",
"Florida", "Tennessee", "Alabama"), `Game Code` = structure(c(1L,
1L, 2L, 2L, 3L, 3L, 4L, 4L), .Label = c("5050320051201", "404000520051226",
"419000820050903", "664000820050910"), class = "factor"), Date = structure(c(13118,
13118, 13143, 13143, 13029, 13029, 13036, 13036), class = "Date"),
Site = c("NEUTRAL", "NEUTRAL", "NEUTRAL", "NEUTRAL", "TEAM",
"AWAY", "AWAY", "TEAM"), Points = c(32L, 25L, 23L, 42L, 37L,
45L, 42L, 43L)), .Names = c("Team Code", "Name", "Game Code",
"Date", "Site", "Points"), row.names = c(NA, -8L), class = "data.frame")
print(df)
## Team Code Name Game Code Date Site Points
##1 5 Akron 5050320051201 2005-12-01 NEUTRAL 32
##2 6 St. Joseph 5050320051201 2005-12-01 NEUTRAL 25
##3 5 Akron 404000520051226 2005-12-26 NEUTRAL 23
##4 5 Miami(Ohio) 404000520051226 2005-12-26 NEUTRAL 42
##5 8 Alabama 419000820050903 2005-09-03 TEAM 37
##6 9 Florida 419000820050903 2005-09-03 AWAY 45
##7 9 Tennessee 664000820050910 2005-09-10 AWAY 42
##8 8 Alabama 664000820050910 2005-09-10 TEAM 43
You can use dplyr to generate what you want:
library(dplyr)
result <- df %>% group_by(`Game Code`) %>%
mutate(`Win/Loss`=if(first(Points) > last(Points)) as.integer(c(1,0)) else as.integer(c(0,1)))
print(result)
##Source: local data frame [8 x 7]
##Groups: Game Code [4]
##
## Team Code Name Game Code Date Site Points Win/Loss
## <int> <chr> <fctr> <date> <chr> <int> <int>
##1 5 Akron 5050320051201 2005-12-01 NEUTRAL 32 1
##2 6 St. Joseph 5050320051201 2005-12-01 NEUTRAL 25 0
##3 5 Akron 404000520051226 2005-12-26 NEUTRAL 23 0
##4 5 Miami(Ohio) 404000520051226 2005-12-26 NEUTRAL 42 1
##5 8 Alabama 419000820050903 2005-09-03 TEAM 37 0
##6 9 Florida 419000820050903 2005-09-03 AWAY 45 1
##7 9 Tennessee 664000820050910 2005-09-10 AWAY 42 0
##8 8 Alabama 664000820050910 2005-09-10 TEAM 43 1
Here, we first group_by the Game Code and then use mutate to create the Win/Loss column for each group. The logic here is simply that if the first Points is greater than the last (there are only two by assumption), then we set the column to c(1,0). Otherwise, we set it to (0,1). Note that this logic does not handle ties, but can easily be extended to do so. Note also that we surround the column names with back-quotes because of special characters such as space and /.
footballdata$SomeVariable[footballdata$Wins == "1"] = stuff
call yours wins by either 1 or 0, thus binomial
R's data frames are nice in that you can aggregate what you want like, I only want the data frames with wins are 1. Then you can set the data to some variable as above. If you wanna do another data frame to populate a data frame, make sure they have the same amount of data.
footballdata$SomeVariable[footballdata$Wins == "1"][footballdata$Team == "Browns"] = Hopeful
Related
I need some help understanding the concept of joining.
I understand how to mentally model how a join works if you have 2 data files that have a common variable. Like:
Animal
Weight
Age
Dog
12
5
Cat
4
19
Fish
2
4
Mouse
1
2
Animal
Award
Dog
1st
Cat
1st
Fish
3rd
Mouse
5th
These can be joined because the animal column is exactly the same and it just adds on another variable to the same observations of animals.
But I don't understand it when its something like this:
Mortality Rate (Heart Attack)
Year
Place
Death Rate (Heart Attack)
2011
Paris
200
2011
Paris
94
2011
Rome
23
2009
London
15
Mortality Rate (Car Crash)
Year
Place
Death Rate (Car Crash)
2011
London
987
2012
London
34
2012
Paris
09
2007
Melbourne
12
The variable TYPES are the same (years, cities and death rates). But the year values aren't the same, they arent in the same order, there arent the same number of 2011's for example, the locations are different, and there are obviously two different death rates that need to be two different columns, but how does this join work? Which variable would you join by? How would it be configured once joined? Would it just result in lots of NA values if this was across a larger data set?
I understand there are different types of joins that do different things, but I'm just struggling to understand how the years and cities would sit if you were wanting to be able to compare the two different death rates in cities and years.
Thank you!
If you do
merge(heart, car, all=TRUE)
# Year Place Death_Rate_heart Death_Rate_Car
# 1 2007 Melbourne NA 12
# 2 2009 London 15 NA
# 3 2011 London NA 987
# 4 2011 Paris 200 NA
# 5 2011 Paris 94 NA
# 6 2011 Rome 23 NA
# 7 2012 London NA 34
# 8 2012 Paris NA 9
merge automatically looks for matching names and merges on them. It's looking for pairs in those columns, so they won't be mixed. More verbosely you could do
merge(heart, car, all=TRUE, by.x=c("Year", "Place"), by.y=c("Year", "Place"))
which is actually what happens in this case.
Data:
heart <- structure(list(Year = c(2011L, 2011L, 2011L, 2009L), Place = c("Paris",
"Paris", "Rome", "London"), Death_Rate_heart = c(200L, 94L, 23L,
15L)), class = "data.frame", row.names = c(NA, -4L))
car <- structure(list(Year = c(2011L, 2012L, 2012L, 2007L), Place = c("London",
"London", "Paris", "Melbourne"), Death_Rate_Car = c(987L, 34L,
9L, 12L)), class = "data.frame", row.names = c(NA, -4L))
I'm relatively new to R, so I realise this type of question is asked often but I've read a lot of stack overflow posts and still can't quite get something to work on my data.
I have data on spss, in two datasets imported into R. Both of my datasets include an id (IDC), which I have been using to merge them. Before merging, I need to filter one of the datasets to select specifically the last observation of a date variable.
My dataset, d1, has a longitudinal measure in long format. There are multiple rows per IDC, representing different places of residence (neighborhood). Each row has its own "start_date", which is a variable that is NOT necessarily unique.
As it looks on spss :
IDC
neighborhood
start_date
1
22
08.07.2001
1
44
04.02.2005
1
13
21.06.2010
2
44
24.12.2014
2
3
06.03.2002
3
22
04.01.2006
4
13
08.07.2001
4
2
15.06.2011
In R, the start dates do not look the same, instead they are one long number like "13529462400". I do not understand this format but I assume it still would retain the date order...
Here are all my attempts so far to select the last date. All attempts ran, there was no error. The output just didn't give me what I want. To my perception, none of these made any change in the number of repetitions of IDC, so none of them actually selected *only the last date.
##### attempt 1 --- not working
d1$start_date_filt <- d1$start_date
d1[order(d1$IDC,d1$start_date_filt),] # Sort by ID and week
d1[!duplicated(d1$IDC, fromLast=T),] # Keep last observation per ID)
###### attempt 2--- not working
myid.uni <- unique(d1$IDC)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(d1, IDC==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
##### atempt 3 -- doesn't work
do.call("rbind",
by(d1,INDICES = d1$IDC,
FUN=function(DF)
DF[which.max(DF$start_date),]))
#### attempt 4 -- doesnt work
library(plyr)
ddply(d1,.(IDC), function(X)
X[which.max(X$start_date),])
### merger code -- in case something has to change with that after only the last start_date is selected
merge(d1,d2, IDC)
My goal dataset d1 would look like this:
IDC
neighborhood
start_date
1
13
21.06.2010
2
44
24.12.2014
3
22
04.01.2006
4
2
15.06.2011
I'm grateful for any help, many thanks <3
There are some problems with most approaches dealing with this data: because your dates are arbitrary strings in a format that does not sort correctly, it just-so-happens to work here because the maximum day-of-month also happens in the maximum year.
It would generally be better to work with that column as a Date object in R, so that comparisons can be better.
dat$start_date <- as.Date(dat$start_date, format = "%d.%m.%Y")
dat
# IDC neighborhood start_date
# 1 1 22 2001-07-08
# 2 1 44 2005-02-04
# 3 1 13 2010-06-21
# 4 2 44 2014-12-24
# 5 2 3 2002-03-06
# 6 3 22 2006-01-04
# 7 4 13 2001-07-08
# 8 4 2 2011-06-15
From here, things are a bit simpler:
Base R
do.call(rbind, by(dat, dat[,c("IDC"),drop=FALSE], function(z) z[which.max(z$start_date),]))
# IDC neighborhood start_date
# 1 1 13 2010-06-21
# 2 2 44 2014-12-24
# 3 3 22 2006-01-04
# 4 4 2 2011-06-15
dplyr
dat %>%
group_by(IDC) %>%
slice(which.max(start_date)) %>%
ungroup()
# # A tibble: 4 x 3
# IDC neighborhood start_date
# <int> <int> <date>
# 1 1 13 2010-06-21
# 2 2 44 2014-12-24
# 3 3 22 2006-01-04
# 4 4 2 2011-06-15
Data
dat <- structure(list(IDC = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L), neighborhood = c(22L, 44L, 13L, 44L, 3L, 22L, 13L, 2L), start_date = c("08.07.2001", "04.02.2005", "21.06.2010", "24.12.2014", "06.03.2002", "04.01.2006", "08.07.2001", "15.06.2011")), class = "data.frame", row.names = c(NA, -8L))
My dataset contains several groups and each group can have a different number of unique observations. I carry out some calculations by group (simplified in the code below), resulting in a summary value for each group. Next, for the purpose of a bootstrap, I want to:
Randomly sample the groups with replacement (number of sampled groups = equal to number of different groups in the original dataset)
Within these sampled groups, randomly sample observations with replacement (number of sampled observations per group = equal to number of unique observations in that group in the original dataset)
A simplified version of my data set up (data1):
data1:
id group y
1001 1 10
1002 1 15
1003 1 3
3002 2 24
3003 2 15
3005 2 37
3006 2 32
3007 2 11
4001 3 12
4002 3 15
5006 4 7
5007 4 9
5009 4 22
5010 4 19
E.g. based on the dataset example above: there are 4 groups in the original dataset, so I want to sample 4 groups with replacement (e.g. groups sampled = groups 4,3,3,1), and then sample observations/rows from those 4 groups (4 ids from group 4 (e.g. 5007, 5007, 5006, 5009); 2 ids from group 3 (twice, as group 3 was sampled twice), and 3 ids from group 1, all with replacement), and return the sampled rows together in a dataframe (4+2+2+3 = 11 rows).
For the above, I some have code working for these steps separately, but I cannot seem to combine them:
# Calculate group value
y.group <- tapply(data1$y,data1$group,mean)
# Step 1. Sample groups, with replacement:
sampled.group <- sample(1:length(unique(data1$group)),replace=T)
# Step 2. Sample within groups, with replacement
data2 <- data.frame(data1 %>%
group_by(group) %>% # for each group
sample_frac(1, replace = TRUE) %>%
ungroup)
Obviously, the code above in full does not do what I want, as in step 2 the sampled groups from step 1 are ignored since it just uses the original group var (I am aware of this). I have tried to solve this using step 1 and trying to generate a new dataframe containing only the sampled groups' observations (with duplicates if a group was sampled more than once, which is likely to happen), and then apply step 2 to that new dataframe, but I cannot get this to work.
I think I am just on the wrong path or overthinking things. Hopefully you can give me some advice on how to proceed.
Edit: While awaiting any potential solutions, I continued on the question myself and ended up with:
total.result <- c()
for (j in 1:length(unique(data1$group))){
sampled.group <- sample(1:length(unique(data1$group)),size=1,replace=T)
group.result <- sample_n(data1[data1$group==sampled.group,],
size=length(unique(data1$id[data1$group==sampled.group])),replace=T)
total.result <- rbind(total.result,group.result)
}
(So basically using a loop to sample the groups one at a time, creating datasets for each, and then sampling individual rows from those, and finally combining the results with rbind)
However, I think Allan Cameron's solution (see below) is more straigthforward, so I have accepted that one as the answer to my question.
I think this is what you're looking for. Let's start with your data in a reproducible format:
data1 <- structure(list(id = structure(1:14, .Label = c("1001", "1002",
"1003", "3002", "3003", "3005", "3006", "3007", "4001", "4002",
"5006", "5007", "5009", "5010"), class = "factor"), group = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor"), y = structure(c(1L, 4L, 8L,
7L, 4L, 10L, 9L, 2L, 3L, 4L, 11L, 12L, 6L, 5L), .Label = c("10",
"11", "12", "15", "19", "22", "24", "3", "32", "37", "7", "9"
), class = "factor")), class = "data.frame", row.names = c(NA,
-14L))
And just to make sure:
data1
#> id group y
#> 1 1001 1 10
#> 2 1002 1 15
#> 3 1003 1 3
#> 4 3002 2 24
#> 5 3003 2 15
#> 6 3005 2 37
#> 7 3006 2 32
#> 8 3007 2 11
#> 9 4001 3 12
#> 10 4002 3 15
#> 11 5006 4 7
#> 12 5007 4 9
#> 13 5009 4 22
#> 14 5010 4 19
We start by splitting the data frame by group into smaller data frames, using the split function. This gives us a list with four data frames, each one containing all the members of its respective group. (The set.seed is there purely to make this example reproducible).
set.seed(69)
split_dfs <- split(data1, data1$group)
Now we can sample this list, giving us a new list of four data frames drawn with replacement from split_dfs. Each one will again contain all the members of its respective group, though of course some whole groups might be sampled more than once, and other whole groups not sampled at all.
sampled_group_dfs <- split_dfs[sample(length(split_dfs), replace = TRUE)]
Now we can sample within each group by sampling with replacement from the rows of each data frame in our new list. We do this for all our data frames in our list by using lapply
all_sampled <- lapply(sampled_group_dfs, function(x) x[sample(nrow(x), replace = TRUE), ])
All that remains is to stick all the resultant dataframes in this list back together to get our result:
result <- do.call(rbind, all_sampled)
As you can see from the final result, it just so happens that each of the four groups was sampled once (this is just by chance - alter set.seed to get different results). However, within the groups there have clearly been some duplicates drawn. In fact, since R mandates unique row names in a data frame, these are easy to pick out by the .1 that has been appended to the duplicate row names. If you don't like this, you can reset the row names with rownames(result) <- seq(nrow(result))
result
#> id group y
#> 4.14 5010 4 19
#> 4.14.1 5010 4 19
#> 4.11 5006 4 7
#> 4.13 5009 4 22
#> 1.3 1003 1 3
#> 1.3.1 1003 1 3
#> 1.2 1002 1 15
#> 3.9 4001 3 12
#> 3.9.1 4001 3 12
#> 2.5 3003 2 15
#> 2.5.1 3003 2 15
#> 2.6 3005 2 37
#> 2.7 3006 2 32
#> 2.5.2 3003 2 15
Created on 2020-02-15 by the reprex package (v0.3.0)
I need to embed a condition in a remove duplicates function. I am working with large student database from South Africa, a highly multilingual country. Last week you guys gave me the code to remove duplicates caused by retakes, but I now realise my language exam data shows some students offering more than 2 different languages.
The source data, simplified looks like this
STUDID MATSUBJ SCORE
101 AFRIKAANSB 1
101 AFRIKAANSB 4
102 ENGLISHB 2
102 ISIZULUB 7
102 ENGLISHB 5
The result file I need is
STUDID MATSUBJ SCORE flagextra
101 AFRIKAANS 4
102 ENGLISH 5
102 ISIZULUB 7 1
I need to flag the extra language so that I can see what languages they are and make new category for this
Two stage procedure works better for me as a newbie to R:
1- remove the duplicates caused by subject retakes:
df <- LANGSEC%>%
group_by(STUDID,MATRICSUBJ) %>%
top_n(1,SUBJSCORE)
2- Then flag one of the two subjects causing the remaining duplicates:
LANGSEC$flagextra <- as.integer(duplicated(LANGSEC$STUDID),LANGSEC$MATRICSUBJ
Then filter for this third language and make new file:
LANG3<-LANGSEC%>% filter(flagextra==1)
Then remove these from the other file:
LANG2<-LANGSEC %>% filter(!flagextra==1)
May be this helps
library(tidyverse)
df1 %>%
group_by(STUDID, MATSUBJ) %>%
summarise(SCORE = max(SCORE),
flagextra = as.integer(!sum(duplicated(MATSUBJ))))
# A tibble: 3 x 4
# Groups: STUDID [?]
# STUDID MATSUBJ SCORE flagextra
# <int> <chr> <dbl> <int>
#1 101 AFRIKAANSB 4 0
#2 102 ENGLISHB 5 0
#3 102 ISIZULUB 7 1
Or with base R
i1 <- !(duplicated(df1[1:2])|duplicated(df1[1:2], fromLast = TRUE))
transform(aggregate(SCORE ~ ., df1, max),
flagextra = as.integer(MATSUBJ %in% df1$MATSUBJ[i1]))
data
df1 <- structure(list(STUDID = c(101L, 101L, 102L, 102L, 102L), MATSUBJ
= c("AFRIKAANSB",
"AFRIKAANSB", "ENGLISHB", "ISIZULUB", "ENGLISHB"), SCORE = c(1L,
4L, 2L, 7L, 5L)), class = "data.frame", row.names = c(NA, -5L
))
I am trying write a function or use cut to assign a grouping variable to some date data when those dates are close (user definition of close). For example, I would like to create a common grouping variable for some samples that were collected on consecutive dates. I was thinking cut would work here but then I realized cut doesn't group variables when they are close and rather creates a series of groups based on a sequence.
So take this dataframe for example:
df <- structure(list(Num = c(0.888401849195361, 0.185766335576773,
0.493163562379777, 0.13070688676089, 0.484760325402021, 0.603240836178884,
0.893201333936304, 0.641203448642045, 0.16957180458121, 0.0101411847863346
), Date = structure(c(10592, 10597, 10598, 10605, 10606, 10608,
10609, 10616, 10617, 10618), class = "Date"), day = c(1L, 6L,
7L, 14L, 15L, 17L, 18L, 25L, 26L, 27L)), .Names = c("Num", "Date",
"day"), row.names = c(NA, -10L), class = "data.frame")
If was to apply a cut function as I understand its usage like this:
df$cutVar <- cut(df$day, breaks= seq(0, 31, by = 1), right=TRUE)
I would be left with a range that went right through values that I'd prefer to be grouped together. For example, the 6th and 7th should be grouped together based on their proximity to each other. Similar to 14th and 15th and so on.
> df
Num Date day cutVar
1 0.88840185 1999-01-01 1 (0,1]
2 0.18576634 1999-01-06 6 (5,6]
3 0.49316356 1999-01-07 7 (6,7]
4 0.13070689 1999-01-14 14 (13,14]
5 0.48476033 1999-01-15 15 (14,15]
6 0.60324084 1999-01-17 17 (16,17]
7 0.89320133 1999-01-18 18 (17,18]
8 0.64120345 1999-01-25 25 (24,25]
9 0.16957180 1999-01-26 26 (25,26]
10 0.01014118 1999-01-27 27 (26,27]
So the basic question here is how to group a continuous variable (a date in this instance) such that close (defined by the user) numbers are grouped together in a factor range?
Is this something you'd like? where 3 is a threshold I chose for convenience. It can be any number you prefer:
df$group <- cumsum(c(1, diff.Date(df$Date)) >= 3)
df
Num Date day group
1 0.88840185 1999-01-01 1 0
2 0.18576634 1999-01-06 6 1
3 0.49316356 1999-01-07 7 1
4 0.13070689 1999-01-14 14 2
5 0.48476033 1999-01-15 15 2
6 0.60324084 1999-01-17 17 2
7 0.89320133 1999-01-18 18 2
8 0.64120345 1999-01-25 25 3
9 0.16957180 1999-01-26 26 3
10 0.01014118 1999-01-27 27 3