Summary output to independent dataset - r

Im working with a twitter dataset i got with rtweet. I worked to create a state variable based on the coordinates (when available).
my output is this so far
> summary(rt1$state)
alabama arizona arkansas california colorado connecticut
3 6 2 104 5 1
delaware district of columbia florida georgia idaho illinois
1 0 17 7 0 12
indiana iowa kansas kentucky louisiana maine
4 1 2 3 2 1
maryland massachusetts michigan minnesota mississippi missouri
1 2 9 6 0 2
montana nebraska nevada new hampshire new jersey new mexico
0 3 5 1 4 7
new york north carolina north dakota ohio oklahoma oregon
25 8 1 3 2 4
pennsylvania rhode island south carolina south dakota tennessee texas
22 0 2 1 3 35
utah vermont virginia washington west virginia wisconsin
2 1 3 5 0 2
wyoming NA's
1 17669
can you please advise on how can i create an independent dataset from the output above so i have 2 columns (state and n) ?
thanks

We can wrap with stack to create a two column data.frame from the OP's code
out <- stack(summary(rt1$state))[2:1]
names(out) <- c("state", "n")
Or another option in base R is
as.data.frame(table(rt1$state))
A reproducible example
data(iris)
out <- stack(summary(iris$Species))[2:1]
Or with table
as.data.frame(table(iris$Species))
Or enframe from tibble
library(tibble)
library(tidyr)
enframe(summary(rt1$state)) %>%
unnest(c(value))

Or maybe you can work directly on your rt1 dataframe:
dplyr::count(rt1, state)

Related

How to manually enter a cell in a dataframe? [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.
This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555
If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))

Building a prediction model with the dpois function in R

Hello! I am in the beginning stages of building (and learning!) how to build prediction models for sports, specifically using NHL statistics.
I have all the game outcomes of the NHL since 1990, and I want to use # goals to predict outcomes in future games (just based on goals, for now)
Below is an excerpt of my data set, but the full data set can be found in this Git link:
https://github.com/papelr/nhldatar/blob/master/nhldatar/data/NHL_outcomes.rda
Date Visitor GVisitor Home GHome Att.
1 1990-10-04 Philadelphia Flyers 1 Boston Bruins 4 <NA>
2 1990-10-04 Montreal Canadiens 3 Buffalo Sabres 3 <NA>
3 1990-10-04 Vancouver Canucks 2 Calgary Flames 3 <NA>
4 1990-10-04 New York Rangers 3 Chicago Blackhawks 4 <NA>
5 1990-10-04 Quebec Nordiques 3 Hartford Whalers 3 <NA>
6 1990-10-04 New York Islanders 1 Los Angeles Kings 4 <NA>
7 1990-10-04 St. Louis Blues 3 Minnesota North Stars 2 <NA>
8 1990-10-04 Detroit Red Wings 3 New Jersey Devils 3 <NA>
9 1990-10-04 Toronto Maple Leafs 1 Winnipeg Jets 7 <NA>
10 1990-10-05 Pittsburgh Penguins 7 Washington Capitals 4 <NA>
11 1990-10-06 Quebec Nordiques 1 Boston Bruins 7 <NA>
12 1990-10-06 Toronto Maple Leafs 1 Calgary Flames 4 <NA>
13 1990-10-06 Winnipeg Jets 3 Edmonton Oilers 3 <NA>
14 1990-10-06 New York Rangers 4 Hartford Whalers 5 <NA>
15 1990-10-06 Vancouver Canucks 6 Los Angeles Kings 3 <NA>
16 1990-10-06 New York Islanders 2 Minnesota North Stars 4 <NA>
17 1990-10-06 Buffalo Sabres 5 Montreal Canadiens 6 <NA>
18 1990-10-06 Philadelphia Flyers 1 New Jersey Devils 3 <NA>
19 1990-10-06 Chicago Blackhawks 5 St. Louis Blues 2 <NA>
20 1990-10-06 Detroit Red Wings 4 Washington Capitals 6 <NA>
21 1990-10-07 New York Islanders 4 Chicago Blackhawks 2 <NA>
22 1990-10-07 Toronto Maple Leafs 2 Edmonton Oilers 3 <NA>
23 1990-10-07 Detroit Red Wings 2 Philadelphia Flyers 7 <NA>
24 1990-10-07 New Jersey Devils 4 Pittsburgh Penguins 7 <NA>
25 1990-10-07 Boston Bruins 5 Quebec Nordiques 2 <NA>
26 1990-10-08 Hartford Whalers 3 Montreal Canadiens 5 <NA>
27 1990-10-08 Minnesota North Stars 3 New York Rangers 6 <NA>
28 1990-10-08 Calgary Flames 4 Winnipeg Jets 3 <NA>
29 1990-10-09 Minnesota North Stars 2 New Jersey Devils 5 <NA>
30 1990-10-09 Pittsburgh Penguins 3 St. Louis Blues 4 <NA>
31 1990-10-09 Los Angeles Kings 6 Vancouver Canucks 2 <NA>
32 1990-10-10 Calgary Flames 5 Detroit Red Wings 6 <NA>
33 1990-10-10 Buffalo Sabres 3 Hartford Whalers 4 <NA>
34 1990-10-10 Washington Capitals 2 New York Rangers 4 <NA>
35 1990-10-10 Quebec Nordiques 8 Toronto Maple Leafs 5 <NA>
36 1990-10-10 Boston Bruins 4 Winnipeg Jets 2 <NA>
37 1990-10-11 Pittsburgh Penguins 1 Chicago Blackhawks 4 <NA>
38 1990-10-11 Edmonton Oilers 5 Los Angeles Kings 5 <NA>
39 1990-10-11 Boston Bruins 3 Minnesota North Stars 3 <NA>
40 1990-10-11 New Jersey Devils 4 Philadelphia Flyers 7 <NA>
This is the prediction model that I have come up with so far, and I have failed to get the matrix that should come with my simulate match line below. Any help would be great.
# Using number of goals for prediction model
model_one <-
rbind(
data.frame(goals = outcomes$GHome,
team = outcomes$Home,
opponent = outcomes$Visitor,
home = 1),
data.frame(goals = outcomes$GVisitor,
team = outcomes$Visitor,
opponent = outcomes$Home,
home = 0)) %>%
glm(goals ~ home + team + opponent,
family = poisson (link = log), data = .)
summary(model_one)
# Probability function / matrix
simulate_game <- function(stat_model, homeTeam, awayTeam, max_goals =
10) {
home_goals <- predict(model_one,
data.frame(home = 1,
team = homeTeam,
opponent = awayTeam),
type ="response")
away_goals <- predict(model_one,
data.frame(home = 0,
team = awayTeam,
opponent = homeTeam),
type ="response")
dpois(0: max_goals, home_goals) %>%
dpois(0: max_goals, away_goals)
}
simulate_game(model_one, "Nashville Predators", "Chicago Blackhawks",
max_goals = 10)
I totally understand that a Poisson model isn't the best for sports predictions, but I am rebuilding a model I found for the EPL for learning/practice reasons, and adapting it to the NHL (from David Sheehan's model, https://dashee87.github.io/data%20science/football/r/predicting-football-results-with-statistical-modelling/).
Any tips would be great, because currently, this model returns a bunch of warnings:
There were 11 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In dpois(., 0:max_goals, away_goals_avg) : non-integer x = 0.062689
2: In dpois(., 0:max_goals, away_goals_avg) : non-integer x = 0.173621

How to flatten data.frame for use with googlevis treemap?

In order to use the treemap function on googleVis, data needs to be flattened into two columns. Using their example:
> library(googleVis)
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
However, in the real world this information more frequently looks like this:
> a <- data.frame(
+ scal=c("Global", "Global", "Global", "Global", "Global", "Global", "Global"),
+ cont=c("Europe", "Europe", "Europe", "America", "America", "Asia", "Asia"),
+ country=c("France", "Sweden", "Germany", "Mexico", "USA", "China", "Japan"),
+ val=c(71, 89, 58, 2, 38, 5, 48),
+ fac=c(2,3,10,9,11,1,11))
> a
scal cont country val fac
1 Global Europe France 71 2
2 Global Europe Sweden 89 3
3 Global Europe Germany 58 10
4 Global America Mexico 2 9
5 Global America USA 38 11
6 Global Asia China 5 1
7 Global Asia Japan 48 11
But how to most efficiently change transform this data?
If we use dplyr, this script will transform the data correctly:
library(dplyr)
cbind(NA,a %>% group_by(scal) %>% summarize(val=sum(val),fac=sum(fac))) -> topLev
names(topLev) <- c("Parent","Region","val","fac")
a %>% group_by(scal,cont) %>% summarize(val=sum(val),fac=sum(fac)) %>%
select(Region=cont,Parent=scal,val,fac) -> midLev
a[,2:5] %>% select(Region=country,Parent=cont,val,fac) -> bottomLev
bind_rows(topLev,midLev,bottomLev) %>% select(2,1,3,4) -> answer
We can verify this by comparing dataframes:
> answer
Source: local data frame [11 x 4]
Region Parent val fac
1 Global NA 311 47
2 America Global 40 20
3 Asia Global 53 12
4 Europe Global 218 15
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
Interesting that the summaries for the continents and the globe aren't the sum of their components (or min/max/ave/mean/normalized...)

Error/exception handling with bind_rows() and lapply() functions

I have a function that scrapes a table from a list of urls:
getscore <- function(www0) {
require(rvest)
require(dplyr)
www <- html(www0)
boxscore <- www %>% html_table(fill = TRUE) %>% .[[1]]
names(boxscore)[3] <- "VG"
names(boxscore)[5] <- "HG"
names(boxscore)[6] <- "Type"
return(boxscore)
}
Working example data:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/20/",
"http://www.hockey-reference.com/boxscores/2014/12/21/",
"http://www.hockey-reference.com/boxscores/2014/12/22/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
However, urls without games played will break my function:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/23/",
"http://www.hockey-reference.com/boxscores/2014/12/24/",
"http://www.hockey-reference.com/boxscores/2014/12/25/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
How might I build error/exception handling into my function to skip the urls that break?
Code should be reproducible...
The table you obtain when there are no games has an entirely other structure. You could check if colnames(boxscore) are as expected. As an example I include an adaptation of your function that checks if the column Visitor is available.
getscore <- function(www0) {
require(rvest)
require(dplyr)
www <- html(www0)
boxscore <- www %>% html_table(fill = TRUE) %>% .[[1]]
if ("Visitor" %in% colnames(boxscore)){
names(boxscore)[3] <- "VG"
names(boxscore)[5] <- "HG"
names(boxscore)[6] <- "Type"
return(boxscore)
}
}
With this function, your example does not break:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/23/",
"http://www.hockey-reference.com/boxscores/2014/12/24/",
"http://www.hockey-reference.com/boxscores/2014/12/25/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
A nice approach here is to use rbindlist from data.table package (which allows you to use fill=TRUE), so that you can bind all even the one for which bind_rows is not working, but then you can filter non-NA Date (which essentially is the webpage for which bind_rows is not working) and then restrict to 6 columns which I guess you are looking for in valid data.
library(data.table) # development vs. 1.9.5
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/20/",
"http://www.hockey-reference.com/boxscores/2014/12/21/",
"http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/24/") # not working
resdt<-rbindlist(
lapply(
www_list, function(www0){
message ("web is ", www0) # comment out this if you don't want message to appear
getscore(www0)}),fill=TRUE)
resdt[!is.na(Date),1:6,with=FALSE] # 6 column is valid data
Date Visitor VG Home HG Type
1: 2014-12-20 Colorado Avalanche 5 Buffalo Sabres 1
2: 2014-12-20 New York Rangers 3 Carolina Hurricanes 2 SO
3: 2014-12-20 Chicago Blackhawks 2 Columbus Blue Jackets 3 SO
4: 2014-12-20 Arizona Coyotes 2 Los Angeles Kings 4
5: 2014-12-20 Nashville Predators 6 Minnesota Wild 5 OT
6: 2014-12-20 Ottawa Senators 1 Montreal Canadiens 4
7: 2014-12-20 Washington Capitals 4 New Jersey Devils 0
8: 2014-12-20 Tampa Bay Lightning 1 New York Islanders 3
9: 2014-12-20 Florida Panthers 1 Pittsburgh Penguins 3
10: 2014-12-20 St. Louis Blues 2 San Jose Sharks 3 OT
11: 2014-12-20 Philadelphia Flyers 7 Toronto Maple Leafs 4
12: 2014-12-20 Calgary Flames 2 Vancouver Canucks 3 OT
13: 2014-12-21 Buffalo Sabres 3 Boston Bruins 4 OT
14: 2014-12-21 Toronto Maple Leafs 0 Chicago Blackhawks 4
15: 2014-12-21 Colorado Avalanche 2 Detroit Red Wings 1 SO
16: 2014-12-21 Dallas Stars 6 Edmonton Oilers 5 SO
17: 2014-12-21 Carolina Hurricanes 0 New York Rangers 1
18: 2014-12-21 Philadelphia Flyers 4 Winnipeg Jets 3 OT
19: 2014-12-22 San Jose Sharks 2 Anaheim Ducks 3 OT
20: 2014-12-22 Nashville Predators 5 Columbus Blue Jackets 1
21: 2014-12-22 Pittsburgh Penguins 3 Florida Panthers 4 SO
22: 2014-12-22 Calgary Flames 4 Los Angeles Kings 3 OT
23: 2014-12-22 Arizona Coyotes 1 Vancouver Canucks 7
24: 2014-12-22 Ottawa Senators 1 Washington Capitals 2
Date Visitor VG Home HG Type
If you are not familiar with data.table, you can just use it to do rbindlist and then convert data.table back to data.frame and perform usual data.frame operation. But, you should really learn data.table because it is very fast and efficient on big data.
resdf<-as.data.frame(res.dt)
with(resdf,resdf[!is.na(Date),1:6])
Date Visitor VG Home HG Type
1 2014-12-20 Colorado Avalanche 5 Buffalo Sabres 1
2 2014-12-20 New York Rangers 3 Carolina Hurricanes 2 SO
3 2014-12-20 Chicago Blackhawks 2 Columbus Blue Jackets 3 SO
4 2014-12-20 Arizona Coyotes 2 Los Angeles Kings 4
5 2014-12-20 Nashville Predators 6 Minnesota Wild 5 OT
6 2014-12-20 Ottawa Senators 1 Montreal Canadiens 4
7 2014-12-20 Washington Capitals 4 New Jersey Devils 0
8 2014-12-20 Tampa Bay Lightning 1 New York Islanders 3
9 2014-12-20 Florida Panthers 1 Pittsburgh Penguins 3
10 2014-12-20 St. Louis Blues 2 San Jose Sharks 3 OT
11 2014-12-20 Philadelphia Flyers 7 Toronto Maple Leafs 4
12 2014-12-20 Calgary Flames 2 Vancouver Canucks 3 OT
13 2014-12-21 Buffalo Sabres 3 Boston Bruins 4 OT
14 2014-12-21 Toronto Maple Leafs 0 Chicago Blackhawks 4
15 2014-12-21 Colorado Avalanche 2 Detroit Red Wings 1 SO
16 2014-12-21 Dallas Stars 6 Edmonton Oilers 5 SO
17 2014-12-21 Carolina Hurricanes 0 New York Rangers 1
18 2014-12-21 Philadelphia Flyers 4 Winnipeg Jets 3 OT
19 2014-12-22 San Jose Sharks 2 Anaheim Ducks 3 OT
20 2014-12-22 Nashville Predators 5 Columbus Blue Jackets 1
21 2014-12-22 Pittsburgh Penguins 3 Florida Panthers 4 SO
22 2014-12-22 Calgary Flames 4 Los Angeles Kings 3 OT
23 2014-12-22 Arizona Coyotes 1 Vancouver Canucks 7
24 2014-12-22 Ottawa Senators 1 Washington Capitals 2

Add new column to long dataframe from another dataframe?

Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.

Resources