Create a Column that Counts (one at a time) Repeated Occurrences Within Groups? (in R) - count

mydata <- read.table(header=TRUE, text="
Away Home Game.ID Points.A Points.H Series.ID Series.Wins
Denver Utah aaa123 121 123 aaabbb Utah
Denver Utah aaa124 132 116 aaabbb Denver
Utah Denver aaa125 117 121 aaabbb Denver
Utah Denver aaa126 112 120 aaabbb Denver
Denver Utah aaa127 115 122 aaabbb Utah
Atlanta Boston aab123 112 114 aaaccc Boston
")
I am trying to create an additional column that counts, one at a time, the Series.Wins within each Series.ID group. So, from the data above, that column column would look like:
new.column <- c(1, 1, 2, 3, 2, 1)
The ultimate goal is to come up with a series record column with "home wins -
away wins":
Record <- c(1-0, 1-1, 2-1, 3-1, 2-3, 1-0)

This seemed to work:
mydata <- mydata %>% group_by(Series.Wins) %>% group_by(Series.ID, add = TRUE) %>% mutate(id = seq_len(n()))

Related

How to filter by a given phrase in a single column in R

Below is the sample data. Seems pretty basic but my internet searches have not yielded a clear answer. In this case, how would I create a new data frame where the areaname only has the phrase MSA at the end and not MicroSA or other possibilities.
areaname <- c("Albany NY MSA", "Albany GA MSA", "Aberdeen SD MicroSA", "Reno NV MSA", "Fernley NV MicroSA", "Syracuse NY MSA")
Employment <- c(100,104,108,112,116,88)
testitem <- data.frame(areaname, Employment)
testitem %>%
filter(stringr::str_ends(areaname, "MSA"))
areaname Employment
1 Albany NY MSA 100
2 Albany GA MSA 104
3 Reno NV MSA 112
4 Syracuse NY MSA 88
Another option is to use stringi:
library(dplyr)
testitem %>%
filter(stringi::stri_endswith_fixed(areaname, "MSA"))
Output
areaname Employment
1 Albany NY MSA 100
2 Albany GA MSA 104
3 Reno NV MSA 112
4 Syracuse NY MSA 88
Or you can use grepl:
library(dplyr)
testitem %>%
filter(grepl("MSA$", areaname))
Or use endsWith:
testitem %>%
filter(endsWith(areaname, "MSA"))

How to filter a dataframe so that it finds the maximum value for 10 unique occurrences of another variable

I have this dataframe here which I filter down to only include counties in the state of Washington and only include columns that are relevant for the answer I am looking for. What I want to do is filter down the dataframe so that I have 10 rows only, which have the highest Black Prison Population out of all of the counties in Washington State regardless of year. The part that I am struggling with is that there can't be repeated counties, so each row should include the highest Black Prison Populations for the top 10 unique county names in the state of Washington. Some of the counties have Null data for the populations for the black prison populations as well. for You should be able to reproduce this to get the updated dataframe.
library(dplyr)
incarceration <- read.csv("https://raw.githubusercontent.com/vera-institute/incarceration-trends/master/incarceration_trends.csv")
blackPrisPop <- incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA")
Sample of what the updated dataframe looks like (should include 1911 rows):
fips county_name state year black_pop_15to64 black_prison_pop
130 53005 Benton County WA 2001 1008 25
131 53005 Benton County WA 2002 1143 20
132 53005 Benton County WA 2003 1208 21
133 53005 Benton County WA 2004 1236 27
134 53005 Benton County WA 2005 1310 32
135 53005 Benton County WA 2006 1333 35
You can group_by the county county_name, and then use slice_max taking the row with maximum value for black_prison_pop. If you set n = 1 option you will get one row for each county. If you set with_ties to FALSE, you also will get one row even in case of ties.
You can arrange in descending order the black_prison_pop value to get the overall top 10 values across all counties.
library(dplyr)
incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA") %>%
group_by(county_name) %>%
slice_max(black_prison_pop, n = 1, with_ties = FALSE) %>%
arrange(desc(black_prison_pop)) %>%
head(10)
Output
black_prison_pop black_pop_15to64 year fips county_name state
<dbl> <dbl> <int> <int> <chr> <chr>
1 1845 73480 2002 53033 King County WA
2 975 47309 2013 53053 Pierce County WA
3 224 5890 2005 53063 Spokane County WA
4 172 19630 2015 53061 Snohomish County WA
5 137 8129 2016 53011 Clark County WA
6 129 5146 2003 53035 Kitsap County WA
7 102 5663 2009 53067 Thurston County WA
8 58 706 1991 53021 Franklin County WA
9 50 1091 1991 53077 Yakima County WA
10 46 1748 2008 53073 Whatcom County WA

R stops scraping when there is missing data

I am using this code to loop through multiple url's to scrape data. The code works fine until it comes to a date that has missing data. This is the error message that pops up:
Error in data.frame(away, home, away1H, home1H, awayPinnacle, homePinnacle) :
arguments imply differing number of rows: 7, 8
I am very new to coding and could not figure out how to make it keep scraping despite the missing data.
library(rvest)
library(dplyr)
get_data <- function(date) {
# Specifying URL
url <- paste0('https://classic.sportsbookreview.com/betting-odds/nba-basketball/money-line/1st-half/?date=', date)
# Reading the HTML code from website
oddspage <- read_html(url)
# Using CSS selectors to scrape away teams
awayHtml <- html_nodes(oddspage,'.eventLine-value:nth-child(1) a')
#Using CSS selectors to scrape 1Q scores
away1QHtml <- html_nodes(oddspage,'.current-score+ .first')
away1Q <- html_text(away1QHtml)
away1Q <- as.numeric(away1Q)
home1QHtml <- html_nodes(oddspage,'.score-periods+ .score-periods .current-score+ .period')
home1Q <- html_text(home1QHtml)
home1Q <- as.numeric(home1Q)
#Using CSS selectors to scrape 2Q scores
away2QHtml <- html_nodes(oddspage,'.first:nth-child(3)')
away2Q <- html_text(away2QHtml)
away2Q <- as.numeric(away2Q)
home2QHtml <- html_nodes(oddspage,'.score-periods+ .score-periods .period:nth-child(3)')
home2Q <- html_text(home2QHtml)
home2Q <- as.numeric(home2Q)
#Creating First Half Scores
away1H <- away1Q + away2Q
home1H <- home1Q + home2Q
#Using CSS selectors to scrape scores
awayScoreHtml <- html_nodes(oddspage,'.first.total')
awayScore <- html_text(awayScoreHtml)
awayScore <- as.numeric(awayScore)
homeScoreHtml <- html_nodes(oddspage, '.score-periods+ .score-periods .total')
homeScore <- html_text(homeScoreHtml)
homeScore <- as.numeric(homeScore)
# Converting away data to text
away <- html_text(awayHtml)
# Using CSS selectors to scrape home teams
homeHtml <- html_nodes(oddspage,'.eventLine-value+ .eventLine-value a')
# Converting home data to text
home <- html_text(homeHtml)
# Using CSS selectors to scrape Away Odds
awayPinnacleHtml <- html_nodes(oddspage,'.eventLine-consensus+ .eventLine-book .eventLine-book-value:nth-child(1) b')
# Converting Away Odds to Text
awayPinnacle <- html_text(awayPinnacleHtml)
# Converting Away Odds to numeric
awayPinnacle <- as.numeric(awayPinnacle)
# Using CSS selectors to scrape Pinnacle Home Odds
homePinnacleHtml <- html_nodes(oddspage,'.eventLine-consensus+ .eventLine-book .eventLine-book-value+ .eventLine-book-value b')
# Converting Home Odds to Text
homePinnacle <- html_text(homePinnacleHtml)
# Converting Home Odds to Numeric
homePinnacle <- as.numeric(homePinnacle)
# Create Data Frame
df <- data.frame(away,home,away1H,home1H,awayPinnacle,homePinnacle)
}
date_vec <- sprintf('201902%02d', 02:06)
all_data <- do.call(rbind, lapply(date_vec, get_data))
View(all_data)
I'd recommending purrr::map() instead of lapply. Then you can wrap your call to get_data() with possibly(), which is a nice way to catch errors and keep going.
library(purrr)
map_dfr(date_vec, possibly(get_data, otherwise = data.frame()))
Output:
away home away1H home1H awayPinnacle homePinnacle
1 L.A. Clippers Detroit 47 65 116 -131
2 Milwaukee Washington 73 50 -181 159
3 Chicago Charlotte 60 51 192 -220
4 Brooklyn Orlando 48 44 121 -137
5 Indiana Miami 53 54 117 -133
6 Dallas Cleveland 58 55 -159 140
7 L.A. Lakers Golden State 58 63 513 -651
8 New Orleans San Antonio 50 63 298 -352
9 Denver Minnesota 61 64 107 -121
10 Houston Utah 63 50 186 -213
11 Atlanta Phoenix 58 57 110 -125
12 Philadelphia Sacramento 52 62 -139 123
13 Memphis New York 42 41 -129 114
14 Oklahoma City Boston 58 66 137 -156
15 L.A. Clippers Toronto 51 65 228 -263
16 Atlanta Washington 61 57 172 -196
17 Denver Detroit 55 68 -112 -101
18 Milwaukee Brooklyn 51 42 -211 184
19 Indiana New Orleans 53 50 -143 127
20 Houston Phoenix 63 57 -256 222
21 San Antonio Sacramento 59 63 -124 110

R - Converting one table to multiple table

I have a csv file called data.csv that contains 3 tables all merged in just one. I would like to separate them in 3 different data.frame when I import it to R.
So far this is what I have got after running this code:
df <- read.csv("data.csv")
View(df)
Student
Name Score
Maria 18
Bob 25
Paul 27
Region
Country Score
Italy 65
India 99
United 88
Sub region
City Score
Paris 77
New 55
Rio 78
How can I split them in such a way that I get this result:
First:
View(StudentDataFrame)
Name Score
Maria 18
Bob 25
Paul 27
Second:
View(regionDataFrame)
Country Score
Italy 65
India 99
United 88
Third:
View(SubRegionDataFrame)
City Score
Paris 77
New 55
Rio 78
One option would be to read the dataset with readLines, create a grouping variable ('grp') based on the location of 'Student', 'Region', 'Sub region' in the 'lines', split it and read it with read.table
i1 <- trimws(lines) %in% c("Student", "Region", "Sub region")
grp <- cumsum(i1)
lst <- lapply(split(lines[!i1], grp[!i1]), function(x)
read.table(text=x, stringsAsFactors=FALSE, header=TRUE))
lst
#$`1`
# Name Score
#1 Maria 18
#2 Bob 25
#3 Paul 27
#$`2`
# Country Score
#1 Italy 65
#2 India 99
#3 United 88
#$`3`
# City Score
#1 Paris 77
#2 New 55
#3 Rio 78
data
lines <- readLines("data.csv")

Showing multiple columns in aggregate function including strings/characters in R

R noob question here.
Let's say I have this data frame:
City State Pop
Fresno CA 494
San Franciso CA 805
San Jose CA 945
San Diego CA 1307
Los Angeles CA 3792
Reno NV 225
Henderson NV 257
Las Vegas NV 583
Gresham OR 105
Salem OR 154
Eugene OR 156
Portland OR 583
Fort Worth TX 741
Austin TX 790
Dallas TX 1197
San Antonio TX 1327
Houston TX 2100
I want to get let's say every 3rd lowest population per State, which would have:
City State Pop
San Jose CA 945
Las Vegas NV 583
Eugene OR 156
Dallas TX 1197
I tried this one:
ord_pop_state <- aggregate(Pop ~ State , data = ord_pop, function(x) { x[3] } )
And I get this one:
State Pop
CA 945
NV 583
OR 156
TX 1197
What do I lack on this one, in order for me to get the desired output that includes the City?
I would suggest to try data.table package for such task as the syntax is easier and the code is more efficient. I would also suggest to add order function in order to make sure that the data is sorted
library(data.table)
setDT(ord_pop)[order(Pop), .SD[3L], keyby = State]
# State City Pop
# 1: CA San Jose 945
# 2: NV Las Vegas 583
# 3: OR Eugene 156
# 4: TX Dallas 1197
So basically, first the data was ordered by Pop, then we subsetted .SD (which the notation parameter of the data itself) by State
Though this is easily solvable with base R too (we will assume that the data is sorted here), we can just create an index per group and then just do a simple subset by that index
ord_pop$indx <- with(ord_pop, ave(Pop, State, FUN = seq))
ord_pop[ord_pop$indx == 3L, ]
# City State Pop indx
# 3 San Jose CA 945 3
# 8 Las Vegas NV 583 3
# 11 Eugene OR 156 3
# 15 Dallas TX 1197 3
Here is a dplyr version:
df2 <- df %>%
group_by(state) %>% # Group observations by state
arrange(-pop) %>% # Within those groups, sort in descending order by pop
slice(3) # Extract the third row in each arranged group
Here's the toy data I used to test it:
set.seed(1)
df <- data.frame(state = rep(LETTERS[1:3], each = 5), city = rep(letters[1:5], 3), pop = round(rnorm(15, 1000, 100), digits=0))
And here's the output from that; it's a coincidence that 'b' was third-largest in each case, not a glitch in the code:
> df2
Source: local data frame [3 x 3]
Groups: state
state city pop
1 A b 1018
2 B b 1049
3 C b 1039
In R same end results can be achieved using different packages.Choice of package is a trade-off between efficiency and simplicity of code.
Since you come from a strong SQL background,this might be easier to use:
library(sqldf)
#Example to return 3rd lowest population of a State
result <-sqldf('Select City,State,Pop from data order by Pop limit 1 offset 2;')
#Note the SQL query is a sample and needs to be modifed to get desired result.

Resources