How to Import data from external website into R? - r

For my project purpose, I need to directly take the data (excel sheet) from a website into R working platform. How can it be performed, please do help me out.
This can be considered as an url for time being "https://www.contextures.com/tablesamples/sampledatahockey.zip"

You can try:
library(readxl)
download.file("https://www.contextures.com/tablesamples/sampledatahockey.zip",
destfile = "sampledatahockey.zip")
unzip("sampledatahockey.zip")
read_excel("sampledatahockey.xlsx", sheet = "PlayerData", skip = 2)
Output is:
# A tibble: 96 × 15
ID Team Country NameF NameL Weight Height DOB Hometown Prov Pos Age HeightFt HtIn BMI
<dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Women Canada Meghan Agosta 148 5'7 1987-02-12 00:00:00 Ruthven Ont. Forward 34 5.58 67 23
2 2 Women Canada Rebecca Johnston 148 5'9 1989-09-24 00:00:00 Sudbury Ont. Forward 32 5.75 69 22
3 3 Women Canada Laura Stacey 156 5'10 1994-05-05 00:00:00 Kleinburg Ont. Forward 27 5.83 70 22
4 4 Women Canada Jennifer Wakefield 172 5'10 1989-06-15 00:00:00 Pickering Ont. Forward 32 5.83 70 25
5 5 Women Canada Jillian Saulnier 144 5'5 1992-03-07 00:00:00 Halifax N.S. Forward 29 5.42 65 24
6 6 Women Canada Mélodie Daoust 159 5'6 1992-01-07 00:00:00 Valleyfield Que. Forward 29 5.5 66 26
7 7 Women Canada Bailey Bram 150 5'8 1990-09-05 00:00:00 St. Anne Man. Forward 31 5.67 68 23
8 8 Women Canada Brianne Jenner 156 5'9 1991-05-04 00:00:00 Oakville Ont. Forward 30 5.75 69 23
9 9 Women Canada Sarah Nurse 140 5'8 1995-01-04 00:00:00 Hamilton Ont. Forward 26 5.67 68 21
10 10 Women Canada Haley Irwin 170 5'7 1988-06-06 00:00:00 Thunder Bay Ont. Forward 33 5.58 67 27
# … with 86 more rows

Related

How to sum all variables that aren't characters/factors using group_by? [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
I am new to R. I have some data from local elections in Mexico and I want to determine how many votes each party had in each municipality.
Here is an example of the data (political parties are all variables from PRI onwards, NAME_MUN is the name of the municipalities):
head(Campeche)
# A tibble: 6 x 14
CABECERA_DISTRITAL CIRCUNSCRIPCION NOMBRE_ESTADO NOM_MUN PRI PAN MORENA PRD PVEM PT MC
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 153 137 43 5 6 9 7
2 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 109 113 52 15 9 4 5
3 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 169 154 33 14 12 5 6
4 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 1414 1474 415 154 73 62 53
5 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 199 238 88 25 17 11 12
6 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 176 197 60 15 7 13 11
# … with 3 more variables: NVA_ALIANZA <dbl>, PH <dbl>, ES <dbl>
tail(Campeche)
CABECERA_DISTRITAL CIRCUNSCRIPCION NOMBRE_ESTADO NOM_MUN PRI PAN MORENA PRD PVEM PT MC
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SABANCUY 3 CAMPECHE CARMEN 83 74 21 7 0 3 1
2 SABANCUY 3 CAMPECHE CARMEN 68 47 28 5 3 4 1
3 SABANCUY 3 CAMPECHE CARMEN 56 72 16 1 0 1 1
4 SEYBAPLAYA 3 CAMPECHE CHAMPOTON 90 147 3 2 4 1 3
5 SEYBAPLAYA 3 CAMPECHE CHAMPOTON 141 161 39 30 4 9 15
6 SEYBAPLAYA 3 CAMPECHE CHAMPOTON 84 77 1 6 0 0 3
# … with 3 more variables: NVA_ALIANZA <dbl>, PH <dbl>, ES <dbl>
The data is disaggregated by electoral section, there is more than one electoral section for each municipality, what I am looking for is to obtain the total votes for each political party by municipality.
This is what I was doing, but I believe there is a faster way to do the same and that can be replicated for different municipalities with different parties.
results_Campeche <- Campeche %>% group_by(NOM_MUN) %>%
summarize(PRI = sum(PRI), PAN = sum(PAN), PRD = sum(PRD), MORENA = sum(MORENA),
PVEM = sum(PVEM), PT = sum(PT), MC = sum(MC), NVA_ALIANZA = sum(NVA_ALIANZA),
PH = sum(PH),ES = sum(ES), .groups = "drop")
head(results_Campeche)
NOM_MUN PRI PAN PRD MORENA PVEM PT MC NVA_ALIANZA PH ES
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CALAKMUL 4861 5427 290 198 70 109 84 236 9 53
2 CALKINI 9035 1326 319 11714 684 194 282 4537 41 262
3 CAMPECHE 39386 32574 4394 11639 2211 2033 1451 4656 1995 4681
4 CANDELARIA 6060 11982 98 209 38 73 135 73 21 21
5 CARMEN 25252 38239 2505 9314 1164 708 712 1124 742 838
6 CHAMPOTON 16415 8500 3212 5387 457 636 1122 1034 203 340

Group by DF and then Filter using dplyr

This might be relatively easy in dplyr. Sample question uses the Lahman package data.
What player managed both the NYA and NYN under teamID?
# get master player table
players <- Lahman::People
# get manager table
managers <- Lahman::Managers
# merge players to managers
manager_tbl <-
managers %>%
left_join(players)
I want to get the results for the players under playerID that have a row for both NYA and NYN under teamID.
How would I go about doing this? I'm guessing that I would need to group at playerID. berrayo01 is one of the answers.
After grouping by 'playerID', filter all groups having both 'NYA' and 'NYN' %in% 'teamID'
library(dplyr)
manager_tbl %>%
group_by(playerID) %>%
filter(all(c("NYA", "NYN") %in% teamID))
# A tibble: 69 x 35
# Groups: playerID [4]
# playerID yearID teamID lgID inseason G W L rank plyrMgr birthYear birthMonth birthDay birthCountry birthState birthCity deathYear deathMonth deathDay deathCountry deathState
# <chr> <int> <fct> <fct> <int> <int> <int> <int> <int> <fct> <int> <int> <int> <chr> <chr> <chr> <int> <int> <int> <chr> <chr>
# 1 stengca… 1934 BRO NL 1 153 71 81 6 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 2 stengca… 1935 BRO NL 1 154 70 83 5 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 3 stengca… 1936 BRO NL 1 156 67 87 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 4 stengca… 1938 BSN NL 1 153 77 75 5 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 5 stengca… 1939 BSN NL 1 152 63 88 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 6 stengca… 1940 BSN NL 1 152 65 87 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 7 stengca… 1941 BSN NL 1 156 62 92 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 8 stengca… 1942 BSN NL 1 150 59 89 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 9 stengca… 1943 BSN NL 2 107 47 60 6 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
#10 stengca… 1949 NYA AL 1 155 97 57 1 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# … with 59 more rows, and 14 more variables: deathCity <chr>, nameFirst <chr>, nameLast <chr>, nameGiven <chr>, weight <int>, height <int>, bats <fct>, throws <fct>, debut <chr>,
# finalGame <chr>, retroID <chr>, bbrefID <chr>, deathDate <date>, birthDate <date>

Output function to create multiple dataframes by subsetted row

I am trying to create multiple DFs from a function with each DF being the aggregate of up until varying row values. For your reference I am using fantasy football data. So right now I have each players stats for every week. I want to create a data frame for each week and their cumulative stats until that week.
Here is my function that I currently am using which only creates one list of aggregating the week 17 values.
sumuptopoint <- function(dfx,i) { listofdfs <- list()
dfy <- dfx[, !sapply(dfx, is.character)]
{for (i in 1:17)
dft <- dfy[dfy$Week < i,]
y <<- as.data.frame(aggregate(dft, list("PlayerID" = dft$PlayerID), sum))
listofdfs[[i]] <- y}
return(listofdfs)}
I expect 17 lists of aggregated data but am only get 1 list where 17 weeks prior to 17 are aggregated
Here is the df:
Team ByeWeek Rank.all PlayerID Name Position Week Opponent PassingCompletio~ PassingAttempts.~ PassingCompletio~ PassingYards.all PassingTouchdow~ PassingIntercep~ PassingRating.a~
<chr> <int> <int> <int> <chr> <chr> <dbl> <chr> <int> <int> <dbl> <int> <int> <int> <dbl>
1 ARI 12 201 19763 Josh ~ QB 8.00 SF 23 40 57.5 252 2 1 82.5
2 ARI 12 319 19763 Josh ~ QB 11.0 OAK 9 20 45.0 136 3 2 67.9
3 ARI 12 372 19763 Josh ~ QB 4.00 SEA 15 27 55.6 180 1 0 88.5
4 ARI 12 392 11527 Sam B~ QB 3.00 CHI 13 19 68.4 157 2 2 89.0
5 ARI 12 407 19763 Josh ~ QB 5.00 SF 10 25 40.0 170 1 0 77.1
6 ARI 12 411 19763 Josh ~ QB 10.0 KC 22 39 56.4 208 1 2 58.5

dplyr group_by summarise inconsistent number of rows

I have been following the tutorial on DataCamp. I have the following line of code, that when I run it produces a different value for "drows"
hflights %>%
group_by(UniqueCarrier, Dest) %>%
summarise(rows= n(), drows = n_distinct(rows))
First time:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 86
2 AirTran BKG 14 6
3 Alaska SEA 32 18
4 American DFW 186 74
5 American MIA 129 57
6 American_Eagle DFW 234 101
7 American_Eagle LAX 74 34
8 American_Eagle ORD 133 56
9 Atlantic_Southeast ATL 64 28
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
Second time:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 125
2 AirTran BKG 14 13
3 Alaska SEA 32 29
4 American DFW 186 118
5 American MIA 129 76
6 American_Eagle DFW 234 143
7 American_Eagle LAX 74 47
8 American_Eagle ORD 133 85
9 Atlantic_Southeast ATL 64 44
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
Third time:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 88
2 AirTran BKG 14 7
3 Alaska SEA 32 16
4 American DFW 186 79
5 American MIA 129 61
6 American_Eagle DFW 234 95
7 American_Eagle LAX 74 31
8 American_Eagle ORD 133 67
9 Atlantic_Southeast ATL 64 31
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
My question is why does this value constantly change? What is it doing?
Apparently this is normal behaviour, see this issue here. https://github.com/tidyverse/dplyr/issues/2222.
This is because values in list columns are compared by reference, so
n_distinct() treats them as different unless they really point to the
same object:
So the internal storage of the df changes the way the thing works. Hadley's comment in that issue seems to say it might be a bug (in the sense of unwanted behaviour), or it might be expected behaviour they need to document better.

How to cross-reference tibbles in R?

library(nycflights13)
library(tidyverse)
My task is
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error).
I have generated a tibble with the average flight times between every two airports:
# A tibble: 224 x 3
# Groups: origin [?]
origin dest mean_time
<chr> <chr> <dbl>
1 EWR ALB 31.78708
2 EWR ANC 413.12500
3 EWR ATL 111.99385
4 EWR AUS 211.24765
5 EWR AVL 89.79681
6 EWR BDL 25.46602
7 EWR BNA 114.50915
8 EWR BOS 40.31275
9 EWR BQN 196.17288
10 EWR BTV 46.25734
# ... with 214 more rows
Now I need to sweep through flights and extract all rows, whose air_time is outside say (mean_time/2, mean_time*2). How do I do that?
Assuming you have stored the tibble with the average flight times, join it to the flights table:
flights_suspicious <- left_join(flights, average_flight_times, by=c("origin","dest")) %>%
filter(air_time < mean_time / 2 | air_time > mean_time * 2)
You would first join that average flight time data frame onto your original flights data and then apply the filter. Something like this should work.
library(nycflights13)
library(tidyverse)
data("flights")
#get mean time
mean_time <- flights %>%
group_by(origin, dest) %>%
summarise(mean_time = mean(air_time, na.rm = TRUE))
#join mean time to original data
df <- left_join(flights, mean_time)
flag_flights <- df %>%
filter(air_time <= (mean_time / 2) | air_time >= (mean_time * 2))
> flag_flights
# A tibble: 29 x 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 16 635 608 27 916 725 111 UA 541 N837UA EWR BOS 81 200 6 8
2 2013 1 21 1851 1900 -9 2034 2012 22 US 2140 N956UW LGA BOS 76 184 19 0
3 2013 1 28 1917 1825 52 2118 1935 103 US 1860 N755US LGA PHL 75 96 18 25
4 2013 10 7 1059 1105 -6 1306 1215 51 MQ 3230 N524MQ JFK DCA 96 213 11 5
5 2013 10 10 950 959 -9 1155 1115 40 EV 5711 N829AS JFK IAD 97 228 9 59
6 2013 2 17 841 840 1 1044 1003 41 9E 3422 N913XJ JFK BOS 86 187 8 40
7 2013 3 8 1136 1001 95 1409 1116 173 UA 1240 N17730 EWR BOS 82 200 10 1
8 2013 3 8 1246 1245 1 1552 1350 122 AA 1850 N3FEAA JFK BOS 80 187 12 45
9 2013 3 12 1607 1500 67 1803 1608 115 US 2132 N946UW LGA BOS 77 184 15 0
10 2013 3 12 1612 1557 15 1808 1720 48 UA 1116 N37252 EWR BOS 81 200 15 57
# ... with 19 more rows, and 2 more variables: time_hour <dttm>, mean_time <dbl>

Resources