I have a two-column dataframe (HOME & AWAY) called 'gamelist' with sports games. The HOME column also includes some dates with the corresponding games listed below.
HOME AWAY
15 Oct 2019 Pre-season
Phoenix Suns Denver Nuggets
Utah Jazz Sacramento Kings
Dallas Mavericks Oklahoma City Thunder
Memphis Grizzlies Charlotte Hornets
14 Oct 2019 Pre-season
Miami Heat Atlanta Hawks
13 Oct 2019 Pre-season
Orlando Magic Philadelphia 76ers
Toronto Raptors Chicago Bulls
Washington Wizards Milwaukee Bucks
I want to create a new column with the dates for each game. Coming from a excel vba approach, I've used a for loop which is giving the result intented but I was wondering if there was a more efficient approach in R, and I'm sure there is.
This is the code I've used:
gamelist<-add_column(gamelist,SDATE="",.before = 1)
for(i in 1:nrow(gamelist)){
if(str_count(gamelist[[i,3]],"\\d")==6){
gamelist[i,2]<-gamelist[i,3]
}else{
gamelist[i,2]<-gamelist[i-1,2]
}
}
Which gives me this as intended
SDATE HOME AWAY
15 Oct 2019 15 Oct 2019 Pre-season
15 Oct 2019 Phoenix Suns Denver Nuggets
15 Oct 2019 Utah Jazz Sacramento Kings
15 Oct 2019 Dallas Mavericks Oklahoma City Thunder
15 Oct 2019 Memphis Grizzlies Charlotte Hornets
14 Oct 2019 14 Oct 2019 Pre-season
14 Oct 2019 Miami Heat Atlanta Hawks
13 Oct 2019 13 Oct 2019 Pre-season
13 Oct 2019 Orlando Magic Philadelphia 76ers
13 Oct 2019 Toronto Raptors Chicago Bulls
13 Oct 2019 Washington Wizards Milwaukee Bucks
My apologies for the dataframe formatting, couldn't figure out how to reproduce one properly here.
Thanks for your help
We could use str_extract to get only the 'dates' so that if there is no match it returns NA, then we use fill to fill the NA elements with the previous non-NA values
library(dplyr)
library(tidyr)
library(stringr)
gamelist %>%
mutate(SDATE = str_extract(HOME, "^\\d+ [A-Za-z]+ \\d{4}")) %>%
fill(SDATE)
# HOME AWAY SDATE
#1 15 Oct 2019 Pre-season 15 Oct 2019
#2 Phoenix Suns Denver Nuggets 15 Oct 2019
#3 Utah Jazz Sacramento Kings 15 Oct 2019
#4 Dallas Mavericks Oklahoma City Thunder 15 Oct 2019
#5 Memphis Grizzlies Charlotte Hornets 15 Oct 2019
#6 14 Oct 2019 Pre-season 14 Oct 2019
#7 Miami Heat Atlanta Hawks 14 Oct 2019
#8 13 Oct 2019 Pre-season 13 Oct 2019
#9 Orlando Magic Philadelphia 76ers 13 Oct 2019
#10 Toronto Raptors Chicago Bulls 13 Oct 2019
#11 Washington Wizards Milwaukee Bucks 13 Oct 2019
If we need the SDATE column first, we can use select
gamelist %>%
mutate(SDATE = str_extract(HOME, "^\\d+ [A-Za-z]+ \\d{4}")) %>%
fill(SDATE) %>%
select(SDATE, everything())
Or use add_column from tibble with either .after or .before
library(tibble)
gamelist %>%
add_column(SDATE = str_extract(.$HOME, "^\\d+ [A-Za-z]+ \\d{4}"),
.before = 1 ) %>%
fill(SDATE)
data
gamelist <- structure(list(HOME = c("15 Oct 2019", "Phoenix Suns", "Utah Jazz",
"Dallas Mavericks", "Memphis Grizzlies", "14 Oct 2019", "Miami Heat",
"13 Oct 2019", "Orlando Magic", "Toronto Raptors", "Washington Wizards"
), AWAY = c("Pre-season", "Denver Nuggets", "Sacramento Kings",
"Oklahoma City Thunder", "Charlotte Hornets", "Pre-season", "Atlanta Hawks",
"Pre-season", "Philadelphia 76ers", "Chicago Bulls", "Milwaukee Bucks"
)), class = "data.frame", row.names = c(NA, -11L))
If the date is always in the HOME column when the AWAY column is "Pre-season" (or some other predictable condition), then you could do something like:
# data
gamelist <- data.frame(
stringsAsFactors = FALSE,
HOME = c("15-Oct-19","Phoenix Suns",
"Utah Jazz","Dallas Mavericks","Memphis Grizzlies",
"14-Oct-19","Miami Heat","13-Oct-19","Orlando Magic",
"Toronto Raptors","Washington Wizards"),
AWAY = c("Pre-season","Denver Nuggets",
"Sacramento Kings","Oklahoma City Thunder",
"Charlotte Hornets","Pre-season","Atlanta Hawks","Pre-season",
"Philadelphia 76ers","Chicago Bulls","Milwaukee Bucks")
)
# create blank column to fill in
gamelist$date <- NA
# fill cases where there's a date
gamelist$date[gamelist$AWAY=="Pre-season"] <- gamelist$HOME[gamelist$AWAY=="Pre-season"]
# user zoo::na.locf() to fill in missing values
gamelist$date <- zoo::na.locf(gamelist$date)
Related
My data frame has a column called "State" and contains the state name, HB/HF number, and the date the law went into effect. I want the state column to only contain the state name and the second column to contain just the year. How would I do this?
Mintz = read.csv('https://github.com/bandcar/mintz/raw/main/State%20Legislation%20on%20Biosimilars2.csv')
mintz = Mintz
# delete rows if col 2 has a blank value.
mintz = mintz[mintz$Substitution.Requirements != "", ]
# removes entire row if column 1 has the word State
mintz=mintz[mintz$State != "State", ]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
# delete PR
mintz = mintz[-34,]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
I'm almost certain I'll need to use strsplit(gsub()) but I'm not sure how to this since there's no specific pattern
EDIT
I still need help keeping only the state name in column 1.
As for moving the year to a new column, I found the below. It works, but I don't know why it works. From my understanding \d means that \d is the actual character it's searching for. the "." means to search for one character, and I have no idea what the \1 means. Another strange thing is that Minnesota (row 20) did not have a year, so it instead used characters. Isn't \d only supposed to be for digits? Someone care to explain?
mintz2 = mintz
mintz2$Year = sub('.*(\\d{4}).*', '\\1', mintz2$State)
One way could be:
For demonstration purposes select the State column.
Then we use str_extract to extract all numbers with 4 digits with that are at the end of the string \\d{4}-> this gives us the Year column.
Finally we make use of the inbuilt state.name function make a pattern of it an use it again with str_extract and remove NA rows.
library(dplyr)
library(stringr)
mintz %>%
select(State) %>%
mutate(Year = str_extract(State, '\\d{4}$'), .after=State,
State = str_extract(State, paste(state.name, collapse='|'))
) %>%
na.omit()
State Year
2 Arizona 2016
3 California 2016
7 Connecticut 2018
12 Florida 2013
13 Georgia 2015
16 Hawaii 2016
21 Illinois 2016
24 Indiana 2014
28 Iowa 2017
32 Kansas 2017
33 Kentucky 2016
34 Louisiana 2015
39 Maryland 2017
42 Michigan 2018
46 Missouri 2016
47 Montana 2017
50 Nebraska 2018
51 Nevada 2018
54 New Hampshire 2018
55 New Jersey 2016
59 New York 2017
62 North Carolina 2015
63 North Dakota 2013
66 Ohio 2017
67 Oregon 2016
70 Pennsylvania 2016
74 Rhode Island 2016
75 South Carolina 2017
78 South Dakota 2019
79 Tennessee 2015
82 Texas 2015
85 Utah 2015
88 Vermont 2018
89 Virginia 2013
92 Washington 2015
93 West Virginia 2018
96 Wisconsin 2019
97 Wyoming 2018
Hi guys I am trying to plot a streamgraph using data at the following link: https://www.kaggle.com/START-UMD/gtd.
My aim is to streamgraph the frequency of terrorist attacks for each terrorist group of the variable gnamebut my problem is that I don't know how to filter the data frame in order to have all the parameters necessary to plot a streamgraph which are data, key, value, date.
I tried to get to that subset of the original dataframe by using the following code
str <- terror %>%
filter(gname != "Unknown") %>%
group_by(gname) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(20)
But all I managed to get is the frequency of attacks for each terrorist group, without getting the number of attacks for each year.
Could you suggest any way to do it? That would be amazing!
Thanks for reading guys and for the help.
Dario and Kent are correct. You need to add the iyear variable in the group_by function:
terror %>%
filter(gname != "Unknown") %>%
group_by(gname, iyear) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(20) -> str
str
# A tibble: 20 x 3
# Groups: gname [7]
gname iyear total
<chr> <int> <int>
1 Islamic State of Iraq and the Levant (ISIL) 2016 1454
2 Islamic State of Iraq and the Levant (ISIL) 2017 1315
3 Islamic State of Iraq and the Levant (ISIL) 2014 1249
4 Taliban 2015 1249
5 Islamic State of Iraq and the Levant (ISIL) 2015 1221
6 Taliban 2016 1065
7 Taliban 2014 1035
8 Taliban 2017 894
9 Al-Shabaab 2014 871
10 Taliban 2012 800
11 Taliban 2013 775
12 Al-Shabaab 2017 570
13 Al-Shabaab 2016 564
14 Boko Haram 2015 540
15 Shining Path (SL) 1989 509
16 Communist Party of India - Maoist (CPI-Maoist) 2010 505
17 Shining Path (SL) 1984 502
18 Boko Haram 2014 495
19 Shining Path (SL) 1983 493
20 Farabundo Marti National Liberation Front (FML~ 1991 492
Then send that to the streamgraph:
str %>% streamgraph("gname", "total", "iyear")
I've always had difficulty annotating these graphs, as far as I know, it had to be done manually:
str %>% streamgraph("gname", "total", "iyear") %>%
sg_annotate(label="ISIL", x=as.Date("2016-01-01"), y=1454, size=14)
I need to raise the question again as it was closed as duplicated, but the issue hasn't been resolved.
So, I'm working on international trade data and have the following table at the moment with 5 different values for commodity_code (commod_codes = c('85','84','87','73','29')):
year trade_flow reporter partner commodity_code commodity trade_value_usd
1 2012 Import Belarus China 29 Organic chemicals 150863100
2 2013 Import Belarus China 29 Organic chemicals 151614000
3 2014 Import Belarus China 29 Organic chemicals 73110200
4 2015 Import Belarus China 29 Organic chemicals 140396300
5 2016 Import Belarus China 29 Organic chemicals 135311600
6 2012 Import Belarus China 73 Articles of iron or steel 100484600
I need to create a new table that looks simple (commodity codes in top row, years in first column and corresponding trade values in cells):
year commodity_code
29 73 84 85 87
1998 value1 ... value 5
1999
…
2016
* I used reshape() but didn't succeed.
Would appreciate your support.
In case there are duplicate permutations, I would suggest to use this code (though not in base R - uses dplyr and tidyr packages)
as.data.frame(trade_data[,c("year","commodity_code","trade_value_usd")] %>% group_by (year,commodity_code)%>% summarise( sum(trade_value_usd))%>%spread(commodity_code,3))
Provided I understood you correctly, here is a one-liner in base R.
xtabs(trade_value_usd ~ year + commodity_code, data = df);
#year 29 73
# 2012 150863100 100484600
# 2013 151614000 0
# 2014 73110200 0
# 2015 140396300 0
# 2016 135311600 0
Explanation: Use xtabs to cross-tabulate trade_value_usd as a function of year (rows) and commodity_code (columns).
Sample data
df <- read.table(text =
"year trade_flow reporter partner commodity_code commodity trade_value_usd
1 2012 Import Belarus China 29 'Organic chemicals' 150863100
2 2013 Import Belarus China 29 'Organic chemicals' 151614000
3 2014 Import Belarus China 29 'Organic chemicals' 73110200
4 2015 Import Belarus China 29 'Organic chemicals' 140396300
5 2016 Import Belarus China 29 'Organic chemicals' 135311600
6 2012 Import Belarus China 73 'Articles of iron or steel' 100484600
", header = T, row.names = 1)
I am trying to use zoo to formulate a Date using two columns in a data.table.
data$Date <- as.yearmon(paste(data$Month,data$Year), "%Y %m")
But all I get in NA's
Here is what the data looks like:
Year Month State County Rate
2015 October California Santa Clara County 4.0
2015 March California Santa Clara County 4.4
2015 August California Santa Clara County 4.1
2015 May California Santa Clara County 4.1
2015 January California Santa Clara County 4.7
You have two issues. One, you're pasting Month, Year but telling it you're sending Year, Month. In addition, %m is for month as a decimal 1-12. You want %B for full name of month. You need to switch the order of the paste and change the format.
data$Date <- as.yearmon(paste(data$Year,data$Month), "%Y %B")
Year Month State County Rate Date
1: 2015 October California Santa Clara County 4.0 Oct 2015
2: 2015 March California Santa Clara County 4.4 Mar 2015
3: 2015 August California Santa Clara County 4.1 Aug 2015
4: 2015 May California Santa Clara County 4.1 May 2015
5: 2015 January California Santa Clara County 4.7 Jan 2015
Following up with more delving into scraping data off sites. Trying to pull data off of this site seatgeeks to get a few columns. I'm having trouble accessing the pricing and link data specifically. The following code runs well but I can't get accurate data for pricing and for links. 65$ keeps repeating itself even though the numbers are different per button. Any ideas? Appreciate the help!
#ticket scruber
library(rvest)
tix_link = paste("https://seatgeek.com/new-york-knicks-tickets#events")
tix_info = tix_link %>% read_html() %>%
html_nodes(".event-listing-title span")
link_date = read_html(tix_link)
link_date = html_nodes(link_date, ".event-listing-date")
link_time = read_html(tix_link)
link_time = html_nodes(link_time, ".event-listing-time")
link_price = read_html(tix_link)
link_price = html_node(link_price, ".event-listing-button")
link_info = read_html(tix_link)
link_info = html_node(link_info, "span")
#convert to data frame
ticket_deals = data.frame(deals = html_text(tix_info),
date = html_text(link_date),
time = html_text(link_time),
price = html_text(link_price),
correpsonding_link = html_attr(link_info,"href"))
head(ticket_deals)
deals date
1 Dallas Mavericks at New York Knicks \n Nov 14
2 Detroit Pistons at New York Knicks \n Nov 16
3 Atlanta Hawks at New York Knicks \n Nov 20
4 Portland Trail Blazers at New York Knicks \n Nov 22
5 Charlotte Hornets at New York Knicks \n Nov 25
6 Oklahoma City Thunder at New York Knicks \n Nov 28
time price
1 \n Mon 7:30 PM \n From $65
2 \n Wed 7:30 PM \n From $65
3 \n Sun 12:00 PM \n From $65
4 \n Tue 7:30 PM \n From $65
5 \n Fri 7:30 PM \n From $65
6 \n Mon 7:30 PM \n From $65
correpsonding_link
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 <NA>