Not sure how to separate a column of data that I scraped - r

I have scraped data from the schedule of Albany Women's Basketball team from an espn website and the win/loss column is formatted like this: W 77-70, which means that Albany won 77-70. I want to separate this so that one column shows how many points Albany scored, and how many points the opponent scored.
Here is my code, not sure what to do next:
library(rvest)
library(stringr)
library(tidyr)
w.url <- "http://www.espn.com/womens-college-basketball/team/schedule/_/id/399"
webpage <- read_html(w.url)
w_table <- html_nodes(webpage, 'table')
w <- html_table(w_table)[[1]]
head(w)
w <- w[-(1:2), ]
names(w) <- c("Date", "Opponent", "Score", "Record")
head(w)

You can firstly trim out those rows that are not offering real results by using grepl function and then use regex for getting specific information:
w <- w[grepl("-", w$Score),]
gsub("^([A-Z])([0-9]+)-([0-9]+).*", "\\1,\\2,\\3", w$Score) %>%
strsplit(., split = ",") %>%
lapply(function(x){
data.frame(
result = x[1],
oponent = ifelse(x[1] == "L", x[2], x[3]),
albany = ifelse(x[1] == "W", x[2], x[3])
)
}) %>%
do.call('rbind',.) %>%
cbind(w,.) -> w2
head(w2)
# Date Opponent Score Record result oponent albany
#3 Fri, Nov 9 ##22 South Florida L74-37 0-1 (0-0) L 74 37
#4 Mon, Nov 12 #Cornell L48-34 0-2 (0-0) L 48 34
#5 Wed, Nov 14 vsManhattan W60-54 1-2 (0-0) W 54 60
#6 Sun, Nov 18 #Rutgers L65-39 1-3 (0-0) L 65 39
#7 Wed, Nov 21 #Monmouth L64-56 1-4 (0-0) L 64 56
#8 Sun, Nov 25 vsHoly Cross L56-50 1-5 (0-0) L 56 50

This is how I did it. Basically, use sub to extract either the Win or Loss values depending on whether Albany won or lost. Whether Albany won or lost the winner is listed first. So the ifelse function is necessary. The "\1" captures the digits in parenthesis.
w<-w[1:24,]
w$Albany<-ifelse(substr(w$Score,1,1)=='W',sub('W(\\d+)-\\d+','\\1',w$Score),sub('L\\d+-(\\d+)','\\1',w$Score))
w$Opponent_Team<-ifelse(substr(w$Score,1,1)=='W',sub('W\\d+-(\\d+)','\\1',w$Score),sub('L(\\d+)-\\d+','\\1',w$Score))
head(w)
Date Opponent Score Record Albany Opponent_Team
3 Fri, Nov 9 ##22 South Florida L74-37 0-1 (0-0) 37 74
4 Mon, Nov 12 #Cornell L48-34 0-2 (0-0) 34 48
5 Wed, Nov 14 vsManhattan W60-54 1-2 (0-0) 60 54
6 Sun, Nov 18 #Rutgers L65-39 1-3 (0-0) 39 65
7 Wed, Nov 21 #Monmouth L64-56 1-4 (0-0) 56 64
8 Sun, Nov 25 vsHoly Cross L56-50 1-5 (0-0) 50 56
````

Related

how to perform calculation chr and dbl

let say I have this run this code
df_customer %>%
separate(DOB,sep = "-",into = c("D", "M","Y")) %>%
mutate(Age=2021)
then this dataframe comes out
ID D M Y G C Age
<int> < chr > <dbl>
268408 02 01 1970 M 4 2021
269696 07 01 1970 F 8 2021
268159 08 01 1970 F 8 2021
270181 10 01 1970 F 2 2021
268073 11 01 1970 M 1 2021
273216 15 01 1970 F 5 2021
266929 15 01 1970 M 8 2021
275152 16 01 1970 M 4 2021
275034 18 01 1970 F 4 2021
273966 21 01 1970 M 8 2021
then, I want to change that list of mutate column
how can I calculate something like 2021-"Y" column?
2021 is dbl and Y is chr
Adding convert = TRUE in separate should give you numeric values. You can also use as.numeric to convert character to numbers.
library(dplyr)
library(tidyr)
df_customer %>%
separate(DOB,sep = "-",into = c("D", "M","Y"), convert = TRUE) %>%
mutate(Age=2021 - as.numeric(Y))
We could do this in base R
transform(cbind(df_customer, read.table(text = df_customer$DOB, sep = "-",
column.names = c("D", "M", "Y"))), Age = 2021- Y)

How to convert multiple selected column names from integer to date in r

I have a data set with column names that look like this.
INPUT
Country X1.22.20 X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20
India 40 20 30 21 25 28
USA 21 22 23 45 32 19
CHINA 30 45 32 46 78 48
X1.22.20 represents 1/22/2020
Required Output
Country 01/22/20 01/23/20 01/24/20 01/25/20 01/26/20 01/27/20
India 40 20 30 21 25 28
USA 21 22 23 45 32 19
CHINA 30 45 32 46 78 48
We can avoid this conversion, if we read with check.names = FALSE
df1 <- read.csv('file.csv', check.names = FALSE, stringsAsFactors = FALSE)
if we already read it without the check.names = FALSE option, convert to Date class and then format
names(df1)[-1] <- format(as.Date(names(df1)[-1], format = "X%m.%d.%y"), "%m/%d/%y")
Or another option is sub
names(df1)[-1] <- sub("^X(\\d+)\\.(\\d+)\\.(\\d+)", "\\1/\\2/\\3", names(df1)[-1])

R moving average between data frame variables

I am trying to find a solution but haven't, yet.
I have a dataframe structured as follows:
country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54
I want to find the moving average for every 3 years (i.e. 2014-16, 2015-17, etc) to be placed in ad-hoc columns.
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
France Paris 23 34 54 12 23 21 37 33.3 29.7 18.7
US NYC 1 2 2 12 95 54 etc etc etc etc
Any hint?
1) Using the data shown reproducibly in the Note at the end we apply rollmean to each column in the transpose of the data and then transpose back. We rollapply the appropriate paste command to create the names.
library(zoo)
DF2 <- DF[-(1:2)]
cbind(DF, setNames(as.data.frame(t(rollmean(t(DF2), 3))),
rollapply(names(DF2), 3, function(x) paste(range(x), collapse = "-"))))
giving:
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
1 France Paris 23 34 54 12 23 21 37.000000 33.333333 29.66667 18.66667
2 US NYC 1 2 2 12 95 54 1.666667 5.333333 36.33333 53.66667
2) This could also be expressed using dplyr/tidyr/zoo like this:
library(dplyr)
library(tidyr)
library(zoo)
DF %>%
pivot_longer(-c(country, City)) %>%
group_by(country, City) %>%
mutate(value = rollmean(value, 3, fill = NA),
name = rollapply(name, 3, function(x) paste(range(x), collapse="-"), fill=NA)) %>%
ungroup %>%
drop_na %>%
pivot_wider %>%
left_join(DF, ., by = c("country", "City"))
Note
Lines <- "country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54 "
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, check.names = FALSE)

For loop results are only correct for the first iteration in R

My project:
I am looping through shapefiles in a folder, and running some calculations to add new columns with new values in the output shapefile
My problem:
The calculations are correct for the first iteration. However these values are then added as columns to every subsequent shapefile (rather than doing new calculations per iteration). Below is the code. The final columns resulting from this code running are: final_year, final_month, final_day, final_date.
My code:
library(rgdal)
library(tidyverse)
library(magrittr)
library(dplyr)
input_path<- "/Users/JohnDoe/Desktop/Zone_Fixup/Z4/Z4_Split/"
output_path<- "/Users/JohnDoe/Desktop/Zone_Fixup/Z4/Z4_Split_Out/"
files<- list.files(input_path, pattern = "[.]shp$")
for(f in files){
ifile<- list.files(input_path, f)
shp_paste<- paste(input_path, ifile, sep = "")
tryCatch({shp0<- readOGR(shp_paste, verbose=FALSE)}, error = function(e){print("Error1.")})
#Order shapefile by filename
shp1<- as.data.frame(shp0)
shp2<- shp1[order(shp1$filename),]
#Sort final dates by relative length values.
#If it's increasing, it's day1; if it's decreasing it's day3, etc.
shp2$final_day1<- ifelse(lag(shp2$Length1)<shp2$Length1, paste0(shp2$day1), paste0(shp2$day3))
shp2$final_month1<- ifelse(lag(shp2$Length1)<shp2$Length1, paste0(shp2$month1), paste0(shp2$month3))
shp2$final_year1<- ifelse(lag(shp2$Length1)<shp2$Length1, paste0(shp2$year1), paste0(shp2$year3))
#Remove first NA value of each column
if(is.na(shp2$final_day1[1])){
ex1<- shp2$day1[1]
ex2<- as.character(ex1)
ex3<- as.numeric(ex2)
shp2$final_day1[1]<- ex2
}
if(is.na(shp2$final_month1[1])){
ex4<- shp2$month1[1]
ex5<- as.character(ex4)
ex6<- as.numeric(ex5)
shp2$final_month1[1]<- ex5
}
if(is.na(shp2$final_year1[1])){
ex7<- shp2$year1[1]
ex8<- as.character(ex7)
ex9<- as.numeric(ex8)
shp2$final_year1[1]<- ex9
}
#Add final dates to shapefile as new columns
shp0$final_year<- shp2$final_year1
shp0$final_month<- shp2$final_month1
shp0$final_day<- shp2$final_day1
final_paste<- paste(shp0$final_year, "_", shp0$final_month, "_", shp0$final_day, sep = "")
shp0$final_date<- final_paste
#Create new shapefile for write out
shp44<- shp0
#Write out shapefile
ifile1<- substring(ifile, 1, nchar(ifile)-4)
#tryCatch({writeOGR(shp44, output_path, layer = ifile1, driver = "ESRI Shapefile", overwrite_layer = TRUE)}, error = function(e){print("Error2.")})
test1<- head(shp44)
print(test1)
}
My results:
Here are two head() tables. The first table is correct. The second table is not correct. Notice that the final_year, final_month, final_day, and final_year columns are identical in the two tables. NOTE: These columns are the last four in the table
Table 1:
coordinates Length1 Bathy Vector filename zone year1 year2 year3 month1 month2 month3 day1 day2 day3 final_year final_month final_day final_date
1 (-477786.3, 1110917) 29577.64 -6.455580 0 Zone4_2000_02_05_2000_02_15_2000_02_24 Zone4 2000 2000 2000 02 02 02 05 15 24 1997 02 15 1997_02_15
2 (-477786.3, 1110917) 29577.64 -6.455580 0 Zone4_2000_02_24_2000_03_10_2000_03_17 Zone4 2000 2000 2000 02 03 03 24 10 17 1997 03 26 1997_03_26
3 (-477848.2, 1113468) 27025.88 -2.100153 0 Zone4_2000_03_24_2000_04_03_2000_04_10 Zone4 2000 2000 2000 03 04 04 24 03 10 1997 04 19 1997_04_19
4 (-477871, 1114406) 26087.98 -4.700025 0 Zone4_2006_03_10_2006_03_27_2006_04_03 Zone4 2006 2006 2006 03 03 04 10 27 03 1998 02 08 1998_02_08
5 (-477876.1, 1114616) 25877.25 -7.598877 0 Zone4_2008_03_06_2008_03_16_2008_03_25 Zone4 2008 2008 2008 03 03 03 06 16 25 1998 03 28 1998_03_28
6 (-477878.8, 1114730) 25764.14 -7.598877 0 Zone4_2008_03_30_2008_04_09_2008_04_23 Zone4 2008 2008 2008 03 04 04 30 09 23 1998 04 21 1998_04_21
Table 2:
coordinates Length1 Bathy Vector filename zone year1 year2 year3 month1 month2 month3 day1 day2 day3 final_year final_month final_day final_date
1 (-477813.5, 1110939) 29612.26 -6.455580 1 Zone4_2000_02_05_2000_02_15_2000_02_24 Zone4 2000 2000 2000 02 02 02 05 15 24 1997 02 15 1997_02_15
2 (-477813.5, 1110939) 29612.26 -6.455580 1 Zone4_2000_02_24_2000_03_10_2000_03_17 Zone4 2000 2000 2000 02 03 03 24 10 17 1997 03 26 1997_03_26
3 (-477883.4, 1113392) 27158.05 -2.100153 1 Zone4_2000_03_24_2000_04_03_2000_04_10 Zone4 2000 2000 2000 03 04 04 24 03 10 1997 04 19 1997_04_19
4 (-477909.9, 1114319) 26230.17 -4.700025 1 Zone4_2006_03_10_2006_03_27_2006_04_03 Zone4 2006 2006 2006 03 03 04 10 27 03 1998 02 08 1998_02_08
5 (-477916.7, 1114558) 25991.57 -7.598877 1 Zone4_2008_03_06_2008_03_16_2008_03_25 Zone4 2008 2008 2008 03 03 03 06 16 25 1998 03 28 1998_03_28
6 (-477920.1, 1114678) 25871.39 -7.598877 1 Zone4_2008_03_30_2008_04_09_2008_04_23 Zone4 2008 2008 2008 03 04 04 30 09 23 1998 04 21 1998_04_21
It looks like my code is taking the column values from the first iteration and adding them to shapefiles in subsequent iterations. How can my code be modified to run new calculations with each iteration, and add those unique values to their respective shapefiles?
Thank you
I think your problem may be with the start of your for loop.
files<- list.files(input_path, pattern = "[.]shp$") #keep this line to get your files
for (f in 1:length(files)){ # change this to the length of files to iterate over files one by one
ifile<- list.files(input_path, f) #delete this line from your code
shp_paste<-paste(input_path,files[f],sep="") # use this line to iterate over each shp file
keep the rest of you code as it is and see if this helps..
Thank you for your help, everyone, I found the problem. A tad embarrassing, I wasn't sorting the filename by ascending order before adding the new columns in. Therefore it seemed like the values in the new columns were wrong, because they weren't matched to the correct rows. A clumsy error on my part, thanks to all who offered advice.

Using grep in tapply()-function for string matching in data frame in R

I have a data frame consisting of 36 observations of 8 variables (the first two variables are factors and the last 6 are numeric). The structure looks like this:
Technology Sector 2011 2012 2013 2014 2015 mean
Photovoltaic Energy 10 20 30 10 30 20
Wind-based Energy 20 60 60 20 40 40
Cultivation Nature 10 10 20 30 30 20
I want to be get the mean for every technology, that has the string "based" in its name. I've done it that way:
df[[8]][c(grep("based",Technology))]
And it works exactly how it is supposed to. The task from my online course is, to do it also with one of the following: apply()-, lapply()-, sapply()- or tapply()-function. How can this be done?
I don't think tapply is an appropriate option because OP wants to just get a subset. Now, if it is must to use apply/lapply/sapply... then option could be as:
df$mean[mapply(function(x)grepl("based", x), df$Technology)]
#[1] 40
df$mean[sapply(df$Technology,function(x)grepl("based", x))]
#[1] 40
Data:
df <- read.table(text=
"Technology Sector 2011 2012 2013 2014 2015 mean
Photovoltaic Energy 10 20 30 10 30 20
Wind-based Energy 20 60 60 20 40 40
Cultivation Nature 10 10 20 30 30 20",
header = TRUE, stringsAsFactors = FALSE)

Resources