Column variabels are changed when making a data partition - r

I have the following dataset:
head(filter_selection)
MATCHID COMPETITION TEAM1 TEAM2 GOALS1 GOALS2 RESULT EXPG1 EXPG2 DATUM TIJD VERSCHIL
1 1696873 Pro League Standard Liège Sporting Charleroi 3 0 TEAM1 1.57 0.61 25-7-2014 18:30:00 0.96
2 1696883 Pro League Waasland-Beveren Club Brugge 0 2 TEAM2 1.29 1.18 26-7-2014 16:00:00 0.11
3 1696879 Pro League Lierse KV Oostende 2 0 TEAM1 1.03 1.04 26-7-2014 18:00:00 -0.01
4 1696881 Pro League Westerlo Lokeren 1 0 TEAM1 1.76 1.24 26-7-2014 18:00:00 0.52
5 1696877 Pro League Mechelen Genk 3 1 TEAM1 1.60 1.23 27-7-2014 12:30:00 0.37
6 1696871 Pro League Anderlecht Mouscron-Péruwelz 3 1 TEAM1 1.27 0.62 27-7-2014 16:00:00 0.65
I want to use the VERSCHIL value to predict the RESULT. Therefore I do the following to create a test/training set:
library(rcaret)
inTrain <- createDataPartition(y=filter_selection$RESULT, p=0.75, list=FALSE)
Thing is however that when I do this my RESULT column changes:
training <- df_final_test[inTrain, ]
testing <- df_final_test[-inTrain, ]
head(training, 20)
MATCHID COMPETITION TEAM1 TEAM2 GOALS1 GOALS2 RESULT EXPG1 EXPG2 DATUM TIJD VERSCHIL CLAS type TYPE TYPE2
1 1696873 Pro League Standard Liège Sporting Charleroi 3 0 3 1.57 0.61 25-7-2014 18:30:00 0.96 0.96 TBD (-0.0767,1.54] HIGH
2 1696883 Pro League Waasland-Beveren Club Brugge 0 2 4 1.29 1.18 26-7-2014 16:00:00 0.11 0.11 TBD (-0.0767,1.54] MEDIUM
It's now 3 and 4 in stead of TEAM1 and TEAM2. Could anybody tell me why the TEAM1 value changed into 3?
Its strange cause when I do the same with the spam dataset it works fine
data(spam)
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[inTrain, ]
head(training)
And that taking into consideration that the classes are the same
class(spam$type)
[1] "factor"
class(filter_selection$RESULT)
[1] "factor"

First of all, there is no package rcaret.
Secondly, you create a datapartition on "filter_selection", but then you create the training and test sets based on a different dataframe "df_final_test".
But do check the structure of df_final_test$RESULT and see how many levels the factor has. Maybe something went wrong there. If there are any levels in there you do not want use droplevels(df_final_test$RESULT)
If I try the code on the filter_selection and create a training set out of this one, I get a correct training and test set.
library(caret)
inTrain <- createDataPartition(y=filter_selection$RESULT, p=0.75, list=FALSE)
training <- filter_selection[inTrain, ]
testing <- filter_selection[-inTrain, ]
head(training)
MATCHID COMPETITION TEAM1 TEAM2 GOALS1 GOALS2 RESULT EXPG1 EXPG2 DATUM TIJD VERSCHIL
1 1696873 Pro League Standard Liège Sporting Charleroi 3 0 TEAM1 1.57 0.61 25-7-2014 18:30:00 0.96
2 1696883 Pro League Waasland-Beveren Club Brugge 0 2 TEAM2 1.29 1.18 26-7-2014 16:00:00 0.11
4 1696881 Pro League Westerlo Lokeren 1 0 TEAM1 1.76 1.24 26-7-2014 18:00:00 0.52
5 1696877 Pro League Mechelen Genk 3 1 TEAM1 1.60 1.23 27-7-2014 12:30:00 0.37
6 1696871 Pro League Anderlecht Mouscron-Péruwelz 3 1 TEAM1 1.27 0.62 27-7-2014 16:00:00 0.65

Related

Scraping an interactive table with rvest

I'm attempting to scrape the second table shown at the URL below, and I'm running into issues which may be related to the interactive nature of the table.
div_stats_standard appears to refer to the table of interest.
The code runs with no errors but returns an empty list.
url <- 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
data <- url %>%
read_html() %>%
html_nodes(xpath = '//*[(#id = "div_stats_standard")]') %>%
html_table()
Can anyone tell me where I'm going wrong?
Look for the table.
library(rvest)
url <- "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
page <- read_html(url)
nodes <- html_nodes(page, "table") # you can use Selectorgadget to identify the node
table <- html_table(nodes[[1]]) # each element of the nodes list is one table that can be extracted
head(table)
Result:
head(table)
Playing Time Playing Time Playing Time Performance Performance
1 Squad # Pl MP Starts Min Gls Ast
2 Arsenal 26 27 297 2,430 39 26
3 Aston Villa 28 27 297 2,430 33 27
4 Bournemouth 25 28 308 2,520 27 17
5 Brighton 23 28 308 2,520 28 19
6 Burnley 21 28 308 2,520 32 23
Performance Performance Performance Performance Per 90 Minutes Per 90 Minutes
1 PK PKatt CrdY CrdR Gls Ast
2 2 2 64 3 1.44 0.96
3 1 3 54 1 1.22 1.00
4 1 1 60 3 0.96 0.61
5 1 1 44 2 1.00 0.68
6 2 2 53 0 1.14 0.82
Per 90 Minutes Per 90 Minutes Per 90 Minutes Expected Expected Expected Per 90 Minutes
1 G+A G-PK G+A-PK xG npxG xA xG
2 2.41 1.37 2.33 35.0 33.5 21.3 1.30
3 2.22 1.19 2.19 30.6 28.2 22.0 1.13
4 1.57 0.93 1.54 31.2 30.5 20.8 1.12
5 1.68 0.96 1.64 33.8 33.1 22.4 1.21
6 1.96 1.07 1.89 30.9 29.4 18.9 1.10
Per 90 Minutes Per 90 Minutes Per 90 Minutes Per 90 Minutes
1 xA xG+xA npxG npxG+xA
2 0.79 2.09 1.24 2.03
3 0.81 1.95 1.04 1.86
4 0.74 1.86 1.09 1.83
5 0.80 2.01 1.18 1.98
6 0.68 1.78 1.05 1.73

Get a cell value using a filter on another cell value

I have following data.table.
Genre PS4 X360
1: Action 0.71 0.75
2: Adventure 0.25 0.32
3: Fighting 0.47 0.58
4: Misc 0.49 0.73
5: Platform 0.64 0.47
6: Puzzle 0.02 0.12
7: Racing 0.68 0.63
8: Role-Playing 0.55 0.95
9: Shooter 2.22 1.37
10: Simulation 0.15 0.36
11: Sports 1.16 0.63
12: Strategy 0.08 0.36
13: (all) 0.83 0.77
I want to get the Genre value where PS4 gets its maximum value. So expected value is Shooter.
I could get the maximum value of PS4 using dt[,max(PS4)]. How can I use this result to get the corresponding value of the column "Genre"
You could use which():
> dt$Genre[which(dt$PS4 == max(dt$PS4))]
[1] Shooter
13 Levels: (all) Action Adventure Fighting Misc Platform Puzzle ... Strategy
Alternatively (and even simpler), just use logical subsetting:
> dt$Genre[dt$PS4 == max(dt$PS4)]
[1] Shooter
13 Levels: (all) Action Adventure Fighting Misc Platform Puzzle ... Strategy

Why same category is giving different frequency in R

Process_Table = Process_Table[order(-Process_Table$Process, -Process_Table$Freq),]
#output
Process Freq Percent
17 Other Airport Services 45 15.46
5 Check-in 35 12.03
23 Ticket sales and support channels 35 12.03
11 Flight and inflight 33 11.34
19 Pegasus Plus 23 7.90
24 Time Delays 16 5.50
7 Other 13 4.47
14 Other 13 4.47
22 Other 13 4.47
25 Other 13 4.47
16 Other 11 3.78
20 Other 6 2.06
26 Other 6 2.06
3 Other 5 1.72
13 Other 5 1.72
18 Other 5 1.72
21 Other 4 1.37
1 Other 2 0.69
2 Other 1 0.34
4 Other 1 0.34
6 Other 1 0.34
8 Other 1 0.34
9 Other 1 0.34
10 Other 1 0.34
12 Other 1 0.34
15 Other 1 0.34
as you can see it is giving different frequency for the same level
whereas, if i am printing the levels in that feature it is giving an output as the following
levels(Process_Table$Process)
[1] "Check-in" "Flight and inflight"
[3] "Other" "Other Airport Services"
[5] "Pegasus Plus" "Ticket sales and support channels"
[7] "Time Delays"
what i want is the combined frequency of "Others" category. Can anyone help me out on this.
Edit: code was used to derive to the first set of output:
Process_Table$Percent = round(Process_Table$Freq/sum(Process_Table$Freq) * 100, 2)
Process_Table$Process = as.character(Process_Table$Process)
low_list = Process_Table %>%
filter(Percent < 5.50) %>%
select(Process)
Process_Table$Process = ifelse(Process_Table$Process %in% low_list$Process, 'Other', Process_Table$Process)
as.data.frame(Process_Table)
Process_Table$Process = as.factor(Process_Table$Process)
Your Processed_Table should undergo another step of aggregating. Add the following to your final step of data aggregating.
Processed_Table <- Processed_Table %>% group_by(Process) %>% summarize(Freq = sum(Freq), Percent = sum(Percent))

Convert Daily Data into Weekly in R Week Starts on Saturday

I am having trouble converting daily data into weekly using averages over the week.
My Data looks like this:
> str(daily_FWIH)
'data.frame': 4371 obs. of 6 variables:
$ Date : Date, format: "2013-03-01" "2013-03-02" "2013-03-04" "2013-03-05" ...
$ CST.OUC : Factor w/ 6 levels "BVG11","BVG12",..: 1 1 1 1 1 1 1 1 1 1 ...
$ CST.NAME : Factor w/ 6 levels "Central Scotland",..: 2 2 2 2 2 2 2 2 2 2 ...
$ SOM_patch: Factor w/ 6 levels "BVG11_Highlands & Islands",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Row_Desc : Factor w/ 1 level "FSFluidWIH": 1 1 1 1 1 1 1 1 1 1 ...
$ Value : num 1.16 1.99 1.47 1.15 1.16 1.28 1.27 2.07 1.26 1.19 ...
> head(daily_FWIH)
Date CST.OUC CST.NAME SOM_patch Row_Desc Value
1 2013-03-01 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.16
2 2013-03-02 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.99
3 2013-03-04 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.47
4 2013-03-05 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.15
5 2013-03-06 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.16
6 2013-03-07 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 1.28
This is what I've tried converting this to xts object as shown here
This is what I have tried:
daily_FWIH$Date = as.Date(as.character(daily_FWIH$Date), "%d/%m/%Y")
library(xts)
temp.x = xts(daily_FWIH[-1], order.by=daily_FWIH$Date)
apply.weekly(temp.x, colMeans(temp.x$Value))
I have two problems. My week starts and ends on a "Saturday" and I get the following error:
> apply.weekly(temp.x, colMeans(temp.x$Value))
Error in colMeans(temp.x$Value) : 'x' must be numeric
UPDATE Based on Sam's comments:
This is what I did:
daily_FWIH$Date <- ymd(daily_FWIH$Date) # convert to POSIX format
daily_FWIH$fakeDate <- daily_FWIH$Date + days(2)
daily_FWIH$week <- week(daily_FWIH$fakeDate) # extract week value
daily_FWIH$year <- year(daily_FWIH$fakeDate)
> daily_FWIH %>%
+ group_by(year,week) %>%
+ mutate(weeklyAvg = mean(Value), weekStartsOn = min(Date)) %>% # create the average variable
+ slice(which(Date == weekStartsOn)) %>% # select just the first record of the week - other vars will come from this
+ select(-Value,-fakeDate,-week,-year,-Date, -CST.OUC,-CST.NAME) # drop unneeded variables
Source: local data frame [631 x 6]
Groups: year, week
year week SOM_patch Row_Desc weeklyAvg weekStartsOn
1 2013 9 BVG11_Highlands & Islands FSFluidWIH 1.048333 2013-03-01
2 2013 9 BVG12_North East Scotland FSFluidWIH 1.048333 2013-03-01
3 2013 9 BVG13_Central Scotland FSFluidWIH 1.048333 2013-03-01
4 2013 9 BVG14_South East Scotland FSFluidWIH 1.048333 2013-03-01
5 2013 9 BVG15_West Central Scotland FSFluidWIH 1.048333 2013-03-01
6 2013 9 BVG16_South West Scotland FSFluidWIH 1.048333 2013-03-01
7 2013 10 BVG11_Highlands & Islands FSFluidWIH 1.520500 2013-03-02
8 2013 10 BVG12_North East Scotland FSFluidWIH 1.520500 2013-03-02
9 2013 10 BVG13_Central Scotland FSFluidWIH 1.520500 2013-03-02
10 2013 10 BVG14_South East Scotland FSFluidWIH 1.520500 2013-03-02
.. ... ... ... ... ... ...
Which is incorrect...
The desired output is:
> head(desired)
Date BVG11.Highlands_I_.A_pct BVG12.North.East.ScotlandA_pct BVG13.Central.ScotlandA_pct
1 01/03/2013 1.16 1.13 1.08
2 08/03/2013 1.41 2.37 1.80
3 15/03/2013 1.33 3.31 1.34
4 22/03/2013 1.39 2.49 1.62
5 29/03/2013 5.06 3.42 1.42
6 NA NA NA
BVG14.South.East.ScotlandA_pct BVG15.West.Central.ScotlandA_pct BVG16.South.West.ScotlandA_pct
1 1.05 0.98 0.89
2 1.51 1.21 1.07
3 1.13 2.13 2.01
4 2.14 1.24 1.37
5 1.62 1.46 1.95
6 NA NA NA
> str(desired)
'data.frame': 11 obs. of 7 variables:
$ Date : Factor w/ 6 levels "01/03/2013",..: 2 3 4 5 6 1 1 1 1 1 ...
$ BVG11.Highlands_I_.A_pct : num 1.16 1.41 1.33 1.39 5.06 ...
$ BVG12.North.East.ScotlandA_pct : num 1.13 2.37 3.31 2.49 3.42 ...
$ BVG13.Central.ScotlandA_pct : num 1.08 1.8 1.34 1.62 1.42 ...
$ BVG14.South.East.ScotlandA_pct : num 1.05 1.51 1.13 2.14 1.62 ...
$ BVG15.West.Central.ScotlandA_pct: num 0.98 1.21 2.13 1.24 1.46 ...
$ BVG16.South.West.ScotlandA_pct : num 0.89 1.07 2.01 1.37 1.95 ...
Find the first Saturday in your data, then assign a week ID to all dates in your data set based on that :
library(lubridate) # for the wday() and ymd() functions
daily_FWIH$Date <- ymd(daily_FWIH$Date)
saturdays <- daily_FWIH[wday(daily_FWIH$Date) == 7, ] # filter for Saturdays
startDate <- min(saturdays$Date) # select first Saturday
daily_FWIH$week <- floor(as.numeric(difftime(daily_FWIH$Date, startDate, units = "weeks")))
Once you have a weekID-starting-on-Saturday variable, this is a standard R problem. You can calculate the weekly averages using your method of choice for calculating means within a subgroup. I like dplyr:
library(dplyr)
daily_FWIH %>%
group_by(week, SOM_patch) %>% # use your grouping variables in addition to week
summarise(weeklyAvg = mean(Value), weekBeginDate = min(Date)) %>%
mutate(firstDayOfWeek = wday(weekBeginDate, label=TRUE)) # confirm correct week cuts
Source: local data frame [2 x 5]
Groups: week
week SOM_patch weeklyAvg weekBeginDate firstDayOfWeek
1 -1 BVG11_Highlands & Islands 1.16 2013-03-01 Fri
2 0 BVG11_Highlands & Islands 1.41 2013-03-02 Sat
Update based on comments below:
If you want to see the other values in your dataset, you'll need to decide how to select or calculate weekly values when daily values within a week conflict. In your sample data, they are the same in all rows, so I'm just drawing them from the row containing the first day of the week.
library(dplyr)
daily_FWIH %>%
group_by(week, SOM_patch) %>% # use your grouping variables
mutate(weeklyAvg = mean(Value), weekBeginDate = min(Date)) %>%
slice(which(Date == weekBeginDate)) %>% # select just the first record of the week - other vars will come from this
select(-Value, -Date) # drop unneeded variables
Source: local data frame [2 x 7]
Groups: week, SOM_patch
CST.OUC CST.NAME SOM_patch Row_Desc week weeklyAvg weekBeginDate
1 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH -1 1.16 2013-03-01
2 BVG11 Highlands & Islands BVG11_Highlands & Islands FSFluidWIH 0 1.41 2013-03-02

Split data frame into multiple data frames based on information in a xts object

I really need your help on the following issue:
I have two data frames - one containing a portfolio of securities with ISIN and Cluster information.
> dfInput
TICKER CLUSTER SECURITY.NAME
1 LU0937588209 High Yield Prime Capital Access SA SICAV-
2 LU0694362343 High Yield ECM CREDIT FUND SICAV - ECM Hi
3 IE0030390896 High Yield Putnam World Trust - Global Hi
4 LU0575374342 EM Debt Ashmore SICAV - Emerging Marke
5 LU0493865678 EM Debt Ashmore SICAV - Emerging Marke
6 LU0972237696 EM Debt Galloway Global Fixed Income F
7 IE00B6TLWG59 ILS/CAT GAM Star Fund PLC - Cat Bond F
8 LU0816333396 ILS/CAT LGT Lux I - Cat Bond Fund
9 LU0879473352 L/S Credit Merrill Lynch Investment Solut
10 HINCFEF ID Equity L/S Credit Hedge Invest International Fun
11 FR0011034800 L/S Credit Schelcher Prince Opportunite E
12 PIMCSEI ID Equity L/S Credit PIMCO Funds Global Investors S
13 VTR US Equity REITs Ventas Inc
14 HCP US Equity REITs HCP Inc
15 VGSIX US Equity REITs Vanguard REIT Index Fund
16 NLY US Equity M REITs Annaly Capital Management Inc
17 CLNY US Equity M REITs Colony Financial Inc
18 AGNC US Equity M REITs American Capital Agency Corp
19 REM US Equity M REITs iShares Mortgage Real Estate C
20 ES0130960018 Infrastructure Equities Enagas SA
21 SDRL US Equity Infrastructure Equities Seadrill Ltd
22 IGF US Equity Infrastructure Equities iShares Global Infrastructure
23 KMP US Equity MLP Kinder Morgan Energy Partners
24 EPD US Equity MLP Enterprise Products Partners L
25 MLPI US Equity MLP ETRACS Alerian MLP Infrastruct
26 HTGC US Equity BDC Hercules Technology Growth Cap
27 TCPC US Equity BDC TCP Capital Corp
28 MAIN US Equity BDC Main Street Capital Corp
29 BDCS US Equity BDC ETRACS Linked to the Wells Far
The other contains multiple time series of returns of these securities with the security name as column name (the data is coming from an excel file)
> PortfolioR.xts
Ventas.Inc HCP.Inc ....
2011-01-03 0.0000000000 0.0000000000
2011-01-04 -0.0117725362 -0.0056323067
2011-01-05 -0.0081155489 0.0018809625
2011-01-06 -0.0009479572 -0.0154202974
2011-01-07 -0.0058974774 -0.0054674822
2011-01-10 -0.0074691528 -0.0077050464
2011-01-11 -0.0036591278 0.0052348928
2011-01-12 0.0132249172 -0.0091097938
2011-01-13 0.0015220703 0.0085600412
2011-01-14 0.0058762372 -0.0038567541
2011-01-17 0.0000000000 0.0000000000
2011-01-18 0.0157513101 -0.0002760525
2011-01-19 -0.0059712810 -0.0074823683
2011-01-20 0.0013092679 0.0049944610
2011-01-21 0.0013075560 -0.0055509440
...
How can I split now the xts object based on the cluster information of the portfolio?
The result should be to have for each CLUSTER a separate data.frame or xts object containing the return history of the securities belonging to this cluster.
Is this possible?
Thank you in advance...
Here's one way to do it:
setNames(lapply(unique(dfInput$CLUSTER), function(x) {
PortfolioR[, which(dfInput$CLUSTER[match(colnames(PortfolioR),
dfInput$SECURITY.NAME)] == x)]
}), unique(dfInput$CLUSTER))
For example:
# Set up some fake data
d1 <- data.frame(grp=sample(LETTERS[1:4], 10, replace=TRUE),
name=letters[1:10])
d1
# grp name
# 1 A a
# 2 B b
# 3 B c
# 4 D d
# 5 C e
# 6 B f
# 7 B g
# 8 A h
# 9 D i
# 10 A j
d2 <- matrix(round(runif(50), 2), ncol=10)
colnames(d2) <- letters[1:10]
library(xts)
d2 <- xts(d2, seq.Date(as.Date('01-01-2011', '%d-%m-%Y'),
as.Date('5-01-2011', '%d-%m-%Y'), 1))
d2
# a b c d e f g h i j
# 2011-01-01 0.51 0.41 0.69 0.87 0.37 0.86 0.47 0.68 0.64 0.73
# 2011-01-02 0.72 0.92 0.53 0.55 0.62 0.54 0.75 0.64 0.04 0.72
# 2011-01-03 0.34 0.50 0.92 0.23 0.59 0.09 0.78 0.53 0.26 0.27
# 2011-01-04 0.52 0.47 0.49 0.25 0.18 0.07 0.65 0.13 0.46 0.74
# 2011-01-05 0.10 0.87 0.10 0.48 0.58 0.72 0.96 0.71 0.78 0.80
out <- setNames(sapply(unique(d1$grp), function(x) {
d2[, which(d1$grp[match(colnames(d2), d1$name)] == x)]
}), unique(d1$grp))
out
# $A
# a h j
# 2011-01-01 0.51 0.68 0.73
# 2011-01-02 0.72 0.64 0.72
# 2011-01-03 0.34 0.53 0.27
# 2011-01-04 0.52 0.13 0.74
# 2011-01-05 0.10 0.71 0.80
#
# $B
# b c f g
# 2011-01-01 0.41 0.69 0.86 0.47
# 2011-01-02 0.92 0.53 0.54 0.75
# 2011-01-03 0.50 0.92 0.09 0.78
# 2011-01-04 0.47 0.49 0.07 0.65
# 2011-01-05 0.87 0.10 0.72 0.96
#
# $C
# d i
# 2011-01-01 0.87 0.64
# 2011-01-02 0.55 0.04
# 2011-01-03 0.23 0.26
# 2011-01-04 0.25 0.46
# 2011-01-05 0.48 0.78
#
# $D
# e
# 2011-01-01 0.37
# 2011-01-02 0.62
# 2011-01-03 0.59
# 2011-01-04 0.18
# 2011-01-05 0.58
If you want the list elements (which are xts objects) to be standalone xts objects in the global environment, you can use list2env:
list2env(out, globalenv())
This will overwrite any objects in the global environment that have the same names as the list elements (i.e. A, B, C and D for the example above).

Resources