So I'm starting with a dataframe called max.mins that has 153 rows.
day Tx Hx Tn
1 1 10.0 7.83 2.1
2 2 7.7 6.19 2.5
3 3 7.1 4.86 0.0
4 4 9.8 7.37 2.7
5 5 13.4 12.68 0.4
6 6 17.5 17.47 3.5
7 7 16.5 15.58 6.5
8 8 21.5 20.30 6.2
9 9 21.7 21.41 9.7
10 10 24.4 28.18 8.0
I'm applying these statements to the dataframe to look for specific criteria
temp_warnings <- subset(max.mins, Tx >= 32 & Tn >=20)
humidex_warnings <- subset(max.mins, Hx >= 40)
Now when I open up humidex_warnings for example I have this dataframe
row.names day Tx Hx Tn
1 41 10 31.1 40.51 20.7
2 56 25 33.4 42.53 19.6
3 72 11 34.1 40.78 18.1
4 73 12 33.8 40.18 18.8
5 74 13 34.1 41.10 22.4
6 79 18 30.3 41.57 22.5
7 94 2 31.4 40.81 20.3
8 96 4 30.7 40.39 20.2
The next step is to search for 2 or 3 consective numbers in the column row.names and give me a total of how many times this occurs (I asked this in a previous question and have a function that should work once this problem is sorted out). The issue is that row.names is class NULL which is preventing me from applying further functions to this dataframe.
Help? :)
Thanks in advance,
Nick
If you need the row.names as a data as integer:
humidex_warnings$seq <- as.integer(row.names(humidex_warnings))
If you don't need row.names
row.names(humidex_warnings) <- NULL
Related
I have weather dataset my data is date-dependent
I want to predict the temperature from 07 May 2008 until 18 May 2008 (which is maybe a total of 10-15 observations) my data size is around 200
I will be using decision tree/RF and SVM & NN to make my prediction
I've never handled data like this so I'm not sure how to sample non random data
I want to sample data 80% train data and 30% test data but I want to sample the data in the original order not randomly. Is that possible ?
install.packages("rattle")
install.packages("RGtk2")
library("rattle")
seed <- 42
set.seed(seed)
fname <- system.file("csv", "weather.csv", package = "rattle")
dataset <- read.csv(fname, encoding = "UTF-8")
dataset <- dataset[1:200,]
dataset <- dataset[order(dataset$Date),]
set.seed(321)
sample_data = sample(nrow(dataset), nrow(dataset)*.8)
test<-dataset[sample_data,] # 30%
train<-dataset[-sample_data,] # 80%
output
> head(dataset)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3 NW 30
2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7 ENE 39
3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3 NW 85
4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1 NW 54
5 2007-11-05 Canberra 7.6 16.1 2.8 5.6 10.6 SSE 50
6 2007-11-06 Canberra 6.2 16.9 0.0 5.8 8.2 SE 44
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
1 SW NW 6 20 68 29 1019.7
2 E W 4 17 80 36 1012.4
3 N NNE 6 6 82 69 1009.5
4 WNW W 30 24 62 56 1005.5
5 SSE ESE 20 28 68 49 1018.3
6 SE E 20 24 70 57 1023.8
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
1 1015.0 7 7 14.4 23.6 No 3.6 Yes
2 1008.4 5 3 17.5 25.7 Yes 3.6 Yes
3 1007.2 8 7 15.4 20.2 Yes 39.8 Yes
4 1007.0 2 7 13.5 14.1 Yes 2.8 Yes
5 1018.5 7 7 11.1 15.4 Yes 0.0 No
6 1021.7 7 5 10.9 14.8 No 0.2 No
> head(test)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
182 2008-04-30 Canberra -1.8 14.8 0.0 1.4 7.0 N 28
77 2008-01-16 Canberra 17.9 33.2 0.0 10.4 8.4 N 59
88 2008-01-27 Canberra 13.2 31.3 0.0 6.6 11.6 WSW 46
58 2007-12-28 Canberra 15.1 28.3 14.4 8.8 13.2 NNW 28
96 2008-02-04 Canberra 18.2 22.6 1.8 8.0 0.0 ENE 33
126 2008-03-05 Canberra 12.0 27.6 0.0 6.0 11.0 E 46
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
182 E N 2 19 80 40 1024.2
77 N NNE 15 20 58 62 1008.5
88 N WNW 4 26 71 28 1013.1
58 NNW NW 6 13 73 44 1016.8
96 SSE ENE 7 13 92 76 1014.4
126 SSE WSW 7 6 69 35 1025.5
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
182 1020.5 1 7 5.3 13.9 No 0.0 No
77 1006.1 6 7 24.5 23.5 No 4.8 Yes
88 1009.5 1 4 19.7 30.7 No 0.0 No
58 1013.4 1 5 18.3 27.4 Yes 0.0 No
96 1011.5 8 8 18.5 22.1 Yes 9.0 Yes
126 1022.2 1 1 15.7 26.2 No 0.0 No
> head(train)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
7 2007-11-07 Canberra 6.1 18.2 0.2 4.2 8.4 SE 43
9 2007-11-09 Canberra 8.8 19.5 0.0 4.0 4.1 S 48
11 2007-11-11 Canberra 9.1 25.2 0.0 4.2 11.9 N 30
16 2007-11-16 Canberra 12.4 32.1 0.0 8.4 11.1 E 46
22 2007-11-22 Canberra 16.4 19.4 0.4 9.2 0.0 E 26
25 2007-11-25 Canberra 15.4 28.4 0.0 4.4 8.1 ENE 33
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
7 SE ESE 19 26 63 47 1024.6
9 E ENE 19 17 70 48 1026.1
11 SE NW 6 9 74 34 1024.4
16 SE WSW 7 9 70 22 1017.9
22 ENE E 6 11 88 72 1010.7
25 SSE NE 9 15 85 31 1022.4
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
7 1022.2 4 6 12.4 17.3 No 0.0 No
9 1022.7 7 7 14.1 18.9 No 16.2 Yes
11 1021.1 1 2 14.6 24.0 No 0.2 No
16 1012.8 0 3 19.1 30.7 No 0.0 No
22 1008.9 8 8 16.5 18.3 No 25.8 Yes
25 1018.6 8 2 16.8 27.3 No 0.0 No
I use mtcars as an example. An option to non-randomly split your data in train and test is to first create a sample size based on the number of rows in your data. After that you can use split to split the data exact at the 80% of your data. You using the following code:
smp_size <- floor(0.80 * nrow(mtcars))
split <- split(mtcars, rep(1:2, each = smp_size))
With the following code you can turn the split in train and test:
train <- split$`1`
test <- split$`2`
Let's check the number of rows:
> nrow(train)
[1] 25
> nrow(test)
[1] 7
Now the data is split in train and test without losing their order.
I have a function that takes in a dataframe, a percentile threshold, and the name of a given column, and computes all values that are above this threshold in the given column as a new column (0 for <, and 1 for >=). However, it won't allow me to do the df$column_name inside the quantile function because column_name is not actually a column name, but a variable storing the actual column name. Therefore df$column_name will return NULL. Is there any way to work around this and keep the code forma somewhat similar to what it is currently? Or do I have to specify the actual numerical column value instead of the name? While I can do this, it is definitely not as convenient/comprehensible as just passing in the column name.
func1 <- function(df, threshold, column_name) {
threshold_value <- quantile(df$column_name, c(threshold))
new_df <- df %>%
mutate(ifelse(column_name > threshold_value, 1, 0))
return(new_df)
}
Thank you so much for your help!
I modified your function as follows. Now the function can take a data frame, a threshold, and a column name. This function only needs the base R.
# Modified function
func1 <- function(df, threshold, column_name) {
threshold_value <- quantile(df[[column_name]], threshold)
new_df <- df
new_df[["new_col"]] <- ifelse(df[[column_name]] > threshold_value, 1, 0)
return(new_df)
}
# Take the trees data frame as an example
head(trees)
# Girth Height Volume
# 1 8.3 70 10.3
# 2 8.6 65 10.3
# 3 8.8 63 10.2
# 4 10.5 72 16.4
# 5 10.7 81 18.8
# 6 10.8 83 19.7
# Apply the function
func1(trees, 0.5, "Volume")
# Girth Height Volume new_col
# 1 8.3 70 10.3 0
# 2 8.6 65 10.3 0
# 3 8.8 63 10.2 0
# 4 10.5 72 16.4 0
# 5 10.7 81 18.8 0
# 6 10.8 83 19.7 0
# 7 11.0 66 15.6 0
# 8 11.0 75 18.2 0
# 9 11.1 80 22.6 0
# 10 11.2 75 19.9 0
# 11 11.3 79 24.2 0
# 12 11.4 76 21.0 0
# 13 11.4 76 21.4 0
# 14 11.7 69 21.3 0
# 15 12.0 75 19.1 0
# 16 12.9 74 22.2 0
# 17 12.9 85 33.8 1
# 18 13.3 86 27.4 1
# 19 13.7 71 25.7 1
# 20 13.8 64 24.9 1
# 21 14.0 78 34.5 1
# 22 14.2 80 31.7 1
# 23 14.5 74 36.3 1
# 24 16.0 72 38.3 1
# 25 16.3 77 42.6 1
# 26 17.3 81 55.4 1
# 27 17.5 82 55.7 1
# 28 17.9 80 58.3 1
# 29 18.0 80 51.5 1
# 30 18.0 80 51.0 1
# 31 20.6 87 77.0 1
If you still want to use dplyr, it is essential to learn how to deal with non-standard evaluation. Please see this to learn more (https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html). The following code will also works.
library(dplyr)
func2 <- function(df, threshold, column_name) {
col_en <- enquo(column_name)
threshold_value <- quantile(df %>% pull(!!col_en), threshold)
new_df <- df %>%
mutate(new_col := ifelse(!!col_en >= threshold_value, 1, 0))
return(new_df)
}
func2(trees, 0.5, Volume)
# Girth Height Volume new_col
# 1 8.3 70 10.3 0
# 2 8.6 65 10.3 0
# 3 8.8 63 10.2 0
# 4 10.5 72 16.4 0
# 5 10.7 81 18.8 0
# 6 10.8 83 19.7 0
# 7 11.0 66 15.6 0
# 8 11.0 75 18.2 0
# 9 11.1 80 22.6 0
# 10 11.2 75 19.9 0
# 11 11.3 79 24.2 1
# 12 11.4 76 21.0 0
# 13 11.4 76 21.4 0
# 14 11.7 69 21.3 0
# 15 12.0 75 19.1 0
# 16 12.9 74 22.2 0
# 17 12.9 85 33.8 1
# 18 13.3 86 27.4 1
# 19 13.7 71 25.7 1
# 20 13.8 64 24.9 1
# 21 14.0 78 34.5 1
# 22 14.2 80 31.7 1
# 23 14.5 74 36.3 1
# 24 16.0 72 38.3 1
# 25 16.3 77 42.6 1
# 26 17.3 81 55.4 1
# 27 17.5 82 55.7 1
# 28 17.9 80 58.3 1
# 29 18.0 80 51.5 1
# 30 18.0 80 51.0 1
# 31 20.6 87 77.0 1
I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!--
What is the best way to get the tables from inside the comment tags? Thanks!
Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none
You can use the XPath comment() function to select comment nodes, then reparse their contents as HTML:
library(rvest)
# scrape page
h <- read_html('http://www.basketball-reference.com/teams/CHI/2015.html')
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comment nodes
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to a single string
read_html() %>% # reparse to HTML
html_node('table#advanced') %>% # select the desired table
html_table() %>% # parse table
.[colSums(is.na(.)) < nrow(.)] # get rid of spacer columns
df[, 1:15]
## Rk Player Age G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK%
## 1 1 Pau Gasol 34 78 2681 22.7 0.550 0.023 0.317 9.2 27.6 18.6 14.4 0.5 4.0
## 2 2 Jimmy Butler 25 65 2513 21.3 0.583 0.212 0.508 5.1 11.2 8.2 14.4 2.3 1.0
## 3 3 Joakim Noah 29 67 2049 15.3 0.482 0.005 0.407 11.9 22.1 17.1 23.0 1.2 2.6
## 4 4 Aaron Brooks 30 82 1885 14.4 0.534 0.383 0.213 1.9 7.5 4.8 24.2 1.5 0.6
## 5 5 Mike Dunleavy 34 63 1838 11.6 0.573 0.547 0.181 1.7 12.7 7.3 9.7 1.1 0.8
## 6 6 Taj Gibson 29 62 1692 16.1 0.545 0.000 0.364 10.7 14.6 12.7 6.9 1.1 3.2
## 7 7 Nikola Mirotic 23 82 1654 17.9 0.556 0.502 0.455 4.3 21.8 13.3 9.7 1.7 2.4
## 8 8 Kirk Hinrich 34 66 1610 6.8 0.468 0.441 0.131 1.4 6.6 4.1 13.8 1.5 0.6
## 9 9 Derrick Rose 26 51 1530 15.9 0.493 0.325 0.224 2.6 8.7 5.7 30.7 1.2 0.8
## 10 10 Tony Snell 23 72 1412 10.2 0.550 0.531 0.148 2.5 10.9 6.8 6.8 1.2 0.6
## 11 11 E'Twaun Moore 25 56 504 10.3 0.504 0.273 0.144 2.7 7.1 5.0 10.4 2.1 0.9
## 12 12 Doug McDermott 23 36 321 6.1 0.480 0.383 0.140 2.1 12.2 7.3 3.0 0.6 0.2
## 13 13 Nazr Mohammed 37 23 128 8.7 0.431 0.000 0.100 9.6 22.3 16.1 3.6 1.6 2.8
## 14 14 Cameron Bairstow 24 18 64 2.1 0.309 0.000 0.357 10.5 3.3 6.8 2.2 1.6 1.1
Ok..got it.
library(stringi)
library(knitr)
library(rvest)
any_version_html <- function(x){
XML::htmlParse(x)
}
a <- 'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none'
b <- readLines(a)
c <- paste0(b, collapse = "")
d <- as.character(unlist(stri_extract_all_regex(c, '<table(.*?)/table>', omit_no_match = T, simplify = T)))
e <- html_table(any_version_html(d))
> kable(summary(e),'rst')
====== ========== ====
Length Class Mode
====== ========== ====
9 data.frame list
2 data.frame list
24 data.frame list
21 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
3 data.frame list
====== ========== ====
kable(e[[1]],'rst')
=== ================ === ==== === ================== === === =================================
No. Player Pos Ht Wt Birth Date  Exp College
=== ================ === ==== === ================== === === =================================
41 Cameron Bairstow PF 6-9 250 December 7, 1990 au R University of New Mexico
0 Aaron Brooks PG 6-0 161 January 14, 1985 us 6 University of Oregon
21 Jimmy Butler SG 6-7 220 September 14, 1989 us 3 Marquette University
34 Mike Dunleavy SF 6-9 230 September 15, 1980 us 12 Duke University
16 Pau Gasol PF 7-0 250 July 6, 1980 es 13
22 Taj Gibson PF 6-9 225 June 24, 1985 us 5 University of Southern California
12 Kirk Hinrich SG 6-4 190 January 2, 1981 us 11 University of Kansas
3 Doug McDermott SF 6-8 225 January 3, 1992 us R Creighton University
## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.
# Names are in h2-tags
e_names <- as.character(unlist(stri_extract_all_regex(c, '<h2(.*?)/h2>', simplify = T)))
e_names <- gsub("<(.*?)>","",e_names[grep('Roster',e_names):grep('Salaries',e_names)])
names(e) <- e_names
kable(head(e$Salaries), 'rst')
=== ============== ===========
Rk Player Salary
=== ============== ===========
1 Derrick Rose $18,862,875
2 Carlos Boozer $13,550,000
3 Joakim Noah $12,200,000
4 Taj Gibson $8,000,000
5 Pau Gasol $7,128,000
6 Nikola Mirotic $5,305,000
=== ============== ===========
rate len ADT trks sigs1 slim shld lane acpt itg lwid hwy
1 4.58 4.99 69 8 0.20040080 55 10 8 4.6 1.20 12 FAI
2 2.86 16.11 73 8 0.06207325 60 10 4 4.4 1.43 12 FAI
3 3.02 9.75 49 10 0.10256410 60 10 4 4.7 1.54 12 FAI
4 2.29 10.65 61 13 0.09389671 65 10 6 3.8 0.94 12 FAI
5 1.61 20.01 28 12 0.04997501 70 10 4 2.2 0.65 12 FAI
6 6.87 5.97 30 6 2.00750419 55 10 4 24.8 0.34 12 PA
7 3.85 8.57 46 8 0.81668611 55 8 4 11.0 0.47 12 PA
8 6.12 5.24 25 9 0.57083969 55 10 4 18.5 0.38 12 PA
9 3.29 15.79 43 12 1.45333122 50 4 4 7.5 0.95 12 PA
I got a question in adding a new column, my data frame is called highway1,and i want to add a column named S/N, as slim divided by acpt, what can I do?
Thanks
> mydf$SN <- mydf$slim/mydf$acpt
> mydf
rate len ADT trks sigs1 slim shld lane acpt itg lwid hwy SN
1 4.58 4.99 69 8 0.20040080 55 10 8 4.6 1.20 12 FAI 11.956522
2 2.86 16.11 73 8 0.06207325 60 10 4 4.4 1.43 12 FAI 13.636364
3 3.02 9.75 49 10 0.10256410 60 10 4 4.7 1.54 12 FAI 12.765957
4 2.29 10.65 61 13 0.09389671 65 10 6 3.8 0.94 12 FAI 17.105263
5 1.61 20.01 28 12 0.04997501 70 10 4 2.2 0.65 12 FAI 31.818182
6 6.87 5.97 30 6 2.00750419 55 10 4 24.8 0.34 12 PA 2.217742
7 3.85 8.57 46 8 0.81668611 55 8 4 11.0 0.47 12 PA 5.000000
8 6.12 5.24 25 9 0.57083969 55 10 4 18.5 0.38 12 PA 2.972973
9 3.29 15.79 43 12 1.45333122 50 4 4 7.5 0.95 12 PA 6.666667
I hope an explanation is not necessary for the above.
While $ is the preferred route, you can also consider cbind.
First, create the numeric vector and assign it to SN:
SN <- Data[,6]/Data[,9]
Now you use cbind to append the numeric vector as a column to the existing data frame:
Data <- cbind(Data, SN)
Again, using the dollar operator $ is preferred, but it doesn't hurt seeing what an alternative looks like.
So I have data regarding Id number and time
Id number Time(hr)
1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4
I want this output
Time Id number
10 5
20 10
30 16
40 22
So I want the time to be in 10 hour intervals and get the ID that corresponds to that particular hour...I decided to use this code data <- data2[seq(0, nrow(data2), by=5), ] but instead of the Time being in 10 hr intervals...the ID number is at 10 intervals....but I dont want that output..so far I'm getting this output
Id.number Time..s.
10 19.3
20 36.9
You can use %% (mod) operator.
data[data$Time %% 10 == 0, ]
I use cut() and cumsum(table()) but I don't quite get the answer you are expecting. How exactly are you calculating this?
# first load the data
v.txt <- '1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4'
# load in the data... awkwardly...
v <- as.data.frame(matrix(as.numeric(unlist(strsplit(strsplit(v.txt, '\n')[[1]], ' +'))), byrow=T, ncol=2))
tens <- seq(from=0, by=10, to=100)
v$cut <- cut(v$Time, tens, labels=tens[-1])
v2 <- as.data.frame(cumsum(table(v$cut)))
names(v2) <- 'Time'
v2$Id <- rownames(v2)
rownames(v2) <- 1:nrow(v2)
v2 <- v2[,c(2,1)]
rm(v, v.txt, tens) # not needed anymore
v2 # the answer... but doesn't quite match your expected answer...
Id Time
1 10 5
2 20 10
3 30 15
4 40 21
5 50 25