I'm able to scrape the first table of this page using the rvest package and using the following code:
library(rvest)
library(magrittr)
urlbbref <- read_html("http://www.baseball-reference.com/bio/Venezuela_born.shtml")
Bat <- urlbbref %>%
html_node(xpath = '//*[(#id = "bio_batting")]') %>%
html_table()
But I'm not able to scrape the second table of this page. I use selectorgadget to find the xpath of both tables and I use that info in the code, but it doesn't seem to be working for the second one.
Pit <- urlbbref %>%
html_node(xpath = '//*[(#id = "div_bio_pitching")]') %>%
html_table()
I come up with 3 tables in total.
library(magrittr)
library(rvest)
library(xml2)
library(stringi)
urlbbref <- read_html("http://www.baseball-reference.com/bio/Venezuela_born.shtml")
# First table is in the markup
table_one <- xml_find_all(urlbbref, "//table") %>% html_table
# Additional tables are within the comment tags, ie <!-- tables -->
# Which is why your xpath is missing them.
# First get the commented nodes
alt_tables <- xml2::xml_find_all(urlbbref,"//comment()") %>% {
#Find only commented nodes that contain the regex for html table markup
raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
# Remove the comment begin and end tags
strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
vectorize_all = FALSE)
# Loop through the pieces that have tables within markup and
# apply the same functions
lapply(grep("<table", strip_html, value = TRUE), function(i){
rvest::html_table(xml_find_all(read_html(i), "//table")) %>%
.[[1]]
})
}
# Put all the data frames into a list.
all_tables <- c(
table_one, alt_tables
)
Results:
> Map(str, all_tables)
'data.frame': 361 obs. of 27 variables:
$ Rk : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Bobby Abreu" "Ehire Adrianza" "Jesus Aguilar" "Edgardo Alfonzo" ...
$ Yrs : int 18 4 4 12 6 7 1 5 5 2 ...
$ From : int 1996 2013 2014 1995 2006 2011 2000 2011 2013 2002 ...
$ To : int 2014 2016 2017 2006 2011 2017 2000 2015 2017 2004 ...
$ ASG : int 2 0 0 1 0 4 0 1 0 0 ...
$ G : int 2425 154 47 1506 193 842 2 92 150 38 ...
$ PA : int 10081 331 89 6108 624 3708 5 109 3 75 ...
$ AB : int 8480 291 81 5385 591 3411 5 94 2 64 ...
$ R : int 1453 27 4 777 44 456 1 5 0 11 ...
$ H : int 2470 64 18 1532 142 1062 1 22 0 16 ...
$ 2B : int 574 16 3 282 24 208 0 4 0 4 ...
$ 3B : int 59 1 0 18 3 19 0 0 0 0 ...
$ HR : int 288 3 0 146 17 60 0 1 0 2 ...
$ RBI : int 1363 26 8 744 67 326 0 9 0 10 ...
$ SB : int 400 4 0 53 1 204 0 0 0 1 ...
$ CS : int 128 4 0 17 2 59 0 0 0 0 ...
$ BB : int 1476 23 6 596 17 214 0 1 1 7 ...
$ SO : int 1840 60 28 617 158 389 1 34 0 12 ...
$ BA : num 0.291 0.22 0.222 0.284 0.24 0.311 0.2 0.234 0 0.25 ...
$ OBP : num 0.395 0.292 0.281 0.357 0.271 0.354 0.2 0.237 0.333 0.324 ...
$ SLG : num 0.475 0.313 0.259 0.425 0.377 0.436 0.2 0.309 0 0.406 ...
$ OPS : num 0.87 0.605 0.54 0.782 0.648 0.791 0.4 0.546 0.333 0.731 ...
$ Birthdate : chr "Mar 11, 1974" "Aug 21, 1989" "Jun 30, 1990" "Nov 8, 1973" ...
$ Debut : chr "Sep 1, 1996" "Sep 8, 2013" "May 15, 2014" "Apr 26, 1995" ...
$ Birthplace: chr "Maracay, Aragua" "Guarenas, Miranda" "Maracay, Aragua" "Santa Teresa del Tuy, Miranda" ...
$ Pos : chr "POS" "POS" "POS" "POS" ...
'data.frame': 157 obs. of 31 variables:
$ Rk : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Henderson Alvarez" "Jose Alvarez" "Wilson Alvarez" "Alexi Amarista" ...
$ Yrs : int 5 5 14 7 5 2 10 4 6 4 ...
$ From : int 2011 2013 1989 2011 1980 2015 1999 2007 2012 2005 ...
$ To : int 2015 2017 2005 2017 1984 2016 2008 2011 2017 2009 ...
$ ASG : int 1 0 1 0 0 0 0 0 0 0 ...
$ W : int 27 6 102 0 9 4 53 1 15 3 ...
$ L : int 34 12 92 0 6 2 65 3 6 4 ...
$ W-L% : num 0.443 0.333 0.526 NA 0.6 0.667 0.449 0.25 0.714 0.429 ...
$ ERA : num 3.8 3.97 3.96 0 3.27 4.35 4.65 5.28 2.91 6.86 ...
$ G : int 92 150 355 2 110 72 185 43 275 25 ...
$ GS : int 92 6 263 0 0 0 167 0 0 8 ...
$ GF : int 0 32 18 2 66 14 7 16 36 12 ...
$ CG : int 5 0 12 0 0 0 0 0 0 0 ...
$ SHO : int 5 0 5 0 0 0 0 0 0 0 ...
$ SV : int 0 0 4 0 7 0 0 0 0 0 ...
$ IP : num 563 167.2 1747.2 0.2 220 ...
$ H : int 596 174 1624 0 222 64 891 57 177 68 ...
$ R : int 261 85 857 0 86 39 519 29 75 51 ...
$ ER : int 238 74 769 0 80 30 478 27 72 46 ...
$ HR : int 54 17 190 0 17 5 122 7 10 4 ...
$ BB : int 129 55 805 0 68 36 431 21 80 34 ...
$ IBB : int 7 10 29 0 7 3 41 5 17 1 ...
$ SO : int 296 148 1330 0 113 63 680 41 180 37 ...
$ HBP : int 22 8 50 0 3 2 51 4 11 4 ...
$ BK : int 3 1 4 0 3 1 6 0 3 1 ...
$ WP : int 16 3 28 0 5 2 43 1 14 2 ...
$ BF : int 2358 729 7518 2 928 285 4055 221 913 282 ...
$ Birthdate : chr "Apr 18, 1990" "May 6, 1989" "Mar 24, 1970" "Apr 6, 1989" ...
$ Debut : chr "Aug 10, 2011" "Jun 9, 2013" "Jul 24, 1989" "Apr 26, 2011" ...
$ Birthplace: chr "Valencia, Carabobo" "Barcelona, Anzoategui" "Maracaibo, Zulia" "Barcelona, Anzoategui" ...
'data.frame': 3 obs. of 17 variables:
$ Rk : int 1 2 NA
$ Mgr : chr "Ozzie Guillen" "Al Pedrique" "Totals"
$ Yrs : int 9 1 10
$ From : int 2004 2004 2004
$ To : int 2012 2004 2012
$ W : int 747 22 769
$ L : int 710 61 771
$ W-L% : num 0.513 0.265 0.499
$ Ties : int 0 0 0
$ G>.500 : int 37 -39 -2
$ G : int 1457 83 1540
$ BestFin : int 1 5 1
$ WrstFin : int 5 5 5
$ AvRk : num 2.7 5 2.8
$ Birthdate : chr "Jan 20, 1964" "Aug 11, 1960" ""
$ Debut : chr "Apr 9, 1985" "Apr 14, 1987" ""
$ Birthplace: chr "Ocumare del Tuy, Miranda" "Valencia, Carabobo" ""
Related
I have 2 data frames, one data frame called datos_octubre with 10131000 rows and other dataframe called datos_conductores.
I wanna put a new column called operador, this column will be fill by the follow instruction
for( j in 1:100){
for(i in 1:100){
if ( (datos_octubre$FECHA_GPS[i]== datos_conductores$fecha[i])){
if (datos_octubre$EQU_CODIGO[i]== datos_conductores$EQU_CODIGO[j]){
if (datos_octubre$HORA_GPS[i] <= datos_conductores$hora_fin[j]){
datos_octubre$Operador[i] <- datos_octubre$NOMBRE[j]
}
}
}
This is the structure of data frame datos_octubre and the head:
> str(datos_octubre)
'data.frame': 10131530 obs. of 14 variables:
$ REP_GPS_CODIGO : Factor w/ 9329105 levels "MI051","MI051_1832614921789237",..: 2 3 4 5 6 7 8 9 10 11 ...
$ EQU_CODIGO : chr "MI051" "MI051" "MI051" "MI051" ...
$ TRAM_GPS_CODIGO: Factor w/ 4 levels "01","03","05",..: 4 4 4 4 4 4 4 4 4 4 ...
$ EVE_GPS_CODIGO : Factor w/ 83 levels "01","02","03",..: 3 3 3 3 3 9 3 3 3 3 ...
$ FECHA_GPS : POSIXct, format: "2019-10-01" "2019-10-01" "2019-10-01" "2019-10-01" ...
$ HORA_GPS : Factor w/ 86389 levels "-75.6654","-75.6655",..: 16528 16536 16546 16556 16564 16568 16574 16583 16592 16601 ...
$ LON_GPS : num -75.7 -75.7 -75.7 -75.7 -75.7 ...
$ LAT_GPS : num 4.8 4.8 4.8 4.8 4.8 ...
$ VEL_GPS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DIR_GPS : int 0 0 0 101 101 101 101 101 101 101 ...
$ ACL_GPS : int 0 0 0 0 0 NA 0 0 0 0 ...
$ ODO_GPS : int 28229762 28229762 28229762 28229768 28229770 NA 28229770 28229770 28229770 28229770 ...
$ ALT_GPS : Factor w/ 120 levels "","\"MI051_1902402005409507",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Operador : chr "" "" "" "" ...
> head(datos_octubre)
REP_GPS_CODIGO EQU_CODIGO TRAM_GPS_CODIGO EVE_GPS_CODIGO FECHA_GPS HORA_GPS LON_GPS LAT_GPS VEL_GPS DIR_GPS ACL_GPS ODO_GPS ALT_GPS Operador
1 MI051_1832614921789237 MI051 EV 03 2019-10-01 04:35:38 -75.7444 4.79857 0 0 0 28229762
2 MI051_1832614964979379 MI051 EV 03 2019-10-01 04:35:46 -75.7444 4.79857 0 0 0 28229762
3 MI051_1832616366109032 MI051 EV 03 2019-10-01 04:35:56 -75.7444 4.79857 0 0 0 28229762
4 MI051_1832617794914447 MI051 EV 03 2019-10-01 04:36:06 -75.7442 4.79907 0 101 0 28229768
5 MI051_1832619516509591 MI051 EV 03 2019-10-01 04:36:14 -75.7442 4.79908 0 101 0 28229770
6 MI051_1832619543973570 MI051 EV 10 2019-10-01 04:36:18 -75.7442 4.79908 0 101 NA NA
And this is the restructure of datos_conductores and the head:
> str(datos_conductores)
'data.frame': 16522 obs. of 11 variables:
$ fecha : POSIXct, format: "2019-10-01" "2019-10-01" "2019-10-01" "2019-10-01" ...
$ equ_id : int 99 99 99 99 99 99 99 99 99 99 ...
$ conductor : int 34 34 34 34 34 34 34 65 65 65 ...
$ servicio_id: int 533329 533328 533327 533326 533325 533324 533323 533333 533332 533331 ...
$ PERA_ID : int 362 362 362 362 362 362 362 107 107 107 ...
$ hora_ini : POSIXct, format: "2019-11-28 09:16:16" "2019-11-28 08:38:16" "2019-11-28 08:00:16" "2019-11-28 07:22:16" ...
$ hora_fin : POSIXct, format: "2019-11-28 09:21:00" "2019-11-28 09:16:16" "2019-11-28 08:38:16" "2019-11-28 08:00:16" ...
$ ruta_id : int 24 24 24 24 24 24 24 24 24 24 ...
$ NOMBRE : Factor w/ 85 levels "ALBERT HERNAN ZAPATA RESTREPO",..: 71 71 71 71 71 71 71 53 53 53 ...
$ PERA_CEDULA: int 1088253762 1088253762 1088253762 1088253762 1088253762 1088253762 1088253762 10087424 10087424 10087424 ...
$ EQU_CODIGO : Factor w/ 36 levels "MI051","MI052",..: 9 9 9 9 9 9 9 9 9 9 ...
> head(datos_octubre)
REP_GPS_CODIGO EQU_CODIGO TRAM_GPS_CODIGO EVE_GPS_CODIGO FECHA_GPS HORA_GPS LON_GPS LAT_GPS VEL_GPS DIR_GPS ACL_GPS ODO_GPS ALT_GPS Operador
1 MI051_1832614921789237 MI051 EV 03 2019-10-01 04:35:38 -75.7444 4.79857 0 0 0 28229762
2 MI051_1832614964979379 MI051 EV 03 2019-10-01 04:35:46 -75.7444 4.79857 0 0 0 28229762
3 MI051_1832616366109032 MI051 EV 03 2019-10-01 04:35:56 -75.7444 4.79857 0 0 0 28229762
4 MI051_1832617794914447 MI051 EV 03 2019-10-01 04:36:06 -75.7442 4.79907 0 101 0 28229768
5 MI051_1832619516509591 MI051 EV 03 2019-10-01 04:36:14 -75.7442 4.79908 0 101 0 28229770
6 MI051_1832619543973570 MI051 EV 10 2019-10-01 04:36:18 -75.7442 4.79908 0 101 NA NA
Also I tried with operator pype but I'm not getting the result I want.
I already find the solution.
I have to convert all data type data with the function as.POSIXct and in the for make a correction with the time_ini and the time_finish of every data.
Given this:
kc$sqft_living_group <- cut(kc$sqft_living, breaks = c(0, 1000, 2000, 3000, 5000, 7000, 10000, 15000), dig.lab=5)
How do I set the limit of my ggplot2 graph?
Nothing I can find shows the syntax to set the limit for intervals.
kc %>%
filter(zipcode %in% top_10_zipcodes) %>%
group_by(sqft_living_group) %>%
summarize(Mean_Price = mean(price)) %>%
ggplot(aes(y = Mean_Price, x = sqft_living_group)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = comma) +
scale_x_discrete(limits = "(0, 1000], (1000, 12000]") <---------- HERE
structure of data:
'data.frame': 21613 obs. of 22 variables:
$ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
$ date : POSIXct, format: "2014-10-13" "2014-12-09" "2015-02-25" "2014-12-09" ...
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : Factor w/ 6 levels "1","1.5","2",..: 1 3 1 1 1 1 3 1 1 3 ...
$ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement : int 0 400 0 910 0 1530 0 0 730 0 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ yr_renovated : Factor w/ 70 levels "0","1934","1940",..: 1 46 1 1 1 1 1 1 1 1 ...
$ zipcode : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15 : int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
$ sqft_living_group: Factor w/ 7 levels "(0,1000]","(1000,2000]",..: 2 3 1 2 2 5 2 2 2 2 ...
I am trying to update a a value in a data frame but am getting--what seems to me--a weird error about operation that I don't think I am using.
Here's a summary of the data:
> str(us.cty2015#data)
'data.frame': 3108 obs. of 15 variables:
$ STATEFP : Factor w/ 52 levels "01","02","04",..: 17 25 33 46 4 14 16 24 36 42 ...
$ COUNTYFP : Factor w/ 325 levels "001","003","005",..: 112 91 67 9 43 81 7 103 72 49 ...
$ COUNTYNS : Factor w/ 3220 levels "00023901","00025441",..: 867 1253 1600 2465 38 577 690 1179 1821 2104 ...
$ AFFGEOID : Factor w/ 3220 levels "0500000US01001",..: 976 1472 1879 2813 144 657 795 1395 2098 2398 ...
$ GEOID : Factor w/ 3220 levels "01001","01003",..: 976 1472 1879 2813 144 657 795 1395 2098 2398 ...
$ NAME : Factor w/ 1910 levels "Abbeville","Acadia",..: 1558 1703 1621 688 856 1075 148 1807 1132 868 ...
$ LSAD : Factor w/ 9 levels "00","03","04",..: 5 5 5 5 5 5 5 5 5 5 ...
$ ALAND : num 1.66e+09 1.10e+09 3.60e+09 2.12e+08 1.50e+09 ...
$ AWATER : num 2.78e+06 5.24e+07 3.50e+07 2.92e+08 8.91e+06 ...
$ t_pop : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_wht : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_free_blk: num 0 0 0 0 0 0 0 0 0 0 ...
$ n_slv : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_blk : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_free : num 0 0 0 0 0 0 0 0 0 0 ...
> str(us.cty1860#data)
'data.frame': 2126 obs. of 29 variables:
$ DECADE : Factor w/ 1 level "1860": 1 1 1 1 1 1 1 1 1 1 ...
$ NHGISNAM : Factor w/ 1236 levels "Abbeville","Accomack",..: 1142 1218 1130 441 812 548 1144 56 50 887 ...
$ NHGISST : Factor w/ 41 levels "010","050","060",..: 32 13 9 36 16 36 16 30 23 39 ...
$ NHGISCTY : Factor w/ 320 levels "0000","0010",..: 142 206 251 187 85 231 131 12 6 161 ...
$ ICPSRST : Factor w/ 37 levels "1","11","12",..: 5 13 21 26 22 26 22 10 15 17 ...
$ ICPSRCTY : Factor w/ 273 levels "10","1010","1015",..: 25 93 146 72 247 122 12 10 228 45 ...
$ ICPSRNAM : Factor w/ 1200 levels "ABBEVILLE","ACCOMACK",..: 1108 1184 1097 432 791 535 1110 55 49 860 ...
$ STATENAM : Factor w/ 41 levels "Alabama","Arkansas",..: 32 13 9 36 16 36 16 30 23 39 ...
$ ICPSRSTI : int 14 31 44 49 45 49 45 24 34 40 ...
$ ICPSRCTYI : int 1210 1970 2910 1810 710 2450 1130 110 50 1450 ...
$ ICPSRFIP : num 0 0 0 0 0 0 0 0 0 0 ...
$ STATE : Factor w/ 41 levels "010","050","060",..: 32 13 9 36 16 36 16 30 23 39 ...
$ COUNTY : Factor w/ 320 levels "0000","0010",..: 142 206 251 187 85 231 131 12 6 161 ...
$ PID : num 1538 735 306 1698 335 ...
$ X_CENTROID : num 1348469 184343 1086494 -62424 585888 ...
$ Y_CENTROID : num 556680 588278 -229809 -433290 -816852 ...
$ GISJOIN : Factor w/ 2126 levels "G0100010","G0100030",..: 1585 627 319 1769 805 1788 823 1425 1079 2006 ...
$ GISJOIN2 : Factor w/ 2126 levels "0100010","0100030",..: 1585 627 319 1769 805 1788 823 1425 1079 2006 ...
$ SHAPE_AREA : num 2.35e+09 1.51e+09 8.52e+08 2.54e+09 6.26e+08 ...
$ SHAPE_LEN : num 235777 155261 166065 242608 260615 ...
$ t_pop : int 25043 653 4413 8184 174491 1995 4324 17187 4649 8392 ...
$ n_wht : int 24974 653 4295 6892 149063 1684 3001 17123 4578 2580 ...
$ n_free_blk : int 69 0 2 0 10939 2 7 64 12 409 ...
$ n_slv : int 0 0 116 1292 14484 309 1316 0 59 5403 ...
$ n_blk : int 69 0 118 1292 25423 311 1323 64 71 5812 ...
$ n_free : num 25043 653 4297 6892 160007 ...
$ frac_free : num 1 1 0.974 0.842 0.917 ...
$ frac_free_blk: num 1 NA 0.0169 0 0.4303 ...
$ frac_slv : num 0 0 0.0263 0.1579 0.083 ...
> str(overlap)
'data.frame': 15266 obs. of 7 variables:
$ cty2015 : Factor w/ 3108 levels "0","1","10","100",..: 1 1 2 2 2 2 2 1082 1082 1082 ...
$ cty1860 : Factor w/ 2126 levels "0","1","10","100",..: 1047 1012 1296 1963 2033 2058 2065 736 1413 1569 ...
$ area_inter : num 1.66e+09 2.32e+05 9.81e+04 1.07e+09 7.67e+07 ...
$ area1860 : num 1.64e+11 1.81e+11 1.54e+09 2.91e+09 2.32e+09 ...
$ frac_1860 : num 1.01e-02 1.28e-06 6.35e-05 3.67e-01 3.30e-02 ...
$ sum_frac_1860 : num 1 1 1 1 1 ...
$ scaled_frac_1860: num 1.01e-02 1.28e-06 6.35e-05 3.67e-01 3.30e-02 ...
I am trying to multiply a vector of variables vars <- c("t_pop", "n_wht", "n_free_blk", "n_slv", "n_blk", "n_free") in the us.cty1860#data data frame by a scalar overlap$scaled_frac_1860[i], then add it to the same vector of variables in the us.cty2015#data data frame, and finally overwrite the variables in the us.cty2015#data data frame.
When I make the following call, I get an error that seems to be saying that I am trying to preform invalid operations on factors (which is not the case (you can confirm from the str output)).
> us.cty2015#data[overlap$cty2015[1], vars] <- us.cty2015#data[overlap$cty2015[1], vars] + (overlap$scaled_frac_1860[1] * us.cty1860#data[overlap$cty1860[1], vars])
Error in Summary.factor(1L, na.rm = FALSE) :
‘max’ not meaningful for factors
In addition: Warning message:
In Ops.factor(i, 0L) : ‘>=’ not meaningful for factors
However, when I don't attempt to overwrite the old value, the operation works fine.
> us.cty2015#data[overlap$cty2015[1], vars] + (overlap$scaled_frac_1860[1] * us.cty1860#data[overlap$cty1860[1], vars])
t_pop n_wht n_free_blk n_slv n_blk n_free
0 118.3889 113.6468 0.1317233 4.610316 4.742039 113.7785
I'm sure there are better ways of accomplishing what I am trying to do but does anyone have any idea what is going on?
Edit:
I am using the following libraries: rgdal, rgeos, and maptools
The all the data/object are coming from NHGIS shapefiles 1860 and 2015 United States Counties.
I'm trying to predict the values of test data set based on train data set, it is predicting the values (no errors) however the predictions deviate A LOT by the original values. Even predicting values around -356 although none of the original values exceeds 200 (and there are no negative values). The warning is bugging me as I think the values deviates a lot because of this warning.
Warning message:
In predict.lm(fit2, data_test) :
prediction from a rank-deficient fit may be misleading
any way I can get rid of this warning? the code is simple
fit2 <- lm(runs~., data=train_data)
prediction<-predict(fit2, data_test)
prediction
I searched a lot but tbh I couldn't understand much about this error.
str of test and train data set in case someone needs them
> str(train_data)
'data.frame': 36 obs. of 28 variables:
$ matchid : int 57 58 55 56 53 54 51 52 45 46 ...
$ TeamName : chr "South Africa" "West Indies" "South Africa" "West Indies" ...
$ Opp_TeamName : chr "West Indies" "South Africa" "West Indies" "South Africa" ...
$ TeamRank : int 4 3 4 3 4 3 10 7 5 1 ...
$ Opp_TeamRank : int 3 4 3 4 3 4 7 10 1 5 ...
$ Team_Top10RankingBatsman : int 0 1 0 1 0 1 0 0 2 2 ...
$ Team_Top50RankingBatsman : int 4 6 4 6 4 6 3 5 4 3 ...
$ Team_Top100RankingBatsman: int 6 8 6 8 6 8 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 0 1 0 1 0 0 0 2 2 ...
$ Opp_Top50RankingBatsman : int 6 4 6 4 6 4 5 3 3 4 ...
$ Opp_Top100RankingBatsman : int 8 6 8 6 8 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "1st innings" "2nd innings" ...
$ Runs_OverAll : num 361 705 348 630 347 ...
$ AVG_Overall : num 27.2 20 23.3 19.1 24 ...
$ SR_Overall : num 128 121 120 118 118 ...
$ Runs_Last10Matches : num 118.5 71 102.1 71 78.6 ...
$ AVG_Last10Matches : num 23.7 20.4 20.9 20.4 23.2 ...
$ SR_Last10Matches : num 120 106 114 106 116 ...
$ Runs_BatingFirst : num 236 459 230 394 203 ...
$ AVG_BatingFirst : num 30.6 23.2 24 21.2 27.1 ...
$ SR_BatingFirst : num 127 136 123 125 118 ...
$ Runs_BatingSecond : num 124 262 119 232 144 ...
$ AVG_BatingSecond : num 25.5 18.3 22.8 17.8 22.8 ...
$ SR_BatingSecond : num 125 118 112 117 114 ...
$ Runs_AgainstTeam2 : num 88.3 118.3 76.3 103.9 49.3 ...
$ AVG_AgainstTeam2 : num 28.2 23 24.7 22.1 16.4 ...
$ SR_AgainstTeam2 : num 139 127 131 128 111 ...
$ runs : int 165 168 231 236 195 126 143 141 191 135 ...
> str(data_test)
'data.frame': 34 obs. of 28 variables:
$ matchid : int 59 60 61 62 63 64 65 66 69 70 ...
$ TeamName : chr "India" "West Indies" "England" "New Zealand" ...
$ Opp_TeamName : chr "West Indies" "India" "New Zealand" "England" ...
$ TeamRank : int 2 3 5 1 4 8 6 2 10 1 ...
$ Opp_TeamRank : int 3 2 1 5 8 4 2 6 1 10 ...
$ Team_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 0 2 ...
$ Team_Top50RankingBatsman : int 5 6 4 3 4 2 5 5 3 3 ...
$ Team_Top100RankingBatsman: int 7 8 7 6 6 5 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 2 0 ...
$ Opp_Top50RankingBatsman : int 6 5 3 4 2 4 5 5 3 3 ...
$ Opp_Top100RankingBatsman : int 8 7 6 7 5 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "2nd innings" "1st innings" ...
$ Runs_OverAll : num 582 618 470 602 509 ...
$ AVG_Overall : num 25 21.8 20.3 20.7 19.6 ...
$ SR_Overall : num 113 120 123 120 112 ...
$ Runs_Last10Matches : num 182 107 117 167 140 ...
$ AVG_Last10Matches : num 37.1 43.8 21 24.9 27.3 ...
$ SR_Last10Matches : num 111 153 122 141 120 ...
$ Runs_BatingFirst : num 319 314 271 345 294 ...
$ AVG_BatingFirst : num 23.6 17.8 20.6 20.3 19.5 ...
$ SR_BatingFirst : num 116.9 98.5 118 124.3 115.8 ...
$ Runs_BatingSecond : num 264 282 304 256 186 ...
$ AVG_BatingSecond : num 28 23.7 31.9 21.6 16.5 ...
$ SR_BatingSecond : num 96.5 133.9 129.4 112 99.5 ...
$ Runs_AgainstTeam2 : num 98.2 95.2 106.9 75.4 88.5 ...
$ AVG_AgainstTeam2 : num 45.3 42.7 38.1 17.7 27.1 ...
$ SR_AgainstTeam2 : num 125 138 152 110 122 ...
$ runs : int 192 196 159 153 122 120 160 161 70 145 ...
In simple word, how can I get rid of this warning so that it doesn't effect my predictions?
(Intercept) matchid TeamNameBangladesh
1699.98232628 -0.06793787 59.29445330
TeamNameEngland TeamNameIndia TeamNameNew Zealand
347.33030177 -499.40074338 -179.19192936
TeamNamePakistan TeamNameSouth Africa TeamNameSri Lanka
-272.71610614 -3.54867488 -45.27920191
TeamNameWest Indies Opp_TeamNameBangladesh Opp_TeamNameEngland
-345.54349798 135.05901017 108.04227770
Opp_TeamNameIndia Opp_TeamNameNew Zealand Opp_TeamNamePakistan
-162.24418387 -60.55364436 -114.74599364
Opp_TeamNameSouth Africa Opp_TeamNameSri Lanka Opp_TeamNameWest Indies
196.90856999 150.70170068 -6.88997714
TeamRank Opp_TeamRank Team_Top10RankingBatsman
NA NA NA
Team_Top50RankingBatsman Team_Top100RankingBatsman Opp_Top10RankingBatsman
NA NA NA
Opp_Top50RankingBatsman Opp_Top100RankingBatsman InningType2nd innings
NA NA 24.24029455
Runs_OverAll AVG_Overall SR_Overall
-0.59935875 20.12721378 -13.60151334
Runs_Last10Matches AVG_Last10Matches SR_Last10Matches
-1.92526750 9.24182916 1.23914363
Runs_BatingFirst AVG_BatingFirst SR_BatingFirst
1.41001672 -9.88582744 -6.69780509
Runs_BatingSecond AVG_BatingSecond SR_BatingSecond
-0.90038727 -7.11580086 3.20915976
Runs_AgainstTeam2 AVG_AgainstTeam2 SR_AgainstTeam2
3.35936312 -5.90267210 2.36899131
You can have a look at this detailed discussion :
predict.lm() in a loop. warning: prediction from a rank-deficient fit may be misleading
In general, multi-collinearity can lead to a rank deficient matrix in logistic regression.
You can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.
I'm trying to subset my dataset 'eggdat' for daytime and nighttime hours. This:
'data.frame': 54847 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 20 20 20 20 20 20 20 20 20 20 ...
$ minute: int 5 5 5 5 5 5 5 5 5 5 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num -159 179 -164 -155 -137 ...
$ Pitch : num -31.36 -41.05 -23.85 -6.62 -9.13 ...
$ Yaw : num -71.8 -113.3 -67.2 -140.2 -78.2 ...
$ temp1 : num 25 33.5 34 34 34 34 34 34 34 34 ...
Subsetting for daytime works fine:
daytime <- eggdat[eggdat$hour >= 7 & eggdat$hour <= 20, ]
'data.frame': 18847 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 20 20 20 20 20 20 20 20 20 20 ...
$ minute: int 5 5 5 5 5 5 5 5 5 5 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num -159 179 -164 -155 -137 ...
$ Pitch : num -31.36 -41.05 -23.85 -6.62 -9.13 ...
$ Yaw : num -71.8 -113.3 -67.2 -140.2 -78.2 ...
$ temp1 : num 25 33.5 34 34 34 34 34 34 34 34 ...
Doing exactly the same thing for nighttime, however, returns a subset with 0 observations:
nighttime <- eggdat[eggdat$hour <= 7 & eggdat$hour >= 21, ]
'data.frame': 0 obs. of 10 variables:
$ year : int
$ month : int
$ day : int
$ hour : int
$ minute: int
$ second: int
$ Roll : num
$ Pitch : num
$ Yaw : num
$ temp1 : num
I really don't know what to do.. I tried using subset , but without success.. I also tried eggdat$hour <- as.factor(eggdat$hour), but couldn't get it to work either.
Even more confusingly, adding the quotation marks in the subset function (daytime <- eggdat[eggdat$hour >= '7' & eggdat$hour <= '20', ] and nighttime <- eggdat[eggdat$hour <= '7' & eggdat$hour >= '21', ]) resulted in the daytime subset containing '0 obs.', but the nighttime subset working fine, so it's just the other way around!
Daytime: 'data.frame': 0 obs. of 10 variables:
Nighttime:
'data.frame': 28800 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 21 21 21 21 21 21 21 21 21 21 ...
$ minute: int 0 0 0 0 0 0 0 0 0 0 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num 65.8 65.8 66.1 65.6 65.6 ...
$ Pitch : num 6.35 6.34 6.24 6.4 6.27 ...
$ Yaw : num 171 172 174 176 176 ...
$ temp1 : num 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 ...
I really don't know what to do, I'm very confused by all of this..
You want eggdat[eggdat$hour <= 7 | eggdat$hour >= 21, ]
x < 7 & x > 21 translates to x smaller than 7 AND larger than 21
x < 7 | x > 21 translates to x smaller than 7 OR larger than 21