I have 2 data frames, one data frame called datos_octubre with 10131000 rows and other dataframe called datos_conductores.
I wanna put a new column called operador, this column will be fill by the follow instruction
for( j in 1:100){
for(i in 1:100){
if ( (datos_octubre$FECHA_GPS[i]== datos_conductores$fecha[i])){
if (datos_octubre$EQU_CODIGO[i]== datos_conductores$EQU_CODIGO[j]){
if (datos_octubre$HORA_GPS[i] <= datos_conductores$hora_fin[j]){
datos_octubre$Operador[i] <- datos_octubre$NOMBRE[j]
}
}
}
This is the structure of data frame datos_octubre and the head:
> str(datos_octubre)
'data.frame': 10131530 obs. of 14 variables:
$ REP_GPS_CODIGO : Factor w/ 9329105 levels "MI051","MI051_1832614921789237",..: 2 3 4 5 6 7 8 9 10 11 ...
$ EQU_CODIGO : chr "MI051" "MI051" "MI051" "MI051" ...
$ TRAM_GPS_CODIGO: Factor w/ 4 levels "01","03","05",..: 4 4 4 4 4 4 4 4 4 4 ...
$ EVE_GPS_CODIGO : Factor w/ 83 levels "01","02","03",..: 3 3 3 3 3 9 3 3 3 3 ...
$ FECHA_GPS : POSIXct, format: "2019-10-01" "2019-10-01" "2019-10-01" "2019-10-01" ...
$ HORA_GPS : Factor w/ 86389 levels "-75.6654","-75.6655",..: 16528 16536 16546 16556 16564 16568 16574 16583 16592 16601 ...
$ LON_GPS : num -75.7 -75.7 -75.7 -75.7 -75.7 ...
$ LAT_GPS : num 4.8 4.8 4.8 4.8 4.8 ...
$ VEL_GPS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DIR_GPS : int 0 0 0 101 101 101 101 101 101 101 ...
$ ACL_GPS : int 0 0 0 0 0 NA 0 0 0 0 ...
$ ODO_GPS : int 28229762 28229762 28229762 28229768 28229770 NA 28229770 28229770 28229770 28229770 ...
$ ALT_GPS : Factor w/ 120 levels "","\"MI051_1902402005409507",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Operador : chr "" "" "" "" ...
> head(datos_octubre)
REP_GPS_CODIGO EQU_CODIGO TRAM_GPS_CODIGO EVE_GPS_CODIGO FECHA_GPS HORA_GPS LON_GPS LAT_GPS VEL_GPS DIR_GPS ACL_GPS ODO_GPS ALT_GPS Operador
1 MI051_1832614921789237 MI051 EV 03 2019-10-01 04:35:38 -75.7444 4.79857 0 0 0 28229762
2 MI051_1832614964979379 MI051 EV 03 2019-10-01 04:35:46 -75.7444 4.79857 0 0 0 28229762
3 MI051_1832616366109032 MI051 EV 03 2019-10-01 04:35:56 -75.7444 4.79857 0 0 0 28229762
4 MI051_1832617794914447 MI051 EV 03 2019-10-01 04:36:06 -75.7442 4.79907 0 101 0 28229768
5 MI051_1832619516509591 MI051 EV 03 2019-10-01 04:36:14 -75.7442 4.79908 0 101 0 28229770
6 MI051_1832619543973570 MI051 EV 10 2019-10-01 04:36:18 -75.7442 4.79908 0 101 NA NA
And this is the restructure of datos_conductores and the head:
> str(datos_conductores)
'data.frame': 16522 obs. of 11 variables:
$ fecha : POSIXct, format: "2019-10-01" "2019-10-01" "2019-10-01" "2019-10-01" ...
$ equ_id : int 99 99 99 99 99 99 99 99 99 99 ...
$ conductor : int 34 34 34 34 34 34 34 65 65 65 ...
$ servicio_id: int 533329 533328 533327 533326 533325 533324 533323 533333 533332 533331 ...
$ PERA_ID : int 362 362 362 362 362 362 362 107 107 107 ...
$ hora_ini : POSIXct, format: "2019-11-28 09:16:16" "2019-11-28 08:38:16" "2019-11-28 08:00:16" "2019-11-28 07:22:16" ...
$ hora_fin : POSIXct, format: "2019-11-28 09:21:00" "2019-11-28 09:16:16" "2019-11-28 08:38:16" "2019-11-28 08:00:16" ...
$ ruta_id : int 24 24 24 24 24 24 24 24 24 24 ...
$ NOMBRE : Factor w/ 85 levels "ALBERT HERNAN ZAPATA RESTREPO",..: 71 71 71 71 71 71 71 53 53 53 ...
$ PERA_CEDULA: int 1088253762 1088253762 1088253762 1088253762 1088253762 1088253762 1088253762 10087424 10087424 10087424 ...
$ EQU_CODIGO : Factor w/ 36 levels "MI051","MI052",..: 9 9 9 9 9 9 9 9 9 9 ...
> head(datos_octubre)
REP_GPS_CODIGO EQU_CODIGO TRAM_GPS_CODIGO EVE_GPS_CODIGO FECHA_GPS HORA_GPS LON_GPS LAT_GPS VEL_GPS DIR_GPS ACL_GPS ODO_GPS ALT_GPS Operador
1 MI051_1832614921789237 MI051 EV 03 2019-10-01 04:35:38 -75.7444 4.79857 0 0 0 28229762
2 MI051_1832614964979379 MI051 EV 03 2019-10-01 04:35:46 -75.7444 4.79857 0 0 0 28229762
3 MI051_1832616366109032 MI051 EV 03 2019-10-01 04:35:56 -75.7444 4.79857 0 0 0 28229762
4 MI051_1832617794914447 MI051 EV 03 2019-10-01 04:36:06 -75.7442 4.79907 0 101 0 28229768
5 MI051_1832619516509591 MI051 EV 03 2019-10-01 04:36:14 -75.7442 4.79908 0 101 0 28229770
6 MI051_1832619543973570 MI051 EV 10 2019-10-01 04:36:18 -75.7442 4.79908 0 101 NA NA
Also I tried with operator pype but I'm not getting the result I want.
I already find the solution.
I have to convert all data type data with the function as.POSIXct and in the for make a correction with the time_ini and the time_finish of every data.
Related
Good afternoon ,
Assume we have the following function :
data_preprocessing<-function(link){
link=as.character(link)
dataset=read.csv(link)
dataset=replace(dataset,dataset=="?",NA)
return(dataset)
}
Example ( https protocole problem ) :
Echocardiogram=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data")
Show Traceback
Rerun with Debug
Error in file(file, "rt") : cannot open the connection
After downloading the dataset :
Echocardiogram=data_preprocessing("http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data")
head(Echocardiogram)
X11 X0 X71 X0.1 X0.260 X9 X4.600 X14 X1 X1.1 name X1.2 X0.2
1 19 0 72 0 0.380 6 4.100 14 1.700 0.588 name 1 0
2 16 0 55 0 0.260 4 3.420 14 1 1 name 1 0
3 57 0 60 0 0.253 12.062 4.603 16 1.450 0.788 name 1 0
4 19 1 57 0 0.160 22 5.750 18 2.250 0.571 name 1 0
5 26 0 68 0 0.260 5 4.310 12 1 0.857 name 1 0
6 13 0 62 0 0.230 31 5.430 22.5 1.875 0.857 name 1 0
Also :
str(Echocardiogram)
'data.frame': 130 obs. of 12 variables:
$ X11 : Factor w/ 57 levels "",".03",".25",..: 18 16 54 18 27 14 50 18 26 12 ...
$ X0 : Factor w/ 4 levels "","?","0","1": 3 3 3 4 3 3 3 3 3 4 ...
$ X71 : Factor w/ 40 levels "","?","35","46",..: 30 12 17 14 26 19 17 4 11 34 ...
$ X0.1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ X0.260: Factor w/ 74 levels "","?","0.010",..: 65 50 47 26 50 42 59 60 21 19 ...
$ X9 : Factor w/ 93 levels "","?","0","10",..: 69 57 13 46 62 56 79 3 19 29 ...
$ X4.600: Factor w/ 106 levels "","?","2.32",..: 25 6 54 92 38 85 76 70 47 33 ...
$ X14 : Factor w/ 48 levels "","?","10","10.5",..: 16 16 21 27 8 36 16 21 19 27 ...
$ X1 : Factor w/ 67 levels "","?","1","1.04",..: 48 3 37 60 3 52 3 11 16 50 ...
$ X1.1 : Factor w/ 32 levels "","?","0.140",..: 14 30 25 13 27 27 30 31 29 21 ...
$ X1.2 : Factor w/ 5 levels "","?","1","2",..: 3 3 3 3 3 3 3 3 3 3 ...
$ X0.2 : Factor w/ 5 levels "","?","0","1",..: 3 3 3 3 3 3 3 3 3 4 ...
Here , i'm wanting to replace all "?" in the dataset with NA. Also , it will be good to remove duplicated and empty rows ( like the 50 row ).
Thank you for help !
something like this?
library(data.table)
DT <- data.table::fread("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data",
fill = TRUE,
na.strings = "?")
When using read.csv from base you can set na.strings = "?" and header=FALSE.
Echocardiogram <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data"
, na.strings = "?", header=FALSE)
str(Echocardiogram)
#'data.frame': 133 obs. of 13 variables:
# $ V1 : num 11 19 16 57 19 26 13 50 19 25 ...
# $ V2 : int 0 0 0 0 1 0 0 0 0 0 ...
# $ V3 : num 71 72 55 60 57 68 62 60 46 54 ...
# $ V4 : int 0 0 0 0 0 0 0 0 0 0 ...
# $ V5 : num 0.26 0.38 0.26 0.253 0.16 0.26 0.23 0.33 0.34 0.14 ...
# $ V6 : num 9 6 4 12.1 22 ...
# $ V7 : num 4.6 4.1 3.42 4.6 5.75 ...
# $ V8 : num 14 14 14 16 18 12 22.5 14 16 15.5 ...
# $ V9 : num 1 1.7 1 1.45 2.25 ...
# $ V10: num 1 0.588 1 0.788 0.571 ...
# $ V11: chr "name" "name" "name" "name" ...
# $ V12: chr "1" "1" "1" "1" ...
# $ V13: int 0 0 0 0 0 0 0 0 0 0 ...
i am having the following error: Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :
‘sum’ not meaningful for factors
Here is my code
library(SportsAnalytics)
nba1819 = fetch_NBAPlayerStatistics("18-19")
nbadf = data.frame(nba1819)
nbaagg = nbadf[c(5:25)]
nbaagg = lapply(nbaagg, function(x) type.convert(as.numeric(x)))
nbaagg$Team = as.character(nbadf$Team)
nbaagg = aggregate(nbaagg,
by = list(nbaagg$Team),
FUN = sum)
I already tried to convert everything to vectors so dont understand why it is still claiming I have factors. here is my output of str(nbaagg)
List of 22
$ GamesPlayed : int [1:530] 31 10 34 80 82 18 7 81 10 37 ...
$ TotalMinutesPlayed : int [1:530] 588 121 425 2667 1908 196 23 2689 120 414 ...
$ FieldGoalsMade : int [1:530] 56 4 38 481 280 11 3 684 13 67 ...
$ FieldGoalsAttempted: int [1:530] 157 18 110 809 487 36 10 1319 39 178 ...
$ ThreesMade : int [1:530] 41 2 25 0 3 6 0 10 3 32 ...
$ ThreesAttempted : int [1:530] 127 15 74 2 15 23 4 42 12 99 ...
$ FreeThrowsMade : int [1:530] 12 7 7 146 166 4 1 349 8 45 ...
$ FreeThrowsAttempted: int [1:530] 13 10 9 292 226 4 2 412 12 60 ...
$ OffensiveRebounds : int [1:530] 5 3 11 392 166 3 1 252 11 3 ...
$ TotalRebounds : int [1:530] 48 25 61 760 598 19 4 744 26 24 ...
$ Assists : int [1:530] 20 8 66 124 184 5 6 194 13 25 ...
$ Steals : int [1:530] 17 1 13 119 72 1 2 43 1 5 ...
$ Turnovers : int [1:530] 14 4 28 138 121 6 2 144 8 33 ...
$ Blocks : int [1:530] 6 4 5 77 65 4 0 107 0 6 ...
$ PersonalFouls : int [1:530] 53 24 45 204 203 13 4 179 7 47 ...
$ Disqualifications : int [1:530] 0 0 0 3 0 0 0 0 0 0 ...
$ TotalPoints : int [1:530] 165 17 108 1108 729 32 7 1727 37 211 ...
$ Technicals : int [1:530] 1 1 0 2 3 0 0 1 0 0 ...
$ Ejections : int [1:530] 0 0 0 0 0 0 0 0 0 0 ...
$ FlagrantFouls : int [1:530] 0 0 0 0 0 0 0 0 0 0 ...
$ GamesStarted : int [1:530] 2 0 1 80 28 3 0 81 1 2 ...
$ Team : chr [1:530] "OKL" "PHO" "ATL" "OKL" ...
Based on the str(nbaagg), nbaagg is a list of vectors and not a data.frame. It can be converted to data.frame with as.data.frame (here the list elements are of equal length
nbaagg <- as.data.frame( nbaagg)
then, we can use
aggregate(.~ Team, nbaagg, FUN = sum, na.rm = TRUE, na.action = NULL)
It was created as a list in this step
nbaagg <- lapply(nbaagg, function(x) type.convert(as.numeric(x)))
The lapply output is always a list. If we want to have the same attributes as in the original dataset, use []
nbaagg[] <- lapply(nbaagg, function(x) type.convert(as.numeric(x)))
Here, the type.convert can be directly used on the dataset assuming they are all character class instead of a loop with lapply
nbaagg <- type.convert(nbaagg, as.is = TRUE)
Given this:
kc$sqft_living_group <- cut(kc$sqft_living, breaks = c(0, 1000, 2000, 3000, 5000, 7000, 10000, 15000), dig.lab=5)
How do I set the limit of my ggplot2 graph?
Nothing I can find shows the syntax to set the limit for intervals.
kc %>%
filter(zipcode %in% top_10_zipcodes) %>%
group_by(sqft_living_group) %>%
summarize(Mean_Price = mean(price)) %>%
ggplot(aes(y = Mean_Price, x = sqft_living_group)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = comma) +
scale_x_discrete(limits = "(0, 1000], (1000, 12000]") <---------- HERE
structure of data:
'data.frame': 21613 obs. of 22 variables:
$ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
$ date : POSIXct, format: "2014-10-13" "2014-12-09" "2015-02-25" "2014-12-09" ...
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : Factor w/ 6 levels "1","1.5","2",..: 1 3 1 1 1 1 3 1 1 3 ...
$ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement : int 0 400 0 910 0 1530 0 0 730 0 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ yr_renovated : Factor w/ 70 levels "0","1934","1940",..: 1 46 1 1 1 1 1 1 1 1 ...
$ zipcode : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15 : int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
$ sqft_living_group: Factor w/ 7 levels "(0,1000]","(1000,2000]",..: 2 3 1 2 2 5 2 2 2 2 ...
I'm able to scrape the first table of this page using the rvest package and using the following code:
library(rvest)
library(magrittr)
urlbbref <- read_html("http://www.baseball-reference.com/bio/Venezuela_born.shtml")
Bat <- urlbbref %>%
html_node(xpath = '//*[(#id = "bio_batting")]') %>%
html_table()
But I'm not able to scrape the second table of this page. I use selectorgadget to find the xpath of both tables and I use that info in the code, but it doesn't seem to be working for the second one.
Pit <- urlbbref %>%
html_node(xpath = '//*[(#id = "div_bio_pitching")]') %>%
html_table()
I come up with 3 tables in total.
library(magrittr)
library(rvest)
library(xml2)
library(stringi)
urlbbref <- read_html("http://www.baseball-reference.com/bio/Venezuela_born.shtml")
# First table is in the markup
table_one <- xml_find_all(urlbbref, "//table") %>% html_table
# Additional tables are within the comment tags, ie <!-- tables -->
# Which is why your xpath is missing them.
# First get the commented nodes
alt_tables <- xml2::xml_find_all(urlbbref,"//comment()") %>% {
#Find only commented nodes that contain the regex for html table markup
raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
# Remove the comment begin and end tags
strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
vectorize_all = FALSE)
# Loop through the pieces that have tables within markup and
# apply the same functions
lapply(grep("<table", strip_html, value = TRUE), function(i){
rvest::html_table(xml_find_all(read_html(i), "//table")) %>%
.[[1]]
})
}
# Put all the data frames into a list.
all_tables <- c(
table_one, alt_tables
)
Results:
> Map(str, all_tables)
'data.frame': 361 obs. of 27 variables:
$ Rk : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Bobby Abreu" "Ehire Adrianza" "Jesus Aguilar" "Edgardo Alfonzo" ...
$ Yrs : int 18 4 4 12 6 7 1 5 5 2 ...
$ From : int 1996 2013 2014 1995 2006 2011 2000 2011 2013 2002 ...
$ To : int 2014 2016 2017 2006 2011 2017 2000 2015 2017 2004 ...
$ ASG : int 2 0 0 1 0 4 0 1 0 0 ...
$ G : int 2425 154 47 1506 193 842 2 92 150 38 ...
$ PA : int 10081 331 89 6108 624 3708 5 109 3 75 ...
$ AB : int 8480 291 81 5385 591 3411 5 94 2 64 ...
$ R : int 1453 27 4 777 44 456 1 5 0 11 ...
$ H : int 2470 64 18 1532 142 1062 1 22 0 16 ...
$ 2B : int 574 16 3 282 24 208 0 4 0 4 ...
$ 3B : int 59 1 0 18 3 19 0 0 0 0 ...
$ HR : int 288 3 0 146 17 60 0 1 0 2 ...
$ RBI : int 1363 26 8 744 67 326 0 9 0 10 ...
$ SB : int 400 4 0 53 1 204 0 0 0 1 ...
$ CS : int 128 4 0 17 2 59 0 0 0 0 ...
$ BB : int 1476 23 6 596 17 214 0 1 1 7 ...
$ SO : int 1840 60 28 617 158 389 1 34 0 12 ...
$ BA : num 0.291 0.22 0.222 0.284 0.24 0.311 0.2 0.234 0 0.25 ...
$ OBP : num 0.395 0.292 0.281 0.357 0.271 0.354 0.2 0.237 0.333 0.324 ...
$ SLG : num 0.475 0.313 0.259 0.425 0.377 0.436 0.2 0.309 0 0.406 ...
$ OPS : num 0.87 0.605 0.54 0.782 0.648 0.791 0.4 0.546 0.333 0.731 ...
$ Birthdate : chr "Mar 11, 1974" "Aug 21, 1989" "Jun 30, 1990" "Nov 8, 1973" ...
$ Debut : chr "Sep 1, 1996" "Sep 8, 2013" "May 15, 2014" "Apr 26, 1995" ...
$ Birthplace: chr "Maracay, Aragua" "Guarenas, Miranda" "Maracay, Aragua" "Santa Teresa del Tuy, Miranda" ...
$ Pos : chr "POS" "POS" "POS" "POS" ...
'data.frame': 157 obs. of 31 variables:
$ Rk : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Henderson Alvarez" "Jose Alvarez" "Wilson Alvarez" "Alexi Amarista" ...
$ Yrs : int 5 5 14 7 5 2 10 4 6 4 ...
$ From : int 2011 2013 1989 2011 1980 2015 1999 2007 2012 2005 ...
$ To : int 2015 2017 2005 2017 1984 2016 2008 2011 2017 2009 ...
$ ASG : int 1 0 1 0 0 0 0 0 0 0 ...
$ W : int 27 6 102 0 9 4 53 1 15 3 ...
$ L : int 34 12 92 0 6 2 65 3 6 4 ...
$ W-L% : num 0.443 0.333 0.526 NA 0.6 0.667 0.449 0.25 0.714 0.429 ...
$ ERA : num 3.8 3.97 3.96 0 3.27 4.35 4.65 5.28 2.91 6.86 ...
$ G : int 92 150 355 2 110 72 185 43 275 25 ...
$ GS : int 92 6 263 0 0 0 167 0 0 8 ...
$ GF : int 0 32 18 2 66 14 7 16 36 12 ...
$ CG : int 5 0 12 0 0 0 0 0 0 0 ...
$ SHO : int 5 0 5 0 0 0 0 0 0 0 ...
$ SV : int 0 0 4 0 7 0 0 0 0 0 ...
$ IP : num 563 167.2 1747.2 0.2 220 ...
$ H : int 596 174 1624 0 222 64 891 57 177 68 ...
$ R : int 261 85 857 0 86 39 519 29 75 51 ...
$ ER : int 238 74 769 0 80 30 478 27 72 46 ...
$ HR : int 54 17 190 0 17 5 122 7 10 4 ...
$ BB : int 129 55 805 0 68 36 431 21 80 34 ...
$ IBB : int 7 10 29 0 7 3 41 5 17 1 ...
$ SO : int 296 148 1330 0 113 63 680 41 180 37 ...
$ HBP : int 22 8 50 0 3 2 51 4 11 4 ...
$ BK : int 3 1 4 0 3 1 6 0 3 1 ...
$ WP : int 16 3 28 0 5 2 43 1 14 2 ...
$ BF : int 2358 729 7518 2 928 285 4055 221 913 282 ...
$ Birthdate : chr "Apr 18, 1990" "May 6, 1989" "Mar 24, 1970" "Apr 6, 1989" ...
$ Debut : chr "Aug 10, 2011" "Jun 9, 2013" "Jul 24, 1989" "Apr 26, 2011" ...
$ Birthplace: chr "Valencia, Carabobo" "Barcelona, Anzoategui" "Maracaibo, Zulia" "Barcelona, Anzoategui" ...
'data.frame': 3 obs. of 17 variables:
$ Rk : int 1 2 NA
$ Mgr : chr "Ozzie Guillen" "Al Pedrique" "Totals"
$ Yrs : int 9 1 10
$ From : int 2004 2004 2004
$ To : int 2012 2004 2012
$ W : int 747 22 769
$ L : int 710 61 771
$ W-L% : num 0.513 0.265 0.499
$ Ties : int 0 0 0
$ G>.500 : int 37 -39 -2
$ G : int 1457 83 1540
$ BestFin : int 1 5 1
$ WrstFin : int 5 5 5
$ AvRk : num 2.7 5 2.8
$ Birthdate : chr "Jan 20, 1964" "Aug 11, 1960" ""
$ Debut : chr "Apr 9, 1985" "Apr 14, 1987" ""
$ Birthplace: chr "Ocumare del Tuy, Miranda" "Valencia, Carabobo" ""
I am trying to use SVM for a multi-class classification task.
I have a dataset called df, which I divided into a training and a test set with the following code:
sample <- df[sample(nrow(df), 10000),] # take a random sample of 10,000 from dataset df
sample <- sample %>% arrange(Date) # arrange chronologically
train <- sample[1:8000,] # 80% of the df dataset
test <- sample[8001:10000,] # 20% of the df dataset
This is what the training set looks like:
> str(train)
'data.frame': 8000 obs. of 45 variables:
$ Date : Date, format: "2008-01-01" "2008-01-01" "2008-01-02" ...
$ Weekday : chr "Tuesday" "Tuesday" "Wednesday" "Wednesday" ...
$ Season : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Weekend : num 0 0 0 0 0 0 0 0 0 0 ...
$ Icao.type : Factor w/ 306 levels "A124","A225",..: 7 29 112 115 107 10 115 115 115 112 ...
$ Act.description : Factor w/ 389 levels "A300-600F","A330-200F",..: 9 29 161 162 150 13 162 162 162 161 ...
$ Arr.dep : Factor w/ 2 levels "A","D": 2 2 1 1 1 1 1 1 1 1 ...
$ MTOW : num 77 69 46 21 22 238 21 21 21 46 ...
$ Icao.wtc : chr "Medium" "Medium" "Medium" "Medium" ...
$ Wind.direc : int 104 104 82 82 93 93 93 132 132 132 ...
$ Wind.speed.vec : int 35 35 57 57 64 64 64 62 62 62 ...
$ Wind.speed.daily: int 35 35 58 58 65 65 65 63 63 63 ...
$ Wind.speed.max : int 60 60 70 70 80 80 80 90 90 90 ...
$ Wind.speed.min : int 20 20 40 40 50 50 50 50 50 50 ...
$ Wind.gust.max : int 100 100 120 120 130 130 130 140 140 140 ...
$ Temp.daily : int 24 24 -5 -5 4 4 4 34 34 34 ...
$ Temp.min : int -7 -7 -25 -25 -13 -13 -13 11 11 11 ...
$ Temp.max : int 50 50 16 16 13 13 13 55 55 55 ...
$ Temp.10.min : int -11 -11 -32 -32 -18 -18 -18 9 9 9 ...
$ Sun.dur : int 7 7 65 65 19 19 19 0 0 0 ...
$ Sun.dur.prct : int 9 9 83 83 24 24 24 0 0 0 ...
$ Radiation : int 173 173 390 390 213 213 213 108 108 108 ...
$ Precip.dur : int 0 0 0 0 0 0 0 5 5 5 ...
$ Precip.daily : int 0 0 0 0 -1 -1 -1 2 2 2 ...
$ Precip.max : int 0 0 0 0 -1 -1 -1 2 2 2 ...
$ Sea.press.daily : int 10259 10259 10206 10206 10080 10080 10080 10063 10063 10063 ...
$ Sea.press.max : int 10276 10276 10248 10248 10132 10132 10132 10086 10086 10086 ...
$ Sea.press.min : int 10250 10250 10141 10141 10058 10058 10058 10001 10001 10001 ...
$ Visibility.min : int 1 1 40 40 43 43 43 58 58 58 ...
$ Visibility.max : int 59 59 75 75 66 66 66 65 65 65 ...
$ Cloud.daily : int 7 7 3 3 8 8 8 8 8 8 ...
$ Humidity.daily : int 96 96 86 86 77 77 77 82 82 82 ...
$ Humidity.max : int 99 99 92 92 92 92 92 90 90 90 ...
$ Humidity.min : int 91 91 74 74 71 71 71 76 76 76 ...
$ Evapo : int 2 2 4 4 2 2 2 1 1 1 ...
$ Wind.discrete : chr "South East" "South East" "North East" "North East" ...
$ Vmc.imc : chr "Unknown" "Unknown" "Unknown" "Unknown" ...
$ Beaufort : num 3 3 4 4 4 4 4 4 4 4 ...
$ Main.A : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.B : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.K : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.O : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.P : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.Z : num 0 0 0 0 0 0 0 0 0 0 ...
$ Runway : Factor w/ 13 levels "04","06","09",..: 3 8 2 2 2 6 2 6 6 6 ...
Then, I try to tune the SVM parameters with the following code:
library(e1071)
tuned <- tune.svm(Runway ~ ., data = train, gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
While this code has worked in the past, it now gives me the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
The only thing I can think of that has changed is the rows in the dataset train, as running the first code block means taking a random sample of 10,000 (out of dataset df, that contains 3.5 million rows).
Does anyone know why I am getting this?
I recognise that this question was rather hard to solve without a good reproducible example.
However, I have found the solution to my problem and wanted to post it here for anyone who might be looking for this in the future.
Running the same code, but with selected columns from the train set:
tuned <- tune.svm(Runway ~ ., data = train[,c(1:2, 45)], gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
gave me absolutely no problem. I continued adding more features until the error was reproduced. I found that the features Vmc.imc and Icao.wtc were causing the error and that they were both chr features. Using the following code:
train$Vmc.imc <- as.factor(train$Vmc.imc)
train$Icao.wtc <- as.factor(train$Icao.wtc)
to change them into factors and then rerunning
tuned <- tune.svm(Runway ~ ., data = train, gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
solved my problem.
I do not know why the other chr features such as Weekday and Wind.discrete are not causing the same issue. If anyone knows the answer to this, I would be glad to find out.
Similar to this thread here. I added the fact that if you neglect making all your character features factors, you will also receive this error when attempting to run predict.