R: aggregating data frame sum not meaningful factors

R: aggregating data frame sum not meaningful factors - r

i am having the following error: Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :
‘sum’ not meaningful for factors
Here is my code
library(SportsAnalytics)
nba1819 = fetch_NBAPlayerStatistics("18-19")
nbadf = data.frame(nba1819)
nbaagg = nbadf[c(5:25)]
nbaagg = lapply(nbaagg, function(x) type.convert(as.numeric(x)))
nbaagg$Team = as.character(nbadf$Team)
nbaagg = aggregate(nbaagg,
by = list(nbaagg$Team),
FUN = sum)
I already tried to convert everything to vectors so dont understand why it is still claiming I have factors. here is my output of str(nbaagg)
List of 22
$ GamesPlayed : int [1:530] 31 10 34 80 82 18 7 81 10 37 ...
$ TotalMinutesPlayed : int [1:530] 588 121 425 2667 1908 196 23 2689 120 414 ...
$ FieldGoalsMade : int [1:530] 56 4 38 481 280 11 3 684 13 67 ...
$ FieldGoalsAttempted: int [1:530] 157 18 110 809 487 36 10 1319 39 178 ...
$ ThreesMade : int [1:530] 41 2 25 0 3 6 0 10 3 32 ...
$ ThreesAttempted : int [1:530] 127 15 74 2 15 23 4 42 12 99 ...
$ FreeThrowsMade : int [1:530] 12 7 7 146 166 4 1 349 8 45 ...
$ FreeThrowsAttempted: int [1:530] 13 10 9 292 226 4 2 412 12 60 ...
$ OffensiveRebounds : int [1:530] 5 3 11 392 166 3 1 252 11 3 ...
$ TotalRebounds : int [1:530] 48 25 61 760 598 19 4 744 26 24 ...
$ Assists : int [1:530] 20 8 66 124 184 5 6 194 13 25 ...
$ Steals : int [1:530] 17 1 13 119 72 1 2 43 1 5 ...
$ Turnovers : int [1:530] 14 4 28 138 121 6 2 144 8 33 ...
$ Blocks : int [1:530] 6 4 5 77 65 4 0 107 0 6 ...
$ PersonalFouls : int [1:530] 53 24 45 204 203 13 4 179 7 47 ...
$ Disqualifications : int [1:530] 0 0 0 3 0 0 0 0 0 0 ...
$ TotalPoints : int [1:530] 165 17 108 1108 729 32 7 1727 37 211 ...
$ Technicals : int [1:530] 1 1 0 2 3 0 0 1 0 0 ...
$ Ejections : int [1:530] 0 0 0 0 0 0 0 0 0 0 ...
$ FlagrantFouls : int [1:530] 0 0 0 0 0 0 0 0 0 0 ...
$ GamesStarted : int [1:530] 2 0 1 80 28 3 0 81 1 2 ...
$ Team : chr [1:530] "OKL" "PHO" "ATL" "OKL" ...

Based on the str(nbaagg), nbaagg is a list of vectors and not a data.frame. It can be converted to data.frame with as.data.frame (here the list elements are of equal length
nbaagg <- as.data.frame( nbaagg)
then, we can use
aggregate(.~ Team, nbaagg, FUN = sum, na.rm = TRUE, na.action = NULL)
It was created as a list in this step
nbaagg <- lapply(nbaagg, function(x) type.convert(as.numeric(x)))
The lapply output is always a list. If we want to have the same attributes as in the original dataset, use []
nbaagg[] <- lapply(nbaagg, function(x) type.convert(as.numeric(x)))
Here, the type.convert can be directly used on the dataset assuming they are all character class instead of a loop with lapply
nbaagg <- type.convert(nbaagg, as.is = TRUE)

Related

How to clean a dataset in R from uci

Good afternoon ,
Assume we have the following function :
data_preprocessing<-function(link){
link=as.character(link)
dataset=read.csv(link)
dataset=replace(dataset,dataset=="?",NA)
return(dataset)
}
Example ( https protocole problem ) :
Echocardiogram=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data")
Show Traceback
Rerun with Debug
Error in file(file, "rt") : cannot open the connection
After downloading the dataset :
Echocardiogram=data_preprocessing("http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data")
head(Echocardiogram)
X11 X0 X71 X0.1 X0.260 X9 X4.600 X14 X1 X1.1 name X1.2 X0.2
1 19 0 72 0 0.380 6 4.100 14 1.700 0.588 name 1 0
2 16 0 55 0 0.260 4 3.420 14 1 1 name 1 0
3 57 0 60 0 0.253 12.062 4.603 16 1.450 0.788 name 1 0
4 19 1 57 0 0.160 22 5.750 18 2.250 0.571 name 1 0
5 26 0 68 0 0.260 5 4.310 12 1 0.857 name 1 0
6 13 0 62 0 0.230 31 5.430 22.5 1.875 0.857 name 1 0
Also :
str(Echocardiogram)
'data.frame': 130 obs. of 12 variables:
$ X11 : Factor w/ 57 levels "",".03",".25",..: 18 16 54 18 27 14 50 18 26 12 ...
$ X0 : Factor w/ 4 levels "","?","0","1": 3 3 3 4 3 3 3 3 3 4 ...
$ X71 : Factor w/ 40 levels "","?","35","46",..: 30 12 17 14 26 19 17 4 11 34 ...
$ X0.1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ X0.260: Factor w/ 74 levels "","?","0.010",..: 65 50 47 26 50 42 59 60 21 19 ...
$ X9 : Factor w/ 93 levels "","?","0","10",..: 69 57 13 46 62 56 79 3 19 29 ...
$ X4.600: Factor w/ 106 levels "","?","2.32",..: 25 6 54 92 38 85 76 70 47 33 ...
$ X14 : Factor w/ 48 levels "","?","10","10.5",..: 16 16 21 27 8 36 16 21 19 27 ...
$ X1 : Factor w/ 67 levels "","?","1","1.04",..: 48 3 37 60 3 52 3 11 16 50 ...
$ X1.1 : Factor w/ 32 levels "","?","0.140",..: 14 30 25 13 27 27 30 31 29 21 ...
$ X1.2 : Factor w/ 5 levels "","?","1","2",..: 3 3 3 3 3 3 3 3 3 3 ...
$ X0.2 : Factor w/ 5 levels "","?","0","1",..: 3 3 3 3 3 3 3 3 3 4 ...
Here , i'm wanting to replace all "?" in the dataset with NA. Also , it will be good to remove duplicated and empty rows ( like the 50 row ).
Thank you for help !

something like this?
library(data.table)
DT <- data.table::fread("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data",
fill = TRUE,
na.strings = "?")

When using read.csv from base you can set na.strings = "?" and header=FALSE.
Echocardiogram <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data"
, na.strings = "?", header=FALSE)
str(Echocardiogram)
#'data.frame': 133 obs. of 13 variables:
# $ V1 : num 11 19 16 57 19 26 13 50 19 25 ...
# $ V2 : int 0 0 0 0 1 0 0 0 0 0 ...
# $ V3 : num 71 72 55 60 57 68 62 60 46 54 ...
# $ V4 : int 0 0 0 0 0 0 0 0 0 0 ...
# $ V5 : num 0.26 0.38 0.26 0.253 0.16 0.26 0.23 0.33 0.34 0.14 ...
# $ V6 : num 9 6 4 12.1 22 ...
# $ V7 : num 4.6 4.1 3.42 4.6 5.75 ...
# $ V8 : num 14 14 14 16 18 12 22.5 14 16 15.5 ...
# $ V9 : num 1 1.7 1 1.45 2.25 ...
# $ V10: num 1 0.588 1 0.788 0.571 ...
# $ V11: chr "name" "name" "name" "name" ...
# $ V12: chr "1" "1" "1" "1" ...
# $ V13: int 0 0 0 0 0 0 0 0 0 0 ...

Problem with putting information in a column in R

I have 2 data frames, one data frame called datos_octubre with 10131000 rows and other dataframe called datos_conductores.
I wanna put a new column called operador, this column will be fill by the follow instruction
for( j in 1:100){
for(i in 1:100){
if ( (datos_octubre$FECHA_GPS[i]== datos_conductores$fecha[i])){
if (datos_octubre$EQU_CODIGO[i]== datos_conductores$EQU_CODIGO[j]){
if (datos_octubre$HORA_GPS[i] <= datos_conductores$hora_fin[j]){
datos_octubre$Operador[i] <- datos_octubre$NOMBRE[j]
}
}
}
This is the structure of data frame datos_octubre and the head:
> str(datos_octubre)
'data.frame': 10131530 obs. of 14 variables:
$ REP_GPS_CODIGO : Factor w/ 9329105 levels "MI051","MI051_1832614921789237",..: 2 3 4 5 6 7 8 9 10 11 ...
$ EQU_CODIGO : chr "MI051" "MI051" "MI051" "MI051" ...
$ TRAM_GPS_CODIGO: Factor w/ 4 levels "01","03","05",..: 4 4 4 4 4 4 4 4 4 4 ...
$ EVE_GPS_CODIGO : Factor w/ 83 levels "01","02","03",..: 3 3 3 3 3 9 3 3 3 3 ...
$ FECHA_GPS : POSIXct, format: "2019-10-01" "2019-10-01" "2019-10-01" "2019-10-01" ...
$ HORA_GPS : Factor w/ 86389 levels "-75.6654","-75.6655",..: 16528 16536 16546 16556 16564 16568 16574 16583 16592 16601 ...
$ LON_GPS : num -75.7 -75.7 -75.7 -75.7 -75.7 ...
$ LAT_GPS : num 4.8 4.8 4.8 4.8 4.8 ...
$ VEL_GPS : num 0 0 0 0 0 0 0 0 0 0 ...
$ DIR_GPS : int 0 0 0 101 101 101 101 101 101 101 ...
$ ACL_GPS : int 0 0 0 0 0 NA 0 0 0 0 ...
$ ODO_GPS : int 28229762 28229762 28229762 28229768 28229770 NA 28229770 28229770 28229770 28229770 ...
$ ALT_GPS : Factor w/ 120 levels "","\"MI051_1902402005409507",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Operador : chr "" "" "" "" ...
> head(datos_octubre)
REP_GPS_CODIGO EQU_CODIGO TRAM_GPS_CODIGO EVE_GPS_CODIGO FECHA_GPS HORA_GPS LON_GPS LAT_GPS VEL_GPS DIR_GPS ACL_GPS ODO_GPS ALT_GPS Operador
1 MI051_1832614921789237 MI051 EV 03 2019-10-01 04:35:38 -75.7444 4.79857 0 0 0 28229762
2 MI051_1832614964979379 MI051 EV 03 2019-10-01 04:35:46 -75.7444 4.79857 0 0 0 28229762
3 MI051_1832616366109032 MI051 EV 03 2019-10-01 04:35:56 -75.7444 4.79857 0 0 0 28229762
4 MI051_1832617794914447 MI051 EV 03 2019-10-01 04:36:06 -75.7442 4.79907 0 101 0 28229768
5 MI051_1832619516509591 MI051 EV 03 2019-10-01 04:36:14 -75.7442 4.79908 0 101 0 28229770
6 MI051_1832619543973570 MI051 EV 10 2019-10-01 04:36:18 -75.7442 4.79908 0 101 NA NA
And this is the restructure of datos_conductores and the head:
> str(datos_conductores)
'data.frame': 16522 obs. of 11 variables:
$ fecha : POSIXct, format: "2019-10-01" "2019-10-01" "2019-10-01" "2019-10-01" ...
$ equ_id : int 99 99 99 99 99 99 99 99 99 99 ...
$ conductor : int 34 34 34 34 34 34 34 65 65 65 ...
$ servicio_id: int 533329 533328 533327 533326 533325 533324 533323 533333 533332 533331 ...
$ PERA_ID : int 362 362 362 362 362 362 362 107 107 107 ...
$ hora_ini : POSIXct, format: "2019-11-28 09:16:16" "2019-11-28 08:38:16" "2019-11-28 08:00:16" "2019-11-28 07:22:16" ...
$ hora_fin : POSIXct, format: "2019-11-28 09:21:00" "2019-11-28 09:16:16" "2019-11-28 08:38:16" "2019-11-28 08:00:16" ...
$ ruta_id : int 24 24 24 24 24 24 24 24 24 24 ...
$ NOMBRE : Factor w/ 85 levels "ALBERT HERNAN ZAPATA RESTREPO",..: 71 71 71 71 71 71 71 53 53 53 ...
$ PERA_CEDULA: int 1088253762 1088253762 1088253762 1088253762 1088253762 1088253762 1088253762 10087424 10087424 10087424 ...
$ EQU_CODIGO : Factor w/ 36 levels "MI051","MI052",..: 9 9 9 9 9 9 9 9 9 9 ...
> head(datos_octubre)
REP_GPS_CODIGO EQU_CODIGO TRAM_GPS_CODIGO EVE_GPS_CODIGO FECHA_GPS HORA_GPS LON_GPS LAT_GPS VEL_GPS DIR_GPS ACL_GPS ODO_GPS ALT_GPS Operador
1 MI051_1832614921789237 MI051 EV 03 2019-10-01 04:35:38 -75.7444 4.79857 0 0 0 28229762
2 MI051_1832614964979379 MI051 EV 03 2019-10-01 04:35:46 -75.7444 4.79857 0 0 0 28229762
3 MI051_1832616366109032 MI051 EV 03 2019-10-01 04:35:56 -75.7444 4.79857 0 0 0 28229762
4 MI051_1832617794914447 MI051 EV 03 2019-10-01 04:36:06 -75.7442 4.79907 0 101 0 28229768
5 MI051_1832619516509591 MI051 EV 03 2019-10-01 04:36:14 -75.7442 4.79908 0 101 0 28229770
6 MI051_1832619543973570 MI051 EV 10 2019-10-01 04:36:18 -75.7442 4.79908 0 101 NA NA
Also I tried with operator pype but I'm not getting the result I want.

I already find the solution.
I have to convert all data type data with the function as.POSIXct and in the for make a correction with the time_ini and the time_finish of every data.

List being added to a dataframe

Why is a list being added to my dataframe here?
Here's my dataframe
df <- data.frame(ch = rep(1:10, each = 12), # care home id
year_id = rep(2018),
month_id = rep(1:12), # month using the system over the course of a year (1 = first month, 2 = second month...etc.)
totaladministrations = rbinom(n=120, size = 1000, prob = 0.6), # administrations that were scheduled to have been given in the month
missed = rbinom(n=120, size = 20, prob = 0.8), # administrations that weren't given in the month (these are bad!)
beds = rep(rbinom(n = 10, size = 60, prob = 0.6), each = 12), # number of beds in the care home
rating = rep(rbinom(n= 10, size = 4, prob = 0.5), each = 12)) # latest inspection rating (1. Inadequate, 2. Requires Improving, 3. Good, 4 Outstanding)
df <- arrange(df, df$ch, df$year_id, df$month_id)
str(df)
> str(df)
'data.frame': 120 obs. of 7 variables:
$ ch : int 1 1 1 1 1 1 1 1 1 1 ...
$ year_id : num 2018 2018 2018 2018 2018 ...
$ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ totaladministrations: int 576 598 608 576 608 637 611 613 593 626 ...
$ missed : int 18 18 19 16 16 13 17 16 15 17 ...
$ beds : int 38 38 38 38 38 38 38 38 38 38 ...
$ rating : int 2 2 2 2 2 2 2 2 2 2 ...
All good so far.
I just want to add another column that sequences the month number within the ch group (this equates to the actual month_id in this example but ignore that, my real life data is different), so I'm using:
df <- df %>% group_by(ch) %>%
mutate(sequential_month_counter = 1:n())
This appears to add a bunch stuff I don't really understand or want or need, such as a list ...
str(df)
> str(df)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 8 variables:
$ ch : int 1 1 1 1 1 1 1 1 1 1 ...
$ year_id : num 2018 2018 2018 2018 2018 ...
$ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ totaladministrations : int 601 590 593 599 615 611 628 587 604 600 ...
$ missed : int 16 14 17 16 18 16 15 18 15 20 ...
$ beds : int 35 35 35 35 35 35 35 35 35 35 ...
$ rating : int 3 3 3 3 3 3 3 3 3 3 ...
$ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "groups")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 2 variables:
..$ ch : int 1 2 3 4 5 6 7 8 9 10
..$ .rows:List of 10
.. ..$ : int 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int 13 14 15 16 17 18 19 20 21 22 ...
.. ..$ : int 25 26 27 28 29 30 31 32 33 34 ...
.. ..$ : int 37 38 39 40 41 42 43 44 45 46 ...
.. ..$ : int 49 50 51 52 53 54 55 56 57 58 ...
.. ..$ : int 61 62 63 64 65 66 67 68 69 70 ...
.. ..$ : int 73 74 75 76 77 78 79 80 81 82 ...
.. ..$ : int 85 86 87 88 89 90 91 92 93 94 ...
.. ..$ : int 97 98 99 100 101 102 103 104 105 106 ...
.. ..$ : int 109 110 111 112 113 114 115 116 117 118 ...
..- attr(*, ".drop")= logi TRUE
What's going on here? I just want a dataframe. Why is there all that additional output after $ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ... and more importantly can I ignore it and just keep treating it as a normal dataframe (i'll be running some generalised linear mixed models on the df)?

The attribute "groups" is where dplyr stores the grouping information added when you did group_by(ch). It doesn't hurt anything, and it will disappear if you ungroup():
df %>% group_by(ch) %>%
mutate(sequential_month_counter = 1:n()) %>%
ungroup %>%
str
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 8 variables:
# $ ch : int 1 1 1 1 1 1 1 1 1 1 ...
# $ year_id : num 2018 2018 2018 2018 2018 ...
# $ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
# $ totaladministrations : int 575 597 579 605 582 599 577 604 630 632 ...
# $ missed : int 18 16 16 18 18 11 10 13 17 16 ...
# $ beds : int 33 33 33 33 33 33 33 33 33 33 ...
# $ rating : int 3 3 3 3 3 3 3 3 3 3 ...
# $ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ...
As a side-note, you should use bare column names inside dplyr verbs, not data$column. With arrange, it doesn't much matter, but in grouped operations it will cause bugs. You should get in the habit of using arrange(df, ch, year_id, month_id) instead of arrange(df, df$ch, df$year_id, df$month_id).

non meaningful operation for fractor error when storing new value in data frame: R

I am trying to update a a value in a data frame but am getting--what seems to me--a weird error about operation that I don't think I am using.
Here's a summary of the data:
> str(us.cty2015#data)
'data.frame': 3108 obs. of 15 variables:
$ STATEFP : Factor w/ 52 levels "01","02","04",..: 17 25 33 46 4 14 16 24 36 42 ...
$ COUNTYFP : Factor w/ 325 levels "001","003","005",..: 112 91 67 9 43 81 7 103 72 49 ...
$ COUNTYNS : Factor w/ 3220 levels "00023901","00025441",..: 867 1253 1600 2465 38 577 690 1179 1821 2104 ...
$ AFFGEOID : Factor w/ 3220 levels "0500000US01001",..: 976 1472 1879 2813 144 657 795 1395 2098 2398 ...
$ GEOID : Factor w/ 3220 levels "01001","01003",..: 976 1472 1879 2813 144 657 795 1395 2098 2398 ...
$ NAME : Factor w/ 1910 levels "Abbeville","Acadia",..: 1558 1703 1621 688 856 1075 148 1807 1132 868 ...
$ LSAD : Factor w/ 9 levels "00","03","04",..: 5 5 5 5 5 5 5 5 5 5 ...
$ ALAND : num 1.66e+09 1.10e+09 3.60e+09 2.12e+08 1.50e+09 ...
$ AWATER : num 2.78e+06 5.24e+07 3.50e+07 2.92e+08 8.91e+06 ...
$ t_pop : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_wht : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_free_blk: num 0 0 0 0 0 0 0 0 0 0 ...
$ n_slv : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_blk : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_free : num 0 0 0 0 0 0 0 0 0 0 ...
> str(us.cty1860#data)
'data.frame': 2126 obs. of 29 variables:
$ DECADE : Factor w/ 1 level "1860": 1 1 1 1 1 1 1 1 1 1 ...
$ NHGISNAM : Factor w/ 1236 levels "Abbeville","Accomack",..: 1142 1218 1130 441 812 548 1144 56 50 887 ...
$ NHGISST : Factor w/ 41 levels "010","050","060",..: 32 13 9 36 16 36 16 30 23 39 ...
$ NHGISCTY : Factor w/ 320 levels "0000","0010",..: 142 206 251 187 85 231 131 12 6 161 ...
$ ICPSRST : Factor w/ 37 levels "1","11","12",..: 5 13 21 26 22 26 22 10 15 17 ...
$ ICPSRCTY : Factor w/ 273 levels "10","1010","1015",..: 25 93 146 72 247 122 12 10 228 45 ...
$ ICPSRNAM : Factor w/ 1200 levels "ABBEVILLE","ACCOMACK",..: 1108 1184 1097 432 791 535 1110 55 49 860 ...
$ STATENAM : Factor w/ 41 levels "Alabama","Arkansas",..: 32 13 9 36 16 36 16 30 23 39 ...
$ ICPSRSTI : int 14 31 44 49 45 49 45 24 34 40 ...
$ ICPSRCTYI : int 1210 1970 2910 1810 710 2450 1130 110 50 1450 ...
$ ICPSRFIP : num 0 0 0 0 0 0 0 0 0 0 ...
$ STATE : Factor w/ 41 levels "010","050","060",..: 32 13 9 36 16 36 16 30 23 39 ...
$ COUNTY : Factor w/ 320 levels "0000","0010",..: 142 206 251 187 85 231 131 12 6 161 ...
$ PID : num 1538 735 306 1698 335 ...
$ X_CENTROID : num 1348469 184343 1086494 -62424 585888 ...
$ Y_CENTROID : num 556680 588278 -229809 -433290 -816852 ...
$ GISJOIN : Factor w/ 2126 levels "G0100010","G0100030",..: 1585 627 319 1769 805 1788 823 1425 1079 2006 ...
$ GISJOIN2 : Factor w/ 2126 levels "0100010","0100030",..: 1585 627 319 1769 805 1788 823 1425 1079 2006 ...
$ SHAPE_AREA : num 2.35e+09 1.51e+09 8.52e+08 2.54e+09 6.26e+08 ...
$ SHAPE_LEN : num 235777 155261 166065 242608 260615 ...
$ t_pop : int 25043 653 4413 8184 174491 1995 4324 17187 4649 8392 ...
$ n_wht : int 24974 653 4295 6892 149063 1684 3001 17123 4578 2580 ...
$ n_free_blk : int 69 0 2 0 10939 2 7 64 12 409 ...
$ n_slv : int 0 0 116 1292 14484 309 1316 0 59 5403 ...
$ n_blk : int 69 0 118 1292 25423 311 1323 64 71 5812 ...
$ n_free : num 25043 653 4297 6892 160007 ...
$ frac_free : num 1 1 0.974 0.842 0.917 ...
$ frac_free_blk: num 1 NA 0.0169 0 0.4303 ...
$ frac_slv : num 0 0 0.0263 0.1579 0.083 ...
> str(overlap)
'data.frame': 15266 obs. of 7 variables:
$ cty2015 : Factor w/ 3108 levels "0","1","10","100",..: 1 1 2 2 2 2 2 1082 1082 1082 ...
$ cty1860 : Factor w/ 2126 levels "0","1","10","100",..: 1047 1012 1296 1963 2033 2058 2065 736 1413 1569 ...
$ area_inter : num 1.66e+09 2.32e+05 9.81e+04 1.07e+09 7.67e+07 ...
$ area1860 : num 1.64e+11 1.81e+11 1.54e+09 2.91e+09 2.32e+09 ...
$ frac_1860 : num 1.01e-02 1.28e-06 6.35e-05 3.67e-01 3.30e-02 ...
$ sum_frac_1860 : num 1 1 1 1 1 ...
$ scaled_frac_1860: num 1.01e-02 1.28e-06 6.35e-05 3.67e-01 3.30e-02 ...
I am trying to multiply a vector of variables vars <- c("t_pop", "n_wht", "n_free_blk", "n_slv", "n_blk", "n_free") in the us.cty1860#data data frame by a scalar overlap$scaled_frac_1860[i], then add it to the same vector of variables in the us.cty2015#data data frame, and finally overwrite the variables in the us.cty2015#data data frame.
When I make the following call, I get an error that seems to be saying that I am trying to preform invalid operations on factors (which is not the case (you can confirm from the str output)).
> us.cty2015#data[overlap$cty2015[1], vars] <- us.cty2015#data[overlap$cty2015[1], vars] + (overlap$scaled_frac_1860[1] * us.cty1860#data[overlap$cty1860[1], vars])
Error in Summary.factor(1L, na.rm = FALSE) :
‘max’ not meaningful for factors
In addition: Warning message:
In Ops.factor(i, 0L) : ‘>=’ not meaningful for factors
However, when I don't attempt to overwrite the old value, the operation works fine.
> us.cty2015#data[overlap$cty2015[1], vars] + (overlap$scaled_frac_1860[1] * us.cty1860#data[overlap$cty1860[1], vars])
t_pop n_wht n_free_blk n_slv n_blk n_free
0 118.3889 113.6468 0.1317233 4.610316 4.742039 113.7785
I'm sure there are better ways of accomplishing what I am trying to do but does anyone have any idea what is going on?
Edit:
I am using the following libraries: rgdal, rgeos, and maptools
The all the data/object are coming from NHGIS shapefiles 1860 and 2015 United States Counties.

Error "(subscript) logical subscript too long" with tune.svm from e1071 package in R

I am trying to use SVM for a multi-class classification task.
I have a dataset called df, which I divided into a training and a test set with the following code:
sample <- df[sample(nrow(df), 10000),] # take a random sample of 10,000 from dataset df
sample <- sample %>% arrange(Date) # arrange chronologically
train <- sample[1:8000,] # 80% of the df dataset
test <- sample[8001:10000,] # 20% of the df dataset
This is what the training set looks like:
> str(train)
'data.frame': 8000 obs. of 45 variables:
$ Date : Date, format: "2008-01-01" "2008-01-01" "2008-01-02" ...
$ Weekday : chr "Tuesday" "Tuesday" "Wednesday" "Wednesday" ...
$ Season : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Weekend : num 0 0 0 0 0 0 0 0 0 0 ...
$ Icao.type : Factor w/ 306 levels "A124","A225",..: 7 29 112 115 107 10 115 115 115 112 ...
$ Act.description : Factor w/ 389 levels "A300-600F","A330-200F",..: 9 29 161 162 150 13 162 162 162 161 ...
$ Arr.dep : Factor w/ 2 levels "A","D": 2 2 1 1 1 1 1 1 1 1 ...
$ MTOW : num 77 69 46 21 22 238 21 21 21 46 ...
$ Icao.wtc : chr "Medium" "Medium" "Medium" "Medium" ...
$ Wind.direc : int 104 104 82 82 93 93 93 132 132 132 ...
$ Wind.speed.vec : int 35 35 57 57 64 64 64 62 62 62 ...
$ Wind.speed.daily: int 35 35 58 58 65 65 65 63 63 63 ...
$ Wind.speed.max : int 60 60 70 70 80 80 80 90 90 90 ...
$ Wind.speed.min : int 20 20 40 40 50 50 50 50 50 50 ...
$ Wind.gust.max : int 100 100 120 120 130 130 130 140 140 140 ...
$ Temp.daily : int 24 24 -5 -5 4 4 4 34 34 34 ...
$ Temp.min : int -7 -7 -25 -25 -13 -13 -13 11 11 11 ...
$ Temp.max : int 50 50 16 16 13 13 13 55 55 55 ...
$ Temp.10.min : int -11 -11 -32 -32 -18 -18 -18 9 9 9 ...
$ Sun.dur : int 7 7 65 65 19 19 19 0 0 0 ...
$ Sun.dur.prct : int 9 9 83 83 24 24 24 0 0 0 ...
$ Radiation : int 173 173 390 390 213 213 213 108 108 108 ...
$ Precip.dur : int 0 0 0 0 0 0 0 5 5 5 ...
$ Precip.daily : int 0 0 0 0 -1 -1 -1 2 2 2 ...
$ Precip.max : int 0 0 0 0 -1 -1 -1 2 2 2 ...
$ Sea.press.daily : int 10259 10259 10206 10206 10080 10080 10080 10063 10063 10063 ...
$ Sea.press.max : int 10276 10276 10248 10248 10132 10132 10132 10086 10086 10086 ...
$ Sea.press.min : int 10250 10250 10141 10141 10058 10058 10058 10001 10001 10001 ...
$ Visibility.min : int 1 1 40 40 43 43 43 58 58 58 ...
$ Visibility.max : int 59 59 75 75 66 66 66 65 65 65 ...
$ Cloud.daily : int 7 7 3 3 8 8 8 8 8 8 ...
$ Humidity.daily : int 96 96 86 86 77 77 77 82 82 82 ...
$ Humidity.max : int 99 99 92 92 92 92 92 90 90 90 ...
$ Humidity.min : int 91 91 74 74 71 71 71 76 76 76 ...
$ Evapo : int 2 2 4 4 2 2 2 1 1 1 ...
$ Wind.discrete : chr "South East" "South East" "North East" "North East" ...
$ Vmc.imc : chr "Unknown" "Unknown" "Unknown" "Unknown" ...
$ Beaufort : num 3 3 4 4 4 4 4 4 4 4 ...
$ Main.A : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.B : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.K : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.O : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.P : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.Z : num 0 0 0 0 0 0 0 0 0 0 ...
$ Runway : Factor w/ 13 levels "04","06","09",..: 3 8 2 2 2 6 2 6 6 6 ...
Then, I try to tune the SVM parameters with the following code:
library(e1071)
tuned <- tune.svm(Runway ~ ., data = train, gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
While this code has worked in the past, it now gives me the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
The only thing I can think of that has changed is the rows in the dataset train, as running the first code block means taking a random sample of 10,000 (out of dataset df, that contains 3.5 million rows).
Does anyone know why I am getting this?

I recognise that this question was rather hard to solve without a good reproducible example.
However, I have found the solution to my problem and wanted to post it here for anyone who might be looking for this in the future.
Running the same code, but with selected columns from the train set:
tuned <- tune.svm(Runway ~ ., data = train[,c(1:2, 45)], gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
gave me absolutely no problem. I continued adding more features until the error was reproduced. I found that the features Vmc.imc and Icao.wtc were causing the error and that they were both chr features. Using the following code:
train$Vmc.imc <- as.factor(train$Vmc.imc)
train$Icao.wtc <- as.factor(train$Icao.wtc)
to change them into factors and then rerunning
tuned <- tune.svm(Runway ~ ., data = train, gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
solved my problem.
I do not know why the other chr features such as Weekday and Wind.discrete are not causing the same issue. If anyone knows the answer to this, I would be glad to find out.

Similar to this thread here. I added the fact that if you neglect making all your character features factors, you will also receive this error when attempting to run predict.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: aggregating data frame sum not meaningful factors - r

Related

How to clean a dataset in R from uci

Problem with putting information in a column in R

List being added to a dataframe

non meaningful operation for fractor error when storing new value in data frame: R

Error "(subscript) logical subscript too long" with tune.svm from e1071 package in R

Categories

Resources