Averaging rows by columns with the same name - r

I have a dataframe that is 130 rows by 1321 columns. Most of the column names are combinations of Month_Year (i.e. 1_89, 3_00, etc.). There are between 2-5 columns with the same name. I want to average the values in the rows in the columns with the same names. Here is my df structure:
'data.frame': 130 obs. of 1321 variables:
$ StationID: int 15 90 91 27 77 72 43 53 67 127 ...
$ X : num -125 -124 -124 -124 -124 ...
$ Y : num 42.8 40.7 40.7 40.6 40.9 ...
$ 1_89 : num 101 100 100 100 100 ...
$ 1_89 : num 95.8 97.2 97.2 100 99 ...
$ 1_89 : num 137 159 159 175 168 ...
$ 1_89 : num 141 171 171 180 178 ...
$ 1_89 : num 106 112 112 113 111 ...
$ 2_89 : num 140 165 165 171 172 ...
$ 2_89 : num 109 133 133 147 137 ...
$ 2_89 : num 140 179 179 174 173 ...
$ 2_89 : num 126 130 130 118 130 ...
$ 3_89 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 3_89 : num 100 104 104 100 100 ...
$ 3_89 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 3_89 : num 112 173 173 173 168 ...
$ 4_89 : num 125 175 175 176 170 ...
$ 4_89 : num 104 166 166 161 161 ...
$ 4_89 : num 0 0 0 0 0 0 0 0 0 0 ...
I am aware that this is quite an unusual structure for a dataframe but I would like to convert this to a dataframe that looks like this:
$ StationID: int 15 90 91 27 77 72 43 53 67 127 ...
$ X : num -125 -124 -124 -124 -124 ...
$ Y : num 42.8 40.7 40.7 40.6 40.9 ...
$ 1_89 : num 101 100 100 100 100 ...
$ 2_89 : num 109 133 133 147 137 ...
$ 3_89 : num 100 104 104 100 100 ...
$ 4_89 : num 104 166 166 161 161 ...
but with average for each Month_Year. Thanks in advance for any help!

You can find the unique columns names, and then loop through each of them, calculating the average across these columns
Create some data
set.seed(1)
dat <- setNames(data.frame(replicate(10, rnorm(5))) ,
paste0("var", rep(1:3, c(3,2,5))))
head(dat, 3)
# var1 var1 var1 var2 var2 #var3 var3 var3
# 1 -0.6264538 -0.8204684 1.5117812 -0.04493361 0.91897737 -0.05612874 #1.3586796 -0.4149946
# 2 0.1836433 0.4874291 0.3898432 -0.01619026 0.78213630 -0.15579551 #-0.1027877 -0.3942900
# 3 -0.8356286 0.7383247 -0.6212406 0.94383621 0.07456498 -1.47075238 #0.3876716 -0.0593134
# var3 var3
# 1 -0.1645236 -0.7074952
# 2 -0.2533617 0.3645820
# 3 0.6969634 0.7685329
Extract unique names
nms <- unique(names(dat))
Average columns with the same name
sapply(nms, function(x) rowMeans(dat[names(dat) %in% x]))
# var1 var2 var3
#[1,] 0.02161966 0.4370219 0.0031074991
#[2,] 0.35363854 0.3829730 -0.1083305812
#[3,] -0.23951483 0.5092006 0.0646204262
#[4,] -0.01454591 -0.5840653 0.2024774526
#[5,] 0.38301677 0.6068635 -0.0007180433
This may be a bit faster for larger data
t(rowsum(t(dat), names(dat))/c(table(names(dat))))

Related

Isolate Data Frame from Data Frame for Multiple Time Series Plot

I want to extract temperature (temp_c) at specific pressure level (press_hpa). As I am filtering my data (dat) using dplyr, I'm creating another data frame which contains the same columns numbers (15) and different length of observation. There were so many solution to plot multiple time series from column but I cant match the solution.. How to plot a multiple time series showing temperature at different level(x = date, y = temp_c, legend = Press_1000, Press_925, Press_850, Press_700)? Kindly help.. Thank you..
library(ggplot2),
library(dplyr)
library(reshape2)
setwd("C:/Users/Hp/Documents/yr/climatology/")
dat <- read.csv("soundingWMKD.csv", head = TRUE, stringsAsFactors = F)
str(dat)
'data.frame': 6583 obs. of 15 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ pres_hpa : num 1006 1000 993 981 1005 ...
$ hght_m : int 16 70 132 238 16 62 141 213 302 329 ...
$ temp_c : num 24 23.6 23.2 24.6 24.2 24.2 24 23.8 23.3 23.2 ...
$ dwpt_c : num 23.4 22.4 21.5 21.6 23.6 23.1 22.9 22.7 22 21.8 ...
$ relh_pct : int 96 93 90 83 96 94 94 94 92 92 ...
$ mixr_g_kg: num 18.4 17.4 16.6 16.9 18.6 ...
$ drct_deg : int 0 0 NA NA 190 210 212 213 215 215 ...
$ sknt_knot: int 0 0 NA NA 1 3 6 8 11 11 ...
$ thta_k : num 297 297 297 299 297 ...
$ thte_k : num 350 347 345 349 351 ...
$ thtv_k : num 300 300 300 302 300 ...
$ date : chr "2017-11-02" "2017-11-02" "2017-11-02" "2017-11-02" ...
$ from_hr : int 0 0 0 0 0 0 0 0 0 0 ...
$ to_hr : int 0 0 0 0 0 0 0 0 0 0 ...
Press_1000 <- filter(dat,dat$pres_hpa == 1000)
Press_925 <- filter(dat,dat$pres_hpa == 925)
Press_850 <- filter(dat,dat$pres_hpa == 850)
Press_700 <- filter(dat,dat$pres_hpa == 700)
date <- as.Date(dat$date, "%m-%d-%y")
str(Press_1000)
'data.frame': 80 obs. of 15 variables:
$ X : int 2 6 90 179 267 357 444 531 585 675 ...
$ pres_hpa : num 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
$ hght_m : int 70 62 63 63 62 73 84 71 74 78 ...
$ temp_c : num 23.6 24.2 24.4 24.2 25.4 24 23.8 24 23.8 24 ...
$ dwpt_c : num 22.4 23.1 23.2 22.3 23.9 23.1 23.4 23 23 23.1 ...
$ relh_pct : int 93 94 93 89 91 95 98 94 95 95 ...
$ mixr_g_kg: num 17.4 18.2 18.3 17.3 19.1 ...
$ drct_deg : int 0 210 240 210 210 340 205 290 315 0 ...
$ sknt_knot: int 0 3 2 3 3 2 4 1 1 0 ...
$ thta_k : num 297 297 298 297 299 ...
$ thte_k : num 347 350 351 348 354 ...
$ thtv_k : num 300 301 301 300 302 ...
$ date : chr "2017-11-02" "2017-11-03" "2017-11-04" "2017-11-05" ...
$ from_hr : int 0 0 0 0 0 0 0 0 0 0 ...
$ to_hr : int 0 0 0 0 0 0 0 0 0 0 ...
str(Press_925)
'data.frame': 79 obs. of 15 variables:
$ X : int 13 96 187 272 365 450 537 593 681 769 ...
$ pres_hpa : num 925 925 925 925 925 925 925 925 925 925 ...
$ hght_m : int 745 747 746 748 757 764 757 758 763 781 ...
$ temp_c : num 21.8 22 22.4 23.2 22.2 20.6 22.4 22 22.4 22.2 ...
$ ... 'truncated'
all_series = rbind(date,Press_1000,Press_925,Press_850,Press_700)
meltdf <- melt(all_series,id.vars ="date")
ggplot(meltdf,aes(x=date,y=value,colour=variable,group=variable)) +
geom_line()
There are two ways of approaching this. What you go for may depend on the bedrock question (which we don't know).
1) For each data.frame, you have all the necessary columns and you can plot each source (data.frame) using e.g.
ggplot()... +
geom_line(data = Press_1000, aes(...)) +
geom_line(data = Press_925, aes(...)) ...
Note that you will have to specify color for each source and having a legend with that is PITA.
2) Combine all data.frames into one big object and create an additional column indicating the origin of the data (from which data.frame the observation is from). This would be your mapping variable (e.g. color, fill, group)in your current aes call. Instant legend.

non meaningful operation for fractor error when storing new value in data frame: R

I am trying to update a a value in a data frame but am getting--what seems to me--a weird error about operation that I don't think I am using.
Here's a summary of the data:
> str(us.cty2015#data)
'data.frame': 3108 obs. of 15 variables:
$ STATEFP : Factor w/ 52 levels "01","02","04",..: 17 25 33 46 4 14 16 24 36 42 ...
$ COUNTYFP : Factor w/ 325 levels "001","003","005",..: 112 91 67 9 43 81 7 103 72 49 ...
$ COUNTYNS : Factor w/ 3220 levels "00023901","00025441",..: 867 1253 1600 2465 38 577 690 1179 1821 2104 ...
$ AFFGEOID : Factor w/ 3220 levels "0500000US01001",..: 976 1472 1879 2813 144 657 795 1395 2098 2398 ...
$ GEOID : Factor w/ 3220 levels "01001","01003",..: 976 1472 1879 2813 144 657 795 1395 2098 2398 ...
$ NAME : Factor w/ 1910 levels "Abbeville","Acadia",..: 1558 1703 1621 688 856 1075 148 1807 1132 868 ...
$ LSAD : Factor w/ 9 levels "00","03","04",..: 5 5 5 5 5 5 5 5 5 5 ...
$ ALAND : num 1.66e+09 1.10e+09 3.60e+09 2.12e+08 1.50e+09 ...
$ AWATER : num 2.78e+06 5.24e+07 3.50e+07 2.92e+08 8.91e+06 ...
$ t_pop : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_wht : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_free_blk: num 0 0 0 0 0 0 0 0 0 0 ...
$ n_slv : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_blk : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_free : num 0 0 0 0 0 0 0 0 0 0 ...
> str(us.cty1860#data)
'data.frame': 2126 obs. of 29 variables:
$ DECADE : Factor w/ 1 level "1860": 1 1 1 1 1 1 1 1 1 1 ...
$ NHGISNAM : Factor w/ 1236 levels "Abbeville","Accomack",..: 1142 1218 1130 441 812 548 1144 56 50 887 ...
$ NHGISST : Factor w/ 41 levels "010","050","060",..: 32 13 9 36 16 36 16 30 23 39 ...
$ NHGISCTY : Factor w/ 320 levels "0000","0010",..: 142 206 251 187 85 231 131 12 6 161 ...
$ ICPSRST : Factor w/ 37 levels "1","11","12",..: 5 13 21 26 22 26 22 10 15 17 ...
$ ICPSRCTY : Factor w/ 273 levels "10","1010","1015",..: 25 93 146 72 247 122 12 10 228 45 ...
$ ICPSRNAM : Factor w/ 1200 levels "ABBEVILLE","ACCOMACK",..: 1108 1184 1097 432 791 535 1110 55 49 860 ...
$ STATENAM : Factor w/ 41 levels "Alabama","Arkansas",..: 32 13 9 36 16 36 16 30 23 39 ...
$ ICPSRSTI : int 14 31 44 49 45 49 45 24 34 40 ...
$ ICPSRCTYI : int 1210 1970 2910 1810 710 2450 1130 110 50 1450 ...
$ ICPSRFIP : num 0 0 0 0 0 0 0 0 0 0 ...
$ STATE : Factor w/ 41 levels "010","050","060",..: 32 13 9 36 16 36 16 30 23 39 ...
$ COUNTY : Factor w/ 320 levels "0000","0010",..: 142 206 251 187 85 231 131 12 6 161 ...
$ PID : num 1538 735 306 1698 335 ...
$ X_CENTROID : num 1348469 184343 1086494 -62424 585888 ...
$ Y_CENTROID : num 556680 588278 -229809 -433290 -816852 ...
$ GISJOIN : Factor w/ 2126 levels "G0100010","G0100030",..: 1585 627 319 1769 805 1788 823 1425 1079 2006 ...
$ GISJOIN2 : Factor w/ 2126 levels "0100010","0100030",..: 1585 627 319 1769 805 1788 823 1425 1079 2006 ...
$ SHAPE_AREA : num 2.35e+09 1.51e+09 8.52e+08 2.54e+09 6.26e+08 ...
$ SHAPE_LEN : num 235777 155261 166065 242608 260615 ...
$ t_pop : int 25043 653 4413 8184 174491 1995 4324 17187 4649 8392 ...
$ n_wht : int 24974 653 4295 6892 149063 1684 3001 17123 4578 2580 ...
$ n_free_blk : int 69 0 2 0 10939 2 7 64 12 409 ...
$ n_slv : int 0 0 116 1292 14484 309 1316 0 59 5403 ...
$ n_blk : int 69 0 118 1292 25423 311 1323 64 71 5812 ...
$ n_free : num 25043 653 4297 6892 160007 ...
$ frac_free : num 1 1 0.974 0.842 0.917 ...
$ frac_free_blk: num 1 NA 0.0169 0 0.4303 ...
$ frac_slv : num 0 0 0.0263 0.1579 0.083 ...
> str(overlap)
'data.frame': 15266 obs. of 7 variables:
$ cty2015 : Factor w/ 3108 levels "0","1","10","100",..: 1 1 2 2 2 2 2 1082 1082 1082 ...
$ cty1860 : Factor w/ 2126 levels "0","1","10","100",..: 1047 1012 1296 1963 2033 2058 2065 736 1413 1569 ...
$ area_inter : num 1.66e+09 2.32e+05 9.81e+04 1.07e+09 7.67e+07 ...
$ area1860 : num 1.64e+11 1.81e+11 1.54e+09 2.91e+09 2.32e+09 ...
$ frac_1860 : num 1.01e-02 1.28e-06 6.35e-05 3.67e-01 3.30e-02 ...
$ sum_frac_1860 : num 1 1 1 1 1 ...
$ scaled_frac_1860: num 1.01e-02 1.28e-06 6.35e-05 3.67e-01 3.30e-02 ...
I am trying to multiply a vector of variables vars <- c("t_pop", "n_wht", "n_free_blk", "n_slv", "n_blk", "n_free") in the us.cty1860#data data frame by a scalar overlap$scaled_frac_1860[i], then add it to the same vector of variables in the us.cty2015#data data frame, and finally overwrite the variables in the us.cty2015#data data frame.
When I make the following call, I get an error that seems to be saying that I am trying to preform invalid operations on factors (which is not the case (you can confirm from the str output)).
> us.cty2015#data[overlap$cty2015[1], vars] <- us.cty2015#data[overlap$cty2015[1], vars] + (overlap$scaled_frac_1860[1] * us.cty1860#data[overlap$cty1860[1], vars])
Error in Summary.factor(1L, na.rm = FALSE) :
‘max’ not meaningful for factors
In addition: Warning message:
In Ops.factor(i, 0L) : ‘>=’ not meaningful for factors
However, when I don't attempt to overwrite the old value, the operation works fine.
> us.cty2015#data[overlap$cty2015[1], vars] + (overlap$scaled_frac_1860[1] * us.cty1860#data[overlap$cty1860[1], vars])
t_pop n_wht n_free_blk n_slv n_blk n_free
0 118.3889 113.6468 0.1317233 4.610316 4.742039 113.7785
I'm sure there are better ways of accomplishing what I am trying to do but does anyone have any idea what is going on?
Edit:
I am using the following libraries: rgdal, rgeos, and maptools
The all the data/object are coming from NHGIS shapefiles 1860 and 2015 United States Counties.

How to deal with " rank-deficient fit may be misleading" in R?

I'm trying to predict the values of test data set based on train data set, it is predicting the values (no errors) however the predictions deviate A LOT by the original values. Even predicting values around -356 although none of the original values exceeds 200 (and there are no negative values). The warning is bugging me as I think the values deviates a lot because of this warning.
Warning message:
In predict.lm(fit2, data_test) :
prediction from a rank-deficient fit may be misleading
any way I can get rid of this warning? the code is simple
fit2 <- lm(runs~., data=train_data)
prediction<-predict(fit2, data_test)
prediction
I searched a lot but tbh I couldn't understand much about this error.
str of test and train data set in case someone needs them
> str(train_data)
'data.frame': 36 obs. of 28 variables:
$ matchid : int 57 58 55 56 53 54 51 52 45 46 ...
$ TeamName : chr "South Africa" "West Indies" "South Africa" "West Indies" ...
$ Opp_TeamName : chr "West Indies" "South Africa" "West Indies" "South Africa" ...
$ TeamRank : int 4 3 4 3 4 3 10 7 5 1 ...
$ Opp_TeamRank : int 3 4 3 4 3 4 7 10 1 5 ...
$ Team_Top10RankingBatsman : int 0 1 0 1 0 1 0 0 2 2 ...
$ Team_Top50RankingBatsman : int 4 6 4 6 4 6 3 5 4 3 ...
$ Team_Top100RankingBatsman: int 6 8 6 8 6 8 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 0 1 0 1 0 0 0 2 2 ...
$ Opp_Top50RankingBatsman : int 6 4 6 4 6 4 5 3 3 4 ...
$ Opp_Top100RankingBatsman : int 8 6 8 6 8 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "1st innings" "2nd innings" ...
$ Runs_OverAll : num 361 705 348 630 347 ...
$ AVG_Overall : num 27.2 20 23.3 19.1 24 ...
$ SR_Overall : num 128 121 120 118 118 ...
$ Runs_Last10Matches : num 118.5 71 102.1 71 78.6 ...
$ AVG_Last10Matches : num 23.7 20.4 20.9 20.4 23.2 ...
$ SR_Last10Matches : num 120 106 114 106 116 ...
$ Runs_BatingFirst : num 236 459 230 394 203 ...
$ AVG_BatingFirst : num 30.6 23.2 24 21.2 27.1 ...
$ SR_BatingFirst : num 127 136 123 125 118 ...
$ Runs_BatingSecond : num 124 262 119 232 144 ...
$ AVG_BatingSecond : num 25.5 18.3 22.8 17.8 22.8 ...
$ SR_BatingSecond : num 125 118 112 117 114 ...
$ Runs_AgainstTeam2 : num 88.3 118.3 76.3 103.9 49.3 ...
$ AVG_AgainstTeam2 : num 28.2 23 24.7 22.1 16.4 ...
$ SR_AgainstTeam2 : num 139 127 131 128 111 ...
$ runs : int 165 168 231 236 195 126 143 141 191 135 ...
> str(data_test)
'data.frame': 34 obs. of 28 variables:
$ matchid : int 59 60 61 62 63 64 65 66 69 70 ...
$ TeamName : chr "India" "West Indies" "England" "New Zealand" ...
$ Opp_TeamName : chr "West Indies" "India" "New Zealand" "England" ...
$ TeamRank : int 2 3 5 1 4 8 6 2 10 1 ...
$ Opp_TeamRank : int 3 2 1 5 8 4 2 6 1 10 ...
$ Team_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 0 2 ...
$ Team_Top50RankingBatsman : int 5 6 4 3 4 2 5 5 3 3 ...
$ Team_Top100RankingBatsman: int 7 8 7 6 6 5 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 2 0 ...
$ Opp_Top50RankingBatsman : int 6 5 3 4 2 4 5 5 3 3 ...
$ Opp_Top100RankingBatsman : int 8 7 6 7 5 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "2nd innings" "1st innings" ...
$ Runs_OverAll : num 582 618 470 602 509 ...
$ AVG_Overall : num 25 21.8 20.3 20.7 19.6 ...
$ SR_Overall : num 113 120 123 120 112 ...
$ Runs_Last10Matches : num 182 107 117 167 140 ...
$ AVG_Last10Matches : num 37.1 43.8 21 24.9 27.3 ...
$ SR_Last10Matches : num 111 153 122 141 120 ...
$ Runs_BatingFirst : num 319 314 271 345 294 ...
$ AVG_BatingFirst : num 23.6 17.8 20.6 20.3 19.5 ...
$ SR_BatingFirst : num 116.9 98.5 118 124.3 115.8 ...
$ Runs_BatingSecond : num 264 282 304 256 186 ...
$ AVG_BatingSecond : num 28 23.7 31.9 21.6 16.5 ...
$ SR_BatingSecond : num 96.5 133.9 129.4 112 99.5 ...
$ Runs_AgainstTeam2 : num 98.2 95.2 106.9 75.4 88.5 ...
$ AVG_AgainstTeam2 : num 45.3 42.7 38.1 17.7 27.1 ...
$ SR_AgainstTeam2 : num 125 138 152 110 122 ...
$ runs : int 192 196 159 153 122 120 160 161 70 145 ...
In simple word, how can I get rid of this warning so that it doesn't effect my predictions?
(Intercept) matchid TeamNameBangladesh
1699.98232628 -0.06793787 59.29445330
TeamNameEngland TeamNameIndia TeamNameNew Zealand
347.33030177 -499.40074338 -179.19192936
TeamNamePakistan TeamNameSouth Africa TeamNameSri Lanka
-272.71610614 -3.54867488 -45.27920191
TeamNameWest Indies Opp_TeamNameBangladesh Opp_TeamNameEngland
-345.54349798 135.05901017 108.04227770
Opp_TeamNameIndia Opp_TeamNameNew Zealand Opp_TeamNamePakistan
-162.24418387 -60.55364436 -114.74599364
Opp_TeamNameSouth Africa Opp_TeamNameSri Lanka Opp_TeamNameWest Indies
196.90856999 150.70170068 -6.88997714
TeamRank Opp_TeamRank Team_Top10RankingBatsman
NA NA NA
Team_Top50RankingBatsman Team_Top100RankingBatsman Opp_Top10RankingBatsman
NA NA NA
Opp_Top50RankingBatsman Opp_Top100RankingBatsman InningType2nd innings
NA NA 24.24029455
Runs_OverAll AVG_Overall SR_Overall
-0.59935875 20.12721378 -13.60151334
Runs_Last10Matches AVG_Last10Matches SR_Last10Matches
-1.92526750 9.24182916 1.23914363
Runs_BatingFirst AVG_BatingFirst SR_BatingFirst
1.41001672 -9.88582744 -6.69780509
Runs_BatingSecond AVG_BatingSecond SR_BatingSecond
-0.90038727 -7.11580086 3.20915976
Runs_AgainstTeam2 AVG_AgainstTeam2 SR_AgainstTeam2
3.35936312 -5.90267210 2.36899131
You can have a look at this detailed discussion :
predict.lm() in a loop. warning: prediction from a rank-deficient fit may be misleading
In general, multi-collinearity can lead to a rank deficient matrix in logistic regression.
You can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.

Error in evalSummaryFunction Caret R

I have this data set
'data.frame': 212300 obs. of 19 variables:
$ FL_DATE_MDD_MMDD : int 101 101 101 101 101 101 101 101 101 101 ...
$ FL_DATE : int 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 ...
$ UNIQUE_CARRIER : Factor w/ 13 levels "9E","AA","AS",..: 11 10 2 5 8 9 11 10 10 10 ...
$ DEST : Factor w/ 150 levels "ABE","ABQ","ALB",..: 111 70 82 8 8 31 110 44 53 80 ...
$ DEST_CITY_NAME : Factor w/ 148 levels "Akron, OH","Albany, NY",..: 107 61 96 9 9 29 106 36 97 78 ...
$ ROUNDED_TIME : int 451 451 551 551 551 551 551 551 551 551 ...
$ CRS_DEP_TIME : int 500 520 600 600 600 600 600 600 602 607 ...
$ DEP_DEL15 : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 2 1 1 ...
$ CRS_ARR_TIME : int 746 813 905 903 855 815 901 744 901 841 ...
$ Conditions : Factor w/ 28 levels "Blowing Snow",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Dew.PointC : num -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 ...
$ Events : Factor w/ 10 levels "","Fog","Fog-Rain",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Humidity : int 68 68 71 71 71 71 71 71 71 71 ...
$ Sea.Level.PressurehPa: num 1021 1021 1022 1022 1022 ...
$ TemperatureC : num -9.4 -9.4 -10 -10 -10 -10 -10 -10 -10 -10 ...
$ VisibilityKm : num 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 ...
$ Wind.Direction : Factor w/ 18 levels "Calm","East",..: 9 9 7 7 7 7 7 7 7 7 ...
$ WindDirDegrees : int 320 320 330 330 330 330 330 330 330 330 ...
$ Wind.SpeedKm.h : num 20.4 20.4 13 13 13 13 13 13 13 13 ...
- attr(*, "na.action")=Class 'omit' Named int [1:22539] 3 32 45 87 94 325 472 548 949 1333 ...
.. ..- attr(*, "names")= chr [1:22539] "3" "32" "45" "87" ...
and when I execute the following in Caret
plsFit3x10cv <-train(DEP_DEL15~., data=training3, method="pls",trControl=ctrl,metric="ROC",preProc=c("center","scale"))
I get the error:
Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, :
train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
The answer to your question is in the error message. It says train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl(). So, you need to use classProbs = TRUE in trainControl(), and of course, set summaryFunction = twoClassSummary (if you have not already done so).
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
plsFit3x10cv <-train(DEP_DEL15~.,
data=training3,
method="pls",
preProc=c("center","scale"),
metric="ROC",
trControl=ctrl)

Caret error: "all the Accuracy metric values are missing"

I'm getting the following error and I don't know what may have gone wrong.
I'm using R Studio with the 3.1.3 version of R for Windows 8.1 and using the Caret package for datamining.
I have the following training data:
str(training)
'data.frame': 212300 obs. of 21 variables:
$ FL_DATE_MDD_MMDD : int 101 101 101 101 101 101 101 101 101 101 ...
$ FL_DATE : int 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 ...
$ UNIQUE_CARRIER : Factor w/ 13 levels "9E","AA","AS",..: 11 10 2 5 8 9 11 10 10 10 ...
$ DEST : Factor w/ 150 levels "ABE","ABQ","ALB",..: 111 70 82 8 8 31 110 44 53 80 ...
$ DEST_CITY_NAME : Factor w/ 148 levels "Akron, OH","Albany, NY",..: 107 61 96 9 9 29 106 36 97 78 ...
$ ROUNDED_TIME : int 451 451 551 551 551 551 551 551 551 551 ...
$ CRS_DEP_TIME : int 500 520 600 600 600 600 600 600 602 607 ...
$ DEP_DEL15 : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 2 1 1 ...
$ CRS_ARR_TIME : int 746 813 905 903 855 815 901 744 901 841 ...
$ Conditions : Factor w/ 28 levels "Blowing Snow",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Dew.PointC : num -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 -14.4 ...
$ Events : Factor w/ 10 levels "","Fog","Fog-Rain",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Gust.SpeedKm.h : num NA NA NA NA NA NA NA NA NA NA ...
$ Humidity : int 68 68 71 71 71 71 71 71 71 71 ...
$ Precipitationmm : num NA NA NA NA NA NA NA NA NA NA ...
$ Sea.Level.PressurehPa: num 1021 1021 1022 1022 1022 ...
$ TemperatureC : num -9.4 -9.4 -10 -10 -10 -10 -10 -10 -10 -10 ...
$ VisibilityKm : num 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 ...
$ Wind.Direction : Factor w/ 18 levels "Calm","East",..: 9 9 7 7 7 7 7 7 7 7 ...
$ WindDirDegrees : int 320 320 330 330 330 330 330 330 330 330 ...
$ Wind.SpeedKm.h : num 20.4 20.4 13 13 13 13 13 13 13 13 ...
- attr(*, "na.action")=Class 'omit' Named int [1:22539] 3 32 45 87 94 325 472 548 949 1333 ...
.. ..- attr(*, "names")= chr [1:22539] "3" "32" "45" "87" ...
and when I execute the following command:
ldaModel <- train(DEP_DEL15~.,data=training,method="lda",preProc=c("center","scale"),na.remove=TRUE)
I get:
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :1 NA's :1
Error in train.default(x, y, weights = w, ...) : Stopping
It is probably due to having about outcome factor with levels "0" and "1".
There is a specific warning issued when this happens: At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1"
It seems that people uniformly ignore warnings so I'm going to make this throw an error in the next version.
If the variables Gust.SpeedKm.h and Precipitationmm contain only NA's try omitting those variables from your data before running the model. If they contain partial NA's and you think they could have predictive value as features then use imputation. Follow this documentation for pre-processing in caret, including imputation.

Resources