Meaning of "#" operator in R language? - r

I came across the following and I haven't figured out the purpose of the "#" operator. What's the meaning there? I didn't make heads/tails of the R manual language.
library(lattice)
library(sp)
data(meuse)
coordinates(meuse) <- ~x+y
proj4string(meuse) <- CRS("+init=epsg:28992")
p <- xyplot(copper ~ cadmium, data = meuse#data, col = "grey", pch = 20, cex = 2)
R manuals says
Usage
object#name
object#name <- value
Extract or replace the contents of a slot in a object with a formal (S4) class structure.
These operators support the formal classes of package methods, and are enabled only when package methods is loaded (as per default). See slot for further details, in particular for the differences between slot() and the # operator.
It is checked that object is an S4 object (see isS4), and it is an error to attempt to use # on any other object. (There is an exception for name .Data for internal use only.) The replacement operator checks that the slot already exists on the object (which it should if the object is really from the class it claims to be).
I checked the structure of "meuse" and found no references to a slot named "data".

meuse is an S4 object
isS4(meuse)
[1] TRUE
If you take the structure of of meuse (str_meuse) you'll see some fields are denoted with your # operator, including one called data. These slots can be accessed with # similar to how you might see other slots in other objects accessed using the $ operator. So meuse#data gives you the data portion of the meuse object.
str(meuse)
Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
..# data :'data.frame': 155 obs. of 12 variables:
.. ..$ cadmium: num [1:155] 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
.. ..$ copper : num [1:155] 85 81 68 81 48 61 31 29 37 24 ...
.. ..$ lead : num [1:155] 299 277 199 116 117 137 132 150 133 80 ...
.. ..$ zinc : num [1:155] 1022 1141 640 257 269 ...
.. ..$ elev : num [1:155] 7.91 6.98 7.8 7.66 7.48 ...
.. ..$ dist : num [1:155] 0.00136 0.01222 0.10303 0.19009 0.27709 ...
.. ..$ om : num [1:155] 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
.. ..$ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
.. ..$ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
.. ..$ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
.. ..$ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
.. ..$ dist.m : num [1:155] 50 30 150 270 380 470 240 120 240 420 ...
..# coords.nrs : int [1:2] 1 2
..# coords : num [1:155, 1:2] 181072 181025 181165 181298 18130
See how that subsetting is working?
str(meuse#data)
'data.frame': 155 obs. of 12 variables:
$ cadmium: num 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
$ copper : num 85 81 68 81 48 61 31 29 37 24 ...
$ lead : num 299 277 199 116 117 137 132 150 133 80 ...
$ zinc : num 1022 1141 640 257 269 ...
$ elev : num 7.91 6.98 7.8 7.66 7.48 ...
$ dist : num 0.00136 0.01222 0.10303 0.19009 0.27709 ...
$ om : num 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
$ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
$ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
$ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
$ dist.m : num 50 30 150 270 380 470 240 120 240 420 ...

Related

Classify factor output with factors with >60 levels and numeric inputs

I'm newbie, and working on a classification to see the causes of coral diseases. The dataset contains 45 variables.
The output variable is a factor with 21 levels (21 diseases) and the inputs are numeric and factor variables, and those factors have even 94 levels, those are like "type of specie of coral", so I can't get into a split factor because I want to be as precise as possible, so maybe one species is less resistant than another. So I can't split those factors. Numeric variables are such as, population in the area, fishing trips etc.
First problem: tried genetic algorithms to select most important variables, random forests, etc., but... it gets aborted, so the variables I eliminated were just based on correlograms. I want something stronger to decide which variables select.
Second problem: I've tried everything I know and made tons of searches on Google to find something that runs and make a classification, but nothing goes on. I tried SVM, Random Forests, Cart, GBM, bagging and boosting, but nothing can't with this dataset.
This is the structure of the dataset
'data.frame': 136510 obs. of 45 variables:
$ SITE : Factor w/ 144 levels "TUT-1511","TUT-1513",..: 56 15 55 21 12 12 17 53 48 82 ...
$ Zone_Fine : Factor w/ 17 levels "Aunuu_E","Aunuu_W",..: 11 9 10 9 9 9 9 8 10 10 ...
$ TRANSECT : num 1 1 1 1 1 1 1 1 1 1 ...
$ SEGMENT : num 5 1 1 1 7 5 7 5 3 7 ...
$ Seg_WIDTH : num 1 1 1 1 1 1 1 1 1 1 ...
$ Seg_LENGTH : num 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
$ SPECIES : Factor w/ 156 levels "AAAA","AABR",..: 94 126 94 102 9 126 135 94 93 94 ...
$ COLONYLENGTH : num 11 45 10 5 12 10 8 30 20 14 ...
$ OLDDEAD : num 5 2 5 0 0 5 10 0 5 10 ...
$ RECENTDEAD : num 0 10 0 0 0 0 0 0 0 0 ...
$ DZCLASS : Factor w/ 21 levels "Acute Tissue Loss - White Syndrome",..: 14 14 14 14 14 14 14 14 14 14 ...
$ EXTENT : num 52.9 52.9 52.9 52.9 52.9 ...
$ SEVERITY : num 3.11 3.11 3.11 3.11 3.11 ...
$ TAXONNAME.x : Factor w/ 155 levels "Acanthastrea hemprichii",..: 95 132 95 107 7 132 133 95 89 95 ...
$ PHYLUM : Factor w/ 2 levels "Cnidaria","Rhodophyta": 1 1 1 1 1 1 1 1 1 1 ...
$ CLASS : Factor w/ 3 levels "Anthozoa","Florideophyceae",..: 1 1 1 1 1 1 1 1 1 1 ...
$ FAMILY : Factor w/ 20 levels "Acroporidae",..: 1 18 1 2 1 18 18 1 8 1 ...
$ GENUS : Factor w/ 55 levels "Acanthastrea",..: 35 44 35 39 2 44 44 35 34 35 ...
$ RANK : Factor w/ 2 levels "Genus","Species": 1 1 1 1 2 1 2 1 1 1 ...
$ DATE_ : Date, format: "0015-03-27" ...
$ OBS_YEAR : num 2015 2015 2015 2015 2015 ...
$ REEF_ZONE : Factor w/ 2 levels "Backreef","Forereef": 2 2 2 2 2 2 2 2 2 2 ...
$ DEPTH_BIN : Factor w/ 4 levels "Bank","Deep",..: 2 2 4 3 2 2 3 4 3 3 ...
$ LBSP : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ Zone_Fine_ReefZone_Depth: Factor w/ 41 levels "Aunuu_E_Deep",..: 30 24 29 25 24 24 25 23 28 28 ...
$ Area_km2.x : num 50.9 49.1 101.8 49.1 49.1 ...
$ Fishing.trips.per.km2 : num 719 1148 1431 1148 1148 ...
$ Area_km2.y : num 50.9 49.1 50.9 49.1 49.1 ...
$ Pop.km2 : num 167.5 49.1 561.9 49.1 49.1 ...
$ SHED_NAME : Factor w/ 35 levels "Aasu","Afao - Asili",..: 2 9 15 17 17 1 1 35 28 26 ...
$ Shed_Cond : Factor w/ 4 levels "Extensive","Intermediate",..: 3 4 2 4 4 3 3 3 1 2 ...
$ Shed_Area_Calc : num 30202 29422 458542 126361 32595 ...
$ Perc_Area : num 0.00128 0.00107 0.00993 0.00458 0.00118 ...
$ Cond_Scale : num 3 4 2 4 4 3 3 3 1 2 ...
$ Shoreline_m : num 23146 33046 45821 33046 33046 ...
$ Rank : num 5 9 3 9 9 9 9 6 3 3 ...
$ Comp.8 : num 0.826 0.814 0.838 0.814 0.814 ...
$ Ble : num 0.958 0.969 0.959 0.969 0.969 ...
$ DZ : num 0.647 0.837 0.732 0.837 0.837 ...
$ Herb : num 0.682 0.564 0.704 0.564 0.564 ...
$ Rec : num 0.375 0.477 0.467 0.477 0.477 ...
$ MA : num 0.965 0.975 0.907 0.975 0.975 ...
$ Dam : num 0.998 1 0.992 1 1 ...
$ TAXONNAME.y : Factor w/ 94 levels "Abudefduf sordidus",..: 94 94 94 94 94 94 94 94 94 94 ...
$ Dummy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I expected a classification of "DZCLASS".
Thanks, every recommendation is welcomed!

R Dataframe issue preventing normality test

I've read my .CSV and then converted the file to a data frame using several methods including:
df<-read.csv('cdSH2015Fall.csv', dec = ".", na.strings = c("na"), header=TRUE,
row.names=NULL, stringsAsFactors=F)
df<-as.data.frame(lapply(df, unlist)) # converted .csv to a a data.frame
str(df) # provides the structure of df.
'data.frame': 72 obs. of 16 variables:
$ trtGroup : Factor w/ 68 levels "AANN","AAPN",..: 5 7 14 18 20 23
27 33 37 48 ...
$ cd : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ PreviousExp : Factor w/ 2 levels "Empty","Enriched": 2 1 2 2 2 2 1
1 1 1 ...
$ treatment : Factor w/ 2 levels "NN","PN": 1 1 1 1 1 1 1 1 1 1 ...
$ total.Area.DarkBlue.: num 827 1037 663 389 983 ...
$ numberOfGroups : int 1 1 1 1 1 1 1 1 1 1 ...
$ totalGroupArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ averageGrpArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ proximityToPlants : num 5.65 16.05 2.58 9.65 4.74 ...
$ latFeed : num 2 0.5 0 1 0 0 1 0.5 2 1 ...
$ latBalloon : num 6 2 2 NA 0 0.1 3 0.5 1 0.7 ...
$ countChases : int 5 8 16 4 16 21 18 11 14 28 ...
$ chases : int 95 87 67 923 636 96 1210 571 775 816 ...
$ grpDiameter : num 16.8 23.3 19.5 11.2 29.9 ...
$ grpActiv : num 4908 5164 4197 5263 5377 ...
$ NND : num 0 11.88 8.98 3.6 9.8 ...
I then run my model two ways:
First option.
fit = t.test(df$proximityToPlants[which (df$cd==1 &
df$treatment == 'PN')], df$proximityToPlants[which
(df$cd==0 & df$treatment == 'PN')]
)
Second option trying to ensure I have a proper data frame.
Subset the data and then create a matrix.
cdProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'PN')]
H2ProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'PN')]
cdProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'NN')]
H2ProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'NN')]
Creating a matrix
df<-
cbind(cdProximityToPlantsPN,H2ProximityToPlantsPN,cdProximityToPlantsNN,
H2ProximityToPlantsNN)
mat <- sapply(df,unlist)
fit=t.test(mat[,1],mat[,2], paired = F, var.equal = T)
Yet, I still get errors when assessing outliers using the following:
outlierTest(fit) # Bonferonni p-value for most extreme obs
Error in UseMethod("outlierTest") :
no applicable method for 'outlierTest' applied to an object of class
"htest"
qqPlot(fit, main="QQ Plot") #qq plot for studentized resid 
Error in order(x[good]) : unimplemented type 'list' in 'orderVector1'
leveragePlots(fit) # leverage plots
Error in formula.default(model) : invalid formula
I know the issue must be with my data structure. Any ideas on how to fix it?

How to deal with " rank-deficient fit may be misleading" in R?

I'm trying to predict the values of test data set based on train data set, it is predicting the values (no errors) however the predictions deviate A LOT by the original values. Even predicting values around -356 although none of the original values exceeds 200 (and there are no negative values). The warning is bugging me as I think the values deviates a lot because of this warning.
Warning message:
In predict.lm(fit2, data_test) :
prediction from a rank-deficient fit may be misleading
any way I can get rid of this warning? the code is simple
fit2 <- lm(runs~., data=train_data)
prediction<-predict(fit2, data_test)
prediction
I searched a lot but tbh I couldn't understand much about this error.
str of test and train data set in case someone needs them
> str(train_data)
'data.frame': 36 obs. of 28 variables:
$ matchid : int 57 58 55 56 53 54 51 52 45 46 ...
$ TeamName : chr "South Africa" "West Indies" "South Africa" "West Indies" ...
$ Opp_TeamName : chr "West Indies" "South Africa" "West Indies" "South Africa" ...
$ TeamRank : int 4 3 4 3 4 3 10 7 5 1 ...
$ Opp_TeamRank : int 3 4 3 4 3 4 7 10 1 5 ...
$ Team_Top10RankingBatsman : int 0 1 0 1 0 1 0 0 2 2 ...
$ Team_Top50RankingBatsman : int 4 6 4 6 4 6 3 5 4 3 ...
$ Team_Top100RankingBatsman: int 6 8 6 8 6 8 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 0 1 0 1 0 0 0 2 2 ...
$ Opp_Top50RankingBatsman : int 6 4 6 4 6 4 5 3 3 4 ...
$ Opp_Top100RankingBatsman : int 8 6 8 6 8 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "1st innings" "2nd innings" ...
$ Runs_OverAll : num 361 705 348 630 347 ...
$ AVG_Overall : num 27.2 20 23.3 19.1 24 ...
$ SR_Overall : num 128 121 120 118 118 ...
$ Runs_Last10Matches : num 118.5 71 102.1 71 78.6 ...
$ AVG_Last10Matches : num 23.7 20.4 20.9 20.4 23.2 ...
$ SR_Last10Matches : num 120 106 114 106 116 ...
$ Runs_BatingFirst : num 236 459 230 394 203 ...
$ AVG_BatingFirst : num 30.6 23.2 24 21.2 27.1 ...
$ SR_BatingFirst : num 127 136 123 125 118 ...
$ Runs_BatingSecond : num 124 262 119 232 144 ...
$ AVG_BatingSecond : num 25.5 18.3 22.8 17.8 22.8 ...
$ SR_BatingSecond : num 125 118 112 117 114 ...
$ Runs_AgainstTeam2 : num 88.3 118.3 76.3 103.9 49.3 ...
$ AVG_AgainstTeam2 : num 28.2 23 24.7 22.1 16.4 ...
$ SR_AgainstTeam2 : num 139 127 131 128 111 ...
$ runs : int 165 168 231 236 195 126 143 141 191 135 ...
> str(data_test)
'data.frame': 34 obs. of 28 variables:
$ matchid : int 59 60 61 62 63 64 65 66 69 70 ...
$ TeamName : chr "India" "West Indies" "England" "New Zealand" ...
$ Opp_TeamName : chr "West Indies" "India" "New Zealand" "England" ...
$ TeamRank : int 2 3 5 1 4 8 6 2 10 1 ...
$ Opp_TeamRank : int 3 2 1 5 8 4 2 6 1 10 ...
$ Team_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 0 2 ...
$ Team_Top50RankingBatsman : int 5 6 4 3 4 2 5 5 3 3 ...
$ Team_Top100RankingBatsman: int 7 8 7 6 6 5 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 2 0 ...
$ Opp_Top50RankingBatsman : int 6 5 3 4 2 4 5 5 3 3 ...
$ Opp_Top100RankingBatsman : int 8 7 6 7 5 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "2nd innings" "1st innings" ...
$ Runs_OverAll : num 582 618 470 602 509 ...
$ AVG_Overall : num 25 21.8 20.3 20.7 19.6 ...
$ SR_Overall : num 113 120 123 120 112 ...
$ Runs_Last10Matches : num 182 107 117 167 140 ...
$ AVG_Last10Matches : num 37.1 43.8 21 24.9 27.3 ...
$ SR_Last10Matches : num 111 153 122 141 120 ...
$ Runs_BatingFirst : num 319 314 271 345 294 ...
$ AVG_BatingFirst : num 23.6 17.8 20.6 20.3 19.5 ...
$ SR_BatingFirst : num 116.9 98.5 118 124.3 115.8 ...
$ Runs_BatingSecond : num 264 282 304 256 186 ...
$ AVG_BatingSecond : num 28 23.7 31.9 21.6 16.5 ...
$ SR_BatingSecond : num 96.5 133.9 129.4 112 99.5 ...
$ Runs_AgainstTeam2 : num 98.2 95.2 106.9 75.4 88.5 ...
$ AVG_AgainstTeam2 : num 45.3 42.7 38.1 17.7 27.1 ...
$ SR_AgainstTeam2 : num 125 138 152 110 122 ...
$ runs : int 192 196 159 153 122 120 160 161 70 145 ...
In simple word, how can I get rid of this warning so that it doesn't effect my predictions?
(Intercept) matchid TeamNameBangladesh
1699.98232628 -0.06793787 59.29445330
TeamNameEngland TeamNameIndia TeamNameNew Zealand
347.33030177 -499.40074338 -179.19192936
TeamNamePakistan TeamNameSouth Africa TeamNameSri Lanka
-272.71610614 -3.54867488 -45.27920191
TeamNameWest Indies Opp_TeamNameBangladesh Opp_TeamNameEngland
-345.54349798 135.05901017 108.04227770
Opp_TeamNameIndia Opp_TeamNameNew Zealand Opp_TeamNamePakistan
-162.24418387 -60.55364436 -114.74599364
Opp_TeamNameSouth Africa Opp_TeamNameSri Lanka Opp_TeamNameWest Indies
196.90856999 150.70170068 -6.88997714
TeamRank Opp_TeamRank Team_Top10RankingBatsman
NA NA NA
Team_Top50RankingBatsman Team_Top100RankingBatsman Opp_Top10RankingBatsman
NA NA NA
Opp_Top50RankingBatsman Opp_Top100RankingBatsman InningType2nd innings
NA NA 24.24029455
Runs_OverAll AVG_Overall SR_Overall
-0.59935875 20.12721378 -13.60151334
Runs_Last10Matches AVG_Last10Matches SR_Last10Matches
-1.92526750 9.24182916 1.23914363
Runs_BatingFirst AVG_BatingFirst SR_BatingFirst
1.41001672 -9.88582744 -6.69780509
Runs_BatingSecond AVG_BatingSecond SR_BatingSecond
-0.90038727 -7.11580086 3.20915976
Runs_AgainstTeam2 AVG_AgainstTeam2 SR_AgainstTeam2
3.35936312 -5.90267210 2.36899131
You can have a look at this detailed discussion :
predict.lm() in a loop. warning: prediction from a rank-deficient fit may be misleading
In general, multi-collinearity can lead to a rank deficient matrix in logistic regression.
You can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.

How to replace some rows of a dataset from another dataset in R

I have two datasets. One the original called geoIncendios and a second called outliers. As you can imagine the latter is a subset consisting of the outliers of the former. After analyzing them, I found the error and corrected them. So now I would like to replace the rows of the first dataset with the second.
Here is the structure of both datasets to give you an idea:
> str(geoIncendios)
'data.frame': 100 obs. of 9 variables:
$ id : num 1 2 3 4 5 6 7 8 9 10 ...
$ municipio : chr "LLANES" "CANIZA" "CANGAS DEL NARCEA" "PILONA" ...
$ num_incendios: num 1725 1521 1349 1341 1290 ...
$ ha_quemadas : num 79 70 34 81 96 56 4 87 18 69 ...
$ ranking : num 1 2 3 4 5 6 7 8 9 10 ...
$ comunidad : chr "ASTURIAS" "GALICIA" "ASTURIAS" "ASTURIAS" ...
$ provincia : chr "ASTURIAS" "PONTEVEDRA" "ASTURIAS" "ASTURIAS" ...
$ lon : num -4.76 -8.27 -6.55 -5.35 -7.11 ...
$ lat : num 43.4 42.2 43.2 43.3 42.2 ...
> str(outliers)
'data.frame': 11 obs. of 9 variables:
$ id : num 9 13 22 24 37 40 43 45 68 93 ...
$ municipio : chr "NEVES" "LENA" "TOMINO" "GRADO" ...
$ num_incendios: num 1081 929 818 744 641 ...
$ ha_quemadas : num 18 74 73 49 61 48 38 21 46 8 ...
$ ranking : num 9 13 22 24 37 40 43 45 68 93 ...
$ comunidad : chr "GALICIA" "ASTURIAS" "GALICIA" "ASTURIAS" ...
$ provincia : chr "PONTEVEDRA" "ASTURIAS" "PONTEVEDRA" "ASTURIAS" ...
$ lon : num -8.41 -5.84 -8.73 -6.07 -8.31 ...
$ lat : num 42.1 43.1 42 43.4 42.1 ...
So again I would like to overwrite 11 rows of the geoIncendios dataset with the ones from the outliers dataset. I believe I have to use some kind of loop. But in case there is a easiest solution (which I doubt it), these are the IDs of the rows: 9,13,22,24,37,40,43,45,68,93 and 99.
In the data you've shown, geoIncendios$id is just the row number of the data.frame. Presuming that's true for the whole dataset, you could use (as suggested in comments by #RHertel)
geoIncendios[outliers$id, ] <- outliers
However, if there are discontinuities in your id column, or if the order isn't strictly the same as the row numbers, a more generalisable solution is:
geoIncendios[match(outliers$id, geoIncendios$id), ] <- outliers

Split and unsplit a dataframe in four parts

I'd like to split a dataframe in 4 equals parts, because I'd like to use the 4 cores of my computer.
I did this :
df2 <- split(df, 1:4)
unsplit(df2, f=1:4)
and that
df2 <- split(df, 1:4)
unsplit(df2, f=c('1','2','3','4')
But the unsplit function did not work, I have these warnings messages
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Do you have an idea of the reason ?
How many rows in df? You will get that warning if the number of rows in your table is not divisible by 4. I think you are using the split factor f incorrectly, unless what you want to do is put each subsequent row into a different split data.frame.
If you really want to split your data into 4 dataframes. one row after the other then make your splitting factor the same size as the number of rows in your dataframe using rep_len like this:
## Split like this:
split(df , f = rep_len(1:4, nrow(df) ) )
## Unsplit like this:
unsplit( split(df , f = rep_len(1:4, nrow(df) ) ) , f = rep_len(1:4,nrow(df) ) )
Hopefully this example illustrates why the error occurs and how to avoid it (i.e. use a proper splitting factor!).
## Want to split our data.frame into two halves, but rows not divisible by 2
df <- data.frame( x = runif(5) )
df
## Splitting still works but...
## We get a warning because the split factor 'f' was not recycled as a multiple of it's length
split( df , f = 1:2 )
#$`1`
# x
#1 0.6970968
#3 0.5614762
#5 0.5910995
#$`2`
# x
#2 0.6206521
#4 0.1798006
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
## Instead let's use the same split levels (1:2)...
## but make it equal to the length of the rows in the table:
splt <- rep_len( 1:2 , nrow(df) )
splt
#[1] 1 2 1 2 1
## Split works, and f is not recycled because there are
## the same number of values in 'f' as rows in the table
split( df , f = splt )
#$`1`
# x
#1 0.6970968
#3 0.5614762
#5 0.5910995
#$`2`
# x
#2 0.6206521
#4 0.1798006
## And unsplitting then works as expected and reconstructs our original data.frame
unsplit( split( df , f = splt ) , f = splt )
# x
#1 0.6970968
#2 0.6206521
#3 0.5614762
#4 0.1798006
#5 0.5910995
In the R language 'split' example . . .
aq <- airquality
g <- aq$Month
l <- split(aq,g)
After the 'scale' function is executed
l <- lapply(l, transform, Ozone = scale(Ozone))
I am guessing that at one time in R history
the function 'scale' did not add extra attributes
to the column it is modifying.
..$ Ozone : num ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
As seen in here . . .
> str(l)
List of 5
$ 5:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] 0.782 0.557 -0.523 -0.253 NA ...
.. ..- attr(*, "scaled:center")= num 23.6
.. ..- attr(*, "scaled:scale")= num 22.2
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 6:'data.frame': 30 obs. of 6 variables:
..$ Ozone : num [1:30, 1] NA NA NA NA NA ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ 7:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] 2.399 -0.32 -0.857 NA 0.154 ...
.. ..- attr(*, "scaled:center")= num 59.1
.. ..- attr(*, "scaled:scale")= num 31.6
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 8:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] -0.528 -1.284 -1.108 0.455 -0.629 ...
.. ..- attr(*, "scaled:center")= num 60
.. ..- attr(*, "scaled:scale")= num 39.7
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 9:'data.frame': 30 obs. of 6 variables:
..$ Ozone : num [1:30, 1] 2.674 1.928 1.721 2.467 0.644 ...
.. ..- attr(*, "scaled:center")= num 31.4
.. ..- attr(*, "scaled:scale")= num 24.1
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
But now it does add those attributes
..$ Ozone : num ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
and the very simple 'unsplit' function is not programmed to handle those attributes.
> unsplit(l,g)
Error in xj[i, , drop = FALSE] : (subscript) logical subscript too long
The (direct and simple) solution is to get rid of those attributes.
attributes(l[[1]]$Ozone) <- NULL
attributes(l[[2]]$Ozone) <- NULL
attributes(l[[3]]$Ozone) <- NULL
attributes(l[[4]]$Ozone) <- NULL
attributes(l[[5]]$Ozone) <- NULL
Then try to unsplit again.
str( unsplit(l,g) )
> str( unsplit(l,g) )
'data.frame': 153 obs. of 6 variables:
$ Ozone : num 0.782 0.557 -0.523 -0.253 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
So, now it works.
Andre Mikulec

Resources