Split and unsplit a dataframe in four parts

Split and unsplit a dataframe in four parts - r

I'd like to split a dataframe in 4 equals parts, because I'd like to use the 4 cores of my computer.
I did this :
df2 <- split(df, 1:4)
unsplit(df2, f=1:4)
and that
df2 <- split(df, 1:4)
unsplit(df2, f=c('1','2','3','4')
But the unsplit function did not work, I have these warnings messages
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Do you have an idea of the reason ?

How many rows in df? You will get that warning if the number of rows in your table is not divisible by 4. I think you are using the split factor f incorrectly, unless what you want to do is put each subsequent row into a different split data.frame.
If you really want to split your data into 4 dataframes. one row after the other then make your splitting factor the same size as the number of rows in your dataframe using rep_len like this:
## Split like this:
split(df , f = rep_len(1:4, nrow(df) ) )
## Unsplit like this:
unsplit( split(df , f = rep_len(1:4, nrow(df) ) ) , f = rep_len(1:4,nrow(df) ) )
Hopefully this example illustrates why the error occurs and how to avoid it (i.e. use a proper splitting factor!).
## Want to split our data.frame into two halves, but rows not divisible by 2
df <- data.frame( x = runif(5) )
df
## Splitting still works but...
## We get a warning because the split factor 'f' was not recycled as a multiple of it's length
split( df , f = 1:2 )
#$`1`
# x
#1 0.6970968
#3 0.5614762
#5 0.5910995
#$`2`
# x
#2 0.6206521
#4 0.1798006
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
## Instead let's use the same split levels (1:2)...
## but make it equal to the length of the rows in the table:
splt <- rep_len( 1:2 , nrow(df) )
splt
#[1] 1 2 1 2 1
## Split works, and f is not recycled because there are
## the same number of values in 'f' as rows in the table
split( df , f = splt )
#$`1`
# x
#1 0.6970968
#3 0.5614762
#5 0.5910995
#$`2`
# x
#2 0.6206521
#4 0.1798006
## And unsplitting then works as expected and reconstructs our original data.frame
unsplit( split( df , f = splt ) , f = splt )
# x
#1 0.6970968
#2 0.6206521
#3 0.5614762
#4 0.1798006
#5 0.5910995

In the R language 'split' example . . .
aq <- airquality
g <- aq$Month
l <- split(aq,g)
After the 'scale' function is executed
l <- lapply(l, transform, Ozone = scale(Ozone))
I am guessing that at one time in R history
the function 'scale' did not add extra attributes
to the column it is modifying.
..$ Ozone : num ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
As seen in here . . .
> str(l)
List of 5
$ 5:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] 0.782 0.557 -0.523 -0.253 NA ...
.. ..- attr(*, "scaled:center")= num 23.6
.. ..- attr(*, "scaled:scale")= num 22.2
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 6:'data.frame': 30 obs. of 6 variables:
..$ Ozone : num [1:30, 1] NA NA NA NA NA ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ 7:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] 2.399 -0.32 -0.857 NA 0.154 ...
.. ..- attr(*, "scaled:center")= num 59.1
.. ..- attr(*, "scaled:scale")= num 31.6
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 8:'data.frame': 31 obs. of 6 variables:
..$ Ozone : num [1:31, 1] -0.528 -1.284 -1.108 0.455 -0.629 ...
.. ..- attr(*, "scaled:center")= num 60
.. ..- attr(*, "scaled:scale")= num 39.7
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 9:'data.frame': 30 obs. of 6 variables:
..$ Ozone : num [1:30, 1] 2.674 1.928 1.721 2.467 0.644 ...
.. ..- attr(*, "scaled:center")= num 31.4
.. ..- attr(*, "scaled:scale")= num 24.1
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
But now it does add those attributes
..$ Ozone : num ...
.. ..- attr(*, "scaled:center")= num 29.4
.. ..- attr(*, "scaled:scale")= num 18.2
and the very simple 'unsplit' function is not programmed to handle those attributes.
> unsplit(l,g)
Error in xj[i, , drop = FALSE] : (subscript) logical subscript too long
The (direct and simple) solution is to get rid of those attributes.
attributes(l[[1]]$Ozone) <- NULL
attributes(l[[2]]$Ozone) <- NULL
attributes(l[[3]]$Ozone) <- NULL
attributes(l[[4]]$Ozone) <- NULL
attributes(l[[5]]$Ozone) <- NULL
Then try to unsplit again.
str( unsplit(l,g) )
> str( unsplit(l,g) )
'data.frame': 153 obs. of 6 variables:
$ Ozone : num 0.782 0.557 -0.523 -0.253 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
So, now it works.
Andre Mikulec

Related

Agricolae, tapply, error: arguments must have same length

I am new to R and I am having issues moving forward with the data analysis. My Excel data has a lot of NA's and I tried troubleshooting this error. Here's my code if anyone can help, and a link to a sample of my data
file:///C:/Users/steph/Documents/DLI%20ANOVA%20Sample.htm
Some of my variables have 4 reps instead of all 8reps, so I have a lot of NA's in the excel file. I keep getting this error after I try tapply:
Error in tapply(X = data1$gi..m3., INDEX = data1$cultivar, FUN = mean, :
arguments must have same length
library(agricolae)
data1=read.csv("DLI ANOVA Sample.csv", header=T, as.is=T)
#setting factors
block = as.factor(data1$block)
treatmentt = as.factor(data1$trt)
cultivar<-factor(data1$cv,c("CR", "LB","RF","RR","S","SNS","SNY","SSJ","YC"))
str(data1)
#Summary statistics
tapply(X = data1$growth.index, INDEX = data1$cultivar, FUN = mean, na.rm=T)
tapply(X = data1$growth.index, INDEX = data1$treatment, FUN = mean, na.rm=T)
data.frame': 288 obs. of 24 variables:
$ block : int 1 1 2 2 3 3 4 4 1 1 ...
$ trt : chr "HL-L" "HL-L" "HL-L" "HL-L" ..
$ cv : chr "CR" "CR" "CR" "CR" ...
$ rep : int 1 2 3 4 5 6 7 8 1 2 ...
$ height : int 23 20 25 19 23 19 22 19 19 24
$ growth.index : num 0.0221 0.0258 0.0276 0.0227 0.0209
$ number.of.mature.fruit : int 34 30 35 34 28 25 40 24 12 16 ...
$ mature.fruit.fw : num 163 163 186 152 169 ...
$ number.of.immature.fruit : int 38 28 40 27 35 37 44 48 20 30 ...
$ immature.fruit.fw : num 77.4 66.6 87.6 43.4 81.3 ...
$ Total.number.of.fruit : num 72 58 75 61 63 62 84 72 32 46 ...
$ Total.fruit.fw : num 241 230 273 195 250 ...
$ Fruit.Water.Content..g. : num NA 209 NA 176 NA ...
$ Brix.. : num 4.9 NA 5.6 NA 4.7 NA 5.1 NA 5.6 NA ...
$ pH : num 4.17 NA 4.3 NA 4.1 ...
$ EC.uS.mL : num 4.46 NA 9.19 NA 8.24 ...
$ X..citric.Acid : num 0.704 NA 0.397 NA 0.653 ...
$ Sugar.Acid.Ratio : num 6.96 NA 14.11 NA 7.2 ...
$ oedema.injury.level..1.6. : int 3 3 1 2 1 1 1 2 2 1 ...
$ Stomatal.conductance : num NA 365 NA 422 NA ...
$ spad : num NA NA NA 64.3 NA 65.5 NA 68.7 NA 55.6 ...
$ Irrigation.Events : int NA 14 NA 12 NA 13 NA 16 NA 13 ...
$ WUE : num NA 0.00584 NA 0.00693 NA ...
$ transpiration..g.H2O.lost..g.dry.biomass.: num NA 117 NA 111 NA ...

Meaning of "#" operator in R language?

I came across the following and I haven't figured out the purpose of the "#" operator. What's the meaning there? I didn't make heads/tails of the R manual language.
library(lattice)
library(sp)
data(meuse)
coordinates(meuse) <- ~x+y
proj4string(meuse) <- CRS("+init=epsg:28992")
p <- xyplot(copper ~ cadmium, data = meuse#data, col = "grey", pch = 20, cex = 2)
R manuals says
Usage
object#name
object#name <- value
Extract or replace the contents of a slot in a object with a formal (S4) class structure.
These operators support the formal classes of package methods, and are enabled only when package methods is loaded (as per default). See slot for further details, in particular for the differences between slot() and the # operator.
It is checked that object is an S4 object (see isS4), and it is an error to attempt to use # on any other object. (There is an exception for name .Data for internal use only.) The replacement operator checks that the slot already exists on the object (which it should if the object is really from the class it claims to be).
I checked the structure of "meuse" and found no references to a slot named "data".

meuse is an S4 object
isS4(meuse)
[1] TRUE
If you take the structure of of meuse (str_meuse) you'll see some fields are denoted with your # operator, including one called data. These slots can be accessed with # similar to how you might see other slots in other objects accessed using the $ operator. So meuse#data gives you the data portion of the meuse object.
str(meuse)
Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
..# data :'data.frame': 155 obs. of 12 variables:
.. ..$ cadmium: num [1:155] 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
.. ..$ copper : num [1:155] 85 81 68 81 48 61 31 29 37 24 ...
.. ..$ lead : num [1:155] 299 277 199 116 117 137 132 150 133 80 ...
.. ..$ zinc : num [1:155] 1022 1141 640 257 269 ...
.. ..$ elev : num [1:155] 7.91 6.98 7.8 7.66 7.48 ...
.. ..$ dist : num [1:155] 0.00136 0.01222 0.10303 0.19009 0.27709 ...
.. ..$ om : num [1:155] 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
.. ..$ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
.. ..$ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
.. ..$ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
.. ..$ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
.. ..$ dist.m : num [1:155] 50 30 150 270 380 470 240 120 240 420 ...
..# coords.nrs : int [1:2] 1 2
..# coords : num [1:155, 1:2] 181072 181025 181165 181298 18130
See how that subsetting is working?
str(meuse#data)
'data.frame': 155 obs. of 12 variables:
$ cadmium: num 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
$ copper : num 85 81 68 81 48 61 31 29 37 24 ...
$ lead : num 299 277 199 116 117 137 132 150 133 80 ...
$ zinc : num 1022 1141 640 257 269 ...
$ elev : num 7.91 6.98 7.8 7.66 7.48 ...
$ dist : num 0.00136 0.01222 0.10303 0.19009 0.27709 ...
$ om : num 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
$ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
$ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
$ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
$ dist.m : num 50 30 150 270 380 470 240 120 240 420 ...

Why is distance matrix (dist()) giving empty values for data sets having more than ~50 observations?

I have a data set for which I'm calculating its distance matrix. Below is the data, which has 251 observations.
> str(mydata)
'data.frame': 251 obs. of 7 variables:
$ BodyFat: num 12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 ...
$ Weight : num 154 173 154 185 184 ...
$ Chest : num 93.1 93.6 95.8 101.8 97.3 ...
$ Abdomen: num 85.2 83 87.9 86.4 100 94.4 90.7 88.5 82.5 88.6 ...
$ Hip : num 94.5 98.7 99.2 101.2 101.9 ...
$ Thigh : num 59 58.7 59.6 60.1 63.2 66 58.4 60 62.9 63.1 ...
$ Biceps : num 32 30.5 28.8 32.4 32.2 35.7 31.9 30.5 35.9 35.6 ...
I normalize the data.
means = apply(mydata,2,mean)
sds = apply(mydata,2,sd)
nor = scale(mydata,center=means,scale=sds)
When i calculate the distance matrix, I can see lot of empty values and moreover distance is measured only from 4 observations.
distance =dist(nor)
> str(distance)
'dist' num [1:31375] 1.33 2.09 1.9 3.08 3.99 ...
- attr(*, "Size")= int 251
- attr(*, "Labels")= chr [1:251] "1" "2" "3" "4" ...
- attr(*, "Diag")= logi FALSE
- attr(*, "Upper")= logi FALSE
- attr(*, "method")= chr "euclidean"
- attr(*, "call")= language dist(x = nor)
> distance # o/p omitted from this post as it has 257 observations.
1 2 3 4 5 6 7
2 1.3346445
3 2.0854437 2.5474796
4 1.8993458 1.4908813 2.5840752
5 3.0790252 3.4485667 2.2165366 2.7021809
8 9 10 11 12 13 14
2
3
4
5
15 16 17 18 19 20 21
This list goes on empty for the remaining 247 comparisons.
Now, I reduce the data set to 20 observations
Here I get a proper distance matrix.
distancetiny=dist(nor)
> str(distancetiny)
'dist' num [1:1176] 1.14 1.8 1.61 2.62 3.39 ...
- attr(*, "Size")= int 49
- attr(*, "Labels")= chr [1:49] "1" "2" "3" "4" ...
- attr(*, "Diag")= logi FALSE
- attr(*, "Upper")= logi FALSE
- attr(*, "method")= chr "euclidean"
- attr(*, "call")= language dist(x = nor)
> distancetiny
1 2 3 4 5 6 7
2 1.1380433
3 1.7990293 2.2088928
4 1.6064118 1.2871522 2.2483586
5 2.6235853 2.9669283 1.9132224 2.3256624
6 3.3898119 3.3730508 3.3718447 2.2615557 2.0094434
7 1.8947704 2.0065514 1.7685604 1.1065940 1.7387938 2.2321156
8 1.1732465 1.0663217 1.6733689 0.8873140 2.1959298 2.7939555 1.1448269
9 2.2721969 2.0545882 3.4263262 1.4058375 3.1811955 2.4011074 2.3078714
10 2.3753110 2.2424464 3.0289947 1.2808398 2.3230202 1.4242653 1.8571654
11 1.5620472 1.1878554 2.5750350 0.5718248 2.7714795 2.6314286 1.5132365
12 3.5088571 3.2484020 4.1164488 2.2723772 3.1377318 1.4795230 2.8274818
13 2.1448841 2.2679705 1.8726670 1.3494988 1.2176727 1.5544030 1.0725518
14 3.6679035 3.7459402 3.6869023 2.6677308 2.1318420 0.7347359 2.5729973
15 2.9908457 3.3312661 3.1289870 2.4340473 1.8027070 1.3626019 2.3795360
16 1.6117570 2.0283356 1.2011116 1.5961064 1.3196981 2.4456436 1.2569683
17 3.2991393 3.5991747 3.0438049 2.6066933 1.4742664 1.0945621 2.2214101
18 3.9409008 4.0726826 4.0113908 2.9250144 2.5228901 0.9087254 2.8158563
19 2.7468511 2.9495031 3.2439229 1.8312508 2.4122436 1.3932604 1.9640170
20 3.7515064 3.7021743 3.9404231 2.5813440 2.5390519 0.8352961 2.6530503
21 2.3102053 2.3878491 2.0836800 1.4328028 1.2991221 1.5287862 1.1769205
There is no empty values in the output when the observation is 21.
Why is this so? Does the dist() do not work when the observation count goes beyond a threshold ?
I'm unable to figure it out. Please help.

This seems to be a size issue. When the dataset contains more than 60-80 observations, the distance matrix is unable to be displayed properly (even for the initial rows). Looks like the values are present in it perfectly alright, and just that we cannot see them as it is.
Further operation on the distance matrix (like Hierarchical agglomerative clustering ) proved that nothing to worried about it's weird display.

How to deal with " rank-deficient fit may be misleading" in R?

I'm trying to predict the values of test data set based on train data set, it is predicting the values (no errors) however the predictions deviate A LOT by the original values. Even predicting values around -356 although none of the original values exceeds 200 (and there are no negative values). The warning is bugging me as I think the values deviates a lot because of this warning.
Warning message:
In predict.lm(fit2, data_test) :
prediction from a rank-deficient fit may be misleading
any way I can get rid of this warning? the code is simple
fit2 <- lm(runs~., data=train_data)
prediction<-predict(fit2, data_test)
prediction
I searched a lot but tbh I couldn't understand much about this error.
str of test and train data set in case someone needs them
> str(train_data)
'data.frame': 36 obs. of 28 variables:
$ matchid : int 57 58 55 56 53 54 51 52 45 46 ...
$ TeamName : chr "South Africa" "West Indies" "South Africa" "West Indies" ...
$ Opp_TeamName : chr "West Indies" "South Africa" "West Indies" "South Africa" ...
$ TeamRank : int 4 3 4 3 4 3 10 7 5 1 ...
$ Opp_TeamRank : int 3 4 3 4 3 4 7 10 1 5 ...
$ Team_Top10RankingBatsman : int 0 1 0 1 0 1 0 0 2 2 ...
$ Team_Top50RankingBatsman : int 4 6 4 6 4 6 3 5 4 3 ...
$ Team_Top100RankingBatsman: int 6 8 6 8 6 8 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 0 1 0 1 0 0 0 2 2 ...
$ Opp_Top50RankingBatsman : int 6 4 6 4 6 4 5 3 3 4 ...
$ Opp_Top100RankingBatsman : int 8 6 8 6 8 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "1st innings" "2nd innings" ...
$ Runs_OverAll : num 361 705 348 630 347 ...
$ AVG_Overall : num 27.2 20 23.3 19.1 24 ...
$ SR_Overall : num 128 121 120 118 118 ...
$ Runs_Last10Matches : num 118.5 71 102.1 71 78.6 ...
$ AVG_Last10Matches : num 23.7 20.4 20.9 20.4 23.2 ...
$ SR_Last10Matches : num 120 106 114 106 116 ...
$ Runs_BatingFirst : num 236 459 230 394 203 ...
$ AVG_BatingFirst : num 30.6 23.2 24 21.2 27.1 ...
$ SR_BatingFirst : num 127 136 123 125 118 ...
$ Runs_BatingSecond : num 124 262 119 232 144 ...
$ AVG_BatingSecond : num 25.5 18.3 22.8 17.8 22.8 ...
$ SR_BatingSecond : num 125 118 112 117 114 ...
$ Runs_AgainstTeam2 : num 88.3 118.3 76.3 103.9 49.3 ...
$ AVG_AgainstTeam2 : num 28.2 23 24.7 22.1 16.4 ...
$ SR_AgainstTeam2 : num 139 127 131 128 111 ...
$ runs : int 165 168 231 236 195 126 143 141 191 135 ...
> str(data_test)
'data.frame': 34 obs. of 28 variables:
$ matchid : int 59 60 61 62 63 64 65 66 69 70 ...
$ TeamName : chr "India" "West Indies" "England" "New Zealand" ...
$ Opp_TeamName : chr "West Indies" "India" "New Zealand" "England" ...
$ TeamRank : int 2 3 5 1 4 8 6 2 10 1 ...
$ Opp_TeamRank : int 3 2 1 5 8 4 2 6 1 10 ...
$ Team_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 0 2 ...
$ Team_Top50RankingBatsman : int 5 6 4 3 4 2 5 5 3 3 ...
$ Team_Top100RankingBatsman: int 7 8 7 6 6 5 7 7 7 6 ...
$ Opp_Top10RankingBatsman : int 1 1 2 2 0 0 1 1 2 0 ...
$ Opp_Top50RankingBatsman : int 6 5 3 4 2 4 5 5 3 3 ...
$ Opp_Top100RankingBatsman : int 8 7 6 7 5 6 7 7 6 7 ...
$ InningType : chr "1st innings" "2nd innings" "2nd innings" "1st innings" ...
$ Runs_OverAll : num 582 618 470 602 509 ...
$ AVG_Overall : num 25 21.8 20.3 20.7 19.6 ...
$ SR_Overall : num 113 120 123 120 112 ...
$ Runs_Last10Matches : num 182 107 117 167 140 ...
$ AVG_Last10Matches : num 37.1 43.8 21 24.9 27.3 ...
$ SR_Last10Matches : num 111 153 122 141 120 ...
$ Runs_BatingFirst : num 319 314 271 345 294 ...
$ AVG_BatingFirst : num 23.6 17.8 20.6 20.3 19.5 ...
$ SR_BatingFirst : num 116.9 98.5 118 124.3 115.8 ...
$ Runs_BatingSecond : num 264 282 304 256 186 ...
$ AVG_BatingSecond : num 28 23.7 31.9 21.6 16.5 ...
$ SR_BatingSecond : num 96.5 133.9 129.4 112 99.5 ...
$ Runs_AgainstTeam2 : num 98.2 95.2 106.9 75.4 88.5 ...
$ AVG_AgainstTeam2 : num 45.3 42.7 38.1 17.7 27.1 ...
$ SR_AgainstTeam2 : num 125 138 152 110 122 ...
$ runs : int 192 196 159 153 122 120 160 161 70 145 ...
In simple word, how can I get rid of this warning so that it doesn't effect my predictions?
(Intercept) matchid TeamNameBangladesh
1699.98232628 -0.06793787 59.29445330
TeamNameEngland TeamNameIndia TeamNameNew Zealand
347.33030177 -499.40074338 -179.19192936
TeamNamePakistan TeamNameSouth Africa TeamNameSri Lanka
-272.71610614 -3.54867488 -45.27920191
TeamNameWest Indies Opp_TeamNameBangladesh Opp_TeamNameEngland
-345.54349798 135.05901017 108.04227770
Opp_TeamNameIndia Opp_TeamNameNew Zealand Opp_TeamNamePakistan
-162.24418387 -60.55364436 -114.74599364
Opp_TeamNameSouth Africa Opp_TeamNameSri Lanka Opp_TeamNameWest Indies
196.90856999 150.70170068 -6.88997714
TeamRank Opp_TeamRank Team_Top10RankingBatsman
NA NA NA
Team_Top50RankingBatsman Team_Top100RankingBatsman Opp_Top10RankingBatsman
NA NA NA
Opp_Top50RankingBatsman Opp_Top100RankingBatsman InningType2nd innings
NA NA 24.24029455
Runs_OverAll AVG_Overall SR_Overall
-0.59935875 20.12721378 -13.60151334
Runs_Last10Matches AVG_Last10Matches SR_Last10Matches
-1.92526750 9.24182916 1.23914363
Runs_BatingFirst AVG_BatingFirst SR_BatingFirst
1.41001672 -9.88582744 -6.69780509
Runs_BatingSecond AVG_BatingSecond SR_BatingSecond
-0.90038727 -7.11580086 3.20915976
Runs_AgainstTeam2 AVG_AgainstTeam2 SR_AgainstTeam2
3.35936312 -5.90267210 2.36899131

You can have a look at this detailed discussion :
predict.lm() in a loop. warning: prediction from a rank-deficient fit may be misleading
In general, multi-collinearity can lead to a rank deficient matrix in logistic regression.
You can try applying PCA to tackle the multi-collinearity issue and then apply logistic regression afterwards.

How to replace some rows of a dataset from another dataset in R

I have two datasets. One the original called geoIncendios and a second called outliers. As you can imagine the latter is a subset consisting of the outliers of the former. After analyzing them, I found the error and corrected them. So now I would like to replace the rows of the first dataset with the second.
Here is the structure of both datasets to give you an idea:
> str(geoIncendios)
'data.frame': 100 obs. of 9 variables:
$ id : num 1 2 3 4 5 6 7 8 9 10 ...
$ municipio : chr "LLANES" "CANIZA" "CANGAS DEL NARCEA" "PILONA" ...
$ num_incendios: num 1725 1521 1349 1341 1290 ...
$ ha_quemadas : num 79 70 34 81 96 56 4 87 18 69 ...
$ ranking : num 1 2 3 4 5 6 7 8 9 10 ...
$ comunidad : chr "ASTURIAS" "GALICIA" "ASTURIAS" "ASTURIAS" ...
$ provincia : chr "ASTURIAS" "PONTEVEDRA" "ASTURIAS" "ASTURIAS" ...
$ lon : num -4.76 -8.27 -6.55 -5.35 -7.11 ...
$ lat : num 43.4 42.2 43.2 43.3 42.2 ...
> str(outliers)
'data.frame': 11 obs. of 9 variables:
$ id : num 9 13 22 24 37 40 43 45 68 93 ...
$ municipio : chr "NEVES" "LENA" "TOMINO" "GRADO" ...
$ num_incendios: num 1081 929 818 744 641 ...
$ ha_quemadas : num 18 74 73 49 61 48 38 21 46 8 ...
$ ranking : num 9 13 22 24 37 40 43 45 68 93 ...
$ comunidad : chr "GALICIA" "ASTURIAS" "GALICIA" "ASTURIAS" ...
$ provincia : chr "PONTEVEDRA" "ASTURIAS" "PONTEVEDRA" "ASTURIAS" ...
$ lon : num -8.41 -5.84 -8.73 -6.07 -8.31 ...
$ lat : num 42.1 43.1 42 43.4 42.1 ...
So again I would like to overwrite 11 rows of the geoIncendios dataset with the ones from the outliers dataset. I believe I have to use some kind of loop. But in case there is a easiest solution (which I doubt it), these are the IDs of the rows: 9,13,22,24,37,40,43,45,68,93 and 99.

In the data you've shown, geoIncendios$id is just the row number of the data.frame. Presuming that's true for the whole dataset, you could use (as suggested in comments by #RHertel)
geoIncendios[outliers$id, ] <- outliers
However, if there are discontinuities in your id column, or if the order isn't strictly the same as the row numbers, a more generalisable solution is:
geoIncendios[match(outliers$id, geoIncendios$id), ] <- outliers

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split and unsplit a dataframe in four parts - r

Related

Agricolae, tapply, error: arguments must have same length

Meaning of "#" operator in R language?

Why is distance matrix (dist()) giving empty values for data sets having more than ~50 observations?

How to deal with " rank-deficient fit may be misleading" in R?

How to replace some rows of a dataset from another dataset in R

Categories

Resources