Carc data from rda file to numeric matrix - r

I try to make KDA (Kernel discriminant analysis) for carc data, but when I call command X<-data.frame(scale(X)); r shows error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
I tried to use as.numeric(as.matrix(carc)) and carc<-na.omit(carc), but it does not help either
library(ks);library(MASS);library(klaR);library(FSelector)
install.packages("klaR")
install.packages("FSelector")
library(ks);library(MASS);library(klaR);library(FSelector)
attach("carc.rda")
data<-load("carc.rda")
data
carc<-na.omit(carc)
head(carc)
class(carc) # check for its class
class(as.matrix(carc)) # change class, and
as.numeric(as.matrix(carc))
XX<-carc
X<-XX[,1:12];X.class<-XX[,13];
X<-data.frame(scale(X));
fit.pc<-princomp(X,scores=TRUE);
plot(fit.pc,type="line")
X.new<-fit.pc$scores[,1:5]; X.new<-data.frame(X.new);
cfs(X.class~.,cbind(X.new,X.class))
X.new<-fit.pc$scores[,c(1,4)]; X.new<-data.frame(X.new);
fit.kda1<-Hkda(x=X.new,x.group=X.class,pilot="samse",
bw="plugin",pre="sphere")
kda.fit1 <- kda(x=X.new, x.group=X.class, Hs=fit.kda1)
Can you help to resolve this problem and make this analysis?
Added:The car data set( Chambers, kleveland, Kleiner & Tukey 1983)
> head(carc)
P M R78 R77 H R Tr W L T D G C
AMC_Concord 4099 22 3 2 2.5 27.5 11 2930 186 40 121 3.58 US
AMC_Pacer 4749 17 3 1 3.0 25.5 11 3350 173 40 258 2.53 US
AMC_Spirit 3799 22 . . 3.0 18.5 12 2640 168 35 121 3.08 US
Audi_5000 9690 17 5 2 3.0 27.0 15 2830 189 37 131 3.20 Europe
Audi_Fox 6295 23 3 3 2.5 28.0 11 2070 174 36 97 3.70 Europe

Here is a small dataset with similar characteristics to what you describe
in order to answer this error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
carc <- data.frame(type1=rep(c('1','2'), each=5),
type2=rep(c('5','6'), each=5),
x = rnorm(10,1,2)/10, y = rnorm(10))
This should be similar to your data.frame
str(carc)
# 'data.frame': 10 obs. of 3 variables:
# $ type1: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
# $ type2: Factor w/ 2 levels "5","6": 1 1 1 1 1 2 2 2 2 2
# $ x : num -0.1177 0.3443 0.1351 0.0443 0.4702 ...
# $ y : num -0.355 0.149 -0.208 -1.202 -1.495 ...
scale(carc)
# Similar error
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Using set()
require(data.table)
DT <- data.table(carc)
cols_fix <- c("type1", "type2")
for (col in cols_fix) set(DT, j=col, value = as.numeric(as.character(DT[[col]])))
str(DT)
# Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ type1: num 1 1 1 1 1 2 2 2 2 2
# $ type2: num 5 5 5 5 5 6 6 6 6 6
# $ x : num 0.0465 0.1712 0.1582 0.1684 0.1183 ...
# $ y : num 0.155 -0.977 -0.291 -0.766 -1.02 ...
# - attr(*, ".internal.selfref")=<externalptr>

The first column(s) of your data set may be factors. Taking the data from corrgram:
library(corrgram)
carc <- auto
str(carc)
# 'data.frame': 74 obs. of 14 variables:
# $ Model : Factor w/ 74 levels "AMC Concord ",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Origin: Factor w/ 3 levels "A","E","J": 1 1 1 2 2 2 1 1 1 1 ...
# $ Price : int 4099 4749 3799 9690 6295 9735 4816 7827 5788 4453 ...
# $ MPG : int 22 17 22 17 23 25 20 15 18 26 ...
# $ Rep78 : num 3 3 NA 5 3 4 3 4 3 NA ...
# $ Rep77 : num 2 1 NA 2 3 4 3 4 4 NA ...
# $ Hroom : num 2.5 3 3 3 2.5 2.5 4.5 4 4 3 ...
# $ Rseat : num 27.5 25.5 18.5 27 28 26 29 31.5 30.5 24 ...
# $ Trunk : int 11 11 12 15 11 12 16 20 21 10 ...
# $ Weight: int 2930 3350 2640 2830 2070 2650 3250 4080 3670 2230 ...
# $ Length: int 186 173 168 189 174 177 196 222 218 170 ...
# $ Turn : int 40 40 35 37 36 34 40 43 43 34 ...
# $ Displa: int 121 258 121 131 97 121 196 350 231 304 ...
# $ Gratio: num 3.58 2.53 3.08 3.2 3.7 3.64 2.93 2.41 2.73 2.87 ...
So exclude them by trying this:
X<-XX[,3:14]
or this
X<-XX[,-(1:2)]

Related

Can we use as.factor to convert categorical variables having multiple levels for decision tree or we need to use model.matrix?

I am trying to build a decison tree model in R having both categorical and numerical variables.Some categorical variables have 3 levels , so can I just use as.factor and then use in my model? I tried to use model.matrix but my doubt is model.matrix converts the variable in numeric values of 0s and 1s and splitting happens on basis of these numeric values. For eg if Color has 3 level- blue,red,green, the splitting rule will look like color_green < 0.5 instead it should always take 0s and 1s only.
If you are asking whether you can use factors to build an rpart decision tree. Then yes. See below example from the documentation. Note that there are a lot of possible packages for decision trees.
library(rpart)
rpart(Reliability ~ ., data=car90)
#> n=76 (35 observations deleted due to missingness)
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 76 53 average (0.2 0.12 0.3 0.11 0.28)
#> 2) Country=Germany,Korea,Mexico,Sweden,USA 49 29 average (0.31 0.18 0.41 0.1 0)
#> 4) Tires=145,155/80,165/80,185/80,195/60,195/65,195/70,205/60,215/65,225/75,275/40 17 9 Much worse (0.47 0.29 0 0.24 0) *
#> 5) Tires=175/70,185/65,185/70,185/75,195/75,205/70,205/75,215/70 32 12 average (0.22 0.12 0.62 0.031 0)
#> 10) HP.revs< 4650 13 7 Much worse (0.46 0.23 0.31 0 0) *
#> 11) HP.revs>=4650 19 3 average (0.053 0.053 0.84 0.053 0) *
#> 3) Country=Japan,Japan/USA 27 6 Much better (0 0 0.11 0.11 0.78) *
str(car90)
#> 'data.frame': 111 obs. of 34 variables:
#> $ Country : Factor w/ 10 levels "Brazil","England",..: 5 5 4 4 4 4 10 10 10 NA ...
#> $ Disp : num 112 163 141 121 152 209 151 231 231 189 ...
#> $ Disp2 : num 1.8 2.7 2.3 2 2.5 3.5 2.5 3.8 3.8 3.1 ...
#> $ Eng.Rev : num 2935 2505 2775 2835 2625 ...
#> $ Front.Hd : num 3.5 2 2.5 4 2 3 4 6 5 5.5 ...
#> $ Frt.Leg.Room: num 41.5 41.5 41.5 42 42 42 42 42 41 41 ...
#> $ Frt.Shld : num 53 55.5 56.5 52.5 52 54.5 56.5 58.5 59 58 ...
#> $ Gear.Ratio : num 3.26 2.95 3.27 3.25 3.02 2.8 NA NA NA NA ...
#> $ Gear2 : num 3.21 3.02 3.25 3.25 2.99 2.85 2.84 1.99 1.99 2.33 ...
#> $ HP : num 130 160 130 108 168 208 110 165 165 101 ...
#> $ HP.revs : num 6000 5900 5500 5300 5800 5700 5200 4800 4800 4400 ...
#> $ Height : num 47.5 50 51.5 50.5 49.5 51 49.5 50.5 51 50.5 ...
#> $ Length : num 177 191 193 176 175 186 189 197 197 192 ...
#> $ Luggage : num 16 14 17 10 12 12 16 16 16 15 ...
#> $ Mileage : num NA 20 NA 27 NA NA 21 NA 23 NA ...
#> $ Model2 : Factor w/ 21 levels ""," Turbo 4 (3)",..: 1 1 1 1 1 1 1 14 13 1 ...
#> $ Price : num 11950 24760 26900 18900 24650 ...
#> $ Rear.Hd : num 1.5 2 3 1 1 2.5 2.5 4.5 3.5 3.5 ...
#> $ Rear.Seating: num 26.5 28.5 31 28 25.5 27 28 30.5 28.5 27.5 ...
#> $ RearShld : num 52 55.5 55 52 51.5 55.5 56 58.5 58.5 56.5 ...
#> $ Reliability : Ord.factor w/ 5 levels "Much worse"<"worse"<..: 5 5 NA NA 4 NA 3 3 3 NA ...
#> $ Rim : Factor w/ 6 levels "R12","R13","R14",..: 3 4 4 3 3 4 3 3 3 3 ...
#> $ Sratio.m : num NA NA NA NA NA NA NA NA NA NA ...
#> $ Sratio.p : num 0.86 0.96 0.97 0.71 0.88 0.78 0.76 0.83 0.87 0.88 ...
#> $ Steering : Factor w/ 3 levels "manual","power",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ Tank : num 13.2 18 21.1 15.9 16.4 21.1 15.7 18 18 16.5 ...
#> $ Tires : Factor w/ 30 levels "145","145/80",..: 16 20 20 8 17 28 13 23 23 22 ...
#> $ Trans1 : Factor w/ 4 levels "","man.4","man.5",..: 3 3 3 3 3 3 1 1 1 1 ...
#> $ Trans2 : Factor w/ 4 levels "","auto.3","auto.4",..: 3 3 2 2 3 3 2 3 3 3 ...
#> $ Turning : num 37 42 39 35 35 39 41 43 42 41 ...
#> $ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 3 1 1 3 3 2 2 NA ...
#> $ Weight : num 2700 3265 2935 2670 2895 ...
#> $ Wheel.base : num 102 109 106 100 101 109 105 111 111 108 ...
#> $ Width : num 67 69 71 67 65 69 69 72 72 71 ...

why multiple columns are shown as one vector in R

I am trying to read in data from a URL however when I do run the following code:
x <- read.csv(url(myUrl), sep = '\t', head = FALSE)
print(x)
I get this
V1 V2
1 18.0 8 30.7 130.0 hello
2 32.0 6 23.5 121.5 bye
and I want this
V1 V2 V3 V4 V5
1 18.0 8.0 30.7 130.0 hello
2 32.0 6.0 23.5 121.5 bye
for some reason it is reading it as 2 columns instead of 5
Edit 1
Here is a snippet of the data file from the url:
Edit 2
Here is the url: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
Instead of \t, may be, use use ' ' or don't specify the delimiter
x <- read.table(url(myUrl), header = FALSE)
based on the url updated in the OP's post
x <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", header = FALSE)
str(x)
#'data.frame': 398 obs. of 9 variables:
# $ V1: num 18 15 18 16 17 15 14 14 14 15 ...
# $ V2: int 8 8 8 8 8 8 8 8 8 8 ...
# $ V3: num 307 350 318 304 302 429 454 440 455 390 ...
# $ V4: chr "130.0" "165.0" "150.0" "150.0" ...
# $ V5: num 3504 3693 3436 3433 3449 ...
# $ V6: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
# $ V7: int 70 70 70 70 70 70 70 70 70 70 ...
# $ V8: int 1 1 1 1 1 1 1 1 1 1 ...
# $ V9: chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

Classify factor output with factors with >60 levels and numeric inputs

I'm newbie, and working on a classification to see the causes of coral diseases. The dataset contains 45 variables.
The output variable is a factor with 21 levels (21 diseases) and the inputs are numeric and factor variables, and those factors have even 94 levels, those are like "type of specie of coral", so I can't get into a split factor because I want to be as precise as possible, so maybe one species is less resistant than another. So I can't split those factors. Numeric variables are such as, population in the area, fishing trips etc.
First problem: tried genetic algorithms to select most important variables, random forests, etc., but... it gets aborted, so the variables I eliminated were just based on correlograms. I want something stronger to decide which variables select.
Second problem: I've tried everything I know and made tons of searches on Google to find something that runs and make a classification, but nothing goes on. I tried SVM, Random Forests, Cart, GBM, bagging and boosting, but nothing can't with this dataset.
This is the structure of the dataset
'data.frame': 136510 obs. of 45 variables:
$ SITE : Factor w/ 144 levels "TUT-1511","TUT-1513",..: 56 15 55 21 12 12 17 53 48 82 ...
$ Zone_Fine : Factor w/ 17 levels "Aunuu_E","Aunuu_W",..: 11 9 10 9 9 9 9 8 10 10 ...
$ TRANSECT : num 1 1 1 1 1 1 1 1 1 1 ...
$ SEGMENT : num 5 1 1 1 7 5 7 5 3 7 ...
$ Seg_WIDTH : num 1 1 1 1 1 1 1 1 1 1 ...
$ Seg_LENGTH : num 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
$ SPECIES : Factor w/ 156 levels "AAAA","AABR",..: 94 126 94 102 9 126 135 94 93 94 ...
$ COLONYLENGTH : num 11 45 10 5 12 10 8 30 20 14 ...
$ OLDDEAD : num 5 2 5 0 0 5 10 0 5 10 ...
$ RECENTDEAD : num 0 10 0 0 0 0 0 0 0 0 ...
$ DZCLASS : Factor w/ 21 levels "Acute Tissue Loss - White Syndrome",..: 14 14 14 14 14 14 14 14 14 14 ...
$ EXTENT : num 52.9 52.9 52.9 52.9 52.9 ...
$ SEVERITY : num 3.11 3.11 3.11 3.11 3.11 ...
$ TAXONNAME.x : Factor w/ 155 levels "Acanthastrea hemprichii",..: 95 132 95 107 7 132 133 95 89 95 ...
$ PHYLUM : Factor w/ 2 levels "Cnidaria","Rhodophyta": 1 1 1 1 1 1 1 1 1 1 ...
$ CLASS : Factor w/ 3 levels "Anthozoa","Florideophyceae",..: 1 1 1 1 1 1 1 1 1 1 ...
$ FAMILY : Factor w/ 20 levels "Acroporidae",..: 1 18 1 2 1 18 18 1 8 1 ...
$ GENUS : Factor w/ 55 levels "Acanthastrea",..: 35 44 35 39 2 44 44 35 34 35 ...
$ RANK : Factor w/ 2 levels "Genus","Species": 1 1 1 1 2 1 2 1 1 1 ...
$ DATE_ : Date, format: "0015-03-27" ...
$ OBS_YEAR : num 2015 2015 2015 2015 2015 ...
$ REEF_ZONE : Factor w/ 2 levels "Backreef","Forereef": 2 2 2 2 2 2 2 2 2 2 ...
$ DEPTH_BIN : Factor w/ 4 levels "Bank","Deep",..: 2 2 4 3 2 2 3 4 3 3 ...
$ LBSP : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ Zone_Fine_ReefZone_Depth: Factor w/ 41 levels "Aunuu_E_Deep",..: 30 24 29 25 24 24 25 23 28 28 ...
$ Area_km2.x : num 50.9 49.1 101.8 49.1 49.1 ...
$ Fishing.trips.per.km2 : num 719 1148 1431 1148 1148 ...
$ Area_km2.y : num 50.9 49.1 50.9 49.1 49.1 ...
$ Pop.km2 : num 167.5 49.1 561.9 49.1 49.1 ...
$ SHED_NAME : Factor w/ 35 levels "Aasu","Afao - Asili",..: 2 9 15 17 17 1 1 35 28 26 ...
$ Shed_Cond : Factor w/ 4 levels "Extensive","Intermediate",..: 3 4 2 4 4 3 3 3 1 2 ...
$ Shed_Area_Calc : num 30202 29422 458542 126361 32595 ...
$ Perc_Area : num 0.00128 0.00107 0.00993 0.00458 0.00118 ...
$ Cond_Scale : num 3 4 2 4 4 3 3 3 1 2 ...
$ Shoreline_m : num 23146 33046 45821 33046 33046 ...
$ Rank : num 5 9 3 9 9 9 9 6 3 3 ...
$ Comp.8 : num 0.826 0.814 0.838 0.814 0.814 ...
$ Ble : num 0.958 0.969 0.959 0.969 0.969 ...
$ DZ : num 0.647 0.837 0.732 0.837 0.837 ...
$ Herb : num 0.682 0.564 0.704 0.564 0.564 ...
$ Rec : num 0.375 0.477 0.467 0.477 0.477 ...
$ MA : num 0.965 0.975 0.907 0.975 0.975 ...
$ Dam : num 0.998 1 0.992 1 1 ...
$ TAXONNAME.y : Factor w/ 94 levels "Abudefduf sordidus",..: 94 94 94 94 94 94 94 94 94 94 ...
$ Dummy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I expected a classification of "DZCLASS".
Thanks, every recommendation is welcomed!

chart.Correlation with continious and categorical variables

I want to see if there is correlation between my variables. This is the structure of the dataset
'data.frame': 189 obs. of 20 variables:
$ age : num 24 31 32 35 36 26 31 24 35 36 ...
$ diplM2 : Factor w/ 3 levels "0","1","2": 3 2 1 3 2 2 3 2 2 1 ...
$ TimeDelcat : Factor w/ 4 levels "0","1","2","3": 1 1 3 3 3 4 2 1 4 4 ...
$ SeasonDel : Factor w/ 4 levels "1","2","3","4": 1 2 4 3 4 3 4 3 2 3 ...
$ BMIM2 : num 23.4 25.7 17 26.6 24.6 21.6 21 22.3 20.8 20.7 ...
$ WgtB2 : int 3740 3615 3705 3485 3420 2775 3365 3770 3075 3000 ...
$ sex : Factor w/ 2 levels "1","2": 2 2 1 2 2 2 1 1 1 1 ...
$ smoke : Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 1 1 3 ...
$ nRBC : num 0.1621 0.0604 0.1935 0.0527 0.1118 ...
$ CD4T : num 0.1427 0.2143 0.1432 0.0686 0.0979 ...
$ CD8T : num 0.1574 0.1549 0.1243 0.0804 0.0782 ...
$ NK : num 0.02817 0 0.04368 0.00641 0.02398 ...
$ Bcell : num 0.1033 0.1124 0.1468 0.0551 0.0696 ...
$ Mono : num 0.0633 0.0641 0.0773 0.0531 0.0656 ...
$ Gran : num 0.428 0.442 0.329 0.716 0.6 ...
$ chip : Factor w/ 92 levels "200251580021",..: 12 24 23 2 27 22 6 22 17 22 ...
$ pos : Factor w/ 12 levels "R01C01","R01C02",..: 11 12 1 6 9 2 12 1 7 11 ...
$ trim1PM25ifdmv4: num 9.45 13.81 15.59 7.13 15.43 ...
$ trim2PM25ifdmv4: num 13.27 15.53 10.69 13.56 9.27 ...
$ trim3PM25ifdmv4: num 16.72 16.21 12.17 6.47 10.66 ...
As you can see, there are both continious and categorical variables.
When I run chart.Correlation(variables, histrogram=T,method = c("pearson") )
I get this error:
Error in pairs.default(x, gap = 0, lower.panel = panel.smooth, upper.panel = panel.cor, :
non-numeric argument to 'pairs'
How can I fix this?
Thank you.
I believe you want correlation only between numerical variables. The below code will do this and it will output only unique correlations between the input.
library(reshape2)
data <- data.frame(x1=rnorm(10),
x2=rnorm(10),
x3=rnorm(10),
x4=c("a","b","c","d","e","f","g","h","i","j"),
x5=c("ab","sp","sp","dd","hg","hj","qw","dh","ko","jk"))
data
x1 x2 x3 x4 x5
1 -1.2169793 0.5397598 0.4981513 a ab
2 -0.7032631 -2.1262837 -1.0377371 b sp
3 0.8766831 -0.2326975 -0.1219613 c sp
4 0.3405332 2.4766225 -1.1960618 d dd
5 0.1889945 0.3444534 1.9659062 e hg
6 0.8086956 0.4654644 -1.2526696 f hj
7 -0.6850181 -1.7657241 0.5156620 g qw
8 0.8518034 0.9484547 1.4784063 h dh
9 0.5191793 1.2246566 1.3867829 i ko
10 0.4568953 -0.6881464 0.3548839 j jk
#finding correlation for all numerical values
corr=cor(data[as.numeric(which(sapply(data,class)=="numeric"))])
#convert the correlation table to long format
res=melt(corr)
##keeping only one side of the correlations
res$type=apply(res,1,function(x)
paste(sort(c(as.character(x[1]),as.character(x[2]))),collapse="*"))
res=unique(res[,c("type","value")])
res
type value
x1*x1 1.00000000
x1*x2 0.44024939
x1*x3 0.04936654
x2*x2 1.00000000
x2*x3 0.08859169
x3*x3 1.00000000

Find mean from subset of one column based on ranking in the top 50 of another column

I have a data frame that has the following columns:
> str(wbr)
'data.frame': 214 obs. of 12 variables:
$ countrycode : Factor w/ 214 levels "ABW","ADO","AFG",..: 1 2 3 4 5 6 7 8 9 10 ...
$ countryname : Factor w/ 214 levels "Afghanistan",..: 10 5 1 6 2 202 8 9 4 7 ...
$ gdp_per_capita : num 19913 35628 415 2738 4091 ...
$ literacy_female : num 96.7 NA 17.6 59.1 95.7 ...
$ literacy_male : num 96.9 NA 45.4 82.5 98 ...
$ literacy_all : num 96.8 NA 31.7 70.6 96.8 ...
$ infant_mortality : num NA 2.2 70.2 101.6 13.3 ...
$ illiteracy_female: num 3.28 NA 82.39 40.85 4.31 ...
$ illiteracy_mele : num 3.06 NA 54.58 17.53 1.99 ...
$ illiteracy_male : num 3.06 NA 54.58 17.53 1.99 ...
$ illiteracy_all : num 3.18 NA 68.26 29.42 3.15 ...
I would like to find the mean of illiteracy_all from the top 50 countries with the highest GDP.
Before you answer me I need to inform you that the data frame has NA values meaning that if I want to find the mean I would have to write:
mean(wbr$illiteracy_all, na.rm=TRUE)
For a reproducible example, let's take:
data.df <- data.frame(x=101:120, y=rep(c(1,2,3,NA), times=5))
So how could I average the y values for e.g. the top 5 values of x?
> data.df
x y
1 101 1
2 102 2
3 103 3
4 104 NA
5 105 1
6 106 2
7 107 3
8 108 NA
9 109 1
10 110 2
11 111 3
12 112 NA
13 113 1
14 114 2
15 115 3
16 116 NA
17 117 1
18 118 2
19 119 3
20 120 NA
Any of the following would work:
mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
To unpack why this works, note first that rank gives ranks in a different order to what you might expect, 1 being the rank of the smallest number not the largest:
> rank(data.df$x)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
We can get round that by negating the input:
> rank(-data.df$x)
[1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
So now ranks 1 to 5 are the "top 5". If we want a vector of TRUE and FALSE to indicate the position of the top 5 we can use:
> rank(-data.df$x)<=5
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
(In reality you might find you have some ties in your data set. This is only going to cause issues if the 50th position is tied. You might want to have a look at the ties.method argument for rank to see how you want to handle this.)
So let's grab the values of y in those positions:
> data.df[rank(-data.df$x)<=5,"y"]
[1] NA 1 2 3 NA
Or you could use:
> data.df$y[rank(-data.df$x)<=5]
[1] NA 1 2 3 NA
So now we know what to input into mean:
> mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
[1] 2
Or:
> mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
[1] 2
Or if you don't like repeating the name of the data frame, use with:
> with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
[1] 2

Resources