R, extracting value from one attribute based on another attribute - r

I've run some analysis that outputs data in the following format:
> sft
Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
1 1 0.35400 8.4300 0.7710 146.00 145.00 166.0
2 2 0.21900 2.2500 0.8960 83.30 82.80 107.0
3 3 0.17300 1.1600 0.9310 49.90 49.80 72.0
4 4 0.04100 0.3070 0.7360 31.60 31.20 50.3
5 5 0.00165 -0.0298 0.4610 21.30 21.00 37.3
6 6 0.05310 -0.1780 -0.1240 15.30 14.60 28.9
7 7 0.21300 -0.2610 -0.0113 11.60 10.90 24.0
8 8 0.63800 -0.5280 0.5560 9.27 8.18 22.3
9 9 0.82500 -0.6310 0.8110 7.69 6.14 21.2
10 10 0.85000 -0.7400 0.8100 6.59 4.97 20.3
11 11 0.82200 -0.8310 0.7710 5.77 3.95 19.6
12 12 0.81900 -0.8480 0.7680 5.16 3.27 19.0
13 13 0.73300 -0.8670 0.6660 4.67 2.80 18.4
14 14 0.65300 -0.9170 0.5840 4.28 2.39 17.9
15 15 0.70200 -0.9130 0.6440 3.97 2.22 17.4
What I want is to extract the Power that gave the highest (maximum) SFT.R.sq value.
Here is the table's attributes:
>str(sft)
List of 2
$ powerEstimate: int NA
$ fitIndices :'data.frame': 15 obs. of 7 variables:
..$ Power : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
..$ SFT.R.sq : num [1:15] 0.35392 0.21883 0.17291 0.04098 0.00165 ...
..$ slope : num [1:15] 8.4267 2.2461 1.158 0.307 -0.0298 ...
..$ truncated.R.sq: num [1:15] 0.771 0.896 0.931 0.736 0.461 ...
..$ mean.k. : num [1:15] 145.8 83.3 49.9 31.6 21.3 ...
..$ median.k. : num [1:15] 145.1 82.8 49.8 31.2 21 ...
..$ max.k. : num [1:15] 165.6 107.1 72 50.3 37.3 ...
I can grab the two columns I need easily with:
sft$fitIndices$Power
sft$fitIndices$SFT.R.sq
But I can't get the actual power associated with the highest SFT.R.sq value:
>sft$fitIndices$Power[max(sft$fitIndices$SFT.R.sq)]
integer(0)
Examples of what I'm trying to do usually involve dataframes where you extract a value based on the value from another column - but it doesn't seem to work with attributes.

We need which.max to return the position of max value for subsetting the 'Power'
sft$fitIndices$Power[which.max(sft$fitIndices$SFT.R.sq)]
Also, if we need to slice the row, extract the data.frame element and slice
library(dplyr)
library(purrr)
pluck(sft, "fitIndices") %>%
slice_max(n = 1, order_by = "SFT.R.sq")

Related

Can we use as.factor to convert categorical variables having multiple levels for decision tree or we need to use model.matrix?

I am trying to build a decison tree model in R having both categorical and numerical variables.Some categorical variables have 3 levels , so can I just use as.factor and then use in my model? I tried to use model.matrix but my doubt is model.matrix converts the variable in numeric values of 0s and 1s and splitting happens on basis of these numeric values. For eg if Color has 3 level- blue,red,green, the splitting rule will look like color_green < 0.5 instead it should always take 0s and 1s only.
If you are asking whether you can use factors to build an rpart decision tree. Then yes. See below example from the documentation. Note that there are a lot of possible packages for decision trees.
library(rpart)
rpart(Reliability ~ ., data=car90)
#> n=76 (35 observations deleted due to missingness)
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 76 53 average (0.2 0.12 0.3 0.11 0.28)
#> 2) Country=Germany,Korea,Mexico,Sweden,USA 49 29 average (0.31 0.18 0.41 0.1 0)
#> 4) Tires=145,155/80,165/80,185/80,195/60,195/65,195/70,205/60,215/65,225/75,275/40 17 9 Much worse (0.47 0.29 0 0.24 0) *
#> 5) Tires=175/70,185/65,185/70,185/75,195/75,205/70,205/75,215/70 32 12 average (0.22 0.12 0.62 0.031 0)
#> 10) HP.revs< 4650 13 7 Much worse (0.46 0.23 0.31 0 0) *
#> 11) HP.revs>=4650 19 3 average (0.053 0.053 0.84 0.053 0) *
#> 3) Country=Japan,Japan/USA 27 6 Much better (0 0 0.11 0.11 0.78) *
str(car90)
#> 'data.frame': 111 obs. of 34 variables:
#> $ Country : Factor w/ 10 levels "Brazil","England",..: 5 5 4 4 4 4 10 10 10 NA ...
#> $ Disp : num 112 163 141 121 152 209 151 231 231 189 ...
#> $ Disp2 : num 1.8 2.7 2.3 2 2.5 3.5 2.5 3.8 3.8 3.1 ...
#> $ Eng.Rev : num 2935 2505 2775 2835 2625 ...
#> $ Front.Hd : num 3.5 2 2.5 4 2 3 4 6 5 5.5 ...
#> $ Frt.Leg.Room: num 41.5 41.5 41.5 42 42 42 42 42 41 41 ...
#> $ Frt.Shld : num 53 55.5 56.5 52.5 52 54.5 56.5 58.5 59 58 ...
#> $ Gear.Ratio : num 3.26 2.95 3.27 3.25 3.02 2.8 NA NA NA NA ...
#> $ Gear2 : num 3.21 3.02 3.25 3.25 2.99 2.85 2.84 1.99 1.99 2.33 ...
#> $ HP : num 130 160 130 108 168 208 110 165 165 101 ...
#> $ HP.revs : num 6000 5900 5500 5300 5800 5700 5200 4800 4800 4400 ...
#> $ Height : num 47.5 50 51.5 50.5 49.5 51 49.5 50.5 51 50.5 ...
#> $ Length : num 177 191 193 176 175 186 189 197 197 192 ...
#> $ Luggage : num 16 14 17 10 12 12 16 16 16 15 ...
#> $ Mileage : num NA 20 NA 27 NA NA 21 NA 23 NA ...
#> $ Model2 : Factor w/ 21 levels ""," Turbo 4 (3)",..: 1 1 1 1 1 1 1 14 13 1 ...
#> $ Price : num 11950 24760 26900 18900 24650 ...
#> $ Rear.Hd : num 1.5 2 3 1 1 2.5 2.5 4.5 3.5 3.5 ...
#> $ Rear.Seating: num 26.5 28.5 31 28 25.5 27 28 30.5 28.5 27.5 ...
#> $ RearShld : num 52 55.5 55 52 51.5 55.5 56 58.5 58.5 56.5 ...
#> $ Reliability : Ord.factor w/ 5 levels "Much worse"<"worse"<..: 5 5 NA NA 4 NA 3 3 3 NA ...
#> $ Rim : Factor w/ 6 levels "R12","R13","R14",..: 3 4 4 3 3 4 3 3 3 3 ...
#> $ Sratio.m : num NA NA NA NA NA NA NA NA NA NA ...
#> $ Sratio.p : num 0.86 0.96 0.97 0.71 0.88 0.78 0.76 0.83 0.87 0.88 ...
#> $ Steering : Factor w/ 3 levels "manual","power",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ Tank : num 13.2 18 21.1 15.9 16.4 21.1 15.7 18 18 16.5 ...
#> $ Tires : Factor w/ 30 levels "145","145/80",..: 16 20 20 8 17 28 13 23 23 22 ...
#> $ Trans1 : Factor w/ 4 levels "","man.4","man.5",..: 3 3 3 3 3 3 1 1 1 1 ...
#> $ Trans2 : Factor w/ 4 levels "","auto.3","auto.4",..: 3 3 2 2 3 3 2 3 3 3 ...
#> $ Turning : num 37 42 39 35 35 39 41 43 42 41 ...
#> $ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 3 1 1 3 3 2 2 NA ...
#> $ Weight : num 2700 3265 2935 2670 2895 ...
#> $ Wheel.base : num 102 109 106 100 101 109 105 111 111 108 ...
#> $ Width : num 67 69 71 67 65 69 69 72 72 71 ...

Why is distance matrix (dist()) giving empty values for data sets having more than ~50 observations?

I have a data set for which I'm calculating its distance matrix. Below is the data, which has 251 observations.
> str(mydata)
'data.frame': 251 obs. of 7 variables:
$ BodyFat: num 12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 ...
$ Weight : num 154 173 154 185 184 ...
$ Chest : num 93.1 93.6 95.8 101.8 97.3 ...
$ Abdomen: num 85.2 83 87.9 86.4 100 94.4 90.7 88.5 82.5 88.6 ...
$ Hip : num 94.5 98.7 99.2 101.2 101.9 ...
$ Thigh : num 59 58.7 59.6 60.1 63.2 66 58.4 60 62.9 63.1 ...
$ Biceps : num 32 30.5 28.8 32.4 32.2 35.7 31.9 30.5 35.9 35.6 ...
I normalize the data.
means = apply(mydata,2,mean)
sds = apply(mydata,2,sd)
nor = scale(mydata,center=means,scale=sds)
When i calculate the distance matrix, I can see lot of empty values and moreover distance is measured only from 4 observations.
distance =dist(nor)
> str(distance)
'dist' num [1:31375] 1.33 2.09 1.9 3.08 3.99 ...
- attr(*, "Size")= int 251
- attr(*, "Labels")= chr [1:251] "1" "2" "3" "4" ...
- attr(*, "Diag")= logi FALSE
- attr(*, "Upper")= logi FALSE
- attr(*, "method")= chr "euclidean"
- attr(*, "call")= language dist(x = nor)
> distance # o/p omitted from this post as it has 257 observations.
1 2 3 4 5 6 7
2 1.3346445
3 2.0854437 2.5474796
4 1.8993458 1.4908813 2.5840752
5 3.0790252 3.4485667 2.2165366 2.7021809
8 9 10 11 12 13 14
2
3
4
5
15 16 17 18 19 20 21
This list goes on empty for the remaining 247 comparisons.
Now, I reduce the data set to 20 observations
Here I get a proper distance matrix.
distancetiny=dist(nor)
> str(distancetiny)
'dist' num [1:1176] 1.14 1.8 1.61 2.62 3.39 ...
- attr(*, "Size")= int 49
- attr(*, "Labels")= chr [1:49] "1" "2" "3" "4" ...
- attr(*, "Diag")= logi FALSE
- attr(*, "Upper")= logi FALSE
- attr(*, "method")= chr "euclidean"
- attr(*, "call")= language dist(x = nor)
> distancetiny
1 2 3 4 5 6 7
2 1.1380433
3 1.7990293 2.2088928
4 1.6064118 1.2871522 2.2483586
5 2.6235853 2.9669283 1.9132224 2.3256624
6 3.3898119 3.3730508 3.3718447 2.2615557 2.0094434
7 1.8947704 2.0065514 1.7685604 1.1065940 1.7387938 2.2321156
8 1.1732465 1.0663217 1.6733689 0.8873140 2.1959298 2.7939555 1.1448269
9 2.2721969 2.0545882 3.4263262 1.4058375 3.1811955 2.4011074 2.3078714
10 2.3753110 2.2424464 3.0289947 1.2808398 2.3230202 1.4242653 1.8571654
11 1.5620472 1.1878554 2.5750350 0.5718248 2.7714795 2.6314286 1.5132365
12 3.5088571 3.2484020 4.1164488 2.2723772 3.1377318 1.4795230 2.8274818
13 2.1448841 2.2679705 1.8726670 1.3494988 1.2176727 1.5544030 1.0725518
14 3.6679035 3.7459402 3.6869023 2.6677308 2.1318420 0.7347359 2.5729973
15 2.9908457 3.3312661 3.1289870 2.4340473 1.8027070 1.3626019 2.3795360
16 1.6117570 2.0283356 1.2011116 1.5961064 1.3196981 2.4456436 1.2569683
17 3.2991393 3.5991747 3.0438049 2.6066933 1.4742664 1.0945621 2.2214101
18 3.9409008 4.0726826 4.0113908 2.9250144 2.5228901 0.9087254 2.8158563
19 2.7468511 2.9495031 3.2439229 1.8312508 2.4122436 1.3932604 1.9640170
20 3.7515064 3.7021743 3.9404231 2.5813440 2.5390519 0.8352961 2.6530503
21 2.3102053 2.3878491 2.0836800 1.4328028 1.2991221 1.5287862 1.1769205
There is no empty values in the output when the observation is 21.
Why is this so? Does the dist() do not work when the observation count goes beyond a threshold ?
I'm unable to figure it out. Please help.
This seems to be a size issue. When the dataset contains more than 60-80 observations, the distance matrix is unable to be displayed properly (even for the initial rows). Looks like the values are present in it perfectly alright, and just that we cannot see them as it is.
Further operation on the distance matrix (like Hierarchical agglomerative clustering ) proved that nothing to worried about it's weird display.

How to apply dist_google (from stplanr package) to a list of data frames?

I am a beginner user of stplanr package. I have splitted a large data frame with long/lat points per 25 rows because dist_google function can be applied up to 25 pairs of origin - destination. So here is the original large data frame:
GPSLatitude GPSLongitude
1 40.66126 22.89565
2 40.66127 22.89565
3 40.66128 22.89565
4 40.66130 22.89566
5 40.66131 22.89567
6 40.66132 22.89569
7 40.66134 22.89573
8 40.66136 22.89577
9 40.66137 22.89582
10 40.66141 22.89594
11 40.66142 22.89601
12 40.66145 22.89609
13 40.66147 22.89618
14 40.66150 22.89627
15 40.66152 22.89635
16 40.66155 22.89644
17 40.66160 22.89650
18 40.66165 22.89654
19 40.66172 22.89656
20 40.66178 22.89658
21 40.66186 22.89659
22 40.66193 22.89660
23 40.66200 22.89662
24 40.66207 22.89663
25 40.66213 22.89664
26 40.66218 22.89665
27 40.66223 22.89665
28 40.66227 22.89664
29 40.66230 22.89663
30 40.66234 22.89662
31 40.66238 22.89661
32 40.66242 22.89662
33 40.66244 22.89664
34 40.66245 22.89666
35 40.66247 22.89669
36 40.66248 22.89671
37 40.66249 22.89673
38 40.66250 22.89674
39 40.66251 22.89676
40 40.66253 22.89679
41 40.66255 22.89683
42 40.66257 22.89686
43 40.66261 22.89694
44 40.66263 22.89698
45 40.66265 22.89700
46 40.66267 22.89702
47 40.66268 22.89705
48 40.66270 22.89707
49 40.66272 22.89709
50 40.66273 22.89710
51 40.66274 22.89711
52 40.66275 22.89712
53 40.66275 22.89714
54 40.66276 22.89716
55 40.66276 22.89718
56 40.66276 22.89721
57 40.66275 22.89725
58 40.66273 22.89728
Then, I splitted this data frame per 25 rows with the following command:
pointssplit<- split(pointsdf, (0:nrow(pointsdf))%/%25)
Finally, I have the following list of the smaller data frames:
List of 3
$ 0:'data.frame': 25 obs. of 2 variables:
..$ GPSLatitude : num [1:25] 40.7 40.7 40.7 40.7 40.7 ...
..$ GPSLongitude: num [1:25] 22.9 22.9 22.9 22.9 22.9 ...
$ 1:'data.frame': 25 obs. of 2 variables:
..$ GPSLatitude : num [1:25] 40.7 40.7 40.7 40.7 40.7 ...
..$ GPSLongitude: num [1:25] 22.9 22.9 22.9 22.9 22.9 ...
$ 2:'data.frame': 8 obs. of 2 variables:
..$ GPSLatitude : num [1:8] 40.7 40.7 40.7 40.7 40.7 ...
..$ GPSLongitude: num [1:8] 22.9 22.9 22.9 22.9 22.9 ...
I've tried to use lapply to apply the dist_google() function:
lapply(length(pointssplit), dist_google(from = point2, to = pointssplit, mode = "driving", google_api = "my api key")) #point2 is my reference point
The problem is that I don't know how to manage with "to" inside dist_google() in order to get the long/lat from each data frame separatelly so I get an Error in match.fun(FUN)
Any ideas? Thank you in advance

Summary of a Subset in R does not work - Why?

I am doing the Analytics Edge course on EdX and ran into this problem. We have a dataset which we are subsetting. Running a Str on the subset works as intended, however trying summary on the same subset throws an error. Can someone explain why?
> str(WHO_Europe)
'data.frame': 53 obs. of 13 variables:
$ Country : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
$ Region : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Population : int 3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
$ Under15 : num 21.3 15.2 20.3 14.5 22.2 ...
$ Over60 : num 14.93 22.86 14.06 23.52 8.24 ...
$ FertilityRate : num 1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
$ LifeExpectancy : int 74 82 71 81 71 71 80 76 74 77 ...
$ ChildMortality : num 16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
$ CellularSubscribers : num 96.4 75.5 103.6 154.8 108.8 ...
$ LiteracyRate : num NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
$ GNI : num 8820 NA 6100 42050 8960 ...
$ PrimarySchoolEnrollmentMale : num NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
$ PrimarySchoolEnrollmentFemale: num NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...
> Summary(WHO_Europe)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘Summary’ for signature ‘"data.frame"’
> write.csv(WHO_Europe,"WHO_Europe.CSV")
> Summary(WHO_Europe)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘Summary’ for signature ‘"data.frame"’

Subsetting by rows to do a correlation

I created a data frame from another dataset with 332 ID's. I split the data frame by IDs and would like to do a count rows of each ID and then do a correlation function. Can someone tell me how to do a count of the rows of each ID in order to do a correlation from these individual groups.
jlhoward your suggestion to add "table(dat1$ID)" command worked. My other problem is the function will not stop running
corr<-function(directory,threshold=)
####### file location path#####
for(i in 1:332){dat<-rbind(dat,read.csv(specdata1[i]))
dat1<-dat[complete.cases(dat),]
dat2<-(split(dat1,dat1$ID))
list(dat2)
dat3<-table(dat1$ID)
for (i in dat1>=threshold){
x<-dat1$sulfate
y<-dat1$nitrate
correlations<-cor(x,y,use="pairwise.complete.obs",method="pearson")
corrs_output<-c(corrs_output,correlations)
}
I'm trying to correlate the "sulfate" and "nitrate of each ID monitor that fits a threshold. I created a list that has all the complete cases per ID monitor. I need the function to do a correlation for "sulfate" and "nitrate of every set per ID that's => the threshold argument in the function. Below is the head and tail of the structure of the data.frame/list of each data set within the main data set "specdata1".
head of entire data.frame/list of specdata1 complete cases for
correlation
head(str(dat2,1))
List of 323
$ 1 :'data.frame': 117 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 279 285 291 297 303 315 321 327 333 339 ...
..$ sulfate: num [1:117] 7.21 5.99 4.68 3.47 2.42 1.43 2.76 3.41 1.3 3.15 ...
..$ nitrate: num [1:117] 0.651 0.428 1.04 0.363 0.507 0.474 0.425 0.964 0.491 0.669 ...
..$ ID : int [1:117] 1 1 1 1 1 1 1 1 1 1 ...
tail of entire data.frame/list for all complete cases of specdata1
tail(str(dat2,1))
$ 99 :'data.frame': 479 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 1774 1780 1786 1804 1810 1816 1822 1840 1852 1858 ...
..$ sulfate: num [1:479] 1.51 8.2 1.48 4.75 3.47 1.19 1.77 2.27 2.06 2.11 ...
..$ nitrate: num [1:479] 0.725 1.64 1.01 6.81 0.751 1.69 2.08 0.996 0.817 0.488 ...
..$ ID : int [1:479] 99 99 99 99 99 99 99 99 99 99 ...
[list output truncated]

Resources