Help on subsetting a dataframe

Help on subsetting a dataframe - r

I am using %in% for subsetting and I came across a strange result.
> my.data[my.data$V3 %in% seq(200,210,.01),]
V1 V2 V3 V4 V5 V6 V7
56 470 48.7 209.73 yes 26.3 54 470
That was correct. But when I widen the range... row 56 just disappears
> my.data[my.data$V3 %in% seq(150,210,.01),]
V1 V2 V3 V4 V5 V6 V7
51 458 48.7 156.19 yes 28.2 58 458
67 511 30.5 150.54 yes 26.1 86 511
73 535 40.6 178.76 yes 29.5 73 535
Can you tell me what's wrong?
Is there a better way to subset the dataframe?
Here is its structure
> str(my.data)
'data.frame': 91 obs. of 7 variables:
$ V1: Factor w/ 91 levels "100","10004",..: 1 2 3 4 5 6 7 8 9 10 ...
$ V2: num 44.6 22.3 30.4 38.6 15.2 18.3 16.3 12.2 36.7 12.2 ...
$ V3: num 110.83 25.03 17.17 57.23 2.18 ...
$ V4: Factor w/ 2 levels "no","yes": 1 2 2 2 1 1 1 1 1 1 ...
$ V5: num 22.3 30.5 24.4 25.5 4.1 28.4 7.9 5.1 24 12.2 ...
$ V6: int 50 137 80 66 27 155 48 42 65 100 ...
$ V7: chr "" "10004" "10005" "10012" ...

Ooops. You are trying to do exact matching on a computer that can't represent all numbers exactly.
> any(209.73 == seq(200,210,.01))
[1] TRUE
> any(209.73 == seq(150,210,.01))
[1] FALSE
> any(209.73 == zapsmall(seq(150,210,.01)))
[1] TRUE
The reason for the discrepancy is in the second sequence, the value in the sequence is not exactly 209.73. This is something you have to appreciate when doing computation with computers.
This is covered in many places on the interweb, but in relation to R, see point 7.31 in the R FAQ.
Anyway, that said, you are going about the problem incorrectly. You want to use proper numeric operators:
my.data[my.data$V3 >= 150 & my.data$V3 <= 210, ]
## or
subset(my.data, V3 >= 150 & V3 <= 210)

Related

What is the best way to sample data based on date range?

I have weather dataset from 01 Nov 2007 until 18 May 2008 my data is date-dependent
I want to predict the temperature from 07 May 2008 until 18 May 2008 (which is maybe a total of 10-15 observations) my data size is around 200
I will be using decision tree/RF and SVM & NN to make my prediction
I've never handled data like this so I'm not sure how to sample it if we ignore the bias factor can I sample training data from 01 Nov 2007 to 18 May 2008 and test data from 07 May 2008 to 18 May 2008? or is there a better way to handle this ? or would it be better to first sort my data by date then split my data (ordered) with 80:20 for test and training set then just output the required date?
install.packages("rattle")
install.packages("RGtk2")
library("rattle")
seed <- 42
set.seed(seed)
fname <- system.file("csv", "weather.csv", package = "rattle")
dataset <- read.csv(fname, encoding = "UTF-8")
dataset$Date <- convert_to_date(dataset$Date)
dataset <- dataset[order(as.Date(dataset$Date, format="%Y/%M/%D")),]
dataset <- dataset[1:200,]
str(dataset)
> str(dataset)
'data.frame': 200 obs. of 24 variables:
$ Date : Date, format: "2007-11-01" "2007-11-02" "2007-11-03" ...
$ Location : chr "Canberra" "Canberra" "Canberra" "Canberra" ...
$ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 8.8 8.4 ...
$ MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 18.2 17 19.5 22.8 ...
$ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 ...
$ Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6 4 5.4 ...
$ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.6 4.1 7.7 ...
$ WindGustDir : chr "NW" "ENE" "NW" "NW" ...
$ WindGustSpeed: int 30 39 85 54 50 44 43 41 48 31 ...
$ WindDir9am : chr "SW" "E" "N" "WNW" ...
$ WindDir3pm : chr "NW" "W" "NNE" "W" ...
$ WindSpeed9am : int 6 4 6 30 20 20 19 11 19 7 ...
$ WindSpeed3pm : int 20 17 6 24 28 24 26 24 17 6 ...
$ Humidity9am : int 68 80 82 62 68 70 63 65 70 82 ...
$ Humidity3pm : int 29 36 69 56 49 57 47 57 48 32 ...
$ Pressure9am : num 1020 1012 1010 1006 1018 ...
$ Pressure3pm : num 1015 1008 1007 1007 1018 ...
$ Cloud9am : int 7 5 8 2 7 7 4 6 7 7 ...
$ Cloud3pm : int 7 3 7 7 7 5 6 7 7 1 ...
$ Temp9am : num 14.4 17.5 15.4 13.5 11.1 10.9 12.4 12.1 14.1 13.3 ...
$ Temp3pm : num 23.6 25.7 20.2 14.1 15.4 14.8 17.3 15.5 18.9 21.7 ...
$ RainToday : chr "No" "Yes" "Yes" "Yes" ...
$ RISK_MM : num 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 0 ...
$ RainTomorrow : chr "Yes" "Yes" "Yes" "Yes" ...

Can we use as.factor to convert categorical variables having multiple levels for decision tree or we need to use model.matrix?

I am trying to build a decison tree model in R having both categorical and numerical variables.Some categorical variables have 3 levels , so can I just use as.factor and then use in my model? I tried to use model.matrix but my doubt is model.matrix converts the variable in numeric values of 0s and 1s and splitting happens on basis of these numeric values. For eg if Color has 3 level- blue,red,green, the splitting rule will look like color_green < 0.5 instead it should always take 0s and 1s only.

If you are asking whether you can use factors to build an rpart decision tree. Then yes. See below example from the documentation. Note that there are a lot of possible packages for decision trees.
library(rpart)
rpart(Reliability ~ ., data=car90)
#> n=76 (35 observations deleted due to missingness)
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 76 53 average (0.2 0.12 0.3 0.11 0.28)
#> 2) Country=Germany,Korea,Mexico,Sweden,USA 49 29 average (0.31 0.18 0.41 0.1 0)
#> 4) Tires=145,155/80,165/80,185/80,195/60,195/65,195/70,205/60,215/65,225/75,275/40 17 9 Much worse (0.47 0.29 0 0.24 0) *
#> 5) Tires=175/70,185/65,185/70,185/75,195/75,205/70,205/75,215/70 32 12 average (0.22 0.12 0.62 0.031 0)
#> 10) HP.revs< 4650 13 7 Much worse (0.46 0.23 0.31 0 0) *
#> 11) HP.revs>=4650 19 3 average (0.053 0.053 0.84 0.053 0) *
#> 3) Country=Japan,Japan/USA 27 6 Much better (0 0 0.11 0.11 0.78) *
str(car90)
#> 'data.frame': 111 obs. of 34 variables:
#> $ Country : Factor w/ 10 levels "Brazil","England",..: 5 5 4 4 4 4 10 10 10 NA ...
#> $ Disp : num 112 163 141 121 152 209 151 231 231 189 ...
#> $ Disp2 : num 1.8 2.7 2.3 2 2.5 3.5 2.5 3.8 3.8 3.1 ...
#> $ Eng.Rev : num 2935 2505 2775 2835 2625 ...
#> $ Front.Hd : num 3.5 2 2.5 4 2 3 4 6 5 5.5 ...
#> $ Frt.Leg.Room: num 41.5 41.5 41.5 42 42 42 42 42 41 41 ...
#> $ Frt.Shld : num 53 55.5 56.5 52.5 52 54.5 56.5 58.5 59 58 ...
#> $ Gear.Ratio : num 3.26 2.95 3.27 3.25 3.02 2.8 NA NA NA NA ...
#> $ Gear2 : num 3.21 3.02 3.25 3.25 2.99 2.85 2.84 1.99 1.99 2.33 ...
#> $ HP : num 130 160 130 108 168 208 110 165 165 101 ...
#> $ HP.revs : num 6000 5900 5500 5300 5800 5700 5200 4800 4800 4400 ...
#> $ Height : num 47.5 50 51.5 50.5 49.5 51 49.5 50.5 51 50.5 ...
#> $ Length : num 177 191 193 176 175 186 189 197 197 192 ...
#> $ Luggage : num 16 14 17 10 12 12 16 16 16 15 ...
#> $ Mileage : num NA 20 NA 27 NA NA 21 NA 23 NA ...
#> $ Model2 : Factor w/ 21 levels ""," Turbo 4 (3)",..: 1 1 1 1 1 1 1 14 13 1 ...
#> $ Price : num 11950 24760 26900 18900 24650 ...
#> $ Rear.Hd : num 1.5 2 3 1 1 2.5 2.5 4.5 3.5 3.5 ...
#> $ Rear.Seating: num 26.5 28.5 31 28 25.5 27 28 30.5 28.5 27.5 ...
#> $ RearShld : num 52 55.5 55 52 51.5 55.5 56 58.5 58.5 56.5 ...
#> $ Reliability : Ord.factor w/ 5 levels "Much worse"<"worse"<..: 5 5 NA NA 4 NA 3 3 3 NA ...
#> $ Rim : Factor w/ 6 levels "R12","R13","R14",..: 3 4 4 3 3 4 3 3 3 3 ...
#> $ Sratio.m : num NA NA NA NA NA NA NA NA NA NA ...
#> $ Sratio.p : num 0.86 0.96 0.97 0.71 0.88 0.78 0.76 0.83 0.87 0.88 ...
#> $ Steering : Factor w/ 3 levels "manual","power",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ Tank : num 13.2 18 21.1 15.9 16.4 21.1 15.7 18 18 16.5 ...
#> $ Tires : Factor w/ 30 levels "145","145/80",..: 16 20 20 8 17 28 13 23 23 22 ...
#> $ Trans1 : Factor w/ 4 levels "","man.4","man.5",..: 3 3 3 3 3 3 1 1 1 1 ...
#> $ Trans2 : Factor w/ 4 levels "","auto.3","auto.4",..: 3 3 2 2 3 3 2 3 3 3 ...
#> $ Turning : num 37 42 39 35 35 39 41 43 42 41 ...
#> $ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 3 1 1 3 3 2 2 NA ...
#> $ Weight : num 2700 3265 2935 2670 2895 ...
#> $ Wheel.base : num 102 109 106 100 101 109 105 111 111 108 ...
#> $ Width : num 67 69 71 67 65 69 69 72 72 71 ...

why multiple columns are shown as one vector in R

I am trying to read in data from a URL however when I do run the following code:
x <- read.csv(url(myUrl), sep = '\t', head = FALSE)
print(x)
I get this
V1 V2
1 18.0 8 30.7 130.0 hello
2 32.0 6 23.5 121.5 bye
and I want this
V1 V2 V3 V4 V5
1 18.0 8.0 30.7 130.0 hello
2 32.0 6.0 23.5 121.5 bye
for some reason it is reading it as 2 columns instead of 5
Edit 1
Here is a snippet of the data file from the url:
Edit 2
Here is the url: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

Instead of \t, may be, use use ' ' or don't specify the delimiter
x <- read.table(url(myUrl), header = FALSE)
based on the url updated in the OP's post
x <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", header = FALSE)
str(x)
#'data.frame': 398 obs. of 9 variables:
# $ V1: num 18 15 18 16 17 15 14 14 14 15 ...
# $ V2: int 8 8 8 8 8 8 8 8 8 8 ...
# $ V3: num 307 350 318 304 302 429 454 440 455 390 ...
# $ V4: chr "130.0" "165.0" "150.0" "150.0" ...
# $ V5: num 3504 3693 3436 3433 3449 ...
# $ V6: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
# $ V7: int 70 70 70 70 70 70 70 70 70 70 ...
# $ V8: int 1 1 1 1 1 1 1 1 1 1 ...
# $ V9: chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

Trying to draw a SPC Chart in R

I am trying to create a control chart using the code below but I am getting the error below. The data has the first Column as date then 12 other columns with different variables of data.
library("qcc")
attach(data)
Data_Frame_Data <- as.data.frame.matrix(data)
q <- qcc(Cancer_Activity
, type="xbar"
, nsigmas=3)
Error in sd.xbar(c(1396310400, 1398902400, 1401580800, 1404172800,
1406851200, : group sizes must be larger than one
This is the output when I run str(data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 48 obs. of 13 variables:
$ Date : POSIXct, format: "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01" ...
$ CW_Activity : num 37 29.5 34 46 39.5 41.5 42 40 46 39.5 ...
$ CW_Breach : num 3.5 6 8.5 10 5.5 8 4.5 3 3.5 4 ...
$ ICHT_Activity: num 73.5 89 60 83.5 85 88.5 65.5 80 75.5 74 ...
$ ICHT_Breach : num 8 11.5 11.5 12 11 15 9.5 14 8.5 16.5 ...
$ LNWH_Activity: num 67 76.5 56 79.5 67 83 77.5 67 66 60.5 ...
$ LNWH_Breach : num 10 12.5 13 14 10.5 16 16.5 12 5 13.5 ...
$ THH_Activity : num 30 26 24.5 36 31 25 33 21.5 42 25.5 ...
$ THH_Breach : num 2 3 2 1 5 1.5 3.5 0.5 3.5 3 ...
$ RBH_Activity : num 2.5 5 6.5 7 6.5 7.5 3.5 9 8 6.5 ...
$ RBH_Breach : num 0.5 1 2 2 4 4 1 2 2.5 2 ...
$ NWL_Activity : num 210 226 181 252 229 ...
$ NWL_Breach : num 24 34 37 39 36 44.5 35 31.5 23 39 ...

Carc data from rda file to numeric matrix

I try to make KDA (Kernel discriminant analysis) for carc data, but when I call command X<-data.frame(scale(X)); r shows error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
I tried to use as.numeric(as.matrix(carc)) and carc<-na.omit(carc), but it does not help either
library(ks);library(MASS);library(klaR);library(FSelector)
install.packages("klaR")
install.packages("FSelector")
library(ks);library(MASS);library(klaR);library(FSelector)
attach("carc.rda")
data<-load("carc.rda")
data
carc<-na.omit(carc)
head(carc)
class(carc) # check for its class
class(as.matrix(carc)) # change class, and
as.numeric(as.matrix(carc))
XX<-carc
X<-XX[,1:12];X.class<-XX[,13];
X<-data.frame(scale(X));
fit.pc<-princomp(X,scores=TRUE);
plot(fit.pc,type="line")
X.new<-fit.pc$scores[,1:5]; X.new<-data.frame(X.new);
cfs(X.class~.,cbind(X.new,X.class))
X.new<-fit.pc$scores[,c(1,4)]; X.new<-data.frame(X.new);
fit.kda1<-Hkda(x=X.new,x.group=X.class,pilot="samse",
bw="plugin",pre="sphere")
kda.fit1 <- kda(x=X.new, x.group=X.class, Hs=fit.kda1)
Can you help to resolve this problem and make this analysis?
Added:The car data set( Chambers, kleveland, Kleiner & Tukey 1983)
> head(carc)
P M R78 R77 H R Tr W L T D G C
AMC_Concord 4099 22 3 2 2.5 27.5 11 2930 186 40 121 3.58 US
AMC_Pacer 4749 17 3 1 3.0 25.5 11 3350 173 40 258 2.53 US
AMC_Spirit 3799 22 . . 3.0 18.5 12 2640 168 35 121 3.08 US
Audi_5000 9690 17 5 2 3.0 27.0 15 2830 189 37 131 3.20 Europe
Audi_Fox 6295 23 3 3 2.5 28.0 11 2070 174 36 97 3.70 Europe

Here is a small dataset with similar characteristics to what you describe
in order to answer this error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
carc <- data.frame(type1=rep(c('1','2'), each=5),
type2=rep(c('5','6'), each=5),
x = rnorm(10,1,2)/10, y = rnorm(10))
This should be similar to your data.frame
str(carc)
# 'data.frame': 10 obs. of 3 variables:
# $ type1: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
# $ type2: Factor w/ 2 levels "5","6": 1 1 1 1 1 2 2 2 2 2
# $ x : num -0.1177 0.3443 0.1351 0.0443 0.4702 ...
# $ y : num -0.355 0.149 -0.208 -1.202 -1.495 ...
scale(carc)
# Similar error
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Using set()
require(data.table)
DT <- data.table(carc)
cols_fix <- c("type1", "type2")
for (col in cols_fix) set(DT, j=col, value = as.numeric(as.character(DT[[col]])))
str(DT)
# Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ type1: num 1 1 1 1 1 2 2 2 2 2
# $ type2: num 5 5 5 5 5 6 6 6 6 6
# $ x : num 0.0465 0.1712 0.1582 0.1684 0.1183 ...
# $ y : num 0.155 -0.977 -0.291 -0.766 -1.02 ...
# - attr(*, ".internal.selfref")=<externalptr>

The first column(s) of your data set may be factors. Taking the data from corrgram:
library(corrgram)
carc <- auto
str(carc)
# 'data.frame': 74 obs. of 14 variables:
# $ Model : Factor w/ 74 levels "AMC Concord ",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Origin: Factor w/ 3 levels "A","E","J": 1 1 1 2 2 2 1 1 1 1 ...
# $ Price : int 4099 4749 3799 9690 6295 9735 4816 7827 5788 4453 ...
# $ MPG : int 22 17 22 17 23 25 20 15 18 26 ...
# $ Rep78 : num 3 3 NA 5 3 4 3 4 3 NA ...
# $ Rep77 : num 2 1 NA 2 3 4 3 4 4 NA ...
# $ Hroom : num 2.5 3 3 3 2.5 2.5 4.5 4 4 3 ...
# $ Rseat : num 27.5 25.5 18.5 27 28 26 29 31.5 30.5 24 ...
# $ Trunk : int 11 11 12 15 11 12 16 20 21 10 ...
# $ Weight: int 2930 3350 2640 2830 2070 2650 3250 4080 3670 2230 ...
# $ Length: int 186 173 168 189 174 177 196 222 218 170 ...
# $ Turn : int 40 40 35 37 36 34 40 43 43 34 ...
# $ Displa: int 121 258 121 131 97 121 196 350 231 304 ...
# $ Gratio: num 3.58 2.53 3.08 3.2 3.7 3.64 2.93 2.41 2.73 2.87 ...
So exclude them by trying this:
X<-XX[,3:14]
or this
X<-XX[,-(1:2)]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Help on subsetting a dataframe - r

Related

What is the best way to sample data based on date range?

Can we use as.factor to convert categorical variables having multiple levels for decision tree or we need to use model.matrix?

why multiple columns are shown as one vector in R

Trying to draw a SPC Chart in R

Carc data from rda file to numeric matrix

Categories

Resources