Why is a list being added to my dataframe here?
Here's my dataframe
df <- data.frame(ch = rep(1:10, each = 12), # care home id
year_id = rep(2018),
month_id = rep(1:12), # month using the system over the course of a year (1 = first month, 2 = second month...etc.)
totaladministrations = rbinom(n=120, size = 1000, prob = 0.6), # administrations that were scheduled to have been given in the month
missed = rbinom(n=120, size = 20, prob = 0.8), # administrations that weren't given in the month (these are bad!)
beds = rep(rbinom(n = 10, size = 60, prob = 0.6), each = 12), # number of beds in the care home
rating = rep(rbinom(n= 10, size = 4, prob = 0.5), each = 12)) # latest inspection rating (1. Inadequate, 2. Requires Improving, 3. Good, 4 Outstanding)
df <- arrange(df, df$ch, df$year_id, df$month_id)
str(df)
> str(df)
'data.frame': 120 obs. of 7 variables:
$ ch : int 1 1 1 1 1 1 1 1 1 1 ...
$ year_id : num 2018 2018 2018 2018 2018 ...
$ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ totaladministrations: int 576 598 608 576 608 637 611 613 593 626 ...
$ missed : int 18 18 19 16 16 13 17 16 15 17 ...
$ beds : int 38 38 38 38 38 38 38 38 38 38 ...
$ rating : int 2 2 2 2 2 2 2 2 2 2 ...
All good so far.
I just want to add another column that sequences the month number within the ch group (this equates to the actual month_id in this example but ignore that, my real life data is different), so I'm using:
df <- df %>% group_by(ch) %>%
mutate(sequential_month_counter = 1:n())
This appears to add a bunch stuff I don't really understand or want or need, such as a list ...
str(df)
> str(df)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 8 variables:
$ ch : int 1 1 1 1 1 1 1 1 1 1 ...
$ year_id : num 2018 2018 2018 2018 2018 ...
$ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ totaladministrations : int 601 590 593 599 615 611 628 587 604 600 ...
$ missed : int 16 14 17 16 18 16 15 18 15 20 ...
$ beds : int 35 35 35 35 35 35 35 35 35 35 ...
$ rating : int 3 3 3 3 3 3 3 3 3 3 ...
$ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "groups")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 2 variables:
..$ ch : int 1 2 3 4 5 6 7 8 9 10
..$ .rows:List of 10
.. ..$ : int 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int 13 14 15 16 17 18 19 20 21 22 ...
.. ..$ : int 25 26 27 28 29 30 31 32 33 34 ...
.. ..$ : int 37 38 39 40 41 42 43 44 45 46 ...
.. ..$ : int 49 50 51 52 53 54 55 56 57 58 ...
.. ..$ : int 61 62 63 64 65 66 67 68 69 70 ...
.. ..$ : int 73 74 75 76 77 78 79 80 81 82 ...
.. ..$ : int 85 86 87 88 89 90 91 92 93 94 ...
.. ..$ : int 97 98 99 100 101 102 103 104 105 106 ...
.. ..$ : int 109 110 111 112 113 114 115 116 117 118 ...
..- attr(*, ".drop")= logi TRUE
What's going on here? I just want a dataframe. Why is there all that additional output after $ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ... and more importantly can I ignore it and just keep treating it as a normal dataframe (i'll be running some generalised linear mixed models on the df)?
The attribute "groups" is where dplyr stores the grouping information added when you did group_by(ch). It doesn't hurt anything, and it will disappear if you ungroup():
df %>% group_by(ch) %>%
mutate(sequential_month_counter = 1:n()) %>%
ungroup %>%
str
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 8 variables:
# $ ch : int 1 1 1 1 1 1 1 1 1 1 ...
# $ year_id : num 2018 2018 2018 2018 2018 ...
# $ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
# $ totaladministrations : int 575 597 579 605 582 599 577 604 630 632 ...
# $ missed : int 18 16 16 18 18 11 10 13 17 16 ...
# $ beds : int 33 33 33 33 33 33 33 33 33 33 ...
# $ rating : int 3 3 3 3 3 3 3 3 3 3 ...
# $ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ...
As a side-note, you should use bare column names inside dplyr verbs, not data$column. With arrange, it doesn't much matter, but in grouped operations it will cause bugs. You should get in the habit of using arrange(df, ch, year_id, month_id) instead of arrange(df, df$ch, df$year_id, df$month_id).
Related
'data.frame': 33510 obs. of 10 variables:
$ model : Factor w/ 92 levels " 1 Series"," 2 Series",..: 3 54 25 72 19 16 37 41 29 29 ...
$ year : int 2009 2019 2014 2016 2016 2017 2019 2017 2019 2015 ...
$ price : int 4675 40950 11472 17998 14399 9980 37990 14000 12299 8484 ...
$ transmission: Factor w/ 3 levels "Automatic","Manual",..: 2 3 3 2 2 2 1 2 2 2 ...
$ mileage : int 70000 19322 83417 30010 45693 70860 1499 20122 4132 25000 ...
$ fuelType : Factor w/ 4 levels "Diesel","Electric",..: 4 4 1 1 1 4 1 1 4 4 ...
$ tax : int 165 150 145 235 20 30 145 30 145 0 ...
$ mpg : num 47.9 34 54.3 44.1 65.7 55.4 40.9 64.2 48.7 65.7 ...
$ engineSize : num 2 3 2.1 2 2.1 1 2 1.5 1.1 1 ...
$ automaker : Factor w/ 4 levels "BMW","Ford","Mercedes",..: 1 1 3 2 3 2 3 2 2 2 ...
mycars_formula = price ~ year + transmission + mileage + fuelType + tax + mpg + engineSize + automaker
dt_mycars <- tree(mycars_formula, data = training_mycars)
cv_mycars <- cv.tree(dt_mycars, FUN=prune.misclass)
pruned_tree_size <- rev(cv_mycars$size)[which.min(rev(cv_mycars$dev))]
p_dt_mycars <- prune.misclass(dt_mycars, best = pruned_tree_size)
Error in prune.tree(tree = dt, best = pruned_tree_size, method = "misclass") :
misclass only for classification trees
Can someone explain to me why I cannot use misclass method?
I know that my factor model has too many levels so I exclude it from my formula. if you have a suggetion also about how i can include it as well it would be very helpful.
I need to expand some data and then restrict which data remains through tail.
Example of data:
list_1 <- list(1:15)
list_2 <- list(16:30)
list_3 <- list(31:45)
short_lists[[1]] <- list_1
short_lists[[2]] <- list_2
short_lists[[3]] <- list_3
str(short_lists)
List of 3
$ :List of 1
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ :List of 1
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
$ :List of 1
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
And how long I want my tail of a given list to be from list_1, list_2, list_3
how_long <-
c(4,2,5,3,6,4,7,5,8,6,9,7,10,8,2,4,6,8,10,12,14,10,9,7,11)
And I expand through nested for loops and try to get the tail of the expanded lists, but just get the expanded lists.
for (i in 1:length(how_long)) {
for (j in 1:length(short_lists)) {
tail_temp[[j]][i] <- tail(short_lists2[[j]], n = how_long[i])
}
}
And this yields:
str(tail_temp)
List of 3
$ :List of 25
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
[snip]
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ :List of 25
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
[snip]
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
$ :List of 25
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
[snip]
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
And I'm happy the j's were expanded, but I never get to the tail call and what I'm seeking:
str(tail_temp)
List of 3
$ :List of 25
..$ : int [1:4] 12 13 14 15
..$ : int [1:2] 14 15
..$ : int [1:5] 11 12 13 14 15
[snip]
so what simple thing am I missing. Any help appreciated. Thanks.
Very close indeed.
I prefer vectors in lists over R.
If you're familiar with python,
vectors behave like 'lists' in python.
Where as lists in R behave like dictionaries.
Therefore, you just needed to unlist first (into a vector),
to then assign to an item of a list,
hence it should be assigned to:
tail_temp[[i]][[j]] instead of tail_temp[i][[j]]
list_1 <- list(1:15)
list_2 <- list(16:30)
list_3 <- list(31:45)
short_lists = list()
short_lists[[1]] <- list_1
short_lists[[2]] <- list_2
short_lists[[3]] <- list_3
how_long <- c(4,2,5,3,6,4,7,5,8,6,9,7,10,
8,2,4,6,8,10,12,14,10,9,7,11)
tail_temp = list()
for (i in 1:length(short_lists)){
tail_temp[[i]] = list()
for (j in 1:length(how_long)){
tail_temp[[i]][[j]] <- tail(unlist(short_lists[[i]]), n = how_long[j])
}
}
Output
[[1]]
[[1]][[1]]
[1] 12 13 14 15
[[1]][[2]]
[1] 14 15
[[1]][[3]]
[1] 11 12 13 14 15
…
[[3]][[23]]
[1] 37 38 39 40 41 42 43 44 45
[[3]][[24]]
[1] 39 40 41 42 43 44 45
[[3]][[25]]
[1] 35 36 37 38 39 40 41 42 43 44 45
I looked at other questions regarding my error but none had a similar issue as I do. I have no empty values, and none of the variable names in the dataset are used by the C50 package.
This is the structure of the used dataset (no empty values):
> str(dataset)
'data.frame': 776973 obs. of 13 variables:
$ CrimeID : int 9446748 9446846 9446876 9447044 9447227 9447263 9447282 9447312 9447340 9447387 ...
$ CaseNumber : Factor w/ 776907 levels "161884","F218264",..: 67 111 157 283 372 404 421 435 457 487 ...
$ CrimeDate : Factor w/ 326056 levels "1/1/2014 0:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ CrimeBlock : Factor w/ 31381 levels "0000X E 100TH PL",..: 3101 4085 26441 10811 6414 3183 7076 11201 12166 5271 ...
$ IUCR : Factor w/ 357 levels "031A","031B",..: 345 51 52 333 52 347 347 345 52 334 ...
$ LocationDescription: Factor w/ 135 levels "ABANDONED BUILDING",..: 24 18 122 24 122 122 122 18 122 122 ...
$ Arrest : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Beat : int 1832 1133 1631 1932 1932 1533 1012 1413 1033 1211 ...
$ District : int 18 11 16 19 19 15 10 14 10 12 ...
$ Ward : int 42 24 36 43 32 24 24 35 12 26 ...
$ CommunityArea : int 8 27 17 7 7 25 29 22 30 24 ...
$ FBICode : Factor w/ 26 levels "01A","01B","04A",..: 24 11 11 24 11 25 25 24 11 24 ...
The variable Arrest will be used as target variable in the decision tree process. I thus factorize the variable, rename the dataset as crimechicago, set the seed to create random training and test datasets, load librar c50, and run the c50 code. This code runs for over an hour and then returns the error: c50 code called exit with value 1
dataset$Arrest<- factor(dataset$Arrest)
crimechicago <- dataset
set.seed(222)
totalvalues <-nrow(crimechicago)
train_sample <- sample(totalvalues, 400000)
crimechicago_train <- crimechicago[train_sample, ]
crimechicago_test <- crimechicago[-train_sample, ]
library(C50)
crimechicago_model <- C5.0(crimechicago_train[-7], crimechicago_train$Arrest)
EDIT:
-removed CrimeID and CaseNumber from dataset as not useful predictors of target variable Arrest
-summary screenshot of the dataset: (the entire dataset, not a subset)
structure of the train dataset (400,000 rows, created by randomly selecting 400,000 rows of the 700,000+ row original dataset)
str(crimechicago_train)
'data.frame': 400000 obs. of 10 variables:
$ CrimeDate : Factor w/ 326056 levels "1/1/2014 0:00",..: 300760 132223 211541 3 287239 54284 93432 133588 284191 232747 ...
$ CrimeBlock : Factor w/ 31381 levels "0000X E 100TH PL",..: 124 14942 2696 24466 143 9024 10613 22404 17613 10766 ...
$ IUCR : Factor w/ 357 levels "031A","031B",..: 209 274 25 51 334 345 329 274 347 329 ...
$ LocationDescription: Factor w/ 135 levels "ABANDONED BUILDING",..: 118 18 80 106 80 110 18 118 122 18 ...
$ Arrest : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 1 1 1 1 1 1 ...
$ Domestic : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 1 1 2 1 1 ...
$ Beat : int 113 1133 1834 825 1834 1434 1921 715 2522 1431 ...
$ District : int 1 11 18 8 18 14 19 7 25 14 ...
$ Ward : int 42 24 42 15 42 32 47 15 30 1 ...
$ CommunityArea : int 32 27 8 66 8 24 5 67 20 22 ...
I'm trying to subset my dataset 'eggdat' for daytime and nighttime hours. This:
'data.frame': 54847 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 20 20 20 20 20 20 20 20 20 20 ...
$ minute: int 5 5 5 5 5 5 5 5 5 5 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num -159 179 -164 -155 -137 ...
$ Pitch : num -31.36 -41.05 -23.85 -6.62 -9.13 ...
$ Yaw : num -71.8 -113.3 -67.2 -140.2 -78.2 ...
$ temp1 : num 25 33.5 34 34 34 34 34 34 34 34 ...
Subsetting for daytime works fine:
daytime <- eggdat[eggdat$hour >= 7 & eggdat$hour <= 20, ]
'data.frame': 18847 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 20 20 20 20 20 20 20 20 20 20 ...
$ minute: int 5 5 5 5 5 5 5 5 5 5 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num -159 179 -164 -155 -137 ...
$ Pitch : num -31.36 -41.05 -23.85 -6.62 -9.13 ...
$ Yaw : num -71.8 -113.3 -67.2 -140.2 -78.2 ...
$ temp1 : num 25 33.5 34 34 34 34 34 34 34 34 ...
Doing exactly the same thing for nighttime, however, returns a subset with 0 observations:
nighttime <- eggdat[eggdat$hour <= 7 & eggdat$hour >= 21, ]
'data.frame': 0 obs. of 10 variables:
$ year : int
$ month : int
$ day : int
$ hour : int
$ minute: int
$ second: int
$ Roll : num
$ Pitch : num
$ Yaw : num
$ temp1 : num
I really don't know what to do.. I tried using subset , but without success.. I also tried eggdat$hour <- as.factor(eggdat$hour), but couldn't get it to work either.
Even more confusingly, adding the quotation marks in the subset function (daytime <- eggdat[eggdat$hour >= '7' & eggdat$hour <= '20', ] and nighttime <- eggdat[eggdat$hour <= '7' & eggdat$hour >= '21', ]) resulted in the daytime subset containing '0 obs.', but the nighttime subset working fine, so it's just the other way around!
Daytime: 'data.frame': 0 obs. of 10 variables:
Nighttime:
'data.frame': 28800 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 21 21 21 21 21 21 21 21 21 21 ...
$ minute: int 0 0 0 0 0 0 0 0 0 0 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num 65.8 65.8 66.1 65.6 65.6 ...
$ Pitch : num 6.35 6.34 6.24 6.4 6.27 ...
$ Yaw : num 171 172 174 176 176 ...
$ temp1 : num 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 ...
I really don't know what to do, I'm very confused by all of this..
You want eggdat[eggdat$hour <= 7 | eggdat$hour >= 21, ]
x < 7 & x > 21 translates to x smaller than 7 AND larger than 21
x < 7 | x > 21 translates to x smaller than 7 OR larger than 21
I try to make KDA (Kernel discriminant analysis) for carc data, but when I call command X<-data.frame(scale(X)); r shows error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
I tried to use as.numeric(as.matrix(carc)) and carc<-na.omit(carc), but it does not help either
library(ks);library(MASS);library(klaR);library(FSelector)
install.packages("klaR")
install.packages("FSelector")
library(ks);library(MASS);library(klaR);library(FSelector)
attach("carc.rda")
data<-load("carc.rda")
data
carc<-na.omit(carc)
head(carc)
class(carc) # check for its class
class(as.matrix(carc)) # change class, and
as.numeric(as.matrix(carc))
XX<-carc
X<-XX[,1:12];X.class<-XX[,13];
X<-data.frame(scale(X));
fit.pc<-princomp(X,scores=TRUE);
plot(fit.pc,type="line")
X.new<-fit.pc$scores[,1:5]; X.new<-data.frame(X.new);
cfs(X.class~.,cbind(X.new,X.class))
X.new<-fit.pc$scores[,c(1,4)]; X.new<-data.frame(X.new);
fit.kda1<-Hkda(x=X.new,x.group=X.class,pilot="samse",
bw="plugin",pre="sphere")
kda.fit1 <- kda(x=X.new, x.group=X.class, Hs=fit.kda1)
Can you help to resolve this problem and make this analysis?
Added:The car data set( Chambers, kleveland, Kleiner & Tukey 1983)
> head(carc)
P M R78 R77 H R Tr W L T D G C
AMC_Concord 4099 22 3 2 2.5 27.5 11 2930 186 40 121 3.58 US
AMC_Pacer 4749 17 3 1 3.0 25.5 11 3350 173 40 258 2.53 US
AMC_Spirit 3799 22 . . 3.0 18.5 12 2640 168 35 121 3.08 US
Audi_5000 9690 17 5 2 3.0 27.0 15 2830 189 37 131 3.20 Europe
Audi_Fox 6295 23 3 3 2.5 28.0 11 2070 174 36 97 3.70 Europe
Here is a small dataset with similar characteristics to what you describe
in order to answer this error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
carc <- data.frame(type1=rep(c('1','2'), each=5),
type2=rep(c('5','6'), each=5),
x = rnorm(10,1,2)/10, y = rnorm(10))
This should be similar to your data.frame
str(carc)
# 'data.frame': 10 obs. of 3 variables:
# $ type1: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
# $ type2: Factor w/ 2 levels "5","6": 1 1 1 1 1 2 2 2 2 2
# $ x : num -0.1177 0.3443 0.1351 0.0443 0.4702 ...
# $ y : num -0.355 0.149 -0.208 -1.202 -1.495 ...
scale(carc)
# Similar error
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Using set()
require(data.table)
DT <- data.table(carc)
cols_fix <- c("type1", "type2")
for (col in cols_fix) set(DT, j=col, value = as.numeric(as.character(DT[[col]])))
str(DT)
# Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ type1: num 1 1 1 1 1 2 2 2 2 2
# $ type2: num 5 5 5 5 5 6 6 6 6 6
# $ x : num 0.0465 0.1712 0.1582 0.1684 0.1183 ...
# $ y : num 0.155 -0.977 -0.291 -0.766 -1.02 ...
# - attr(*, ".internal.selfref")=<externalptr>
The first column(s) of your data set may be factors. Taking the data from corrgram:
library(corrgram)
carc <- auto
str(carc)
# 'data.frame': 74 obs. of 14 variables:
# $ Model : Factor w/ 74 levels "AMC Concord ",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Origin: Factor w/ 3 levels "A","E","J": 1 1 1 2 2 2 1 1 1 1 ...
# $ Price : int 4099 4749 3799 9690 6295 9735 4816 7827 5788 4453 ...
# $ MPG : int 22 17 22 17 23 25 20 15 18 26 ...
# $ Rep78 : num 3 3 NA 5 3 4 3 4 3 NA ...
# $ Rep77 : num 2 1 NA 2 3 4 3 4 4 NA ...
# $ Hroom : num 2.5 3 3 3 2.5 2.5 4.5 4 4 3 ...
# $ Rseat : num 27.5 25.5 18.5 27 28 26 29 31.5 30.5 24 ...
# $ Trunk : int 11 11 12 15 11 12 16 20 21 10 ...
# $ Weight: int 2930 3350 2640 2830 2070 2650 3250 4080 3670 2230 ...
# $ Length: int 186 173 168 189 174 177 196 222 218 170 ...
# $ Turn : int 40 40 35 37 36 34 40 43 43 34 ...
# $ Displa: int 121 258 121 131 97 121 196 350 231 304 ...
# $ Gratio: num 3.58 2.53 3.08 3.2 3.7 3.64 2.93 2.41 2.73 2.87 ...
So exclude them by trying this:
X<-XX[,3:14]
or this
X<-XX[,-(1:2)]