i have data which look like this:
IntensityRisk Depth Mag smaj smin
<fctr> <int> <int> <int> <int>
1 2 2 3 2 2
2 3 1 3 2 2
3 3 1 3 2 2
4 3 1 1 2 2
5 3 1 1 2 2
6 2 2 3 2 2
7 3 1 3 2 2
8 3 1 3 2 2
9 3 1 3 2 2
10 2 2 3 2 2
I made this following steps:
gempaDF <- gempa[order(runif(nrow(gempa))),]
str(gempaDF$IntensityRisk)
tail(gempaDF,5)
gempaTrain <- gempaDF[1:4000,]
gempaTest <- gempaDF[4001:4471,]
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,1])
and getting error like this:
Error in C5.0.default(gempaTrain[, -1], gempaTrain[, 1]) :
C5.0 models require a factor outcome
i have changed it to this:
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,as.factor(gempaDF$IntensityRisk)])
and getting error again:
Error: Unsupported index type: factor
Then i try changing it to this:
gempaDF <- gempa[order(runif(nrow(gempa))),]
gempaDF$IntensityRisk <- as.factor(gempaDF$IntensityRisk)
str(gempaDF$IntensityRisk)
tail(gempaDF,5)
gempaTrain <- gempaDF[1:4000,]
gempaTest <- gempaDF[4001:4471,]
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,1])
But still getting error like this:
Error in C5.0.default(gempaTrain[, -1], gempaTrain[, 1]) :
C5.0 models require a factor outcome
I'm trying this too:
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,gempaDF$IntensityRisk])
But still getting error
Error: Unsupported index type: factor
Does anyone know where i did wrong? I appreciate it so much.
I'll use the following sample data (as I don't have access to your data)
set.seed(1)
dat = tibble::as_tibble(list(IntensityRisk = sample(1:5, 30, replace = T), Depth = sample(1:100, 30, replace = T), Mag = sample(1:100, 30, replace = T)))
table(dat$IntensityRisk)
1 2 3 4 5
4 11 2 7 6
# convert the response to factor,
dat$IntensityRisk = as.factor(dat$IntensityRisk)
str(dat)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 30 obs. of 3 variables:
$ IntensityRisk: Factor w/ 5 levels "1","2","3","4",..: 2 2 3 5 2 5 5 4 4 1 ...
$ Depth : int 49 60 50 19 83 67 80 11 73 42 ...
$ Mag : int 92 30 46 34 66 26 48 77 9 88 ...
If I use the tibble dataframe I get a similar error,
fit = C50::C5.0(dat1[, -1], dat1[, 1])
Error in C5.0.default(dat[, -1], dat[, 1]) :
C5.0 models require a factor outcome
If I convert to a dataframe,
dat1 = as.data.frame(dat)
str(dat1)
'data.frame': 30 obs. of 3 variables:
$ IntensityRisk: Factor w/ 5 levels "1","2","3","4",..: 2 2 3 5 2 5 5 4 4 1 ...
$ Depth : int 49 60 50 19 83 67 80 11 73 42 ...
$ Mag : int 92 30 46 34 66 26 48 77 9 88 ...
the function runs error free,
fit = C50::C5.0(dat1[, -1], dat1[, 1])
> fit
Call:
C5.0.default(x = dat1[, -1], y = dat1[, 1])
Classification Tree
Number of samples: 30
Number of predictors: 2
Tree size: 8
Non-standard options: attempt to group attributes
Related
I have a data frame with 1666 rows. I would like to add a column with a repeating sequence of 1:5 to use with cut() to do cross validation. It would look like this:
Y x1 x2 Id1
1 .15 3.6 1
0 1.1 2.2 2
0 .05 3.3 3
0 .45 2.8 4
1 .85 3.1 5
1 1.01 2.9 1
... ... ... ...
I've tried the following 2 ways but get an error message as it seems to only add numbers in increments of the full seq() argument:
> tr2$Id1 <- rep(seq(1,5,1), (nrow(tr2)/5))
Error in `$<-.data.frame`(`*tmp*`, "Id", value = c(1, 2, 3, 4, 5, 1, 2, :
replacement has 1665 rows, data has 1666
> tr2$Id1 <- rep(seq(1,5,1), (nrow(tr2)/5) + (nrow(tr2)%%5))
Error in `$<-.data.frame`(`*tmp*`, "Id", value = c(1, 2, 3, 4, 5, 1, 2, :
replacement has 1670 rows, data has 1666
Any suggestions?
Use the length.out argument of rep() or rep_len (a "faster simplified version" [of rep]):
length.out: non-negative integer. The desired length of the output vector
Here is an example using the built-in dataset cars.
str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
Add grouping column:
cars$group <- rep(1:3, length.out = 50L)
Inspect the result:
head(cars)
speed dist group
1 4 2 1
2 4 10 2
3 7 4 3
4 7 22 1
5 8 16 2
6 9 10 3
tail(cars)
speed dist group
45 23 54 3
46 24 70 1
47 24 92 2
48 24 93 3
49 24 120 1
50 25 85 2
Something, like this?
df <- data.frame(rnorm(1666))
df$cutter <- rep(1:5, length.out=1666)
tail(df)
rnorm.1666. cutter
1661 0.11693169 1
1662 -1.12508091 2
1663 0.25441847 3
1664 -0.06045037 4
1665 -0.17242921 5
1666 -0.85366242 1
I am having an issue with creating a matrix of explanatory variables for running ridge and lasso regression using cv.glmnet.
My original data frame is of dimension 1460*81 and consist of several numeric and factor variables. In order to run glmnet, I am attempting to create a matrix of predictors using model.matrix.
However, when creating model.matrix on my original dataset, some of the rows are being dropped and my response variable and predictors are not of the same length.
Here's the code:
str(train1)
'data.frame': 1460 obs. of 80 variables:
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
$ LotFrontage : num 65 80 68 60 84 85 75 69 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420
$ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Alley : Factor w/ 3 levels "Grvl","None",..: 2 2 2 2 2 2 2 2 2 2 ...
$ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4
$ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4
$ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
And now I am passing the data frame to model.matrix to create a matrix.
x = model.matrix(SalePrice ~., data = train1)
dim(x)
dim(x)
[1] 1370 260
Notice, how n = 1460 * 80 is transformed to 1370 * 260. This is causing a mismatch between lengths of my predictor variables and response variable when I try to run ridge regression.
cv.ridge <- glmnet(x, y, alpha = 0)
Error in glmnet(x, y, alpha = 0) :
number of observations in y (1460) not equal to the number of rows of x (1370)
Any ideas on where to look to ensure the length of the matrix (x) is equal (y)?
I have a data set similar to this one:
x <- sample(c("A", "B", "C", "D"), 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05))
y <- sample(1:40, 1000, replace=TRUE)
d <- data.frame(x,y)
str(d)
'data.frame': 1000 obs. of 2 variables:
$ x: Factor w/4 levels "A","B","C","D": 1 3 3 2 3 3 3 3 4 3 ...
$ y: int 28 35 14 4 34 36 30 35 26 9 ...
table(d$x)
A B C D
115 204 637 44
So in my real data set i have multiple thousands of these category (A, B, C, D).
The str() of my real dataset
str(realdata)
data.frame': 346340 obs. of 91 variables:
$ author : Factor w/ 42590 levels "-jon-","--LZR--",..: 1962 3434 1241 7666 6235 2391 1196 2779 1881 339 ...
$ created_utc : Factor w/ 343708 levels "2015-05-01 02:00:41",..: 14815 23163 2281 3569 5922 7211 15783 5512 13485 8591 ...
$ group : Factor w/ 5 levels "xyz","abc","bnm",..: 2 2 2 2 2 2 2 2 2 2 ...
....
Now i want to subset the data, so i have only the rows of those $authors (or $x in the d dataframe) in my new dataframe that have more than 100 entries in total.
I tried the following:
dnew <- subset(realdata, table(realdata$author) > 100)
It gives me a result, but it seems the not all entries of the authors were included. Although it should be way more, i just get 1.3% of the rows of the complete dataset. I checked it manually (with excel) and it should be way more than that (approx. 30%). The manual analysis showed that 1.2 % of $author stand for 30% of the entries. So it seems he just gave me one row with the $author who has more than 100 entries, but not all of his entries.
Do you know of a way to fix this?
We can do this easily with data.table. Convert the 'data.frame' to 'data.table' (setDT(d), grouped by 'x', if the number of observations is greater than 100, we Subset the Data.table (.SD)
library(data.table)
ddt <- setDT(d)[, if(.N > 100) .SD, x]
Or if we are using dplyr, the same approach can be used.
library(dplyr)
dpl <- d %>%
group_by(x) %>%
filter(n() > 100) %>%
droplevels()
str(dpl)
#Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 866 obs. of 2 variables:
#$ x: Factor w/ 2 levels "B","C": 1 1 2 1 1 2 2 2 2 2 ...
# $ y: int 25 25 13 11 2 32 12 15 12 3 ...
Also, in using the base R, the table can be helpful
v1 <- table(d$x)
d1 <- subset(d, x %in% names(v1)[v1 > 100])
As the column 'x' is factor, when we subset the dataset, the levels persist, to remove that use droplevels
d2 <- droplevels(d1)
As the OP didn't set the seed, the output will be different for each person.
str(d2)
#'data.frame': 866 obs. of 2 variables:
#$ x: Factor w/ 2 levels "B","C": 1 1 2 1 1 2 2 2 2 2 ...
#$ y: int 25 25 13 11 2 32 12 15 12 3 ...
I. Data frame d with four levels
table(d$x)
# A B C D
# 92 232 630 46
II. Checking which level has greater than 100 records
which(table(d$x)>100)
# B C
# 2 3
III. Subsetting d data frame having only records belonging to levels which have greater than 100 records ie. level B and level C
result <- d[ d$x %in% names(table(d$x))[table(d$x) > 100] , ]
dim(result)
# [1] 862 2
str(result)
# 'data.frame': 862 obs. of 2 variables:
# $ x: Factor w/ 4 levels "A","B","C","D": 3 2 3 3 2 2 2 3 3 3 ...
# $ y: int 29 32 27 40 30 38 8 16 2 23 ...
Level A and D still persists with 0 records
table(result$x)
# A B C D
# 0 232 630 0
IV. Removing the levels with 0 records using factor()
result$x <- factor(result$x)
str(result)
# 'data.frame': 860 obs. of 2 variables:
# $ x: Factor w/ 2 levels "B","C": 2 2 1 2 1 2 2 2 1 2 ...
# $ y: int 29 32 27 40 30 38 8 16 2 23 ...
table(result$x)
# B C
# 232 630
I'm trying to find class probabilities of new input vectors with support vector machines in R.
Training the model shows no errors.
fit <-svm(device~.,data=dataframetrain,
kernel="polynomial",probability=TRUE)
But predicting some input vector shows some errors.
predict(fit,dataframetest,probability=prob)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
dataframetrain looks like:
> str(dataframetrain)
'data.frame': 24577 obs. of 5 variables:
$ device : Factor w/ 3 levels "mob","pc","tab": 1 1 1 1 1 1 1 1 1 1 ...
$ geslacht : Factor w/ 2 levels "M","V": 1 1 1 1 1 1 1 1 1 1 ...
$ leeftijd : num 77 67 67 66 64 64 63 61 61 58 ...
$ invultijd: num 12 12 12 12 12 12 12 12 12 12 ...
$ type : Factor w/ 8 levels "A","B","C","D",..: 5 5 5 5 5 5 5 5 5 5 ...
and dataframetest looks like:
> str(dataframetest)
'data.frame': 8 obs. of 4 variables:
$ geslacht : Factor w/ 1 level "M": 1 1 1 1 1 1 1 1
$ leeftijd : num 20 60 30 25 36 52 145 25
$ invultijd: num 6 12 2 5 6 8 69 7
$ type : Factor w/ 8 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8
I trained the model with 2 factors for 'geslacht' but sometime I have to predict data with only 1 factor of 'geslacht'.
Is it maybe possible that the class probabilites can be predicted with a test set with only 1 factor of 'geslacht'?
I hope someone can help me!!
Add another level (but not data) to geslacht.
x <- factor(c("A", "A"), levels = c("A", "B"))
x
[1] A A
Levels: A B
or
x <- factor(c("A", "A"))
levels(x) <- c("A", "B")
x
[1] A A
Levels: A B
I'm quite new to R and am battling a bit with what would appear to be an extremely simple query.
I've imported a csv file into R using read.csv and am trying to remove the dollar signs ($) prior to tidying the data and further analysis (the dollar signs are playing havoc with charting).
I've been trying without luck to strip the $ using dplyr and gsub from the data frame and I'd really appreciate some advice about how to go about it.
My data frame looks like this:
> str(data)
'data.frame': 50 obs. of 17 variables:
$ Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Prog.Cost : Factor w/ 2 levels "-$3,333","$0": 1 2 2 2 2 2 2 2 2 2 ...
$ Total.Benefits : Factor w/ 44 levels "$2,155","$2,418",..: 25 5 7 11 12 10 9 14 13 8 ...
$ Net.Cash.Flow : Factor w/ 45 levels "-$2,825","$2,155",..: 1 6 8 12 13 11 10 15 14 9 ...
$ Participant : Factor w/ 46 levels "$0","$109","$123",..: 1 1 1 45 46 2 3 4 5 6 ...
$ Taxpayer : Factor w/ 48 levels "$113","$114",..: 19 32 35 37 38 40 41 45 48 47 ...
$ Others : Factor w/ 47 levels "-$9","$1,026",..: 12 25 26 24 23 11 9 10 8 7 ...
$ Indirect : Factor w/ 42 levels "-$1,626","-$2",..: 1 6 10 18 22 24 28 33 36 35 ...
$ Crime : Factor w/ 35 levels "$0","$1","$10",..: 6 11 13 19 21 23 28 31 33 32 ...
$ Child.Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Education : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Health.Care : Factor w/ 38 levels "-$10","-$11",..: 7 7 7 7 2 8 12 36 30 9 ...
$ Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Earnings : Factor w/ 41 levels "$0","$101","$104",..: 1 1 1 22 23 24 25 26 27 28 ...
$ State.Benefits : Factor w/ 37 levels "$102","$117",..: 37 1 3 4 6 10 12 18 24 27 ...
$ Local.Benefits : Factor w/ 24 levels "$115","$136",..: 24 1 2 12 14 16 19 22 23 21 ...
$ Federal.Benefits: Factor w/ 39 levels "$0","$100","$102",..: 1 1 1 12 12 17 20 19 19 21 ...
If you need to only remove the $ and do not want to change the class of the columns.
indx <- sapply(data, is.factor)
data[indx] <- lapply(data[indx], function(x)
as.factor(gsub("\\$", "", x)))
If you need numeric columns, you can strip out the , as well (contributed by #David
Arenburg) and convert to numeric by as.numeric
data[indx] <- lapply(data[indx], function(x) as.numeric(gsub("[,$]", "", x)))
You can wrap this in a function
f1 <- function(dat, pat="[$]", Class="factor"){
indx <- sapply(dat, is.factor)
if(Class=="factor"){
dat[indx] <- lapply(dat[indx], function(x) as.factor(gsub(pat, "", x)))
}
else {
dat[indx] <- lapply(dat[indx], function(x) as.numeric(gsub(pat, "", x)))
}
dat
}
f1(data)
f1(data, pat="[,$]", "numeric")
data
set.seed(24)
data <- data.frame(Year=1:6, Prog.Cost= sample(c("-$3,3333", "$0"),
6, replace=TRUE), Total.Benefits= sample(c("$2,155","$2,418",
"$2,312"), 6, replace=TRUE))
If you have to read a lot of csv files with data like this, perhaps you should consider creating your own as method to use with the colClasses argument, like this:
setClass("dollar")
setAs("character", "dollar",
function(from)
as.numeric(gsub("[,$]", "", from, fixed = FALSE)))
Before demonstrating how to use it, let's write #akrun's sample data to a csv file named "A". This would not be necessary in your actual use case where you would be reading the file directly...
## write #akrun's sample data to a csv file named "A"
set.seed(24)
data <- data.frame(
Year=1:6,
Prog.Cost= sample(c("-$3,3333", "$0"), 6, replace = TRUE),
Total.Benefits = sample(c("$2,155","$2,418","$2,312"), 6, replace=TRUE))
A <- tempfile()
write.csv(data, A, row.names = FALSE)
Now, you have a new option for colClasses that can be used with read.csv :-)
read.csv(A, colClasses = c("numeric", "dollar", "dollar"))
# Year Prog.Cost Total.Benefits
# 1 1 -33333 2155
# 2 2 -33333 2312
# 3 3 0 2312
# 4 4 0 2155
# 5 5 0 2418
# 6 6 0 2418
It would probably be more beneficial to just read it again, this time with readLines. I wrote akrun's data to the file "data.text" and fixed the strings before reading the table. Nor sure if the comma was a decimal point or an annoying comma, so I chose decimal point.
r <- gsub("[$]", "", readLines("data.txt"))
read.table(text = r, dec = ",")
# Year Prog.Cost Total.Benefits
# 1 1 -3.3333 2.155
# 2 2 -3.3333 2.312
# 3 3 0.0000 2.312
# 4 4 0.0000 2.155
# 5 5 0.0000 2.418
# 6 6 0.0000 2.418