Subsetting nominal variables in R - r

I have a data set similar to this one:
x <- sample(c("A", "B", "C", "D"), 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05))
y <- sample(1:40, 1000, replace=TRUE)
d <- data.frame(x,y)
str(d)
'data.frame': 1000 obs. of 2 variables:
$ x: Factor w/4 levels "A","B","C","D": 1 3 3 2 3 3 3 3 4 3 ...
$ y: int 28 35 14 4 34 36 30 35 26 9 ...
table(d$x)
A B C D
115 204 637 44
So in my real data set i have multiple thousands of these category (A, B, C, D).
The str() of my real dataset
str(realdata)
data.frame': 346340 obs. of 91 variables:
$ author : Factor w/ 42590 levels "-jon-","--LZR--",..: 1962 3434 1241 7666 6235 2391 1196 2779 1881 339 ...
$ created_utc : Factor w/ 343708 levels "2015-05-01 02:00:41",..: 14815 23163 2281 3569 5922 7211 15783 5512 13485 8591 ...
$ group : Factor w/ 5 levels "xyz","abc","bnm",..: 2 2 2 2 2 2 2 2 2 2 ...
....
Now i want to subset the data, so i have only the rows of those $authors (or $x in the d dataframe) in my new dataframe that have more than 100 entries in total.
I tried the following:
dnew <- subset(realdata, table(realdata$author) > 100)
It gives me a result, but it seems the not all entries of the authors were included. Although it should be way more, i just get 1.3% of the rows of the complete dataset. I checked it manually (with excel) and it should be way more than that (approx. 30%). The manual analysis showed that 1.2 % of $author stand for 30% of the entries. So it seems he just gave me one row with the $author who has more than 100 entries, but not all of his entries.
Do you know of a way to fix this?

We can do this easily with data.table. Convert the 'data.frame' to 'data.table' (setDT(d), grouped by 'x', if the number of observations is greater than 100, we Subset the Data.table (.SD)
library(data.table)
ddt <- setDT(d)[, if(.N > 100) .SD, x]
Or if we are using dplyr, the same approach can be used.
library(dplyr)
dpl <- d %>%
group_by(x) %>%
filter(n() > 100) %>%
droplevels()
str(dpl)
#Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 866 obs. of 2 variables:
#$ x: Factor w/ 2 levels "B","C": 1 1 2 1 1 2 2 2 2 2 ...
# $ y: int 25 25 13 11 2 32 12 15 12 3 ...
Also, in using the base R, the table can be helpful
v1 <- table(d$x)
d1 <- subset(d, x %in% names(v1)[v1 > 100])
As the column 'x' is factor, when we subset the dataset, the levels persist, to remove that use droplevels
d2 <- droplevels(d1)
As the OP didn't set the seed, the output will be different for each person.
str(d2)
#'data.frame': 866 obs. of 2 variables:
#$ x: Factor w/ 2 levels "B","C": 1 1 2 1 1 2 2 2 2 2 ...
#$ y: int 25 25 13 11 2 32 12 15 12 3 ...

I. Data frame d with four levels
table(d$x)
# A B C D
# 92 232 630 46
II. Checking which level has greater than 100 records
which(table(d$x)>100)
# B C
# 2 3
III. Subsetting d data frame having only records belonging to levels which have greater than 100 records ie. level B and level C
result <- d[ d$x %in% names(table(d$x))[table(d$x) > 100] , ]
dim(result)
# [1] 862 2
str(result)
# 'data.frame': 862 obs. of 2 variables:
# $ x: Factor w/ 4 levels "A","B","C","D": 3 2 3 3 2 2 2 3 3 3 ...
# $ y: int 29 32 27 40 30 38 8 16 2 23 ...
Level A and D still persists with 0 records
table(result$x)
# A B C D
# 0 232 630 0
IV. Removing the levels with 0 records using factor()
result$x <- factor(result$x)
str(result)
# 'data.frame': 860 obs. of 2 variables:
# $ x: Factor w/ 2 levels "B","C": 2 2 1 2 1 2 2 2 1 2 ...
# $ y: int 29 32 27 40 30 38 8 16 2 23 ...
table(result$x)
# B C
# 232 630

Related

R - Random Forest and more than 53 categories

I know. RandomForest is not able to handle more than 53 categories. Sadly I have to analyze data and one column has 165 levels. Therefor I want to use RandomForest for a classification.
My problem is I cannot remove this columns since this predictor is really important and known as a valuable predictor.
This predictor has 165 levels and is a factor.
Are there any tips how I can handle this? Since we are talking about film genre I have no idea.
Are there alternative packages for big data? A special workaround? Something like this..
Switching to Python is no option. We have too many R scripts here.
Thanks a lot and all the best
The str(data) looks like this:
'data.frame': 481696 obs. of 18 variables:
$ SENDERNR : int 432 1612 735 721 436 436 1321 721 721 434 ...
$ SENDER : Factor w/ 14 levels "ARD Das Erste",..: 6 3 4 9 12 12 10 9 9 7 ...
$ GEPLANTE_SENDUNG_N: Factor w/ 12563 levels "-- nicht bekannt --",..: 7070 808 5579 9584 4922 4922 12492 1933 9584 4533 ...
$ U_N_PROGRAMMCODE : Factor w/ 14 levels "Bühne/Aufführung",..: 9 4 8 4 8 8 12 8 4 2 ...
$ U_N_PROGRAMMSPARTE: Factor w/ 6 levels "Anderes","Fiction",..: 5 3 2 3 2 2 5 2 3 3 ...
$ U_N_SENDUNGSFORMAT: Factor w/ 29 levels "Bühne / Aufführung",..: 20 9 19 4 19 19 24 19 4 16 ...
$ U_N_GENRE : Factor w/ 163 levels "Action / Abenteuer",..: 119 147 115 4 158 158 163 61 4 84 ...
$ U_N_PRODUKTIONSART: Factor w/ 5 levels "Eigen-, Co-, Auftragsproduktion, Cofinanzierung",..: 1 1 3 1 3 3 1 3 1 1 ...
$ U_N_HERKUNFTSLAND : Factor w/ 25 levels "afrikanische Länder",..: 16 16 25 16 15 15 16 25 16 16 ...
$ GEPLANTE_SENDUNG_V: Factor w/ 12191 levels "-- nicht bekannt --",..: 6932 800 5470 9382 1518 9318 12119 1829 9382 4432 ...
$ U_V_PROGRAMMCODE : Factor w/ 13 levels "Bühne/Aufführung",..: 9 4 8 4 8 8 12 8 4 2 ...
$ U_V_PROGRAMMSPARTE: Factor w/ 6 levels "Anderes","Fiction",..: 5 3 2 3 2 2 5 2 3 3 ...
$ U_V_SENDUNGSFORMAT: Factor w/ 28 levels "Bühne / Aufführung",..: 20 9 19 4 19 19 24 19 4 16 ...
$ U_V_GENRE : Factor w/ 165 levels "Action / Abenteuer",..: 119 148 115 4 160 19 165 61 4 84 ...
$ U_V_PRODUKTIONSART: Factor w/ 5 levels "Eigen-, Co-, Auftragsproduktion, Cofinanzierung",..: 1 1 3 1 3 3 1 3 1 1 ...
$ U_V_HERKUNFTSLAND : Factor w/ 25 levels "afrikanische Länder",..: 16 16 25 16 15 9 16 25 16 16 ...
$ ABGELEHNT : int 0 0 0 0 0 0 0 0 0 0 ...
$ AKZEPTIERT : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 1 2 2 2 ...
Having faced the same issue, here are some tips I can list.
Switch to another algorithm, for instance gradient boosting from
gbm package. You can handle up to 1024 categorical levels. If your predictor has quite discriminant parameters, you should also consider probabilistic approaches such as naiveBayes.
Transform your predictor into dummy variables, which can be done by using matrix.model. You can then perform a random forest over this matrix.
Reduce the number of levels in your factor. Ok, that may sound like a silly advice, but is it really relevant to look at factors with such "thinness" ? Is it possible for you to aggregate some modalities at a broader level ?
EDIT TO ADD MODEL.MATRIX EXAMPLE
As mentioned, here is an example on how to use model.matrix to transform your column into dummy variables.
mydf <- data.frame(var1 = factor(c("A", "A", "A", "B", "B", "C")),
var2 = factor(c("X", "Y", "X", "Y", "X", "Z")),
target = c(1,1,1,2,2,2))
dummyMat <- model.matrix(target ~ var1 + var2, mydf, # set contrasts.arg to keep all levels
contrasts.arg = list(var1 = contrasts(mydf$var1, contrasts = F),
var2 = contrasts(mydf$var2, contrasts = F)))
mydf2 <- cbind(mydf, dummyMat[,c(2:ncol(dummyMat)]) # just removing intercept column
Use the caret package :
random_forest <- train("***dependent variable name***" ~ .,
data = "***your training data set***",
method = "ranger")
This can handle 53 + categories.

Error : C5.0 models require a factor outcome

i have data which look like this:
IntensityRisk Depth Mag smaj smin
<fctr> <int> <int> <int> <int>
1 2 2 3 2 2
2 3 1 3 2 2
3 3 1 3 2 2
4 3 1 1 2 2
5 3 1 1 2 2
6 2 2 3 2 2
7 3 1 3 2 2
8 3 1 3 2 2
9 3 1 3 2 2
10 2 2 3 2 2
I made this following steps:
gempaDF <- gempa[order(runif(nrow(gempa))),]
str(gempaDF$IntensityRisk)
tail(gempaDF,5)
gempaTrain <- gempaDF[1:4000,]
gempaTest <- gempaDF[4001:4471,]
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,1])
and getting error like this:
Error in C5.0.default(gempaTrain[, -1], gempaTrain[, 1]) :
C5.0 models require a factor outcome
i have changed it to this:
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,as.factor(gempaDF$IntensityRisk)])
and getting error again:
Error: Unsupported index type: factor
Then i try changing it to this:
gempaDF <- gempa[order(runif(nrow(gempa))),]
gempaDF$IntensityRisk <- as.factor(gempaDF$IntensityRisk)
str(gempaDF$IntensityRisk)
tail(gempaDF,5)
gempaTrain <- gempaDF[1:4000,]
gempaTest <- gempaDF[4001:4471,]
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,1])
But still getting error like this:
Error in C5.0.default(gempaTrain[, -1], gempaTrain[, 1]) :
C5.0 models require a factor outcome
I'm trying this too:
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,gempaDF$IntensityRisk])
But still getting error
Error: Unsupported index type: factor
Does anyone know where i did wrong? I appreciate it so much.
I'll use the following sample data (as I don't have access to your data)
set.seed(1)
dat = tibble::as_tibble(list(IntensityRisk = sample(1:5, 30, replace = T), Depth = sample(1:100, 30, replace = T), Mag = sample(1:100, 30, replace = T)))
table(dat$IntensityRisk)
1 2 3 4 5
4 11 2 7 6
# convert the response to factor,
dat$IntensityRisk = as.factor(dat$IntensityRisk)
str(dat)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 30 obs. of 3 variables:
$ IntensityRisk: Factor w/ 5 levels "1","2","3","4",..: 2 2 3 5 2 5 5 4 4 1 ...
$ Depth : int 49 60 50 19 83 67 80 11 73 42 ...
$ Mag : int 92 30 46 34 66 26 48 77 9 88 ...
If I use the tibble dataframe I get a similar error,
fit = C50::C5.0(dat1[, -1], dat1[, 1])
Error in C5.0.default(dat[, -1], dat[, 1]) :
C5.0 models require a factor outcome
If I convert to a dataframe,
dat1 = as.data.frame(dat)
str(dat1)
'data.frame': 30 obs. of 3 variables:
$ IntensityRisk: Factor w/ 5 levels "1","2","3","4",..: 2 2 3 5 2 5 5 4 4 1 ...
$ Depth : int 49 60 50 19 83 67 80 11 73 42 ...
$ Mag : int 92 30 46 34 66 26 48 77 9 88 ...
the function runs error free,
fit = C50::C5.0(dat1[, -1], dat1[, 1])
> fit
Call:
C5.0.default(x = dat1[, -1], y = dat1[, 1])
Classification Tree
Number of samples: 30
Number of predictors: 2
Tree size: 8
Non-standard options: attempt to group attributes

must a dataset contain all factors in SVM in R

I'm trying to find class probabilities of new input vectors with support vector machines in R.
Training the model shows no errors.
fit <-svm(device~.,data=dataframetrain,
kernel="polynomial",probability=TRUE)
But predicting some input vector shows some errors.
predict(fit,dataframetest,probability=prob)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
dataframetrain looks like:
> str(dataframetrain)
'data.frame': 24577 obs. of 5 variables:
$ device : Factor w/ 3 levels "mob","pc","tab": 1 1 1 1 1 1 1 1 1 1 ...
$ geslacht : Factor w/ 2 levels "M","V": 1 1 1 1 1 1 1 1 1 1 ...
$ leeftijd : num 77 67 67 66 64 64 63 61 61 58 ...
$ invultijd: num 12 12 12 12 12 12 12 12 12 12 ...
$ type : Factor w/ 8 levels "A","B","C","D",..: 5 5 5 5 5 5 5 5 5 5 ...
and dataframetest looks like:
> str(dataframetest)
'data.frame': 8 obs. of 4 variables:
$ geslacht : Factor w/ 1 level "M": 1 1 1 1 1 1 1 1
$ leeftijd : num 20 60 30 25 36 52 145 25
$ invultijd: num 6 12 2 5 6 8 69 7
$ type : Factor w/ 8 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8
I trained the model with 2 factors for 'geslacht' but sometime I have to predict data with only 1 factor of 'geslacht'.
Is it maybe possible that the class probabilites can be predicted with a test set with only 1 factor of 'geslacht'?
I hope someone can help me!!
Add another level (but not data) to geslacht.
x <- factor(c("A", "A"), levels = c("A", "B"))
x
[1] A A
Levels: A B
or
x <- factor(c("A", "A"))
levels(x) <- c("A", "B")
x
[1] A A
Levels: A B

How can I strip dollar signs ($) from a data frame in R?

I'm quite new to R and am battling a bit with what would appear to be an extremely simple query.
I've imported a csv file into R using read.csv and am trying to remove the dollar signs ($) prior to tidying the data and further analysis (the dollar signs are playing havoc with charting).
I've been trying without luck to strip the $ using dplyr and gsub from the data frame and I'd really appreciate some advice about how to go about it.
My data frame looks like this:
> str(data)
'data.frame': 50 obs. of 17 variables:
$ Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Prog.Cost : Factor w/ 2 levels "-$3,333","$0": 1 2 2 2 2 2 2 2 2 2 ...
$ Total.Benefits : Factor w/ 44 levels "$2,155","$2,418",..: 25 5 7 11 12 10 9 14 13 8 ...
$ Net.Cash.Flow : Factor w/ 45 levels "-$2,825","$2,155",..: 1 6 8 12 13 11 10 15 14 9 ...
$ Participant : Factor w/ 46 levels "$0","$109","$123",..: 1 1 1 45 46 2 3 4 5 6 ...
$ Taxpayer : Factor w/ 48 levels "$113","$114",..: 19 32 35 37 38 40 41 45 48 47 ...
$ Others : Factor w/ 47 levels "-$9","$1,026",..: 12 25 26 24 23 11 9 10 8 7 ...
$ Indirect : Factor w/ 42 levels "-$1,626","-$2",..: 1 6 10 18 22 24 28 33 36 35 ...
$ Crime : Factor w/ 35 levels "$0","$1","$10",..: 6 11 13 19 21 23 28 31 33 32 ...
$ Child.Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Education : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Health.Care : Factor w/ 38 levels "-$10","-$11",..: 7 7 7 7 2 8 12 36 30 9 ...
$ Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Earnings : Factor w/ 41 levels "$0","$101","$104",..: 1 1 1 22 23 24 25 26 27 28 ...
$ State.Benefits : Factor w/ 37 levels "$102","$117",..: 37 1 3 4 6 10 12 18 24 27 ...
$ Local.Benefits : Factor w/ 24 levels "$115","$136",..: 24 1 2 12 14 16 19 22 23 21 ...
$ Federal.Benefits: Factor w/ 39 levels "$0","$100","$102",..: 1 1 1 12 12 17 20 19 19 21 ...
If you need to only remove the $ and do not want to change the class of the columns.
indx <- sapply(data, is.factor)
data[indx] <- lapply(data[indx], function(x)
as.factor(gsub("\\$", "", x)))
If you need numeric columns, you can strip out the , as well (contributed by #David
Arenburg) and convert to numeric by as.numeric
data[indx] <- lapply(data[indx], function(x) as.numeric(gsub("[,$]", "", x)))
You can wrap this in a function
f1 <- function(dat, pat="[$]", Class="factor"){
indx <- sapply(dat, is.factor)
if(Class=="factor"){
dat[indx] <- lapply(dat[indx], function(x) as.factor(gsub(pat, "", x)))
}
else {
dat[indx] <- lapply(dat[indx], function(x) as.numeric(gsub(pat, "", x)))
}
dat
}
f1(data)
f1(data, pat="[,$]", "numeric")
data
set.seed(24)
data <- data.frame(Year=1:6, Prog.Cost= sample(c("-$3,3333", "$0"),
6, replace=TRUE), Total.Benefits= sample(c("$2,155","$2,418",
"$2,312"), 6, replace=TRUE))
If you have to read a lot of csv files with data like this, perhaps you should consider creating your own as method to use with the colClasses argument, like this:
setClass("dollar")
setAs("character", "dollar",
function(from)
as.numeric(gsub("[,$]", "", from, fixed = FALSE)))
Before demonstrating how to use it, let's write #akrun's sample data to a csv file named "A". This would not be necessary in your actual use case where you would be reading the file directly...
## write #akrun's sample data to a csv file named "A"
set.seed(24)
data <- data.frame(
Year=1:6,
Prog.Cost= sample(c("-$3,3333", "$0"), 6, replace = TRUE),
Total.Benefits = sample(c("$2,155","$2,418","$2,312"), 6, replace=TRUE))
A <- tempfile()
write.csv(data, A, row.names = FALSE)
Now, you have a new option for colClasses that can be used with read.csv :-)
read.csv(A, colClasses = c("numeric", "dollar", "dollar"))
# Year Prog.Cost Total.Benefits
# 1 1 -33333 2155
# 2 2 -33333 2312
# 3 3 0 2312
# 4 4 0 2155
# 5 5 0 2418
# 6 6 0 2418
It would probably be more beneficial to just read it again, this time with readLines. I wrote akrun's data to the file "data.text" and fixed the strings before reading the table. Nor sure if the comma was a decimal point or an annoying comma, so I chose decimal point.
r <- gsub("[$]", "", readLines("data.txt"))
read.table(text = r, dec = ",")
# Year Prog.Cost Total.Benefits
# 1 1 -3.3333 2.155
# 2 2 -3.3333 2.312
# 3 3 0.0000 2.312
# 4 4 0.0000 2.155
# 5 5 0.0000 2.418
# 6 6 0.0000 2.418

Treatment of 'empty' values

I am importing a csv file into R using the sqldf-package. I have several missing values for both numeric and string variables. I notice that missing values are left empty in the dataframe (as opposed to being filled with NA or something else). I want to replace the missing values with an user defined value. Obviously, a function like is.na() will not work in this case.
Toy dataframe with three columns:
A B C
3 4
2 4 6
34 23 43
2 5
I want:
A B C
3 4 NA
2 4 6
34 23 43
2 5 NA
Thank you in advance.
Assuming you are using read.csv.sql in sqldf with the default sqlite database it is producing a factor column for C so
(1) just convert the values to numeric using as.numeric(as.character(...)) like this:
> Lines <- "A,B,C
+ 3,4,
+ 2,4,6
+ 34,23,43
+ 2,5,
+ "
> cat(Lines, file = "stest.csv")
> library(sqldf)
> DF <- read.csv.sql("stest.csv")
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: Factor w/ 3 levels "","43","6": 1 3 2 1
> DF$C <- as.numeric(as.character(DF$C))
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: num NA 6 43 NA
(2) or if we use sqldf(..., method = "raw") then we can just use as.numeric:
> DF <- read.csv.sql("stest.csv", method = "raw")
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: chr "" "6" "43" ""
> DF$C <- as.numeric(DF$C)
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: num NA 6 43 NA
(3) If its feasible for you to use read.csv then we do get NA filling right off:
> str(read.csv("stest.csv"))
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: int NA 6 43 NA

Resources