I'm quite new to R and am battling a bit with what would appear to be an extremely simple query.
I've imported a csv file into R using read.csv and am trying to remove the dollar signs ($) prior to tidying the data and further analysis (the dollar signs are playing havoc with charting).
I've been trying without luck to strip the $ using dplyr and gsub from the data frame and I'd really appreciate some advice about how to go about it.
My data frame looks like this:
> str(data)
'data.frame': 50 obs. of 17 variables:
$ Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Prog.Cost : Factor w/ 2 levels "-$3,333","$0": 1 2 2 2 2 2 2 2 2 2 ...
$ Total.Benefits : Factor w/ 44 levels "$2,155","$2,418",..: 25 5 7 11 12 10 9 14 13 8 ...
$ Net.Cash.Flow : Factor w/ 45 levels "-$2,825","$2,155",..: 1 6 8 12 13 11 10 15 14 9 ...
$ Participant : Factor w/ 46 levels "$0","$109","$123",..: 1 1 1 45 46 2 3 4 5 6 ...
$ Taxpayer : Factor w/ 48 levels "$113","$114",..: 19 32 35 37 38 40 41 45 48 47 ...
$ Others : Factor w/ 47 levels "-$9","$1,026",..: 12 25 26 24 23 11 9 10 8 7 ...
$ Indirect : Factor w/ 42 levels "-$1,626","-$2",..: 1 6 10 18 22 24 28 33 36 35 ...
$ Crime : Factor w/ 35 levels "$0","$1","$10",..: 6 11 13 19 21 23 28 31 33 32 ...
$ Child.Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Education : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Health.Care : Factor w/ 38 levels "-$10","-$11",..: 7 7 7 7 2 8 12 36 30 9 ...
$ Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Earnings : Factor w/ 41 levels "$0","$101","$104",..: 1 1 1 22 23 24 25 26 27 28 ...
$ State.Benefits : Factor w/ 37 levels "$102","$117",..: 37 1 3 4 6 10 12 18 24 27 ...
$ Local.Benefits : Factor w/ 24 levels "$115","$136",..: 24 1 2 12 14 16 19 22 23 21 ...
$ Federal.Benefits: Factor w/ 39 levels "$0","$100","$102",..: 1 1 1 12 12 17 20 19 19 21 ...
If you need to only remove the $ and do not want to change the class of the columns.
indx <- sapply(data, is.factor)
data[indx] <- lapply(data[indx], function(x)
as.factor(gsub("\\$", "", x)))
If you need numeric columns, you can strip out the , as well (contributed by #David
Arenburg) and convert to numeric by as.numeric
data[indx] <- lapply(data[indx], function(x) as.numeric(gsub("[,$]", "", x)))
You can wrap this in a function
f1 <- function(dat, pat="[$]", Class="factor"){
indx <- sapply(dat, is.factor)
if(Class=="factor"){
dat[indx] <- lapply(dat[indx], function(x) as.factor(gsub(pat, "", x)))
}
else {
dat[indx] <- lapply(dat[indx], function(x) as.numeric(gsub(pat, "", x)))
}
dat
}
f1(data)
f1(data, pat="[,$]", "numeric")
data
set.seed(24)
data <- data.frame(Year=1:6, Prog.Cost= sample(c("-$3,3333", "$0"),
6, replace=TRUE), Total.Benefits= sample(c("$2,155","$2,418",
"$2,312"), 6, replace=TRUE))
If you have to read a lot of csv files with data like this, perhaps you should consider creating your own as method to use with the colClasses argument, like this:
setClass("dollar")
setAs("character", "dollar",
function(from)
as.numeric(gsub("[,$]", "", from, fixed = FALSE)))
Before demonstrating how to use it, let's write #akrun's sample data to a csv file named "A". This would not be necessary in your actual use case where you would be reading the file directly...
## write #akrun's sample data to a csv file named "A"
set.seed(24)
data <- data.frame(
Year=1:6,
Prog.Cost= sample(c("-$3,3333", "$0"), 6, replace = TRUE),
Total.Benefits = sample(c("$2,155","$2,418","$2,312"), 6, replace=TRUE))
A <- tempfile()
write.csv(data, A, row.names = FALSE)
Now, you have a new option for colClasses that can be used with read.csv :-)
read.csv(A, colClasses = c("numeric", "dollar", "dollar"))
# Year Prog.Cost Total.Benefits
# 1 1 -33333 2155
# 2 2 -33333 2312
# 3 3 0 2312
# 4 4 0 2155
# 5 5 0 2418
# 6 6 0 2418
It would probably be more beneficial to just read it again, this time with readLines. I wrote akrun's data to the file "data.text" and fixed the strings before reading the table. Nor sure if the comma was a decimal point or an annoying comma, so I chose decimal point.
r <- gsub("[$]", "", readLines("data.txt"))
read.table(text = r, dec = ",")
# Year Prog.Cost Total.Benefits
# 1 1 -3.3333 2.155
# 2 2 -3.3333 2.312
# 3 3 0.0000 2.312
# 4 4 0.0000 2.155
# 5 5 0.0000 2.418
# 6 6 0.0000 2.418
Related
I know. RandomForest is not able to handle more than 53 categories. Sadly I have to analyze data and one column has 165 levels. Therefor I want to use RandomForest for a classification.
My problem is I cannot remove this columns since this predictor is really important and known as a valuable predictor.
This predictor has 165 levels and is a factor.
Are there any tips how I can handle this? Since we are talking about film genre I have no idea.
Are there alternative packages for big data? A special workaround? Something like this..
Switching to Python is no option. We have too many R scripts here.
Thanks a lot and all the best
The str(data) looks like this:
'data.frame': 481696 obs. of 18 variables:
$ SENDERNR : int 432 1612 735 721 436 436 1321 721 721 434 ...
$ SENDER : Factor w/ 14 levels "ARD Das Erste",..: 6 3 4 9 12 12 10 9 9 7 ...
$ GEPLANTE_SENDUNG_N: Factor w/ 12563 levels "-- nicht bekannt --",..: 7070 808 5579 9584 4922 4922 12492 1933 9584 4533 ...
$ U_N_PROGRAMMCODE : Factor w/ 14 levels "Bühne/Aufführung",..: 9 4 8 4 8 8 12 8 4 2 ...
$ U_N_PROGRAMMSPARTE: Factor w/ 6 levels "Anderes","Fiction",..: 5 3 2 3 2 2 5 2 3 3 ...
$ U_N_SENDUNGSFORMAT: Factor w/ 29 levels "Bühne / Aufführung",..: 20 9 19 4 19 19 24 19 4 16 ...
$ U_N_GENRE : Factor w/ 163 levels "Action / Abenteuer",..: 119 147 115 4 158 158 163 61 4 84 ...
$ U_N_PRODUKTIONSART: Factor w/ 5 levels "Eigen-, Co-, Auftragsproduktion, Cofinanzierung",..: 1 1 3 1 3 3 1 3 1 1 ...
$ U_N_HERKUNFTSLAND : Factor w/ 25 levels "afrikanische Länder",..: 16 16 25 16 15 15 16 25 16 16 ...
$ GEPLANTE_SENDUNG_V: Factor w/ 12191 levels "-- nicht bekannt --",..: 6932 800 5470 9382 1518 9318 12119 1829 9382 4432 ...
$ U_V_PROGRAMMCODE : Factor w/ 13 levels "Bühne/Aufführung",..: 9 4 8 4 8 8 12 8 4 2 ...
$ U_V_PROGRAMMSPARTE: Factor w/ 6 levels "Anderes","Fiction",..: 5 3 2 3 2 2 5 2 3 3 ...
$ U_V_SENDUNGSFORMAT: Factor w/ 28 levels "Bühne / Aufführung",..: 20 9 19 4 19 19 24 19 4 16 ...
$ U_V_GENRE : Factor w/ 165 levels "Action / Abenteuer",..: 119 148 115 4 160 19 165 61 4 84 ...
$ U_V_PRODUKTIONSART: Factor w/ 5 levels "Eigen-, Co-, Auftragsproduktion, Cofinanzierung",..: 1 1 3 1 3 3 1 3 1 1 ...
$ U_V_HERKUNFTSLAND : Factor w/ 25 levels "afrikanische Länder",..: 16 16 25 16 15 9 16 25 16 16 ...
$ ABGELEHNT : int 0 0 0 0 0 0 0 0 0 0 ...
$ AKZEPTIERT : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 1 2 2 2 ...
Having faced the same issue, here are some tips I can list.
Switch to another algorithm, for instance gradient boosting from
gbm package. You can handle up to 1024 categorical levels. If your predictor has quite discriminant parameters, you should also consider probabilistic approaches such as naiveBayes.
Transform your predictor into dummy variables, which can be done by using matrix.model. You can then perform a random forest over this matrix.
Reduce the number of levels in your factor. Ok, that may sound like a silly advice, but is it really relevant to look at factors with such "thinness" ? Is it possible for you to aggregate some modalities at a broader level ?
EDIT TO ADD MODEL.MATRIX EXAMPLE
As mentioned, here is an example on how to use model.matrix to transform your column into dummy variables.
mydf <- data.frame(var1 = factor(c("A", "A", "A", "B", "B", "C")),
var2 = factor(c("X", "Y", "X", "Y", "X", "Z")),
target = c(1,1,1,2,2,2))
dummyMat <- model.matrix(target ~ var1 + var2, mydf, # set contrasts.arg to keep all levels
contrasts.arg = list(var1 = contrasts(mydf$var1, contrasts = F),
var2 = contrasts(mydf$var2, contrasts = F)))
mydf2 <- cbind(mydf, dummyMat[,c(2:ncol(dummyMat)]) # just removing intercept column
Use the caret package :
random_forest <- train("***dependent variable name***" ~ .,
data = "***your training data set***",
method = "ranger")
This can handle 53 + categories.
i have data which look like this:
IntensityRisk Depth Mag smaj smin
<fctr> <int> <int> <int> <int>
1 2 2 3 2 2
2 3 1 3 2 2
3 3 1 3 2 2
4 3 1 1 2 2
5 3 1 1 2 2
6 2 2 3 2 2
7 3 1 3 2 2
8 3 1 3 2 2
9 3 1 3 2 2
10 2 2 3 2 2
I made this following steps:
gempaDF <- gempa[order(runif(nrow(gempa))),]
str(gempaDF$IntensityRisk)
tail(gempaDF,5)
gempaTrain <- gempaDF[1:4000,]
gempaTest <- gempaDF[4001:4471,]
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,1])
and getting error like this:
Error in C5.0.default(gempaTrain[, -1], gempaTrain[, 1]) :
C5.0 models require a factor outcome
i have changed it to this:
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,as.factor(gempaDF$IntensityRisk)])
and getting error again:
Error: Unsupported index type: factor
Then i try changing it to this:
gempaDF <- gempa[order(runif(nrow(gempa))),]
gempaDF$IntensityRisk <- as.factor(gempaDF$IntensityRisk)
str(gempaDF$IntensityRisk)
tail(gempaDF,5)
gempaTrain <- gempaDF[1:4000,]
gempaTest <- gempaDF[4001:4471,]
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,1])
But still getting error like this:
Error in C5.0.default(gempaTrain[, -1], gempaTrain[, 1]) :
C5.0 models require a factor outcome
I'm trying this too:
C50_model <- C5.0(gempaTrain[,-1], gempaTrain[,gempaDF$IntensityRisk])
But still getting error
Error: Unsupported index type: factor
Does anyone know where i did wrong? I appreciate it so much.
I'll use the following sample data (as I don't have access to your data)
set.seed(1)
dat = tibble::as_tibble(list(IntensityRisk = sample(1:5, 30, replace = T), Depth = sample(1:100, 30, replace = T), Mag = sample(1:100, 30, replace = T)))
table(dat$IntensityRisk)
1 2 3 4 5
4 11 2 7 6
# convert the response to factor,
dat$IntensityRisk = as.factor(dat$IntensityRisk)
str(dat)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 30 obs. of 3 variables:
$ IntensityRisk: Factor w/ 5 levels "1","2","3","4",..: 2 2 3 5 2 5 5 4 4 1 ...
$ Depth : int 49 60 50 19 83 67 80 11 73 42 ...
$ Mag : int 92 30 46 34 66 26 48 77 9 88 ...
If I use the tibble dataframe I get a similar error,
fit = C50::C5.0(dat1[, -1], dat1[, 1])
Error in C5.0.default(dat[, -1], dat[, 1]) :
C5.0 models require a factor outcome
If I convert to a dataframe,
dat1 = as.data.frame(dat)
str(dat1)
'data.frame': 30 obs. of 3 variables:
$ IntensityRisk: Factor w/ 5 levels "1","2","3","4",..: 2 2 3 5 2 5 5 4 4 1 ...
$ Depth : int 49 60 50 19 83 67 80 11 73 42 ...
$ Mag : int 92 30 46 34 66 26 48 77 9 88 ...
the function runs error free,
fit = C50::C5.0(dat1[, -1], dat1[, 1])
> fit
Call:
C5.0.default(x = dat1[, -1], y = dat1[, 1])
Classification Tree
Number of samples: 30
Number of predictors: 2
Tree size: 8
Non-standard options: attempt to group attributes
I have a list of data frames:
str(df.list)
List of 34
$ :'data.frame': 506 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:506] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALAT": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:506] 23 23 23 24 25 24 20 34 28 17 ...
..$ Index : Factor w/ 502 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ...
$ :'data.frame': 505 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:505] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALB": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:505] 45 46 47 47 49 47 46 46 44 43 ...
..$ Index : Factor w/ 501 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ..
The list contains 34 data frames with equal variable names. The variables Time and ResultAssay are of the wrong type: I would like to have Time as factor and ResultAssay as numerical.
I am trying to generate a function to use together with lapply to convert the variable type of this list of 34 data frames in one go, but so far i am unsuccessful.
I have tried things in parallel to:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
}
lapply(df.list, ChangeType)
What you have tried is nearly correct, but you also need to return the new data.frame and also store it to your existing variable, as so:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
DF #return the data.frame
}
# store the returned value to df.list,
# thus updating your existing data.frame
df.list <- lapply(df.list, ChangeType)
I have a big csv file that has 51993 rows and 18 columns. Here is part of the table:
head(ddd)
country.of.birth age sex X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
Afghanistan 0 men 0 0 1 2 2 0 1 1
Afghanistan 0 women 1 1 0 0 1 0 0 0
Afghanistan 1 men 0 2 5 2 3 4 1 1
Afghanistan 1 women 4 1 4 2 3 2 3 2
Afghanistan 2 men 5 0 8 7 7 3 5 3
Afghanistan 2 women 4 8 3 9 4 4 4 3
In the main csv file, the columns are: Country of Birth, age, sex, and then years from 2000 to 2014. My questions are why does R put X before each year number?
When I used the str() function, I got:
> str(ddd)
'data.frame': 15264 obs. of 18 variables:
$ country.of.birth: Factor w/ 261 levels "0","1","10","103",..: 51 51 51 51 51 51 51 51 51 51 ...
$ age : Factor w/ 38 levels "","0 ","1 ","10 ",..: 2 2 3 3 14 14 17 17 20 20 ...
$ sex : Factor w/ 39 levels "","0 ","1 ","10 ",..: 38 39 38 39 38 39 38 39 38 39 ...
$ X2000 : Factor w/ 786 levels "","0","1","10",..: 2 3 2 478 555 478 92 4 205 716 ...
$ X2001 : int 0 1 2 1 0 8 11 8 26 19 ...
$ X2002 : int 1 0 5 4 8 3 13 18 22 15 ...
$ X2003 : int 2 0 2 2 7 9 15 13 23 33 ...
$ X2004 : int 2 1 3 3 7 4 11 15 21 22 ...
$ X2005 : int 0 0 4 2 3 4 10 6 13 16 ...
$ X2006 : int 1 0 1 3 5 4 8 13 20 10 ...
$ X2007 : int 1 0 1 2 3 3 6 7 9 17 ...
$ X2008 : int 0 0 2 0 4 5 4 6 8 9 ...
$ X2009 : int 0 1 1 4 7 3 9 10 11 12 ...
$ X2010 : int 1 1 6 4 8 10 17 10 21 16 ...
$ X2011 : int 0 5 9 6 21 18 16 27 34 24 ...
$ X2012 : int 3 5 5 16 30 22 44 48 46 49 ...
$ X2013 : int 3 0 12 19 24 34 54 46 76 71 ...
$ X2014 : int 2 3 15 3 21 29 37 48 64 62 ...
As you notice, sex is factor but with 39 levels whereas it has only two values (men) and (women). Also, year 2000 (X2000 in the table) is a factor with 786 levels and it should have been read as an "int". Why did R read the observable "sex" with this large number of levels and why did it read year 2000 as a factor while it read the others as int (as is the case)?
Edit:
The age column has values of the form: 20-24, 25-30, ... till 85-90. and another category as 90+
X is put in front of the column names because R doesn't allow first character of the column name to be a number (try data.frame(a = 1:10, "3" = runif(10)).
Age is a factor because you have bins and what you observe is expected behavior. R doesn't handle intervals as numeric, but as factor.
Sex variable is weird and given currently available data, I would say the variable represents something other than sex, in at least part of the dataset. Has the dataset been stitched together? Perhaps there was a mistake in copy/pasting. See levels(ddd$sex) to disentangle all possible levels.
The default behaviour of read.table and its related functions is to make all column names syntactically valid. This means that they can be used without quoting after the $ operator. However, this behaviour can be changed using the check.names = FALSE parameter. This will mean you end up with columns called 2000 etc. To then use those columns with $ they will need to be backquoted, e.g.
ddd$`2000`
The same will be true if you want to use these columns with non-standard evaluation, e.g.
ggplot(ddd, aes(x = sex, y = `2000`)) + geom_boxplot()
For the sex column, there must be entries in the column further down that have numbers in. Check your original
data.
For age, you have trailing spaces in your age column. Either remove these outside R, or you could do something like this:
ddd$age <- as.numeric(sub(" +$", "", as.character(ddd$age)))
For the 2000 column, it's not clear from your str output why it's been read as a factor. By default, empty strings should be regarded as NA and so shouldn't affect the class. You could try (assuming you're now using check.names = FALSE):
as.character(ddd$`2000`)[is.na(as.numeric(as.character(ddd$`2000`))) & ddd$`2000` != ""]
This should print out any elements of the column which are non-blank and non-numeric. It may again be a trailing space issue.
I'm trying to find class probabilities of new input vectors with support vector machines in R.
Training the model shows no errors.
fit <-svm(device~.,data=dataframetrain,
kernel="polynomial",probability=TRUE)
But predicting some input vector shows some errors.
predict(fit,dataframetest,probability=prob)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
dataframetrain looks like:
> str(dataframetrain)
'data.frame': 24577 obs. of 5 variables:
$ device : Factor w/ 3 levels "mob","pc","tab": 1 1 1 1 1 1 1 1 1 1 ...
$ geslacht : Factor w/ 2 levels "M","V": 1 1 1 1 1 1 1 1 1 1 ...
$ leeftijd : num 77 67 67 66 64 64 63 61 61 58 ...
$ invultijd: num 12 12 12 12 12 12 12 12 12 12 ...
$ type : Factor w/ 8 levels "A","B","C","D",..: 5 5 5 5 5 5 5 5 5 5 ...
and dataframetest looks like:
> str(dataframetest)
'data.frame': 8 obs. of 4 variables:
$ geslacht : Factor w/ 1 level "M": 1 1 1 1 1 1 1 1
$ leeftijd : num 20 60 30 25 36 52 145 25
$ invultijd: num 6 12 2 5 6 8 69 7
$ type : Factor w/ 8 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8
I trained the model with 2 factors for 'geslacht' but sometime I have to predict data with only 1 factor of 'geslacht'.
Is it maybe possible that the class probabilites can be predicted with a test set with only 1 factor of 'geslacht'?
I hope someone can help me!!
Add another level (but not data) to geslacht.
x <- factor(c("A", "A"), levels = c("A", "B"))
x
[1] A A
Levels: A B
or
x <- factor(c("A", "A"))
levels(x) <- c("A", "B")
x
[1] A A
Levels: A B