How to convert one of my columns from 226 FACTORS? - r

New to programming in R. I have a dataset in which one column is or should be numeric since it has %values! I need to plot that data using ggplot2 but I can't since I'm pretty new with this.
Summary:
DataSet = 245 Rows, 6 columns.
I have spent 5 hours searching for the right code. But posts seem to be to advance for my understanding.
data.frame': 245 obs. of 6 variables:
$ location : Factor w/ 8 levels "site01","site02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ coralType: Factor w/ 5 levels "blue corals",..: 1 1 1 1 1 1 1 1 2 2 ...
$ longitude: num 144 144 144 144 144 ...
$ latitude : num -11.8 -11.8 -11.8 -11.8 -11.8 ...
$ year : int 2010 2011 2012 2013 2014 2015 2016 2017 2011 2012 ...
$ value : Factor w/ 223 levels "10.01%","10.23%",..: 113 123 166 168 184 193 196 200 43 44 ...
See that df$value? That is my issue I need it to be numeric so I can plot it, right now I can't! Simply put $value needs to be numeric. Would really appreciate if any of you R veterans can help me out?!

You need to remove the percentage symbol and save it as a numeric value.
df <- data.frame(value = paste(1:100, "%", sep = ""))
df$value <- as.numeric(sub("%", "", df$value))

Related

Using R - Need advice predicting next year based on irregular time-series

I need advice with regards to the following inquiry: "Based on your observations, what could you say about the load for the same months in year 2019?"
The str()/head() of the df looks like this:
data.frame': 683 obs. of 10 variables:
$ Route : chr "A" "B" "A" "A" ...
$ FlightNumber: int 770 279 128 235 434 543 556 663 770 279 ...
$ Capacity : int 375 345 375 375 375 375 375 375 375 345 ...
$ Booked : int 379 314 374 379 373 377 379 378 379 294 ...
$ DDate : Date, format: "2018-05-01" "2018-05-01" "2018-05-02" "2018-05-03" ...
$ Year : num 2018 2018 2018 2018 2018 ...
$ Month : num 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 1 2 3 4 5 6 7 8 8 ...
$ Hour : int 12 20 12 12 12 12 12 12 12 20 ...
$ load : num 1.011 0.91 0.997 1.011 0.995 ...
Route FlightNumber Capacity Booked DDate Year Month Day Hour load(=Booked/Capacity)
1 A 770 375 379 2018-05-01 2018 5 1 12 1.0106667
2 B 279 345 314 2018-05-01 2018 5 1 20 0.9101449
3 A 128 375 374 2018-05-02 2018 5 2 12 0.9973333
4 A 235 375 379 2018-05-03 2018 5 3 12 1.0106667
5 A 434 375 373 2018-05-04 2018 5 4 12 0.9946667
6 A 543 375 377 2018-05-05 2018 5 5 12 1.0053333
If I plot the data, it looks like this: geom_point
UPDATE: I ended up doing the following:
dat_A <- test %>% select(Route, DDate, load) %>% filter(Route == "A")
ts_A <- ts(dat_A$load, start = c(2017,5), end = c(2018,11), frequency = 1*12)
forecast(ts_A, h=12) %>% plot()
Predicted outcome image
#Double checking
fit <- auto.arima(ts_A)
summary(fit)
predict <- forecast(fit,n=1)
plot(predict)
plot.ts(predict$residuals)
qqnorm(predict$residuals)
acf(predict$residuals)
Does the prediction seem sound? Looks rather flat even though I also tried train(1:480)/validat(481:611) via arima then forecast with a RMSE of 0.036...
Here is a solution you can try. I can give you a direction to generate a time-series using the following function. First load your data say it is df and it has the column Booked so you can use the following method to generate a time-series which can be easily fit.
ts_data = ts(df$Booked, start = c(2017,1), end = c(2018,12), frequency = 12)
Now you can simply apply time-series prediction on this ts_data to predict the value of 2019. I am leaving rest of the code for you. Thank you!!
To convert a vector or data.frame into a time series, you could use:
dat <- as.ts(as.matrix(dat))

C50 failed in r with "c50 code called exit with value 1"

I am having issue with training C50 on my dataset. Before this post, I have researched all the other similar issues/solutions people had. However, my dataset has none of the issue they had but still failed the C50 execution in r. My dataset looks like:
'data.frame': 113967 obs. of 15 variables:
$ region : Factor w/ 51 levels "US:AK","US:AL",..: 2 3 3 4 4 4 4 5 5 5 ...
$ city : Factor w/ 6396 levels "179708","179720",..: 24 156 156 194 214 226 244 276 316 407 ...
$ dma : Factor w/ 211 levels "1","500","501",..: 24 148 148 173 173 173 189 195 204 208 ...
$ user_day : Factor w/ 7 levels "0","1","2","3",..: 6 6 6 6 6 6 6 6 6 6 ...
$ user_hour : Factor w/ 24 levels "0","1","10","11",..: 5 16 16 4 22 7 10 11 15 21 ...
$ os_extended : Factor w/ 71 levels "0","100","113",..: 55 68 68 7 29 14 14 14 29 34 ...
$ browser : Factor w/ 19 levels "0","10","11",..: 19 18 18 8 18 9 18 17 18 18 ...
$ domain : Factor w/ 2685 levels "0calc.com","100daysofrealfood.com",..: 1709 777 777 1406 727 2658 1406 1604 964 2658 ...
$ position : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 2 1 1 1 2 ...
$ placement : Factor w/ 5406 levels "10004098","10008956",..: 3331 1696 1714 3600 438 479 3598 3423 5406 479 ...
$ publisher : Factor w/ 1641 levels "1000773","1000776",..: 581 687 687 663 1369 1525 663 624 1641 1525 ...
$ seller_member_id : Factor w/ 304 levels "1001","1019",..: 19 101 101 40 19 35 40 40 75 35 ...
$ user_group : Factor w/ 1000 levels "0","1","10","100",..: 252 243 243 363 343 342 162 380 122 212 ...
$ size : Factor w/ 7 levels "160x600","300x250",..: 5 2 2 4 5 2 2 1 2 2 ...
$ predict.bid.vector.bin: Factor w/ 2 levels "(0.112,0.831]",..: 1 1 1 1 1 1 1 2 1 2 ...
As you can see, the last variable is my target variable (as factor) and all features here have more than 1 level. Moreover, there is no NA in the dataset. Yet, when i execute the C50, i got error:
> library(C50)
> myC50_Tree <- C5.0(x = test_set[,-15], y = test_set$predict.bid.vector.bin)
c50 code called exit with value 1
> summary(myC50_Tree)
Call:
C5.0.default(x = test_set[, -15], y = test_set$predict.bid.vector.bin)
C5.0 [Release 2.07 GPL Edition] Fri Apr 13 14:29:54 2018
-------------------------------
*** line 6 of `undefined.names': attribute `region' has only one value `US'
Error limit exceeded
What would be the issue here?
***You can get the simulated dataset of mine with following r code:
# --- Set unique feature values
region <- c("US:AL","US:AR","US:AZ","US:CA","US:CO","US:CT","US:DC","US:FL")
city <- c("179944","180802","181120","181212","181251","181315","181400","181512","181762","181842","181934","181953","182259","182295")
dma <- c('522','693','754','875','345','234')
user_day <- c('1','2','3','4','5','6')
user_hour <- c('12','11','10','9','8','7','6','5')
os_extended <- c('187','92','125','87','90')
browser <- c('8','9','18','5')
domain <- c('yahoo.com','youtube.com','mmctw.com','msn.com','frive.com','wework.com')
position <- c('0','1','2','3')
placement <- c('`234123412','34563451','235234624','46785467','234556834','85991927394')
publisher <- c('5345','57867','78034','123452','84567','245645','956752')
seller_memeber_id <- c('234','745','546','687','235')
user_group <- c('112','556','009','345','238')
size <- c('100X20','340X10','300X500','300X600')
predict.bid.vector.bin <- c('(0.831,1.55]', '(0.112,0.831]')
features <- list(region,city,dma,user_day,user_hour,os_extended,browser,domain,position,placement,publisher,seller_memeber_id,user_group,size,predict.bid.vector.bin)
# --- Sample simulated dataset
test_set <- vector()
for (feature in 1:length(features)) {
test_set <- cbind(test_set, sample(features[[feature]],1000,replace=TRUE))
}
test_set <- data.frame(test_set)
colnames(test_set) <- c('region','city','dma','user_day','user_hour',
'os_extended','browser','domain','position',
'placement','publisher','seller_memeber_id',
'user_group','size','predict.bid.vector.bin')
# --- check data
str(test_set)
The problem is the variable name region -- I think C5.0 doesn't like the colons in there. I recreated your dataset with:
region <- c("AL","AR","AZ","CA","CO","CT","DC","FL")
And then it worked with no errors:
treeModel <- C5.0(x=test_set[,-15],y=test_set[,15])
treeModel
...
Evaluation on training data (1000 cases):
Decision Tree
----------------
Size Errors
103 220(22.0%) <<
(a) (b) <-classified as
---- ----
358 122 (a): class 1
98 422 (b): class 2
Attribute usage:
100.00% user_hour
28.30% region
27.30% dma
24.30% city
17.60% user_day
15.40% size
12.70% placement
9.10% user_group
7.90% browser
6.50% os_extended
4.70% publisher
4.40% position
3.70% domain
3.00% seller_memeber_id
I also recoded the dependent variable as 1 and 2 just in case the string with the ranges was giving it a problem, but that didn't seem to matter at all (however in the output above you'll see that it predicted to Class 1 and Class 2, and that's why).

How to work with %in% symbol in R?

I found out that %in% stands for matching operator, binary (in model formulae: nesting). There are two tables in my workspace. The first table contains
> str(GP.drugs)
'data.frame': 4158393 obs. of 9 variables:
$ SHA : Factor w/ 10 levels "Q30","Q31","Q32",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PCT : Factor w/ 151 levels "5A3","5A4","5A5",..: 16 16 16 16 16 16 16 16 16 16 ...
$ PRACTICE: Factor w/ 10191 levels "A81001","A81002",..: 344 345 345 345 345 345 345 345 345 345 ...
$ BNF.CODE: Factor w/ 1731 levels "0101010C0","0101010E0",..: 878 4 9 11 17 22 25 26 27 28 ...
$ BNF.NAME: Factor w/ 1524 levels "Abacavir ",..: 317 289 294 1284 37 379 655 825 1115 824 ...
$ ITEMS : int 1 27 1 2 97 4 40 98 27 2 ...
$ NIC : num 1.89 74.94 3.2 7.35 439.83 ...
$ ACT.COST: num 1.77 69.92 2.98 6.84 408.43 ...
$ PERIOD : num 201109 201109 201109 201109 201109 ...
The second table contains
> str(problem.drugs)
'data.frame': 13 obs. of 2 variables:
$ Drug : Factor w/ 13 levels "Alogliptin","Glipizide",..: 1 2 3 9 10 11 12 13 4 7 ...
$ Category: Factor w/ 1 level "metformin": 1 1 1 1 1 1 1 1 1 1 ...
The code and the error I am using is
> t<-subset(GP.drugs,n %in% p)
> t
[1] SHA PCT PRACTICE BNF.CODE BNF.NAME ITEMS NIC ACT.COST PERIOD
<0 rows> (or 0-length row.names)
More errors
Does it make difference on the tables' column names or does it make it difference on the number of columns both have?
Your BNF.NAME column in the GP.drugs data frame appears to have extra trailing spaces in it: notice it says something like "Abacavir " as the first element. If this is true of all the drugs in GP.drugs, but not the ones in problem.drugs, it will prevent any from matching.
To fix this, you can use the str_trim function from stringr, which trims leading and trailing whitespace:
library(stringr)
n <- str_trim(GP.drugs$BNF.NAME)
# same thing you did before
p <- problem.drugs$Drug
t <- subset(GP.drugs, n %in% p)
Other solutions can be found here.
Try,
GP.drugs[GP.drugs$BNF.NAME %in% problem.drugs$Drug, ]

Creating decision tree

I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.
But when i try to create the decision tree:
model<-tree(salary~.,data)
I get the error like below:
*Error in tree(salary ~ ., data) :
factor predictors must have at most 32 levels*
What is wrong with that? Data is as follows:
Name bat hit homeruns runs
1 Alan Ashby 315 81 7 24
2 Alvin Davis 479 130 18 66
3 Andre Dawson 496 141 20 65
...
team position putout assists errors
1 Hou. C 632 43 10
2 Sea. 1B 880 82 14
3 Mon. RF 200 11 3
salary league87 team87
1 475 N Hou.
2 480 A Sea.
3 500 N Chi.
And its the value of str(data):
'data.frame': 263 obs. of 24 variables:
$ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...
$ bat : int 315 479 496 321 594 185 298 323 401 574 ...
$ hit : int 81 130 141 87 169 37 73 81 92 159 ...
$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...
$ runs : int 24 66 65 39 74 23 24 26 49 107 ...
$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...
$ walks : int 39 76 37 30 35 21 7 8 65 59 ...
$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...
$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...
$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...
$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...
$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...
$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7
8 ...
$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...
$ putout : int 632 880 200 805 282 76 121 143 0 238 ...
$ assists : int 43 82 11 40 421 127 283 290 0 445 ...
$ errors : int 10 14 3 4 25 7 9 19 0 22 ...
$ salary : num 475 480 500 91.5 750 ...
$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...
The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:
train <- data
train$Name <- NULL
model<-tree(salary~.,train)
It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:
http://cran.r-project.org/web/packages/tree/tree.pdf
Usage
tree(formula, data, weights, subset, na.action = na.pass,
control = tree.control(nobs, ...), method = "recursive.partition",
split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts
= TRUE, ...)
Arguments
formula A formula expression. The left-hand-side (response) should be either a numerical vector when a
regression tree will be fitted or a factor, when a classification tree
is produced. The right-hand-side should be a series of numeric or
factor variables separated by +; there should be no interaction terms.
Both . and - are allowed: regression trees can have offset terms.
(...)
Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:
salary = as.numeric(levels(salary))[salary]
EDIT
As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.
EDIT2
It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).

R subsetting a data frame based on a factor variable formatted like a range (xx-xx)

I am facing this problem for many hours now, but I know I am missing something obvious.
Here is my problem:
I have a data-frame in .xlsx file that can be downloaded here.
I loaded this data-frame into R using RStudio on MAc and called it demoData.
There are 5 variables (AgeRange, Women, Men, Total, and Year).
I am not able to subset this data frame with a condition on the AgeRange. The format of this variable is as follow: xx-xx (00-04 meaning people between 00 and 04 years old). The message I have when I try to do that is that there is no row filling this condition.
The class of the variable "AgeRange" is factor.
Here is my code:
demoData[demoData$AgeRange=="00-04",]
Thank you for your help.
Edit: from Arun. Here's input from head(demoData):
Age Feminin Masculin. Ensemble Annee
1 00-04 720 745 1465 2004
2 05-09 745 767 1512 2004
3 10-14 813 830 1643 2004
4 15-19 824 820 1644 2004
5 20-24 839 823 1662 2004
6 25-29 752 699 1450 2004
# str(demoData)
'data.frame': 272 obs. of 5 variables:
$ Age : Factor w/ 16 levels "00-04 ","05-09 ",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Feminin : Factor w/ 216 levels "138 ","139 ",..: 112 124 164 165 174 130 106 86 78 66 ...
$ Masculin.: Factor w/ 201 levels "120 ","122 ",..: 132 141 174 169 170 124 111 89 90 75 ...
$ Ensemble : Factor w/ 242 levels "1041 ","1044 ",..: 53 66 115 116 119 50 38 14 9 238 ...
$ Annee : Factor w/ 17 levels "2004 ","2005",..: 1 1 1 1 1 1 1 1 1 1 ...
I read in your xlsx file with the xlsx package:
df<-read.xlsx("C:/Users/swatson1/Downloads/Evolution_Population_2004_2020.xlsx",1)
and it looked like this:
> df
Age Feminin MasculinÂ. Ensemble Annee
1 00-04Â 720Â 745Â 1465Â 2004Â
2 05-09Â 745Â 767Â 1512Â 2004Â
You could replace each column, getting rid of the extra character with something like:
df$Age<-substr(df$Age,1,5)
Alternatively, use gsub as this will work on any column regardless of the length of the entry:
df$Age<-gsub("Â\\s","",df$Age)
Then your code would work:
df[df$Age=="00-04",]
#coppied from the Excel file
str1 <- "00-04 "
utf8ToInt(str1)
#[1] 48 48 45 48 52 160
There seems to be a no-break space at the end of the string. Sanitize your file.
You should be able to remove the no-break spaces using
df$Age <- gsub(intToUtf8(160),"",df$Age)

Resources