Why cannot order a factor in data frame in R? - r

I have a data frame (named 'mdf') which includes two columns. The basic information is below:
> head(mdf); tail(mdf)
Country Rank
1 ABW 161
2 AFG 105
3 AGO 60
4 ALB 125
5 ARE 32
6 ARG 26
Country Rank
184 WSM 181
185 YEM 90
186 ZAF 28
187 ZAR 112
188 ZMB 104
189 ZWE 134
> str(mdf)
'data.frame': 189 obs. of 2 variables:
$ Country: Factor w/ 229 levels "","ABW","ADO",..: 2 4 5 6 7 8 9 11 12 13 ...
$ Rank : Factor w/ 195 levels "",".. Not available. ",..: 72 10 149 32 118 111 41 84 26 112 ...
My purpose is to rearrange it by ordering 'Rank' variable, but the result is:
> mdf[order(mdf$Rank),]
Country Rank
178 USA 1
78 IND 10
153 SLV 100
170 TTO 101
43 CYP 102
54 EST 103
188 ZMB 104
2 AFG 105
175 UGA 106
130 NPL 107
73 HND 108
60 GAB 109
31 CAN 11
67 GNQ 110
As you see, it is not what I need (i.e. increasing order). How can I do it? Thanks!

To get the answer you are looking for, use:
mdf[order(as.numeric(as.character(mdf$Rank))),]
The reason your original code doesn't work is that your Rank variable is a factor, so it will be sorted by the levels of the factor. For example, if you had a data frame such that:
DF
# x
# 1 2
# 2 22
# 3 11
# 4 1
and order the data
DF[order(DF$x),]
and you look at the levels:
levels(DF$x)
# [1] "1" "2" "11" "22"
We can reorder the levels such that
levels(DF$x) <- relevel(DF$x, ref = '11')
Now,
levels(DF$x)
# [1] "2" "22" "11" "1"
So ordering with the new factor levels we get different results:
DF[order(DF$x),]
To answer your question of why as.numeric doesn't work, it's because each factor level has an associated integer, which you get with as.numeric. If you want the number that is the factor label, you must first convert to a character and then convert to numeric, thus requiring as.numeric(as.character(x))
For example, calling as.numeric(DF$x) gives the integer values for each level, but not the actual label for each level:
# [1] 2 4 3 1
One way to avoid this in the future if you are loading your data frame from a .csv file is to use read.csv(..., stringsAsFactors=FALSE), or also I like the fread function in data.table which uses safer default types.

Related

R impute with Kalman on large data

I have a large dataset, 4666972 obs. of 5 variables.
I want to impute one column, MPR, with Kalman method based on each groups.
> str(dt)
Classes ‘data.table’ and 'data.frame': 4666972 obs. of 5 variables:
$ Year : int 1999 2000 2001 1999 2000 2001 1999 2000 2001 1999 ...
$ State: int 1 1 1 1 1 1 1 1 1 1 ...
$ CC : int 1 1 1 1 1 1 1 1 1 1 ...
$ ID : chr "1" "1" "1" "2" ...
$ MPR : num 54 54 55 52 52 53 60 60 65 70 ...
I tried the code below but it crashed after a while.
> library(imputeTS)
> data.table::setDT(dt)[, MPR_kalman := with(dt, ave(MPR, State, CC, ID, FUN=na_kalman))]
I don't know how to improve the time efficiency and impute successfully without crashed.
Is it better to split the dataset with ID to list and impute each of them with for loop?
> length(unique(hpms_S3$Section_ID))
[1] 668184
> split(dt, dt$ID)
However, I think this will not save too much of memory use or avoid crashed since when I split the dataset to 668184 lists and impute, I need to do multiple times and then combine to one dataset at last.
Is there any great way to do or how can I optimize code I did?
I provide the simple sample here:
# dt
Year State CC ID MPR
2002 15 3 3 NA
2003 15 3 3 NA
2004 15 3 3 193
2005 15 3 3 193
2006 15 3 3 348
2007 15 3 3 388
2008 15 3 3 388
1999 53 33 1 NA
2000 53 33 1 NA
2002 53 33 1 NA
2003 53 33 1 NA
2004 53 33 1 NA
2005 53 33 1 170
2006 53 33 1 170
2007 53 33 1 330
2008 53 33 1 330
EDIT:
As #r2evans mentioned in comment, I modified the code:
> setDT(dt)[, MPR_kalman := ave(MPR, State, CC, ID, FUN=na_kalman), by = .(State, CC, ID)]
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, :
L-BFGS-B needs finite values of 'fn'
I got the error above. I found the post here for this error discussions. However, even I use na_kalman(MPR, type = 'level'), I still got error. I think there might be some repeated values within groups so that it produced error.
Perhaps splitting should be done using data.table's by= operator, perhaps more efficient.
Since I don't have imputeTS installed (there are several nested dependencies I don't have), I'll fake imputation using zoo::na.locf, both forward/backwards. I'm not suggesting this be your imputation mechanism, I'm using it to demonstrate a more-common pattern with data.table.
myimpute <- function(z) zoo::na.locf(zoo::na.locf(z, na.rm = FALSE), fromLast = TRUE, na.rm = FALSE)
Now some equivalent calls, one with your with(dt, ...) and my alternatives (which are really walk-throughs until my ultimate suggestion of 5):
dt[, MPR_kalman1 := with(dt, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman2 := with(.SD, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman3 := with(.SD, ave(MPR, FUN = myimpute)), by = .(State, CC, ID)]
dt[, MPR_kalman4 := ave(MPR, FUN = myimpute), by = .(State, CC, ID)]
dt[, MPR_kalman5 := myimpute(MPR), by = .(State, CC, ID)]
# Year State CC ID MPR MPR_kalman1 MPR_kalman2 MPR_kalman3 MPR_kalman4 MPR_kalman5
# 1: 2002 15 3 3 NA 193 193 193 193 193
# 2: 2003 15 3 3 NA 193 193 193 193 193
# 3: 2004 15 3 3 193 193 193 193 193 193
# 4: 2005 15 3 3 193 193 193 193 193 193
# 5: 2006 15 3 3 348 348 348 348 348 348
# 6: 2007 15 3 3 388 388 388 388 388 388
# 7: 2008 15 3 3 388 388 388 388 388 388
# 8: 1999 53 33 1 NA 170 170 170 170 170
# 9: 2000 53 33 1 NA 170 170 170 170 170
# 10: 2002 53 33 1 NA 170 170 170 170 170
# 11: 2003 53 33 1 NA 170 170 170 170 170
# 12: 2004 53 33 1 NA 170 170 170 170 170
# 13: 2005 53 33 1 170 170 170 170 170 170
# 14: 2006 53 33 1 170 170 170 170 170 170
# 15: 2007 53 33 1 330 330 330 330 330 330
# 16: 2008 53 33 1 330 330 330 330 330 330
The two methods produce the same results, but the latter preserves many of the memory-efficiencies that can make data.table preferred.
The use of with(dt, ...) is an anti-pattern in one case, and a strong risk in another. For the "risk" part, realize that data.table can do a lot of things behind-the-scenes so that the calculations/function-calls within the j= component (second argument) only sees data that is relevant. A clear example is grouping, but another (unrelated to this) data.table example is conditional replacement, as in dt[is.na(x), x := -1]. With the reference to the enter table dt inside of this, if there is ever something in the first argument (conditional replacement) or a by= argument, then it fails.
MPR_kalman2 mitigates this by using .SD, which is data.table's way of replacing the data-to-be-used with the "Subset of the Data" (ref). But it's still not taking advantage of data.table's significant efficiencies in dealing in-memory with groups.
MPR_kalman3 works on this by grouping outside, still using with but not (as in 2) in a more friendly way.
MPR_kalman4 removes the use of with, since really the MPR visible to ave is only within each group anyway. And then when you think about it, since ave is given no grouping variables, it really just passes all of the MPR data straight-through to myimpute. From this, we have MPR_kalman5, a direct method that is along the normal patterns of data.table.
While I don't know that it will mitigate your crashing, it is intended very much to be memory-efficient (in data.table's ways).

How to convert one of my columns from 226 FACTORS?

New to programming in R. I have a dataset in which one column is or should be numeric since it has %values! I need to plot that data using ggplot2 but I can't since I'm pretty new with this.
Summary:
DataSet = 245 Rows, 6 columns.
I have spent 5 hours searching for the right code. But posts seem to be to advance for my understanding.
data.frame': 245 obs. of 6 variables:
$ location : Factor w/ 8 levels "site01","site02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ coralType: Factor w/ 5 levels "blue corals",..: 1 1 1 1 1 1 1 1 2 2 ...
$ longitude: num 144 144 144 144 144 ...
$ latitude : num -11.8 -11.8 -11.8 -11.8 -11.8 ...
$ year : int 2010 2011 2012 2013 2014 2015 2016 2017 2011 2012 ...
$ value : Factor w/ 223 levels "10.01%","10.23%",..: 113 123 166 168 184 193 196 200 43 44 ...
See that df$value? That is my issue I need it to be numeric so I can plot it, right now I can't! Simply put $value needs to be numeric. Would really appreciate if any of you R veterans can help me out?!
You need to remove the percentage symbol and save it as a numeric value.
df <- data.frame(value = paste(1:100, "%", sep = ""))
df$value <- as.numeric(sub("%", "", df$value))

Subset large Data frame by first 100 factors

Having a large data frame(almost 100m rows)
Want to subset the data frame by factors i.e
complete data of first 100 factors into one data frame ,next 100 into another
OR(the below one even I'm not sure)
Factors (categories) starts from Letter A:J in one batch,L:R as another data frame like that
(Actually I'm facing memory issues when dealing with large data frames,simple rows split can't help problem that working)
Any suggestion appreciated ..Thanks
Sample data set
ID FACTORS VALUE
1 ABCD 100
2 ABCD 101
3 ABCD 102
4 ABCD 103
5 ABCD 104
6 DEFG 105
7 DEFG 106
8 DEFG 107
9 DEFG 108
10 DEFG 109
11 DEFG 110
12 HIJK 111
13 HIJK 112
14 HIJK 113
15 HIJK 114
16 HIJK 115
17 HIJK 116
18 MNOP 117
19 MNOP 118
20 MNOP 119
21 MNOP 120
22 MNOP 121
23 99-1 122
24 99-1 123
25 99-1 124
26 99-2 125
27 99-2 126
This is related loosely to Split a vector into chunks in R
First, let's get the unique factors and split them up into bins of size n:
fctrs <- unique(dat$FACTORS)
fctrs
# [1] "ABCD" "DEFG" "HIJK" "MNOP" "99-1" "99-2"
n <- 3 # set to 100 for your data
fctrgroups <- split(fctrs, ceiling(seq_along(fctrs)/n))
str(fctrgroups)
# List of 2
# $ 1: chr [1:3] "ABCD" "DEFG" "HIJK"
# $ 2: chr [1:3] "MNOP" "99-1" "99-2"
(The last group may be less than n.)
THere are two ways you can work through this. If you're going to keep it all in-memory but just work on a subset at a time, then I suggest you keep the separated frames in a list and subsequently do your work within another lapply:
lst_o_frames <- lapply(fctrgroups, function(f) subset(dat, FACTORS %in% f))
str(lst_o_frames)
# List of 2
# $ 1:'data.frame': 17 obs. of 3 variables:
# ..$ ID : int [1:17] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ FACTORS: chr [1:17] "ABCD" "ABCD" "ABCD" "ABCD" ...
# ..$ VALUE : int [1:17] 100 101 102 103 104 105 106 107 108 109 ...
# $ 2:'data.frame': 10 obs. of 3 variables:
# ..$ ID : int [1:10] 18 19 20 21 22 23 24 25 26 27
# ..$ FACTORS: chr [1:10] "MNOP" "MNOP" "MNOP" "MNOP" ...
# ..$ VALUE : int [1:10] 117 118 119 120 121 122 123 124 125 126
If you take your work and put it into a user function named myfunc, then you can do
processed_lst_o_frames <- lapply(lst_o_frames, myfunc)
If, however, you just want to save the data to CSVs (or similar) so you can work with them elsewhere, then something like this will work:
for (f in fctrgroups) {
write.csv(subset(dat, FACTORS %in% f), paste0(f[[1]][1], ".csv"))
}
Note that this method is often used to do the actual work on the subset frames, too. Doing it this way is certainly feasible, but misses a strength of R and a simplifying programming step of "do some function on each elem of a list".

Creating decision tree

I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.
But when i try to create the decision tree:
model<-tree(salary~.,data)
I get the error like below:
*Error in tree(salary ~ ., data) :
factor predictors must have at most 32 levels*
What is wrong with that? Data is as follows:
Name bat hit homeruns runs
1 Alan Ashby 315 81 7 24
2 Alvin Davis 479 130 18 66
3 Andre Dawson 496 141 20 65
...
team position putout assists errors
1 Hou. C 632 43 10
2 Sea. 1B 880 82 14
3 Mon. RF 200 11 3
salary league87 team87
1 475 N Hou.
2 480 A Sea.
3 500 N Chi.
And its the value of str(data):
'data.frame': 263 obs. of 24 variables:
$ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...
$ bat : int 315 479 496 321 594 185 298 323 401 574 ...
$ hit : int 81 130 141 87 169 37 73 81 92 159 ...
$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...
$ runs : int 24 66 65 39 74 23 24 26 49 107 ...
$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...
$ walks : int 39 76 37 30 35 21 7 8 65 59 ...
$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...
$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...
$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...
$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...
$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...
$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7
8 ...
$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...
$ putout : int 632 880 200 805 282 76 121 143 0 238 ...
$ assists : int 43 82 11 40 421 127 283 290 0 445 ...
$ errors : int 10 14 3 4 25 7 9 19 0 22 ...
$ salary : num 475 480 500 91.5 750 ...
$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...
The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:
train <- data
train$Name <- NULL
model<-tree(salary~.,train)
It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:
http://cran.r-project.org/web/packages/tree/tree.pdf
Usage
tree(formula, data, weights, subset, na.action = na.pass,
control = tree.control(nobs, ...), method = "recursive.partition",
split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts
= TRUE, ...)
Arguments
formula A formula expression. The left-hand-side (response) should be either a numerical vector when a
regression tree will be fitted or a factor, when a classification tree
is produced. The right-hand-side should be a series of numeric or
factor variables separated by +; there should be no interaction terms.
Both . and - are allowed: regression trees can have offset terms.
(...)
Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:
salary = as.numeric(levels(salary))[salary]
EDIT
As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.
EDIT2
It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).

R subsetting a data frame based on a factor variable formatted like a range (xx-xx)

I am facing this problem for many hours now, but I know I am missing something obvious.
Here is my problem:
I have a data-frame in .xlsx file that can be downloaded here.
I loaded this data-frame into R using RStudio on MAc and called it demoData.
There are 5 variables (AgeRange, Women, Men, Total, and Year).
I am not able to subset this data frame with a condition on the AgeRange. The format of this variable is as follow: xx-xx (00-04 meaning people between 00 and 04 years old). The message I have when I try to do that is that there is no row filling this condition.
The class of the variable "AgeRange" is factor.
Here is my code:
demoData[demoData$AgeRange=="00-04",]
Thank you for your help.
Edit: from Arun. Here's input from head(demoData):
Age Feminin Masculin. Ensemble Annee
1 00-04 720 745 1465 2004
2 05-09 745 767 1512 2004
3 10-14 813 830 1643 2004
4 15-19 824 820 1644 2004
5 20-24 839 823 1662 2004
6 25-29 752 699 1450 2004
# str(demoData)
'data.frame': 272 obs. of 5 variables:
$ Age : Factor w/ 16 levels "00-04 ","05-09 ",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Feminin : Factor w/ 216 levels "138 ","139 ",..: 112 124 164 165 174 130 106 86 78 66 ...
$ Masculin.: Factor w/ 201 levels "120 ","122 ",..: 132 141 174 169 170 124 111 89 90 75 ...
$ Ensemble : Factor w/ 242 levels "1041 ","1044 ",..: 53 66 115 116 119 50 38 14 9 238 ...
$ Annee : Factor w/ 17 levels "2004 ","2005",..: 1 1 1 1 1 1 1 1 1 1 ...
I read in your xlsx file with the xlsx package:
df<-read.xlsx("C:/Users/swatson1/Downloads/Evolution_Population_2004_2020.xlsx",1)
and it looked like this:
> df
Age Feminin MasculinÂ. Ensemble Annee
1 00-04Â 720Â 745Â 1465Â 2004Â
2 05-09Â 745Â 767Â 1512Â 2004Â
You could replace each column, getting rid of the extra character with something like:
df$Age<-substr(df$Age,1,5)
Alternatively, use gsub as this will work on any column regardless of the length of the entry:
df$Age<-gsub("Â\\s","",df$Age)
Then your code would work:
df[df$Age=="00-04",]
#coppied from the Excel file
str1 <- "00-04 "
utf8ToInt(str1)
#[1] 48 48 45 48 52 160
There seems to be a no-break space at the end of the string. Sanitize your file.
You should be able to remove the no-break spaces using
df$Age <- gsub(intToUtf8(160),"",df$Age)

Resources