Subset large Data frame by first 100 factors - r

Having a large data frame(almost 100m rows)
Want to subset the data frame by factors i.e
complete data of first 100 factors into one data frame ,next 100 into another
OR(the below one even I'm not sure)
Factors (categories) starts from Letter A:J in one batch,L:R as another data frame like that
(Actually I'm facing memory issues when dealing with large data frames,simple rows split can't help problem that working)
Any suggestion appreciated ..Thanks
Sample data set
ID FACTORS VALUE
1 ABCD 100
2 ABCD 101
3 ABCD 102
4 ABCD 103
5 ABCD 104
6 DEFG 105
7 DEFG 106
8 DEFG 107
9 DEFG 108
10 DEFG 109
11 DEFG 110
12 HIJK 111
13 HIJK 112
14 HIJK 113
15 HIJK 114
16 HIJK 115
17 HIJK 116
18 MNOP 117
19 MNOP 118
20 MNOP 119
21 MNOP 120
22 MNOP 121
23 99-1 122
24 99-1 123
25 99-1 124
26 99-2 125
27 99-2 126

This is related loosely to Split a vector into chunks in R
First, let's get the unique factors and split them up into bins of size n:
fctrs <- unique(dat$FACTORS)
fctrs
# [1] "ABCD" "DEFG" "HIJK" "MNOP" "99-1" "99-2"
n <- 3 # set to 100 for your data
fctrgroups <- split(fctrs, ceiling(seq_along(fctrs)/n))
str(fctrgroups)
# List of 2
# $ 1: chr [1:3] "ABCD" "DEFG" "HIJK"
# $ 2: chr [1:3] "MNOP" "99-1" "99-2"
(The last group may be less than n.)
THere are two ways you can work through this. If you're going to keep it all in-memory but just work on a subset at a time, then I suggest you keep the separated frames in a list and subsequently do your work within another lapply:
lst_o_frames <- lapply(fctrgroups, function(f) subset(dat, FACTORS %in% f))
str(lst_o_frames)
# List of 2
# $ 1:'data.frame': 17 obs. of 3 variables:
# ..$ ID : int [1:17] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ FACTORS: chr [1:17] "ABCD" "ABCD" "ABCD" "ABCD" ...
# ..$ VALUE : int [1:17] 100 101 102 103 104 105 106 107 108 109 ...
# $ 2:'data.frame': 10 obs. of 3 variables:
# ..$ ID : int [1:10] 18 19 20 21 22 23 24 25 26 27
# ..$ FACTORS: chr [1:10] "MNOP" "MNOP" "MNOP" "MNOP" ...
# ..$ VALUE : int [1:10] 117 118 119 120 121 122 123 124 125 126
If you take your work and put it into a user function named myfunc, then you can do
processed_lst_o_frames <- lapply(lst_o_frames, myfunc)
If, however, you just want to save the data to CSVs (or similar) so you can work with them elsewhere, then something like this will work:
for (f in fctrgroups) {
write.csv(subset(dat, FACTORS %in% f), paste0(f[[1]][1], ".csv"))
}
Note that this method is often used to do the actual work on the subset frames, too. Doing it this way is certainly feasible, but misses a strength of R and a simplifying programming step of "do some function on each elem of a list".

Related

Viewing dataset in RStudio shows different number of observations compared to R commands

I am currently studying data science with R. To practice, I am using the Auto data of the ISLR package. However, I am encountering a confusing situation when viewing the data. When I view the dataset Auto.df in RStudio, I get the following:
However, when I use dim(Auto.df), I get the following:
> dim(Auto.df)
[1] 392 9
And when I use nrow(Auto.df), I get the following:
> nrow(Auto.df)
[1] 392
And when I use str(Auto.df), I get the following:
> str(Auto.df)
'data.frame': 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
$ weight : num 3504 3693 3436 3433 3449 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : num 70 70 70 70 70 70 70 70 70 70 ...
$ origin : num 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
And I have the following in my RStudio "Global Environment" tab:
So why does viewing the dataset in RStudio show 397 rows (observations), whilst everything else says that there are 392 observations?
There are 392 observations in the data. What you are viewing are the rownames of the data. You can set rownames as anything and they do not represent row number in the data.
If you check the rownames of Auto dataset you'll realise they are not sequential and some rownames jump by 2. For example, after 32 you don't have 33 but 34. Similarly after 126 there is 128. I don't know why the data is like that but that makes row number at the end to go till 397.

How to work with %in% symbol in R?

I found out that %in% stands for matching operator, binary (in model formulae: nesting). There are two tables in my workspace. The first table contains
> str(GP.drugs)
'data.frame': 4158393 obs. of 9 variables:
$ SHA : Factor w/ 10 levels "Q30","Q31","Q32",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PCT : Factor w/ 151 levels "5A3","5A4","5A5",..: 16 16 16 16 16 16 16 16 16 16 ...
$ PRACTICE: Factor w/ 10191 levels "A81001","A81002",..: 344 345 345 345 345 345 345 345 345 345 ...
$ BNF.CODE: Factor w/ 1731 levels "0101010C0","0101010E0",..: 878 4 9 11 17 22 25 26 27 28 ...
$ BNF.NAME: Factor w/ 1524 levels "Abacavir ",..: 317 289 294 1284 37 379 655 825 1115 824 ...
$ ITEMS : int 1 27 1 2 97 4 40 98 27 2 ...
$ NIC : num 1.89 74.94 3.2 7.35 439.83 ...
$ ACT.COST: num 1.77 69.92 2.98 6.84 408.43 ...
$ PERIOD : num 201109 201109 201109 201109 201109 ...
The second table contains
> str(problem.drugs)
'data.frame': 13 obs. of 2 variables:
$ Drug : Factor w/ 13 levels "Alogliptin","Glipizide",..: 1 2 3 9 10 11 12 13 4 7 ...
$ Category: Factor w/ 1 level "metformin": 1 1 1 1 1 1 1 1 1 1 ...
The code and the error I am using is
> t<-subset(GP.drugs,n %in% p)
> t
[1] SHA PCT PRACTICE BNF.CODE BNF.NAME ITEMS NIC ACT.COST PERIOD
<0 rows> (or 0-length row.names)
More errors
Does it make difference on the tables' column names or does it make it difference on the number of columns both have?
Your BNF.NAME column in the GP.drugs data frame appears to have extra trailing spaces in it: notice it says something like "Abacavir " as the first element. If this is true of all the drugs in GP.drugs, but not the ones in problem.drugs, it will prevent any from matching.
To fix this, you can use the str_trim function from stringr, which trims leading and trailing whitespace:
library(stringr)
n <- str_trim(GP.drugs$BNF.NAME)
# same thing you did before
p <- problem.drugs$Drug
t <- subset(GP.drugs, n %in% p)
Other solutions can be found here.
Try,
GP.drugs[GP.drugs$BNF.NAME %in% problem.drugs$Drug, ]

Adding new variable to each element of a list of data.frames

I have a list of data frames ( C ). I want to add a new variable "d" to each data frame, wich'd be the sum of "a" and "b". Here's a short example.
A<-data.frame(1:10, 101:110, 201:210)
colnames(A)=c("a","b", "c")
B<-data.frame(11:20, 111:120, 211:220)
colnames(B)=c("a","b", "c")
C<-list(A, B)
I've tried this, but i think there may be something simpler, specially considering the last part of the line ( value= (("["(C[[x]][1]))) + ("["(C[[x]][2])) )
lapply(seq(C), function(x) "[[<-" (C[[x]], 3, value= (("["(C[[x]][1]))) + ("["(C[[x]][2])))
Any ideas?
And BTW, how can i choose the name of the variable created?
Thanx in advance
Or using cbind
C <- lapply(C, function(x) cbind(x, d = x[["a"]] + x[["b"]]))
Or using rowSums
C <- lapply(C, function(x) cbind(x, d = rowSums(x[1:2])))
This should be easy using within:
out <- lapply(C, function(x) within(x, d <- a + b))
Here's the result:
> str(out)
List of 2
$ :'data.frame': 10 obs. of 4 variables:
..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ b: int [1:10] 101 102 103 104 105 106 107 108 109 110
..$ c: int [1:10] 201 202 203 204 205 206 207 208 209 210
..$ d: int [1:10] 102 104 106 108 110 112 114 116 118 120
$ :'data.frame': 10 obs. of 4 variables:
..$ a: int [1:10] 11 12 13 14 15 16 17 18 19 20
..$ b: int [1:10] 111 112 113 114 115 116 117 118 119 120
..$ c: int [1:10] 211 212 213 214 215 216 217 218 219 220
..$ d: int [1:10] 122 124 126 128 130 132 134 136 138 140
You could, then, store out as C (i.e., C <- out) to replace the original list. I just use out here to show you what is going on.

Why cannot order a factor in data frame in R?

I have a data frame (named 'mdf') which includes two columns. The basic information is below:
> head(mdf); tail(mdf)
Country Rank
1 ABW 161
2 AFG 105
3 AGO 60
4 ALB 125
5 ARE 32
6 ARG 26
Country Rank
184 WSM 181
185 YEM 90
186 ZAF 28
187 ZAR 112
188 ZMB 104
189 ZWE 134
> str(mdf)
'data.frame': 189 obs. of 2 variables:
$ Country: Factor w/ 229 levels "","ABW","ADO",..: 2 4 5 6 7 8 9 11 12 13 ...
$ Rank : Factor w/ 195 levels "",".. Not available. ",..: 72 10 149 32 118 111 41 84 26 112 ...
My purpose is to rearrange it by ordering 'Rank' variable, but the result is:
> mdf[order(mdf$Rank),]
Country Rank
178 USA 1
78 IND 10
153 SLV 100
170 TTO 101
43 CYP 102
54 EST 103
188 ZMB 104
2 AFG 105
175 UGA 106
130 NPL 107
73 HND 108
60 GAB 109
31 CAN 11
67 GNQ 110
As you see, it is not what I need (i.e. increasing order). How can I do it? Thanks!
To get the answer you are looking for, use:
mdf[order(as.numeric(as.character(mdf$Rank))),]
The reason your original code doesn't work is that your Rank variable is a factor, so it will be sorted by the levels of the factor. For example, if you had a data frame such that:
DF
# x
# 1 2
# 2 22
# 3 11
# 4 1
and order the data
DF[order(DF$x),]
and you look at the levels:
levels(DF$x)
# [1] "1" "2" "11" "22"
We can reorder the levels such that
levels(DF$x) <- relevel(DF$x, ref = '11')
Now,
levels(DF$x)
# [1] "2" "22" "11" "1"
So ordering with the new factor levels we get different results:
DF[order(DF$x),]
To answer your question of why as.numeric doesn't work, it's because each factor level has an associated integer, which you get with as.numeric. If you want the number that is the factor label, you must first convert to a character and then convert to numeric, thus requiring as.numeric(as.character(x))
For example, calling as.numeric(DF$x) gives the integer values for each level, but not the actual label for each level:
# [1] 2 4 3 1
One way to avoid this in the future if you are loading your data frame from a .csv file is to use read.csv(..., stringsAsFactors=FALSE), or also I like the fread function in data.table which uses safer default types.

Creating decision tree

I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.
But when i try to create the decision tree:
model<-tree(salary~.,data)
I get the error like below:
*Error in tree(salary ~ ., data) :
factor predictors must have at most 32 levels*
What is wrong with that? Data is as follows:
Name bat hit homeruns runs
1 Alan Ashby 315 81 7 24
2 Alvin Davis 479 130 18 66
3 Andre Dawson 496 141 20 65
...
team position putout assists errors
1 Hou. C 632 43 10
2 Sea. 1B 880 82 14
3 Mon. RF 200 11 3
salary league87 team87
1 475 N Hou.
2 480 A Sea.
3 500 N Chi.
And its the value of str(data):
'data.frame': 263 obs. of 24 variables:
$ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...
$ bat : int 315 479 496 321 594 185 298 323 401 574 ...
$ hit : int 81 130 141 87 169 37 73 81 92 159 ...
$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...
$ runs : int 24 66 65 39 74 23 24 26 49 107 ...
$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...
$ walks : int 39 76 37 30 35 21 7 8 65 59 ...
$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...
$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...
$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...
$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...
$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...
$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7
8 ...
$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...
$ putout : int 632 880 200 805 282 76 121 143 0 238 ...
$ assists : int 43 82 11 40 421 127 283 290 0 445 ...
$ errors : int 10 14 3 4 25 7 9 19 0 22 ...
$ salary : num 475 480 500 91.5 750 ...
$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...
The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:
train <- data
train$Name <- NULL
model<-tree(salary~.,train)
It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:
http://cran.r-project.org/web/packages/tree/tree.pdf
Usage
tree(formula, data, weights, subset, na.action = na.pass,
control = tree.control(nobs, ...), method = "recursive.partition",
split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts
= TRUE, ...)
Arguments
formula A formula expression. The left-hand-side (response) should be either a numerical vector when a
regression tree will be fitted or a factor, when a classification tree
is produced. The right-hand-side should be a series of numeric or
factor variables separated by +; there should be no interaction terms.
Both . and - are allowed: regression trees can have offset terms.
(...)
Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:
salary = as.numeric(levels(salary))[salary]
EDIT
As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.
EDIT2
It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).

Resources