Automating split up of data frame - r

I have the following data frame in R:
> head(df)
date x y z n t
1 2012-01-01 1 1 1 0 52
2 2012-01-01 1 1 2 0 52
3 2012-01-01 1 1 3 0 52
4 2012-01-01 1 1 4 0 52
5 2012-01-01 1 1 5 0 52
6 2012-01-01 1 1 6 0 52
> str(df)
'data.frame': 4617600 obs. of 6 variables:
$ date: Date, format: "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" ...
$ x : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ y : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ z : Factor w/ 111 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ n : int 0 0 0 0 0 0 0 0 29 0 ...
$ t : num 52 52 52 52 52 52 52 52 52 52 ...
What I want to do is split this large df into smaller data frames as follows:
1) I want to have 45 data frames for each factor value of 'x'. 2) I want to further split these 45 data frames for each factor value of 'z'. So I want a total of 45*111=4995 data frames.
I've seen plenty online about splitting data frames, which turns them into lists. However, I'm not seeing how to further split lists. Another concern I have is with computer memory. If I split the data frame into lists, will it not still take up as much computer memory? If I then want to run some prediction models on the split data, it seems impossible to do. Ideally I would split the data into many data frames, run prediction models on the first split data frame, get the results I need, and then delete it before moving on to the next one.

Here's what I would do. Your data already fits in memory, so just leave it in one piece:
require(data.table)
setDT(df)
df[,{
sum(t*n) # or whatever you're doing for "prediction models"
},by=list(x,z)]

Related

Is it possible to order an R contingency table by rowSums

I have created a contingency table with several variables/cols by 510 categories/factors. I want to have factors ordered descending based on the sum of all variables/cols.
Tried converting table back to DF and rowSums but no luck.
Not sure if possible to sort while using table function?
DF structure
'data.frame': 2210 obs. of 7 variables:
$ Paddock_ID: num 1 1 1 1 1 1 1 1 1 1 ...
$ Year : num 2010 2011 2011 2012 2012 ...
$ LandUse : chr "Wheat" "Wheat" "Wheat" "Wheat" ...
$ LUT : chr "Cer" "Cer" "Cer" "Cer" ...
$ LUG : chr "Wheat" "Wheat" "Wheat" "Wheat" ...
$ Tmix : Factor w/ 6 levels "6","5","4","3",..: 6 5 6 4 6 5 4 5 6 6
...
$ combo : Factor w/ 510 levels "","GLYPHOSATE",..: 416 6 59 119 30
22 510 2 2 509
my table
a <- table(DF$"combo", DF$"LUG")
I get table ok but would like to have it ordered based on sum of all variables/columns i.e. Glyphosate = 124, then clethodim = 69, then paraquat = 53 ... descending for all 510 categories (rows).
Barley Canola Lupin Other Pasture Wheat
GLYPHOSATE 4 46 6 5 23 40
TRALKOXYDIM 0 0 0 0 0 8
MCPA; GLYPHOSATE; METSULFURON 0 0 0 0 0 1
METSULFURON 1 0 0 0 0 1
BUTROXYDIM; METSULFURON 1 0 0 0 0 0
GLYPHOSATE; METSULFURON; PYRAFLUFEN 0 0 0 0 0 1
PARAQUAT 2 7 7 2 28 7
CLETHODIM 0 41 15 3 0 0
Using an example dataset:
grades <- c(1,1,1,2,2,1,1,2,1,1,1,2,3)
credits <- c(4,4,4,8,4,4,8,4,4,4,8,4,4)
df <- cbind(grades, credits)
You can find the rowsums using rowSums().
One possible solution would be to create another column for rowsums and then sort with decreasing = T.
df <- as.data.frame(df)
df$sum <- rowSums(df)
df <- df[order(df[,3], decreasing = T),]

multivariate random forest on a community matrix

I want to use random forest modeling to understand variable importance on community assembly - my response data is a community matrix.
library(randomForestSRC)
# simulated species matrix
species
# site species 1 species2 species 3
# 1 1 1 0
# 2 1 0 1
# 3 1 1 1
# 4 1 0 1
# 5 1 0 0
# 6 1 1 0
# 7 1 1 0
# 8 1 0 0
# 9 1 0 0
# 10 1 1 0
# environmental data
data
# site elevation_m PRECIPITATION_mm
# 1 500 28
# 2 140 37
# 3 445 15
# 4 340 45
# 5 448 20
# 6 55 70
# 7 320 18
# 8 200 42
# 9 420 22
# 10 180 8
# adding my species matrix into the environmental data frame
data[["species"]] <-(species)
# running the model
rf_model <- rfsrc(Multivar(species) ~.,data = data, importance = T)
but I'm getting an error message:
Error in parseFormula(formula, data, ytry) :
the y-outcome must be either real or a factor.
I'm guessing that the issue is the presence/absence data, but I'm not sure how to move past that. Is this a limitation of the function?
I think it MIGHT have to do with how you built your "data" data frame. When you used data[["species"]] <- (species), you had a data frame inside a data frame. If you str(data) after the step I just referred to, the output is this:
> str(data)
'data.frame': 10 obs. of 4 variables:
$ site : int 1 2 3 4 5 6 7 8 9 10
$ elevation: num 500 140 445 340 448 55 320 200 420 180
$ precip : num 28 37 15 45 20 70 18 42 22 8
$ species :'data.frame': 10 obs. of 4 variables: #2nd data frame
..$ site : int 1 2 3 4 5 6 7 8 9 10
..$ species.1: num 1 1 1 1 1 1 1 1 1 1
..$ species2 : num 1 0 1 0 0 1 1 0 0 1
..$ species.3: num 0 1 1 1 0 0 0 0 0 0
If you instead build your data frame as data2 <- as.data.frame(cbind(data,species))
, then
rfsrc(Multivar(species.1,species2,species.3)~.,data = data2, importance=T)
seems to work because I don't get an error message, instead I get some reasonable looking output:
Sample size: 10
Number of trees: 1000
Forest terminal node size: 5
Average no. of terminal nodes: 2
No. of variables tried at each split: 2
Total no. of variables: 4
Total no. of responses: 3
User has requested response: species.1
Resampling used to grow trees: swr
Resample size used to grow trees: 10
Analysis: mRF-R
Family: regr+
Splitting rule: mv.mse *random*
Number of random split points: 10
% variance explained: NaN
Error rate: 0
I don't think your method for building the data frame you wanted is the customary way, but I could be wrong. I think rfsrc() did not know how to read a nested data frame. I doubt most modeling functions do without extra customized code.
Here's an example, using example data from the vegan package, of automatically constructing a formula that includes all of the species names in the response:
library(vegan)
library(randomForestSRC)
data("dune.env")
data("dune")
all <- as.data.frame(cbind(dune,dune.env))
form <- formula(sprintf("Multivar(%s) ~ .",
paste(colnames(dune),collapse=",")))
rfsrc(form, data=all)
Suppose we want to do this with 2000 species. Here's a simulated example:
nsp <- 2000
nsamp <- 100
nenv <- 10
set.seed(101)
spmat <- matrix(rpois(nsp*nsamp, lambda=5), ncol=nsp,
dimnames=list(NULL,paste0("sp",seq(nsp))))
envmat <- matrix(rnorm(nenv*nsamp), ncol=nenv,
dimnames=list(NULL,paste0("env",seq(nenv))))
all2 <- as.data.frame(cbind(spmat,envmat))
form2 <- formula(sprintf("Multivar(%s) ~ .",
paste(colnames(spmat),collapse=",")))
rfsrc(form2, data=all2)
In this particular example we seem to explain -3% (!!) of the variance, but it doesn't crash, so that's a good thing ...

randomForest using factor variables as continuous?

I am using the package randomForest to produce habitat suitability models for species. I thought everything was working as it should until I started looking at individual trees with getTree(). The documentation (see page 4 of the randomForest vignette) states that for categorical variables, the split point will be an integer, which makes sense. However, in the trees I have looked at for my results, this is not the case.
The data frame I used to build the model was formatted with categorical variables as factors:
> str(df.full)
'data.frame': 27087 obs. of 23 variables:
$ sciname : Factor w/ 2 levels "Laterallus jamaicensis",..: 1 1 1 1 1 1 1 1 1 1 ...
$ estid : Factor w/ 2 levels "7694","psabs": 1 1 1 1 1 1 1 1 1 1 ...
$ pres : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
$ stratum : Factor w/ 89 levels "poly_0","poly_1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ra : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 3 3 3 3 3 ...
$ eoid : Factor w/ 2 levels "0","psabs": 1 1 1 1 1 1 1 1 1 1 ...
$ avd3200 : num 0.1167 0.0953 0.349 0.1024 0.3765 ...
$ biocl05 : num 330 330 330 330 330 ...
$ biocl06 : num 66 65.8 66 65.8 66 ...
$ biocl08 : num 277 277 277 277 277 ...
$ biocl09 : num 170 170 170 170 170 ...
$ biocl13 : num 186 186 185 186 185 ...
$ cti : num 19.7 19 10.4 16.4 14.7 ...
$ dtnhdwat : num 168 240 39 206 309 ...
$ dtwtlnd : num 0 0 0 0 0 0 0 0 0 0 ...
$ e2em1n99 : num 0 0 0 0 0 0 0 0 0 0 ...
$ ems30_53 : Factor w/ 53 levels "0","602","2206",..: 19 4 17 4 19 19 4 4 19 19 ...
$ ems5607_46: num 0 0 1 0 0.4 ...
$ ksat : num 0.21 0.21 0.21 0.21 0.21 ...
$ lfevh_53 : Factor w/ 53 levels "0","11","16",..: 38 38 38 38 38 38 38 38 38 38 ...
$ ned : num 1.46 1.48 1.54 1.48 1.47 ...
$ soilec : num 14.8 14.8 19.7 14.8 14.8 ...
$ wtlnd_53 : Factor w/ 50 levels "0","3","7","11",..: 4 31 7 31 7 31 7 7 31 31 ...
This was the function call:
# rfStratum and sampSizeVec were previously defined
> rf.full$call
randomForest(x = df.full[, c(7:23)], y = df.full[, 3],
ntree = 2000, mtry = 7, replace = TRUE, strata = rfStratum,
sampsize = sampSizeVec, importance = TRUE, norm.votes = TRUE)
Here are the first 15 lines of an example tree (note that the variables in lines 1, 5, and 15 should be categorical, i.e., they should have integer split values):
> tree100
left daughter right daughter split var split point status prediction
1 2 3 ems30_53 9.007198e+15 1 <NA>
2 4 5 biocl08 2.753206e+02 1 <NA>
3 6 7 biocl06 6.110518e+01 1 <NA>
4 8 9 biocl06 1.002722e+02 1 <NA>
5 10 11 lfevh_53 9.006718e+15 1 <NA>
6 0 0 <NA> 0.000000e+00 -1 0
7 12 13 biocl05 3.310025e+02 1 <NA>
8 14 15 ned 2.814818e+00 1 <NA>
9 0 0 <NA> 0.000000e+00 -1 1
10 16 17 avd3200 4.199712e-01 1 <NA>
11 18 19 e2em1n99 1.724138e-02 1 <NA>
12 20 21 biocl09 1.738916e+02 1 <NA>
13 22 23 ned 8.837864e-01 1 <NA>
14 24 25 biocl05 3.442437e+02 1 <NA>
15 26 27 lfevh_53 9.007199e+15 1 <NA>
Additional information: I encountered this because I was investigating an error I was getting when predicting the results back onto the study area stating that the types of predictors in the new data did not match those of the training data. I have done 6 other iterations of this model using the same data frame and scripts (just with different subsets of predictors) and never before gotten this message. The only thing I could find that was different between the randomforest object in this run compared to that in the other runs is that the rf.full$forest$ncat components are stored as double instead of integer
> for(i in 1:length(rf.full$forest$ncat)){
+ cat(names(rf.full$forest$ncat)[[i]], ": ", class(rf.full$forest$ncat[[i]]), "\n")
+ }
avd12800 : numeric
cti : numeric
dtnhdwat : numeric
dtwtlnd : numeric
ems2207_99 : numeric
ems30_53 : numeric
ems5807_99 : numeric
hydgrp : numeric
ksat : numeric
lfevh_53 : numeric
ned : numeric
soilec : numeric
wtlnd_53 : numeric
>
> rf.full$forest$ncat
avd12800 cti dtnhdwat dtwtlnd ems2207_99 ems30_53 ems5807_99 hydgrp ksat lfevh_53
1 1 1 1 1 53 1 1 1 53
ned soilec wtlnd_53
1 1 50
However, xlevels (which appears to be a list of the predictor variables used and their types) are all showing the correct datatype for each predictor.
> for(i in 1:length(rf.full$forest$xlevels)){
+ cat(names(rf.full$forest$xlevels)[[i]], ": ", class(rf.full$forest$xlevels[[i]]),"\n")
+ }
avd12800 : numeric
cti : numeric
dtnhdwat : numeric
dtwtlnd : numeric
ems2207_99 : numeric
ems30_53 : character
ems5807_99 : numeric
hydgrp : character
ksat : numeric
lfevh_53 : character
ned : numeric
soilec : numeric
wtlnd_53 : character
# example continuous predictor
> rf.full$forest$xlevels$avd12800
[1] 0
# example categorical predictor
> rf.full$forest$xlevels$ems30_53
[1] "0" "602" "2206" "2207" "4504" "4507" "4702" "4704" "4705" "4706" "4707" "4717" "5207" "5307" "5600"
[16] "5605" "5607" "5616" "5617" "5707" "5717" "5807" "5907" "6306" "6307" "6507" "6600" "7002" "7004" "9107"
[31] "9116" "9214" "9307" "9410" "9411" "9600" "4607" "4703" "6402" "6405" "6407" "6610" "7005" "7102" "7104"
[46] "7107" "9000" "9104" "9106" "9124" "9187" "9301" "9505"
The ncat component is simply a vector of the number of categories per variable with 1 for continuous variables (as noted here), so it doesn't seem like it should matter if that is stored as an integer or a double, but it seems like this might all be related.
Questions
1) Shouldn't the split point for categorical predictors in any given tree of a randomForest forest be an integer, and if yes, any thoughts as to why factors in the data frame used as input to the randomForest call here are not being used as such?
2) Does the number type (double vs integer) of the ncat component of a randomForest object matter in any way related to model building, and any thoughts as to what could cause this to switch from integer in the first 6 runs to double in this last run (with each run containing different subsets of the same data)?
The randomforest::randomForest algorithm encodes low-cardinality (up to 32 categories) and high-cardinality (32 to 64? categories) categorical splits differently. Pay attention - all your "problematic" features belong to the latter class, and are encoded using 64-bit floating point values.
While the console output doesn't make sense for the human observer, the randomForest model object/algorithm itself is correct (ie. treats those variables as categorical), and is making correct predictions.
If you want to investigate the structure of decision tree, and decision tree ensemble models, then you might consider exporting them to the PMML data format. For example, you can use the R2PMML package for this:
library("r2pmml")
r2pmml(rf.full, "MyRandomForest.pmml")
Then, open the MyRandomForest.pmml in a text editor, and you shall have a nice overview about the internals of your model (branches, split conditions, leaf values, etc).

How can I strip dollar signs ($) from a data frame in R?

I'm quite new to R and am battling a bit with what would appear to be an extremely simple query.
I've imported a csv file into R using read.csv and am trying to remove the dollar signs ($) prior to tidying the data and further analysis (the dollar signs are playing havoc with charting).
I've been trying without luck to strip the $ using dplyr and gsub from the data frame and I'd really appreciate some advice about how to go about it.
My data frame looks like this:
> str(data)
'data.frame': 50 obs. of 17 variables:
$ Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Prog.Cost : Factor w/ 2 levels "-$3,333","$0": 1 2 2 2 2 2 2 2 2 2 ...
$ Total.Benefits : Factor w/ 44 levels "$2,155","$2,418",..: 25 5 7 11 12 10 9 14 13 8 ...
$ Net.Cash.Flow : Factor w/ 45 levels "-$2,825","$2,155",..: 1 6 8 12 13 11 10 15 14 9 ...
$ Participant : Factor w/ 46 levels "$0","$109","$123",..: 1 1 1 45 46 2 3 4 5 6 ...
$ Taxpayer : Factor w/ 48 levels "$113","$114",..: 19 32 35 37 38 40 41 45 48 47 ...
$ Others : Factor w/ 47 levels "-$9","$1,026",..: 12 25 26 24 23 11 9 10 8 7 ...
$ Indirect : Factor w/ 42 levels "-$1,626","-$2",..: 1 6 10 18 22 24 28 33 36 35 ...
$ Crime : Factor w/ 35 levels "$0","$1","$10",..: 6 11 13 19 21 23 28 31 33 32 ...
$ Child.Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Education : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Health.Care : Factor w/ 38 levels "-$10","-$11",..: 7 7 7 7 2 8 12 36 30 9 ...
$ Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Earnings : Factor w/ 41 levels "$0","$101","$104",..: 1 1 1 22 23 24 25 26 27 28 ...
$ State.Benefits : Factor w/ 37 levels "$102","$117",..: 37 1 3 4 6 10 12 18 24 27 ...
$ Local.Benefits : Factor w/ 24 levels "$115","$136",..: 24 1 2 12 14 16 19 22 23 21 ...
$ Federal.Benefits: Factor w/ 39 levels "$0","$100","$102",..: 1 1 1 12 12 17 20 19 19 21 ...
If you need to only remove the $ and do not want to change the class of the columns.
indx <- sapply(data, is.factor)
data[indx] <- lapply(data[indx], function(x)
as.factor(gsub("\\$", "", x)))
If you need numeric columns, you can strip out the , as well (contributed by #David
Arenburg) and convert to numeric by as.numeric
data[indx] <- lapply(data[indx], function(x) as.numeric(gsub("[,$]", "", x)))
You can wrap this in a function
f1 <- function(dat, pat="[$]", Class="factor"){
indx <- sapply(dat, is.factor)
if(Class=="factor"){
dat[indx] <- lapply(dat[indx], function(x) as.factor(gsub(pat, "", x)))
}
else {
dat[indx] <- lapply(dat[indx], function(x) as.numeric(gsub(pat, "", x)))
}
dat
}
f1(data)
f1(data, pat="[,$]", "numeric")
data
set.seed(24)
data <- data.frame(Year=1:6, Prog.Cost= sample(c("-$3,3333", "$0"),
6, replace=TRUE), Total.Benefits= sample(c("$2,155","$2,418",
"$2,312"), 6, replace=TRUE))
If you have to read a lot of csv files with data like this, perhaps you should consider creating your own as method to use with the colClasses argument, like this:
setClass("dollar")
setAs("character", "dollar",
function(from)
as.numeric(gsub("[,$]", "", from, fixed = FALSE)))
Before demonstrating how to use it, let's write #akrun's sample data to a csv file named "A". This would not be necessary in your actual use case where you would be reading the file directly...
## write #akrun's sample data to a csv file named "A"
set.seed(24)
data <- data.frame(
Year=1:6,
Prog.Cost= sample(c("-$3,3333", "$0"), 6, replace = TRUE),
Total.Benefits = sample(c("$2,155","$2,418","$2,312"), 6, replace=TRUE))
A <- tempfile()
write.csv(data, A, row.names = FALSE)
Now, you have a new option for colClasses that can be used with read.csv :-)
read.csv(A, colClasses = c("numeric", "dollar", "dollar"))
# Year Prog.Cost Total.Benefits
# 1 1 -33333 2155
# 2 2 -33333 2312
# 3 3 0 2312
# 4 4 0 2155
# 5 5 0 2418
# 6 6 0 2418
It would probably be more beneficial to just read it again, this time with readLines. I wrote akrun's data to the file "data.text" and fixed the strings before reading the table. Nor sure if the comma was a decimal point or an annoying comma, so I chose decimal point.
r <- gsub("[$]", "", readLines("data.txt"))
read.table(text = r, dec = ",")
# Year Prog.Cost Total.Benefits
# 1 1 -3.3333 2.155
# 2 2 -3.3333 2.312
# 3 3 0.0000 2.312
# 4 4 0.0000 2.155
# 5 5 0.0000 2.418
# 6 6 0.0000 2.418

R factor - time series conversion not working

I have the common problem of converting a factor of the format:
"2007/01"
to time series object. The data can be found here: http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=prc_hicp_midx&lang=en
I did replace the "M" in YYYY"M"MM with a "/".
> str(infl)
'data.frame': 3560 obs. of 5 variables:
$ TIME : Factor w/ 89 levels "2007/01","2007/02",..: 1 2 3 4 5 6 7 8 9 10 ...
$ GEO : Factor w/ 40 levels "Austria","Belgium",..: 15 15 15 15 15 15 15 15 15 15 ...
$ INFOTYPE: Factor w/ 1 level "Index, 2005=100": 1 1 1 1 1 1 1 1 1 1 ...
$ COICOP : Factor w/ 1 level "All-items HICP": 1 1 1 1 1 1 1 1 1 1 ...
$ Value : Factor w/ 1952 levels ":","100.49","100.5",..: 35 51 85 112 127 131 120 126 147 169 ...
I followed all the different approaches:
as.POSIXct(as.character(infl$TIME), format = "%Y/%m")
as.POSIXlt(as.character(infl$TIME), format = "%Y/%m")
as.Date(as.character(infl$TIME), format = "%Y/%m")
However all of them return "NA" for the entire length of the series. Has anyone any idea why I cannot convert this series to a ts object?
Your help is greatly appreciated.
It looks like you can make it work using the yearmon object from the zoo package:
library(zoo)
as.yearmon("2007/01", "%Y/%m")
# [1] "Jan 2007"
See Sorting an data frame based on month-year time format for more ideas.

Resources