R - Aggregate function creating sub-lists - r

I'm using the aggregate function to summarise some data. The data is loans data, I have the ContractNum and LoanAmount. I want to aggregate the data by StartDate, count the number of Loans and Average the loan amount.
Here is a sample of the data and the function that I use:
ContractNum <- c("RHL-1","RHL-2","RHL-3","RHL-3")
StartDate <- c("2016-11-01","2016-11-01","2016-12-01","2016-12-01")
LoanPurpose <- c("Personal","Personal","HomeLoan","Investment")
LoanAmount <- c(200,500,600,150)
dat <- data.frame(ContractNum,StartDate,LoanPurpose,LoanAmount)
aggr.data <- aggregate(
cbind(LoanAmount,ContractNum) ~ StartDate + LoanPurpose
,data = dat
,FUN = function(x)c(count = mean(x),length(x))
)
When I lookat the results of the aggregate function, it looks ok:
> aggr.data
StartDate LoanPurpose LoanAmount.count LoanAmount.V2 ContractNum.count ContractNum.V2
1 2016-12-01 HomeLoan 600 1 3.0 1.0
2 2016-12-01 Investment 150 1 3.0 1.0
3 2016-11-01 Personal 350 2 1.5 2.0
But when I look at the strucutre of it, it seems to have created a sub-list:
> str(aggr.data)
'data.frame': 3 obs. of 4 variables:
$ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
$ LoanPurpose: Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
$ LoanAmount : num [1:3, 1:2] 600 150 350 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
$ ContractNum: num [1:3, 1:2] 3 3 1.5 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
How do I get rid of this sub-list so that I can access each column the way I would normally access a DF? I understand that in the code I've asked to give me a mean on a ContractNum which is not meaningful, but I can just get rid of that column.
Thank you

Just do do.call(data.frame, ...) on aggr.data to unnest the matrices.
aggr.data <- do.call(data.frame, aggr.data);
str(aggr.data);
#'data.frame': 3 obs. of 6 variables:
# $ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
# $ LoanPurpose : Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
# $ LoanAmount.count : num 600 150 350
# $ LoanAmount.V2 : num 1 1 2
# $ ContractNum.count: num 3 3 1.5
# $ ContractNum.V2 : num 1 1 2

Related

Creating a list of dataframes based on filter criteria

I would have a data set with a column ID. I filter them the data frame into winter and summer. I would like to split the data further based on the ID. In my actual data set there are over 100 IDs, so I don't want to make 100 data frames. Instead I would like to make a list of data frames. I used the group_split function to do this, but the number of list comes out uneven between winter and summer. I know for certain that there are the same number of IDs that should be in winter and summer. Is there a better way of doing this?
library(lubridate)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2011"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df$month <- month(df$date)
summer <- df%>% arrange(ID, date) %>%
filter(month %in% 07:09) %>%
group_by(ID, .add = TRUE) %>%
group_split(ID)
winter <- df%>%
arrange(ID, date) %>%
filter(month %in% c(01,02,03)) $>%
group_by(ID, .add = TRUE) %>%
# group_split(ID)
Thank you!
I think split will do what you want: produce a list of frames.
summer <- filter(df, month(date) %in% 7:9)
head(summer)
# date x y ID
# 1 2011-07-01 74958.44 842429.7 3
# 2 2011-07-02 64223.78 897607.8 4
# 3 2011-07-03 78843.54 829362.2 5
# 4 2011-07-04 60703.31 822962.0 1
# 5 2011-07-05 71328.44 872268.8 2
# 6 2011-07-06 68827.96 880618.3 3
str(split(summer, summer$ID))
# List of 5
# $ 1:'data.frame': 18 obs. of 4 variables:
# ..$ date: Date[1:18], format: "2011-07-04" "2011-07-09" ...
# ..$ x : num [1:18] 60703 64986 79477 67815 70387 ...
# ..$ y : num [1:18] 822962 858762 897413 817728 838251 ...
# ..$ ID : int [1:18] 1 1 1 1 1 1 1 1 1 1 ...
# $ 2:'data.frame': 18 obs. of 4 variables:
# ..$ date: Date[1:18], format: "2011-07-05" "2011-07-10" ...
# ..$ x : num [1:18] 71328 65414 64275 74286 76995 ...
# ..$ y : num [1:18] 872269 862579 818690 825991 847360 ...
# ..$ ID : int [1:18] 2 2 2 2 2 2 2 2 2 2 ...
# $ 3:'data.frame': 19 obs. of 4 variables:
# ..$ date: Date[1:19], format: "2011-07-01" "2011-07-06" ...
# ..$ x : num [1:19] 74958 68828 69431 76959 68538 ...
# ..$ y : num [1:19] 842430 880618 852488 874800 839197 ...
# ..$ ID : int [1:19] 3 3 3 3 3 3 3 3 3 3 ...
# $ 4:'data.frame': 19 obs. of 4 variables:
# ..$ date: Date[1:19], format: "2011-07-02" "2011-07-07" ...
# ..$ x : num [1:19] 64224 66977 75101 64189 73444 ...
# ..$ y : num [1:19] 897608 845062 809777 850364 822869 ...
# ..$ ID : int [1:19] 4 4 4 4 4 4 4 4 4 4 ...
# $ 5:'data.frame': 18 obs. of 4 variables:
# ..$ date: Date[1:18], format: "2011-07-03" "2011-07-08" ...
# ..$ x : num [1:18] 78844 77418 79762 78613 77485 ...
# ..$ y : num [1:18] 829362 867594 860007 819956 815058 ...
# ..$ ID : int [1:18] 5 5 5 5 5 5 5 5 5 5 ...

caret::predict giving Error: $ operator is invalid for atomic vectors

This has been driving me crazy and I've been looking through similar posts all day but can't seem to solve my problem. I have a naive bayes model trained and stored as model. I'm attempting to predict with a newdata data frame but I keep getting the error Error: $ operator is invalid for atomic vectors. Here is what I am running: stats::predict(model, newdata = newdata) where newdata is the first row of another data frame: new data <- pbp[1, c("balls", "strikes", "outs_when_up", "stand", "pitcher", "p_throws", "inning")]
class(newdata) gives [1] "tbl_df" "tbl" "data.frame".
The issue is with the data used. it should match the levels used in the training. E.g. if we use one of the rows from trainingData to predict, it does work
predict(model, head(model$trainingData, 1))
#[1] Curveball
#Levels: Changeup Curveball Fastball Sinker Slider
By checking the str of both datasets, some of the factor columns in the training is character class
str(model$trainingData)
'data.frame': 1277525 obs. of 7 variables:
$ pitcher : Factor w/ 1390 levels "112526","115629",..: 277 277 277 277 277 277 277 277 277 277 ...
$ stand : Factor w/ 2 levels "L","R": 1 1 2 2 2 2 2 1 1 1 ...
$ p_throws : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
$ balls : num 0 1 0 1 2 2 2 0 0 0 ...
$ strikes : num 0 0 0 0 0 1 2 0 1 2 ...
$ outs_when_up: num 1 1 1 1 1 1 1 2 2 2 ...
$ .outcome : Factor w/ 5 levels "Changeup","Curveball",..: 3 4 1 4 1 5 5 1 1 5 ...
str(newdata)
tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
$ balls : int 3
$ strikes : int 2
$ outs_when_up: int 1
$ stand : chr "R"
$ pitcher : int 605200
$ p_throws : chr "R"
An option is to make levels same for factor class
nm1 <- intersect(names(model$trainingData), names(newdata))
nm2 <- names(which(sapply(model$trainingData[nm1], is.factor)))
newdata[nm2] <- Map(function(x, y) factor(x, levels = levels(y)), newdata[nm2], model$trainingData[nm2])
Now do the prediction
predict(model, newdata)
#[1] Sinker
#Levels: Changeup Curveball Fastball Sinker Slider

Add level to R Data frame

Let's say we have a data frame/table organized like this
x$user1, x$user2, etc..
x$usern is a data table with attributes like $age, $department, $sale, $price, etc.
I would like to "push" and regroup the data frame in x$usern to one lower level, so that I can add other data tables below x$usern
Perhaps it's better with illustration : the current structure is
x
$user1 $user2
$price,$age, etc. $price, $age, etc.
Target structure is
x
$user1 $user2
$data $stat $data $stat
$price,$age, etc. $min, $max, etc. $price,$age, etc. $min, $max, etc.
What would be the best way to achieve this. I am thinking of lapply and/or loop through all user, but perhaps there is a more elegant way to do this ?
Thank you.
This seems like a good place for lapply (or one of its kin). Some mock data:
x <- list(
user1 = data.frame(price = 11, age = 12),
user2 = data.frame(price = 21, age = 22)
)
str(x)
# List of 2
# $ user1:'data.frame': 1 obs. of 2 variables:
# ..$ price: num 11
# ..$ age : num 12
# $ user2:'data.frame': 1 obs. of 2 variables:
# ..$ price: num 21
# ..$ age : num 22
The transformation:
newx <- lapply(x, function(l) {
st <- data.frame(min = 0.9*min(l$price), max = 1.1*max(l$age))
list(data = l, stat = st)
})
str(newx)
# List of 2
# $ user1:List of 2
# ..$ data:'data.frame': 1 obs. of 2 variables:
# .. ..$ price: num 11
# .. ..$ age : num 12
# ..$ stat:'data.frame': 1 obs. of 2 variables:
# .. ..$ min: num 9.9
# .. ..$ max: num 13.2
# $ user2:List of 2
# ..$ data:'data.frame': 1 obs. of 2 variables:
# .. ..$ price: num 21
# .. ..$ age : num 22
# ..$ stat:'data.frame': 1 obs. of 2 variables:
# .. ..$ min: num 18.9
# .. ..$ max: num 24.2
(Obviously, my definition of st would have to be tailored to your needs. Additionally, it does not strictly need to be defined within the lapply, but it makes sense to do it there if you already know its definition based on x$user1$....)

Why does mutate change the variable type?

activity <- mutate(
activity, steps = ifelse(is.na(steps), lookup_mean(interval), steps))
The "steps" variable changes from an int to a list. I want it to stay an "int" so I can aggregate it (aggregate is failing because it is a list type).
Before:
> str(activity)
'data.frame': 17568 obs. of 3 variables:
$ steps : int NA NA NA NA NA NA NA NA NA NA ...
$ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
After:
> str(activity)
'data.frame': 17568 obs. of 3 variables:
$ steps :List of 17568
..$ : num 1.72
..$ : num 1.72
Lookup mean is defined here:
lookup_mean <- function(i) {
return filter(daily_activity_pattern, interval == 0) %>% select(steps)
}
The problem is that lookup_mean returns a list, so R casts each value in activity$steps to a list. lookup_mean should be:
lookup_mean <- function(i) {
interval <- filter(daily_activity_pattern, interval == 0) %>% select(steps)
return(interval$steps)
}

Between/within standard deviations

When working on a hierarchical/multilevel/panel dataset, it may be very useful to adopt a package which returns the within- and between-group standard deviations of the available variables.
This is something that with the following data in Stata can be easily done through the command
xtsum, i(momid)
I made a research, but I cannot find any R package which can do that..
edit:
Just to fix ideas, an example of hierarchical dataset could be this:
son_id mom_id hispanic mom_smoke son_birthweigth
1 1 1 1 3950
2 1 1 0 3890
3 1 1 0 3990
1 2 0 1 4200
2 2 0 1 4120
1 3 0 0 2975
2 3 0 1 2980
The "multilevel" structure is given by the fact that each mother (higher level) has two or more sons (lower level). Hence, each mother defines a group of observations.
Accordingly, each dataset variable can vary either between and within mothers or only between mothers. birtweigth varies among mothers, but also within the same mother. Instead, hispanic is fixed for the same mother.
For example, the within-mother variance of son_birthweigth is:
# mom1 means
bwt_mean1 <- (3950+3890+3990)/3
bwt_mean2 <- (4200+4120)/2
bwt_mean3 <- (2975+2980)/2
# Within-mother variance for birthweigth
((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 +
(4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 +
(2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)
While the between-mother variance is:
# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
mean <- (3950+3890+3990+4200+4120+2975+2980)/7
# within variance:
((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)
I don't know what your stata command should reproduce, but to answer the second part of question about
hierarchical structure , it is easy to do this with list.
For example, you define a structure like this:
tree = list(
"var1" = list(
"panel" = list(type ='p',mean = 1,sd=0)
,"cluster" = list(type = 'c',value = c(5,8,10)))
,"var2" = list(
"panel" = list(type ='p',mean = 2,sd=0.5)
,"cluster" = list(type="c",value =c(1,2)))
)
To create this lapply is convinent to work with list
tree <- lapply(list('var1','var2'),function(x){
ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
cluster= list(type = 'c',value = rnorm(3))) ## R prefer symbols
})
names(tree) <-c('var1','var2')
You can view he structure with str
str(tree)
List of 2
$ var1:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num 0.284
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
$ var2:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num -0.144
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] -0.595 -1.795 -0.439
Edit after OP clarification
I think that package reshape2 is what you want. I will demonstrate this here.
The idea here is in order to do the multilevel analysis we need to reshape the data.
First to divide the variables into two groups :identifier and measured variables.
library(reshape2)
dat.m <- melt(dat,id.vars=c('son_id','mom_id')) ## other columns are measured
str(dat.m)
'data.frame': 21 obs. of 4 variables:
$ son_id : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
$ mom_id : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
$ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
$ value : num 1 1 1 0 0 0 0 1 0 0 ..
Once your have data in "moten" form , you can "cast" to rearrange it in the shape that you want:
# mom1 means for all variable
acast(dat.m,variable~mom_id,mean)
1 2 3
hispanic 1.0000000 0 0.0
mom_smoke 0.3333333 1 0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth
acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
1 2 3
hispanic 0.0000000 0 0.0
mom_smoke 0.6666667 0 0.5
son_birthweigth 5066.6666667 3200 12.5
## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic 0.4285714
mom_smoke 0.5714286
son_birthweigth 3729.2857143
I know this question is four years old, but recently I wanted to do the same in R and came up with the following function. It depends on dplyr and tibble. Where: df is the dataframe, columns is a numerical vector to subset the dataframe and individuals is the column with the individuals.
xtsumR<-function(df,columns,individuals){
df<-dplyr::arrange_(df,individuals)
panel<-tibble::tibble()
for (i in columns){
v<-df %>% dplyr::group_by_() %>%
dplyr::summarize_(
mean=mean(df[[i]]),
sd=sd(df[[i]]),
min=min(df[[i]]),
max=max(df[[i]])
)
v<-tibble::add_column(v,variacao="overal",.before=-1)
v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
sdB<-sd(v2)
varW<-df[[i]]-rep(v2,each=12) #
varW<-varW+mean(df[[i]])
sdW<-sd(varW)
minB<-min(v2)
maxB<-max(v2)
minW<-min(varW)
maxW<-max(varW)
v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
panel<-rbind(panel,v)
}
var<-rep(names(df)[columns])
n1<-rep(NA,length(columns))
n2<-rep(NA,length(columns))
var<-c(rbind(var,n1,n1))
panel$var<-var
panel<-panel[c(6,1:5)]
names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
panel[3:6]<-as.numeric(unlist(panel[3:6]))
panel[3:6]<-round(unlist(panel[3:6]),2)
return(panel)
}

Resources