Combining data in R based on a characteristic of that data [duplicate] - r

This question already has answers here:
Aggregate data in R
(3 answers)
Closed 8 years ago.
Suppose I have data for a number of transaction that occur within different states
State Cost
AK, 70
AK, 75
AK, 10
IL, 20
IL, 1050
IL, 235
etc...
How can I compress my data so that I'm only looking at total cost per state? I can only come up with solutions by writing python scripts to compress this data but it seems like R should be able to support this operation.
State Cost
AK, 155
IL, 1305
etc...
Any ideas are greatly appreciated.

library("dplyr")
options(digits=4)
StatsByState <- group_by(Your.df, State)
summarise(StatsByState, Sum = sum(Cost), Mean = mean(Cost), StDev = sd(Cost))
options(digits=7)
State Sum Mean StDev
1 AK 155 51.67 36.17
2 IL 1040 346.67 565.80
3 NE 720 240.00 242.49

Related

Running a loop on the complete variable after iterating on all the categories of that field

I am working in R and I have a dataframe which consists of columns with categorical data. On each of these combinations of categories, I have to aggregate a metric.
Input table:
ID Region Access Touchpoints
A Central High 8
B Central Low 7
C West High 7
D West Low 3
E Central High 2
F Central Low 5
G West High 9
H West Low 8
Output which I want:
Region Access Touchpoints
All All 49
All High 26
All Low 23
Central High 10
West High 16
Central Low 12
West Low 11
Central All 22
West All 27
Problem is I have to create an All category when iterating these variables in nested loops. Is there any other way?
New answer
The question is somewhat hard to make out. But what the questioner is looking for is aggregates and totals in several groupings variables. The cube function from data.table is specifically designed for this scenario.
library(data.table)
df <- fread('ID Region Access Touchpoints
A Central High 8
B Central Low 7
C West High 7
D West Low 3
E Central High 2
F Central Low 5
G West High 9
H West Low 8')
result <- cube(df, j = sum(Touchpoints), by = c('Region', 'Access'))
Note that cube only accepts a data.table and returns one as well. For more information on the data.table package I refer to their excellent cheat-sheet like wiki here. In the result NA mark totals in groups and subgroups. We can get change this and get back to a data.frame by running
df[is.na(Region), Region = 'All'][is.na(Access), Access := 'All']
setDF(df) #Change back to DF (if wanted)
Old answer
This will be a somewhat limited answer due to the lack of a reproducible example.
Depending on the size of your data and your available memory, the simplest method for these situations is to simple create a grid of all combinations to iterate over. Multiple methods exist. In base R
combinations <- expand.grid(var1, var2, var3, ...)
for(i in seq(nrow(combination))){
current_comb <- combinations[i, ]
#Do stuff
#...
}
#Alternative
#apply(combinations, 1, FUN)
With data.table we could similarly use CJ(var1, var2, ...) and with tidyverse we'd use expand_grid.
This is often much faster, but as the number of categories grow this is going to become less and less feasible. In your situation it should do fine however.

How to create a hierarchical cluster using categorical and numerical data is R?

I want to create a hierarchical cluster to show types of careers and the balance that those who are in those careers have in their bank account.
I a dataset with two variables, job and balance:
job balance
1 unemployed 1787
2 services 4789
3 management 1350
4 management 1476
5 blue-collar 0
6 management 747
7 self-employed 307
8 technician 147
9 entrepreneur 221
10 services -88
I want the result to look like this:
Where A, B ,C etc are the job categories.
Can anyone help me start this or give me some help?
I have no idea how to begin.
Thanks!
You can start by using the distand hclust functions.
df <- read.table(text = " job balance
1 unemployed 1787
2 services 4789
3 management 1350
4 management 1476
5 blue-collar 0
6 management 747
7 self-employed 307
8 technician 147
9 entrepreneur 221
10 services -88")
dist computes the distance between each element (by default, the euclidian distance):
distances <- dist(df$balance)
You can then cluster you values using the distance matrix generated above:
clusters <- hclust(distances)
By default, hclust applies complete-linkage clustering to your data.
Finally, you can plot your results as a tree:
plot(clusters, labels = df$job)
Here, we clustered all the entries in your data frame, that's why some jobs are duplicated. If you want to have a single value per job, you can for example take the mean balance for each job using tapply:
means <- tapply(df$balance, df$job, mean)
And then cluster the jobs:
distances <- dist(means)
clusters <- hclust(distances)
plot(clusters)
You can then try to use other distance measures or other clustering algorithms (see help(dist) and help(hclust) for other methods).

Missing columns when using max in data.table [duplicate]

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 7 years ago.
I am trying to get the top frequency words in a data.table
data.table : dtable4G
key freq value
================================
thanks for the 612 support
thanks for the 380 drink
thanks for the 215 payment
thanks for the 27 encouragement
have a great 154 day
have a great 132 weekend
have a great 54 week
have a great 42 time
have a great 19 night
at the same 346 time
at the same 57 damn
at the same 30 pace
at the same 11 speed
at the same 7 level
at the same 1 rate
I tried the code
dtable4G[ , max(freq), by = key]
and
dtable4G[ , .I[which.max(freq)] , by = key]
Both the above commands, I am getting the same result:
key V1
====================
thanks for the 612
have a great 154
at the same 346
I want the result to be:
key freq value
================================
thanks for the 612 support
have a great 154 day
at the same 346 time
Any ideas what I am doing wrong?
EDITED
dtable4G[dtable4G[, .I[which.max(freq)], by = key]$V1]
worked for me. Though it took some time to run through my 5.4 mil rows.
But this was way faster than using
dtable4G[,.SD[which.max(freq)],by=key]
Reference: With data.table, is SD[which.max(Var1)] the fastest way to find the max of a group?
We can subset the data table for only the max freq of each key column value with the following:
dtable4G[,.SD[which.max(freq)],by=key]
For better performance you can use the below approach as well. It doesn't construct the .SD and is thus faster:
dtable4g[dtable4g[, .I[which.max(freq)], by = key]$V1]

R: Percentile calculations on subsets of data

I have a data set which contains the following identifiers, an rscore, gvkey, sic2, year, and cdom. What I am looking to do is calculate percentile ranks based on summed rscores for all temporal spans (~1500) for a given gvkey, and then calculate percentile ranks in a given temporal time span and sic2 based on gvkey.
Calculating the percentiles for all temporal time spans is a fairly quick process, however once I add in calculating the sic2 percentile ranks it's fairly slow, but we are likely looking at about ~65,000 subsets in total. I'm wondering if there is a possibility of speeding up this process.
The data for one temporal time span looks like the following
gvkey sic2 cdom rscoreSum pct
1187 10 USA 8.00E-02 0.942268617
1265 10 USA -1.98E-01 0.142334654
1266 10 USA 4.97E-02 0.88565478
1464 10 USA -1.56E-02 0.445748247
1484 10 USA 1.40E-01 0.979807985
1856 10 USA -2.23E-02 0.398252565
1867 10 USA 4.69E-02 0.8791019
2047 10 USA -5.00E-02 0.286701209
2099 10 USA -1.78E-02 0.430915371
2127 10 USA -4.24E-02 0.309255308
2187 10 USA 5.07E-02 0.893020421
The code to calculate the industry ranks is below, and fairly straightforward.
#generate 2 digit industry SICs percentile ranks
dout <- ddply(dfSum, .(sic2), function(x){
indPct <- rank(x$rscoreSum)/nrow(x)
gvkey <- x$gvkey
x <- data.frame(gvkey, indPct)
})
#merge 2 digit industry SIC percentile ranks with market percentile ranks
dfSum <- merge(dfSum, dout, by = "gvkey")
names(dfSum)[2] <- 'sic2'
Any suggestions to speed the process would be appreciated!
You might try the data.table package for fast operations across relatively large datasets like yours. For example, my machine has no problem working through this:
library(data.table)
# Create a dataset like yours, but bigger
n.rows <- 2e6
n.sic2 <- 1e4
dfSum <- data.frame(gvkey=seq_len(n.rows),
sic2=sample.int(n.sic2, n.rows, replace=TRUE),
cdom="USA",
rscoreSum=rnorm(n.rows))
# Now make your dataset into a data.table
dfSum <- data.table(dfSum)
# Calculate the percentiles
# Note that there is no need to re-assign the result
dfSum[, indPct:=rank(rscoreSum)/length(rscoreSum), by="sic2"]
whereas the plyr equivalent takes a while.
If you like the plyr syntax (I do), you may also be interested in the dplyr package, which is billed as "the next generation of plyr", with support for faster data stores in the backend.

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

Resources