how to aggregate this data in R - r

I have a data frame in R with the following structure.
> testData
date exch.code comm.code oi
1 1997-12-30 CBT 1 468710
2 1997-12-23 CBT 1 457165
3 1997-12-19 CBT 1 461520
4 1997-12-16 CBT 1 444190
5 1997-12-09 CBT 1 446190
6 1997-12-02 CBT 1 443085
....
77827 2004-10-26 NYME 967 10038
77828 2004-10-19 NYME 967 9910
77829 2004-10-12 NYME 967 10195
77830 2004-09-28 NYME 967 9970
77831 2004-08-31 NYME 967 9155
77832 2004-08-24 NYME 967 8655
What I want to do is produce a table the shows for a given date and commodity the total oi across every exchange code. So, the rows would be made up of
unique(testData$date)
and the columns would be
unique(testData$comm.code)
and each cell would be the total oi over all exch.codes on a given day.
Thanks,

The plyr package is good at this, and you should get this done with a single ddply() call. Something like (untested)
ddply(testData, .(date,comm.code), function(x) sum(x$oi))
should work.

# get it all aggregated
dfl <- aggregate(oi ~ date + comm.code, testData, sum)
# rearrange it so that it's like you requested
uc <- unique(df1$comm.code)
dfw <- with( df1, data.frame(data = unique(date), matrix(oi, ncol = length(uc))) )
names(dfw) <- c( 'date', uc)
This will be much much faster than the equivalent plyr command. And, there are ways to rearrange it in one liners. The rearranging part is very fast.

A data.table solution
library(data.table)
DT <- data.table(testData)
DT[,sum(oi), by = list(date,comm.code)]

Related

Subset Conditions closest higher/lower observation

AreaCode Name Rank
1001108 HA - 2326
1001247 HA - 2327
1003063 GC - 2328
1000957 DG - 2329
1001290 EA - 2330
1003305 GC - 2331
1003417 GC - 2332
1006442 WL - 2333
1005076 PK - 2334
1004581 NL - 2335
I am new to R and am having some issues. I have a data set where I want to subset the closest higher/lower ranked AreaCodes to GC in order to do a case-control study.
So I want AreaCode 1001247, 1000957, 1001290, 1006442 in a seperate data frame. How do I do this? I’m assuming through a loop, but have no experience with these. This data has ~6000 observations so doing it by hand becomes exhausting. Is there any way to do this?
An alternative would be something like this (assuming Name is a character variable):
df2 = df %>%
mutate(newcol = ifelse(!Name=="GC"&(lag(Name)=="GC"|lead(Name)=="GC"),1,0) ) %>%
filter(newcol==1)
cumsum and rle are useful here
brks <- cumsum(rle(df$Name)$lengths)
# [1] 2 3 4 5 7 8 9 10
equalsGC <- which(rle(df$Name)$values=="GC")
# [1] 2 5
ans <- df$AreaCode[sort(brks[c(equalsGC+1, equalsGC-1)])]
# [1] 1001247 1003063 1001290 1003417
As a single block
brks <- cumsum(rle(df$Name)$lengths)
equalsGC <- which(rle(df$Name)$values=="GC")
ans <- df$AreaCode[sort(brks[c(equalsGC+1, equalsGC-1)])]

Avoiding loop by grouping variable in R

I am new to R and have been stuck with a problem for quite a while now ...
I have a big dataset(gridded data originally) with more than 1,000,000 observations and have to make a group variable for my elements.
My dataset looks like follows:
ID Var1
1 0,5
2 0,6
3 0,2
4 0,15
... ...
1029600 0,43
What I want now is to make groups according to the following scheme:
1 2 3 4 5 6 ... 4320
4321 4322 4322 4322 4322 4322 ... 8640
8641 8642 8643 8644 8645 8646 ... 12960
12961 12962 12963 12964 12965 12966 ... 17280
17281 17282 17283 17284 17285 17286 ... 21600
21601 21602 21603 21604 21605 21606 ... 25920
... ... ... ... ... ... ... ...
1025281 1025282 1025283 1025284 1025285 1025286... 1029600
Where the 36 numbers {1,2,3,4,5,6,4321,4322,4323,4324,4325,4326,8641,8642,...,21060} are the first group .
The second group would be {7,8,9,10,11,12,4327,4328,...,21612}. The third group would start with {13,14,15...}. And so on for all observations. I hope i could make it clear what my goal is here. I wanted to visualize it with a picture, but as a new member, this is not possible.
So far i managed to do it with a really ugly loop function, which looks as follows:
for(k in 0:40) {
nk <- 25920 * k
mk <- 720 * k
for (j in 0:719) {
cj <- j * 6
for (i in 0:5) {
ai <- i * 4320 + 1 + cj + nk
bi <- i * 4320 + 6 + cj + nk
group[ai:bi] <- 1 + j + mk
}
}
}
I am aware that this is pretty inefficient and it takes a very long time to compute this with loops. I am pretty sure that there is an easier way to solve my problem, but as I am new to R, I cannot find it myself.
Any help would be really appreciated. Thank you in advance!
You can get the group from the ID with a simple formula:
group <- (((ID-1) %% 4320) %/% 6) +1
Note that %% is the modulo operation and %/% is the integer division. The formula should give you groups numbered from 1. No need to include it in a loop, it is a vectorized operation.
There are plenty of ways to do it (like reshaping 1:1029600 into a matrix with 4320 columns and taking the 6*N:6*(N+1) columns and do a match or something) but this is why you should always stop and think about what, really, you want to do. And realize it comes down to a little arithmetic :)
Create sample data
dtf <- data.frame(ID = 1:1e4, Var1 = rnorm(1:1e4))
Grouping as explained by #antine-sac:
group <- (((dtf$ID-1) %% 4320) %/% 6) +1
Split the data
dtfsplit <- split(dtf, group)
First group
> dtfsplit[1]
$`1`
ID Var1
1 1 0.56655
2 2 0.87645
3 3 -1.41986
4 4 -1.84881
5 5 0.03233
6 6 3.06512
4321 4321 -1.57179
4322 4322 -1.09958
4323 4323 0.55980
4324 4324 0.32390
4325 4325 0.85438
4326 4326 -0.10311
8641 8641 2.08886
8642 8642 1.19836
8643 8643 0.52592
8644 8644 0.20571
8645 8645 1.08429
8646 8646 0.69648
Second group
dtfsplit[2]

How to column bind and row bind a large number of data frames in R?

I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))

split and combine by factor into new columns

I've got a sql output into a data.frame which looks like this:
dateTime resultMean SensorDescription
1 2009-01-09 21:35:00 7.134589 Aanderaa Optode - Type 3835
2 2009-01-09 21:35:00 7.813000 Seabird SBE45 Thermosalinograph
3 2009-01-09 21:35:00 8.080399 Turner SCUFA II Chlorophyll Fluorometer
4 2009-01-09 21:35:00 7.818604 ADAM PT100 PRT
5 2009-01-09 21:36:00 7.818604 ADAM PT100 PRT
I want to turn it into a frame like so:
dateTime Aanderaa Optode - Type 3835 Seabird SBE45 Thermosalinograph Turner SCUFA II Chlorophyll Fluorometer ADAM PT100 PRT
1 2009-01-09 21:35:00 7.134589 7.813000 8.080399 7.818604
Currently I've got a function which splits by SensorDescription, then loops over the list with merge.
Is there a better way of doing this using built in functions? I've looked at plyr, ddply etc and nothing seams to do quite what I want.
the current merging loop functions looks like this:
listmerge = function(datalist){
mdat = datalist[[1]][1:2]
for(i in 2:length(datalist)){
mdat = join(mdat,datalist[[i]][1:2], by="dateTime", match = "all")
}
You can use dcast from the reshape2 package:
d <- data.frame(x=1, y=letters[1:10], z=runif(10))
dcast(x ~ y, data=d)
Using z as value column: use value.var to override.
x a b c d e f g h i j
1 1 0.7582016 0.4000201 0.5712599 0.9851774 0.9971331 0.2955978 0.9895403 0.6114973 0.323996 0.785073
reshape from the base stats package can also accomplish this, but the syntax is a little more difficult.
reshape(d, idvar='x', timevar='y', direction='wide')
x z.a z.b z.c z.d z.e z.f z.g z.h z.i z.j
1 1 0.7582016 0.4000201 0.5712599 0.9851774 0.9971331 0.2955978 0.9895403 0.6114973 0.323996 0.785073

counting unique factors in r

I would like to know the number of unique dams which gave birth on each of the birth dates recorded. My data frame is similar to this one:
dam <- c("2A11","2A11","2A12","2A12","2A12","4D23","4D23","1X23")
bdate <- c("2009-10-01","2009-10-01","2009-10-01","2009-10-01",
"2009-10-01","2009-10-03","2009-10-03","2009-10-03")
mydf <- data.frame(dam,bdate)
mydf
# dam bdate
# 1 2A11 2009-10-01
# 2 2A11 2009-10-01
# 3 2A12 2009-10-01
# 4 2A12 2009-10-01
# 5 2A12 2009-10-01
# 6 4D23 2009-10-03
# 7 4D23 2009-10-03
# 8 1X23 2009-10-03
I used aggregate(dam ~ bdate, data=mydf, FUN=length) but it counts all the dams that gave birth on a particular date
bdate dam
1 2009-10-01 5
2 2009-10-03 3
Instead, I need to have something like this:
mydf2
bdate dam
1 2009-10-01 2
2 2009-10-03 2
Your help is very much appreciated!
What about:
aggregate(dam ~ bdate, data=mydf, FUN=function(x) length(unique(x)))
You could also run unique on the data first:
aggregate(dam ~ bdate, data=unique(mydf[c("dam","date")]), FUN=length)
Then you could also use table instead of aggregate, though the output is a little different.
> table(unique(mydf[c("dam","date")])$bdate)
2009-10-01 2009-10-03
2 2
This is just an example of how to think of the problem and one of the approaches on how to solve it.
split.mydf <- with(mydf, split(x = mydf, f = bdate)) #each list element has only one date.
# it's just a matter of counting unique dams
unique.mydf <- lapply(X = split.mydf, FUN = unique)
#and then count the number of unique elements
unilen.mydf <- lapply(unique.mydf, length)
#you can do these two last steps in one go like so
lapply(split.mydf, FUN = function(x) length(unique(x)))
as.data.frame(unlist(unilen.mydf)) #data.frame is just a special list, so this is water to your mill
unlist(unilen.mydf)
2009-10-01 2
2009-10-03 2
In dplyr you can use n_distinct :
library(tidyverse)
mydf %>%
group_by(bdate) %>%
summarize(dam = n_distinct(dam))

Resources