I have a dataframe which I would like to clean by removing some offsetting lines (boxed positions) and doing some netting.
Here is the source table:
Type Name Strike Maturity Nominal
Call Amazon 10 10/12/2018 1000
Put Amazon 10 10/12/2018 1000
Call Ebay 8 2/8/2018 800
Put Ebay 8 2/8/2018 500
Call Facebook 5 5/5/2018 900
Call Google 2 23/4/2018 250
Put Google 2 23/4/2018 350
Call Microsoft 2 19/3/2018 250
Put Microsoft 2.5 19/3/2018 350
Put Ebay 8 2/8/2018 100
And the result of the code here:
Type Name Strike Maturity Nominal
Call Ebay 8 2/8/2018 200
Call Facebook 5 5/5/2018 900
Put Google 2 23/4/2018 100
Call Microsoft 2 19/3/2018 250
Put Microsoft 2.5 19/3/2018 350
I'm trying to write a code in R that would perform these 3 tasks:
1// Remove all the pairs that offset each other.
A pair that offset each other is a pair that meet these 2 criteria:
2 lines that have the same Name, Strike, Maturity and Nominal.
1 line is a "Call" while the other one is a "Put"
Example: the 2 "Amazon" lines that were removed from the table
2// Do a netting on the nominal for the lines that don't perfectly offset each other.
A pair that don't perfectly offset each other is a pair that meet these 2 criteria:
2 lines that have the same Name, Strike and Maturity but different Nominal
1 line is a "Call" while the other one is a "Put"
Example: the 2 "Ebay" lines that were netted on the Call or the 2 "Google" lines that were netted on the Put.
3// Don't do anything on all the other lines
Example: the 2 "Microsoft" lines. They have different strike so no netting at all should be done
Please see below my first attempt.
My idea was first to create a new column with a unique key, then sorting alphabetically and then testing each line one by one.
I find it very laborious so I was wondering if someone could help me find a more straightforward and efficient solution?
Many thanks!
library(data.table)
dt <- data.table(Type=c("Call", "Put", "Call", "Put", "Call", "Call", "Put", "Call", "Put","Put"),
Name=c("Amazon", "Amazon", "Ebay", "Ebay", "Facebook", "Google", "Google", "Microsoft", "Microsoft","Ebay"),
Strike=c(10,10,8,8,5,2,2,2,2.5,8),
Maturity=c("10/12/2018", "10/12/2018", "2/8/2018", "2/8/2018", "5/5/2018", "23/4/2018", "23/4/2018", "19/3/2018", "19/3/2018","2/8/2018),
Nominal=c(1000,1000,800,500,900,250,350,250,35,100))
##idea
dt$key <- paste(dt$Name,dt$Strike,dt$Maturity)
dt[order(dt$key,decreasing = FALSE),]
dt$Type2 <- ifelse(dt$Type = "Call",1,0)
#for each line k, test value in the column "Key" and the column "Type2":
#if key(k) = key(k+1) and Type2(k)+Type2(k+1)=1 then
#if Nominal (k)> Nominal (k+1), delete the line k+1 and do the netting on nominal of the line k
#else Nomnial (k+1)< Nominal (k), delete the line k and do the netting on nominal of the line k+1
#next k
dt <- dt[dt$Nominal!=0,]
dt$key <- NULL
After ideas that were recommended, I tried the dcast solution but it looks like it does not do the proper netting as shown below:
> dt <- data.table(Type=c("Call", "Put", "Call", "Put", "Call", "Call", "Put", "Call", "Put","Put"),
+ Name=c("Amazon", "Amazon", "Ebay", "Ebay", "Facebook", "Google", "Google", "Microsoft", "Microsoft","Ebay"),
+ Strike=c(10,10,8,8,5,2,2,2,2.5,8),
+ Maturity=c("10/12/2018", "10/12/2018", "2/8/2018", "2/8/2018", "5/5/2018", "23/4/2018", "23/4/2018", "19/3/2018", "19/3/2018","2/8/2018"),
+ Nominal=c(1000,1000,800,500,900,250,350,250,350,100))
> dcast(dt, Name + Maturity + Strike ~ Type, value.var="Nominal", fill = 0)[, Net := Call - Put][Net != 0]
Aggregate function missing, defaulting to 'length'
Name Maturity Strike Call Put Net
1: Ebay 2/8/2018 8.0 1 2 -1
2: Facebook 5/5/2018 5.0 1 0 1
3: Microsoft 19/3/2018 2.0 1 0 1
4: Microsoft 19/3/2018 2.5 0 1 -1
Here is a tidyverse solution. Basically, since you want to group all rows that have the same Name, Strike and Maturity, I think it's simplest to convert Call and Put into actual numbers and use summarise. Your special offset case is really just removing net cases where the total ends up being 0.
Approach is:
Convert Put into negative values of Nominal using ifelse and mutate,
Use group_by and summarise to reduce the groups into a single value per group`,
Remove perfect offsets with filter,
Replace the Type column and make the negative values positive.
Code:
library(tidyverse)
tbl <- read_table2(
"Type Name Strike Maturity Nominal
Call Amazon 10 10/12/2018 1000
Put Amazon 10 10/12/2018 1000
Call Ebay 8 2/8/2018 800
Put Ebay 8 2/8/2018 500
Call Facebook 5 5/5/2018 900
Call Google 2 23/4/2018 250
Put Google 2 23/4/2018 350
Call Microsoft 2 19/3/2018 250
Put Microsoft 2.5 19/3/2018 350
Put Ebay 8 2/8/2018 100"
)
tbl %>%
mutate(actual = ifelse(Type == "Call", Nominal, -Nominal)) %>%
group_by(Name, Strike, Maturity) %>%
summarise(Net = sum(actual)) %>%
filter(Net != 0) %>%
mutate(
Type = ifelse(Net > 0, "Call", "Put"),
Net = abs(Net)
)
# A tibble: 5 x 5
# Groups: Name, Strike [5]
Name Strike Maturity Net Type
<chr> <dbl> <chr> <int> <chr>
1 Ebay 8.00 2/8/2018 200 Call
2 Facebook 5.00 5/5/2018 900 Call
3 Google 2.00 23/4/2018 100 Put
4 Microsoft 2.00 19/3/2018 250 Call
5 Microsoft 2.50 19/3/2018 350 Put
Related
I have a data frame called nurse. At the moment it contains several columns but only one (nurse$word) is relevant at the moment. I want to create a new column named nurse$w.frequency which looks at the words in the nurse$word column and if it finds the one specified, I want it to change the corresponding nurse$w.frequency value to a specified integer.
nurse <- read.csv(...)
file word w.frequency
1 determining
2 journey
3 journey
4 serving
5 work
6 journey
... ...
The word frequency for determining and journey, for instance, is 1590 and 4650 respectively. So it should look like the following:
file word w.frequency
1 determining 1590
2 journey 4650
3 journey 4650
4 serving
5 work
6 journey 4650
... ...
I have tried it with the an ifelse statement (below) which seems to work, however, every time I try to change the actual word and frequency it overwrites the results from before.
nurse$w.frequency <- ifelse(nurse$word == "determining", nurse$w.frequency[nurse$word["determining"]] <- 1590, "")
You could first initialise an empty column
nurse$w.frequency <- NA
then populated it with the data you want
nurse$w.frequency[nurse$word == "determining"] <- 1590
nurse$w.frequency[nurse$word == "journey"] <- 4650
Using dplyr:
nurse %>%
mutate(w.frequency =
case_when(
word == "determining" ~ "1590",
word == "journey" ~ "4650",
TRUE ~ ""
))
Gives us:
word w.frequency
1 determining 1590
2 journey 4650
3 journey 4650
4 serving
5 work
6 journey 4650
Data:
nurse <- data.frame(word = c("determining", "journey", "journey", "serving", "work", "journey"))
I'm struggling to create a new variable off a text string. Here is a sample of my data frame:
Brand Pack_Content
1 Dove 4X25 G
2 Snickers 250 G
3 Twix 2X20.7 G
4 Korkunov BULK
I would like to create a numeric variable called Grams. I've tried solutions using gsub or separate, but the need to for different solutions by row (i.e., some need to multiply the Brand Packs with multiple packs (i.e., 4X25 G)) has me stumped. A solution with dplyr is preferred.
Brand Pack_Content Grams
1 Dove 4X25 G 100
2 Snickers 250 G 250
3 Twix 2X20.7 G 41.4
4 Korkunov BULK 1000
A solution using dplyr and tidyr. The key is before using separate to separate the Pack_Content_new column, replace all the strings, such as "G" or "BULK" with "" or meaningful numbers. If you have more than one meaningful strings like "BULK", you may want to use case_when in addition to recode. Arfter the separate function, we can replace NA with 1 in the Number column. Finnaly, we can calculate the Grams based on numbers in Number and Unit_Weight.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
mutate(Pack_Content_new = sub("G$", "", Pack_Content)) %>% # Remove the last G
mutate(Pack_Content_new = recode(Pack_Content_new, # Replace BULK with 1000
`BULK` = "1000")) %>%
separate(Pack_Content_new, into = c("Number", "Unit_Weight"), # Separate the Pack_Content_new column
sep = "X", convert = TRUE,
fill = "left") %>%
replace_na(list(Number = 1)) %>% # Replace NA in Number with 1
mutate(Grams = Number * Unit_Weight) # Calculate the Grams
dat2
# Brand Pack_Content Number Unit_Weight Grams
# 1 Dove 4X25 G 4 25.0 100.0
# 2 Snickers 250 G 1 250.0 250.0
# 3 Twix 2X20.7 G 2 20.7 41.4
# 4 Korkunov BULK 1 1000.0 1000.0
DATA
dat <- read.table(text = " Brand Pack_Content
1 Dove '4X25 G'
2 Snickers '250 G'
3 Twix '2X20.7 G'
4 Korkunov 'BULK'",
header = TRUE, stringsAsFactors = FALSE)
Update: added in some unit extraction and conversions just for the heck of it
Update 2: Threw in some validation steps (for my own reference if no-one else) that should probably have been part of the original answer. In general, if you're using regular expressions to extract values (and you don't have time to review every single row of output in detail), it's easy to get burned when some corner case input format that wasn't considered comes along
Using data.table,stringi, and the sweet, sweet, magic of regular expressions:
A note on tool selection here:
Since regular expressions are difficult to follow enough on their own, I think it's a safer bet to focus on making the transformation steps readable and clearly defined instead of trying to cram it all into a series of pipes and as few lines of code possible.
Since dplyr doesn't allow for step by step manipulation (no pipes) without re-assigning the tibble after each expression, I feel data.table is far more elegant and efficient tool for this kind of data munging work.
Create Data
library(data.table)
library(stringi)
DT <- data.table(Brand = c("Dove","Snickers","Twix","Korkunov","Reeses","M&M's"),
Pack = c("4X25 G","0.250 KG","2X20.7 G","BULK","2.5.5X4G","2 X 3 X 3G"))
Pre Cleaning
First off we'll strip out spaces and make everything uppercase
## Strip out Spaces
DT[,Pack := gsub("[[:space:]]+","",Pack)]
## Make everything Uppercase
DT[,Pack := toupper(Pack)]
Assumption Validation
Before we use regular expressions to extract values and do some math on them, it's probably prudent to do some validation steps to make sure we don't get burned down the road by an unexpected corner case.
## Start off by trusting nothing
DT[,Valid := FALSE]
## Mark Packs that fit formats like "BULK" as valid
DT[Pack %in% c("BULK"),Valid := TRUE]
## Mark Packs that fit formats like "4X20G" or "3.0X3KG" as valid
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+X([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"),
Valid := TRUE]
## Mark Packs that fit formats like "250G" as valid
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"),
Valid := TRUE]
print(DT)
At this point:
Brand Pack Valid
1: Dove 4X25G TRUE
2: Snickers 0.250KG TRUE
3: Twix 2X20.7G TRUE
4: Korkunov BULK TRUE
5: Reeses 2.5.5X4G FALSE
6: M&M's 2X3X3G FALSE
Extracting Values
Note that we are only populating values for rows that met pre-defined expectations for what a valid format is.
## Extract the first number at the beginning of the "Pack" column followed by an X
DT[Valid == TRUE, Quantity := as.numeric(stri_extract_first_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(?=X)"))]
## Extract last number out of the "Pack" column
DT[Valid == TRUE, Unit_Weight := as.numeric(stri_extract_last_regex(Pack,"([[:digit:]]+\\.){0,1}[[:digit:]]+"))]
## Extract the Units
DT[Valid == TRUE, Units := stri_extract_last_regex(Pack,"[[:alpha:]]+$")]
print(DT)
Now we've got the following:
Brand Pack Valid Quantity Unit_Weight Units
1: Dove 4X25G TRUE 4 25.00 G
2: Snickers 0.250KG TRUE NA 0.25 KG
3: Twix 2X20.7G TRUE 2 20.70 G
4: Korkunov BULK TRUE NA NA BULK
5: Reeses 2.5.5X4G FALSE NA NA NA
6: M&M's 2X3X3G FALSE NA NA NA
Convert units, fill in NA's, calculate weights
Now we just have to go back and fill in rows where there wasn't a weight or a quantity, optionally convert units, etc. so we can calculate weight.
## Start with a standard conversion factor of 1
DT[Valid == TRUE, Unit_Factor := 1]
## Make some Unit Conversions
DT[Units == "KG", Unit_Factor := 1000]
## Fill in Rows without a quantity with a value of 1
DT[Valid == TRUE & is.na(Quantity), Quantity := 1]
## Fill in a weight for Bulk units
DT[Pack == "BULK", `:=` (Unit_Weight = 1000, Units = "G")]
## And finally, calculate Weight in grams
DT[Valid == TRUE, Grams := Unit_Weight*Quantity*Unit_Factor]
print(DT)
Which yields a final result:
Brand Pack Valid Quantity Unit_Weight Units Unit_Factor Grams
1: Dove 4X25G TRUE 4 25.00 G 1 100.0
2: Snickers 0.250KG TRUE 1 0.25 KG 1000 250.0
3: Twix 2X20.7G TRUE 2 20.70 G 1 41.4
4: Korkunov BULK TRUE 1 1000.00 G 1 1000.0
5: Reeses 2.5.5X4G FALSE NA NA NA NA NA
6: M&M's 2X3X3G FALSE NA NA NA NA NA
(All the steps, in condensed form)
library(data.table)
library(stringi)
DT <- data.table(Brand = c("Dove","Snickers","Twix","Korkunov","Reeses","M&M's"),
Pack = c("4X25 G","0.250 KG","2X20.7 G","BULK","2.5.5X4G","2 X 3 X 3G"))
DT[,Pack := gsub("[[:space:]]+","",Pack)]
DT[,Pack := toupper(Pack)]
DT[,Valid := FALSE]
DT[Pack %in% c("BULK"),Valid := TRUE]
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+X([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"), Valid := TRUE]
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"), Valid := TRUE]
DT[Valid == TRUE, Quantity := as.numeric(stri_extract_first_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(?=X)"))]
DT[Valid == TRUE, Unit_Weight := as.numeric(stri_extract_last_regex(Pack,"([[:digit:]]+\\.){0,1}[[:digit:]]+"))]
DT[Valid == TRUE, Units := stri_extract_last_regex(Pack,"[[:alpha:]]+$")]
DT[Valid == TRUE, Unit_Factor := 1]
DT[Units == "KG", Unit_Factor := 1000]
DT[Valid == TRUE & is.na(Quantity), Quantity := 1]
DT[Pack == "BULK", `:=` (Unit_Weight = 1000, Units = "G")]
DT[Valid == TRUE, Grams := Unit_Weight*Quantity*Unit_Factor]
A final note:
I'm assuming you didn't include all the messy, dirty details of how all over the place your raw data is, so you might need to add some more steps to capture cases where you have pounds instead of grams (and all those other corner cases).
Still, with 5-7 regular expressions I think you'd probably be able to cover at least a decent amount of your potential cases.
I keep this Regex cheatsheet on RStudio's website within arms reach most of the time.
A relevant XKCD:
I know you need a plyr solution. Have you tried all the methods of Base R? Well here is just a small one. Hope this helps even though its not a plyr method.
First you need to remain with the numbers and also substitute X with *. This is done by the use of sub function. We also replace the one that does not contain a number with 1000. Then we just evaluate the content obtained:
A=sub("X","*",sub("\\s.*","",dat$Pack_Content))
transform(dat,Grams=sapply(parse(text=replace(A,-grep("\\d",A),1000)),eval))
Brand Pack_Content Grams
1 Dove 4X25 G 100.0
2 Snickers 250 G 250.0
3 Twix 2X20.7 G 41.4
4 Korkunov BULK 1000.0
Data Used:
dat=structure(list(Brand = c("Dove", "Snickers", "Twix", "Korkunov"
), Pack_Content = c("4X25 G", "250 G", "2X20.7 G", "BULK")), .Names = c("Brand",
"Pack_Content"), class = "data.frame", row.names = c("1", "2",
"3", "4"))
I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.
I'm working on a data frame which looks like this
Here's how it looks like:
shape id day hour week id footfall category area name
22496 22/3/14 3 12 634 Work cluster CBD area 1
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1
23287 22/3/14 3 12 723 Airport Changi Airport 2
16430 22/3/14 4 12 947 Work cluster CBD area 2
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2
and so on... until 635 rows return.
with the other dataset that I want to compare with can be found here
Here's how it looks like:
category Foreigners Locals
Work cluster 1600000 3623900
Shopping cluster 1800000 3646666.667
Airport 15095152 8902705
Residential area 527700 280000
and also this last dataset that i want to compare with their previousHour
The first and second share the same attribute, i.e. category & first and third dataset share the same attribute hour.
As for previousHour based on category. Eg, for workcluster here
The previousHour should look like this:
hour
0
3
4
4
4
5
until 144 rows return... for each category.
Click here for shopping category
previousHour eg. for shopping should look like this:
hour
0
3
3
4
4
5
until 144 rows return...
Click here for airport category
Click here for residential category
all 144 rows return...
SumHour dataset:
category sumHour
1 Airport 2208
2 Residential area 1656
3 Shopping cluster 1656
4 Work cluster 1656
Here's, what I ideally want to find in R:
#for n in 1: number of rows{
# calculate sumHours(in SumHours dataset) - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
I'm not sure how to do that and here's what i tried:
mergetbl <- function(tbl1, tbl2)
{
newtbl = data.frame(hour=numeric(),forgHour=numeric())
ntbl1rows<-nrow(tbl1) # get the number of rows
for(n in 1:ntbl1rows)
{
#for n in 1: number of rows{
# check the previous hour from IDA dataset !!!!
# calculate sumDate - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
newHourSum <- 3588 - tbl1
footfallHour <- (tbl1$hour/(newHourSum-previousHour)) * tbl2$Foreigners
newtbl <- rbind(newtbl, footfallHour)
}
}
But nothing happened to newtbl...
Here's what ideally looks like for newtbl:
hour forgHour
0 1337.79 (the function should calculate this)
3 ...
3 ...
3 ...
4 ...
3 ...
and so on...
Thinking in terms of vectors gives this :
Try this:
### this is to get your Foreigners/Locals to be at the same size as tbl1
Foreigners=ifelse(tbl1$category=="Work cluster",tbl2$Foreigners[1], ifelse (tbl1$category=="Shopping cluster", tbl2$Foreigners[2], ifelse(tbl1$category=="Airport", tbl2$Foreigners[3], tbl2$Foreigners[4])))
Locals=ifelse(tbl1$category=="Work cluster",tbl2$Locals[1], ifelse (tbl1$category=="Shopping cluster", tbl2$Locals[2], ifelse(tbl1$category=="Airport", tbl2$Locals[3], tbl2$Locals[4])))
And now, the function
resultHour = function(tbl1, tbl2, ForeOrLoca)
{
previousHour = rep (0, nrow(tbl1))
for (i in 2:nrow(tbl1))
{
previousHour[i] = tbl1$hour[i-1]
}
### The conditional sum matching the category from tbl1
NewHourSum = ifelse(tbl1$category=="Work cluster",sum(with(tbl1, hour*I(category == "Work cluster"))), ifelse (tbl1$category=="Shopping cluster", sum(with(tbl1, hour*I(category == "Shopping cluster"))), ifelse(tbl1$category=="Airport", sum(with(tbl1, hour*I(category == "Airport"))), sum(with(tbl1, hour*I(category == "Residential area"))))))
##and finally, this
hour = as.vector(tbl1$hour)
footfallHour <- (hour/(newHourSum - previousHour)) * ForeOrLoca
newtbl <- cbind(hour, footfallHour)
return (newtbl)
}
this is the output I get :
> head(newtbl)
hour footfallHour
[1,] 3 1337.7926
[2,] 3 1506.2762
[3,] 3 12631.9264
[4,] 4 1785.2162
[5,] 3 441.7132
[6,] 3 1506.2762
Using the function:
TheResultIWant = resultHour (tbl1,tbl2)
For your new question.
Provided you cut your data frame into several containing only one category, you can use this function:
new_function_hour_result = function (tbl1_categ, vec_categ, prevHour_Categ, sumHour_Categ)
hour = as.vector(tbl1_categ$hour)
footfallHour <- (hour/(sumHour_Categ- previousHour)) * vec_categ
newtbl <- cbind(hour, footfallHour)
return (newtbl)
}
With tbl1_categ your data frame for a given category, vec_categ your foreigner or local data for a given category, prevHour_Categ the previousHour for a given category, and finally sumHour_Categ the sumHour for a given category.
To get your vectors to be the same size as the df they will be compared to :
for instance, for the vec_categ in the case locals/airport category:
locals_airport = rep(category[3,3], nrow = nrow(tbl1_airport))
for foreigners and airport category: foreig_airport = rep(category[3,2], nrow = nrow(tbl1_airport))
this will repeat the value contained in category[3,2], nrow(tbl1_airport) times.
for locals and workcluster: locals_workcluster = rep(category[1,3], nrow = nrow(tbl1_workcluster))
and so on for each vector (ie prevHour_Categ, sumHour_Categ, vec_categ) for each category!!
GIVEN DATA
I have 6 columns of data of vehicle trajectory (observation of vehicles' change in position, velocity, etc over time) a part of which is shown below:
Vehicle ID Frame ID Global X Vehicle class Vehicle velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5
Vehicle ID is the identification of individual vehicles e.g. vehicle 1, vehicle 2, etc. It is repeated in the column for each frame in which it was observed. Please note that each frame is 0.1 seconds long so 10 frames make 1 second. The IDs of frames is in Frame ID column. Vehicle class is the type of vehicle (1=motorcycle, 2=car, 3=truck). Vehicle velocity column represents instantaneous speed of vehicle in that instant of time i.e. in a frame. Lane represents the number or ID of the lane in which vehicle is present in a particular frame.
WHAT I NEED TO FIND
The data I have is for 15 minutes period. The minimum frame ID is 5 and maximum frame ID is 9952. I need to find the total number of vehicles in every 30 seconds time period. This means that starting from the first 30 seconds (frame ID 5 to frame ID 305), I need to know the unique vehicle IDs observed. Also, for these 30 seconds period, I need to find the average velocity of each vehicle class. This means that e.g. for cars I need to find the average of all velocities of those vehicles whose vehicle class is 2.
I need to find this for all 30 seconds time period i.e. 5-305, 305-605, 605-905,..., 9605-9905. The ouput should tables for cars, trucks and motorcycles like this:
Time Slots Total Cars Average Velocity
5-305 xx xx
305-605 xx xx
. . .
. . .
9605-9905 xx xx
WHAT I HAVE TRIED SO FAR
# Finding the minimum and maximum Frame ID for creating 30-seconds time slots
minfid <- min(data$'Frame ID') # this was 5
maxfid <- max(data$'Frame ID') # this was 9952
for (i in 'Frame ID'==5:Frame ID'==305) {
table ('Vehicle ID')
mean('Vehicle Velocity', 'Vehicle class'==2)
} #For cars in first 30 seconds
I can't generate the required output and I don't know how can I do this for all 30 second periods. Please help.
It's a bit tough to make sure code is completely correct with your data since there is only one vehicle in the sample you show. That said, this is a typical split-apply-combine type analysis you can execute easily with the data.table package:
library(data.table)
dt <- data.table(df) # I just did a `read.table` on the text you posted
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
Here, I just converted your data into a data.table (df was a direct import of your data posted above), and then created 300 frame buckets using cut. Then, you just let data.table do the work. In the first expression we calculate total unique vehicles per frame.group
dt[, list(tot.vehic=length(unique(Vehicle_ID))), by=frame.group]
# frame.group tot.vehic
# 1: [5,305] 1
Now we group by frame.group and Vehicle_class to get average speed and count for those combinations:
dt[, list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 2 1 24.965
Again, a bit silly when we only have one vehicle, but this should work for your data set.
EDIT: to show that it works:
library(data.table)
set.seed(101)
dt <- data.table(
Frame_ID=sample(5:9905, 50000, rep=T),
Vehicle_ID=sample(1:400, 50000, rep=T),
Vehicle_velocity=runif(50000, 25, 100)
)
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
dt[, Vehicle_class:=Vehicle_ID %% 3]
head(
dt[order(frame.group, Vehicle_class), list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
)
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 0 130 63.34589
# 2: [5,305] 1 131 61.84366
# 3: [5,305] 2 129 64.13968
# 4: (305,605] 0 132 61.85548
# 5: (305,605] 1 132 64.76820
# 6: (305,605] 2 133 61.57129
Maybe it's your data?
Here is a plyr version:
data$timeSlot <- cut(data$FrameID,
breaks = seq(5, 9905, by=300),
dig.lab=5,
include.lowest=TRUE)
# split & combine
library(plyr)
data.sum1 <- ddply(.data = data,
.variables = c("timeSlot"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
# include VehicleClass
data.sum2 <- ddply(.data = data,
.variables = c("timeSlot", "VehicleClass"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
The column names like FrameID would have to be edited to match the ones you use:
data <- read.table(sep = "", header = TRUE, text = "
VehicleID FrameID GlobalX VehicleClass velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5")
data.sum1
# timeSlot totalCars AverageVelocity
# 1 [5,305] 1 24.965
data.sum2
# timeSlot VehicleClass totalCars AverageVelocity
# 1 [5,305] 2 1 24.965