Related
I have a dataframe which I would like to clean by removing some offsetting lines (boxed positions) and doing some netting.
Here is the source table:
Type Name Strike Maturity Nominal
Call Amazon 10 10/12/2018 1000
Put Amazon 10 10/12/2018 1000
Call Ebay 8 2/8/2018 800
Put Ebay 8 2/8/2018 500
Call Facebook 5 5/5/2018 900
Call Google 2 23/4/2018 250
Put Google 2 23/4/2018 350
Call Microsoft 2 19/3/2018 250
Put Microsoft 2.5 19/3/2018 350
Put Ebay 8 2/8/2018 100
And the result of the code here:
Type Name Strike Maturity Nominal
Call Ebay 8 2/8/2018 200
Call Facebook 5 5/5/2018 900
Put Google 2 23/4/2018 100
Call Microsoft 2 19/3/2018 250
Put Microsoft 2.5 19/3/2018 350
I'm trying to write a code in R that would perform these 3 tasks:
1// Remove all the pairs that offset each other.
A pair that offset each other is a pair that meet these 2 criteria:
2 lines that have the same Name, Strike, Maturity and Nominal.
1 line is a "Call" while the other one is a "Put"
Example: the 2 "Amazon" lines that were removed from the table
2// Do a netting on the nominal for the lines that don't perfectly offset each other.
A pair that don't perfectly offset each other is a pair that meet these 2 criteria:
2 lines that have the same Name, Strike and Maturity but different Nominal
1 line is a "Call" while the other one is a "Put"
Example: the 2 "Ebay" lines that were netted on the Call or the 2 "Google" lines that were netted on the Put.
3// Don't do anything on all the other lines
Example: the 2 "Microsoft" lines. They have different strike so no netting at all should be done
Please see below my first attempt.
My idea was first to create a new column with a unique key, then sorting alphabetically and then testing each line one by one.
I find it very laborious so I was wondering if someone could help me find a more straightforward and efficient solution?
Many thanks!
library(data.table)
dt <- data.table(Type=c("Call", "Put", "Call", "Put", "Call", "Call", "Put", "Call", "Put","Put"),
Name=c("Amazon", "Amazon", "Ebay", "Ebay", "Facebook", "Google", "Google", "Microsoft", "Microsoft","Ebay"),
Strike=c(10,10,8,8,5,2,2,2,2.5,8),
Maturity=c("10/12/2018", "10/12/2018", "2/8/2018", "2/8/2018", "5/5/2018", "23/4/2018", "23/4/2018", "19/3/2018", "19/3/2018","2/8/2018),
Nominal=c(1000,1000,800,500,900,250,350,250,35,100))
##idea
dt$key <- paste(dt$Name,dt$Strike,dt$Maturity)
dt[order(dt$key,decreasing = FALSE),]
dt$Type2 <- ifelse(dt$Type = "Call",1,0)
#for each line k, test value in the column "Key" and the column "Type2":
#if key(k) = key(k+1) and Type2(k)+Type2(k+1)=1 then
#if Nominal (k)> Nominal (k+1), delete the line k+1 and do the netting on nominal of the line k
#else Nomnial (k+1)< Nominal (k), delete the line k and do the netting on nominal of the line k+1
#next k
dt <- dt[dt$Nominal!=0,]
dt$key <- NULL
After ideas that were recommended, I tried the dcast solution but it looks like it does not do the proper netting as shown below:
> dt <- data.table(Type=c("Call", "Put", "Call", "Put", "Call", "Call", "Put", "Call", "Put","Put"),
+ Name=c("Amazon", "Amazon", "Ebay", "Ebay", "Facebook", "Google", "Google", "Microsoft", "Microsoft","Ebay"),
+ Strike=c(10,10,8,8,5,2,2,2,2.5,8),
+ Maturity=c("10/12/2018", "10/12/2018", "2/8/2018", "2/8/2018", "5/5/2018", "23/4/2018", "23/4/2018", "19/3/2018", "19/3/2018","2/8/2018"),
+ Nominal=c(1000,1000,800,500,900,250,350,250,350,100))
> dcast(dt, Name + Maturity + Strike ~ Type, value.var="Nominal", fill = 0)[, Net := Call - Put][Net != 0]
Aggregate function missing, defaulting to 'length'
Name Maturity Strike Call Put Net
1: Ebay 2/8/2018 8.0 1 2 -1
2: Facebook 5/5/2018 5.0 1 0 1
3: Microsoft 19/3/2018 2.0 1 0 1
4: Microsoft 19/3/2018 2.5 0 1 -1
Here is a tidyverse solution. Basically, since you want to group all rows that have the same Name, Strike and Maturity, I think it's simplest to convert Call and Put into actual numbers and use summarise. Your special offset case is really just removing net cases where the total ends up being 0.
Approach is:
Convert Put into negative values of Nominal using ifelse and mutate,
Use group_by and summarise to reduce the groups into a single value per group`,
Remove perfect offsets with filter,
Replace the Type column and make the negative values positive.
Code:
library(tidyverse)
tbl <- read_table2(
"Type Name Strike Maturity Nominal
Call Amazon 10 10/12/2018 1000
Put Amazon 10 10/12/2018 1000
Call Ebay 8 2/8/2018 800
Put Ebay 8 2/8/2018 500
Call Facebook 5 5/5/2018 900
Call Google 2 23/4/2018 250
Put Google 2 23/4/2018 350
Call Microsoft 2 19/3/2018 250
Put Microsoft 2.5 19/3/2018 350
Put Ebay 8 2/8/2018 100"
)
tbl %>%
mutate(actual = ifelse(Type == "Call", Nominal, -Nominal)) %>%
group_by(Name, Strike, Maturity) %>%
summarise(Net = sum(actual)) %>%
filter(Net != 0) %>%
mutate(
Type = ifelse(Net > 0, "Call", "Put"),
Net = abs(Net)
)
# A tibble: 5 x 5
# Groups: Name, Strike [5]
Name Strike Maturity Net Type
<chr> <dbl> <chr> <int> <chr>
1 Ebay 8.00 2/8/2018 200 Call
2 Facebook 5.00 5/5/2018 900 Call
3 Google 2.00 23/4/2018 100 Put
4 Microsoft 2.00 19/3/2018 250 Call
5 Microsoft 2.50 19/3/2018 350 Put
I am trying to use tapply in R so that it gives me a summary of averages for data that correspond to two categories (category_name and brand_name)
It is working fine when I include one category, i.e.:
brandavg <- tapply(train$price, train$brand_name, mean)
head(brandavg, 10)
Looks something like:
"" "brandavg" "brandstd"
"% Pure" 13.85 5.53719480565651
"10.Deep" 21 NA
"21men" 10 NA
But when I try to run it with two categories, using this code:
brandcatavg <- tapply(train$price, list(train$brand_name, train$category_name), mean)
head(brandcatavg, 10)
I get along list of category names (far more than 10!), but no averages.
I'm sure this has a really easy solution that I'm missing, but I'm hitting my head on the wall trying to figure it out.
Current code attempts have been:
brandcatavg <- tapply(train$price, list(train$brand_name, train$category_name), mean)
brandcatavg <- with(train, tapply(price, list(brand_name, category_name), mean))
Both yield something that looks like:
Vintage & Collectibles/Paper Ephemera/Postcard Vintage & Collectibles/Paper Ephemera/Stamps
Vintage & Collectibles/Serving/Bowl Vintage & Collectibles/Serving/Casserole
Vintage & Collectibles/Serving/Cream and Sugar Set Vintage & Collectibles/Serving/Dinnerware Set
Vintage & Collectibles/Serving/Flatware Vintage & Collectibles/Serving/Glassware
Vintage & Collectibles/Serving/Mug Vintage & Collectibles/Serving/Other
Vintage & Collectibles/Serving/Pitcher Vintage & Collectibles/Serving/Plate
Vintage & Collectibles/Serving/Salt and Pepper Shakers Vintage & Collectibles/Serving/Teacup
Vintage & Collectibles/Serving/Teapot Vintage & Collectibles/Serving/Tray
Vintage & Collectibles/Serving/Tumbler Vintage & Collectibles/Supplies/Bead
(Note that I cut the output for length...)
When I pull the first 10 columns it looks like:
train_id name item_condition_id category_name
1 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts
2 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & Parts
3 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse
4 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents
5 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces
6 5 Bundled items requested for Ruie 3 Women/Other/Other
7 6 Acacia pacific tides santorini top 3 Women/Swimwear/Two-Piece
8 7 Girls cheer and tumbling bundle of 7 3 Sports & Outdoors/Apparel/Girls
9 8 Girls Nike Pro shorts 3 Sports & Outdoors/Apparel/Girls
10 9 Porcelain clown doll checker pants VTG 3 Vintage & Collectibles/Collectibles/Doll
brand_name price shipping
1 10 1
2 Razer 52 0
3 Target 10 1
4 35 1
5 44 0
6 59 0
7 Acacia Swimwear 64 0
8 Soffe 6 1
9 Nike 19 0
10 8 0
item_description
1 No description yet
2 This keyboard is in great condition and works like it came out of the box. All of the ports are tested and work perfectly. The lights are customizable via the Razer Synapse app on your PC.
3 Adorable top with a hint of lace and a key hole in the back! The pale pink is a 1X, and I also have a 3X available in white!
4 New with tags. Leather horses. Retail for [rm] each. Stand about a foot high. They are being sold as a pair. Any questions please ask. Free shipping. Just got out of storage
5 Complete with certificate of authenticity
6 Banana republic bottoms, Candies skirt with matching blazer,Amy Byers suit, Loft bottoms and cami top.
7 Size small but straps slightly shortened to fit xs, besides that, perfect condition
8 You get three pairs of Sophie cheer shorts size small and medium girls and two sports bra/boy shorts spandex matching sets in small and medium girls. All items total retail for [rm] in store and you can take him today for less than the price of one item at the store!)
9 Girls Size small Plus green. Three shorts total.
10 I realized his pants are on backwards after the picture. They were very dirty so I hand washed them. He has a stuffed body and painted porcelain head, hands and feet. Back before clowns were too scary. 9" tall. No chips or cracks but minor paint loss in a few places. Clown Circus Doll Collectible
(dput is giving me funky results.)
Eventually got this to work with the code:
library(dplyr)
brandcatavg <- train %>%
group_by(category_name, brand_name) %>%
summarise(averageprice = mean(price, na.rm=TRUE))
You are close,
Try:
with(train, tapply(price, list(brand_name, category_name), mean))
I'm working on a data frame which looks like this
Here's how it looks like:
shape id day hour week id footfall category area name
22496 22/3/14 3 12 634 Work cluster CBD area 1
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1
23287 22/3/14 3 12 723 Airport Changi Airport 2
16430 22/3/14 4 12 947 Work cluster CBD area 2
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2
and so on... until 635 rows return.
with the other dataset that I want to compare with can be found here
Here's how it looks like:
category Foreigners Locals
Work cluster 1600000 3623900
Shopping cluster 1800000 3646666.667
Airport 15095152 8902705
Residential area 527700 280000
and also this last dataset that i want to compare with their previousHour
The first and second share the same attribute, i.e. category & first and third dataset share the same attribute hour.
As for previousHour based on category. Eg, for workcluster here
The previousHour should look like this:
hour
0
3
4
4
4
5
until 144 rows return... for each category.
Click here for shopping category
previousHour eg. for shopping should look like this:
hour
0
3
3
4
4
5
until 144 rows return...
Click here for airport category
Click here for residential category
all 144 rows return...
SumHour dataset:
category sumHour
1 Airport 2208
2 Residential area 1656
3 Shopping cluster 1656
4 Work cluster 1656
Here's, what I ideally want to find in R:
#for n in 1: number of rows{
# calculate sumHours(in SumHours dataset) - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
I'm not sure how to do that and here's what i tried:
mergetbl <- function(tbl1, tbl2)
{
newtbl = data.frame(hour=numeric(),forgHour=numeric())
ntbl1rows<-nrow(tbl1) # get the number of rows
for(n in 1:ntbl1rows)
{
#for n in 1: number of rows{
# check the previous hour from IDA dataset !!!!
# calculate sumDate - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
newHourSum <- 3588 - tbl1
footfallHour <- (tbl1$hour/(newHourSum-previousHour)) * tbl2$Foreigners
newtbl <- rbind(newtbl, footfallHour)
}
}
But nothing happened to newtbl...
Here's what ideally looks like for newtbl:
hour forgHour
0 1337.79 (the function should calculate this)
3 ...
3 ...
3 ...
4 ...
3 ...
and so on...
Thinking in terms of vectors gives this :
Try this:
### this is to get your Foreigners/Locals to be at the same size as tbl1
Foreigners=ifelse(tbl1$category=="Work cluster",tbl2$Foreigners[1], ifelse (tbl1$category=="Shopping cluster", tbl2$Foreigners[2], ifelse(tbl1$category=="Airport", tbl2$Foreigners[3], tbl2$Foreigners[4])))
Locals=ifelse(tbl1$category=="Work cluster",tbl2$Locals[1], ifelse (tbl1$category=="Shopping cluster", tbl2$Locals[2], ifelse(tbl1$category=="Airport", tbl2$Locals[3], tbl2$Locals[4])))
And now, the function
resultHour = function(tbl1, tbl2, ForeOrLoca)
{
previousHour = rep (0, nrow(tbl1))
for (i in 2:nrow(tbl1))
{
previousHour[i] = tbl1$hour[i-1]
}
### The conditional sum matching the category from tbl1
NewHourSum = ifelse(tbl1$category=="Work cluster",sum(with(tbl1, hour*I(category == "Work cluster"))), ifelse (tbl1$category=="Shopping cluster", sum(with(tbl1, hour*I(category == "Shopping cluster"))), ifelse(tbl1$category=="Airport", sum(with(tbl1, hour*I(category == "Airport"))), sum(with(tbl1, hour*I(category == "Residential area"))))))
##and finally, this
hour = as.vector(tbl1$hour)
footfallHour <- (hour/(newHourSum - previousHour)) * ForeOrLoca
newtbl <- cbind(hour, footfallHour)
return (newtbl)
}
this is the output I get :
> head(newtbl)
hour footfallHour
[1,] 3 1337.7926
[2,] 3 1506.2762
[3,] 3 12631.9264
[4,] 4 1785.2162
[5,] 3 441.7132
[6,] 3 1506.2762
Using the function:
TheResultIWant = resultHour (tbl1,tbl2)
For your new question.
Provided you cut your data frame into several containing only one category, you can use this function:
new_function_hour_result = function (tbl1_categ, vec_categ, prevHour_Categ, sumHour_Categ)
hour = as.vector(tbl1_categ$hour)
footfallHour <- (hour/(sumHour_Categ- previousHour)) * vec_categ
newtbl <- cbind(hour, footfallHour)
return (newtbl)
}
With tbl1_categ your data frame for a given category, vec_categ your foreigner or local data for a given category, prevHour_Categ the previousHour for a given category, and finally sumHour_Categ the sumHour for a given category.
To get your vectors to be the same size as the df they will be compared to :
for instance, for the vec_categ in the case locals/airport category:
locals_airport = rep(category[3,3], nrow = nrow(tbl1_airport))
for foreigners and airport category: foreig_airport = rep(category[3,2], nrow = nrow(tbl1_airport))
this will repeat the value contained in category[3,2], nrow(tbl1_airport) times.
for locals and workcluster: locals_workcluster = rep(category[1,3], nrow = nrow(tbl1_workcluster))
and so on for each vector (ie prevHour_Categ, sumHour_Categ, vec_categ) for each category!!
I have several columns in the same data frame that lists location names, but the column headings differ slightly
Location 1 (Adelaide, Sydney, Perth)
Location 2 (Perth, Darwin, Adelaide)
Location 3 (Brisbane, Adelaide, Melbourne)
I want to get one location column that combines all three of these columns but only keeps unique location names.
For example, my final column list would be - Location (Adelaide, Sydney, Perth, Darwin, Brisbane, Melbourne)
If those are in a dataframe then thei gives you a reduced factor result,
> unique(unlist(dat))
[1] Adelaide Sydney Perth Darwin Brisbane Melbourne
Levels: Adelaide Perth Sydney Darwin Brisbane Melbourne
Which if you wanted as a dataframe is simple enough:
newdat <- data.frame(Location1 = unique(unlist(dat)) )
> newdat
Location1
1 Adelaide
2 Sydney
3 Perth
4 Darwin
5 Brisbane
6 Melbourne
And as.character could turn it into a character vector.
Test object:
dat <- data.frame(
Location1 = c('Adelaide', 'Sydney', 'Perth'),
Location2= c('Perth', 'Darwin', 'Adelaide'),
Location3 =c('Brisbane', 'Adelaide', 'Melbourne'))
Try this:
newcol<-with(yourdf,unique(c(Location1,Location2,Location3)))
I could get at my goals "the long way" but am hoping to stay completely within R. I am looking to append Census demographic data by zip code to records in my database. I know that R has a few Census-based packages, but, unless I am missing something, these data do not seem to exist at the zip code level, nor is it intuitive to merge onto an existing data frame.
In short, is it possible to do this within R, or is my best approach to grab the data elsewhere and read it into R?
Any help will be greatly appreciated!
In short, no. Census to zip translations are generally created from proprietary sources.
It's unlikely that you'll find anything at the zipcode level from a census perspective (privacy). However, that doesn't mean you're left in the cold. You can use the zipcodes that you have and append census data from the MSA, muSA or CSA level. Now all you need is a listing of postal codes within your MSA, muSA or CSA so that you can merge. There's a bunch online that are pretty cheap if you don't already have such a list.
For example, in Canada, we can get income data from CRA at the FSA level (the first three digits of a postal code in the form A1A 1A1). I'm not sure what or if the IRS provides similar information, I'm also not too familiar with US Census data, but I imagine they provide information at the CSA level at the very least.
If you're bewildered by all these acronyms:
MSA: http://en.wikipedia.org/wiki/Metropolitan_Statistical_Area
CSA: http://en.wikipedia.org/wiki/Combined_statistical_area
muSA: http://en.wikipedia.org/wiki/Micropolitan_Statistical_Area
As others in this thread have mentioned, the Census Bureau American FactFinder is a free source of comprehensive and detailed data. Unfortunately, it’s not particularly easy to use in its raw format.
We’ve pulled, cleaned, consolidated, and reformatted the Census Bureau data. The details of this process and how to use the data files can be found on our team blog.
None of these tables actually have a field called “ZIP code.” Rather, they have a field called “ZCTA5”. A ZCTA5 (or ZCTA) can be thought of as interchangeable with a zip code given following caveats:
There are no ZCTAs for PO Box ZIP codes - this means that for 42,000 US ZIP Codes there are 32,000 ZCTAs.
ZCTAs, which stand for Zip Code Tabulation Areas, are based on zip codes but don’t necessarily follow exact zip code boundaries. If you would like to read more about ZCTAs, please refer to this link. The Census Bureau also provides an animation that shows how ZCTAs are formed.
I just wrote a R package called totalcensus (https://github.com/GL-Li/totalcensus), with which you can extract any data in decennial census and ACS survey easily.
For this old question if you still care, you can get total population (by default) and population of other races from national data of decennial census 2010 or 2015 ACS 5-year survey.
From 2015 ACS 5-year survey. Download national data with download_census("acs5year", 2015, "US") and then:
zip_acs5 <- read_acs5year(
year = 2015,
states = "US",
geo_headers = "ZCTA5",
table_contents = c(
"white = B02001_002",
"black = B02001_003",
"asian = B02001_005"
),
summary_level = "860"
)
# GEOID lon lat ZCTA5 state population white black asian GEOCOMP SUMLEV NAME
# 1: 86000US01001 -72.62827 42.06233 01001 NA 17438 16014 230 639 all 860 ZCTA5 01001
# 2: 86000US01002 -72.45851 42.36398 01002 NA 29780 23333 1399 3853 all 860 ZCTA5 01002
# 3: 86000US01003 -72.52411 42.38994 01003 NA 11241 8967 699 1266 all 860 ZCTA5 01003
# 4: 86000US01005 -72.10660 42.41885 01005 NA 5201 5062 40 81 all 860 ZCTA5 01005
# 5: 86000US01007 -72.40047 42.27901 01007 NA 14838 14086 104 330 all 860 ZCTA5 01007
# ---
# 32985: 86000US99923 -130.04103 56.00232 99923 NA 13 13 0 0 all 860 ZCTA5 99923
# 32986: 86000US99925 -132.94593 55.55020 99925 NA 826 368 7 0 all 860 ZCTA5 99925
# 32987: 86000US99926 -131.47074 55.13807 99926 NA 1711 141 0 2 all 860 ZCTA5 99926
# 32988: 86000US99927 -133.45792 56.23906 99927 NA 123 114 0 0 all 860 ZCTA5 99927
# 32989: 86000US99929 -131.60683 56.41383 99929 NA 2365 1643 5 60 all 860 ZCTA5 99929
From Census 2010. Download national data with download_census("decennial", 2010, "US") and then:
zip_2010 <- read_decennial(
year = 2010,
states = "US",
table_contents = c(
"white = P0030002",
"black = P0030003",
"asian = P0030005"
),
geo_headers = "ZCTA5",
summary_level = "860"
)
# lon lat ZCTA5 state population white black asian GEOCOMP SUMLEV
# 1: -66.74996 18.18056 00601 NA 18570 17285 572 5 all 860
# 2: -67.17613 18.36227 00602 NA 41520 35980 2210 22 all 860
# 3: -67.11989 18.45518 00603 NA 54689 45348 4141 85 all 860
# 4: -66.93291 18.15835 00606 NA 6615 5883 314 3 all 860
# 5: -67.12587 18.29096 00610 NA 29016 23796 2083 37 all 860
# ---
# 33116: -130.04103 56.00232 99923 NA 87 79 0 0 all 860
# 33117: -132.94593 55.55020 99925 NA 819 350 2 4 all 860
# 33118: -131.47074 55.13807 99926 NA 1460 145 6 2 all 860
# 33119: -133.45792 56.23906 99927 NA 94 74 0 0 all 860
# 33120: -131.60683 56.41383 99929 NA 2338 1691 3 33 all 860
Your best bet is probably with the U.S. Census Bureau TIGER/Line shapefiles. They have ZIP code tabulation area shapefiles (ZCTA5) for 2010 at the state level which may be sufficient for your purposes.
Census data itself can be found at American FactFinder. For example, you can get population estimates at the sub-county level (i.e. city/town), but not straight-forward population estimates at the zip-code level. I don't know the details of your data set, but one solution might require the use of relationship tables that are also available as part of the TIGER/Line data, or alternatively spatially joining the place names containing the census data (subcounty shapefiles) with the ZCTA5 codes.
Note from the metadata: "These products are free to use in a product or publication, however acknowledgement must be given to the U.S. Census Bureau as the source."
HTH
simple for loop to get zip level population. you need to get a key though. it is for US now.
masterdata <- data.table()
for(z in 1:length(ziplist)){
print(z)
textt <- paste0("http://api.opendatanetwork.com/data/v1/values?variable=demographics.population.count&entity_id=8600000US",ziplist[z],"&forecast=3&describe=false&format=&app_token=YOURKEYHERE")
errorornot <- try(jsonlite::fromJSON(textt), silent=T)
if(is(errorornot,"try-error")) next
data <- jsonlite::fromJSON(textt)
data <- as.data.table(data$data)
zipcode <- data[1,2]
data <- data[2:nrow(data)]
setnames(data,c("Year","Population","Forecasted"))
data[,ZipCodeQuery:=zipcode]
data[,ZipCodeData:=ziplist[z]]
masterdata <- rbind(masterdata,data)
}