Plot diagram in SAS - plot

I should show in a diagram how the variable, avgflow, has evolved over time (1992-2006) for three groups of observations: i) intra-Euroland trade flows (EMU-EMU country pairs), ii) extra-Euroland trade flows (non EMU-non EMU country pairs), and iii) trade flows between EMU and non EMU country pairs. Keep the three groups constant over time, such that, e.g., Germany-France country pairs are classified as EMU-EMU for all years 1992-2006. Use 1999 as index 100.
I have created two dummy variable for the 3 groups of observations. The dummy variable, emu, is 1 when it is EMU-EMU country pairs and 0 when is non EMU-non EMU country pairs. And the dummy variable, emu1, is 1 when trade flows between EMU and non EMU country pairs.
I know I should use the PROC GPLOT, but I am not sure how to exactly use it for this case. Can someone help me?
Thanks in advance.

Related

Propensity Score Matching with panel data

I am trying to use MatchIt to perform Propensity Score Matching (PSM) for my panel data. The data is panel data that contains multi-year observations from the same group of companies.
The data is basically describing a list of bond data and the financial data of their issuers, also the bond terms such as issued date, coupon rate, maturity, and bond type of bonds issued by them. For instance:
Firmnames
Year
ROA
Bond_type
AAPL US Equity
2015
0.3
0
AAPL US Equity
2015
0.3
1
AAPL US Equity
2016
0.3
0
AAPL US Equity
2017
0.3
0
C US Equity
2015
0.3
0
C US Equity
2016
0.3
0
C US Equity
2017
0.3
0
......
I've already known how to match the observations by the criteria I want and I use exact = Year to make sure I match observations from the same year. The problem now I am facing is that the observations from the same companies will be matched together, this is not what I want. The code I used:
matchit(Bond_type ~ Year + Amount_Issued + Cpn + Total_Assets_bf + AssetsEquityRatio_bf + Asset_Turnover_bf, data = rdata, method = "nearest", distance = "glm", exact = "Year")
However, as you can see, in the second raw of my sample, there might be two observations in one year from the same companies due to the nature of my study (the company can issue bonds more than one time a year). The only difference between them is the Bond_type. Therefore, the MathcIt function will, of course, treat them as the best control and treatment group and match these two observations together since they have the same ROA and other matching factors in that year.
I have two ways to solve this in my opinion:
Remove the observations from the same year and company, however, removing the observations might lead to bias results and ruined the study.
Preventing MatchIt function match the observations from the same company (or with the same Frimnames)
The second approach will be better since it will not lead to bias, however, I don't know if I can do this in MatchIt function. Hope someone can give me some advice on this or maybe there's any better solution to this problem, please be so kind to share with me, thanks in advance!
Note: If there's any further information or requirement I should provide, please just inform me. This is my first time raising the question here!
This is not possible with MatchIt at the moment (though it's an interesting idea and not hard to implement, so I may add it as a feature).
In the optmatch package, which perfroms optimal pair and full matching, there is a constraint that can be added called "anti-exact matching", which sounds exactly like what you want. Units with the same value of the anti-exact matching variable will not be matched with each other. This can be implemented using optmatch::antiExactMatch().
In the Matching package, which performs nearest neighbor and genetic matching, the restrict argument can be supplied to the matching function to restrict certain matches. You could manually create the restriction matrix by restricting all pairs of observations in the same company and then supply the matrix to Match().

What is the highest combination of certain values in a table given a certain restriction

I am currently working on the so-called "Moneyball" problem. I am basically trying to select the best combination of three baseball players (based on certain baseball-relevant statistics) for the least amount of money.
I have the following dataset (OBP, SLG, and AB are statistics that describe the performance of a player):
# the table has about 100 observations;
# the data frame is called "batting.2001"
playerID OBP SLG AB salary
giambja01 0.3569001 0.6096154 20 410333
heltoto01 0.4316547 0.4948382 57 4950000
berkmla01 0.2102326 0.6204506 277 305000
gonzalu01 0.4285714 0.3880131 409 9200000
martied01 0.4234079 0.5425532 100 5500000
My goal is to pick three players who in combination have the highest possible sum of OBP, SLG, and AB, but at the same time do not exceed a total salary of 15.000.000 dollar.
My approach so far has been rather simple... I just tried to arrange (in descending order) the columns OBP, SLG, and AB and simply picking the three players on the top that in combination do not exceed the salary restriction of 15 Million dollar:
batting.2001 %>%
arrange(desc(OPB), desc(SLG), desc(AB))
Can anyone of you think of a better solution? Also, what if I would like to get the best combination of three players for the least amount of money? What approach would you use in that scenario?
Thanks in advance, and looking forward to reading your solutions.

Filter factor variable based on counts

I have a dataframe containing house price data, with price and lots of variables. One of these variables is a "sub-area" for the property, and I am trying to incorporate this into various regressions. However, it is a factor variable with almost 3000 levels.
For example:
table(df$sub_area)
La Jolla
2
Carlsbad
5
Esconsido
1
..etc
I want to filter out those places that have only 1 count, since they don't offer much predictive power but add lots of computation time. However, I want to replace the sub_area entry for that property with blank or NA, since I still want to use the rest of the information for that property, such as bedrooms, bathrooms, etc.
For reference, an individual property entry might look like:
ID Beds Baths City Sub_area sqm... etc
1 4 2 San Diego La Jolla 100....
Then I can do
lm(price ~ beds + baths + city + sub_area)
under the new, smaller sub_area variable with fewer levels.
I want to do this because most of the predictive price power is contained in sub_area for the locations I'm working on.
One way:
areas <- names(which(table(df$Sub_area) > 10))
df$Sub_area[! df$Sub_area %in% areas] <- NA
Create a new dataframe with the number of occurrences for each subarea and keep the subareas that occur at least twice.
Then add NAs to the original dataframe if the subarea does not appear in the filtered sub_area_count.
library(dplyr)
sub_area_count <- df %>%
count(sub_area) %>%
filter(n > 1)
boo <- !df$sub_area %in% sub_area_count$sub_area
df[boo, ]$sub_area <- NA
You didn't give a reproducible example, but I think this will work for identifying those places which count==1
count_1 <- as.data.frame(table(df$sub_area))
count_1 <- count_1$Var1[which(count_1$Freq==1)]

barplot: selecting data in R

I have a problem for building barplot.
I am working on air traffic in different countries. I would like to get barplots for each countries with the different airport names in the X axis. The Y axis will show the quantity of airlines using the airport.
My plan is to make the script for 1 country and to replicate it manually for the others.
in my data, I have in the different columns:
Country / aiport / destination.
So each rows is actually one airline that is using the airport.
Do you have an idea about how to do this?
For now I have this idea:
UK<-traffic[traffic$Country=="UK",]
UK$airport <- as.factor(UK$airport)
countUK<-table(UK$airport)
barplot(countUK)
This is not working, I have a bunch of airports that are not in UK in the X axis...
Thanks for your help
Answer found:
You could try to drop unused factor levels, i.e.
UK <- droplevels(UK) after the line UK$airport <- as.factor(UK$airport).

getting the max() of a data frame under certain conditions

I have a rather large dataframe with 13 variables. Here is the first line just to give an idea:
prov_code nuts1 nuts1name nuts2 nuts2name prov_geoorder prov_name NUTS_ID EDAD year ORDER graphs value prov_geo
1. 15 1 NW 11 Galicia 1 La Corunna ES111 11 1975 1 1 0.000000000 La Corunna
I would like to obtain the maximum for a certain set of variables according to a combination of variables year ORDER and prov_code (ie, f_all being my data.frame: f_all[(f_all$year==1975)&(f_all$ORDER==1)&(f_all$prov_code=="1"),] ). The goal is to repeat the operation in order to obtain a new data frame containing all the maximum values for each year, ORDER, prov_code.
Is there a simple and quick way to do this?
Thanks for any suggestion on the matter,
There are several way of doing this, for example the one #James mentions. I want to suggest using plyr:
library(ply)
ddply(f_all, .(year, ORDER, prov_code), summarise, mx_value = max(value))
Alternatively, if you have a lot of data, data.table provides similar functionality, but is much much faster in that case.

Resources