Stata-related graphic enquiry - graph

I have a very basic question about Stata. I have a repeated cross section of individuals from year 1 to year 20. For each individual, by year, I have a year-specific variable- GDP per capita in the country for instance. This variable is defined for each individual for each year, across years. I therefore have 20 unique data points for this variable. I want to plot this variable as a function of time (say in a two-way plot). The twoway command does not work because I have a lot more than 20 points for this 20 values because for each value I have it defined over the n number of people in the cross section in that year. How can I create a separate variable that extracts only the distinct values from the variable in its current form?

With a simple example of your data you could have saved yourself and others time. As it stands, your question is difficult to understand. As already pointed out, it lacks both code and example data. Please rewrite so others can easily find and use whatever is posted here.
My interpretation is you have panel data. The variable gdp is year-specific (in every panel the information is duplicated), but you'd like to graph it against time. Just tag one instance, and draw a graph conditional on that. An example:
clear
set more off
// not 20 years, but 3
input ///
id year gdp
1 1990 78
1 1991 90
1 1992 98
2 1990 78
2 1991 90
2 1992 98
end
egen tograph = tag(year)
twoway line gdp year if tograph
or
twoway line gdp year if id == 1

This is a perfect case of panel data:
First set the panel. The command to set the panel in your case is the following:
xtset id year
you can plot using xtline function using following command:
xtline gdp , t(year) i(id)
The above command will plot individual graphs for each id over year. To get one graph for all for comparison, use the following command:
xtline gdb , overlay t(year) i(id)

Related

Creating a subset of a dataset by taking the mean values for each date in R

Using Rstudio with tidyverse plugin, using ggplot2 to plot:
Say we have a dataset called SoccerTeam, this data set consists of variables: Location, Goals, YearPlayed, etc... and each data entry is assigned to a game, so the game was played at Location X, they scored Y Goals, It was played in year 19XX.
In the YearPlayed we have all the years the team has been active for, say years 1950 to 2020 and there is a whole season of data for each year.
Lets say that 2002 has 30 games, so there would be 30 data entries that have YearPlayed = 2002.
Our goal is to plot over time how many goals the team has scored. If we take into account every single game from each year and plot it over the 70 years of play, our graph would be very messy and hard to interpret. To tackle this issue, I would like to take the average goals for each year and plot that over time. How would i do this?
If you need a general introduction to data wrangling in R, I recommend R for Data Science. That said, you need to group by the column YearsPlayed, and then compute the mean for each year. Then, pipe it into the plot commands. The %>% symbol send the left side's output into the right side. So you can chain them together like this:
SoccerTeam %>%
group_by(YearPlayed) %>%
summarize(Goals = mean(Goals)) %>%
ggplot(aes(x=YearPlayed, y=Goals) +
geom_line()

Mark a portion of a bar chart ggplot

I have data for number of cars sold each year for different brands like this:
But I also have data for how many of the cars sold were cars with a diesel engine for each one of the brands and years.
I want to be able to stack the charts in a bar chart and also add a second dimension to each class, showing how many of the cars that have a diesel engine of the specific brand (e.g. BMW). I want to do it either by colour, or by lines like below:
Is it possible to do that with ggplot in R?
Edit:
My data:
The data looks like this in Excel:
BMW Volvo Audi
2010 50 400 50
2011 75 450 35
2012 45 350 55
BMW Volvo Audi
2010 0.2 0.2 0.5
2011 0.293333333 0.5 0.571428571
2012 0.488888889 0.5 0.272727273
You will need to do a bit of data preparation to make it easier to plot, but once you do this type of thing a few times, it becomes quite straightforward. I highly recommend reading about Tidy Data Principles, which I'll apply here.
Data
In the future, please post your dataframes via the output of dput(data.frame), but your tables are small, so import isn't that difficult:
df1 <- data.frame(year=c(2010:2012), BMW=c(50,75,45), Volvo=c(400,450,350), Audi=c(50,35,55))
df2 <- data.frame(year=c(2010:2012), BMW=c(0.2, 0.29333333, 0.4888888), Volvo=c(0.2,0.5,0.5), Audi=c(0.5,0.571428571,0.2727272727272))
Your data should be converted into Tidy Data, in which the key principle is that each row is an observation, each variable is one column, and each value represents the value for that column for that observation. Consider your first table, where you have only 3 pieces of information (variables) that are changing: Year, Model, and number of cars sold. As such, we need to combine those three columns for BMW, Volvo, and Audi into two: one for Model and one for number sold. You can do that by using gather() from dplyr (or a few other ways). Similarly, we need to combine columns in the second dataset.
Then, you can merge the two datasets together. Then finally, I use the information from total sold * proportion which are diesel to identify the number of diesel vs. number that are not diesel. In this way, we create the final dataframe used for plotting:
df1.1 <- df1 %>% gather(key='Model', value='Total_Sold',-year)
df2.1 <- df2 %>% gather(key='Model', value='prop_diesel',-year)
df <- merge(df1.1, df2.1)
df$diesel <- df$Total_Sold * df$prop_diesel
df$non_diesel <- df$Total_Sold - df$diesel
df <- df %>% gather(key='type', value='sold', -(1:4))
Plot
To create the plot, it seems like the best way to show this would be in a column plot, stacking "non-diesel" and "diesel" on top of one another so you can see total amount compared across each make per year, which also estimating the proportion of diesel/non-diesel. We kind of want to use dodging (separating columns out for make where they share the same x axis value), as well as "stacking" (stacking info on diesel vs. non-diesel). You kind of can't do that at the same time for a column plot, but I'm using faceting to get the same effect. Here you assign Model as the x axis, use stacking for the amount sold, and then faceting to create the subsets per year. Here's the code and result:
ggplot(df, aes(x=Model, y=sold)) +
geom_col(aes(fill=type), position='stack') +
facet_wrap(~year)

How can I can aggregate by group over an aggregate in Tableau?

I'm trying to visualize the median profit as a proportion of sales for each day of the week. My data looks like this:
Date Category Profit Sales State
1/1 Book 3 6 NY
1/1 Toys 12 30 CA
1/2 Games 9 20 NY
1/2 Books 5 10 WA
I've created a calculated field "Profit_Prop" as SUM([Profit])/SUM([Sales]). I want to display the median daily value of profit_prop for Mondays, Tuesdays, etc.
I can kind of do this as a boxplot by adding WEEKDAY(Date) to Columns and Profit_Prop to Rows, then adding Date to Detail and changing granularity to Exact Date. But I just want to display the median without displaying a data point for each day.
I tried making another calculated field with MEDIAN([Profit_prop]), but I get "argument to MEDIAN is already an aggregation and cannot be further aggregated."
Remove date from the level of detail.
Create calculated field like below and use it instead of Profit prop
median(
{ INCLUDE [Date]:
[Profit_Prop]
}
)
Let me know how it goes.
When you are doing a calculation on a calculated field normal median function doesn't work instead you need to use the Table calculations.
Taking data from your example, create a formula. Use below code:
Create a calculated field and paste below code:
WINDOW_MEDIAN([Calculation1],FIRST(),LAST())
Set the computation to Table Down

Getting randomly latitude/longitude data in R

I simulated a dataset for an online Retail market. The customer can purchase their products in different stores in Germany (e.g. Munich, Berlin, Hamburg..) and in Online stores. To get the latitude/longitude data from the cities I use geocode from the ggmap package. But customers who purchase Online are able to purchase them all over the country. Now I want to generate random latitude/longitude data within Germany for the online purchases, to map them later with shiny leaflet. Is there any way to do this?
My df looks like this:
View(df)
ClientId Store ... lat lon
1 Berlin 52 13
2 Munich 48 11
3 Online x x
4 Online x x
But my aim is a data frame for example like this:
ClientId Store ... lat lon
1 Berlin 52 13
2 Munich 48 11
3 Online 50 12
4 Online 46 10
Is there any way to get these random latitude/longitude data and integrate it to my data frame?
Your problem is twofold. First of all, as a newbie to R, you are not yet used to the semantics required to do what you need. Fundamentally, what you are asking to to do is:
First, Identify which orders are sourced from Online
Second, generate a random lat and lon for these orders
First, to identify elements of your data frame which fit a criterion, you use the which function. Thus, to find the rows in your data frame which have the Store column equal to "Online", you do:
df[which(df$Store=="Online")]
To update the lat or lon for a particular row, we need to be able to access the column. To get values of a particular column, we use $. For example, to get the lat values for the online orders you use:
df$lat[which(df$Store=="Online")]
Great! The problem now diverges and increases in complexity. For the new values, do you want to generate simple values to accomplish your demo, or do you want to come up with new logic to generate spacial results in a given region? You indicate you would like to generate data points in Germany itself, however, to accomplish that is beyond the scope of this question. For now, we will consider the easy example of generating values in a bounded box and updating your data.frame accordingly.
To generate integer values in a given range, we can use the sample function. Assuming that you would want lat values in the range of 45 and 55 and lon values in the range of 9 to 14 we can do the following:
df$lat[which(df$Store=="Online")]<-sample(45:55,length(which(df$Store=="Online")))
df$lon[which(df$Store=="Online")]<-sample(9:14,length(which(df$Store=="Online")))
Reading this code, we have update the lat values in df that are "Online" orders with a vector of random numbers from 48:52 that is the proper length (the number of "Online" orders).
If you wanted more decimal precision, you can use similar logic with the runif function which samples from the uniform distribution and round to get the appropriate amount of precision. Good luck!

Writing a function that outputs several regression results

I have a mega data frame containing monthly stock returns from january 1970 to december 2009 (rows) for 7 different countries including the US (columns). My task is to regress the stock returns of each country (dependent variable) on the USA stock returns (independent variable) using the values of 4 different time periods namely the 70s, the 80s, the 90s and the 00s.
The data set (.csv) can be downloaded at:
https://docs.google.com/file/d/0BxaWFk-EO7tjbG43Yl9iQVlvazQ/edit
This means that I have 24 regressions to run seperately and report the results, which I have already done using the lm() function. However, I am currently attempting to use R smarter and create custom functions that will achieve my purpose and produce the 24 sets of results.
I have created sub data frames containing the observations clustered according to the time periods knowing that there are 120 months in a decade.
seventies = mydata[1:120, ] # 1970s (from Jan. 1970 to Dec. 1979)
eighties = mydata[121:240, ] # 1980s (from Jan. 1980to Dec. 1989)
nineties = mydata[241:360, ] # 1990s (from Jan. 1990 to Dec. 1999)
twenties = mydata[361:480, ] # 2000s (from Jan. 2000 to Dec. 2009)
NB: Each of the newly created variables are 120 x 7 matrices for 120 observations across 7 countries.
Running the 24 regressions using Java would require the use of imbricated for loops.
Could anyone provide the steps I must take to write a function that will arrive a the desired result? Some snippets of R code would also be appreciated. I am also thinking the mapply function will be used.
Thank you and let me know if my post needs some editing.
try this:
install.packages('plyr')
library('plyr')
myfactors<-c(rep("seventies",120),rep("eighties",120),rep("nineties",120),rep("twenties",120))
tapply(y,myfactors,function(y,X){ fit<-lm(y~ << regressors go here>>; return (fit);},X=mydata)
The lm function will accept a matrix as the response varible and compute seperate regressions for each of the columns, so you can just combine (cbind) the different countries together for that part.
If you are willing to assume that the different decades have the same variance then you could fit the different decades using a dummy variable for decade (look at the gl function for a quick way to calculate a decade factor) and do everything in one call to lm. A simple example:
fit <- lm( cbind( Sepal.Width, Sepal.Length, Petal.Width ) ~ 0 + Species + Petal.Length:Species,
data=iris )
This will give the same coefficient estimates as the seperate regressions, only the standard deviations and degrees of freedom (and therefore the tests and anything else that depends on those) will be different from running the regressions individually.
If you need the standard deviations computed individually for each decade then you can use tapply or sapply (passing decade info into the subset argument of lm) or other apply functions.
For displaying the results from several different regression models the new stargazer package may be of interest.
Try using the 'stargazer' package for publication-quality text or LaTeX regression results tables.

Resources