R: Creating a table with the highest values by year - r

I hope I don't ask a question that has been asked already, but I couldn't quite find what I was looking for. I am fairly new to R and have no experience with programming.
I want to make a table with the top 10 values of three sections for each year If my data looks somthing like this:
Year Country Test1 Test2 Test3
2000 ALB 500 497 501
2001 ALB NA NA NA
...
2000 ARG 502 487 354
2001 ARG NA NA NA
...
(My years go from 2000 to 2015, I only have observations for every three years, and even in those years still a lot of NA's for some countries or tests)
I would like to get a table in which I can see the 10 top values for each test for each year. So for the year 2000,2003,2006,...,2015 the top ten values and the countries that reached those values for test 1,2&3.
AND then (I am not sure if this should be a separate question) I would like to get the table into Latex.

Easier to see top values this way.
You could use dcast and melt from data.table package:
# convert to data table
setDT(df)
# convert it to long format and select the columns to used
df1 <- melt(df, id.vars=1:2)
df1 <- df1[,c(1,2,4)]
# get top values year and country
df1 <- df1[,top_value := .(list(sort(value, decreasing = T))), .(Year, Country)][,.(Year, Country, top_value)]
print(df1)
Year Country top_value
1: 2000 ALB 501,500,497
2: 2001 ALB
3: 2000 ARG 502,487,354
4: 2001 ARG
5: 2000 ALB 501,500,497
6: 2001 ALB
7: 2000 ARG 502,487,354
8: 2001 ARG
9: 2000 ALB 501,500,497
10: 2001 ALB
11: 2000 ARG 502,487,354
12: 2001 ARG

Related

Assign mean values and/or conditional assignment for unordered duplicate dyads

I've come across something a bit above my skill set. I'm working with IMF trade data that consists of data between country dyads. The IMF dataset consists of ' unordered duplicate' records in that each country individually reports trade data. However, due to a variety of timing, recording systems, regime type, etc., there are discrepancies between corresponding values. I'm trying to manipulate this data in two ways:
Assign the mean values to the duplicated dyads.
Assign the dyad values conditionally based on a separate economic indicator or development index (who do I trust more?).
There are several discussions of identifying unordered duplicates here, here, here, and here but after a couple days of searching I have yet to see what I'm trying to do.
Here is an example of the raw data. In reality there are many more variables and several hundred thousand dyads:
reporter<-c('USA','GER','AFG','FRA','CHN')
partner<-c('AFG','CHN','USA','CAN','GER')
year<-c(2010,2010,2010,2009,2010)
import<-c(-1000,-2000,-2400,-1200,-2000)
export<-c(2500,2200,1200,2900,2100)
rep_econ1<-c(28,32,12,25,19)
imf<-data.table(reporter,partner,year,import,export,rep_econ1)
imf
reporter partner year import export rep_econ1
1: USA AFG 2010 -1000 2500 28
2: GER CHN 2010 -2000 2200 32
3: AFG USA 2010 -2400 1200 12
4: FRA CAN 2009 -1200 2900 25
5: CHN GER 2010 -2000 2100 19
The additional wrinkle is that import and export are inverses of each other between the dyads, so they need to be matched and meaned with an absolute value.
For objective 1, the resulting data.table is:
Mean
reporter partner year import export rep_econ1
USA AFG 2010 -1100 2450 28
GER CHN 2010 -2050 2100 32
AFG USA 2010 -2450 1100 12
FRA CAN 2009 -1200 2900 25
CHN GER 2010 -2100 2050 19
For objective 2:
Conditionally Assign on Higher Economic Indicator (rep_econ1)
reporter partner year import export rep_econ1
USA AFG 2010 -1000 2500 28
GER CHN 2010 -2000 2200 32
AFG USA 2010 -2500 1000 12
FRA CAN 2009 -1200 2900 25
CHN GER 2010 -2200 2000 19
It's possible not all dyads are represented twice so I included a solo record. I prefer data.table but I'll go with anything that leads me down the right path.
Thank you for your time.
Pre - Processing:
library(data.table)
# get G = reporter/partner group and N = number of rows for each group
# Thanks #eddi for simplifying
imf[, G := .GRP, by = .(year, pmin(reporter, partner), pmax(reporter, partner))]
imf[, N := .N, G]
Option 1 (means)
# for groups with 2 rows, average imports and exports
imf[N == 2
, `:=`(import = (import - rev(export))/2
, export = (export - rev(import))/2)
, by = G]
imf
# reporter partner year import export rep_econ1 G N
# 1: USA AFG 2010 -1100 2450 28 1 2
# 2: GER CHN 2010 -2050 2100 32 2 2
# 3: AFG USA 2010 -2450 1100 12 1 2
# 4: FRA CAN 2009 -1200 2900 25 3 1
# 5: CHN GER 2010 -2100 2050 19 2 2
Option 2 (highest economic indicator)
# for groups with 2 rows, choose imports and exports based on highest rep_econ1
imf[N == 2
, c('import', 'export') := {
o <- order(-rep_econ1)
import <- cbind(import, -export)[o[1], o]
.(import, export = -rev(import))}
, by = G]
imf
# reporter partner year import export rep_econ1 G N
# 1: USA AFG 2010 -1000 2500 28 1 2
# 2: GER CHN 2010 -2000 2200 32 2 2
# 3: AFG USA 2010 -2500 1000 12 1 2
# 4: FRA CAN 2009 -1200 2900 25 3 1
# 5: CHN GER 2010 -2200 2000 19 2 2
Option 2 explanation: You need to select the row with the highest economic indicator (i.e. row order(-rep_econ1)[1]) and use that for imports, but if the second row is the "trusted" one, it needs to be reversed. Otherwise you'd have the countries switched, since the second reporter's imports (now the first element of cbind(import, -export)[o[1],]) would be assigned as the first reporter's imports (because it's the first element).
Edit:
If imports and exports are both positive in the input data and need to be positive in the output data, the two calculations above can be modified as
imf[N == 2
, `:=`(import = (import + rev(export))/2
, export = (export + rev(import))/2)
, by = G]
And
imf[N == 2
, c('import', 'export') := {
o <- order(-rep_econ1)
import <- cbind(import, export)[o[1], o]
.(import, export = rev(import))}
, by = G]

Using a conditional in a for loop to create a unique panel id

I have a dataset which looks as follows:
# A tibble: 5,458 x 539
# Groups: country, id1 [2,729]
idstd id2 xxx id1 country year
<dbl+> <dbl> <dbl+lbl> <dbl+lbl> <chr> <dbl>
1 445801 NA NA 7 Albania 2009
2 542384 4616555 1163 7 Albania 2013
3 445802 NA NA 8 Albania 2009
4 542386 4616355 1162 8 Albania 2013
5 445803 NA NA 25 Albania 2009
6 542371 4616545 1161 25 Albania 2013
7 445804 NA NA 30 Albania 2009
8 542152 4616556 475 30 Albania 2013
9 445805 NA NA 31 Albania 2009
10 542392 4616542 1160 31 Albania 2013
The data is paneldata, but is there is no unique panel-id. The first two observations are for example respondent number 7 from Albania, but number 7 is used again for other countries. id2 however is unique. My plan is therefore to copy id2 into the NA entry of the corresponding respondent.
I wrote the following code:
for (i in 1:nrow(df)) {
if (df$id1[i]== df$id1[i+1] & df$country[i] == df$country[i+1]) {
df$id2[i] <- df$id2[i+1]
}}
Which gives the following error:
Error in if (df$id1[i] == df1$id1[i + 1] & : missing value where TRUE/FALSE needed
It does however seem to work. As my dataset is quite large and I am not very skilled, I am reluctant to accept the solution I came up with, especially when it gives an error.
Could anyone may help explain the error to me?
In addition, is there a more efficient (for example data.table) and maybe error free way to deal with this?
Can you not do something along the line:
library(tidyverse)
df %>%
group_by(country, id1) %>%
mutate(uniqueId = id2 %>% discard(is.na) %>% unique) %>%
ungroup()
Also, from looking at your loop I judge that the NA are always 1 row apart from the unique IDs, so you could also do:
df %>%
mutate(id2Lag = lag(id2),
uniqueId = ifelse(is.na(id2), id2Lag, id2) %>%
select(-id2Lag)

Construct a variable that conditionally takes a certain value until another condition is met

I have a panel dataset with data on conflicts for which I want to identify the post-conflict years.
So I constructed a variable myself, which codes a transition from conflict to peace with "3". Whenever the values for a new country begin, I coded that same variable with NA. S
What I want to do now is to create a new binary variable which identifies post-conflict years with a 1 and conflict years and never conflict with 0. For that I would have to assign every year, following a 3 in the transition variable with a 1 until there is an NA in the same column. As follows:
Country Year transition post-conflict
Afghanistan 1994 0 0
Afghanistan 1995 0 0
Afghanistan 1996 3 1
Afghanistan 1997 2 1
Afghanistan 1998 2 1
Albania 1994 NA 0
Albania 1994 2 0
How could I go about this?
You probably shouldn't use NA like that. It prevents functions like which, sum, and cumsum from working as you may want them to. You likely don't need to mark the first row of a new country anyway, since most R functions you would use for your analysis can group by Country without needing a special marker showing where each group starts.
Below I change NA to something different, and make transition a factor. Then you can use cumsum to create your new column.
library(data.table)
setDT(df) # assuming your data is called df
# fix transition column
df[is.na(transition), transition := 90]
df[, transition := as.factor(transition)]
# create post_conflict column
df[, post_conflict := cumsum(transition == 3), by = Country]
# Country Year transition post_conflict
# 1: Afghanistan 1994 0 0
# 2: Afghanistan 1995 0 0
# 3: Afghanistan 1996 3 1
# 4: Afghanistan 1997 2 1
# 5: Afghanistan 1998 2 1
# 6: Albania 1994 90 0
# 7: Albania 1994 2 0

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

Long Format Function [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
faster way to create variable that aggregates a column by id
I am having trouble with a project. I created a dataframe (called dat) in long format (i copied the first 3 rows below) and I want to calculate for example the mean of the Pretax Income of all Banks in the United States for the years 2000 to 2011. How would I do that? I have hardly any experience in R. I am sorry if the answer is too obvious, but I couldn't find anything and i already spent a lot of time on the project. Thank you in advance!
KeyItem Bank Country Year Value
1 Pretax Income WELLS_FARGO_&_COMPANY UNITED STATES 2011 2.365600e+10
2 Total Assets WELLS_FARGO_&_COMPANY UNITED STATES 2011 1.313867e+12
3 Total Liabilities WELLS_FARGO_&_COMPANY UNITED STATES 2011 1.172180e+12
The following should get you started. You basically need to do two things: subset, and aggregate. I'll demonstrate a base R solution and a data.table solution.
First, some sample data.
set.seed(1) # So you can reproduce my results
dat <- data.frame(KeyItem = rep(c("Pretax", "TotalAssets", "TotalLiabilities"),
times = 30),
Bank = rep(c("WellsFargo", "BankOfAmerica", "ICICI"),
each = 30),
Country = rep(c("UnitedStates", "India"), times = c(60, 30)),
Year = rep(c(2000:2009), each = 3, times = 3),
Value = runif(90, min=300, max=600))
Let's aggregate mean of the "Pretax" values by "Country" and "Year", but only for the years 2001 to 2005.
aggregate(Value ~ Country + Year,
dat[dat$KeyItem == "Pretax" & dat$Year >= 2001 & dat$Year <=2005, ],
mean)
# Country Year Value
# 1 India 2001 399.7184
# 2 UnitedStates 2001 464.1638
# 3 India 2002 443.5636
# 4 UnitedStates 2002 560.8373
# 5 India 2003 562.5964
# 6 UnitedStates 2003 370.9591
# 7 India 2004 404.0050
# 8 UnitedStates 2004 520.4933
# 9 India 2005 567.6595
# 10 UnitedStates 2005 493.0583
Here's the same thing in data.table
library(data.table)
DT <- data.table(dat, key = "Country,Bank,Year")
subset(DT, KeyItem == "Pretax")[Year %between% c(2001, 2005),
mean(Value), by = list(Country, Year)]
# Country Year V1
# 1: India 2001 399.7184
# 2: India 2002 443.5636
# 3: India 2003 562.5964
# 4: India 2004 404.0050
# 5: India 2005 567.6595
# 6: UnitedStates 2001 464.1638
# 7: UnitedStates 2002 560.8373
# 8: UnitedStates 2003 370.9591
# 9: UnitedStates 2004 520.4933
# 10: UnitedStates 2005 493.0583

Resources