From monadic to dyadic data in R - r

For the sake of simplicity, let's say I have a dataset at the country-year level, that lists organizations that received aid from a government, how much money was that, and the type of project. The data frame has "space" for 10 organizations each year, but not every government subsidizes so many organizations each year, so there are a lot a blank spaces. Moreover, they do not follow any order: one organization can be in the first spot one year, and the next year be coded in the second spot. The data looks like this:
> State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2 Org3 Aid3 Proj3 Org4 Aid4 Proj4 ...
Italy 2000 A 1000 Arts B 500 Arts C 300 Social
Italy 2001 B 700 Social A 1000 Envir
Italy 2002 A 1000 Arts C 300 Envir
UK 2000
UK 2001 Z 2000 Social
UK 2002 Z 2000 Social
...
I'm trying to transform this into dyadic data, which would look like this:
> State Org Year Aid Proj
Italy A 2000 1000 Arts
Italy A 2001 1000 Envir
Italy A 2002 1000 Arts
Italy B 2000 500 Arts
Italy B 2001 700 Social
Italy C 2000 300 Social
Italy C 2002 300 Envir
UK Z 2001 2000 Social
...
I'm using R, and the best way I could find was building a pre-defined possible set of dyads —using something like expand.grid(unique(State), unique(Org))— and then looping through the data, finding the corresponding column and filling the data frame. But I don't thing this is the most effective method, so I was wondering whether there would be a better way. I thought about dplyror reshape but can't find a solution.
I know this is a recurring question, but couldn't really find an answer. The most similar question is this one, but it's not exactly the same.
Thanks a lot in advance.

Since you did not use dput, I will try and make some data that resemble yours:
dat = data.frame(State = rep(c("Italy", "UK"), 3),
Year = rep(c(2014, 2015, 2016), 2),
Org1 = letters[1:6],
Aid1 = sample(800:1000, 6),
Proj1 = rep(c("A", "B"), 3),
Org2 = letters[7:12],
Aid2 = sample(600:700, 6),
Proj2 = rep(c("C", "D"), 3),
stringsAsFactors = FALSE)
dat
# State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2
# 1 Italy 2014 a 910 A g 658 C
# 2 UK 2015 b 926 B h 681 D
# 3 Italy 2016 c 834 A i 625 C
# 4 UK 2014 d 858 B j 620 D
# 5 Italy 2015 e 831 A k 650 C
# 6 UK 2016 f 821 B l 687 D
Next I gather the data and then use extract to make 2 new columns and then spread it all again:
library(tidyr)
library(dplyr)
dat %>%
gather(key, value, -c(State, Year)) %>%
extract(key, into = c("key", "num"), "([A-Za-z]+)([0-9]+)") %>%
spread(key, value) %>%
select(-num)
# State Year Aid Org Proj
# 1 Italy 2014 910 a A
# 2 Italy 2014 658 g C
# 3 Italy 2015 831 e A
# 4 Italy 2015 650 k C
# 5 Italy 2016 834 c A
# 6 Italy 2016 625 i C
# 7 UK 2014 858 d B
# 8 UK 2014 620 j D
# 9 UK 2015 926 b B
# 10 UK 2015 681 h D
# 11 UK 2016 821 f B
# 12 UK 2016 687 l D
Is this the desired output?

Related

How can I change row and column indexes of a dataframe in R?

I have a dataframe in R which has three columns Product_Name(name of books), Year and Units (number of units sold in that year) which looks like this:
Product_Name Year Units
A Modest Proposal 2011 10000
A Modest Proposal 2012 11000
A Modest Proposal 2013 12000
A Modest Proposal 2014 13000
Animal Farm 2011 8000
Animal Farm 2012 9000
Animal Farm 2013 11000
Animal Farm 2014 15000
Catch 22 2011 1000
Catch 22 2012 2000
Catch 22 2013 3000
Catch 22 2014 4000
....
I intend to make a R Shiny dashboard with that where I want to keep the year as a drop-down menu option, for which I wanted to have the dataframe in the following format
A Modest Proposal Animal Farm Catch 22
2011 10000 8000 1000
2012 11000 9000 2000
2013 12000 11000 3000
2014 13000 15000 4000
or the other way round where the Product Names are row indexes and Years are column indexes, either way goes.
How can I do this in R?
Your general issue is transforming long data to wide data. For this, you can use data.table's dcast function (amongst many others):
dt = data.table(
Name = c(rep('A', 4), rep('B', 4), rep('C', 4)),
Year = c(rep(2011:2014, 3)),
Units = rnorm(12)
)
> dt
Name Year Units
1: A 2011 -0.26861318
2: A 2012 0.27194732
3: A 2013 -0.39331361
4: A 2014 0.58200101
5: B 2011 0.09885381
6: B 2012 -0.13786098
7: B 2013 0.03778400
8: B 2014 0.02576433
9: C 2011 -0.86682584
10: C 2012 -1.34319590
11: C 2013 0.10012673
12: C 2014 -0.42956207
> dcast(dt, Year ~ Name, value.var = 'Units')
Year A B C
1: 2011 -0.2686132 0.09885381 -0.8668258
2: 2012 0.2719473 -0.13786098 -1.3431959
3: 2013 -0.3933136 0.03778400 0.1001267
4: 2014 0.5820010 0.02576433 -0.4295621
For the next time, it is easier if you provide a reproducible example, so that the people assisting you do not have to manually recreate your data structure :)
You need to use pivot_wider from tidyr package. I assumed your data is saved in df and you also need dplyr package for %>% (piping)
library(tidyr)
library(dplyr)
df %>%
pivot_wider(names_from = Product_Name, values_from = Units)
Assuming that your dataframe is ordered by Product_Name and by year, I will generate artificial data similar to your datafrme, try this:
Col_1 <- sort(rep(LETTERS[1:3], 4))
Col_2 <- rep(2011:2014, 3)
# artificial data
resp <- ceiling(rnorm(12, 5000, 500))
uu <- data.frame(Col_1, Col_2, resp)
uu
# output is
Col_1 Col_2 resp
1 A 2011 5297
2 A 2012 4963
3 A 2013 4369
4 A 2014 4278
5 B 2011 4721
6 B 2012 5021
7 B 2013 4118
8 B 2014 5262
9 C 2011 4601
10 C 2012 5013
11 C 2013 5707
12 C 2014 5637
>
> # Here starts
> output <- aggregate(uu$resp, list(uu$Col_1), function(x) {x})
> output
Group.1 x.1 x.2 x.3 x.4
1 A 5297 4963 4369 4278
2 B 4721 5021 4118 5262
3 C 4601 5013 5707 5637
>
output2 <- output [, -1]
colnames(output2) <- levels(as.factor(uu$Col_2))
rownames(output2) <- levels(as.factor(uu$Col_1))
# transpose the matrix
> t(output2)
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637
> # or convert to data.frame
> as.data.frame(t(output2))
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637

Assign mean values and/or conditional assignment for unordered duplicate dyads

I've come across something a bit above my skill set. I'm working with IMF trade data that consists of data between country dyads. The IMF dataset consists of ' unordered duplicate' records in that each country individually reports trade data. However, due to a variety of timing, recording systems, regime type, etc., there are discrepancies between corresponding values. I'm trying to manipulate this data in two ways:
Assign the mean values to the duplicated dyads.
Assign the dyad values conditionally based on a separate economic indicator or development index (who do I trust more?).
There are several discussions of identifying unordered duplicates here, here, here, and here but after a couple days of searching I have yet to see what I'm trying to do.
Here is an example of the raw data. In reality there are many more variables and several hundred thousand dyads:
reporter<-c('USA','GER','AFG','FRA','CHN')
partner<-c('AFG','CHN','USA','CAN','GER')
year<-c(2010,2010,2010,2009,2010)
import<-c(-1000,-2000,-2400,-1200,-2000)
export<-c(2500,2200,1200,2900,2100)
rep_econ1<-c(28,32,12,25,19)
imf<-data.table(reporter,partner,year,import,export,rep_econ1)
imf
reporter partner year import export rep_econ1
1: USA AFG 2010 -1000 2500 28
2: GER CHN 2010 -2000 2200 32
3: AFG USA 2010 -2400 1200 12
4: FRA CAN 2009 -1200 2900 25
5: CHN GER 2010 -2000 2100 19
The additional wrinkle is that import and export are inverses of each other between the dyads, so they need to be matched and meaned with an absolute value.
For objective 1, the resulting data.table is:
Mean
reporter partner year import export rep_econ1
USA AFG 2010 -1100 2450 28
GER CHN 2010 -2050 2100 32
AFG USA 2010 -2450 1100 12
FRA CAN 2009 -1200 2900 25
CHN GER 2010 -2100 2050 19
For objective 2:
Conditionally Assign on Higher Economic Indicator (rep_econ1)
reporter partner year import export rep_econ1
USA AFG 2010 -1000 2500 28
GER CHN 2010 -2000 2200 32
AFG USA 2010 -2500 1000 12
FRA CAN 2009 -1200 2900 25
CHN GER 2010 -2200 2000 19
It's possible not all dyads are represented twice so I included a solo record. I prefer data.table but I'll go with anything that leads me down the right path.
Thank you for your time.
Pre - Processing:
library(data.table)
# get G = reporter/partner group and N = number of rows for each group
# Thanks #eddi for simplifying
imf[, G := .GRP, by = .(year, pmin(reporter, partner), pmax(reporter, partner))]
imf[, N := .N, G]
Option 1 (means)
# for groups with 2 rows, average imports and exports
imf[N == 2
, `:=`(import = (import - rev(export))/2
, export = (export - rev(import))/2)
, by = G]
imf
# reporter partner year import export rep_econ1 G N
# 1: USA AFG 2010 -1100 2450 28 1 2
# 2: GER CHN 2010 -2050 2100 32 2 2
# 3: AFG USA 2010 -2450 1100 12 1 2
# 4: FRA CAN 2009 -1200 2900 25 3 1
# 5: CHN GER 2010 -2100 2050 19 2 2
Option 2 (highest economic indicator)
# for groups with 2 rows, choose imports and exports based on highest rep_econ1
imf[N == 2
, c('import', 'export') := {
o <- order(-rep_econ1)
import <- cbind(import, -export)[o[1], o]
.(import, export = -rev(import))}
, by = G]
imf
# reporter partner year import export rep_econ1 G N
# 1: USA AFG 2010 -1000 2500 28 1 2
# 2: GER CHN 2010 -2000 2200 32 2 2
# 3: AFG USA 2010 -2500 1000 12 1 2
# 4: FRA CAN 2009 -1200 2900 25 3 1
# 5: CHN GER 2010 -2200 2000 19 2 2
Option 2 explanation: You need to select the row with the highest economic indicator (i.e. row order(-rep_econ1)[1]) and use that for imports, but if the second row is the "trusted" one, it needs to be reversed. Otherwise you'd have the countries switched, since the second reporter's imports (now the first element of cbind(import, -export)[o[1],]) would be assigned as the first reporter's imports (because it's the first element).
Edit:
If imports and exports are both positive in the input data and need to be positive in the output data, the two calculations above can be modified as
imf[N == 2
, `:=`(import = (import + rev(export))/2
, export = (export + rev(import))/2)
, by = G]
And
imf[N == 2
, c('import', 'export') := {
o <- order(-rep_econ1)
import <- cbind(import, export)[o[1], o]
.(import, export = rev(import))}
, by = G]

R: Find top, mid and bottom values to create a category column in dplyr

I would like to create a 'Category' column in the below dataset based on the sales and year.
set.seed(30)
df <- data.frame(
Year = rep(2010:2015, each = 6),
Country = rep(c('India', 'China', 'Japan', 'USA', 'Germany', 'Russia'), 6),
Sales = round(runif(18, 100, 900))
)
head(df)
Year Country Sales
1 2010 India 661
2 2010 China 888
3 2010 Japan 285
4 2010 USA 272
5 2010 Germany 332
6 2010 Russia 660
Categories are:
Top 2 countries with highest sales in each year: Category - 1
Bottom 2 countries with lowest sales in each year: Category - 3
Remaining countries by year: Category - 2
Expected dataset might look like:
Year Country Sales Category
1 2010 India 661 1
2 2010 China 888 1
3 2010 Japan 285 3
4 2010 USA 272 3
5 2010 Germany 332 2
6 2010 Russia 660 2
You don't need much here; just group_by year, arrange from greatest to least sales, and then add a new column with mutate that fills with 2:
df %>% group_by(Year) %>%
arrange(desc(Sales)) %>%
mutate(Category = c(1, 1, rep(2, n()-4), 3, 3))
# Source: local data frame [36 x 4]
# Groups: Year [6]
#
# Year Country Sales Category
# (int) (fctr) (dbl) (dbl)
# 1 2010 China 491 1
# 2 2010 USA 436 1
# 3 2010 Japan 391 2
# 4 2010 Germany 341 2
# 5 2010 Russia 218 3
# 6 2010 India 179 3
# 7 2011 Japan 873 1
# 8 2011 India 819 1
# 9 2011 Russia 418 2
# 10 2011 China 279 2
# .. ... ... ... ...
It will fail with fewer than four countries, but that doesn't sound like an issue from the question.
We can use cut to create a 'Category' column after grouping by "Year".
library(dplyr)
df %>%
group_by(Year) %>%
mutate(Category = as.numeric(cut(-Sales, breaks=c(-Inf,
quantile(-Sales, prob = c(0, .5, 1))))))
Or using data.table
library(data.table)
setDT(df)[order(-Sales), Category := if(.N > 4) rep(1:3,
c(2, .N - 4, 2)) else rep(seq(.N), each = ceiling(.N/3)) ,by = Year]
This should also work when there are fewer elements than 4 in each "Year". i.e. if we remove the first five observations in 2010.
df1 <- df[-(1:5),]
setDT(df1)[order(-Sales), Category := if(.N > 4) rep(1:3,
c(2, .N - 4, 2)) else rep(seq(.N), each = ceiling(.N/3)) ,by = Year]
head(df1)
# Year Country Sales Category
#1: 2010 Russia 218 1
#2: 2011 India 819 1
#3: 2011 China 279 2
#4: 2011 Japan 873 1
#5: 2011 USA 213 3
#6: 2011 Germany 152 3

How to do Group By Rollup in R? (Like SQL)

I have a dataset and I want to perform something like Group By Rollup like we have in SQL for aggregate values.
Below is a reproducible example. I know aggregate works really well as explained here but not a satisfactory fit for my case.
year<- c('2016','2016','2016','2016','2017','2017','2017','2017')
month<- c('1','1','1','1','2','2','2','2')
region<- c('east','west','east','west','east','west','east','west')
sales<- c(100,200,300,400,200,400,600,800)
df<- data.frame(year,month,region,sales)
df
year month region sales
1 2016 1 east 100
2 2016 1 west 200
3 2016 1 east 300
4 2016 1 west 400
5 2017 2 east 200
6 2017 2 west 400
7 2017 2 east 600
8 2017 2 west 800
now what I want to do is aggregation (sum- by year-month-region) and add the new aggregate row in the existing dataframe
e.g. there should be two additional rows like below with a new name for region as 'USA' for the aggreagted rows
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 USA 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 USA 2000
I have figured out a way (below) but I am very sure that there exists an optimum solution for this OR a better workaround than mine
df1<- setNames(aggregate(df$sales, by=list(df$year,df$month, df$region), FUN=sum),
c('year','month','region', 'sales'))
df2<- setNames(aggregate(df$sales, by=list(df$year,df$month), FUN=sum),
c('year','month', 'sales'))
df2$region<- 'USA' ## added a new column- region- for total USA
df2<- df2[, c('year','month','region', 'sales')] ## reordering the columns of df2
df3<- rbind(df1,df2)
df3<- df3[order(df3$year,df3$month,df3$region),] ## order by
rownames(df3)<- NULL ## renumbered the rows after order by
df3
Thanks for the support!
melt/dcast in the reshape2 package can do subtotalling. After running dcast we replace "(all)" in the month column with the month using na.locf from the zoo package:
library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))
giving:
> dout
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 (all) 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 (all) 2000
In recent devel data.table 1.10.5 you can use new feature called "grouping sets" to produce sub totals:
library(data.table)
setDT(df)
res = groupingsets(df, .(sales=sum(sales)), sets=list(c("year","month"), c("year","month","region")), by=c("year","month","region"))
setorder(res, na.last=TRUE)
res
# year month region sales
#1: 2016 1 east 400
#2: 2016 1 west 600
#3: 2016 1 NA 1000
#4: 2017 2 east 800
#5: 2017 2 west 1200
#6: 2017 2 NA 2000
You can substitute NA to USA using res[is.na(region), region := "USA"].
plyr::ddply(df, c("year", "month", "region"), plyr::summarise, sales = sum(sales))

create random subsets in R without duplicates

my task is to divide a dataset of 32 rows into 8 groups without having duplicated entries.
i am trying to do this with a loop and by creating a new dataset after each cycle.
the data:
year pos country elo fifa cont hcountry hcont
1 2010 FRA 1851 1044 Europe RSA Africa
2 2010 MEX 1872 895 South America RSA Africa
3 2010 URU 1819 899 South America RSA Africa
4 2010 RSA 1569 392 Africa RSA Africa
5 2010 GRE 1726 964 Europe RSA Africa
6 2010 KOR 1766 632 Asia RSA Africa
8 2010 ARG 1899 1076 South America RSA Africa
9 2010 USA 1749 957 North America RSA Africa
10 2010 SVN 1648 860 Europe RSA Africa
11 2010 ALG 1531 821 Africa RSA Africa
...
my solution so far:
for (i in 1:8){
assign(paste("group", i, sep = ""), droplevels(subset(wc2010[sample(nrow(wc2010), 4),])))
wc2010 <- subset(wc2010, !(country %in% group[i]$country))
}
problem is of course: i don't know how to use the loop-variable.... :-(
help would be deeply appreciated!
thanks
Bob
Here is one way to create a random partition:
random.groups <- function(n.items = 32L, n.groups = 8L)
1L + (sample.int(n.items) %% n.groups)
So then you just have to do:
wc2010$group <- random.groups(nrow(wc2010), n.groups = 8L)
Then you might also be interested in doing
groups <- split(wc2010, wc2010$group)
Edit: this was not asked by the OP, but I realize that soccer draws for big tournaments usually involves hats: before the draw, teams are grouped by regions and/or rankings. Then groups are formed by randomly picking one team from each hat, so that two teams from a same hat cannot end up in the same group.
Here is a modification to my function so it can also take hats as an input:
random.groups <- function(n.items = 32L, n.groups = 8L,
hats = rep(1L, n.items)) {
splitted.items <- split(seq.int(n.items), hats)
shuffled <- lapply(splitted.items, sample)
1L + (order(unlist(shuffled)) %% n.groups)
}
Here is an example, where say, the first 8 teams are in hat #1, the next 8 teams are in hat #2, etc.:
# set.seed(123)
random.groups(32, 8, c(rep(1, 8), rep(2, 8), rep(3, 8), rep(4, 8)))
# [1] 7 8 2 6 5 3 1 4 8 7 5 3 2 4 1 6 3 2 7 6 5 8 1 4 7 6 5 4 3 2 1 8

Resources