Identifying matching observations in dyadic data in R - r

Hell everyone,
I am struggling with the following issue. Currently, I have a dataset looking like this:
living_in from Year stock
Austria Australia 2014 2513
Austria Australia 2013 2000
Germany Austria 2010 6000
Australia Austria 2014 3000
Austria Australia 1993 NA
Now I would like to identify all observations that fulfill the following criteria:
Should be from same year
Should contain the same country pairs in that year
Should not contain NA
For instance, I want to find all observations for combinations of two countries like Austria-Australia and Australia-Austria within the same year that contain values. This is due to the fact that some combinations in a given year in the dataset have only one value for stock not two. I want to remove those.
What is the best way to proceed here? Many thanks in advance!
P.S. I have about 14 country pairs in my dataset that need this kind of identification
A helpful output might be something like this.
living_in from Year stock dummy
Austria Australia 2014 2513 1
Austria Australia 2013 2000 0
Germany Austria 2010 6000 0
Australia Austria 2014 3000 1
Austria Australia 1993 NA 0

For each combination of country irrespective of their order (A-B is same as B-A) assign 1 to dummy column if for the same Year it has more than 1 row and all the stock values are non-NA or else assign 0.
library(dplyr)
df %>%
group_by(col1 = pmin(living_in, from), col2 = pmax(living_in, from), Year) %>%
mutate(dummy = as.integer(n() > 1 && all(!is.na(stock)))) %>%
ungroup %>%
select(-col1, -col2)
# living_in from Year stock dummy
# <chr> <chr> <int> <int> <int>
#1 Austria Australia 2014 2513 1
#2 Austria Australia 2013 2000 0
#3 Germany Austria 2010 6000 0
#4 Australia Austria 2014 3000 1
#5 Austria Australia 1993 NA 0
data
df <- structure(list(living_in = c("Austria", "Austria", "Germany",
"Australia", "Austria"), from = c("Australia", "Australia", "Austria",
"Austria", "Australia"), Year = c(2014L, 2013L, 2010L, 2014L,
1993L), stock = c(2513L, 2000L, 6000L, 3000L, NA)),
class = "data.frame", row.names = c(NA, -5L))

Related

How to modify data frame in R based on one unique column

I have a data frame that looks like this.
Data
Denmark MG301
Denmark MG302
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
I need to make a new data frame based on unique values of 2nd columns while removing repeating values in Denmark. And results should look like this
Data
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
Regards
Update after clarification:
This code keeps all distinct values in column2:
distinct(df, code, .keep_all = TRUE)
Output:
1 Denmark MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
First answer:
I am not quite sure. But it gives the desired output:
df %>%
filter(country != "Denmark")
Output:
country code
<chr> <chr>
1 Australia MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
data:
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG301",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
In base R, the following code removes all rows with "Denmark" in the first column and all duplicated 2nd column by groups of 1st column.
i <- df1$V1 != "Denmark"
j <- as.logical(ave(df1$V2, df1$V1, FUN = duplicated))
df1[i & !j, ]
# V1 V2
#3 Australia MG301
#4 Australia MG302
#5 Sweden MG100
#6 Sweden MG120
Do you want just distinct ? then this may help
df <- data.frame(A = c("denmark", "denmark", "Australia", "Australia", "Sweden", "Sweden"), B = c("MG301","MG302","MG301","MG302","MG100","MG100"))
df %>% distinct()
A B
1 denmark MG301
2 denmark MG302
3 Australia MG301
4 Australia MG302
5 Sweden MG100
Or you want this ?
df %>%
group_by(B) %>%
dplyr::summarise(A = first(A))
B A
* <chr> <chr>
1 MG100 Sweden
2 MG301 denmark
3 MG302 denmark
Use duplicated with a ! bang operator to remove duplicated rows among that column.
To show a rather complicated case, I am adding one row in Denmark which is not duplicated and hence should not be filtered out.
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG302",
'Denmark', "MG303",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
# A tibble: 7 x 2
country code
<chr> <chr>
1 Denmark MG301
2 Denmark MG302
3 Denmark MG303
4 Australia MG301
5 Australia MG302
6 Sweden MG100
7 Sweden MG120
df %>%
mutate(d = duplicated(code)) %>%
group_by(code) %>%
mutate(d = sum(d)) %>% ungroup() %>%
filter(!(d > 0 & country == 'Denmark'))
# A tibble: 5 x 3
country code d
<chr> <chr> <int>
1 Denmark MG303 0
2 Australia MG301 1
3 Australia MG302 1
4 Sweden MG100 0
5 Sweden MG120 0

Adding conditional variables to dataframe

Say we have a Dataframe that look like this:
UNIT NUMBER Year City STATE
124 1996 Prague CZECH
121 2001 Sofie BULG
122 2003 Ostrava CZECH
147 1986 Kyjev UKRAINE
133 2005 Lvov UKRAINE
...
...
...
188 2001 Rome ITALY
And say I need to add anothet variable to dataframe called Capital city - that would be equal to 1 if the City is a capital city of STATE and 0 otherwise.
how would I add this variable?
Capital cities in above dataframe are: Prague, Sofie, Kyjev
PS: I know I can do it 'by hand' in above dataframe, but I need universal solution for mutch bigger dataframes...
If you have many cities names with some cities with same names:
library(dplyr)
df <- data.frame(
unit = c(124, 121, 122, 147, 133),
Year = c(1996,2001,2003,1986,2005),
City = c("Prague", "Sofie", "Ostrava", "Kyjev", "Lvov"),
State = c("CZECH", "BULG", "CZECH", "UKRAINE", "UKRAINE"))
capital <- data.frame(
City = c("Prague", "Sofie", "Kyjev"),
State = c("CZECH", "BULG", "UKRAINE"),
Capital = "YES"
)
left_join(df, capital, by = c("State" = "State", "City" = "City"))
Get:
> left_join(df, capital, by = c("State" = "State", "City" = "City"))
unit Year City State Capital
1 124 1996 Prague CZECH YES
2 121 2001 Sofie BULG YES
3 122 2003 Ostrava CZECH <NA>
4 147 1986 Kyjev UKRAINE YES
5 133 2005 Lvov UKRAINE <NA>
If all city names are unique, then
cap_list = c("Prague", "Sofie", "Kyjev")
df %>%
mutate (
yes = as.numeric(City %in% cap_list)
)
unit Year City State yes
1 124 1996 Prague CZECH 1
2 121 2001 Sofie BULG 1
3 122 2003 Ostrava CZECH 0
4 147 1986 Kyjev UKRAINE 1
5 133 2005 Lvov UKRAINE 0

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

Dummy indicator for cross country ID repetitions

Here is my problem: I have a table with isin and countries, like this:
ISIN COUNTRY
XX0001 ITALY
XX0002 FRANCE
XX0003 ITALY
XX0001 FRANCE
XX0002 ITALY
XX0004 FRANCE
I would like to create a new column with an indicator taking value 1 if the same name appears in both countries, 0 otherwise.
ISIN COUNTRY INDICATOR
XX0001 ITALY 1
XX0002 FRANCE 1
XX0003 ITALY 0
XX0001 FRANCE 1
XX0002 ITALY 1
XX0004 FRANCE 0
I am working in Tibco Spotfire that works also with native R language.
Data
df1 <- structure(list(ISIN = c("XX0001", "XX0002", "XX0003", "XX0001", "XX0002", "XX0004"),
COUNTRY = c("ITALY", "FRANCE", "ITALY", "FRANCE", "ITALY", "FRANCE")),
.Names = c("ISIN", "COUNTRY"), class = "data.frame",
row.names = c(NA, -6L))
We can try with duplicated
df1$INDICATOR <- as.integer(duplicated(df1$ISIN)|!duplicated(df1$COUNTRY))
df1$INDICATOR
#[1] 1 1 0 1 1 0
Or using data.table
library(data.table)
setDT(df1)[, INDICATOR := +(uniqueN(COUNTRY)>1) , ISIN]

reshape data from wide to long with multiple rows

I have a dataset dfs that i would like to reshape
dfs
# country.name indicator.name x1990 x1991 x1992
# 507 andorra GDP at market prices (current US$) 1.028989e+09 1.106891e+09 1.209993e+09
# 510 andorra GDP growth (annual %) 3.781393e+00 2.546001e+00 9.292154e-01
# 1347 albania GDP at market prices (current US$) 2.101625e+09 1.139167e+09 7.094526e+08
# 1350 albania GDP growth (annual %) -9.575640e+00 -2.958900e+01 -7.200000e+00
# 3587 austria GDP at market prices (current US$) 1.660624e+11 1.733755e+11 1.946082e+11
And i would like it so that the indicator names are columns and the times are in one column with an indicator.
# country time gdp_market gdp_growth
# 1 andorra 1990 1028989394 3.7813935
# 2 andorra 1990 1106891025 2.5460006
# 3 andorra 1990 1209992650 0.9292154
# 4 albania 1991 2101624963 3.7813935
# 5 albania 1991 1139166646 2.5460006
# 6 albania 1991 709452584 0.9292154
# 7 austria 1992 166062376740 NA
# 8 austria 1992 173375508073 NA
# 9 austria 1992 194608183696 NA
I can melt reshape the data into long format but cant seperate it into two columns
library(reshape2)
melt.dfs <- melt(dfs, id=1:2)
I could do a split and cbind, but id prefer to do it with reshape. Thanks
dfs = structure(list(country.name = c("andorra", "andorra", "albania",
"albania", "austria"), indicator.name = c("GDP at market prices (current US$)",
"GDP growth (annual %)", "GDP at market prices (current US$)",
"GDP growth (annual %)", "GDP at market prices (current US$)"
), x1990 = c(1028989393.70295, 3.78139347786568, 2101624962.5,
-9.57564018741695, 166062376739.683), x1991 = c(1106891024.78653,
2.54600064090229, 1139166645.83333, -29.5889976817695, 173375508073.07
), x1992 = c(1209992649.56688, 0.929215382801402, 709452583.880319,
-7.19999998650893, 194608183696.469)), .Names = c("country.name",
"indicator.name", "x1990", "x1991", "x1992"), row.names = c(507L,
510L, 1347L, 1350L, 3587L), class = "data.frame")
We can use
library(dplyr)
library(tidyr)
gather(dfs, time, Val, x1990:x1992) %>%
spread(indicator.name, Val)
EDIT: Based on comments from #docendo discimus
Or using recast
library(reshape2)
recast(dfs, measure = 3:5, ...~indicator.name, value.var='value')

Resources