How to create two-way table with character variables in r - r

I have the following data frame in R:
df <- data.frame(Year = c(2011, 2012, 2013, 2011, 2012, 2013, 2011, 2012, 2013),
Country = c("England", "England", "England", "French", "French", "French", "Germany", "Germany", "Germany"),
Pop = c(53.107, 53.493, 53.865, 63.070, 63.375, 63.697, 80.328, 80.524, 80.767))
# df
# Year Country Pop
# 1 2011 England 53.107
# 2 2012 England 53.493
# 3 2013 England 53.865
# 4 2011 French 63.070
# 5 2012 French 63.375
# 6 2013 French 63.697
# 7 2011 Germany 80.328
# 8 2012 Germany 80.524
# 9 2013 Germany 80.767
I would like to get the following table:
Year
2011 2012 2013
Country Pop Country Pop Country Pop
England 53,107 England 53,493 England 53,865
French 63,07 French 63,375 French 63,697
Germany 80,328 Germany 80,524 Germany 80,767

Will this do?
> xtabs(Pop ~ Country + as.factor(Year), df)
as.factor(Year)
Country 2011 2012 2013
England 53.107 53.493 53.865
French 63.070 63.375 63.697
Germany 80.328 80.524 80.767

Solution with dplyr + tidyr:
library(dplyr)
library(tidyr)
df_reshaped = df %>%
mutate(Year = paste0("Pop_", Year)) %>%
spread(Year, Pop)
compute_margins = df_reshaped %>%
summarize_if(is.numeric, sum, na.rm = TRUE) %>%
as.list(.) %>%
c(Country = "Total") %>%
bind_rows(df_reshaped, .) %>%
mutate(Total = rowSums(.[2:4]))
Result:
> df_reshaped
Country Pop_2011 Pop_2012 Pop_2013
1 England 53.107 53.493 53.865
2 French 63.070 63.375 63.697
3 Germany 80.328 80.524 80.767
> compute_margins
Country Pop_2011 Pop_2012 Pop_2013 Total
1 England 53.107 53.493 53.865 160.465
2 French 63.070 63.375 63.697 190.142
3 Germany 80.328 80.524 80.767 241.619
4 Total 196.505 197.392 198.329 592.226
To get the format you want, you can do the following:
Map(function(x, y){
temp = cbind(compute_margins[1], x)
names(temp)[2] = y
return(temp)
}, compute_margins[2:4], names(compute_margins)[2:4]) %>%
unname() %>%
do.call(cbind, .) %>%
cbind(compute_margins[5])
Result:
Country Pop_2011 Country Pop_2012 Country Pop_2013 Total
1 England 53.107 England 53.493 England 53.865 160.465
2 French 63.070 French 63.375 French 63.697 190.142
3 Germany 80.328 Germany 80.524 Germany 80.767 241.619
4 Total 196.505 Total 197.392 Total 198.329 592.226

Related

How to modify data frame in R based on one unique column

I have a data frame that looks like this.
Data
Denmark MG301
Denmark MG302
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
I need to make a new data frame based on unique values of 2nd columns while removing repeating values in Denmark. And results should look like this
Data
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
Regards
Update after clarification:
This code keeps all distinct values in column2:
distinct(df, code, .keep_all = TRUE)
Output:
1 Denmark MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
First answer:
I am not quite sure. But it gives the desired output:
df %>%
filter(country != "Denmark")
Output:
country code
<chr> <chr>
1 Australia MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
data:
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG301",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
In base R, the following code removes all rows with "Denmark" in the first column and all duplicated 2nd column by groups of 1st column.
i <- df1$V1 != "Denmark"
j <- as.logical(ave(df1$V2, df1$V1, FUN = duplicated))
df1[i & !j, ]
# V1 V2
#3 Australia MG301
#4 Australia MG302
#5 Sweden MG100
#6 Sweden MG120
Do you want just distinct ? then this may help
df <- data.frame(A = c("denmark", "denmark", "Australia", "Australia", "Sweden", "Sweden"), B = c("MG301","MG302","MG301","MG302","MG100","MG100"))
df %>% distinct()
A B
1 denmark MG301
2 denmark MG302
3 Australia MG301
4 Australia MG302
5 Sweden MG100
Or you want this ?
df %>%
group_by(B) %>%
dplyr::summarise(A = first(A))
B A
* <chr> <chr>
1 MG100 Sweden
2 MG301 denmark
3 MG302 denmark
Use duplicated with a ! bang operator to remove duplicated rows among that column.
To show a rather complicated case, I am adding one row in Denmark which is not duplicated and hence should not be filtered out.
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG302",
'Denmark', "MG303",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
# A tibble: 7 x 2
country code
<chr> <chr>
1 Denmark MG301
2 Denmark MG302
3 Denmark MG303
4 Australia MG301
5 Australia MG302
6 Sweden MG100
7 Sweden MG120
df %>%
mutate(d = duplicated(code)) %>%
group_by(code) %>%
mutate(d = sum(d)) %>% ungroup() %>%
filter(!(d > 0 & country == 'Denmark'))
# A tibble: 5 x 3
country code d
<chr> <chr> <int>
1 Denmark MG303 0
2 Australia MG301 1
3 Australia MG302 1
4 Sweden MG100 0
5 Sweden MG120 0

R: How do I avoid getting an error when merging two data frames (group by/summarise)?

I have a big data frame of 80,000 rows. It was created by combining individual data frames from different years. The origin variable indicates the year of the entry's original data frame.
Here is an example of the first few of the big data frame rows that show how data frames from 2003 and 2011 were combined.
df_1:
ID City State origin
1 NY NY 2003
2 NY NY 2003
3 SF CA 2003
1 NY NY 2011
3 SF CA 2011
2 NY NY 2011
4 LA CA 2011
5 SD CA 2011
Now I want to create a new variable called first_appearance that takes the min of the origin variable for each ID:
final_df:
ID City State origin first_appearance
1 NY NY 2003 2003
2 NY NY 2003 2003
3 SF CA 2003 2003
1 NY NY 2011 2003
3 SF CA 2011 2003
2 NY NY 2011 2003
4 LA CA 2011 2011
5 SD CA 2011 2011
So far, I've tried using:
prestep_final <- df_1 %>% group_by(ID) %>% summarise(first_apperance = min(origin))
final_df <- merge(prestep_final, df_1, by = "ID")
Prestep_final works and produces a data frame with the ID and the first_appearance.
Unfortunately, the merge step doesn't work and yields a data frame with NA entries only.
How can I improve my code so that I can produce a table like final_df above. I'd appreciate any suggestions and don't have package preferences.
If you change summarise to mutate you get your desired result without merging:
library(tidyverse)
df <- tibble::tribble(
~ID, ~City, ~State, ~origin,
1, 'NY', 'NY', 2003,
2, 'NY', 'NY', 2003,
3, 'SF', 'CA', 2003,
1, 'NY', 'NY', 2011,
3, 'SF', 'CA', 2011,
2, 'NY', 'NY', 2011,
4, 'LA', 'CA', 2011,
5, 'SD', 'CA', 2011
)
df %>% group_by(ID) %>%
mutate(first_appearance = min(origin))
#> # A tibble: 8 x 5
#> # Groups: ID [5]
#> ID City State origin first_appearance
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 NY NY 2003 2003
#> 2 2 NY NY 2003 2003
#> 3 3 SF CA 2003 2003
#> 4 1 NY NY 2011 2003
#> 5 3 SF CA 2011 2003
#> 6 2 NY NY 2011 2003
#> 7 4 LA CA 2011 2011
#> 8 5 SD CA 2011 2011
Created on 2020-06-10 by the reprex package (v0.3.0)
An option with data.table
library(data.table)
setDT(df)[, first_appearance := min(origin), ID]
Or in base R
df$first_appearance <- with(df, ave(origin, ID, FUN = min))

Fill NA values in one data table with observed values from a second data table in R

I can't believe I'm having this much trouble finding a solution to this problem: I have two data tables with identical rows and columns that look like this:
Country <- c("FRA", "FRA", "DEU", "DEU", "CHE", "CHE")
Year <- c(2010, 2020, 2010, 2020, 2010, 2020)
acctm <- c(20, 30, 10, NA, 20, NA)
acctf <- c(20, NA, 15, NA, 40, NA)
dt1 <- data.table(Country, Year, acctm, acctf)
Country Year acctm acctf
1 FRA 2010 20 20
2 FRA 2020 30 NA
3 DEU 2010 10 15
4 DEU 2020 NA NA
5 CHE 2010 20 40
6 CHE 2020 NA NA
Country <- c("FRA", "FRA", "DEU", "DEU", "CHE", "CHE")
Year <- c(2010, 2020, 2010, 2020, 2010, 2020)
acctm <- c(1, 1, 1, 60, 1, 70)
acctf <- c(1, 60, 1, 80, 1, 100)
dt2 <- data.table(Country, Year, acctm, acctf)
Country Year acctm acctf
1 FRA 2010 1 1
2 FRA 2020 2 60
3 DEU 2010 1 1
4 DEU 2020 60 80
5 CHE 2010 1 2
6 CHE 2020 70 100
I need to create a new data table that replaces NA values in dt1 with values for the corresponding country/year/variable match from dt2, yielding a table that looks like this:
Country Year acctm acctf
1 FRA 2010 20 20
2 FRA 2020 30 60
3 DEU 2010 10 15
4 DEU 2020 60 80
5 CHE 2010 20 40
6 CHE 2020 70 100
We can do this with a join on the 'Country', 'Year' columns
library(data.table)
nm1 <- names(dt1)[3:4]
nm2 <- paste0("i.", nm1)
dt3 <- copy(dt1)
dt3[dt2, (nm1) := Map(function(x, y)
fifelse(is.na(x), y, x), mget(nm1), mget(nm2)), on = .(Country, Year)]
dt3
# Country Year acctm acctf
#1: FRA 2010 20 20
#2: FRA 2020 30 60
#3: DEU 2010 10 15
#4: DEU 2020 60 80
#5: CHE 2010 20 40
#6: CHE 2020 70 100
Or to make this compact, use fcoalesce from data.table (comments from #IceCreamToucan)
dt3[dt2, (nm1) := Map(fcoalesce, mget(nm1), mget(nm2)), on = .(Country, Year)]
If the datasets are of same dimensions and have the same values for 'Country', 'Year', then another option is
library(purrr)
library(dplyr)
list(dt1[, .(acctm, acctf)], dt2[, .(acctm, acctf)]) %>%
reduce(coalesce) %>%
bind_cols(dt1[, .(Country, Year)], .)

A problem with multiple conditions in R that does not work

I am having problem with multiple conditions in R.
My data is like this:
Region in UK Year Third column (year.city)
Liverpool 2008
Manchester 2010
Liverpool 2016
Chester 2015
Birmingham 2016
Blackpool 2012
Birmingham 2005
Chester 2009
Liverpool 2005
Hull 2011
Leeds 2013
Liverpool 2014
Bradford 2008
London 2010
Coventry 2009
Cardiff 2016
Liverpool 2007
What I want to create is a third column in a way that it has for groups in it: Liverpool before 2010, Liverpool after 2010, Other cities before 2010, other cities after 2010. I tried couple of codes like mutate but could not solve it. May you please help me to do it?
Thanks
I would do this as #dvibisan suggested and use dplyr.
# Create a dataframe
df <- structure(list(`Region in UK` = c("Liverpool", "Manchester", "Liverpool",
"Chester", "Birmingham", "Blackpool", "Birmingham", "Chester",
"Liverpool", "Hull", "Leeds", "Liverpool", "Bradford", "London",
"Coventry", "Cardiff", "Liverpool"),
Year = c(2008L, 2010L, 2016L, 2015L, 2016L, 2012L, 2005L, 2009L, 2005L, 2011L, 2013L, 2014L, 2008L, 2010L, 2009L, 2016L, 2007L)),
row.names = c(NA, -17L), class = c("data.table", "data.frame"))
# Load the dplyr library to use mutate and if_else (if there were more than 2 conditions of interest for each variable could use case_when)
library(dplyr)
# Create a new column using mutate, pasting together two conditions
df <-
df %>%
mutate(`Third column (year.city)` = paste0(if_else(grepl("Liverpool", `Region in UK`, fixed = TRUE), `Region in UK`, "Other cities"),
if_else(Year < 2010, " before 2010", " 2010 or after")))
The easiest way I think is using vectorisation with base R:
# create index of categories
vec <- c("Other cities after 2010", "Liverpool after 2010", "Other cities before 2010", "Liverpool before 2010")
# create index vector
ix <- 1 + (df$Region.in.UK == "Liverpool") + 2*(df$Year < 2010)
# index the categories-vector with the index-vector
df$year.city <- vec[ix]
The result:
> df
Region.in.UK Year year.city
1 Liverpool 2008 Liverpool before 2010
2 Manchester 2010 Other cities after 2010
3 Liverpool 2016 Liverpool after 2010
4 Chester 2015 Other cities after 2010
5 Birmingham 2016 Other cities after 2010
6 Blackpool 2012 Other cities after 2010
7 Birmingham 2005 Other cities before 2010
8 Chester 2009 Other cities before 2010
9 Liverpool 2005 Liverpool before 2010
10 Hull 2011 Other cities after 2010
11 Leeds 2013 Other cities after 2010
12 Liverpool 2014 Liverpool after 2010
13 Bradford 2008 Other cities before 2010
14 London 2010 Other cities after 2010
15 Coventry 2009 Other cities before 2010
16 Cardiff 2016 Other cities after 2010
17 Liverpool 2007 Liverpool before 2010
Try this
Region_in_UK = c( "Liverpool", "Manchester", "Liverpool", "Chester",
"Birmingham", "Blackpool", "Birmingham", "Chester", "Liverpool", "Hull",
"Leeds", "Liverpool", "Bradford", "London", "Coventry", "Cardiff", "Liverpool")
Year = c(2008, 2010, 2016, 2015, 2016, 2012, 2005, 2009, 2005, 2011, 2013,
2014, 2008, 2010, 2009, 2016, 2007)
df = data.frame(Region_in_UK, Year)
# erase the code above and replace your own dataframe if its bigger
# than the data you displayed at this point and name it "df" (e.g.:
# df = your_dataframe)
df$year_city = rep(NA, dim(df)[1])
df = mutate(df, year_city =
ifelse (grepl("Liverpool", df$Region_in_UK) & df$Year < 2010,
"Liverpool before 2010", df$year_city))
df = mutate(df, year_city =
ifelse (grepl("Liverpool", df$Region_in_UK) & df$Year >= 2010,
"Liverpool 2010 and after", df$year_city))
df = mutate(df, year_city =
ifelse (!grepl("Liverpool", df$Region_in_UK) & df$Year < 2010,
"Other before 2010", df$year_city))
df = mutate(df, year_city =
ifelse (!grepl("Liverpool", df$Region_in_UK) & df$Year >= 2010,
"Other 2010 and after", df$year_city))
Using base R you could do:
transform(df, year.city = factor(paste(sub('^((?!Liver).)*$', 'other', Region_in_UK,perl = TRUE), Year>2010), label=1:4))
Region_in_UK Year year.city
1 Liverpool 2008 1
2 Manchester 2010 3
3 Liverpool 2016 2
4 Chester 2015 4
5 Birmingham 2016 4
6 Blackpool 2012 4
7 Birmingham 2005 3
8 Chester 2009 3
9 Liverpool 2005 1
10 Hull 2011 4
11 Leeds 2013 4
12 Liverpool 2014 2
13 Bradford 2008 3
14 London 2010 3
15 Coventry 2009 3
16 Cardiff 2016 4
17 Liverpool 2007 1
You can also do:
transform(df,m=factor(paste(!grepl("Liverpool",Region_in_UK),Year>2010),label=1:4))
or
transform(df,m = factor(paste(sub('(Liverpool)|.*','\\1',Region_in_UK),Year<=2010),label=4:1))
Region_in_UK Year m
1 Liverpool 2008 1
2 Manchester 2010 3
3 Liverpool 2016 2
4 Chester 2015 4
5 Birmingham 2016 4
6 Blackpool 2012 4
7 Birmingham 2005 3
8 Chester 2009 3
9 Liverpool 2005 1
10 Hull 2011 4
11 Leeds 2013 4
12 Liverpool 2014 2
13 Bradford 2008 3
14 London 2010 3
15 Coventry 2009 3
16 Cardiff 2016 4
17 Liverpool 2007 1

Convert Panel Data to Long in R

My current data is for missiles between 1920 and 2018. The goal is to measure a nation’s ability to deploy missiles of different kinds for each year from 1920 to 2018. The problems that arise are that the data has multiple observations per nation and often per year. This creates issues because for instance if a nation adopted a missile in 1970 that is Air to Air and imported then developed one in 1980 that is Air to Air and Air to Ground and produced domestically, that change needs to be reflected. The goal is to have a unique row/observation for each year for every nation. Also it should be noted that it is assumed if the nation can produced Air to air for instance in 1970 they can do so until 2018.
Current:
YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 Saudi Arabia 0 1
2017 670 Saudi Arabia 1 1
2016 2 United States 1 1
Desired:
YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 Saudi Arabia 0 1
2015 670 Saudi Arabia 0 1
2016 670 Saudi Arabia 0 1
2017 670 Saudi Arabia 1 1
2018 670 Saudi Arabia 1 1
2016 2 United States 0 1
2017 2 United States 0 1
2018 2 United States 0 1
Note: There are many entries and so I would like it to generate from 1920 to 2018 for every country even if they will have straight zeroes. That is not necessary but it would be a great bit!
You can do this via several steps:
Create the combination of all years and countries (a CROSS JOIN in SQL)
LEFT JOIN these combinations with the available data
Use a function like zoo::na.locf() to replace NA values by the last known ones per country.
The first step is common:
df <- read.table(text = 'YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 "Saudi Arabia" 0 1
2017 670 "Saudi Arabia" 1 1
2016 2 "United States" 1 1', header = TRUE, stringsAsFactors = FALSE)
combinations <- merge(data.frame(YearAcquired = seq(1920, 2018, 1)),
unique(df[,2:3]), by = NULL)
For steps 2 and 3 here a solution using dplyr
library(dplyr)
library(zoo)
df <- left_join(combinations, df) %>%
group_by(CountryCode) %>%
mutate(Domestic = na.locf(Domestic, na.rm = FALSE),
AirtoAir = na.locf(AirtoAir, na.rm = FALSE))
And one solution using data.table:
library(data.table)
library(zoo)
setDT(df)
setDT(combinations)
df <- df[combinations, on = c("YearAcquired", "CountryCode", "CountryName")]
df <- df[, na.locf(.SD, na.rm = FALSE), by = "CountryCode"]
You could create a new dataframe using the country names and codes available and perform a left join with your existing data. This would give you 1920 to 2018 for each country and code, leaving NA's in where you don't have data available but you could easily replace them given how you want your data structured.
# df is your initial dataframe
countries <- df$CountryName
codes <- df
new_df <- data.frame(YearAcquired = seq(1920, 2018, 1),
CountryName = df$CountryName
CountryCode = df$CountryCode)
new_df <- left_join(new_df, df)
Using tidyverse (dplyr and tidyr)...
If you only need to fill in internal years per country...
df <- read.table(header = TRUE, as.is = TRUE, text = "
YearAcquired countrycode CountryName Domestic AirtoAir
2014 670 'Saudi Arabia' 0 1
2017 670 'Saudi Arabia' 1 1
2016 2 'United States' 1 1
")
library(dplyr)
library(tidyr)
df %>%
group_by(countrycode) %>%
complete(YearAcquired = full_seq(YearAcquired, 1), countrycode, CountryName) %>%
arrange(countrycode, YearAcquired) %>%
fill(Domestic, AirtoAir)
#> # A tibble: 5 x 5
#> # Groups: countrycode [2]
#> YearAcquired countrycode CountryName Domestic AirtoAir
#> <dbl> <int> <chr> <int> <int>
#> 1 2016 2 United States 1 1
#> 2 2014 670 Saudi Arabia 0 1
#> 3 2015 670 Saudi Arabia 0 1
#> 4 2016 670 Saudi Arabia 0 1
#> 5 2017 670 Saudi Arabia 1 1
If you want to expand each country to all years found in the dataset...
df <- read.table(header = TRUE, as.is = TRUE, text = "
YearAcquired countrycode CountryName Domestic AirtoAir
2014 670 'Saudi Arabia' 0 1
2017 670 'Saudi Arabia' 1 1
2016 2 'United States' 1 1
")
library(dplyr)
library(tidyr)
df %>%
complete(YearAcquired = full_seq(YearAcquired, 1),
nesting(countrycode, CountryName)) %>%
group_by(countrycode) %>%
arrange(countrycode, YearAcquired) %>%
fill(Domestic, AirtoAir) %>%
mutate_at(vars(Domestic, AirtoAir), funs(if_else(is.na(.), 0L, .)))
#> # A tibble: 8 x 5
#> # Groups: countrycode [2]
#> YearAcquired countrycode CountryName Domestic AirtoAir
#> <dbl> <int> <chr> <int> <int>
#> 1 2014 2 United States 0 0
#> 2 2015 2 United States 0 0
#> 3 2016 2 United States 1 1
#> 4 2017 2 United States 1 1
#> 5 2014 670 Saudi Arabia 0 1
#> 6 2015 670 Saudi Arabia 0 1
#> 7 2016 670 Saudi Arabia 0 1
#> 8 2017 670 Saudi Arabia 1 1

Resources