Calculate duration/difference between first and n rows that match on column value - r

I'm trying to calculate difference/duration between the first and n rows of a dataframe that match in one column. I want to place that value in a new column "duration". Sample data: below.
y <- data.frame(c("USA", "USA", "USA", "France", "France", "Mexico", "Mexico", "Mexico"), c(1992, 1993, 1994, 1989, 1990, 1999, 2000, 2001))
colnames(y) <- c("Country", "Year")
y$Year <- as.integer(y$Year) # this is to match the class of my actual data
My desired result is:
1992 USA 0
1993 USA 1
1994 USA 2
1989 France 0
1990 France 1
1999 Mexico 0
2000 Mexico 1
2001 Mexico 2
I've tried using dplyr's group_by and mutate
y <- y %>% group_by(Country) %>% mutate(duration = Year - lag(Year))
but I can only get the actual lag year (e.g. 1999) or only calculate the difference between sequential rows getting me either NA for the first row of a country or 1 for all other rows with the same country. Many q & a's focus on difference between sequential rows and not between the first and n rows.
Thoughts?

This can be done by subtracting the first 'Year' with the 'Year' column after grouping by 'Country'.
y %>%
group_by(Country) %>%
mutate(duration = Year - first(Year))

Related

Creating a new variable from the ranges of another column in which the ranges change - R

I am a beginner in R so sorry if it is a very simple question. I looked but I could not find the same problem.
I want to create a new variable from the ranges of another column in R but the ranges are not the same for each row.
To be more specific, my data has years 1960 - 2000 and i have ranges for employment. For 1960 to 1980 a teacher is 1 and a lawyer is 2 etc. For 1980 - 1990 a teacher is in the value range 1-29 and lawyer is 50-89 etc. Then finally for 1990-2000, the value range for the teacher is 40-65 and for the lawyer it is 1-39.
I dont even know how to begin with it (teacher and lawyer are not the only occupations there are 10 different occupations with overlapping value ranges for different years - which makes it very confusing for me).
I would appreciate your help. Thank you very much.
Here are a couple of approaches to get you started.
First, say you have a data frame with year and occupation_code:
df1 <- data.frame(
year = c(1965, 1985, 1995),
occupation_code = c(1, 2, 3)
)
year occupation_code
1 1965 1
2 1985 2
3 1995 3
Then, create a second data frame which will clearly indicate the year ranges and occupation code ranges with each occupation. You can include all of your occupations here.
df2 <- data.frame(
year_start = c(1960, 1960, 1980, 1980, 1990, 1990),
year_end = c(1980, 1980, 1990, 1990, 2000, 2000),
occupation_code_start = c(1, 2, 1, 50, 40, 1),
occupation_code_end = c(1, 2, 29, 89, 65, 39),
occupation = c("teacher", "lawyer", "teacher", "lawyer", "teacher", "lawyer")
)
year_start year_end occupation_code_start occupation_code_end occupation
1 1960 1980 1 1 teacher
2 1960 1980 2 2 lawyer
3 1980 1990 1 29 teacher
4 1980 1990 50 89 lawyer
5 1990 2000 40 65 teacher
6 1990 2000 1 39 lawyer
Then, you can merge the two together.
One approach is with data.table package.
library(data.table)
setDT(df1)
setDT(df2)
df2[df1,
on = .(year_start <= year,
year_end >= year,
occupation_code_start <= occupation_code,
occupation_code_end >= occupation_code),
.(year, occupation = occupation)]
This will give you:
year occupation
1: 1965 teacher
2: 1985 teacher
3: 1995 lawyer
Another approach is with fuzzyjoin and tidyverse:
library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(df1, df2,
by = c("year" = "year_start",
"year" = "year_end",
"occupation_code" = "occupation_code_start",
"occupation_code" = "occupation_code_end"),
match_fun = list(`>=`, `<=`, `>=`, `<=`)) %>%
select(year, occupation)

Agregating and counting elements in the variables of a dataset

I might have not asked the proper question in my research, sorry in such case.
I have a multiple columns dataset:
helena <-
Year US$ Euros Country Regions
2001 12 13 US America
2000 13 15 UK Europe
2003 14 19 China Asia
I want to group the dataset in a way that I have for each region the total per year of the earnings plus a column showing how many countries have communicated their data per region every year
helena <-
Year US$ Euros Regions Number of countries per region per Year
2000 150 135 America 2
2001 135 151 Europe 15
2002 142 1900 Asia 18
Yet, I have tried
count(helena, c("Regions", "Year"))
but it does not work properly since includes only the columns indicated
Here is the data.table way, I have added a row for Canada for year 2000 to test the code:
library(data.table)
df <- data.frame(Year = c(2000, 2001, 2003,2000),
US = c(13, 12, 14,13),
Euros = c(15, 13, 19,15),
Country = c('US', 'UK', 'China','Canada'),
Regions = c('America', 'Europe', 'Asia','America'))
df <- data.table(df)
df[,
.(sum_US = sum(US),
sum_Euros = sum(Euros),
number_of_countries = uniqueN(Country)),
.(Regions, Year)]
Regions Year sum_US sum_Euros number_of_countries
1: America 2000 26 30 2
2: Europe 2001 12 13 1
3: Asia 2003 14 19 1
With dplyr:
library(dplyr)
your_data %>%
group_by(Regions, Year) %>%
summarize(
US = sum(US),
Euros = sum(Euros),
N_countries = n_distinct(Country)
)
using tidyr
library(tidyr)
df %>% group_by(Regions, Year) %>%
summarise(Earnings_US = sum(`US$`),
Earnings_Euros = sum(Euros),
N_Countries = length(Country))
aggregate the data set by regions, summing the earnings columns and doing a length of the country column (assuming countries are unique)
Using tidyverse and building the example
library(tidyverse)
df <- tibble(Year = c(2000, 2001, 2003,2000),
US = c(13, 12, 14,13),
Euros = c(15, 13, 19,15),
Country = c('US', 'UK', 'China','Canada'),
Regions = c('America', 'Europe', 'Asia','America'))
df %>%
group_by(Regions, Year) %>%
summarise(US = sum(US),
Euros = sum(Euros),
Countries = n_distinct(Country))
updated to reflect the data in the original question

Replace missing values with mean for subsets of dataframe

I have a data frame titled final_project_data with the following structure. It includes 17 columns with data that corresponds to the county/ State and years. For example, Baldwin county in Alabama in 2006 had a population of 69162, an unemployment rate of 4.2% etc.
ID County State Population Year Ump.Rate Fertility
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1003 Baldwin County Alabama 69162 2006 4.2 88
1015 Calhoun County Alabama 112903 2006 2.4 na
1043 Baldwin County Alabama na 2007 1.9 71
1049 Calhoun County Alabama 68014 2007 na 90
1050 CountyY Alaska 2757 2006 3.9 na
1070 CountyZ Alaska 11000 2006 7.8 95
1081 CountyY Alaska na 2007 6.5 70
1082 CountyZ Alaska 67514 2007 4.5 60
There are a number of columns with missing values in them, which I am trying to replace with the mean for the given State and Year. I am running into issues trying to loop over each column with missing values and then each subset of years and rows to fill in the missing values with the mean. The code I have thus far is below:
#get list of unique states
states <- unique(final_project_data$State)
#get list of columns with na in them - we will use this to impute missing
values
list_na <- colnames(final_project_data)[ apply(final_project_data, 2, anyNA) ]
list_na
#create a place to hold the missing values
average_missing <- c()
#Loop through each state to impute the missing values with the mean
for(i in 1:length(states)){
average_missing <- apply(final_project_data[which(final_project_data$State == states[i]),colnames(final_project_data) %in% list_na], 2, mean, na.rm = TRUE)
}
average_missing
However, when I run the above bit of code, I only get one set of values for each of the columns with missing values, not for a different value for every state. I am also not sure how to extend this to include years. Any help or advice would be appreciated!
In a for loop:
dt <- data.frame(
ID = c(1003, 1015, 1043, 1049, 1050, 1070, 1081, 1082, NA, NA),
State = c(rep("Alabama", 4), rep("Alaska", 4), "Alabama", "Alaska"),
Population = c(sample(10000:100000, 8, replace = T), NA, NA),
Year = c(2006, 2006, 2007, 2007, 2006, 2006, 2007, 2007, 2007, 2006),
Unemployment = c(sample(1:5, 8, replace = T), NA, NA)
)
# index through each row in data frame
for (i in 1:nrow(dt)){
# if Population variable is NA
if(is.na(dt$Population[i]) == T){
# calculate mean from all Population variables with the same State and Year as index
dt$Population[i] <- mean(dt$Population[which(dt$State == dt$State[i] & dt$Year == dt$Year[i])], na.rm = T)
}
# repeat for Unemployment variable
if(is.na(dt$Unemployment[i]) == T){
dt$Unemployment[i] <- mean(dt$Unemployment[which(dt$State == dt$State[i] & dt$Year == dt$Year[i])], na.rm = T)
}
}
Here's a dplyr version without a loop. Just add all the columns you want to transform insided vars():
your_data %>%
group_by(State, Year) %>%
mutate_at(vars(Population, Ump.Rate, Fertility),
~ ifelse(is.na(.), mean(., na.rm = TRUE), .))

Changing observation`s name using dplyr

Suppose I have this dataset:
Variable <- c("GDP")
Country <- c("Brazil", "Chile")
df <- data.frame(Variable, Country)
I want to change the GDP to "Country_observation" GDP, i.e, Brazil GDP and Chile GDP.
I have a much larger dataset and I've been trying to do this by using
df %>% mutate(Variable = replace(Variable, Variable == "GDP", paste(Country, "GDP")))
However, it will print the first observation of variable "Country" for every observation in "Variable" that meets the conditional. Is there any way to make paste() use the value of Country on the row it is applying to?
I've tried to use rowwise() and it did not work. I've tried the following code as well and encountered the same problem
df %>% mutate(Country = ifelse(Country == "Chile", replace(Variable, Variable == "GDP",
paste(Country, "GDP")), Variable))
Thanks to everyone!
EDIT
I can't simply use unite because I still need the variable Country. So a workaround that I found was (I had several other observations that I needed to change their names)
df %>% mutate(Variable2 = ifelse(Variable == "GDP", paste0(Country, " ",
Variable), Variable)) %>%
mutate(Variable2 = replace(Variable2, Variable2 ==
"CR", "Country Risk")) %>%
mutate(Variable2 = replace(Variable2, Variable2
== "EXR", "Exchange Rate")) %>%
mutate(Variable2 = replace(Variable2,mVariable2 == "INTR", "Interest Rate"))
%>% select(-Variable) %>%
select(Horizon, Variable = Variable2, Response, Low, Up, Shock, Country,
Status)
EDIT 2
My desired output was
Horizon Variable Response Shock Country
1 Brazil GDP 0.0037 PCOM Brazil
2 Brazil GDP 0.0060 PCOM Brazil
3 Brazil GDP 0.0053 PCOM Brazil
4 Brazil GDP 0.0033 PCOM Brazil
5 Brazil GDP 0.0021 PCOM Brazil
6 Brazil GDP 0.0020 PCOM Brazil
This example should help:
library(tidyr)
library(dplyr)
Variable <- c("GDP")
Country <- c("Brazil", "Chile")
value = c(5,10)
df <- data.frame(Variable, Country, value)
# original data
df
# Variable Country value
# 1 GDP Brazil 5
# 2 GDP Chile 10
# update
df %>% unite(NewGDP, Variable, Country)
# NewGDP value
# 1 GDP_Brazil 5
# 2 GDP_Chile 10
If you want to use paste you can do:
df %>% mutate(NewGDP = paste0(Country,"_",Variable))
# Variable Country value NewGDP
# 1 GDP Brazil 5 Brazil_GDP
# 2 GDP Chile 10 Chile_GDP

Divide case by population

In the table2 dataset from the tidyr package, we have:
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
How do I code this so that I can divide the type cases by the type population and then multiply by 10000. (Yes, this is a question from R for Data Science by Hadley Wickham.)
I've thought of:
sum_1 <- vector()
for (i,j in 1:nrow(table2)) {
if (i %% 2 != 0) {
sum_1 <- (table2[i] / table2[j]) * 10000
Assuming that there are only 2 values for 'type' for each 'country', 'year', then after grouping by 'country', 'year', arrange by 'type' (in case the order is different) and divide the first value of 'count' with the last value of 'count' to create the 'newcol'
library(dplyr)
table2 %>%
group_by(country, year) %>%
arrange(country, year, type) %>%
mutate(newcol = 10000*first(count)/last(count))
If we need only a summarised output, replace mutate with summarise
If there are other values in type in addition to 'cases' and 'population', then we subset the 'count' based on logical index
table2 %>%
group_by(country, year) %>%
mutate(newcol = 10000*count[type=="cases"]/count[type=="population"])
Here, also the assumption is that there is only a single 'cases' and 'population' per each 'country', 'year'

Resources