Using mutate and group_by to roll an operation over rows - r

I have the following data:
country year sales
--------------------------
Afghanistan 1950 30
Afghanistan 1951 35
Albania 1950 0
Albania 1951 5
total 1950 30
total 1951 40
I want to generate a new column, ratio, which is the ratio of sales for any given country-year combination to the total for that year. So the output should be:
country year sales ratio
---------------------------------
Afghanistan 1950 30 1
Afghanistan 1951 35 0.875
Albania 1950 0 0
Albania 1951 5 0.125
total 1950 30 1
total 1951 40 1
I'd like to use tidyverse (which I am somewhat new to) to accomplish this, but I'm still somewhat confused about how to use mutate and group_by to accomplish this (or even if that is the best way to go about this task in general).
I tried unsuccessfully to use the advice given in this thread. What I have tried is:
library(tidyverse)
df <- df %>%
group_by(year) %>%
mutate(ratio = sales[country]/sales[country == "total"])
But this generates a column called ratio full of NAs. Do I need to use a loop or something else? I'm somewhat new to R and I will admit I have avoided loops up until now. Looking over documentation on loops, I couldn't quite think of how I would use one to run over each country-year combination and generate a new column.

You can group by country and then divide sales by maximum of sales - which is total, I suppose.
library(dplyr)
df %>%
group_by(year) %>%
mutate(ratio = sales / max(sales))
# A tibble: 6 x 4
# Groups: year [2]
# country year sales ratio
# <chr> <int> <int> <dbl>
#1 Afghanistan 1950 30 1
#2 Afghanistan 1951 35 0.875
#3 Albania 1950 0 0
#4 Albania 1951 5 0.125
#5 total 1950 30 1
#6 total 1951 40 1
In base R
transform(df, ratio = ave(sales, year, FUN = function(x) x / max(x)))
Or with data.table
library(data.table)
setDT(df)[, ratio := sales / max(sales), by = year][]
data
df <- structure(list(country = c("Afghanistan", "Afghanistan", "Albania",
"Albania", "total", "total"), year = c(1950L, 1951L, 1950L, 1951L,
1950L, 1951L), sales = c(30L, 35L, 0L, 5L, 30L, 40L)), .Names = c("country",
"year", "sales"), class = "data.frame", row.names = c(NA, -6L
))

Related

Is it possible to interpolate a list of dataframes in r?

According to the answer of lhs,
https://stackoverflow.com/a/72467827/11124121
#From lhs
library(tidyverse)
data("population")
# create some data to interpolate
population_5 <- population %>%
filter(year %% 5 == 0) %>%
mutate(female_pop = population / 2,
male_pop = population / 2)
interpolate_func <- function(variable, data) {
data %>%
group_by(country) %>%
# can't interpolate if only one year
filter(n() >= 2) %>%
group_modify(~as_tibble(approx(.x$year, .x[[variable]],
xout = min(.x$year):max(.x$year)))) %>%
set_names(c("country", "year", paste0(variable, "_interpolated"))) %>%
ungroup()
}
The data that already exists, i.e. year 2000 and 2005 are also interpolated. I want to keep the orginal data and only interpolate the missing parts, that is,
2001-2004 ; 2006-2009
Therefore, I would like to construct a list:
population_5_list = list(population_5 %>% filter(year %in% c(2000,2005)),population_5 %>% filter(year %in% c(2005,2010)))
And impute the dataframes in the list one by one.
However, a error appeared:
Error in UseMethod("group_by") :
no applicable method for 'group_by' applied to an object of class "list"
I am wondering how can I change the interpolate_func into purrr format, in order to apply to list.
We need to loop over the list with map
library(purrr)
library(dplyr)
map(population_5_list,
~ map(vars_to_interpolate, interpolate_func, data = .x) %>%
reduce(full_join, by = c("country", "year")))
-output
[[1]]
# A tibble: 1,266 × 5
country year population_interpolated female_pop_interpolated male_pop_interpolated
<chr> <int> <dbl> <dbl> <dbl>
1 Afghanistan 2000 20595360 10297680 10297680
2 Afghanistan 2001 21448459 10724230. 10724230.
3 Afghanistan 2002 22301558 11150779 11150779
4 Afghanistan 2003 23154657 11577328. 11577328.
5 Afghanistan 2004 24007756 12003878 12003878
6 Afghanistan 2005 24860855 12430428. 12430428.
7 Albania 2000 3304948 1652474 1652474
8 Albania 2001 3283184. 1641592. 1641592.
9 Albania 2002 3261421. 1630710. 1630710.
10 Albania 2003 3239657. 1619829. 1619829.
# … with 1,256 more rows
# ℹ Use `print(n = ...)` to see more rows
[[2]]
# A tibble: 1,278 × 5
country year population_interpolated female_pop_interpolated male_pop_interpolated
<chr> <int> <dbl> <dbl> <dbl>
1 Afghanistan 2005 24860855 12430428. 12430428.
2 Afghanistan 2006 25568246. 12784123. 12784123.
3 Afghanistan 2007 26275638. 13137819. 13137819.
4 Afghanistan 2008 26983029. 13491515. 13491515.
5 Afghanistan 2009 27690421. 13845210. 13845210.
6 Afghanistan 2010 28397812 14198906 14198906
7 Albania 2005 3196130 1598065 1598065
8 Albania 2006 3186933. 1593466. 1593466.
9 Albania 2007 3177735. 1588868. 1588868.
10 Albania 2008 3168538. 1584269. 1584269.
# … with 1,268 more rows

r - Select a numerical value in a Matrix which is based of another string dataset (Distance between two countries)

I have a dataset(1) of investments where the target and the host country are listed.
In addition, I have a matrix(2) which shows the distance between all countries.
Right now I would like to add a column in the first data set which contains the distance between the target and home country in each row.
The first data set looks like this (with values below):
targetC year Comp_id homeC sales assets profit Distance_Target_Home (this column would be the goal)
ABW 2008 AL8234 ALB 74839 75342 976857 8543
and the second (with the distance in between the countries):
ABW AFG AGO ALB ANT
ABW 3455 2456 8543 1342
AFG
AGO
ALB
ANT
Thanks a lot
Assuming that df_distances is your second dataframe (with distances), you can reshape it to long format like this:
## for testing you can use this minimal distance dataframe:
df_distances <- structure(list(ABW = c(10L, 7L), AFG = c(9L, 5L)), class = "data.frame", row.names = c("ABW", "AFG"))
## > df_distances
##
## ABW AFG
## ABW 9 10
## AFG 3 4
... now reshape (stack) df_distances to long format so that homeC and targetC receive their own column each:
library(tidyr)
library(dplyr)
df_distances <- df_distances %>%
rownames_to_column('targetC') %>%
pivot_longer(cols = -1,
names_to = 'homeC',
values_to = 'Distance_Target_Home'
)
## > df_distances
## # A tibble: 4 x 3
## targetC homeC Distance_Target_Home
## <chr> <chr> <int>
## 1 ABW ABW 9
## 2 ABW AFG 10
## 3 AFG ABW 3
## 4 AFG AFG 4
... join with df1:
df1 %>% left_join(df_distances)

Revaluing many observations with a for loop in R

I have a data set where I am looking at longitudinal data for countries.
master.set <- data.frame(
Country = c(rep("Afghanistan", 3), rep("Albania", 3)),
Country.ID = c(rep("Afghanistan", 3), rep("Albania", 3)),
Year = c(2015, 2016, 2017, 2015, 2016, 2017),
Happiness.Score = c(3.575, 3.360, 3.794, 4.959, 4.655, 4.644),
GDP.PPP = c(1766.593, 1757.023, 1758.466, 10971.044, 11356.717, 11803.282),
GINI = NA,
Status = 2,
stringsAsFactors = F
)
> head(master.set)
Country Country.ID Year Happiness.Score GDP.PPP GINI Status
1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
4 Albania Albania 2015 4.959 10971.044 NA 2
5 Albania Albania 2016 4.655 11356.717 NA 2
6 Albania Albania 2017 4.644 11803.282 NA 2
I created that Country.ID variable with the intent of turning them into numerical values 1:159.
I am hoping to avoid doing something like this to replace the value at each individual observation:
master.set$Country.ID <- master.set$Country.ID[master.set$Country.ID == "Afghanistan"] <- 1
As I implied, there are 159 countries listed in the data set. Because it' longitudinal, there are 460 observations.
Is there any way to use a for loop to save me a lot of time? Here is what I attempted. I made a couple of lists and attempted to use an ifelse command to tell R to label each country the next number.
Here is what I have:
#List of country names
N.Countries <- length(unique(master.set$Country))
Country <- unique(master.set$Country)
Country.ID <- unique(master.set$Country.ID)
CountryList <- unique(master.set$Country)
#For Loop to make Country ID numerically match Country
for (i in 1:460){
for (j in N.Countries){
master.set[[Country.ID[i]]] <- ifelse(master.set[[Country[i]]] == CountryList[j], j, master.set$Country)
}
}
I received this error:
Error in `[[<-.data.frame`(`*tmp*`, Country.ID[i], value = logical(0)) :
replacement has 0 rows, data has 460
Does anyone know how I can accomplish this task? Or will I be stuck using the ifelse command 159 times?
Thanks!
Maybe something like
master.set$Country.ID <- as.numeric(as.factor(master.set$Country.ID))
Or alternatively, using dplyr
library(tidyverse)
master.set <- master.set %>% mutate(Country.ID = as.numeric(as.factor(Country.ID)))
Or this, which creates a new variable Country.ID2based on a key-value pair between Country.ID and a 1:length(unique(Country)).
library(tidyverse)
master.set <- left_join(master.set,
data.frame( Country = unique(master.set$Country),
Country.ID2 = 1:length(unique(master.set$Country))))
master.set
#> Country Country.ID Year Happiness.Score GDP.PPP GINI Status
#> 1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
#> 2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
#> 3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
#> 4 Albania Albania 2015 4.959 10971.044 NA 2
#> 5 Albania Albania 2016 4.655 11356.717 NA 2
#> 6 Albania Albania 2017 4.644 11803.282 NA 2
#> Country.ID2
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 2
#> 6 2
library(dplyr)
df<-data.frame("Country"=c("Afghanistan","Afghanistan","Afghanistan","Albania","Albania","Albania"),
"Year"=c(2015,2016,2017,2015,2016,2017),
"Happiness.Score"=c(3.575,3.360,3.794,4.959,4.655,4.644),
"GDP.PPP"=c(1766.593,1757.023,1758.466,10971.044,11356.717,11803.282),
"GINI"=NA,
"Status"=rep(2,6))
df1<-df %>% arrange(Country) %>% mutate(Country_id = group_indices_(., .dots="Country"))
View(df1)

revising the values of a variable in a data frame [duplicate]

This question already has answers here:
Remove part of a string in dataframe column (R)
(3 answers)
removing particular character in a column in r
(3 answers)
Closed 3 years ago.
I want to revise the values of a variable. The values are for a series of years. They start from 1960 and end at 2017. There are multiple 1960s, 1961s and so on till 2017. The multiple values for each year correspond to different countries. Countries are another variable in another column. However, each year is tagged with an X. eg. each 1960 has X1960 and so on till X2017. I want to remove the X for all years.
database is as shown below
Country Year GDP
Afghanistan X1960
England X1960
Sudan X1960
.
.
.
Afghanistan X2017
England X2017
Sudan X2017
.
.
Hi You can you gsub function to your data frame
ABC <- data.frame(country = c("Afghanistan", "England"), year = c("X1960","X1960"))
print(ABC)
country year
1 Afghanistan X1960
2 England X1960
ABC$year <- gsub("X","",ABC$year)
> print(ABC)
country year
1 Afghanistan 1960
2 England 1960
Here's a tidyverse solution.
# Load libraries
library(dplyr)
library(readr)
# Dummy data frame
df <- data.frame(country = c("Afghanistan", "England", "Sudan"),
year = rep("X1960", 3),
stringsAsFactors = FALSE)
# Quick peak
print(df)
#> country year
#> 1 Afghanistan X1960
#> 2 England X1960
#> 3 Sudan X1960
# Strip all non-numerics from strings
df %>% mutate(year = parse_number(year))
#> country year
#> 1 Afghanistan 1960
#> 2 England 1960
#> 3 Sudan 1960
Created on 2019-05-23 by the reprex package (v0.2.1)

reshape data from wide to long with multiple rows

I have a dataset dfs that i would like to reshape
dfs
# country.name indicator.name x1990 x1991 x1992
# 507 andorra GDP at market prices (current US$) 1.028989e+09 1.106891e+09 1.209993e+09
# 510 andorra GDP growth (annual %) 3.781393e+00 2.546001e+00 9.292154e-01
# 1347 albania GDP at market prices (current US$) 2.101625e+09 1.139167e+09 7.094526e+08
# 1350 albania GDP growth (annual %) -9.575640e+00 -2.958900e+01 -7.200000e+00
# 3587 austria GDP at market prices (current US$) 1.660624e+11 1.733755e+11 1.946082e+11
And i would like it so that the indicator names are columns and the times are in one column with an indicator.
# country time gdp_market gdp_growth
# 1 andorra 1990 1028989394 3.7813935
# 2 andorra 1990 1106891025 2.5460006
# 3 andorra 1990 1209992650 0.9292154
# 4 albania 1991 2101624963 3.7813935
# 5 albania 1991 1139166646 2.5460006
# 6 albania 1991 709452584 0.9292154
# 7 austria 1992 166062376740 NA
# 8 austria 1992 173375508073 NA
# 9 austria 1992 194608183696 NA
I can melt reshape the data into long format but cant seperate it into two columns
library(reshape2)
melt.dfs <- melt(dfs, id=1:2)
I could do a split and cbind, but id prefer to do it with reshape. Thanks
dfs = structure(list(country.name = c("andorra", "andorra", "albania",
"albania", "austria"), indicator.name = c("GDP at market prices (current US$)",
"GDP growth (annual %)", "GDP at market prices (current US$)",
"GDP growth (annual %)", "GDP at market prices (current US$)"
), x1990 = c(1028989393.70295, 3.78139347786568, 2101624962.5,
-9.57564018741695, 166062376739.683), x1991 = c(1106891024.78653,
2.54600064090229, 1139166645.83333, -29.5889976817695, 173375508073.07
), x1992 = c(1209992649.56688, 0.929215382801402, 709452583.880319,
-7.19999998650893, 194608183696.469)), .Names = c("country.name",
"indicator.name", "x1990", "x1991", "x1992"), row.names = c(507L,
510L, 1347L, 1350L, 3587L), class = "data.frame")
We can use
library(dplyr)
library(tidyr)
gather(dfs, time, Val, x1990:x1992) %>%
spread(indicator.name, Val)
EDIT: Based on comments from #docendo discimus
Or using recast
library(reshape2)
recast(dfs, measure = 3:5, ...~indicator.name, value.var='value')

Resources