R | Adding index numbers - r

I have two dataset which look like below
Sales
Region ReviewYear Sales Index
South Asia 2006 1.5 NA
South Asia 2009 4.5 NA
South Asia 2011 11 0
South Asia 2014 16.7 NA
Africa 2008 0.4 NA
Africa 2013 3.5 0
Africa 2017 9.7 NA
Strategy
Region StrategyYear
South Asia 2011
Africa 2013
Japan 2007
SE Asia 2009
There are multiple regions and many review years which are not periodic and not even same for all regions. I have added a column 'Index' to dataframe 'Sales' such that for a strategy year from second dataframe, the index value is zero. I now want to change NA to a series of numbers that tell how many rows before or after that particular row is to 0 row, grouped by 'Region'.
I can do this using a for loop but that is just tedious and checking if there is a cleaner way to do this. Final output should look like
Sales
Region ReviewYear Sales Index
South Asia 2006 1.5 -2
South Asia 2009 4.5 -1
South Asia 2011 11 0
South Asia 2014 16.7 1
Africa 2008 0.4 -1
Africa 2013 3.5 0
Africa 2017 9.7 1

Join the two datasets by Region and for each Region create an Index column by subtracting the row number with the index where StrategyYear matches the ReviewYear.
library(dplyr)
left_join(Sales, Strategy, by = 'Region') %>%
arrange(Region, StrategyYear) %>%
group_by(Region) %>%
mutate(Index = row_number() - match(first(StrategyYear), ReviewYear))
# Region ReviewYear Sales Index StrategyYear
# <chr> <int> <dbl> <int> <int>
#1 Africa 2008 0.4 -1 2013
#2 Africa 2013 3.5 0 2013
#3 Africa 2017 9.7 1 2013
#4 SouthAsia 2006 1.5 -2 2011
#5 SouthAsia 2009 4.5 -1 2011
#6 SouthAsia 2011 11 0 2011
#7 SouthAsia 2014 16.7 1 2011
data
Sales <- structure(list(Region = c("SouthAsia", "SouthAsia", "SouthAsia",
"SouthAsia", "Africa", "Africa", "Africa"), ReviewYear = c(2006L,
2009L, 2011L, 2014L, 2008L, 2013L, 2017L), Sales = c(1.5, 4.5,
11, 16.7, 0.4, 3.5, 9.7), Index = c(NA, NA, 0L, NA, NA, 0L, NA
)), class = "data.frame", row.names = c(NA, -7L))
Strategy <- structure(list(Region = c("SouthAsia", "Africa", "Japan", "SEAsia"
), StrategyYear = c(2011L, 2013L, 2007L, 2009L)), class = "data.frame",
row.names = c(NA, -4L))

Related

Error in unique.default(x, nmax = nmax) : unique() applies only to vectors by using with(data_frame, table(year, country))

My data looks like this:
year
country
1990
USA
1991
USA
1991
UK
1990
UK
1991
USA
1991
UK
1992
USA
1992
UK
1992
UK
I am trying to execute the following row of the code
Freq <- data.frame(with(data_frame, table(year, country)))
to get the frequencies of each country in my data set in every year that looks like this.
year
country
frequency
1990
USA
1
1991
USA
2
1992
USA
1
1990
UK
1
1991
UK
2
1992
UK
2
Until recently, the code worked okay and this is exactly what I had as an output.
Today I installed an RSelenium package and after that by executing the row above I get the error
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
I don't know if Selenium has something to do with it. Why can this happen, and how can I fix this? Any tips will be very appreciated.
I already read this question, but unfortunately, it was of no use for me.
P.S.: this is my first question here, sorry, if it's a little crooked.
An option with count
library(dplyr)
count(df1, year, country)
-output
year country n
1 1990 UK 1
2 1990 USA 1
3 1991 UK 2
4 1991 USA 2
5 1992 UK 2
6 1992 USA 1
data
df1 <- structure(list(year = c(1990L, 1991L, 1991L, 1990L, 1991L, 1991L,
1992L, 1992L, 1992L), country = c("USA", "USA", "UK", "UK", "USA",
"UK", "USA", "UK", "UK")), class = "data.frame", row.names = c(NA,
-9L))

How to summarize two different rows with different values to a single row with that sum using dplyr?

I have the following data frame but in a bigger scale of course:
country
year
strain
num_cases
mex
1996
sp_m014
412
mex
1996
sp_f014
214
mex
1998
sp_m014
150
mex
1998
sp_f014
200
usa
1996
sp_m014
200
usa
1996
sp_f014
180
usa
1997
sp_m014
190
usa
1997
sp_f014
150
I want to get the following result, that is the sum of sp_m014 (male) and sp_f014 (female) for mex and usa individually:
country
year
strain
num_cases
mex
1996
sp
626
mex
1998
sp
350
usa
1996
sp
380
usa
1997
sp
340
In my real data frame I have a lot more age ranges, here I only show the 014 for males and females. But I want to summarize them that way for every age range and gender.
Thanks!
Grouped by 'country', 'year' summarise to update the 'strain' as 'sp' and get the sum of 'num_cases'
library(dplyr)
df1 %>%
group_by(country, year) %>%
summarise(strain = 'sp', num_cases = sum(num_cases), .groups = 'drop')
-output
# A tibble: 4 x 4
# country year strain num_cases
#* <chr> <int> <chr> <int>
#1 mex 1996 sp 626
#2 mex 1998 sp 350
#3 usa 1996 sp 380
#4 usa 1997 sp 340
data
df1 <- structure(list(country = c("mex", "mex", "mex", "mex", "usa",
"usa", "usa", "usa"), year = c(1996L, 1996L, 1998L, 1998L, 1996L,
1996L, 1997L, 1997L), strain = c("sp_m014", "sp_f014", "sp_m014",
"sp_f014", "sp_m014", "sp_f014", "sp_m014", "sp_f014"), num_cases = c(412L,
214L, 150L, 200L, 200L, 180L, 190L, 150L)),
class = "data.frame", row.names = c(NA,
-8L))
Here's an approach with tidyr::extract:
library(tidyr);library(dplyr)
df1 %>%
extract(strain, into = c("strain","sex","age"), "(\\w+)_([mf])(.*)") %>%
group_by(country,year,strain) %>%
summarise(across(num_cases,sum))
# A tibble: 4 x 4
# Groups: country, year [4]
country year strain num_cases
<chr> <int> <chr> <int>
1 mex 1996 sp 626
2 mex 1998 sp 350
3 usa 1996 sp 380
4 usa 1997 sp 340
Now that you have the strains fully parsed you can easily group by sex or age. Thanks to #akrun for the data.
Update:
To use the age range you can do parse_number
df1 %>%
mutate(age_range=parse_number(strain)) %>%
group_by(country, year, age_range) %>%
summarise(num_cases=sum(num_cases))
Output:
country year age_range num_cases
<chr> <int> <dbl> <int>
1 mex 1996 14 626
2 mex 1998 14 350
3 usa 1996 14 380
4 usa 1997 14 340
First answer:
Thanks to akrun for the data:
library(tidyverse)
df1 %>%
group_by(country, year, strain) %>%
mutate(strain=str_extract(strain, "^.{2}")) %>%
summarise(num_cases=sum(num_cases))
Output:
country year strain num_cases
<chr> <int> <chr> <int>
1 mex 1996 sp 626
2 mex 1998 sp 350
3 usa 1996 sp 380
4 usa 1997 sp 340

How do I make a summary and then multiply the result by group?

So this is my data frame. Country1 represent the people that live in Germany and Country 2 represent the country that they used to live 5 years before moving to Country1 .
Country1
Country2
Weight
obs
Germany
Germany
4
1
Germany
Germany
119
2
France
Germany
3
3
France
Germany
2
4
Italy
France
1
5
Basically what I want is to make a summary of the columns weights for each combination and the multiply by the observation (represented by the column obs. For example, in the first row I have the combination Germany to Germany so what I want is to sum the weights of the column Weight (119+4=123) and then multiply the result of this sum (123* 1=123) to the respective observation of the column Obs (1) (in the first row). For the second row would be the same the summary of the weight for Germany would be (119+4=123)and this result have to be multiplied by the observation of this row in this case (123* 2=246). In the third row the sum of weights would be (3+2=5) and then multiply this result by the observations for this row (5* 3=15) and so on.
The output that I want is represented by the column x and it would be something like this.
Country1
Country2
Weight
obs
x
Germany
Germany
4
1
123
Germany
Germany
119
2
246
France
Germany
3
3
15
France
Germany
2
4
20
Italy
France
1
5
5
Also the formula that im trying to apply is this one.
You could also solve it as follows:
df1$x <- tapply(df1$Weight, df1$Country1, sum)[df1$Country1] * df1$obs
Country1 Country2 Weight obs x
1 Germany Germany 4 1 123
2 Germany Germany 119 2 246
3 France Germany 3 3 15
4 France Germany 2 4 20
5 Italy France 1 5 5
Try this:
library(dplyr)
#Code
new <- df %>% group_by(Country1) %>%
mutate(x=sum(Weight)*obs)
Output:
# A tibble: 5 x 5
# Groups: Country1 [3]
Country1 Country2 Weight obs x
<chr> <chr> <int> <int> <int>
1 Germany Germany 4 1 123
2 Germany Germany 119 2 246
3 France Germany 3 3 15
4 France Germany 2 4 20
5 Italy France 1 5 5
Some data used:
#Data
df <- structure(list(Country1 = c("Germany", "Germany", "France", "France",
"Italy"), Country2 = c("Germany", "Germany", "Germany", "Germany",
"France"), Weight = c(4L, 119L, 3L, 2L, 1L), obs = 1:5), class = "data.frame", row.names = c(NA,
-5L))
We can use data.table methods
library(data.table)
setDT(df1)[, x := sum(Weight) *obs, by = Country1][]
-output
# Country1 Country2 Weight obs x
#1: Germany Germany 4 1 123
#2: Germany Germany 119 2 246
#3: France Germany 3 3 15
#4: France Germany 2 4 20
#5: Italy France 1 5 5
Or using base R with ave
df1$x <- with(df1, ave(Weight, Country1, FUN = sum) * obs)
data
df1 <- structure(list(Country1 = c("Germany", "Germany", "France", "France",
"Italy"), Country2 = c("Germany", "Germany", "Germany", "Germany",
"France"), Weight = c(4L, 119L, 3L, 2L, 1L), obs = 1:5),
class = "data.frame", row.names = c(NA,
-5L))

Descending order in ggplot bar_col

Below is the dataset,
# A tibble: 449 x 7
`Country or Area` `Region 1` Year Rate MinCI MaxCI Average
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Southern Asia 2011 4.2 2.6 6.2 4.4
2 Afghanistan Southern Asia 2016 5.5 3.4 8.1 5.75
3 Aland Islands Northern Europe NA NA NA NA NA
4 Albania Southern Europe 2011 18.8 14.8 23 18.9
5 Albania Southern Europe 2016 21.7 17 26.7 21.8
6 Algeria Northern Africa 2011 24 19.9 28.4 24.2
7 Algeria Northern Africa 2016 27.4 22.5 32.7 27.6
8 American Samoa Polynesia NA NA NA NA NA
9 Andorra Southern Europe 2011 24.6 19.8 29.8 24.8
10 Andorra Southern Europe 2016 25.6 20.1 31.3 25.7
I need to draw a bar_col using the above dataset to compare the average obesity rate of each region. Further, I need to order the bar from the highest to the lowest.
I have also calculated the Average obesity rate as shown above.
Below is the code I used to generate the ggplot, but unable to figure out how to order from the highest to lowest.
region_plot <- ggplot(continent) + aes(x = continent$`Region 1`, y = continent$Average, fill = Average) +
geom_col() +
xlab("Region") + ylab("Average Obesity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Average obesity rate of each region")
region_plot
After checking your data, you have multiple regions so in order to show the average per region you must to compute it and then plot. You can do that with dplyr using group_by() and summarise(). Your data is limited but for the real one, NA should not be present. Here the code using part of the shared data. Be careful with names when using your real data. reorder() function can arrange bars. Here the code:
library(dplyr)
library(ggplot2)
#Code
df %>% group_by(Region) %>%
summarise(Avg=mean(Average,na.rm=T)) %>%
filter(!is.na(Avg)) %>%
ggplot(aes(x=reorder(Region,-Avg),y=Avg,fill=Region))+
geom_col() +
xlab("Region") + ylab("Average Obesity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Average obesity rate of each region")
Output:
Some data used:
#Data
df <- structure(list(Region = c("Southern Asia", "Southern Asia", "Northern Europe",
"Southern Europe", "Southern Europe", "Northern Africa", "Northern Africa",
"Polynesia", "Southern Europe", "Southern Europe"), Year = c(2011L,
2016L, NA, 2011L, 2016L, 2011L, 2016L, NA, 2011L, 2016L), Rate = c(4.2,
5.5, NA, 18.8, 21.7, 24, 27.4, NA, 24.6, 25.6), MinCI = c(2.6,
3.4, NA, 14.8, 17, 19.9, 22.5, NA, 19.8, 20.1), MaxCI = c(6.2,
8.1, NA, 23, 26.7, 28.4, 32.7, NA, 29.8, 31.3), Average = c(4.4,
5.75, NA, 18.9, 21.8, 24.2, 27.6, NA, 24.8, 25.7)), row.names = c(NA,
-10L), class = "data.frame")
The problem can be solved by preprocessing the data and sorting the result by Average. Then coerce Region 1 to factor.
library(ggplot2)
library(dplyr)
continent %>%
group_by(`Region 1`) %>%
summarise(Average = mean(Average, na.rm = TRUE)) %>%
arrange(desc(Average)) %>%
mutate(`Region 1` = factor(`Region 1`, levels = unique(`Region 1`))) %>%
ggplot(aes(x = `Region 1`, y = Average, fill = Average)) +
geom_col() +
xlab("Region") + ylab("Average Obesity") +
ggtitle("Average obesity rate of each region") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) -> region_plot
region_plot
Data
continent <- read.table(text = "
'Country or Area' 'Region 1' Year Rate MinCI MaxCI Average
1 Afghanistan 'Southern Asia' 2011 4.2 2.6 6.2 4.4
2 Afghanistan 'Southern Asia' 2016 5.5 3.4 8.1 5.75
3 'Aland Islands' 'Northern Europe' NA NA NA NA NA
4 Albania 'Southern Europe' 2011 18.8 14.8 23 18.9
5 Albania 'Southern Europe' 2016 21.7 17 26.7 21.8
6 Algeria 'Northern Africa' 2011 24 19.9 28.4 24.2
7 Algeria 'Northern Africa' 2016 27.4 22.5 32.7 27.6
8 'American Samoa' Polynesia NA NA NA NA NA
9 Andorra 'Southern Europe' 2011 24.6 19.8 29.8 24.8
10 Andorra 'Southern Europe' 2016 25.6 20.1 31.3 25.7
", header = TRUE, check.names = FALSE)

Projecting data in R

Here my dataset, I have 194 countries with values from 2014 to 2018
# A tibble: 10 x 3
iso3 time y
<chr> <chr> <dbl>
1 AFG 2014 0.50
2 AFG 2015 0.55
3 AFG 2016 0.63
4 AFG 2017 0.68
5 AFG 2018 0.69
6 AGO 2014 0.54
7 AGO 2015 0.58
8 AGO 2016 0.57
9 AGO 2017 0.51
10 AGO 2018 0.61
What I would like to do is project data till 2023 using this function
proj <- function(y, time=2014:2018, target=2023){
stopifnot(any(y>0) | any(y<1))
period <- time[1]:target
yhat <- predict(glm(y ~ time, family=quasibinomial), newdata=data.frame(time=period))
return(data.frame(time=period, y=invlogit(yhat)))
}
Now the problem is that I don't know how to use functions..how to apply the abive function to my dataset to create a new dataset where I have both historical data from 2014 to 2018 and projected data till 2023 for all countries, in the same format as above.
Could you help?
Thank you very much
Let's say this is your data:
df = structure(list(iso3 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("AFG", "AGO"), class = "factor"), time = c(2014L,
2015L, 2016L, 2017L, 2018L, 2014L, 2015L, 2016L, 2017L, 2018L
), y = c(0.5, 0.55, 0.63, 0.68, 0.69, 0.54, 0.58, 0.57, 0.51,
0.61)), class = "data.frame", row.names = c(NA, 10L))
You still need a inverse logit function, which should be:
invlogit =function (x){ 1/(1 + exp(-x)) }
Then what you need to provide, are the y data in the time period for each country, so for example:
subset(df,iso3=="AFG" & time %in% 2014:2018)
gives you a subset of the data.frame in country AFG from 2014 to 2018. This only works in your function if the data is complete (goes from 2014 to 2018) and sorted. We try that:
proj(subset(df,iso3=="AFG" & time %in% 2014:2018)$y)
time y
1 2014 0.5057644
2 2015 0.5598027
3 2016 0.6124600
4 2017 0.6626145
5 2018 0.7093585
6 2019 0.7520495
7 2020 0.7903234
8 2021 0.8240714
9 2022 0.8533952
10 2023 0.8785516
I suggest writing a function that handles situations when data is not sorted or is missing:
func = function(data,country){
data = subset(data,iso3==country)
data[match(2014:2018,data$time),]$y
}
res = lapply(unique(df$iso3),function(i){
data.frame(country=i,proj(func(df,country=i)))
})
res = do.call(rbind,res)
country time y
1 AFG 2014 0.5057644
2 AFG 2015 0.5598027
3 AFG 2016 0.6124600
4 AFG 2017 0.6626145
5 AFG 2018 0.7093585
6 AFG 2019 0.7520495
7 AFG 2020 0.7903234
8 AFG 2021 0.8240714
9 AFG 2022 0.8533952
10 AFG 2023 0.8785516
11 AGO 2014 0.5479759
21 AGO 2015 0.5550113
31 AGO 2016 0.5620247
41 AGO 2017 0.5690134
51 AGO 2018 0.5759748
61 AGO 2019 0.5829061
71 AGO 2020 0.5898048
81 AGO 2021 0.5966684
91 AGO 2022 0.6034943
101 AGO 2023 0.6102802

Resources