R using which function after group_by - r

I have a dataset with four variables (a,b,c,d). I want to group the data by a,b,c then find out outliers for d.
Here is the sample data: https://www.dropbox.com/s/ftp4eehqxzh7nn3/example.csv?dl=0
I tried:
outliers = data %>%
group_by(a,b,c) %>%
which (data$d > quantile (data$d, na.rm=T)[4] + 1.5*IQR(data$d, na.rm = T) | data$d < quantile (data$d, na.rm=T)[2] - 1.5*IQR(data$d, na.rm = T).
However, I got error argument to 'which' is not logical.
Would appreciate if anyone can tell me what I got wrong and how should I fix the problem.

You could use
library(dplyr)
data %>%
group_by(a,b,c) %>%
filter(
d > quantile(d, na.rm = TRUE)[4] + 1.5 * IQR(d, na.rm = TRUE) |
d < quantile(d, na.rm = TRUE)[4] - 1.5 * IQR(d, na.rm = TRUE))
This returns you
# A tibble: 2,464 x 5
...1 a d b c
<dbl> <chr> <dbl> <chr> <dbl>
1 10533 gas 321. CAISO 2011
2 10534 gas 51.8 CAISO 2012
3 15067 gas 52.6 CAISO 2013
4 25890 oil 51.0 ISONE 2010
5 26485 gas 416. PJM 2008
6 26489 gas 468. PJM 2012
7 38153 gas Inf SPP 2014
8 38154 gas Inf SPP 2015
9 38155 gas 67.4 SPP 2016
10 38156 gas 58.8 SPP 2017
# ... with 2,454 more rows

Related

How could I get the mean and mode summary at the same time for a dataframe?

I have a dataframe with 10 numeric columns and 3 character columns, as a sample I prepare this dataframe:
df <- data.frame(
name = c("ANCON","ANCON","ANCON", "LUNA", "MAGOLLO", "MANCHAY", "MANCHAY","PATILLA","PATILLA"),
destiny = c("sea","reuse","sea","sea", "reuse","sea","sea","sea","sea"),
year = c("2022","2015","2022","2022", "2015","2016","2016","2018","2018"),
QQ = c(10,11,3,4,13,11,12,23,7),
Temp = c(14,16,16,15,16,20,19,14,18))
I need to group it by column "name", get the mean summary for columns "QQ" and "Temp", and the mode for columns "destiny" and "year". I could get the mean summary but I couldn´t include the mode
df_mean <- df %>%
group_by(name) %>%
summarise_all(mean, na.rm = TRUE)
name destiny year QQ Temp
<chr> <dbl> <dbl> <dbl> <dbl>
1 ANCON NA NA 8 15.3
2 LUNA NA NA 4 15
3 MAGOLLO NA NA 13 16
4 MANCHAY NA NA 11.5 19.5
5 PATILLA NA NA 15 16
the desired output with the medians is something like this:
name destiny year QQ Temp
1 ANCON sea 2022 8.0 15.3
2 LUNA sea 2022 4.0 15.0
3 MAGOLLO reuse 2015 13.0 16.0
4 MANCHAY sea 2016 11.5 19.5
5 PATILLA sea 2018 15.0 16.0
How could I do it? Please help
Use across and cur_column. Median would only work with ordinal data, though, and for categorical data like the character columns you have, use mode:
mode <- function(x) {
x_unique <- unique(x)
x_unique[which.max(tabulate(match(x, x_unique)))]
}
Then
mode_columns <- c('destiny', 'year')
df %>%
group_by(name) %>%
summarise(
across(
everything(),
~ if (cur_column() %in% mode_columns) mode(.x) else mean(.x)
)
)
# A tibble: 5 × 5
name destiny year QQ Temp
<chr> <chr> <chr> <dbl> <dbl>
1 ANCON sea 2022 8 15.3
2 LUNA sea 2022 4 15
3 MAGOLLO reuse 2015 13 16
4 MANCHAY sea 2016 11.5 19.5
5 PATILLA sea 2018 15 16
UPD: Or you could summarise a bit differently
summarise(
across({{mode_cols}}, mode),
across(!{{mode_cols}}, mean)
)

Inflation rate with the CPI multiples country, with R

I have to calculate the inflation rate from 2015 to 2019. I have to do this with the CPI, which I have for each month during the 4 years. This means that I have to calculate the percentage growth rate for the same month last year.
They ask me for the calculation of several countries and then calculate or show the average for the period 2015-2019.
This is my database:
data <- read.table("https://pastebin.com/raw/6cetukKb")
I have tried the quantmod, dplyr, lubridate packages, but I can't do the CPI conversion.
I tried this but I know it is not correct:
data$year <- year(data$date)
anual_cpi <- data %>% group_by(year) %>% summarize(cpi = mean(Argentina))
anual_cpi$adj_factor <- anual_cpi$cpi/anual_cpi$cpi[anual_cpi$year == 2014]
**
UPDATE
**
my teacher gave us a hint on how to get the result, but when I try to add it to the code, I get an error.
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate((change=(x-lag(x,1))/lag(x,1)*100))
| Antigua_Barbuda | -1.55 |
|----------------- |------- |
| Argentina | 1.03 |
| Aruba | -1.52 |
| Bahamas | -1.56 |
| Barbados | -1.38 |
where "value" corresponds to the average inflation for each country during the entire period 2015-2019
We can use data.table methods
library(data.table)
melt(fread("https://pastebin.com/raw/6cetukKb"),
id.var = c('date', 'year', 'period', 'periodName'))[,
.(value = mean(value)), .(variable, year)][,
adj_factor := value/value[year == 2014]][]
# variable year value adj_factor
# 1: Antigua_Barbuda 2014 96.40000 1.0000000
# 2: Antigua_Barbuda 2015 96.55833 1.7059776
# 3: Antigua_Barbuda 2016 96.08333 1.0146075
# 4: Antigua_Barbuda 2017 98.40833 0.9900235
# 5: Antigua_Barbuda 2018 99.62500 0.5822618
# 6: Antigua_Barbuda 2019 101.07500 1.0484959
# 7: Argentina 2014 56.60000 1.0000000
# ..
You should read your data with header = TRUE since the first row are the names of the columns. Then get your data in long format which makes it easy to do the calculation.
After this you can perform whichever calculation you want. For example, to perform the same steps as your attempt i.e divide all the values with the value in the year 2014 for each country you can do.
library(dplyr)
data <- read.table("https://pastebin.com/raw/6cetukKb", header = TRUE)
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate(adj_factor = value/value[year == 2014])
# name year value adj_factor
# <chr> <int> <dbl> <dbl>
# 1 Antigua_Barbuda 2014 96.4 1
# 2 Antigua_Barbuda 2015 96.6 1.00
# 3 Antigua_Barbuda 2016 96.1 0.997
# 4 Antigua_Barbuda 2017 98.4 1.02
# 5 Antigua_Barbuda 2018 99.6 1.03
# 6 Antigua_Barbuda 2019 101. 1.05
# 7 Argentina 2014 56.6 1
# 8 Argentina 2015 64.0 1.13
# 9 Argentina 2016 89.9 1.59
#10 Argentina 2017 113. 2.00
# … with 20 more rows

ratio calculation and sort the calculated rates

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
df8 <- read.csv ('https://raw.githubusercontent.com/hirenvadher954/Worldometers-
Scraping/master/countries.csv',
stringsAsFactors = FALSE)
install.packages("tidyverse")
library(tidyverse)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
group_by(countryName) %>%
unique() %>%
summarize(population = sum(population, na.rm = TRUE),
confirmed = sum(confirmed, na.rm = TRUE),
recovered = sum(recovered, na.rm = TRUE),
death = sum(death, na.rm = TRUE),
death_prop = paste0(as.character(death), "/", as.character(population))
)
in this code
population / death rate was calculated.
highest population / death have rate
Finding 10 countries.
confirmed and recovered
dont will be available.
10 x 6
countryName population confirmed recovered death death_prop
<chr> <dbl> <int> <int> <int> <chr>
1 Afghanistan 4749258212 141652 16505 3796 3796/4749258212
2 Albania 351091234 37233 22518 1501 1501/351091234
3 Algeria 5349827368 206413 88323 20812 20812/5349827368
4 Andorra 9411324 38518 18054 2015 2015/9411324
5 Angola 4009685184 1620 435 115 115/4009685184
6 Anguilla 1814018 161 92 0 0/1814018
7 Antigua and Barbuda 11947338 1230 514 128 128/11947338
8 Argentina 5513884428 232975 66155 10740 10740/5513884428
9 Armenia 361515646 121702 46955 1626 1626/361515646
10 Aruba 13025452 5194 3135 91 91/13025452
data is an example.
the information is not correct.
The data is in cumulative format meaning all the values for today have all the values till yesterday. So take only max values of each column and calculate death_prop.
library(dplyr)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
summarise_at(vars(population:death), max, na.rm = TRUE) %>%
mutate(death_prop = death/population * 100) %>%
arrange(desc(death_prop))
# A tibble: 215 x 5
# countryName population year death death_prop
# <chr> <dbl> <dbl> <int> <dbl>
# 1 San Marino 33860 2019 42 0.124
# 2 Belgium 11589623 2020 9312 0.0803
# 3 Andorra 77142 2019 51 0.0661
# 4 Spain 46754778 2020 28752 0.0615
# 5 Italy 60461826 2020 32877 0.0544
# 6 United Kingdom 67886011 2020 36914 0.0544
# 7 France 65273511 2020 28432 0.0436
# 8 Sweden 10099265 2020 4029 0.0399
# 9 Sint Maarten 42388 2019 15 0.0354
#10 Netherlands 17134872 2020 5830 0.0340
# … with 205 more rows

loop to run model on subset dataframe

I am not very experienced with loops so I am not sure where I went wrong here...
I have a dataframe that looks like:
month year day mean.temp mean.temp.year.month
1 1961 1 4.85 4.090323
1 1961 2 4.90 4.090323
1 1961 3 2.95 4.090323
1 1961 4 3.40 4.090323
1 1961 5 2.90 4.090323
dataset showing 3 months for 2 years can be found here:
https://drive.google.com/file/d/1w7NVeoEh8b7cAkU3cu1sXx6yCh75Inqg/view?usp=sharing
and I want to subset this dataframe by year and month so that I can run one nls model per year and month. Since my dataset contains 56 years (and each year has 12 months), that will give 672 models. Then I want to store the parameter estimates in a separate table.
I've created this code, but I can't work out why it is only giving me the parameter estimates for month 12 (all 56 years, but just month 12):
table <- matrix(99999, nrow=672, ncol=4)
YEARMONTHsel <- unique(df_weather[c("year", "month")])
YEARsel <- unique(df_weather$year)
MONTHsel <- unique(df_weather$month)
for (i in 1:length(YEARsel)) {
for (j in 1:length(MONTHsel)) {
temp2 <- df_weather[df_weather$year==YEARsel[i] & df_weather$month==MONTHsel[j],]
mn <- nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = temp2, control=nlc,
start=list(alpha=-6.07043, phi = -10))
cr <- as.vector(coef(mn))
nv <-length(coef(mn))
table[i,1:nv] <- cr
table[i,nv+1]<- YEARsel[i]
table[i,nv+2]<- MONTHsel[j]
}
}
I've tried several options (i.e. without using nested loop) but I'm not getting anywhere.
Any help would be greatly appreciated!Thanks.
Based on your loop, it looks like you want to run the regression grouped by year and month and then extract the coefficients in a new dataframe (correct me if thats wrong)
library(readxl)
library(tidyverse)
df <- read_excel("~/Downloads/df_weather.xlsx")
df %>% nest(-month, -year) %>%
mutate(model = map(data, ~nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = .x, control= "nlc",
start=list(alpha=-6.07043, phi = -10))),
coeff = map(model, ~coefficients(.x))) %>%
unnest(coeff %>% map(broom::tidy)) %>%
spread(names, x) %>%
arrange(year)
#> # A tibble: 6 x 4
#> month year alpha phi
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1961 0.561 -10.8
#> 2 2 1961 -1.50 -10.5
#> 3 3 1961 -2.06 -9.77
#> 4 1 1962 -3.35 -5.48
#> 5 2 1962 -2.27 -9.97
#> 6 3 1962 0.959 -10.8
First we nest the data based on your groups (in this case year and month), then we map the model for each group, then we map the coefficients for each group, lastly we unnest the coefficients and spread the data from long to wide.

use model object, e.g. panelmodel, to flag data used

Is it possible in some way to use a fit object, specifically the regression object I get form a plm() model, to flag observations, in the data used for the regression, if they were in fact used in the regression. I realize this could be done my looking for complete observations in my original data, but I am curious if there's a way to use the fit/reg object to flag the data.
Let me illustrate my issue with a minimal working example,
First some packages needed,
# install.packages(c("stargazer", "plm", "tidyverse"), dependencies = TRUE)
library(plm); library(stargazer); library(tidyverse)
Second some data, this example is drawing heavily on Baltagi (2013), table 3.1, found in ?plm,
data("Grunfeld", package = "plm")
dta <- Grunfeld
now I create some semi-random missing values in my data object, dta
dta[c(3:13),3] <- NA; dta[c(22:28),4] <- NA; dta[c(30:33),5] <- NA
final step in the data preparation is to create a data frame with an index attribute that describes its individual and time dimensions, using tidyverse,
dta.p <- dta %>% group_by(firm, year)
Now to the regression
plm.reg <- plm(inv ~ value + capital, data = dta.p, model = "pooling")
the results, using stargazer,
stargazer(plm.reg, type="text") # stargazer(dta, type="text")
#> ============================================
#> Dependent variable:
#> ---------------------------
#> inv
#> ----------------------------------------
#> value 0.114***
#> (0.008)
#>
#> capital 0.237***
#> (0.028)
#>
#> Constant -47.962***
#> (9.252)
#>
#> ----------------------------------------
#> Observations 178
#> R2 0.799
#> Adjusted R2 0.797
#> F Statistic 348.176*** (df = 2; 175)
#> ===========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Say I know my data has 200 observations, and I want to find the 178 that was used in the regression.
I am speculating if there's some vector in the plm.reg I can (easily) use to crate a flag i my original data, dta, if this observation was used/not used, i.e. the semi-random missing values I created above. Maybe some broom like tool.
I imagine something like,
dta <- dta %>% valid_reg_obs(plm.reg)
The desired outcome would look something like this, the new element is the vector plm.reg at the end, i.e.,
dta %>% as_tibble()
#> # A tibble: 200 x 6
#> firm year inv value capital plm.reg
#> * <int> <int> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1935 318 3078 2.80 T
#> 2 1 1936 392 4662 52.6 T
#> 3 1 1937 NA 5387 157 F
#> 4 1 1938 NA 2792 209 F
#> 5 1 1939 NA 4313 203 F
#> 6 1 1940 NA 4644 207 F
#> 7 1 1941 NA 4551 255 F
#> 8 1 1942 NA 3244 304 F
#> 9 1 1943 NA 4054 264 F
#> 10 1 1944 NA 4379 202 F
#> # ... with 190 more rows
Update, I tried to use broom's augment(), but unforunatly it gave me the error message I had hoped would create some flag,
# install.packages(c("broom"), dependencies = TRUE)
library(broom)
augment(plm.reg, dta)
#> Error in data.frame(..., check.names = FALSE) :
#> arguments imply differing number of rows: 200, 178
The vector is plm.reg$residuals. Not sure of a nice broom solution, but this seems to work:
library(tidyverse)
dta.p %>%
as.data.frame %>%
rowid_to_column %>%
mutate(plm.reg = rowid %in% names(plm.reg$residuals))
for people who use the class pdata.frame() to create an index attribute that describes its individual and time dimensions, you can us the following code, this is from another Baltagi in the ?plm,
# == Baltagi (2013), pp. 204-205
data("Produc", package = "plm")
pProduc <- pdata.frame(Produc, index = c("state", "year", "region"))
form <- log(gsp) ~ log(pc) + log(emp) + log(hwy) + log(water) + log(util) + unemp
Baltagi_reg_204_5 <- plm(form, data = pProduc, model = "random", effect = "nested")
pProduc %>% mutate(reg.re = rownames(pProduc) %in% names(Baltagi_reg_204_5$residuals)) %>%
as_tibble() %>% select(state, year, region, reg.re)
#> # A tibble: 816 x 4
#> state year region reg.re
#> <fct> <fct> <fct> <lgl>
#> 1 CONNECTICUT 1970 1 T
#> 2 CONNECTICUT 1971 1 T
#> 3 CONNECTICUT 1972 1 T
#> 4 CONNECTICUT 1973 1 T
#> 5 CONNECTICUT 1974 1 T
#> 6 CONNECTICUT 1975 1 T
#> 7 CONNECTICUT 1976 1 T
#> 8 CONNECTICUT 1977 1 T
#> 9 CONNECTICUT 1978 1 T
#> 10 CONNECTICUT 1979 1 T
#> # ... with 806 more rows
finally, if you are running the first Baltagi without index attributes, i.e. unmodified example from the help file, the code should be,
Grunfeld %>% rowid_to_column %>%
mutate(plm.reg = rowid %in% names(p$residuals)) %>% as_tibble()

Resources