Lead and lag issue using dplyr - r

I have a data frame with data that looks like this that has 365 rows reflecting the calendar year. I am trying to shift the county name columns up by one row. The data frame doesn't contain any missing values.
I tried using the following code to shift it, but the resulting table has values that are all NA.
covid_shift <- covid_pivot %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
Does anyone know what might be the issue?

Since covid_pivot is grouped by date, and each of these groups has one row, the lead and lag functions return NA.
Try:
covid_shift <- covid_pivot %>%
ungroup() %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
You might also consider using across()
covid_pivot %>%
ungroup() %>%
mutate(across(-date, ~lag(.x)))

Related

Is there an R function to summarize individual level data by country and year?

I am trying to make country-level (by year) summaries of a long-form aggregated dataset that has individual-level data. I have tried using dplyr to summarize the average of the variable I am interested in to create a new dataset. However... there appears to be something wrong with my group_by because the answer is only one observation that appears to be the mean of every observation.
data named: "finaldata.giniE",
country variable: "iso3c",
year variable: "date",
individual-level variable of interest: "Ladder.Life.Present"
Note: there are more variables in my data-- could this be an issue?
country_summmary <- finaldata.giniE %>%
select(iso3c, date, Ladder.Life.Present) %>%
group_by(iso3c, date) %>%
summarize(averaged.M = mean(Ladder.Life.Present))
country_summmary
My output appears like this:> country_summmary
averaged.M
1 5.505455
Thank you!
I actually just changed something and added your suggested code to the front and it worked! Here is the code that was able to work!
library(dplyr)
country_summary <- finaldata.gini %>%
group_by(iso3c, date) %>%
select(Ladder.Life.Present) %>%
summarise_each(funs(mean))

Changing a Column to an Observation in a Row in R

I am currently struggling to transition one of my columns in my data to a row as an observation. Below is a representative example of what my data looks like:
library(tidyverse)
test_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409),
abstention=rep(52199))
As seen above, the abstention column exists at the end of my data frame, and I would like my data to look like the following:
library(tidyverse)
desired_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo","abstention"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409,52199))
Here, abstentions are treated like a candidate, in the can column. Thus, the rest of the data is maintained, and the abstention values are their own observation in the cv1 column.
I have tried using pivot_wider, but I am unsure how to use the arguments to get what I want. I have also considered t() to transpose the column into a row, but also having a hard time slotting it back into my data. Any help is appreciated! Thanks!
Here's a strategy what will work if you have multiple unit_names
test_df %>%
group_split(unit_name) %>%
map( function(group_data) {
slice(group_data, 1) %>%
mutate(can="abstention", cv1=abstention) %>%
add_row(group_data, .) %>%
select(-abstention)
}) %>%
bind_rows()
Basically we split the data up by unit_name, then we grab the first row for each group and move the values around. Append that as a new row to each group, and then re-combine all the groups.

R spread across multiple value columns

My dataset looks like this -
dataset = data.frame(Site=c(rep('A',6),rep('B',6)),Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),4)),Question=c(rep('Q1',3),rep('Q2',3)),Score=runif(12,0.5,1),Average=runif(12,0.5,1))
I'd like to spread columns in such a way that the the first two columns contain the Site and Question and the remaining columns are have the Score_Date and Average_Date
Here's an example of what the first line of the resulting table would look like
Site Question Score_2019.03.31 Score_2019.04.30 Score_2019.05.31 Average_2019.03.31 Average_2019.04.30 Average_2019.05.31
A Q1 0.9117566 0.8661078 0.5624139 0.7246694 0.8870703 0.6401099
I tried using unite & spread from tidyr but nowhere close to the result
Any inputs would be highly appreciated
Using tidyr and dplyr from the tidyverse, you could do the following:
library(tidyverse)
dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col) %>%
unnest(`2019-03-31`, `2019-04-30`, `2019-05-31`, .sep = "_")

How do I aggregate certain columns from data frame by a Unique ID?

I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.

Trying to understand dplyr function - group_by

I am trying to understand the way group_by function works in dplyr. I am using the airquality data set, that comes with the datasets package link.
I understand that is if I do the following, it should arrange the records in increasing order of Temp variable
airquality_max1 <- airquality %>% arrange(Temp)
I see that is the case in airquality_max1. I now want to arrange the records by increasing order of Temp but grouped by Month. So the end result should first have all the records for Month == 5 in increasing order of Temp. Then it should have all records of Month == 6 in increasing order of Temp and so on, so I use the following command
airquality_max2 <- airquality %>% group_by(Month) %>% arrange(Temp)
However, what I find is that the results are still in increasing order of Temp only, not grouped by Month, i.e., airquality_max1 and airquality_max2 are equal.
I am not sure why the grouping by Month does not happen before the arrange function. Can anyone help me understand what I am doing wrong here?
More than the problem of trying to sort the data frame by columns, I am trying to understand the behavior of group_by as I am trying to use this to explain the application of group_by to someone.
arrange ignores group_by, see break-changes on dplyr 0.5.0. If you need to order by two columns, you can do:
airquality %>% arrange(Month, Temp)
For grouped data frame, you can also .by_group variable to sort by the group variable first.
airquality %>% group_by(Month) %>% arrange(Temp, .by_group = TRUE)

Resources