How to pivot wider in R on one column value - r

Below is the sample data and the manipulations that I have done so far. I have tried this in other ways but have an idea that may make this a bit simpler. The intended result is at the bottom. what i am looking for is a way to pivot wider based on when the smb column says total. There are five possible values for smb.. 1,2,3,4, and total. I want there to be a new column smb.total which will have the total for each smb/year/qtr/area combination. I have tried putting a filter in front of the pivot wider statement (at the bottom)
library(readxl)
library(dplyr)
library(stringr)
library(tidyverse)
library(gt)
employment <- c(1,45,125,130,165,260,600,601,2,46,127,132,167,265,601,602,50,61,110,121,170,305,55,603,52,66,112,123,172,310,604,605)
small <- c(1,1,2,2,3,4,NA,NA,1,1,2,2,3,4,NA,NA,1,1,2,2,3,4,NA,NA,1,1,2,2,3,4,NA,NA)
area <-c(001,001,001,001,001,001,001,001,001,001,001,001,001,001,001,001,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003)
year<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020)
qtr <-c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
smbtest <- data.frame(employment,small,area,year,qtr)
smbtest$smb <-0
smbtest <- smbtest %>% mutate(smb = case_when(employment >=0 & employment <100 ~ "1",employment >=0
& employment <150 ~ "2",employment >=0 & employment <250 ~ "3", employment >=0 & employment <500 ~
"4", employment >=0 & employment <100000 ~ "Total"))
smbsummary2<-smbtest %>%
mutate(period = paste0(year,"q",qtr)) %>%
group_by(area,period,smb) %>%
summarise(employment = sum(employment), worksites = n(),
.groups = 'drop_last') %>%
mutate(employment = cumsum(employment),
worksites = cumsum(worksites))
smbsummary2<- smbsummary2%>%
group_by(area,smb)%>%
mutate(empprevyear=lag(employment),
empprevyearpp=employment-empprevyear,
empprevyearpct=((employment/empprevyear)-1),
empprevyearpct=scales::percent(empprevyearpct,accuracy = 0.01)
)
area period smb employment worksites smb.Total
1 2020q1 1 46 2 1927
1 2020q1 2 301 4 1927
1 2020q1 3 466 5 1927
1 2020q1 4 726 6 1927
1 2020q1 Total 1927 8 1927
smbsummary2<-smbsummary2 %>%
filter(small=='Total')
pivot_wider(names_from = small, values_from = employment)

Maybe this code will solve your question:
employment <- c(1, 45, 125, 130, 165, 260, 600, 601, 2, 46, 127,
132, 167, 265, 601, 602, 50, 61, 110, 121, 170,
305, 55, 603, 52, 66, 112, 123, 172, 310, 604, 605)
small <- c(1, 1, 2, 2, 3, 4, NA, NA, 1, 1, 2, 2, 3, 4, NA, NA, 1, 1,
2, 2, 3, 4, NA, NA, 1, 1, 2, 2, 3, 4, NA, NA)
area <-c(001, 001, 001, 001, 001, 001, 001, 001, 001, 001, 001, 001,
001, 001, 001, 001, 003, 003, 003, 003, 003, 003, 003, 003,
003, 003, 003, 003, 003, 003, 003, 003)
year<-c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020)
qtr <-c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2)
smbtest <- tibble(employment, small, area, year, qtr)
smbtest$smb <- 0
smbtest <- smbtest %>%
mutate(smb = case_when(employment >=0 & employment <100 ~ "1",
employment >=0 & employment <150 ~ "2",
employment >=0 & employment <250 ~ "3",
employment >=0 & employment <500 ~ "4",
employment >=0 & employment <100000 ~ "Total"))
smbtest <- smbtest %>%
relocate(smb, year, qtr, area, small, employment)
smbsummary2 <- smbtest %>%
mutate(period = paste0(year,"q",qtr)) %>%
group_by(area, period, smb) %>%
summarise(employment = sum(employment),
worksites = n()) %>%
mutate(employment = cumsum(employment),
worksites = cumsum(worksites))
smbsummary2 %>%
group_by(area, period) %>%
mutate(`employ/period (%)` = employment/employment[smb == "Total"]*100)
Probably not the best answer, but for your data I think it's works well.
If not please tell me.
Good job!

I do know if I understand correctly.
Do you wanna smb.total of what? employment variable?
If yes.
In your object "smbsummary2" use this code:
smbsummary2 <- smbtest %>%
relocate(smb, year, qtr, area, small, employment) %>%
group_by(smb, year, qtr, area) %>%
mutate(smb.total = n())
If was not it, do you could be explain me better?

Related

Easy method for checking for duplicates and errata in panel dataset

Imagine a dataframe:
df1 <- tibble::tribble( ~City, ~Year, ~Burger_cost, ~Cola_cost, ~Resident_AVGGrowth_cm,
"Abu Dhabi", 2020, 2, 3, 175,
"Abu Dhabi", 2019, 1, 3, 174,
"Abu Dhabi", 2018, 1, 2, 173,
"New York", 2020, 4, 5, 500,
"New York", 2019, 3, 5, 184,
"New York", 2018, 2, 3, 183,
"Abu Dhabi", 2020, 2, 3, 175,
"Abu Dhabi", 2019, 1, 3, 174,
"Abu Dhabi", 2018, 1, 2, 173,
"Abu Dhabi", 2017, 1, NA, 100,
"London", 2020, 5, 6, 186,
"London", 2019, 4, 6, 188,
"London", 2018, 3, 5, 185,
"New York", 2020, 4, 5, 185,
"New York", 2019, 3, 5, 184,
"New York", 2018, 3, 3, 183,
"London", 2020, 5, 6, 186,
"London", 2019, 4, 6, 188,
"London", 2018, 3, 5, 185)
The same dataset for visual representation:
There can be many inputs. For example, data for London is totally similar for all years, so we can delete it. The data for Abu Dhabi is similar for years 2018-2020 and has difference for 2017.
The data for New York contains discrepancy in year 2018 for the Burger cost (and growth of a city resident).
The data for the growth of a city resident is surely erroneus in row 4 for NY (too huge descrepancy), but it has a duplicate value in the row 16 (in this case we would prefer delete row 4 and keep row 16 based on criteria that no person can be shorter than 50 cm and longer than 4 meters [400 cm] [especially in the mean values :)]).
Rows 6 and 16 (for NY, 2018) contain different data for the burger cost which can only be resolved by human (say, variant with 3 USD is correct but R needs to show the fact of discrepancy for the end user).
The question: can you suppose nice and neat methods for performing these operations? What do you usually use to solve such issues.
I just started to think on this issue.
It is an easy task in C#. I am keeping in mind several strategies, but I would be extremely interested to know what ways do you use for solving such issues :) Any ideas are much appreciated :)
Perhaps this helps - Grouped by 'City', create a flag for the outliers with boxplot on the 'Resident_AVGGrowth_cm', then add 'Year' into the grouping, create another flag for unique values based on the columns Burger_cost to Resident_AVGGrowth_cm using n_distinct and looping over if_all (returns TRUE only if all the columns looped for the row are returning TRUE), then grouped by City, filter out those City having all duplicates e.g. London, as well as remove the rows with outlier_flag
library(dplyr)
df1 %>%
group_by(City) %>%
mutate(outlier_flag = Resident_AVGGrowth_cm %in%
boxplot(Resident_AVGGrowth_cm, plot = FALSE)$out) %>%
group_by(Year, .add = TRUE) %>%
mutate(flag_all_unq = if_all(Burger_cost:Resident_AVGGrowth_cm,
~ n_distinct(.x, na.rm = TRUE) == 1)) %>%
group_by(City) %>%
filter(!all(flag_all_unq)) %>%
ungroup %>%
filter((!outlier_flag)|flag_all_unq)

How to add a new column based on a few other variables

I am new to R and am having trouble creating a new variable using conditions from already existing variables. I have a dataset that has a few columns: Name, Month, Binary for Gender, and Price. I want to create a new variable, Price2, that will:
make the price charged 20 if [the month is 6-9(Jun-Sept) and Gender is 0]
make the price charged 30 if [the month is 6-9(Jun-Sept) and Gender is 1]
make the price charged 0 if [the month is 1-5(Jan-May) or month is 10-12(Oct-Dec]
--
structure(list(Name = c("ADI", "SLI", "SKL", "SNK", "SIIEL", "DJD"), Mon = c(1, 2, 3, 4, 5, 6), Gender = c(1, NA, NA, NA, 1, NA), Price = c(23, 34, 32, 64, 23, 34)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
Using case_when() from the dplyr package:
mydf$newprice <- dplyr::case_when(
mydf$Mon >= 6 & mydf$Mon <= 9 & mydf$Gender == 0 ~ 20,
mydf$Mon >= 6 & mydf$Mon <= 9 & mydf$Gender == 1 ~ 30,
mydf$Mon < 6 | mydf$Mon > 9 ~ 0)

How do I write a function to plot a line graph for each factor in a dataframe?

I have a dataframe, the head of which looks like this:
|trackName | week| sum|
|:--------------------|----:|---:|
|New Slang | 1| 493|
|You're Somebody Else | 1| 300|
|Mushaboom | 1| 297|
|San Luis | 1| 296|
I am interested in plotting a line graph for each of the 346 unique trackNames in the dataframe, with week on the x-axis and sum on the y-axis. To automate this process, I wrote the following function:
charts <- function(df) {
songs <- df
lim <- nrow(songs)
x <- 1
song_names <- as_tibble(unique(songs$trackName))
while (x <= lim) {
song <- song_names[x, 1]
plot.name <- paste(paste(song), "plot.png", sep = "_")
songs %>% filter(trackName == paste(song[x, 1])) %>%
ggplot(., aes(x = week, y = sum), group = 1) +
geom_line() +
labs(
x = "Week",
y = "Sum of Listens",
title = paste("Week by Week Listening Interest for", song, sep = " "),
subtitle = "Calculated by plotting the sum of percentages of the song listened per week, starting from first listen"
) +
ggsave(plot.name,
width = 20,
height = 15,
units = "cm")
x <- x + 1
}
}
However when I run charts(df), only the following error shows up and then it quits:
> charts(mini)
geom_path: Each group consists of only one observation. Do you need to
adjust the group aesthetic?
>
What am I doing wrong here and what does this error mean?
A sample of the dataframe in DPUT format:
structure(list(trackName = c("New Slang", "You're Somebody Else",
"Mushaboom", "San Luis", "The Trapeze Swinger", "Flightless Bird, American Mouth",
"tere bina - Acoustic", "Only for a Moment", "Upward Over the Mountain",
"Virginia May", "Never to Be Forgotten Kinda Year", "Little Talks",
"Jhak Maar Ke", "Big Rock Candy Mountain", "Sofia", "Aaoge Tum Kabhi",
"Deathcab", "Dil Mere", "Choke", "Phir Le Aya Dil", "Lucille",
"tere bina - Acoustic", "Dil Mere", "Only for a Moment", "This Is The Life",
"San Luis", "Main Bola Hey!", "Choo Lo", "Yeh Zindagi Hai", "Aaftaab",
"Never to Be Forgotten Kinda Year", "Khudi", "Flightless Bird, American Mouth",
"Mere Bina", "Simple Song", "Dil Haare", "Dil Hi Toh Hai", "You're Somebody Else",
"Sofia", "Who's Laughing Now", "Main Bola Hey!", "Lucille", "Eenie Meenie",
"tere bina - Acoustic", "New Slang", "Aaftaab", "Mamma Mia",
"July", "Yeh Zindagi Hai", "Someone You Loved"), week = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3), sum = c(493, 300, 297, 296, 292, 234, 214,
200, 200, 197, 192, 187, 185, 181, 175, 172, 141, 119, 106, 103,
579, 574, 501, 462, 428, 378, 320, 307, 306, 301, 301, 300, 300,
300, 300, 300, 296, 294, 251, 242, 3534, 724, 696, 512, 479,
400, 302, 300, 300, 300)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
How about using purrr::walk instead?
library(tidyverse)
library(hrbrthemes)
walk(unique(songs$trackName),
~{ggsave(plot = ggplot(filter(songs, trackName == .x), aes(x = week, y = sum), group = 1) +
geom_line(color = ft_cols$yellow) +
labs(x = "Week", y = "Sum of Listens", title = paste("Week by Week Listening Interest for", .x, sep = " "),
subtitle = "Calculated by plotting the sum of percentages of the song listened per week, starting from first listen") +
theme_ft_rc(),
file = paste0(.x,"_plot.png"), width = 20, height = 15, units = "cm")})
Note: the question was subsequently edited to remove the hrbrthemes package requirement.
You can split the dataset for each trackName and create a png file for it.
library(tidyverse)
charts <- function(df) {
df %>%
group_split(trackName) %>%
map(~{
track <- first(.x$trackName)
ggplot(.x, aes(x = factor(week), y = sum, group = 1)) +
geom_line() +
labs(
x = "Week",
y = "Sum of Listens",
title = paste("Week by Week Listening Interest for", track),
subtitle = "Calculated by plotting the sum of percentages of the song listened per week, starting from first listen"
) -> plt
ggsave(paste0(track,'.png'), plt, width = 20, height = 15, units = "cm")
})
}
charts(songs)

Struggling to find the total number of rows that meet a certain variable grouped by another variable

I'm performing some light analysis on an NFL kickers' dataset, and am trying to find the total number of kicks made from 18-29yds grouped by each kicker. The dataset's rows contain every made or missed field goal for each kicker, along with the distance and some other variables irrelevant to this issue. I'm using groupby() and then the sum function within the summarise function, but it is returning 1 for every kicker. I have tried different combinations, trying to use filter() as well, but the results keep returning 1 for each kicker. Pics of my code are attached. Any help is appreciated :)
Some code I have tried:
kicks20to29 <- nfl_kicks1%>%
group_by(Kicker)%>%
count(filter(nfl_kicks1$`FG Length`>=18 & nfl_kicks1$`FG Length`<=29))
kicks20to29 <- nfl_kicks1%>%
group_by(Kicker)%>%
filter(`FG Length`>=18 & `FG Length`<=29)
dput output:
structure(list(Quarter = c(1, 2, 1, 2, 2, 4), `Possession Team` = c("NE",
"NE", "NE", "NE", "NE", "NE"), `Wind Speed` = c("6", "6", "12",
"12", "12", "12"), Down = c(4, 4, 4, 4, 4, 4), Distance = c(13,
7, 2, 6, 9, 12), YardLine = c(22, 20, 2, 6, 35, 25), `FG Length` = c(39,
37, 19, 23, 52, 42), `4Q to tie or take lead` = c(0, 0, 0, 0,
0, 0), Result = c("Miss", "Miss", "Good", "Good", "Good", "Miss"
), `Success Rate` = c(0, 0, 1, 1, 1, 0), Kicker = c("A.Vinatieri",
"A.Vinatieri", "A.Vinatieri", "A.Vinatieri", "A.Vinatieri", "A.Vinatieri"
), `# career kicks in study` = c(766, 766, 766, 766, 766, 766
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
One approach is to use the tally function, which counts the number of rows per group.
library(tidyverse)
nfl_kicks1 %>%
group_by(Kicker) %>%
dplyr::filter(`FG Length` >= 18 & `FG Length` <= 29) %>%
tally(name = "Number of Kicks")
## A tibble: 1 x 2
# Kicker `Number of Kicks`
# <chr> <int>
#1 A.Vinatieri 2
You can use group_by + summarise :
library(dplyr)
nfl_kicks1 %>%
group_by(Kicker) %>%
summarise(n_kicks = sum(`FG Length` >= 18 & `FG Length` <= 29))

glmmTMB with autocorrelation of irregular times

I'm putting together a glmmTMB model. I have data collected at a single site over the course of May, every year, for 4 years. Time resolution within year can range from a few minutes (or even same minute) to days apart.
The covariance vignette says that the ar1() structure requires a regular time series, but the ou(times + 0 | group) structure can handle irregular times. That said - it looks like the times argument is a factor - how does that work with irregular time structure??
So, for example, is this a correct use of the ou() structure?
df <- structure(list(DayYear = c(234, 220, 234, 231, 243, 229, 228,
223, 220, 218, 234, 237, 234, 231, 241, 237, 241, 241, 233, 234,
234, 232, 218, 227, 232, 229, 220, 223, 228, 224), DateTime = structure(c(1495477980,
1399590540, 1495479780, 1495225920, 1464631980, 1495052760, 1463324460,
1494525780, 1494256560, 1494088440, 1495471320, 1495730940, 1495476960,
1495225200, 1432919940, 1495725900, 1432924200, 1432918860, 1495384020,
1495479900, 1463848140, 1495298820, 1399420080, 1463253000, 1463692920,
1495037040, 1494275160, 1494510780, 1463348220, 1494597180), class = c("POSIXct",
"POSIXt"), tzone = ""), Year = c(2017, 2014, 2017, 2017, 2016,
2017, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2015, 2017,
2015, 2015, 2017, 2017, 2016, 2017, 2014, 2016, 2016, 2017, 2017,
2017, 2016, 2017), N = c(2, 2, 7, 2, 6, 4, 1, 4, 1, 3, 1, 6,
2, 2, 2, 2, 5, 5, 3, 5, 3, 2, 4, 1, 6, 2, 2, 3, 5, 2)), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))
create sampling factor within year
df <- df %>%
arrange(DateTime) %>%
group_by(Year) %>%
mutate(times = 1:n()) %>%
ungroup() %>%
mutate(YearF = as.factor(Year),
times = numFactor(times))
mod1 <- glmmTMB(N ~ DayYear + YearF +
ou(times + 0 | YearF),
family = nbinom2,
data = df)
This particular model doesn't run too well because the toy dataset is so tiny (and probably doesn't show what I need showing) - but is that a correct specification of the autocorrelation structure under an irregular time series?
No, it's not: you have to use decimal times/dates in numFactor. The way you've done it coerces the data set to be equally spaced. Below I use lubridate::decimal.date(DateTime) %% 1 to get the fraction-of-year variable that's used as the time coordinate.
library(dplyr)
library(lubridate)
library(glmmTMB)
df2 <- (df
%>% arrange(DateTime)
%>% group_by(Year)
%>% mutate(times = lubridate::decimal_date(DateTime) %% 1)
%>% ungroup()
)
df3 <- (df2
%>% mutate(YearF = as.factor(Year),
times = glmmTMB::numFactor(times))
%>% select(N, DayYear, YearF, times)
)
mod1 <- glmmTMB(N ~ DayYear + YearF +
ou(times + 0 | YearF),
family = nbinom2,
data = df3)

Resources