Expanding a dataframe from years to months

Expanding a dataframe from years to months - r

I have the data frame with a column for years. See below:
D <- as.data.frame(cbind(c(1998,1998,1999,1999,2000,2001,2001), c(1,2,2,5,1,3,4), c(1,5,9,2,NA,7,8)))
colnames(D) <- c('year','var1','var2')
D$start <- D$year*100+1
D$end <- D$year*100+12
print(D)
year var1 var2 start end
1 1998 1 1 199801 199812
2 1998 2 5 199801 199812
3 1999 2 9 199901 199912
4 1999 5 2 199901 199912
5 2000 1 NA 200001 200012
6 2001 3 7 200101 200112
7 2001 4 8 200101 200112
I want to copy each row 12 times, one for each month between the start and end columns. I made the start and end columns January and December in this example, but in theory they could be different. Obviously I am really dealing with an incredibly large dataset, so I was wondering how I could do it in one or two lines(preferably using dplyr since that is the coding language I am most used to).

If you want all months for each row, I would do this as a join:
months = expand.grid(year = unique(d$year), month = 1:12)
left_join(D, months, by = "year")
If you want most months for most years, you could filter out the ones you don't want in a next step.
If you really want to use the start and end columns you've created, I would do it like this:
D %>% mutate(month = Map(seq, start, end)) %>%
tidyr::unnest(cols = month)

We can do expand from tidyr
expand(D, year = unique(year), month = 1:12) %>%
left_join(D, by = 'year')

This also works:
D %>%
rowid_to_column() %>%
gather(key = key, value = date, start, end) %>%
select(-key) %>%
group_by(rowid) %>%
complete(date = full_seq(date, 1)) %>%
fill(everything(), -rowid, .direction = "downup") %>%
ungroup() %>%
arrange(rowid)
If you want to keep the start and end columns add the following before ungroup():
mutate(start = min(date), end = max(date))

Related

R - Filter data to only include date X and following date

I have data structured like below, but with many more columns.
I need to filter the data to include only instances where a person has a date of X and X+1.
In this example only person B and C should remain, and only the rows with directly adjacent dates. So rows 2,3,5,6 should be the only remaining ones.
Once it is filtered I need to count how many times this occurred as well as do calculations on the other values, likely summing up the Values column for the X+1 date.
Person <- c("A","B","B","B","C","C","D","D")
Date <- c("2021-01-01","2021-01-01","2021-01-02","2021-01-04","2021-01-09","2021-01-10","2021-01-26","2021-01-29")
Values <- c(10,15,6,48,71,3,1,3)
df <- data.frame(Person, Date, Values)
df
How would I accomplish this?

end_points <- df %>%
mutate(Date = as.Date(Date)) %>%
group_by(Person) %>%
filter(Date - lag(Date) == 1 | lead(Date) - Date == 1) %>%
ungroup()
Result
end_points
# A tibble: 4 x 3
Person Date Values
<chr> <date> <dbl>
1 B 2021-01-01 15
2 B 2021-01-02 6
3 C 2021-01-09 71
4 C 2021-01-10 3
2nd part:
end_points %>%
group_by(Person) %>%
slice_max(Date) %>%
ungroup() %>%
summarize(total = sum(Values))

Create column with a certain week value by group

I would like to create a column, by group, with a certain week's value from another column.
In this example New_column is created with the Number from the 2nd week for each group.
Group Week Number New_column
A 1 19 8
A 2 8 8
A 3 21 8
A 4 5 8
B 1 4 12
B 2 12 12
B 3 18 12
B 4 15 12
C 1 9 4
C 2 4 4
C 3 10 4
C 4 2 4
I've used this method, which works, but I feel is a really messy way to do it:
library(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(New_column = ifelse(Week == 2, Number, NA))
df <- df %>%
group_by(Group) %>%
mutate(New_column = sum(New_column, na.rm = T))

There are several solution possible, depending on what you need specifically. With your specific sample data, however, all of them give the same result
1) It identifies the week number from column Week, even if the dataframe is not sorted
df %>%
group_by(Group) %>%
mutate(New_column = Number[Week == 2])
However, if the weeks do not start from 1, this solution will still try to find the case only where Week == 2
2) If df is already sorted by Week inside each group, you could use
df %>%
group_by(Group) %>%
mutate(New_column = Number[2])
This solution does not take the week Number in which Week == 2, but rather the second week within each group, regardless of its actual Week value.
3) If df is not sorted by week, you could do it with
df %>%
group_by(Group) %>%
arrange(Week, .by_group = TRUE) %>%
mutate(New_column = Number[2])
and uses the same rationale as solution 2)

Creating a count table of unique values when there are multiple values in a single cell using R

I am trying to create a count table from a data table that looks like this:
df <- data.frame("Spring" = c("skirt, pants, shirt", "tshirt"), "Summer" =
c("shorts, skirt", "pants, shoes"), Fall = c("Scarf", "purse, pants"))
Spring Summer Fall
1 skirt, pants, shirt shorts, skirt Scarf
2 tshirt pants, shoes purse, pants
and then a count table that looks like this in the end:
output <- data.frame("Spring" = 4, "Summer" = 4, Fall = 3)
Spring Summer Fall
1 4 4 3
So, I would just like it to count the unique values in a column for each season. I am having trouble with this because of the commas separating values within 1 cell. I tried using length(unique())), but it is not giving me the correct number because of the columns.
Any help is appreciated!!!

One tidyverse possibility could be:
df %>%
mutate_if(is.factor, as.character) %>%
gather(var, val) %>%
mutate(val = strsplit(val, ", ")) %>%
unnest() %>%
group_by(var) %>%
summarise(val = n_distinct(val))
var val
<chr> <int>
1 Fall 3
2 Spring 4
3 Summer 4
If you want to match the desired output exactly, then you can add spread():
df %>%
mutate_if(is.factor, as.character) %>%
gather(var, val) %>%
mutate(val = strsplit(val, ", ")) %>%
unnest() %>%
group_by(var) %>%
summarise(val = n_distinct(val)) %>%
spread(var, val)
Fall Spring Summer
<int> <int> <int>
1 3 4 4
Or using the basic idea from #Sonny (this requires just dplyr):
df %>%
mutate_if(is.factor, as.character) %>%
summarise_all(list(~ n_distinct(unlist(strsplit(., ", ")))))
Spring Summer Fall
1 4 4 3

Using summarise_all:
getCount <- function(x) {
x <- as.character(x)
length(unique(unlist(strsplit(x, ","))))
}
library(dplyr)
df %>%
summarise_all(funs(getCount))
Spring Summer Fall
1 4 4 3

Summing the number of occurrences from m/d/y to y/m

I have data of from each of the avalanches that occurred. I need to calculate the number of avalanches that occurred by each year and month but the data just gives the exact days that an avalanche occurred. How do I sum the number of occurrences that occurred during each year-month? I also only need the winter related year-months (Dec (12) - March (3)). Please help!
library(XML)
library(RCurl)
library(dplyr)
avalanche<-data.frame()
avalanche.url<-"https://utahavalanchecenter.org/observations?page="
all.pages<-0:202
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep="")
this.webpage<-htmlParse(getURL(this.url))
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T,stringsAsFactors=F)
names(thispage.avalanche)<-c('Date','Region','Location','Observer')
avalanche<-rbind(avalanche,thispage.avalanche)
}
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
str(avalancheslc)
The output should look something like:
Date AvalancheTotal
2000-01 1
2000-02 2
2000-03 8
2000-12 23
2001-01 16
.
.
.
.
.
2019-03 45

Using dplyr, you could get the variable of interest ("year-month") from the Date column, group by this variable, and then compute the number of rows in each group.
In a similar way, you can filter to only get the months you like:
library(dplyr)
winter_months <- c(1:3, 12)
avalancheslc %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
mutate(YearMonth = format(Date,"%Y-%m"),
Month = as.numeric(format(Date,"%m"))) %>%
filter(Month %in% winter_months) %>%
group_by(YearMonth) %>%
summarise(AvalancheTotal = n())

We can convert to yearmon from zoo and use that in the group_by to get the number of rows
library(dplyr)
library(zoo)
dim(avalancheslc)
#[1] 5494 4
out <- avalancheslc %>%
group_by(Date = format(as.yearmon(Date, "%m/%d/%Y"), "%Y-%m")) %>%
summarise(AvalancheTotal = n())
If we need only output from December to March, then filter the data
subOut <- out %>%
filter(as.integer(substr(Date, 6, 7)) %in% c(12, 1:3))
Or it can be filtered earlier in the chain
library(lubridate)
out <- avalancheslc %>%
mutate(Date = as.yearmon(Date, "%m/%d/%Y")) %>%
filter(month(Date) %in% c(12, 1:3)) %>%
count(Date)
dim(out)
#[1] 67 2
Now, for filling with 0's
mths <- month.abb[c(12, 1:3)]
out1 <- crossing(Months = mths,
Year = year(min(out$Date)):year(max(out$Date))) %>%
unite(Date, Months, Year, sep= " ") %>%
mutate(Date = as.yearmon(Date)) %>%
left_join(out) %>%
mutate(n = replace_na(n, 0))
tail(out1)
# A tibble: 6 x 2
# Date n
# <S3: yearmon> <dbl>
#1 Mar 2014 100
#2 Mar 2015 94
#3 Mar 2016 96
#4 Mar 2017 93
#5 Mar 2018 126
#6 Mar 2019 163

max([column])where name = (each unique name in the name column) for each year in R

I am using the baby names data in R for practice.
total_n <-babynames %>%
mutate(name_gender = paste(name,sex))%>%
group_by(year) %>%
summarise(total_n = sum(n, na.rm=TRUE)) %>%
arrange(total_n)
bn <- inner_join(babynames,total_n,by = "year")
df <- bn%>%
mutate(pct_of_names = n/total_n)%>%
group_by(name, year)%>%
summarise(pct =sum(pct_of_names))
The dataframe output looked like this:
For each name, there's all the years, and the related pct for that year. I am stuck with getting the year with the highest pct for each name. How do I do this?

Pretty simple, once you know where the babynames data comes from. You had everything needed:
library(dplyr)
library(babynames)
total_n <-babynames %>%
mutate(name_gender = paste(name,sex))%>%
group_by(year) %>%
summarise(total_n = sum(n, na.rm=TRUE)) %>%
arrange(total_n)
bn <- inner_join(babynames,total_n,by = "year")
df <- bn%>%
mutate(pct_of_names = n/total_n)%>%
group_by(name, year)%>%
summarise(pct =sum(pct_of_names))
You were missing this final step:
df %>%
group_by(name) %>%
filter(pct == max(pct))
# A tibble: 95,025 x 3
# Groups: name [95,025]
name year pct
<chr> <dbl> <dbl>
1 Aaban 2014 4.338256e-06
2 Aabha 2014 2.440269e-06
3 Aabid 2003 1.316094e-06
4 Aabriella 2015 1.363073e-06
5 Aada 2015 1.363073e-06
6 Aadam 2015 5.997520e-06
7 Aadan 2009 6.031433e-06
8 Aadarsh 2014 4.880538e-06
9 Aaden 2009 3.335645e-04
10 Aadesh 2011 1.370356e-06
# ... with 95,015 more row
group_by and filter are your friends.