Dplyr merge rows based on one column value and sum other columns - r

My current df looks like the following:
WEEK COUNT COUNT2 PERCENTAGE
2017-53 10 15 .05
2018-00 5 10 .1
2018-01 7 9 .1
....
2018-52 10 12 .06
2019-00 6 10 .05
....
What I would like to do is combine the last two weeks of each year together into the final week of the year and combine COUNT, COUNT2, and PERCENTAGE. The weeks I currently have that I would like to combine are: 2017-53 and 2018-00, 2018-52 and 2019-00, 2019-52 and 2020-00. Which I would like to merge into 2017-53, 2018-52, 2019-52 My expected output would be the following:
WEEK COUNT COUNT2 PERCENTAGE
2017-53 15 25 .15
2018-01 7 9 .1
....
2018-52 16 22 .11
....

With tidyverse, after converting the 'WEEK' to Date class, arrange by that column, extract the 'year', create a grouping with 'WEEK' based on the difference of adjacent elements of 'year', and then summarise to get the sum of the columns that matches 'COUNT' or 'PERCENTAGE'
library(stringr)
library(lubridate)
library(dplyr) #1.0.0
df1 %>%
mutate(Date = as.Date(str_c(WEEK, "-01"), format = '%Y-%U-%w')) %>%
arrange(Date) %>%
mutate(year = year(Date)) %>%
group_by(WEEK = case_when(lag(year, default = first(year)) - year < 0 ~
lag(WEEK), TRUE ~ WEEK)) %>%
summarise(across(matches("COUNT|PERCENTAGE"), sum))
# A tibble: 3 x 4
# WEEK COUNT COUNT2 PERCENTAGE
# <chr> <int> <int> <dbl>
#1 2017-53 15 25 0.15
#2 2018-01 7 9 0.1
#3 2018-52 16 22 0.11
data
df1 <- structure(list(WEEK = c("2017-53", "2018-00", "2018-01", "2018-52",
"2019-00"), COUNT = c(10L, 5L, 7L, 10L, 6L), COUNT2 = c(15L,
10L, 9L, 12L, 10L), PERCENTAGE = c(0.05, 0.1, 0.1, 0.06, 0.05
)), class = "data.frame", row.names = c(NA, -5L))

You could use colSums() as is shown here, but it's a bit convoluted. I'd recommend using aggregate and pipes, as is shown further down in the same link.
Hope this helps!

Related

From raws to column, pivot_wider mistake [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 months ago.
Hi guys I have a list that looks like that :
Sample Value
10 152365
10 236548
10 232547
10 145987
22 98564
22 98745
22 236547
And I would like to make it like this
10 22
152365 98564
236548 98745
232547 236547
145987
I have tried pivot_wider, but, since I have over 100'000 values it give me the mistake that some identical values are found and thus cannot work, wheras the spread just simply freeze...
Can you help me?
Thanks
Lore
Assuming the blank under 22 can be NA, this works:
library(dplyr)
library(tidyr) # pivot_longer
quux %>%
group_by(Sample) %>%
mutate(rn = row_number()) %>%
pivot_wider(rn, names_from = "Sample", values_from = "Value") %>%
select(-rn)
# # A tibble: 4 x 2
# `10` `22`
# <int> <int>
# 1 152365 98564
# 2 236548 98745
# 3 232547 236547
# 4 145987 NA
Data
quux <- structure(list(Sample = c(10L, 10L, 10L, 10L, 22L, 22L, 22L), Value = c(152365L, 236548L, 232547L, 145987L, 98564L, 98745L, 236547L)), class = "data.frame", row.names = c(NA, -7L))

Transform to wide format from long in R

I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA

Sorting dataframe by column of letters and numbers

I have been attempting to sort my dataframe by the first column - or day - with multiple different methods listed below to no avail. I suspect it could be because it is attempting to order by the first number but I am unsure how I would change that to get it to order the rows properly. The dataset is as follows:
df1
[day][sample1][sample2]
[1,]day0 22 11
[2,]day11 23 15
[3,]day15 25 14
[4,]day2 21 13
[5,]day8 20 17
...
I am looking to order the entire row by day. I have tried the following
df[sort(as.character(df$day)),]
df[order(as.character(df$day)),]
mixedorder(as.character(df$day)) (gtools package)
The mixedorder merely output an index of numbers.
Current Code:
df_0$day = metadata_df[,3]
df_0 <- df_0[,c(8,1:7)]
df1 <- aggregate(df_0[,2:ncol(df_0)], df_0[1], mean)
df1 <- df1[mixedorder(as.character(df1$day)),]
df1$day <- factor(df1$day, levels = unique(df1$day))
rownames(df1) <- 1:nrow(df1)
##Plotting expression levels
Plot1 <- ggplot() +
geom_line(data=df1, aes(x=day, y=sample1, group=1, color="blue"))+
geom_line(data=df2, aes(x=day, y=sample1, group=2, color="red"))
Note that I have done the same transformations with df2 as I have with df1. Both df1 and df2 are the same, except with slightly different values in them.
The mixedorder gives the ordered index which can be used to order the rows
df1 <- df[mixedorder(as.character(df$day)),]
df1
# day sample1 sample2
#1 day0 22 11
#4 day2 21 13
#5 day8 20 17
#2 day11 23 15
#3 day15 25 14
It is not clear about how the OP is plotting.
library(tidyverse)
df1 %>%
mutate(day = factor(day, levels = unique(day))) %>%
gather(key, val, -day) %>%
ggplot(., aes(x = day, y = val, color = key)) +
geom_point()
data
df <- structure(list(day = structure(1:5, .Label = c("day0", "day11",
"day15", "day2", "day8"), class = "factor"), sample1 = c(22L,
23L, 25L, 21L, 20L), sample2 = c(11L, 15L, 14L, 13L, 17L)), .Names = c("day",
"sample1", "sample2"), class = "data.frame", row.names = c(NA,
-5L))

Average data by month for a given latitude and longitude?

I have a table with the following headers and example data
Lat Long Date Value.
30.497478 -87.880258 01/01/2016 10
30.497478 -87.880258 01/02/2016 15
30.497478 -87.880258 01/05/2016 20
33.284928 -85.803608 01/02/2016 10
33.284928 -85.803608 01/03/2016 15
33.284928 -85.803608 01/05/2016 20
I would like to average the value column on monthly basis for a particular location.
So example output would be
Lat Long Month Avg Value
30.497478 -87.880258 January 15
A solution using dplyr and lubridate.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Date = mdy(Date), Month = month(Date)) %>%
group_by(Lat, Long, Month) %>%
summarise(`Avg Value` = mean(Value))
dt2
# A tibble: 2 x 4
# Groups: Lat, Long [?]
Lat Long Month `Avg Value`
<dbl> <dbl> <dbl> <dbl>
1 30.49748 -87.88026 1 15
2 33.28493 -85.80361 1 15
You can try the following, but it first modifies the data frame adding an extra column, Month, using package zoo.
library(zoo)
dat$Month <- as.yearmon(as.Date(dat$Date, "%m/%d/%Y"))
aggregate(Value. ~ Lat + Long + Month, dat, mean)
# Lat Long Month Value.
#1 30.49748 -87.88026 jan 2016 15
#2 33.28493 -85.80361 jan 2016 15
If you don't want to change the original data, make a copy dat2 <- dat and change the copy.
DATA
dat <-
structure(list(Lat = c(30.497478, 30.497478, 30.497478, 33.284928,
33.284928, 33.284928), Long = c(-87.880258, -87.880258, -87.880258,
-85.803608, -85.803608, -85.803608), Date = structure(c(1L, 2L,
4L, 2L, 3L, 4L), .Label = c("01/01/2016", "01/02/2016", "01/03/2016",
"01/05/2016"), class = "factor"), Value. = c(10L, 15L, 20L, 10L,
15L, 20L)), .Names = c("Lat", "Long", "Date", "Value."), class = "data.frame", row.names = c(NA,
-6L))
EDIT.
If you want to compute several statistics, you can define a function that computes them and returns a named vector and call it in aggregate, like the following.
stat <- function(x){
c(Mean = mean(x), Median = median(x), SD = sd(x))
}
agg <- aggregate(Value. ~ Lat + Long + Month, dat, stat)
agg <- cbind(agg[1:3], as.data.frame(agg[[4]]))
agg
# Lat Long Month Mean Median SD
#1 30.49748 -87.88026 jan 2016 15 15 5
#2 33.28493 -85.80361 jan 2016 15 15 5

combine data in depending on the value of one column

I have a data frame in R
year group sales
1 2000 1 20
2 2001 1 25
3 2002 1 23
4 2003 1 30
5 2001 2 50
6 2002 2 55
And I want to group the data by groups or create some kind of object. I want to create one array for each group that will store the year and the sales. And the I will try to save it as a json file with this structure:
[{"group": 1, "sales":[[2000,20],[2001, 25], [2002,23], [2003, 30]]},
{"group": 2, "sales":[[2001, 50], [2002,55]]}]
Is it possible to do it automatically?
Thanks a lot
We can use data.table to paste the 'year' and 'sales' column grouped by 'group. We convert the 'data.frame' to 'data.table' (setDT(df1)). Group by 'group', we use sprintf to paste the 'year', 'sales' along with the parentheses ([]), then collapse the output to a single string with toString (it is a wrapper for paste(..., collapse=', ')), paste the [], and use toJSON.
library(jsonlite)
library(data.table)
toJSON(setDT(df1)[, list(sales= paste0('[',toString(sprintf('[%d,%d]',
year, sales)),']')), by = group])
#[{"group":1,"sales":"[[2000,20], [2001,25], [2002,23], [2003,30]]"},
#{"group":2,"sales":"[[2001,50], [2002,55]]"}]
The paste by group can be done using base R. We split the dataset by the 'group' column to create a list. Loop through the list with lapply, paste, the 'year', 'sales' column as mentioned above. Create a data.frame with the first element of 'group' and the string from the paste step, rbind the list elements to create a single data.frame and then use toJSON.
toJSON(
do.call(rbind,
lapply(
split(df1, df1$group),
function(x) data.frame(group=x$group[1L],
sales=paste0('[',
toString(sprintf('[%d,%d]', x$year, x$sales)),
']')))))
data
df1 <- structure(list(year = c(2000L, 2001L, 2002L, 2003L, 2001L, 2002L
), group = c(1L, 1L, 1L, 1L, 2L, 2L), sales = c(20L, 25L, 23L,
30L, 50L, 55L)), .Names = c("year", "group", "sales"),
class = "data.frame", row.names = c(NA, -6L))
Since the other answer uses data.table, I thought it would be a interesting exercise to try to do this in dplyr. This is not the optimal way but illustrates do which I'm not convinced is well enough documented. I have also shown the more appropriate summarise solution.
df <-read.table(textConnection('
year group sales expenses
2000 1 20 19
2001 1 25 19
2002 1 23 20
2003 1 30 15
2001 2 50 27
2002 2 55 30
'),header=TRUE)
library(dplyr)
library(jsonlite)
df %>%
group_by( group ) %>%
do(
sales = group_by(.,year) %>% select(sales) %>% apply(MARGIN=2,identity),
expenses = group_by(.,year) %>% select(expenses) %>% apply(MARGIN=2,identity)
)
df %>%
group_by( group ) %>%
summarise(
sales = list(apply( data.frame(year,sales), MARGIN=2, identity ))
,expenses = list(apply( data.frame(year,sales), MARGIN=2, identity ))
) %>% jsonlite::toJSON()

Resources