Create a new dataframe showing the sum of each column - r

I have a dataframe that looks like this
Date Food Utility Travel
01 1.2 12.00 0
02 10.52 0 12.50
03 9.24 0 2.7
04 3.25 0 2.7
I want to create a new dataframe that shows in the first column the type of spending (e.g. food, utility) and then have the sum in another column. I do not need the date column in the new frame but don't want to omit it from the original.
I hope to have the below output.
Category Total
Utility 12.00
Food 24.21
Transport 17.9
I have tried creating a new value for each category, and then trying to pull them together in a dataframe but it has the transposed version, and seems a little long winded if I was to have lots of categories.

You could do this:
library(tidyverse)
test_data <- read_table2("Date Food Utility Travel
01 1.2 12.00 0
02 10.52 0 12.50
03 9.24 0 2.7
04 3.25 0 2.7")
test_data%>%
select(Food:Travel) %>%
pivot_longer(cols = everything(), names_to = "Category", values_to = "val") %>%
group_by(Category) %>%
summarise(Total = sum(val))
#> # A tibble: 3 x 2
#> Category Total
#> <chr> <dbl>
#> 1 Food 24.2
#> 2 Travel 17.9
#> 3 Utility 12
First select the rows you want, then go long, then summarize the categories by sum.

With base R, we can stack the columns except the first to a two column data.frame, and then do a group by sum with aggregate
aggregate(values ~ ind, stack(dat[-1]), sum)
# ind values
#1 Food 24.21
#2 Utility 12.00
#3 Travel 17.90
Or do colSums on the subset of columns and stack it
stack(colSums(dat[-1]))[2:1]
data
dat <- structure(list(Date = 1:4, Food = c(1.2, 10.52, 9.24, 3.25),
Utility = c(12, 0, 0, 0), Travel = c(0, 12.5, 2.7, 2.7)),
class = "data.frame", row.names = c(NA,
-4L))

Related

Aggregate week and date in R by some specific rules

I'm not used to using R. I already asked a question on stack overflow and got a great answer.
I'm sorry to post a similar question, but I tried many times and got the output that I didn't expect.
This time, I want to do slightly different from my previous question.
Merge two data with respect to date and week using R
I have two data. One has a year_month_week column and the other has a date column.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022. What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022.If more than one date are in year_month_week, I will just include the first of them. Now, here's the different part. Even if there is no date inside year_month_week,just leave it NA. So my expected output has a same number of rows as df1 which includes the column year_month_week.So my expected output is as follows:
df<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43),
temperature=c(36.1,36.6,NA,34.3,34.9,NA,NA))
First we can convert the dates in df2 into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
You can build off of a previous answer here by taking the function to count the week of the month, then generate a join key in df2. See here
df1 <- data.frame(
id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2 <- data.frame(
id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
# Take the function from the previous StackOverflow question
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
# Create a year_month_week variable to join on
df2 <-
df2 %>%
mutate(
date = lubridate::parse_date_time(
x = date,
orders = "%Y%m%d"),
year_month_week = paste0(
lubridate::year(date),
0,
lubridate::month(date),
monthweeks.Date(date)),
year_month_week = as.double(year_month_week))
# Remove duplicate year_month_weeks
df2 <-
df2 %>%
arrange(year_month_week) %>%
distinct(year_month_week, .keep_all = T)
# Join dataframes
df1 <-
left_join(
df1,
df2,
by = "year_month_week")
Produces this result
id.x year_month_week points id.y date temperature
1 1 2022051 65 1 2022-05-03 36.1
2 1 2022052 58 1 2022-05-12 36.6
3 1 2022053 47 NA <NA> NA
4 2 2022041 21 2 2022-04-01 34.3
5 2 2022042 25 2 2022-04-08 34.9
6 2 2022043 27 NA <NA> NA
7 2 2022044 43 NA <NA> NA
>
Edit: forgot to mention that you need tidyverse loaded
library(tidyverse)

Arrange data by variables of a data.frame in R?

I have written down the following script to get the data in longer format. How i can get the data.frame arrange by variables and not by Date?. That means first i should get the data for Variable A for all the dates followed by Variable X.
library(lubridate)
library(tidyverse)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("1979-01-01"), to = as.Date("1979-12-31"), by = "day"),
A = runif(365,1,10), X = runif(365,5,15)) %>%
pivot_longer(-Date, names_to = "Variables", values_to = "Values")
Maybe I not understood wrigth, but you can arrange your data according to the variables column, through the arrange() function.
library(tidyverse)
DF <- DF %>%
arrange(Variables)
Resulting this
# A tibble: 730 x 3
Date Variables Values
<date> <chr> <dbl>
1 1979-01-01 A 3.59
2 1979-01-02 A 8.09
3 1979-01-03 A 4.68
4 1979-01-04 A 8.95
5 1979-01-05 A 9.46
6 1979-01-06 A 1.41
7 1979-01-07 A 5.75
8 1979-01-08 A 9.03
9 1979-01-09 A 5.96
10 1979-01-10 A 5.11
# ... with 720 more rows
In base R, we can use
DF1 <- DF[order(DF$Variables),]
Am I missing something? This is it.
arrange (DF,Variables,Date) %>% select(Variables,everything())

How to create columns based on multiple conditions on the other data frame

I have two data frames like below:
df = data.frame(Vintage = c(2016,2017,2018,2019),
Mean = c(6.9,11.5,7.5,11.9),
Upper = c(10.0,14.5,13.2,14.9),
Median = c(8.3,10.9,10.2,12.1),
Lower = c(5.3,8.2,6.3,9.4),
Deviation = c(6.5,5.1,9.3,5.9))
df
Vintage Mean Upper Median Lower Deviation
1 2016 6.9 10.0 8.3 5.3 6.5
2 2017 11.5 14.5 10.9 8.2 5.1
3 2018 7.5 13.2 10.2 6.3 9.3
4 2019 11.9 14.9 12.1 9.4 5.9
df1 = data.frame(Name = c("A","B","C"),
Year = c(2017,2018,2019),
Performance = c(7.7,7.2,15.2))
df1
Name Year Performance
1 A 2017 7.7
2 B 2018 7.2
3 C 2019 15.2
I'd like to add two columns to df1 based on the following conditions:
df1$Quartile: when df1$Year = df$Vintage,
if df1$Performance > df$Upper, then "Fourth";
if df$Upper>df1$Performance > df$Median, then "Third";
if df$Median>df1$Performance > df$Lower, then "Second";
if df$Lower>df1$Performance , then "First".
df1$Z_Score = (df1$Performance - df$Mean) / df$Deviation when df1$Year = df$Vintage.
The result should look like this:
Name Year Performance Quartile Z_Score
1 A 2017 7.7 First -0.75
2 B 2018 7.2 Second -0.03
3 C 2019 15.2 Fourth 0.56
library(dplyr)
df %>%
inner_join(df1, by = c(Vintage = 'Year')) %>%
mutate(Quartile = case_when(Performance > Upper ~ 'Fourth',
Performance > Median ~ 'Third',
Performance > Lower ~ 'Second',
TRUE ~ 'First'),
Z_Score = (Performance - Mean)/Deviation) %>%
select(Name, Year = Vintage, Performance, Quartile, Z_Score)
# Name Year Performance Quartile Z_Score
# 1 A 2017 7.7 First -0.74509804
# 2 B 2018 7.2 Second -0.03225806
# 3 C 2019 15.2 Fourth 0.55932203
You could also use cut instead of dplyr::case_when (as #akrun suggests in comments). Same output as above, except Quartile is now a factor instead of character.
df %>%
inner_join(df1, by = c(Vintage = 'Year')) %>%
rowwise %>%
mutate(Quartile = cut(Performance, c(0, Lower, Median, Upper, Inf),
c('First', 'Second', 'Third', 'Fourth')),
Z_Score = (Performance - Mean)/Deviation) %>%
select(Name, Year = Vintage, Performance, Quartile, Z_Score)
data.table option which modifies df1 rather than creating a new data.frame
df1[df, on = .(Year = Vintage),
':='(Quartile =
mapply(function(p, l, m, u)
cut(p, c(0, l, m, u, Inf), c('First', 'Second', 'Third', 'Fourth')),
Performance, i.Lower, i.Median, i.Upper),
Z_Score = (Performance - i.Mean)/i.Deviation)]
df1
# Name Year Performance Quartile Z_Score
# 1: A 2017 7.7 First -0.74509804
# 2: B 2018 7.2 Second -0.03225806
# 3: C 2019 15.2 Fourth 0.55932203

R aggregate variable but duplicate new values in original dataframe

I am new to R, and I've run into what I imagine is a very simple problem:
I am currently trying to aggregate an hourly variable to daily averages. The trick is I want to keep these new daily averages in my original data frame. While I have been able use aggregate() or summaryBy() for a new daily aggregated data frame, I would like to simply repeat averaged values within my original data frame. Shown below is a head from my frame:
- x y
50 4.650097 2017-3-12-16
51 6.499223 2017-3-12-17
52 8.741650 2017-3-12-18
53 8.358922 2017-3-12-19
54 8.650971 2017-3-12-20
55 6.928252 2017-3-12-21
What I want to do is aggregate x, which is an hourly measurement, into a single daily average, but include those repeated averages as new columns.
For example, lets say the average of x was '6.12' for the first 24 rows. I want '6.12' to repeat as a new column for 24 rows, instead of creating a new single value vector.
Thank you in advance for any advice!
Here is a dplyr solution:
library(dplyr);
df %>%
mutate(date = as.Date(as.POSIXct(strptime(y, "%Y-%m-%d-%H")))) %>%
group_by(date) %>%
mutate(mean.x = mean(x))
## A tibble: 9 x 5
## Groups: date [2]
# X. x y date mean.x
# <int> <dbl> <fct> <date> <dbl>
#1 50 4.65 2017-3-12-16 2017-03-12 7.30
#2 51 6.50 2017-3-12-17 2017-03-12 7.30
#3 52 8.74 2017-3-12-18 2017-03-12 7.30
#4 53 8.36 2017-3-12-19 2017-03-12 7.30
#5 54 8.65 2017-3-12-20 2017-03-12 7.30
#6 55 6.93 2017-3-12-21 2017-03-12 7.30
#7 100 5.00 2017-4-23-16 2017-04-23 5.00
#8 101 6.00 2017-4-23-17 2017-04-23 5.00
#9 102 4.00 2017-4-23-18 2017-04-23 5.00
Explanation: Convert y to POSIXct format, extract date component, group_by date, and create new column with daily mean.
Sample data
df <- read.table(text =
"- x y
50 4.650097 2017-3-12-16
51 6.499223 2017-3-12-17
52 8.741650 2017-3-12-18
53 8.358922 2017-3-12-19
54 8.650971 2017-3-12-20
55 6.928252 2017-3-12-21
100 5.0 2017-4-23-16
101 6.0 2017-4-23-17
102 4.0 2017-4-23-18", header = T)
This is untested as you haven't provided a reproducible form of your data (check out dput), but this should at least point you in the right direction. Just replace mydf with whatever your dataframe is called.
library(tidyr)
library(dplyr)
aggregated_df <- mydf %>%
separate(y, c("date", "hour"), sep = -3) %>%
group_by(date) %>%
mutate(daily_average = mean(x))

Days since a variable changed dplyr

Does anyone know of a dplyr method for calculating the number of days since a variable changed (by groups)? For example, consider the number of days since a particular store last changed its price.
library(dplyr)
df <- data.frame(store = c(34, 34, 34, 34, 34, 28, 28, 28, 81, 81),
date = c(20111231, 20111224, 20111217, 20111210, 20111203,
20111224, 20111217, 20111203, 20111231, 20111224),
price = c(3.45, 3.45, 3.45, 3.36, 3.45, 3.17, 3.25, 3.15,
3.49, 3.17))
df <- df %>% mutate(date = as.Date(as.character(date), format = "%Y%m%d")) %>%
arrange(store, desc(date)) %>% group_by(store) %>%
mutate(pchange = price - lead(price))
df$days.since.change <- c(7, 14, 0, 21, 14, 7, 7, 0, 7, 0)
I'm trying to use dplyr to generate a variable called days.since.change. For example, store 34 charged $3.45 on 2012-12-31, a price which had been in effect for 21 days (since it charged $3.36 on 2012-12-10). The variable appears manually above. The challenge is that a store might change its price back to an earlier price level, which invalidates some grouping strategies.
One option is to calculate the number of days between each price listing for each store and then adding a second grouping variable to group together consecutive dates during which the price didn't change. Then just take the cumulative sum over the days that passed.
I did this with the dataset sorted by date in ascending order with lag instead of lead to avoid using arrange twice but of course you could change this around. I also left the group variable in the dataset, which you likely won't want and could remove by ungrouping and then using select.
df %>% mutate(date = as.Date(as.character(date), format = "%Y%m%d")) %>%
arrange(store, date) %>%
group_by(store) %>%
mutate(pchange = price - lag(price), dchange = as.numeric(date - lag(date))) %>%
group_by(store, group = cumsum(c(1, diff(price) != 0))) %>%
mutate(dchange = cumsum(dchange))
Source: local data frame [10 x 6]
Groups: store, group
store date price pchange dchange group
1 28 2011-12-03 3.15 NA NA 1
2 28 2011-12-17 3.25 0.10 14 2
3 28 2011-12-24 3.17 -0.08 7 3
4 34 2011-12-03 3.45 NA NA 1
5 34 2011-12-10 3.36 -0.09 7 2
6 34 2011-12-17 3.45 0.09 7 3
7 34 2011-12-24 3.45 0.00 14 3
8 34 2011-12-31 3.45 0.00 21 3
9 81 2011-12-24 3.17 NA NA 1
10 81 2011-12-31 3.49 0.32 7 2

Resources