I have a dataframe with the following sample:
df = data.frame(x1 = c(2000a,2010a,2000b,2010b,2000c,2010c),
x2 = c(1,2,3,4,5,6))
I am trying to find a way to calculate the percent change for each "group" (a,b,c) using the change() function. Below is my attempt:
percent_change = change(df,x2, NewVar = "percent_change", slideBy = 1,type = 'percent')
where slideBy is the lag variable that restarts the percent change calculation every other observation. This does not work, and I get the following error:
" Remember to put data in time order before running.
Leading total_units by 1 time units."
Would it be possible to adapt my x1 column to a time series or is there an easier way around this I am missing?
Thank you!
This uses the data.table structure from the data.table package. First it sorts on x1, then does a row by row calculation of the percent change, grouping by the letter in x1.
library(data.table)
setDT(df)
df[order(x1),
100*x2/shift(x2,1L),
keyby=gsub("[0-9]","",x1)]
Here is a tidyverse way to do this. First, use extract to separate x1 into year and group, then pivot_wider on the table. Now you can use mutate to create the percent change row.
library(dplyr)
library(tidyr)
df = data.frame(x1 = c("2000a","2010a","2000b","2010b","2000c","2010c"),x2 = c(1,2,3,4,5,6))
df_new = df %>%
extract(x1, c("year", "group"),regex="(\\d{4})(\\D{1})") %>%
pivot_wider(names_from = year, values_from=x2) %>%
mutate(percent_change=(`2010`-`2000`)/`2000`)
Related
I am attempting to combine multiple columns in my dataset however I have been using the unite() function and this does half of the work as it combines all the columns however I need it to calculate the mean of all the numbers.
Unite <- Complete_TrainingSet %>%
unite(col = "PP1-3", PP1, PP2, PP3))
This was my code however I would like to know how would I also get it to calculate the mean?
Maybe you are looking for such a solution as you explicitly use unite.
See this example with fake data. Here you unite all columns to one and then calculate the mean of that column.
library(tidyr)
df %>%
unite("PP1-3", PP1, PP2, PP3, sep="") %>%
summarise(mean = mean(PP1-3))
Output
mean
1 2.5
Data
df <- structure(list(PP1 = 1:10, PP2 = 11:20, PP3 = 21:30), class = "data.frame", row.names = c(NA,
-10L))
Assuming you want the row means of the three columns. You can use rowMeans for this, like:
Unite <- Complete_TrainingSet %>%
mutate(`PP1-3` = rowMeans(select(., PP1, PP2, PP3)))
With select you select the columns you want.
I am looking for a way to define the values of a column into percentiles. The data looks similar to this but with more complex values of column E:
data.frame(Date=c(rep("2010-01-31", 60), rep("2010-02-28", 60)), E=c(rep(1:20, 6)))
The data should be grouped around the data variable. The brackets are to be used to create a histogram like the one attached below. If you could kindly also help me with a code that does that, it would be great.
Do you mean something in line of:
df <- df %>%
group_by(Date) %>%
mutate(first = quantile(E,0.5),
second = quantile(E,0.95))
With data.table:
setDT(df)
df[,c("first","second") := list(quantile(E,0.5),quantile(E,0.95)), by = "Date"]
I probably need an ifelse statement similar to this expanded to include all percentiles.
CombData <- CombData %>%
group_by(Date) %>%
mutate(E_P = ifelse(E
I have a df consisting of daily returns for various maturities. The first column consists of dates and the next 12 are maturities. I want to create a new df that calculates the difference in consecutive daily rates. Not sure where to start.
With multiple columns, diff can applied
rbind(0, diff(as.matrix(df[-1])))
Or we can use dplyr
library(dplyr)
df %>%
mutate_at(vars(-Date), ~ . - lag(.))
Reproducible example
diff(as.matrix(head(mtcars)))
In the future try to refrain just providing a picture and provide a reprex!
Here is one way to get what you're looking for:
df <-
data.frame(
dates= c("2019-01-01", "2019-01-02", "2019-01-03"),
original_numbers = c(1,2,3)
)
df2 <- df %>%
mutate(
difference = original_numbers - lag(original_numbers)
)
I am trying to generate a new column with values derived from the original chart. I would like to calculate the group average of same hotel and same date first, then use this group averages to divide the original sales.
Here is my code: I tried to calculate the group average by using group_by and summarise embedding in dplyr package, however, it did not generate my expected results.
hotel = c(rep("Hilton",3), rep("Caesar",3))
date1 = c(rep('2018-01-01',2), '2018-01-02', rep('2018-01-01',3))
dba = c(2,0,1,3,2,1)
sales = c(3,5,7,5,2,3)
df = data.frame(cbind(hotel, date1, dba, sales))
df1 = df %>%
group_by(date1, hotel) %>%
dplyr::summarise(avg = mean(sales)) %>%
acast(., date1~hotel)
Any suggestion would be highly appreciated!
Instead of summarise, we can use mutate. After grouping by 'date1', 'hotel', divide the 'sales' by the mean of 'sales' to create a new column
library(tidyverse)
df %>%
group_by(date1, hotel) %>%
mutate(SalesDividedByMean = sales/mean(sales))
NOTE: When there are columns having different types, cbinding results in a matrix and matrix can have only a single type. So, a character class vector can change the whole data into character. Wrapping with data.frame, propagate that change into either factor (by default stringsAsFactors = TRUE or `character)
data
df <- data.frame(hotel, date1, dba, sales)
Edit: I asked this question poorly. For a more clear question, please see Find the variance over a sliding window in dplyr
I'm trying to call a function using each row's value and that of the group.
# make some data with categories a and b
library(dplyr)
df = expand.grid(
a = LETTERS[1:3],
b = 1:3,
x = 1:5
)
# add a variable that changes within group
df$b2 = df$b + floor(runif(nrow(df))*100)
df %>%
# group the data
group_by(a, b) %>%
# row by row analysis
rowwise() %>%
# do some function based on this row's value and the vector for the group
mutate(y = x + 100*max(.$b2))
I want .$b2 to correspond to only items in the current group. Instead it's the entire data frame.
Is there any way to get just the group's data?
Note: I don't actually care about max. It's just a standin for a more complicated function. I need to be able to call foo(one_value, group_vector).
Try
df %>%
group_by(a,b) %>%
mutate(y=x+100*max(b2))