I have a dataframe with a large amount of annual data. For example consider the following toy example like so:
dat <- data.frame(id = 1:2, quantity = 3:4, agg_2002 = 5:6, agg_2003 = 7:8, agg_2020 = 9:10)
What I would like to do is the following:
Look for columns named "agg_",in the set of column names, names(df)
Substitute the "agg_" in names(df) for "change_"
Calculate the relative change from year to year, so for example,
df$change_2002 <- df$agg_2002/df$agg_2002 (since 2002 is first year)
df$change_2003 <- df$agg_2003/df$agg_2002
df$change_2004 <- df$agg_2004/df$agg_2003...all the way up to 2020 or the latest value with "agg_" in the column name.
What I have so far is the following function:
func <- function(dat, overwrite = FALSE) {
nms <- grep("agg_[0-9]+$", names(dat), value = TRUE)
revnms <- gsub("agg_", "chg_", nms)
for i = 1:ncol(df) %in% revnms{
dat[, rvnms][i] <- lapply(dat[, rvnms][i], `/`, dat[, rvnms][i-1])
}
dat
}
What I am struggling with is the indexing. How do I get R to make the above calculations recursively without having to do it manually? The desired result is the "chg_" columns appended to the original dataframe:
id quantity agg_2002 agg_2003 agg_2020 chg_2002 chg_2003 chg_2020
1 1 3 5 7 9 1 1.40 1.28
2 2 4 6 8 10 1 1.33 1.25
I would like to modify the specified function above to produce the desired result via lapply if possible. All ideas are welcome. Thank you.
UPDATE: I would much prefer something using lapply or something that can accomodate differing data types
You can make table to long form, change name (can use gsub), then spread back
library(tidyverse)
library(stringr)
df <- dat %>% pivot_longer(-c(id,quantity), names_to = "agg", values_to = "year") %>%
mutate(agg = str_replace(agg, "agg", "change")) %>%
group_by(id) %>%
mutate(year = ifelse(is.na(lag(year)), year/year, year/lag(year))) %>% # Divide itself if there is no lag(year)
pivot_wider(names_from = "agg", values_from = "year")
inner_join(dat, df, by = c("id","quantity"))
id quantity agg_2002 agg_2003 agg_2020 change_2002 change_2003 change_2020
1 1 3 5 7 9 1 1.400000 1.285714
2 2 4 6 8 10 1 1.333333 1.250000
Here is a solution with dplyr and tidyr:
library(tidyr)
library(dplyr)
dat %>%
pivot_longer(cols = starts_with("agg"),
names_to = "year",
names_prefix = "agg_",
values_to = "agg") %>%
group_by(id) %>%
arrange(year) %>%
mutate(change = agg / lag(agg, 1)) %>%
pivot_wider(names_from = year, values_from = c("agg", "change"))
Related
I have a problem that sounds easy, however, I could not find a solution in R. I would like to shift values according to the first year of the release. I mean the first column represents the years of the release and the columns are years when the device is broken (values are numbers of broken devices).
This is a solution in Python:
def f(x):
shifted = np.argmin((x.index.astype(int)< x.name[0]))
return x.shift(-shifted)
df = df.set_index(['Delivery Year', 'Freq']).apply(f, axis=1)
df.columns = [f'Year.{i + 1}' for i in range(len(df.columns))]
df = df.reset_index()
df
I would like to have it in R too.
# TEST
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1976` = c(10,NA,NA,NA),
`Year.1977` = c(5,3,NA,NA),
`Year.1978` = c(10,NA,8,NA),
`Year.1979` = c(13,10,5,14)
)
data
# DESIRED
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1` = c(10,3,8,14),
`Year.2` = c(5,NA,5,NA),
`Year.3` = c(10,10,NA,NA),
`Year.4` = c(13,NA,NA,NA)
)
data
In addition, would it be also possible to transform the number of broken devices into the percentage of Freq column?
Thank you
Using tidyverse
data %>%
pivot_longer(!c(Delivery.Year, Freq)) %>%
separate(name, c("Lab", "Year")) %>%
select(-Lab) %>%
mutate_all(as.numeric) %>%
filter(Year >= Delivery.Year) %>%
group_by(Delivery.Year, Freq) %>%
mutate(ind = paste0("Year.", row_number()),
per = value/Freq) %>%
ungroup() %>%
pivot_wider(id_cols = c(Delivery.Year, Freq), names_from = ind, values_from = c(value, per))
I pivoted it into long form to begin with and separated the original column names Year.1976, Year.1977, etc. to just get the years from the columns and dropped the Year piece of it. Then I converted all columns to numeric to allow for mathematical operations like filtering for when Year >= Delivery.Year. I then created a column to get the titles you did request Year.1, Year.2, etc. and calculated the percent. Then I pivot_wider to get it in the format you requested. One thing to note is that I was unclear if you wanted both the original values and the percent or just the percent. If you only want the percent then values_from = per should do it for you.
library(dplyr)
f <- function(df) {
years <- paste0("Year.",sort(as.vector(na.omit(as.integer(stringr::str_extract(colnames(df), "\\d+"))))))
df1 <- df %>% select(years)
df2 <- df %>% select(-years)
val <- c()
firstyear <- years[1]
for (k in 1:nrow(df1) ) {
vec <- as.numeric(as.vector(df1[k,]))
val[k] <- (as.numeric(suppressWarnings(na.omit(vec))))[1]
}
df1[firstyear] <- val
colnames(df1) <- c(paste0("Year.",seq(1:ncol(df1))))
df <- cbind(df2,df1)
print(df)
}
> f(data)
Delivery.Year Freq Year.1 Year.2 Year.3 Year.4
1 1976 120 10 5 10 13
2 1977 100 3 3 NA 10
3 1978 80 8 NA 8 5
4 1979 60 14 NA NA 14
I have already read a variety of threads on dynamically named variables, but I couldn't quite find an answer.
I have two dataframes.
df <- data.frame(qno=c(1,2,3,4))
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
Now I want to create another column 'average' in the df dataframe which gives me the average of each column in ref.
Intended output:
df <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
Here is what I have tried:
df <- df %>%
mutate(average := mean(!!as.name(paste0("ref$Q",qno)))
I have also tried a version with a for loop, but that didn't work either.
for (i in 1:length(df$qno)){
df$average[i] <- mean(as.name(paste0("ref$Q",df$qno[i])))
}
df <- df %>%
mutate(average = mean(as.name(paste0("ref$Q",qno))))```.
Here it is with mutate:
df %>% mutate(average = t(ref %>% summarise(across(everything(), ~mean(.x, na.rm = TRUE)))))
qno average
1 1 10.5
2 2 30.5
3 3 50.5
4 4 70.5
But you can use it without mutate entirely if you want the names from ref:
t(ref %>% summarise(across(everything(), list(mean), .names = "{.col}"))) %>%
data.frame() %>%
rename(average = 1)
average
Q1 10.5
Q2 30.5
Q3 50.5
Q4 70.5
Does this solve your problem?
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
out <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
df <- data.frame(qno=c(1:length(ref)))
for (i in seq_along(ref)) {
df$average[i] <- mean(ref[[i]], na.rm = T)
}
I was not really sure if you want to name the rows like the variables, so you could just add this when you create the df object:
df <- data.frame(qno = paste0("Q", c(1:length(ref))))
I have created a user function in R to multiply two columns to create a third (within a series), so this function creates 4 new columns.
create_mult_var <- function(.data){
.data <-.data%>%
mutate(Q4_1_4 = Q4_1_2_TEXT*Q4_1_3_TEXT) %>%
mutate(Q4_2_4 = Q4_2_2_TEXT*Q4_2_3_TEXT) %>%
mutate(Q4_3_4 = Q4_3_2_TEXT*Q4_3_3_TEXT) %>%
mutate(Q4_4_4 = Q4_4_2_TEXT*Q4_4_3_TEXT)
.data
I am trying to modify this function so that I can apply it to a different set of columns that match the same type. For instance, if I want to repeat this on the series of columns that start with "Q8", I know I can do the following:
create_mult_var_2 <- function(.data){
.data <-.data%>%
mutate(Q8_1_4 = Q8_1_2_TEXT*Q8_1_3_TEXT) %>%
mutate(Q8_2_4 = Q8_2_2_TEXT*Q8_2_3_TEXT) %>%
mutate(Q8_3_4 = Q8_3_2_TEXT*Q8_3_3_TEXT) %>%
mutate(Q8_4_4 = Q8_4_2_TEXT*Q8_4_3_TEXT)
.data
}
Instead of creating a different function for each of the Q4 and Q8 series, I would like to add the "Q4" or "Q8" as an argument. I tried this below, but R would not accept this as an argument this way. Is there a way to achieve my desired outcome?
This does not work:
create_mult_var <- function(.data,question){
.data <-.data%>%
mutate(question_1_4 = question_1_2_TEXT*question_1_3_TEXT) %>%
mutate(question_2_4 = question_2_2_TEXT*question_2_3_TEXT) %>%
mutate(question_3_4 = question_3_2_TEXT*question_3_3_TEXT) %>%
mutate(question_4_4 = question_4_2_TEXT*question_4_3_TEXT)
.data
}
I would like to modify the function, such as that I can use the following:
data_in %>% create_mult_var("Q4") %>% create_mult_var("Q8")
Or something similar to create these new columns? Any suggestions are appreciated! Thank you! If this is a bad idea, any suggestions for how I should approach this?
We could use paste and evaluate with !!
create_mult_var_2 <- function(.data, pat){
.data <-.data%>%
mutate(!! str_c(pat, '_1_4') :=
!! rlang::sym(str_c(pat, '_1_2_TEXT')) *
!! rlang::sym(str_c(pat, '_1_3_TEXT')))
.data
}
create_mult_var_2(data_in, "Q4")
# Q4_1_2_TEXT Q4_1_3_TEXT Q4_1_4
#1 1 5 5
#2 2 6 12
#3 3 7 21
#4 4 8 32
Also, based on the pattern showed, this can be automated as well
library(dplyr)
library(stringr)
create_mult_var_3 <- function(.data, pat) {
.data %>%
mutate(across(matches(str_c("^", pat, "_\\d+_2")), ~
.* get(str_replace(cur_column(), '_2_TEXT', '_3_TEXT')),
.names = '{.col}_new')) %>%
rename_at(vars(ends_with('_new')),
~ str_replace(., '\\d+_TEXT_new', '4'))
}
-testing
create_mult_var_3(data_in, "Q4")
# Q4_1_2_TEXT Q4_1_3_TEXT Q4_1_4
#1 1 5 5
#2 2 6 12
#3 3 7 21
#4 4 8 32
data
data_in <- data.frame(Q4_1_2_TEXT = 1:4, Q4_1_3_TEXT = 5:8)
I have two data frames as following:
df <- data.frame(month = c("mazda", "yamaha"),
april = c(11,12),
may = c(14,15))
df_whole <- data.frame(month = c("car", "bikes"),
april = c(.1,.2),
may = c(.5,.2))
What I want to do is create a row called total_car which would be a product of car and mazda. However if df_whole doesnt have row car I would still like to create a row total_car which would mazda*0.The output I want to get to is as below. How could I calculate something like this in R
df_car <- data.frame(month = c("mazda", "yamaha","total"),
april = c(11,12,1.1),
may = c(14,15,7))
It sounds like a sql query would work better to join your tables and get the summary data.
Here's a tidyverse solution (note the inner_join(), which is sql):
library(dplyr)
library(tidyr)
df <- data_frame(month = c("mazda", "yamaha"),
april = c(11,12),
may = c(14,15))
df_whole <- data_frame(month = c("car", "bikes"),
april = c(.1,.2),
may = c(.5,.2))
# adds car row if missing - there is probably a better way to handle the missing data
if (!"car" %in% df_whole$month) df_whole <- bind_rows(df_whole, data_frame(month = "car"))
# converts NAs to 0
df_whole[is.na(df_whole)] <- 0
# convert to long format
df_long <- df %>%
rename("brand" = "month") %>%
gather(key = "month", value = "val", april:may)
df_whole_long <- df_whole %>%
rename("type" = "month") %>%
gather(key = "month", value = "val", april:may)
# calcualte the multiplication
dat <- inner_join(df_long, df_whole_long, by = "month") %>% # combine dfs
filter(brand == "mazda", type == "car") %>% # filter out key rows
mutate(total = val.x * val.y) # do multiplication and add into new column
# reformat and append
df_car <- bind_rows(df,
bind_cols(data_frame(month = "total"),
dat %>%
select(month, total) %>%
spread(month, total)))
df_car is:
month april may
1 mazda 11.0 14
2 yamaha 12.0 15
3 total 1.1 7
I used tidyverse data frames (data_frame() rather than data.frame()) so strings weren't converted to factors (the same can be achieved by adding stringsAsFactors = F to data.frame()) and I converted your data to long format, which would probably be a better structure for your database anyway.
Given a data frame like:
df <- data.frame(z_a = 1:2,
z_b = 1:2,
y_a = 3:4,
y_b = 3:4)
I can select columns names that contain a character with:
library(dplyr)
df %>% select(contains("a"), contains("b"))
z_a y_a z_b y_b
1 1 3 1 3
2 2 4 2 4
NOTE that the column order has changed. Columns containing a come first before columns containing b
I'd like to select column names that contain characters in a vector and that reorders the columns.
searchfor <- letters[1:2]
Using searchfor, I'd like to make the following expression and use it in a select statement:
E <- quote(contains(searchfor[1]), contains(searchfor[2]))
df %>% select_(E)
We can do
df %>%
select_at(vars(matches(paste(searchfor, collapse="|")))) %>%
select(order(sub(".*_", "", names(.))))
purrr solution:
library(purrr)
ind_lgl <- map(letters[1:2], ~ grepl(.x, names(df), fixed = TRUE)) %>%
pmap_lgl(`|`)
df[ind_lgl]
With the pipe:
df %>%
`[`(map(letters[1:2], ~ grepl(.x, names(df), fixed = TRUE)) %>%
pmap_lgl(`|`))
If you to get the right order:
rank <- map(letters[1:2], ~ grepl(.x, names(df), fixed = TRUE)) %>%
pmap(c) %>%
map(which)
ind_chr <- data_frame(colnames = names(df), rank) %>%
mutate(l = lengths(rank)) %>%
filter(l > 0) %>%
mutate(rank = unlist(map(rank, ~ .x[[1]]))) %>%
arrange(rank) %>%
pull(colnames)
df[ind_chr]
But it is not pretty...
I don't understand the exact requirement, but is this solution.
select(df, matches("a|b"))
Self answer - here's a solution with select_ and that still uses contains - just in case anyone else is interested:
library(iterators)
library(dplyr)
s <- paste0("c(", paste0(sapply(iter(searchfor), function(x) paste0("contains(\"", x, "\")")), collapse=","), ")")
df %>% select_(., s)
z_a y_a z_b y_b
1 1 3 1 3
2 2 4 2 4