Addressing another dataframe with dynamic variables in R (ideally dplyr, using mutate) - r

I have already read a variety of threads on dynamically named variables, but I couldn't quite find an answer.
I have two dataframes.
df <- data.frame(qno=c(1,2,3,4))
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
Now I want to create another column 'average' in the df dataframe which gives me the average of each column in ref.
Intended output:
df <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
Here is what I have tried:
df <- df %>%
mutate(average := mean(!!as.name(paste0("ref$Q",qno)))
I have also tried a version with a for loop, but that didn't work either.
for (i in 1:length(df$qno)){
df$average[i] <- mean(as.name(paste0("ref$Q",df$qno[i])))
}
df <- df %>%
mutate(average = mean(as.name(paste0("ref$Q",qno))))```.

Here it is with mutate:
df %>% mutate(average = t(ref %>% summarise(across(everything(), ~mean(.x, na.rm = TRUE)))))
qno average
1 1 10.5
2 2 30.5
3 3 50.5
4 4 70.5
But you can use it without mutate entirely if you want the names from ref:
t(ref %>% summarise(across(everything(), list(mean), .names = "{.col}"))) %>%
data.frame() %>%
rename(average = 1)
average
Q1 10.5
Q2 30.5
Q3 50.5
Q4 70.5

Does this solve your problem?
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
out <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
df <- data.frame(qno=c(1:length(ref)))
for (i in seq_along(ref)) {
df$average[i] <- mean(ref[[i]], na.rm = T)
}
I was not really sure if you want to name the rows like the variables, so you could just add this when you create the df object:
df <- data.frame(qno = paste0("Q", c(1:length(ref))))

Related

More efficient way to compute mean for subset

In this dataframe:
df <- data.frame(
comp = c("pre",rep("story",4), rep("x",2), rep("story",3)),
hbr = c(101:110)
)
let's say I need to compute the mean for hbr subsetted to the first stretch where comp=="story", how would I do that more efficiently than this way, which seems bulky and longwinded and requires that I specify the grpI want to compute the mean for manually:
library(dplyr)
library(data.table)
df %>%
mutate(grp = rleid(comp)) %>%
summarise(M = mean(hbr[grp==2]))
M
1 103.5
I'm not sure if this is any better, but at least you only need to specify that you want the first run of 'story':
df %>%
mutate(grp = ifelse(comp == 'story', rleid(comp), NA)) %>%
filter(grp == min(grp, na.rm = TRUE)) %>%
summarise(M = mean(hbr))
#> M
#> 1 103.5
In base R, you can select the desired rows using cumsum and diff, and then choosing which group you need (here it's the first, so 1), and then compute the mean on those rows. With this option, you don't need to get the group you need manually and you don't require any additional packages.
idx <- which(df$comp == "story")
first <- idx[cumsum(c(1, diff(idx) != 1)) == 1]
#[1] 2 3 4 5
mean(df$hbr[first])
#[1] 103.5

Shifting Values in R in rows

I have a problem that sounds easy, however, I could not find a solution in R. I would like to shift values according to the first year of the release. I mean the first column represents the years of the release and the columns are years when the device is broken (values are numbers of broken devices).
This is a solution in Python:
def f(x):
shifted = np.argmin((x.index.astype(int)< x.name[0]))
return x.shift(-shifted)
df = df.set_index(['Delivery Year', 'Freq']).apply(f, axis=1)
df.columns = [f'Year.{i + 1}' for i in range(len(df.columns))]
df = df.reset_index()
df
I would like to have it in R too.
# TEST
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1976` = c(10,NA,NA,NA),
`Year.1977` = c(5,3,NA,NA),
`Year.1978` = c(10,NA,8,NA),
`Year.1979` = c(13,10,5,14)
)
data
# DESIRED
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1` = c(10,3,8,14),
`Year.2` = c(5,NA,5,NA),
`Year.3` = c(10,10,NA,NA),
`Year.4` = c(13,NA,NA,NA)
)
data
In addition, would it be also possible to transform the number of broken devices into the percentage of Freq column?
Thank you
Using tidyverse
data %>%
pivot_longer(!c(Delivery.Year, Freq)) %>%
separate(name, c("Lab", "Year")) %>%
select(-Lab) %>%
mutate_all(as.numeric) %>%
filter(Year >= Delivery.Year) %>%
group_by(Delivery.Year, Freq) %>%
mutate(ind = paste0("Year.", row_number()),
per = value/Freq) %>%
ungroup() %>%
pivot_wider(id_cols = c(Delivery.Year, Freq), names_from = ind, values_from = c(value, per))
I pivoted it into long form to begin with and separated the original column names Year.1976, Year.1977, etc. to just get the years from the columns and dropped the Year piece of it. Then I converted all columns to numeric to allow for mathematical operations like filtering for when Year >= Delivery.Year. I then created a column to get the titles you did request Year.1, Year.2, etc. and calculated the percent. Then I pivot_wider to get it in the format you requested. One thing to note is that I was unclear if you wanted both the original values and the percent or just the percent. If you only want the percent then values_from = per should do it for you.
library(dplyr)
f <- function(df) {
years <- paste0("Year.",sort(as.vector(na.omit(as.integer(stringr::str_extract(colnames(df), "\\d+"))))))
df1 <- df %>% select(years)
df2 <- df %>% select(-years)
val <- c()
firstyear <- years[1]
for (k in 1:nrow(df1) ) {
vec <- as.numeric(as.vector(df1[k,]))
val[k] <- (as.numeric(suppressWarnings(na.omit(vec))))[1]
}
df1[firstyear] <- val
colnames(df1) <- c(paste0("Year.",seq(1:ncol(df1))))
df <- cbind(df2,df1)
print(df)
}
> f(data)
Delivery.Year Freq Year.1 Year.2 Year.3 Year.4
1 1976 120 10 5 10 13
2 1977 100 3 3 NA 10
3 1978 80 8 NA 8 5
4 1979 60 14 NA NA 14

R - recursively create dataframe columns via index inside function

I have a dataframe with a large amount of annual data. For example consider the following toy example like so:
dat <- data.frame(id = 1:2, quantity = 3:4, agg_2002 = 5:6, agg_2003 = 7:8, agg_2020 = 9:10)
What I would like to do is the following:
Look for columns named "agg_",in the set of column names, names(df)
Substitute the "agg_" in names(df) for "change_"
Calculate the relative change from year to year, so for example,
df$change_2002 <- df$agg_2002/df$agg_2002 (since 2002 is first year)
df$change_2003 <- df$agg_2003/df$agg_2002
df$change_2004 <- df$agg_2004/df$agg_2003...all the way up to 2020 or the latest value with "agg_" in the column name.
What I have so far is the following function:
func <- function(dat, overwrite = FALSE) {
nms <- grep("agg_[0-9]+$", names(dat), value = TRUE)
revnms <- gsub("agg_", "chg_", nms)
for i = 1:ncol(df) %in% revnms{
dat[, rvnms][i] <- lapply(dat[, rvnms][i], `/`, dat[, rvnms][i-1])
}
dat
}
What I am struggling with is the indexing. How do I get R to make the above calculations recursively without having to do it manually? The desired result is the "chg_" columns appended to the original dataframe:
id quantity agg_2002 agg_2003 agg_2020 chg_2002 chg_2003 chg_2020
1 1 3 5 7 9 1 1.40 1.28
2 2 4 6 8 10 1 1.33 1.25
I would like to modify the specified function above to produce the desired result via lapply if possible. All ideas are welcome. Thank you.
UPDATE: I would much prefer something using lapply or something that can accomodate differing data types
You can make table to long form, change name (can use gsub), then spread back
library(tidyverse)
library(stringr)
df <- dat %>% pivot_longer(-c(id,quantity), names_to = "agg", values_to = "year") %>%
mutate(agg = str_replace(agg, "agg", "change")) %>%
group_by(id) %>%
mutate(year = ifelse(is.na(lag(year)), year/year, year/lag(year))) %>% # Divide itself if there is no lag(year)
pivot_wider(names_from = "agg", values_from = "year")
inner_join(dat, df, by = c("id","quantity"))
id quantity agg_2002 agg_2003 agg_2020 change_2002 change_2003 change_2020
1 1 3 5 7 9 1 1.400000 1.285714
2 2 4 6 8 10 1 1.333333 1.250000
Here is a solution with dplyr and tidyr:
library(tidyr)
library(dplyr)
dat %>%
pivot_longer(cols = starts_with("agg"),
names_to = "year",
names_prefix = "agg_",
values_to = "agg") %>%
group_by(id) %>%
arrange(year) %>%
mutate(change = agg / lag(agg, 1)) %>%
pivot_wider(names_from = year, values_from = c("agg", "change"))

Filter from a dataframe and create another dataframe using r

I have two data frames as following:
df <- data.frame(month = c("mazda", "yamaha"),
april = c(11,12),
may = c(14,15))
df_whole <- data.frame(month = c("car", "bikes"),
april = c(.1,.2),
may = c(.5,.2))
What I want to do is create a row called total_car which would be a product of car and mazda. However if df_whole doesnt have row car I would still like to create a row total_car which would mazda*0.The output I want to get to is as below. How could I calculate something like this in R
df_car <- data.frame(month = c("mazda", "yamaha","total"),
april = c(11,12,1.1),
may = c(14,15,7))
It sounds like a sql query would work better to join your tables and get the summary data.
Here's a tidyverse solution (note the inner_join(), which is sql):
library(dplyr)
library(tidyr)
df <- data_frame(month = c("mazda", "yamaha"),
april = c(11,12),
may = c(14,15))
df_whole <- data_frame(month = c("car", "bikes"),
april = c(.1,.2),
may = c(.5,.2))
# adds car row if missing - there is probably a better way to handle the missing data
if (!"car" %in% df_whole$month) df_whole <- bind_rows(df_whole, data_frame(month = "car"))
# converts NAs to 0
df_whole[is.na(df_whole)] <- 0
# convert to long format
df_long <- df %>%
rename("brand" = "month") %>%
gather(key = "month", value = "val", april:may)
df_whole_long <- df_whole %>%
rename("type" = "month") %>%
gather(key = "month", value = "val", april:may)
# calcualte the multiplication
dat <- inner_join(df_long, df_whole_long, by = "month") %>% # combine dfs
filter(brand == "mazda", type == "car") %>% # filter out key rows
mutate(total = val.x * val.y) # do multiplication and add into new column
# reformat and append
df_car <- bind_rows(df,
bind_cols(data_frame(month = "total"),
dat %>%
select(month, total) %>%
spread(month, total)))
df_car is:
month april may
1 mazda 11.0 14
2 yamaha 12.0 15
3 total 1.1 7
I used tidyverse data frames (data_frame() rather than data.frame()) so strings weren't converted to factors (the same can be achieved by adding stringsAsFactors = F to data.frame()) and I converted your data to long format, which would probably be a better structure for your database anyway.

Filtering data in a data frame

I have a data frame that looks like this:
S1State S1Value S2State S2Value
NSW 20 VIC 30
WA 30 NSW 20
I would like to filter and select the state(from S1State and S2State) that has the maximum value(from S1Value and S2Value). The result should look like this:
SState SValue
VIC 30
WA 30
I am new to R and have been experimenting with dplyr.
The answer I was hinting at is as follows:
library(dplyr)
dt <- read.table(text = "S1State S1Value S2State S2Value
NSW 20 VIC 30
WA 30 NSW 20",
header = TRUE, stringsAsFactors = FALSE)
answer = dt %>%
mutate(SState = ifelse(S1Value > S2Value, S1State, S2State),
SValue = ifelse(S1Value > S2Value, S1Value, S2Value)) %>%
select(SState, SValue)
Just to show that this is far from impossible with standard R tools:
nams <- c("State","Value")
tmp <- reshape(dt, direction="long", varying=lapply(nams, grep, x=names(dt)),
v.names=nams, timevar=NULL)
tmp[with(tmp, Value == ave(Value, id, FUN=max)),]
# State Value id
#2.1 WA 30 2
#1.2 VIC 30 1
I assume that the OP may have more states in the data frame, such as S3State, S4State, ...
The following solutions are based on this assumption, trying to be able to process more than one states. If there are only two states, the approach proposed by #lebelinoz is simple and straightforward.
Solution 1
A solution using functions from dplyr and tidyr. dt2 is the final output.
# Load packages
library(dplyr)
library(tidyr)
# Process the data
dt2 <- dt %>%
gather(Num, Value, contains("Value")) %>%
gather(State, Name, contains("State")) %>%
# Only keep records with the same state number
filter(substring(Num, 1, 2) == substring(State, 1, 2)) %>%
mutate(Group = substring(Num, 1, 2)) %>%
group_by(Group) %>%
filter(Value == max(Value)) %>%
ungroup() %>%
select(SState = Name, SSValue = Value)
Solution 2
A solution using functions from dplyr, purrr, and stringr. I loaded the package tidyverse for the first two packages. Again, dt2 is the final output.
# Load packages
library(tidyverse)
library(stringr)
# Extract the column names
Col <- colnames(dt)
# Extract state numbers
ColNum <- Col %>%
str_extract(pattern = "[0-9]") %>%
unique()
# Design a function to process the data
dt_process <- function(pattern, dt){
dt2 <- dt %>%
# Extract columns based on a pattern (numbers)
select(dplyr::contains(pattern)) %>%
# Rename the columns
rename_all(~sub(pattern, "", .)) %>%
# Filter the maximum row
filter(SValue == max(SValue))
return(dt2)
}
# Apply the dt_process function
dt_list <- map(.x = ColNum, .f = dt_process, dt = dt)
# Bind all data frames
dt2 <- bind_rows(dt_list) %>% arrange(SState)
Data Preparation
# Create example data frame
dt <- read.table(text = "S1State S1Value S2State S2Value
NSW 20 VIC 30
WA 30 NSW 20",
header = TRUE, stringsAsFactors = FALSE)

Resources