Given a dataframe I want to run on multiple column names, calculate something and add the output as new column. The next calculation will be added as a new column to the updated dataframe.
For example:
Given a simple df:
df <- structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9),
d = c(10, 11, 12)), .Names = c("a", "b", "c", "d"), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
For each column a, b, c, d I want to calculate, say a square:
a2 = a^2
b2 = b^2 ...
For technical reasons I can't publish the whole dataframe but I am going to pass a column name each time and expect the function to mutate a new column (for example a2) next time when I will add b2, a2 will be already there:
If I would use for loop it would look like:
for (x in column_names) {
df <- df %>% mutate("x2" = x^2)
}
So each time my df updates with new calculated column.
Please advise how can I do this without for loop with functional programming.
I am trying to do this with map, lapply but I have the problem that my df doesn't get updated each iteration.
Is this the function you are looking for?
add_x2 <- function(df, x) {
df[paste0(x, "2")] <- df[x]^2
df
}
df %>%
add_x2(c("a", "b"))
# A tibble: 3 x 6
a b c d a2 b2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 7 10 1 16
2 2 5 8 11 4 25
3 3 6 9 12 9 36
With tidyverse:
df %>%
+ mutate_if(is.numeric,funs(.^2))
# A tibble: 3 x 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 1 16 49 100
2 4 25 64 121
3 9 36 81 144
or
df %>% mutate_all(funs(.^2))
# A tibble: 3 x 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 1 16 49 100
2 4 25 64 121
3 9 36 81 144
Related
I have time t, speed v and position x of a vehicle along with other informations: info1, info2, status
How can I extend the tibble by linearly extrapolating the position based on last speed (v = 14) for given timestamps. So the added rows are copy of the last row except with the predicted positions and status changed to "predicted"`.
Is it possible to do it without using loop.
tbl <- tibble(info1 = rep("a", 3),
info2 = rep("b", 3),
t = c(1, 2, 3),
v = c(12, 13, 14),
x = c(12, 24, 37),
status = rep("real", 3))
timestamps <- c(4, 5, 6, 8) # timestamps does not linearly increase
# desired output
tbl_desired <- tibble(info1 = rep("a", 7),
info2 = rep("b", 7),
t = c(1, 2, 3, 4, 5, 6, 8),
v = c(12, 13, 14, 14, 14, 14, 14),
x = c(12, 24, 37, 51, 65, 79, 107),
status = c(rep("real", 3), rep("predicted", 4)))
The conditions are not clear especially the predicted values in 'x'. Below code works in the following way
Extracts the last row (slice_tail, n = 1)
update the columns 't', 'v', 'x' (summarise)
Bind the rows from the original data (bind_rows)
library(dplyr)
tbl %>%
slice_tail(n = 1) %>%
summarise(info1, info2, t = timestamps, v = v,
x = (x + cumsum(c(1, diff(t)) *
rep(last(v), length(t)))), status = 'predicted') %>%
bind_rows(tbl, .)
-output
# A tibble: 7 × 6
info1 info2 t v x status
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 a b 1 12 12 real
2 a b 2 13 24 real
3 a b 3 14 37 real
4 a b 4 14 51 predicted
5 a b 5 14 65 predicted
6 a b 6 14 79 predicted
7 a b 8 14 107 predicted
If there are many columns, after sliceing the last row, use mutate to update only the columns that needs to be changed and wrap in a list whereever the length is greater than 1, then unnest the list column
library(tidyr)
tbl %>%
slice_tail(n = 1) %>%
mutate(t = list(timestamps), v = v,
x = list((x + cumsum(c(1, diff(timestamps)) *
rep(last(v), length(timestamps))))), status = 'predicted') %>%
unnest(where(is.list)) %>%
bind_rows(tbl, .)
-output
# A tibble: 7 × 6
info1 info2 t v x status
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 a b 1 12 12 real
2 a b 2 13 24 real
3 a b 3 14 37 real
4 a b 4 14 51 predicted
5 a b 5 14 65 predicted
6 a b 6 14 79 predicted
7 a b 8 14 107 predicted
Or use add_row and then fill the NA rows with previous non-NA for those columns not specified in the add_row
library(tibble)
tbl %>%
add_row(t = timestamps, v = last(.$v),
x = (last(.$x) + cumsum(c(1, diff(timestamps)) *
rep(last(.$v), length(timestamps)))), status = 'predicted') %>%
fill(everything())
-output
# A tibble: 7 × 6
info1 info2 t v x status
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 a b 1 12 12 real
2 a b 2 13 24 real
3 a b 3 14 37 real
4 a b 4 14 51 predicted
5 a b 5 14 65 predicted
6 a b 6 14 79 predicted
7 a b 8 14 107 predicted
I am new to R and have a simple 'how to' question, specifically, what is the best way to calculate Group and overall percentages on data frame columns? My data looks like this:
# A tibble: 13 x 3
group resp id
<chr> <dbl> <chr>
1 A 1 ssa
2 A 1 das
3 A NA fdsf
4 B NA gfd
5 B 1 dfg
6 B 1 dg
7 C 1 gdf
8 C NA gdf
9 C NA hfg
10 D 1 hfg
11 D 1 trw
12 D 1 jyt
13 D NA ghj
the test data is this:
structure(list(group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "D", "D", "D", "D"), resp = c(1, 1, NA, NA, 1, 1, 1, NA,
NA, 1, 1, 1, NA), id = c("ssa", "das", "fdsf", "gfd", "dfg",
"dg", "gdf", "gdf", "hfg", "hfg", "trw", "jyt", "ghj")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame")
I managed to do the group percentages by doing the following (which seems overcomplicated):
a <- test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE))
b <- test %>%
group_by(group) %>%
summarise(all = n_distinct(id, na.rm = TRUE))
result <- a %>%
left_join(b) %>%
mutate(a,resp_rate = round(no_resp/all*100))
this gives me:
# A tibble: 4 x 4
group no_resp all resp_rate
<chr> <dbl> <int> <dbl>
1 A 2 3 67
2 B 2 3 67
3 C 1 2 50
4 D 3 4 75
which is fine, but I wondered how I could make this simpler? Also, how would I do an overall percentage? E.g. an overall distinct count of resp/distinct count of id, without grouping.
Many thanks
You can add multiple statements in summarise so you don't have to create temporary objects a and b. To calculate overall percentage you can divide the number by the sum of the column.
library(dplyr)
test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE),
all = n_distinct(id),
resp_rate = round(no_resp/all*100)) %>%
mutate(no_resp_perc = no_resp/sum(no_resp) * 100)
# group no_resp all resp_rate no_resp_perc
# <chr> <int> <int> <dbl> <dbl>
#1 A 2 3 67 25
#2 B 2 3 67 25
#3 C 1 2 50 12.5
#4 D 3 4 75 37.5
Using base R we may apply tapply and table functions.
res <- transform(with(test, data.frame(no_resp=tapply(resp, group, sum, na.rm=TRUE),
all=colSums(table(id, group) > 0))),
resp_rate=round(no_resp/all*100),
overall_perc=prop.table(no_resp)*100
)
res
# no_resp all resp_rate overall_perc
# A 2 3 67 25.0
# B 2 3 67 25.0
# C 1 2 50 12.5
# D 3 4 75 37.5
I have the below data frame df1. (Edited to have different numbers of repeated value in the data frame.)
> dput(df1)
structure(list(...1 = c("a", "b", "c", "d", "e"), x = c(5, 10,
20, 20, 25), y = c(2, 6, 6, 6, 10), z = c(6, 2, 1, 8, 1)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
>df1
x y z
a 5 2 6
b 10 6 2
c 20 6 1
d 20 6 8
e 25 10 1
I would like to get a df2 which only has the unique values from each column 'x','y' and 'z'.
I tried:
df2<-apply(df1,2, unique)
df2 <- do.call(cbind, df2)
df2 <- as.data.frame(df2)
Desired output:
>df2
x y z
5 2 6
10 6 2
20 10 1
25 8
Tibbles can't have rownames so it creates a new column with it in your data. You can delete the first column and then use unique on all columns.
library(dplyr)
df1$...1 <- NULL
df1 %>% summarise(across(.fns = unique))
# x y z
# <dbl> <dbl> <dbl>
#1 5 2 6
#2 10 6 2
#3 20 8 1
#4 25 10 8
Or in base R :
df2 <- data.frame(sapply(df1, unique))
For unequal unique values in the column you could use :
tmp <- lapply(df1, unique)
data.frame(sapply(tmp, `[`, 1:max(lengths(tmp))))
# x y z
#1 5 2 6
#2 10 6 2
#3 20 10 1
#4 25 NA 8
This question already has answers here:
Proper idiom for adding zero count rows in tidyr/dplyr
(6 answers)
Closed 2 years ago.
Apologies if this is a duplicate question, I saw some questions which were similar to mine, but none exactly addressing my problem.
My data look basically like this:
FiscalWeek <- as.factor(c(45, 46, 48, 48, 48))
Group <- c("A", "A", "A", "B", "C")
Amount <- c(1, 1, 1, 5, 6)
df <- tibble(FiscalWeek, Group, Amount)
df
# A tibble: 5 x 3
FiscalWeek Group Amount
<fct> <chr> <dbl>
1 45 A 1
2 46 A 1
3 48 A 1
4 48 B 5
5 48 C 6
Note that FiscalWeek is a factor. So, when I take a weekly average by Group, I get this:
library(dplyr)
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount))
averages
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 1
2 B 5
3 C 6
But, this is actually a four-week period. Nothing at all happened in Week 47, and groups B and C didn't show data in weeks 45 and 46, but I still want averages that reflect the existence of those weeks. So I need to fill out my original data with zeroes such that this is my desired result:
DesiredGroup <- c("A", "B", "C")
DesiredAvgs <- c(0.75, 1.25, 1.5)
Desired <- tibble(DesiredGroup, DesiredAvgs)
Desired
# A tibble: 3 x 2
DesiredGroup DesiredAvgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
What is the best way to do this using dplyr?
Up front: missing data to me is very different from 0. I'm assuming that you "know" with certainty that missing data should bring all other values down.
The name FiscalWeek suggests that it is an integer-like data, but your use of factor suggests ordinal or categorical. Because of that, you need to define authoritatively what the complete set of factors can be. And because your current factor does not contain all possible levels, I'll infer them (you need to adjust your all_groups_weeks accordingly:
all_groups_weeks <- tidyr::expand_grid(FiscalWeek = as.factor(45:48), Group = c("A", "B", "C"))
all_groups_weeks
# # A tibble: 12 x 2
# FiscalWeek Group
# <fct> <chr>
# 1 45 A
# 2 45 B
# 3 45 C
# 4 46 A
# 5 46 B
# 6 46 C
# 7 47 A
# 8 47 B
# 9 47 C
# 10 48 A
# 11 48 B
# 12 48 C
From here, join in the full data in order to "complete" it. Using tidyr::complete won't work because you don't have all possible values in the data (47 missing).
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0))
# # A tibble: 12 x 3
# FiscalWeek Group Amount
# <fct> <chr> <dbl>
# 1 45 A 1
# 2 46 A 1
# 3 48 A 1
# 4 48 B 5
# 5 48 C 6
# 6 45 B 0
# 7 45 C 0
# 8 46 B 0
# 9 46 C 0
# 10 47 A 0
# 11 47 B 0
# 12 47 C 0
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0)) %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount, na.rm = TRUE))
# # A tibble: 3 x 2
# Group Avgs
# <chr> <dbl>
# 1 A 0.75
# 2 B 1.25
# 3 C 1.5
You can try this. I hope this helps.
library(dplyr)
#Define range
df %>% mutate(FiscalWeek=as.numeric(as.character(FiscalWeek))) -> df
range <- length(seq(min(df$FiscalWeek),max(df$FiscalWeek),by=1))
#Aggregation
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = sum(Amount)/range)
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
You can do it without filling if you know number of weeks:
df %>%
group_by(Group) %>%
summarise(Avgs = sum(Amount) / length(45:48))
Say I have a tibble of values:
raw = tibble(
group = c("A", "B", "C", "A", "B", "C"),
value = c(10, 20, 30, 40, 50, 60)
)
# A tibble: 6 x 2
group value
<chr> <dbl>
1 A 10
2 B 20
3 C 30
4 A 40
5 B 50
6 C 60
I want to subtract a certain amount from each value in my tibble depending on which group it belongs to. The amounts I need to subtract are in another tibble:
corrections = tibble(
group = c("A", "B", "C"),
corr = c(0, 1, 2)
)
# A tibble: 3 x 2
group corr
<chr> <dbl>
1 A 0
2 B 1
3 C 2
What is the most elegant way to achieve this? The following works, but I feel like it is messy - surely there is another way?
mutate(raw, corrected = value - as_vector(corrections[corrections["group"] == group, "corr"]))
# A tibble: 6 x 3
group value corrected
<chr> <dbl> <dbl>
1 A 10 10
2 B 20 19
3 C 30 28
4 A 40 40
5 B 50 49
6 C 60 58
How about first joining raw and corrections and then calculating corrected?
library(dplyr)
left_join(raw, corrections, by = "group") %>%
mutate(corrected = value - corr) %>%
select(-corr)
#> # A tibble: 6 x 3
#> group value corrected
#> <chr> <dbl> <dbl>
#> 1 A 10 10
#> 2 B 20 19
#> 3 C 30 28
#> 4 A 40 40
#> 5 B 50 49
#> 6 C 60 58