data frame grouping and column transforming using dplyr - r

I have the below df:
df<-data.frame(geokey=c("A","A","A","A","A","A","B","B","B","B","B","B"),
upc=c("100","100","101","101","102","102","200","200","201","201",
"202","202"),
endwk=c("14-07-2021","21-07-2021","14-07-2021","21-07-2021","14-07-2021","21-07-2021",
"14-07-2021","21-07-2021","14-07-2021","21-07-2021","14-07-2021","21-07-2021"),
Base_units=c(2,3,1,2,4,1,1,4,2,3,3,2),
Base_price=c(0.1,0.2,0.2,0.1,0.1,0.1,0.2,0.3,0.4,0.1,0.2,0.3),
Incr_units=c(2,1,1,1,2,1,3,2,2,3,1,1),
incr_price=c(0.1,0.1,0.1,0.3,0.2,0.1,0.1,0.2,0.1,0.2,0.1,0.2))
> df
geokey upc endwk Base_units Base_price Incr_units incr_price
1 A 100 14-07-2021 2 0.1 2 0.1
2 A 100 21-07-2021 3 0.2 1 0.1
3 A 101 14-07-2021 1 0.2 1 0.1
4 A 101 21-07-2021 2 0.1 1 0.3
5 A 102 14-07-2021 4 0.1 2 0.2
6 A 102 21-07-2021 1 0.1 1 0.1
7 B 200 14-07-2021 1 0.2 3 0.1
8 B 200 21-07-2021 4 0.3 2 0.2
9 B 201 14-07-2021 2 0.4 2 0.1
10 B 201 21-07-2021 3 0.1 3 0.2
11 B 202 14-07-2021 3 0.2 1 0.1
12 B 202 21-07-2021 2 0.3 1 0.2
expected output---> Group by geokey--upc---endwk with all vol cols to be totalled (added) and price columns to be averaged shown as below:
df_merged<-data.frame(geokey=c("A","A","B","B"),
upc=c("upc_100_101_102","upc_100_101_102","upc_200_201_202","upc_200_201_202"),
endwk=c("14-07-2021","21-07-2021","14-07-2021","21-07-2021"),
Base_units_totalled=c(7,6,6,9),
Base_price_averaged=c(0.133,0.133,0.2667,0.2333),
Incr_units_totalled=c(5,3,3,6),
incr_price_averaged=c(0.1333,0.1,0.1,0.2))
> df_merged
geokey upc endwk Base_units_totalled Base_price_averaged Incr_units_totalled incr_price_averaged
1 A upc_100_101_102 14-07-2021 7 0.1330 5 0.1333
2 A upc_100_101_102 21-07-2021 6 0.1330 3 0.1000
3 B upc_200_201_202 14-07-2021 6 0.2667 3 0.1000
4 B upc_200_201_202 21-07-2021 9 0.2333 6 0.2000
Help will be appreciated.

I presume you want to summarize the upc column and not group by it?
library(dplyr)
group_by(geokey, endwk) %>%
summarize(upc = paste0("upc_", paste(upc, collapse = "_")),
across(contains("units"), sum),
across(contains("price"), mean), .groups = "drop")
# A tibble: 4 x 7
geokey endwk upc Base_units Incr_units Base_price incr_price
* <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A 14-07-2021 upc_100_101_102 7 5 0.133 0.133
2 A 21-07-2021 upc_100_101_102 6 3 0.133 0.167
3 B 14-07-2021 upc_200_201_202 6 6 0.267 0.1
4 B 21-07-2021 upc_200_201_202 9 6 0.233 0.2

Related

R and DPLYR: I'm trying to summarize an average site score for a nonnumeric variable and I can't get it to convert to a percentage

I am analyzing a dataset of physical habitat characteristics of a river. For each site we have a variety of variables that were measured across a 10 point transect. Summarizing most of these variables by Location, Reach, and Transect is simple but one of them, substrate, is scored using nonnumeric data. I am trying to create a summary for each transect to use in my statistical analysis but using dplyr's summarize and pivot_wider does not work. When I use summarize the nonnummeric value for the substrate variable returns an error, and for pivot_wider I do not have a value to enter for the new columns created from the Substrate variable.
This is an example of the raw data I am trying to summarize.
Location
Reach
Transect
Flow
Depth
Substrate
RIX
1
1
0.4
14
CO
RIX
1
1
0.5
12
BO
RIX
1
1
0.3
11
SA
RIX
1
1
0.4
14
GR
RIX
1
1
0.4
14
CO
RIX
1
2
0.4
17
CO
RIX
1
2
0.5
18
SA
RIX
1
2
0.1
22
SA
RIX
1
2
0.6
15
GR
RIX
1
2
0.4
14
SILT
RIX
2
1
0.4
14
CO
RIX
2
1
0.5
12
BO
RIX
2
1
0.3
11
SA
RIX
2
1
0.4
14
GR
RIX
2
1
0.4
14
CO
RIX
2
2
0.4
17
CO
RIX
2
2
0.5
18
SA
RIX
2
2
0.1
22
SA
RIX
2
2
0.6
15
GR
RIX
2
2
0.4
14
SILT
ARA
1
1
0.4
14
CO
ARA
1
1
0.5
12
BO
ARA
1
1
0.3
11
SA
ARA
1
1
0.4
14
GR
ARA
1
1
0.4
14
CO
ARA
1
2
0.4
17
CO
ARA
1
2
0.5
18
SA
ARA
1
2
0.1
22
SA
ARA
1
2
0.6
15
GR
ARA
1
2
0.4
14
SILT
ARA
2
1
0.4
14
CO
ARA
2
1
0.5
12
BO
ARA
2
1
0.3
11
SA
ARA
2
1
0.4
14
GR
ARA
2
1
0.4
14
CO
ARA
2
2
0.4
17
CO
ARA
2
2
0.5
18
SA
ARA
2
2
0.1
22
SA
RIX
2
2
0.6
15
GR
RIX
2
2
0.4
14
SILT
Here is what I am trying to create.
Location
Reach
Transect
Flow
Depth
CO
GR
SA
SILT
BO
RIX
1
1
0.4
13
.40
.20
.20
0
.20
RIX
1
2
0.4
17.2
.20
.20
.40
.20
0
RIX
2
1
0.4
13
.40
.20
.20
0
.20
RIX
2
2
0.4
17.2
.20
.20
.40
.20
0
I am unsure how to go about filling the new columns I have created by using the pivot_wider function in dplyr and to keep the Transects separated by Location and Reach when I use the summarize function.
One option to achieve your desired result would be to add an id column (to make pivot_wider work) and a new column of ones (which serves as an indicator and as the value column when converting to wide format):
library(dplyr)
library(tidyr)
df %>%
mutate(id = row_number(), value = 1) %>%
pivot_wider(names_from = "Substrate", values_from = "value", values_fill = 0) %>%
group_by(Location, Reach, Transect) %>%
summarise(across(everything(), mean), .groups = "drop") %>%
select(-id)
#> # A tibble: 8 × 10
#> Location Reach Transect Flow Depth CO BO SA GR SILT
#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ARA 1 1 0.4 13 0.4 0.2 0.2 0.2 0
#> 2 ARA 1 2 0.4 17.2 0.2 0 0.4 0.2 0.2
#> 3 ARA 2 1 0.4 13 0.4 0.2 0.2 0.2 0
#> 4 ARA 2 2 0.333 19 0.333 0 0.667 0 0
#> 5 RIX 1 1 0.4 13 0.4 0.2 0.2 0.2 0
#> 6 RIX 1 2 0.4 17.2 0.2 0 0.4 0.2 0.2
#> 7 RIX 2 1 0.4 13 0.4 0.2 0.2 0.2 0
#> 8 RIX 2 2 0.429 16.4 0.143 0 0.286 0.286 0.286

Combine incremental sequence with a fixed columns in a dataframe [duplicate]

This question already has answers here:
Alternative to expand.grid for data.frames
(6 answers)
Closed 2 years ago.
I have a dataframe:
data.frame(x=c(1,2,3), y=c(4,5,6))
x y
1 1 4
2 2 5
3 3 6
For each row, I want to repeat x and y for each element within a given sequence, where the sequence is:
E=seq(0,0.2,by=0.1)
So when combined this would give:
x y E
1 1 4 0
2 1 4 0.1
3 1 4 0.2
4 2 5 0
5 2 5 0.1
6 2 5 0.2
7 3 6 0
8 3 6 0.1
9 3 6 0.2
I can not seem to achieve this with expand.grid - seems to give me all possible combinations. Am I after a cartesian product?
library(data.table)
dt <- data.table(x=c(1,2,3), y=c(4,5,6))
dt[,.(E=seq(0,0.2,by=0.1)),by=.(x,y)]
#> x y E
#> 1: 1 4 0.0
#> 2: 1 4 0.1
#> 3: 1 4 0.2
#> 4: 2 5 0.0
#> 5: 2 5 0.1
#> 6: 2 5 0.2
#> 7: 3 6 0.0
#> 8: 3 6 0.1
#> 9: 3 6 0.2
Created on 2020-05-01 by the reprex package (v0.3.0)
Yes, you are looking for cartesian product but base expand.grid cannot handle dataframes.
You can use tidyr functions here :
tidyr::expand_grid(df, E)
# A tibble: 9 x 3
# x y E
# <dbl> <dbl> <dbl>
#1 1 4 0
#2 1 4 0.1
#3 1 4 0.2
#4 2 5 0
#5 2 5 0.1
#6 2 5 0.2
#7 3 6 0
#8 3 6 0.1
#9 3 6 0.2
Or with crossing
tidyr::crossing(df, E)

creating time to improvement of +1 variable in r?

I want to create a "time to improvement of +1" variable I have a longitudinal in a long format at baseline, 3, 6 and 9 months. How do I go about it in r? The improvement from the baseline.
The data is like this:
sno time WHZ
1 0 -0.5
1 3 1.4
1 6 -0.7
1 9 2.2
2 0 -0.63
2 3 0.7
2 6 -2.64
2 9 2.1
expected output
sno time WHZ impr First time to imp
1 0 -0.5 0 3
1 3 1.4 1.9 3
1 6 -0.7 -0.2 3
1 9 2.2 2.7 3
2 0 -0.63 0 3
2 3 0.7 1.33 3
2 6 -2.64 -2.01 3
2 9 2.1 2.73 3
Codes I was trying to use to first create the improvement variable:
library(dplyr)
data %>%
group_by(sno)%>%
mutate(ImprvWHZ = data$WHZ - lag(data$WHZ, default = data$WHZ[1]))
If I understand the question correctly, here is a dplyr solution.
library(dplyr)
dat %>%
group_by(sno) %>%
mutate(Improv = WHZ - WHZ[1],
TimeToImprov = ifelse(Improv > 1, time - time[1], NA))
## A tibble: 8 x 5
## Groups: sno [2]
# sno time WHZ Improv TimeToImprov
# <int> <int> <dbl> <dbl> <int>
#1 1 0 -0.5 0 NA
#2 1 3 1.4 1.9 3
#3 1 6 -0.7 -0.200 NA
#4 1 9 2.2 2.7 9
#5 2 0 -0.63 0 NA
#6 2 3 0.7 1.33 3
#7 2 6 -2.64 -2.01 NA
#8 2 9 2.1 2.73 9
And here is a base R solution.
res <- lapply(split(dat, dat$sno), function(DF){
DF$Improv <- DF$WHZ - DF$WHZ[1]
DF$TimeToImprov <- ifelse(DF$Improv > 1, DF$time - DF$time[1], NA)
DF
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
# sno time WHZ Improv TimeToImprov
#1 1 0 -0.50 0.00 NA
#2 1 3 1.40 1.90 3
#3 1 6 -0.70 -0.20 NA
#4 1 9 2.20 2.70 9
#5 2 0 -0.63 0.00 NA
#6 2 3 0.70 1.33 3
#7 2 6 -2.64 -2.01 NA
#8 2 9 2.10 2.73 9
DATA.
dat <- read.table(text = "
sno time WHZ
1 0 -0.5
1 3 1.4
1 6 -0.7
1 9 2.2
2 0 -0.63
2 3 0.7
2 6 -2.64
2 9 2.1
", header = TRUE)

Calculate product of frequencies in each column

I have a data frame with 3 columns, each containing a small number of values:
> df
# A tibble: 364 x 3
A B C
<dbl> <dbl> <dbl>
0. 1. 0.100
0. 1. 0.200
0. 1. 0.300
0. 1. 0.500
0. 2. 0.100
0. 2. 0.200
0. 2. 0.300
0. 2. 0.600
0. 3. 0.100
0. 3. 0.200
# ... with 354 more rows
> apply(df, 2, table)
$`A`
0 1 2 3 4 5 6 7 8 9 10
34 37 31 32 27 39 29 28 37 39 31
$B
1 2 3 4 5 6 7 8 9 10 11
38 28 38 37 32 34 29 33 30 35 30
$C
0.1 0.2 0.3 0.4 0.5 0.6
62 65 65 56 60 56
I would like to create a fourth column, which will contain for each row the product of the frequencies of each value withing each group. So for example the first value of the column "Freq" would be the product of the frequency of zero within column A, the frequency of 1 within column B and the frequency of 0.1 within column C.
How can I do this efficiently with dplyr/baseR?
To emphasize, this is not the combined frequency of each total row, but the product of the 1-column frequencies
An efficient approach using a combination of lapply, Map & Reduce from base R:
l <- lapply(df, table)
m <- Map(function(x,y) unname(y[match(x, names(y))]), df, l)
df$D <- Reduce(`*`, m)
which gives:
> head(df, 15)
A B C D
1 3 5 0.4 57344
2 5 6 0.5 79560
3 0 4 0.1 77996
4 2 6 0.1 65348
5 5 11 0.6 65520
6 3 8 0.5 63360
7 6 6 0.2 64090
8 1 9 0.4 62160
9 10 2 0.2 56420
10 5 2 0.2 70980
11 4 11 0.3 52650
12 7 6 0.5 57120
13 10 1 0.2 76570
14 7 10 0.5 58800
15 8 10 0.3 84175
What this does:
lapply(df, table) creates a list of frequency for each column
With Map a list is created with match where each list-item has the same length as the number of rows of df. Each list-item is a vector of frequencies corresponding to the values in df.
With Reduce the product of the vectors in the list m is calculated element wise: the first value of each vector in the list m are mulplied with each other, then the 2nd value, etc.
The same approach in tidyverse:
library(dplyr)
library(purrr)
df %>%
mutate(D = map(df, table) %>%
map2(df, ., function(x,y) unname(y[match(x, names(y))])) %>%
reduce(`*`))
Used data:
set.seed(2018)
df <- data.frame(A = sample(rep(0:10, c(34,37,31,32,27,39,29,28,37,39,31)), 364),
B = sample(rep(1:11, c(38,28,38,37,32,34,29,33,30,35,30)), 364),
C = sample(rep(seq(0.1,0.6,0.1), c(62,65,65,56,60,56)), 364))
will use the following small example
df
A B C
1 3 5 0.4
2 5 6 0.5
3 0 4 0.1
4 2 6 0.1
5 5 11 0.6
6 3 8 0.5
7 6 6 0.2
8 1 9 0.4
9 10 2 0.2
10 5 2 0.2
sapply(g,table)
$A
0 1 2 3 5 6 10
1 1 1 2 3 1 1
$B
2 4 5 6 8 9 11
2 1 1 3 1 1 1
$C
0.1 0.2 0.4 0.5 0.6
2 3 2 2 1
library(tidyverse)
df%>%
group_by(A)%>%
mutate(An=n())%>%
group_by(B)%>%
mutate(Bn=n())%>%
group_by(C)%>%
mutate(Cn=n(),prod=An*Bn*Cn)
A B C An Bn Cn prod
<int> <int> <dbl> <int> <int> <int> <int>
1 3 5 0.400 2 1 2 4
2 5 6 0.500 3 3 2 18
3 0 4 0.100 1 1 2 2
4 2 6 0.100 1 3 2 6
5 5 11 0.600 3 1 1 3
6 3 8 0.500 2 1 2 4
7 6 6 0.200 1 3 3 9
8 1 9 0.400 1 1 2 2
9 10 2 0.200 1 2 3 6
10 5 2 0.200 3 2 3 18

Cumulative percentages in R

I have the following data frame
d2
# A tibble: 10 x 2
ID Count
<int> <dbl>
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 3
10 3
Which states how many counts each person (ID) had.
I would like to calculate the cumulative percentage of each count: 1 - 50%, up to 2: 80%, up to 3: 100%.
I tried
> d2 %>% mutate(cum = cumsum(Count)/sum(Count))
# A tibble: 10 x 3
ID Count cum
<int> <dbl> <dbl>
1 1 0.05882353
2 1 0.11764706
3 1 0.17647059
4 1 0.23529412
5 1 0.29411765
6 2 0.41176471
7 2 0.52941176
8 2 0.64705882
9 3 0.82352941
10 3 1.00000000
but this result is obviously incorrect because I would expect that the count of 1 would correspond to 50% rather than 29.4%.
What is wrong here? How do I get the correct answer?
We get the count of 'Count', create the 'Cum' by taking the cumulative sum of 'n' and divide it by the sum of 'n', then right_join with the original data
d2 %>%
count(Count) %>%
mutate(Cum = cumsum(n)/sum(n)) %>%
select(-n) %>%
right_join(d2) %>%
select(names(d2), everything())
# A tibble: 10 x 3
# ID Count Cum
# <int> <int> <dbl>
# 1 1 1 0.500
# 2 2 1 0.500
# 3 3 1 0.500
# 4 4 1 0.500
# 5 5 1 0.500
# 6 6 2 0.800
# 7 7 2 0.800
# 8 8 2 0.800
# 9 9 3 1.00
#10 10 3 1.00
If we need the output as #LAP mentioned
d2 %>%
mutate(Cum = row_number()/n())
# ID Count Cum
#1 1 1 0.1
#2 2 1 0.2
#3 3 1 0.3
#4 4 1 0.4
#5 5 1 0.5
#6 6 2 0.6
#7 7 2 0.7
#8 8 2 0.8
#9 9 3 0.9
#10 10 3 1.0
This works:
d2 %>%
mutate(cum = cumsum(rep(1/n(), n())))
ID Count cum
1 1 1 0.1
2 2 1 0.2
3 3 1 0.3
4 4 1 0.4
5 5 1 0.5
6 6 2 0.6
7 7 2 0.7
8 8 2 0.8
9 9 3 0.9
10 10 3 1.0
One option could be as:
library(dplyr)
d2 %>%
group_by(Count) %>%
summarise(proportion = n()) %>%
mutate(Perc = cumsum(100*proportion/sum(proportion))) %>%
select(-proportion)
# # A tibble: 3 x 2
# Count Perc
# <int> <dbl>
# 1 1 50.0
# 2 2 80.0
# 3 3 100.0

Resources