Calculate mean based on first part of row.name() in R - r

I have a data frame that looks likes this:
structure(list(value1 = c(1, 2, 3, 4, 5), value2 = c(1, 2, 2,
2, 2), value3 = c(1, 1, 2, 3, 4)), class = "data.frame", row.names = c("apple1",
"apple2", "orange1", "orange2", "plum"))
value1
value2
value3
apple1
1
1
1
apple2
2
2
1
orange1
3
2
2
orange2
4
2
3
plum
5
2
4
now I want to run the mean function on every column based on the first part of the row names
(for example I want to calculate the mean of value1 of the apple group independently from their apple number.)
I figured out that something like this works:
y<-x[grep("apple",row.names(x)),]
mean(y$value1)
mean(y$value2)
mean(y$vvalue3)
y<-x[grep("orange",row.names(x)),]
mean(y$value1)
mean(y$value2)
mean(y$value2)
y<-x[grep("plum",row.names(x)),]
mean(y$value1)
mean(y$value2)
mean(y$value2)
but for a bigger dataset, this is going to take ages, so I was wondering if there is a more efficient way to subset the data based on the first part of the row name and calculating the mean afterward.

Using tidyverse:
library(tidyverse)
df %>%
tibble::rownames_to_column("row") %>%
dplyr::mutate(row = str_remove(row, "\\d+")) %>%
dplyr::group_by(row) %>%
dplyr::summarize(across(where(is.numeric), ~ mean(.), .groups = "drop"))
In base R you could do:
df$row <- gsub("\\d+", "", rownames(df))
data.frame(do.call(cbind, lapply(df[,1:3], function(x) by(x, df$row, mean))))
Output
row value1 value2 value3
* <chr> <dbl> <dbl> <dbl>
1 apple 1.5 1.5 1
2 orange 3.5 2 2.5
3 plum 5 2 4
Data
df <- structure(list(value1 = 1:5, value2 = c(1, 2, 2, 2, 2), value3 = c(1,
1, 2, 3, 4)), class = "data.frame", row.names = c("apple1", "apple2",
"orange1", "orange2", "plum"))

Related

manipulate a pair data in R

I would like to reshape the data sample below, so that to get the output like in the table. How can I reach to that? the idea is to split the column e into two columns according to the disease. Those with disease 0 in one column and those with disease 1 in the other column. thanks in advance.
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), fid = c(1,
1, 2, 2, 3, 3, 4, 4, 5, 5), disease = c(0, 1, 0, 1, 1, 0, 1, 0, 0,
1), e = c(3, 2, 6, 1, 2, 5, 2, 3, 1, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
library(tidyverse)
df %>%
pivot_wider(fid, names_from = disease, values_from = e, names_prefix = 'e') %>%
select(-fid)
e0 e1
<dbl> <dbl>
1 3 2
2 6 1
3 5 2
4 3 2
5 1 1
if you want the e1,e2 you could do:
df %>%
pivot_wider(fid, names_from = disease, values_from = e,
names_glue = 'e{disease + 1}') %>%
select(-fid)
# A tibble: 5 x 2
e1 e2
<dbl> <dbl>
1 3 2
2 6 1
3 5 2
4 3 2
5 1 1
We could use lead() combined with ìfelse statements for this:
library(dplyr)
df %>%
mutate(e2 = lead(e)) %>%
filter(row_number() %% 2 == 1) %>%
mutate(e1 = ifelse(disease==1, e2,e),
e2 = ifelse(disease==0, e2,e)) %>%
select(e1, e2)
e1 e2
<dbl> <dbl>
1 3 2
2 6 1
3 5 2
4 3 2
5 1 1

If rows match on two conditions, subtract in particular order, Tidyverse

I am trying to write Tidyverse code that finds two rows that have exactly matching values on two conditions. The rows should match on Participant_ID & Indicator. There should be no more than two rows that match exactly on these two values. In this pair of matches, one will have occurred at timepoint 1 and the other at timepoint 4. After the matches are identified, I want the score at timepoint 4 to be subtracted from the score at timepoint 1. I would also like to preserve the Group number in the final tibble.
There will be rows that don't have matches. Those can be omitted, if possible. I don't want them in the resulting tibble.
I am having trouble wrapping my head around this, so thank you very much for your help!
example <- tibble (
Participant_ID = c('Part1','Part2','Part1','Part2','Part1','Part2','Part1','Part2'),
Indicator =c('item1','item1','item1','item1','item2','item2','item2','item2'),
Timepoint = c(1,1,4,4,1,1,4,4),
Score = c(3,3,1.5,3,4,4,3.5,3.5),
Group = c(1,2,1,2,1,2,1,2))
example %>%
pivot_wider(c(Participant_ID, Indicator, Group), names_from = Timepoint, values_from = Score) %>%
transmute(Participant_ID, Indicator, Group, Score = `1` - `4`)
# A tibble: 4 x 4
# Participant_ID Indicator Group Score
# <chr> <chr> <dbl> <dbl>
# 1 Part1 item1 1 1.5
# 2 Part2 item1 2 0
# 3 Part1 item2 1 0.5
# 4 Part2 item2 2 0.5
Data
example <- structure(list(Participant_ID = c("Part1", "Part2", "Part1", "Part2", "Part1", "Part2", "Part1", "Part2"), Indicator = c("item1", "item1", "item1", "item1", "item2", "item2", "item2", "item2"), Timepoint = c(1, 1, 4, 4, 1, 1, 4, 4), Score = c(3, 3, 1.5, 3, 4, 4, 3.5, 3.5), Group = c(1, 2, 1, 2, 1, 2, 1, 2)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))
One tidyverse approach is to use use pivot_wider() in tidyr to place the matches matches into one row, then calculate the difference between the two scores:
example %>%
pivot_wider(id_cols = c(Participant_ID, Indicator), values_from = Score, names_from = Timepoint, names_prefix = "Score_Timepoint_") %>%
mutate(Score_difference = Score_Timepoint_1 - Score_Timepoint_4)
This produces:
# A tibble: 4 x 5
Participant_ID Indicator Score_Timepoint_1 Score_Timepoint_4 Score_difference
<chr> <chr> <dbl> <dbl> <dbl>
1 Part1 item1 3 1.5 1.5
2 Part2 item1 3 3 0
3 Part1 item2 4 3.5 0.5
4 Part2 item2 4 3.5 0.5
You could arrange the data by descending Timepoint and then use diff by group.
library(dplyr)
example %>%
arrange(Participant_ID, Indicator, desc(Timepoint)) %>%
group_by(Participant_ID, Indicator) %>%
summarise(Score = diff(Score))
# Participant_ID Indicator Score
# <chr> <chr> <dbl>
#1 Part1 item1 1.5
#2 Part1 item2 0.5
#3 Part2 item1 0
#4 Part2 item2 0.5

How to add new column and calculate recursive cum using dplyr and shift

I have a dataset: (actually I have more than 100 groups)
and I want to use dplyr to create a variable-y for each group, and fill first value of y to be 1,
Second y = 1* first x + 2*first y
The result would be:
I tried to create a column- y, all=1, then use
df%>% group_by(group)%>% mutate(var=shift(x)+2*shift(y))%>% ungroup()
but the formula for y become, always use initialize y value--1
Second y = 1* first x + 2*1
Could someone give me some ideas about this? Thank you!
The dput of my result data is:
structure(list(group = c("a", "a", "a", "a", "a", "b", "b", "b" ), x =
c(1, 2, 3, 4, 5, 6, 7, 8), y = c(1, 3, 8, 19, 42, 1, 8, 23)),
row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame" ))
To perform such calculation we can use accumulate from purrr or Reduce in base R.
Since you are already using dplyr we can use accumulate :
library(dplyr)
df %>%
group_by(group) %>%
mutate(y1 = purrr::accumulate(x[-n()], ~.x * 2 + .y, .init = 1))
# group x y y1
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 3 3
#3 a 3 8 8
#4 a 4 19 19
#5 a 5 42 42
#6 b 6 1 1
#7 b 7 8 8
#8 b 8 23 23

Selecting cases based on 2 variables

I am sorry if it seems like a foolish question but I want to ask how to select cases that have the same id and index
This is an example of my dataframe:
df1<-structure(list(id = c(10, 10, 10, 11, 11, 11), pnum = c(1,
2, 3, 1, 2, 3), index = c(1, 2, 2, 1, 1, 1)), class = "data.frame", row.names = c(NA,
-6L))
Also if in and index has the values across all pnums:
df2<-structure(list(id = c(10, 10, 10, 11, 11, 11), pnum = c(1,
2, 3, 1, 2, 3), index = c(1, 1, 2, 2, 2, 2)), class = "data.frame", row.names = c(NA,
-6L))
I need to select cases that have the same id and index
End table should be this:
for df1
id pnum index
11 1 1
11 2 1
11 3 1
Also when id and index belong to the same group:
df2 outcome
id pnum index
10 1 2
10 2 2
10 3 2
We can use subset from base R
subset(df1, id == index)
# id pnum index
#4 1 1 1
#5 1 2 1
#6 1 3 1
Or with filter
library(dplyr)
df1 %>%
filter(id == index)
For the second case, may be we can use
df2 %>%
group_by(id) %>%
filter(n_distinct(index) > 1) %>%
mutate(index = 2)
We can select id's where there are only 1 unique index value.
library(data.table)
setDT(df1)[, .SD[uniqueN(index) == 1], id]
# id pnum index
#1: 11 1 1
#2: 11 2 1
#3: 11 3 1
For df2 this returns as :
setDT(df2)[, .SD[uniqueN(index) == 1], id]
# id pnum index
#1: 11 1 2
#2: 11 2 2
#3: 11 3 2
We can translate this to dplyr as :
df1 %>% group_by(id) %>% filter(n_distinct(index) == 1)
and in base R :
subset(df1, ave(index, id, FUN = function(x) length(unique(x))) == 1)

Binning by Subgroup in R

I have a dataframe with Markets, Retailers and Sales. I need to bin the Retailers within each Market into 5 quantiles.
Example:
dataframe <- structure(list(Market = c(1, 1, 1, 2, 2, 2), Retailer = c(1,
2, 3, 4, 5, 6), Sales = c(5, 10, 25, 5, 10, 25), Quantile = c(1,
2, 3, 1, 2, 3)), class = "data.frame", row.names = c(NA, -6L))
One approach is using group_by and ntile from dplyr:
library(dplyr)
dataframe %>%
group_by(Market) %>%
mutate(Quantile = ntile(Sales, 4))
# A tibble: 150 x 4
# Groups: Market [3]
Market Retailer Sales Quantile
<int> <int> <dbl> <int>
1 1 1 16804 1
2 1 2 80752 4
3 1 3 38494 2
4 1 4 32773 2
5 1 5 60210 3
# … with 145 more rows
Data
set.seed(3)
dataframe <- data.frame(Market = rep(1:3, each = 50),
Retailer = rep(1:50, times = 3),
Sales = round(runif(150,0,100000),0))

Resources