I have a data frame, for example:
ID category value1 value2
1 A 2 5
1 A 3 6
1 A 5 7
1 B 6 **12**
2 A 1 3
2 A 2 5
2 B 5 **10**
Now I want to add a new column to calculate the percentage. For each ID, the way of calculation is to use each value1 of category A to divide the value2 of category B. For value1 of category B, it divides value2 of category B directly. The expected result likes:
ID category value1 value2 percentage
1 A 2 5 0.17
1 A 3 6 0.25
1 A 5 7 0.42
1 B 6 **12** 0.50
2 A 1 3 0.10
2 A 2 5 0.20
2 B 5 **10** 0.50
Thank you very much.
Using dplyr you could do something like this, assuming there is only one category B value for each of your IDs as per your example:
library(dplyr)
df1 <- tibble(
ID = c(1,1,1,1,2,2,2),
category =c('A', 'A', 'A','B','A','A','B'),
value1 = c(2,3,5,6,1,2,5),
value2 = c(5,6,7,12,3,5,10)
)
df1 %>%
group_by(ID) %>%
mutate(percentage = value1 / value2[category == 'B'])
# # A tibble: 7 x 5
# # Groups: ID [2]
# ID category value1 value2 percentage
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 A 2 5 0.167
# 2 1 A 3 6 0.25
# 3 1 A 5 7 0.417
# 4 1 B 6 12 0.5
# 5 2 A 1 3 0.1
# 6 2 A 2 5 0.2
# 7 2 B 5 10 0.5
Related
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 1 year ago.
I currently have a table with a quantity in it.
ID
Code
Quantity
1
A
1
2
B
3
3
C
2
4
D
1
Is there anyway to get this table?
ID
Code
Quantity
1
A
1
2
B
1
2
B
1
2
B
1
3
C
1
3
C
1
4
D
1
I need to break out the quantity and have that many number of rows.
Thanks!!!!
Updated
Now we have stored the separated, collapsed values into a new column:
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
uncount(Quantity, .remove = FALSE) %>%
mutate(NewQ = 1)
# A tibble: 7 x 4
# Groups: ID [4]
ID Code Quantity NewQ
<int> <chr> <int> <dbl>
1 1 A 1 1
2 2 B 3 1
3 2 B 3 1
4 2 B 3 1
5 3 C 2 1
6 3 C 2 1
7 4 D 1 1
Updated
In case we opt not to replace the existing Quantity column with the collapsed values.
df %>%
group_by(ID) %>%
mutate(NewQ = ifelse(Quantity != 1, paste(rep(1, Quantity), collapse = ", "),
as.character(Quantity))) %>%
separate_rows(NewQ) %>%
mutate(NewQ = as.numeric(NewQ))
# A tibble: 7 x 4
# Groups: ID [4]
ID Code Quantity NewQ
<int> <chr> <int> <dbl>
1 1 A 1 1
2 2 B 3 1
3 2 B 3 1
4 2 B 3 1
5 3 C 2 1
6 3 C 2 1
7 4 D 1 1
We could use slice
library(dplyr)
df %>%
group_by(ID) %>%
slice(rep(1:n(), each = Quantity)) %>%
mutate(Quantity= rep(1))
Output:
ID Code Quantity
<dbl> <chr> <dbl>
1 1 A 1
2 2 B 1
3 2 B 1
4 2 B 1
5 3 C 1
6 3 C 1
7 4 D 1
A base R option using rep
transform(
`row.names<-`(df[rep(1:nrow(df), df$Quantity), ], NULL),
Quantity = 1
)
gives
ID Code Quantity
1 1 A 1
2 2 B 1
3 2 B 1
4 2 B 1
5 3 C 1
6 3 C 1
7 4 D 1
This question already has answers here:
Filter group of rows based on sum of values from different column
(2 answers)
Closed 2 years ago.
I have a data.frame that looks like this
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.0
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like to find an elegant way to erase a group when its values are smaller < 0.2 in two different time points. Those points do not have to be consecutive.
In this case, I would like to filter out group A because its value at time point 1 and time point 2 is smaller than < 0.2.
group time value
1 B 1 1.0
2 C 1 1.0
3 B 2 10.0
4 C 2 20.0
5 B 3 20.0
6 C 3 30.0
With this solution you check that no group has more than 1 observation with values under 0.2 as you requested.
library(dplyr)
data %>%
group_by(group) %>%
filter(sum(value < 0.2) < 2) %>%
ungroup()
#> # A tibble: 6 x 3
#> group time value
#> <chr> <dbl> <dbl>
#> 1 B 1 1
#> 2 C 1 1
#> 3 B 2 10
#> 4 C 2 20
#> 5 B 3 20
#> 6 C 3 30
But if you are really a fan of base R:
data[ave(data$value<0.2, data$group, FUN = function(x) sum(x)<2), ]
#> group time value
#> 2 B 1 1
#> 3 C 1 1
#> 5 B 2 10
#> 6 C 2 20
#> 8 B 3 20
#> 9 C 3 30
Try this dplyr approach:
library(tidyverse)
#Code
data <- data %>% group_by(group) %>% mutate(Flag=any(value<0.2)) %>%
filter(Flag==F) %>% select(-Flag)
Output:
# A tibble: 6 x 3
# Groups: group [2]
group time value
<fct> <dbl> <dbl>
1 B 1 1
2 C 1 1
3 B 2 10
4 C 2 20
5 B 3 20
6 C 3 30
I am trying to calculate the mean of time by keeping all the variables in the final dataset within dplyr package.
Here how my sample dataset looks like:
library(dplyr)
id <- c(1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4)
gender <- c(1,1,1,1, 2,2,2,2, 2,2,2,2, 1,1,1,1)
item.id <-c(1,1,1,2, 1,1,2,2, 1,2,3,4, 1,2,2,3)
sequence<-c(1,2,3,1, 1,2,1,2, 1,1,1,1, 1,1,2,1)
time <- c(5,6,7,1, 2,3,4,9, 1,2,3,9, 5,6,7,8)
data <- data.frame(id, gender, item.id, sequence, time)
> data
id gender item.id sequence time
1 1 1 1 1 5
2 1 1 1 2 6
3 1 1 1 3 7
4 1 1 2 1 1
5 2 2 1 1 2
6 2 2 1 2 3
7 2 2 2 1 4
8 2 2 2 2 9
9 3 2 1 1 1
10 3 2 2 1 2
11 3 2 3 1 3
12 3 2 4 1 9
13 4 1 1 1 5
14 4 1 2 1 6
15 4 1 2 2 7
16 4 1 3 1 8
id for student id, gender for gender, item.id for the question ids students take, sequence is the sequence number of attempts to solve the question because students might return back to questions and try to answer again, and time is the time spent on each trial.
When calculating the mean of the time, I need to follow three steps:
(a) students have multiple trials for each question. I need to calculate the mean of the time for each item having multiple trials.
(b) then calculate the overall mean of the time for each id. For example, for id=1, I have two items, the first item has 3 trials and the second item has 1 trial. First I need to aggregate the time for the first item by (5+6+7)/3=6, so id=1 has item1 time 6 and item2 time 1. Second, taking 6 and 1 and calculating the mean for this student (6+1)/2=3.5.
(c) Lastly, I would like to keep all the variables in the dataset.
data <- data %>%
group_by(id) %>%
select(id, gender, item.id, sequence, time) %>%
summarize(mean.time = mean(time))
I got this but obviously this is only aggregating the mean by not taking into account of the within mean for each trial and this also does not keep all the variables:
> data
# A tibble: 4 x 2
id mean.time
<dbl> <dbl>
1 1 4.75
2 2 4.5
3 3 3.75
4 4 6.5
I thought select() was going to keep all variables.
The final dataset should look like this below:
> data
id gender item.id sequence time mean.time
1 1 1 1 1 5 3.5
2 1 1 1 2 6 3.5
3 1 1 1 3 7 3.5
4 1 1 2 1 1 3.5
5 2 2 1 1 2 4.5
6 2 2 1 2 3 4.5
7 2 2 2 1 4 4.5
8 2 2 2 2 5 4.5
9 3 2 1 1 1 3.75
10 3 2 2 1 2 3.75
11 3 2 3 1 3 3.75
12 3 2 4 1 9 3.75
13 4 1 1 1 5 6.5
14 4 1 2 1 6 6.5
15 4 1 2 2 7 6.5
16 4 1 3 1 8 6.5
I used dplyr but open any other solutions.
Thanks in advance!
We can use mutate instead of summarise as summarise returns a summarised output of 1 row per each group, while mutate creates a new column in the dataset
...
%>%
mutate(mean.time = mean(time))
If wee want to get the mean of mean, then first group by 'id', 'item.id', get the mean, and then grouped by 'id', get the mean of unique elements
data %>%
group_by(id, item.id) %>%
mutate(mean.time = mean(time)) %>%
group_by(id) %>%
mutate(mean.time = mean(unique(mean.time)))
# A tibble: 16 x 6
# Groups: id [4]
# id gender item.id sequence time mean.time
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1 5 3.5
# 2 1 1 1 2 6 3.5
# 3 1 1 1 3 7 3.5
# 4 1 1 2 1 1 3.5
# 5 2 2 1 1 2 4.5
# 6 2 2 1 2 3 4.5
# 7 2 2 2 1 4 4.5
# 8 2 2 2 2 9 4.5
# 9 3 2 1 1 1 3.75
#10 3 2 2 1 2 3.75
#11 3 2 3 1 3 3.75
#12 3 2 4 1 9 3.75
#13 4 1 1 1 5 6.5
#14 4 1 2 1 6 6.5
#15 4 1 2 2 7 6.5
#16 4 1 3 1 8 6.5
Or instead of creating a second group by, we can do a match to get the first position of 'item.id', extract the 'mean.time' and get the mean
data %>%
group_by(id, item.id) %>%
mutate(mean.time = mean(time),
mean.time = mean(mean.time[match(unique(item.id), item.id)]))
Or use summarise and then do a left_join
data %>%
group_by(id, item.id) %>%
summarise(mean.time = mean(time)) %>%
group_by(id) %>%
summarise(mean.time = mean(mean.time)) %>%
right_join(data)
Suppose I have a tibble tbl_
tbl_ <- tibble(id = c(1,1,2,2,3,3), dta = 1:6)
tbl_
# A tibble: 6 x 2
id dta
<dbl> <int>
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 3 6
There are 3 id groups. I want to resample entire id groups 3 times with replacement. For example the resulting tibble can be:
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 1 2
5 3 5
6 3 6
but not
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 2 4
5 3 5
6 3 6
or
id dta
<dbl> <int>
1 1 1
2 1 1
3 2 3
4 2 4
5 3 5
6 3 6
Here is one option with sample_n and distinct
library(tidyverse)
distinct(tbl_, id) %>%
sample_n(nrow(.), replace = TRUE) %>%
pull(id) %>%
map_df( ~ tbl_ %>%
filter(id == .x)) %>%
arrange(id)
# A tibble: 6 x 2
# id dta
# <dbl> <int>
#1 1.00 1
#2 1.00 2
#3 1.00 1
#4 1.00 2
#5 3.00 5
#6 3.00 6
An option can be to get the minimum row number for each id. That row number will be used to generate random samples from wiht replace = TRUE.
library(dplyr)
tbl_ %>% mutate(rn = row_number()) %>%
group_by(id) %>%
summarise(minrow = min(rn)) ->min_row
indx <- rep(sample(min_row$minrow, nrow(min_row), replace = TRUE), each = 2) +
rep(c(0,1), 3)
tbl_[indx,]
# # A tibble: 6 x 2
# id dta
# <dbl> <int>
# 1 1.00 1
# 2 1.00 2
# 3 3.00 5
# 4 3.00 6
# 5 2.00 3
# 6 2.00 4
Note: In the above answer the number of rows for each id has been assumed as 2 but this answer can tackle any number of IDs. The hard-coded each=2 and c(0,1) needs to be modified in order to scale it up to handle more than 2 rows for each id
I have the following data frame
d2
# A tibble: 10 x 2
ID Count
<int> <dbl>
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 3
10 3
Which states how many counts each person (ID) had.
I would like to calculate the cumulative percentage of each count: 1 - 50%, up to 2: 80%, up to 3: 100%.
I tried
> d2 %>% mutate(cum = cumsum(Count)/sum(Count))
# A tibble: 10 x 3
ID Count cum
<int> <dbl> <dbl>
1 1 0.05882353
2 1 0.11764706
3 1 0.17647059
4 1 0.23529412
5 1 0.29411765
6 2 0.41176471
7 2 0.52941176
8 2 0.64705882
9 3 0.82352941
10 3 1.00000000
but this result is obviously incorrect because I would expect that the count of 1 would correspond to 50% rather than 29.4%.
What is wrong here? How do I get the correct answer?
We get the count of 'Count', create the 'Cum' by taking the cumulative sum of 'n' and divide it by the sum of 'n', then right_join with the original data
d2 %>%
count(Count) %>%
mutate(Cum = cumsum(n)/sum(n)) %>%
select(-n) %>%
right_join(d2) %>%
select(names(d2), everything())
# A tibble: 10 x 3
# ID Count Cum
# <int> <int> <dbl>
# 1 1 1 0.500
# 2 2 1 0.500
# 3 3 1 0.500
# 4 4 1 0.500
# 5 5 1 0.500
# 6 6 2 0.800
# 7 7 2 0.800
# 8 8 2 0.800
# 9 9 3 1.00
#10 10 3 1.00
If we need the output as #LAP mentioned
d2 %>%
mutate(Cum = row_number()/n())
# ID Count Cum
#1 1 1 0.1
#2 2 1 0.2
#3 3 1 0.3
#4 4 1 0.4
#5 5 1 0.5
#6 6 2 0.6
#7 7 2 0.7
#8 8 2 0.8
#9 9 3 0.9
#10 10 3 1.0
This works:
d2 %>%
mutate(cum = cumsum(rep(1/n(), n())))
ID Count cum
1 1 1 0.1
2 2 1 0.2
3 3 1 0.3
4 4 1 0.4
5 5 1 0.5
6 6 2 0.6
7 7 2 0.7
8 8 2 0.8
9 9 3 0.9
10 10 3 1.0
One option could be as:
library(dplyr)
d2 %>%
group_by(Count) %>%
summarise(proportion = n()) %>%
mutate(Perc = cumsum(100*proportion/sum(proportion))) %>%
select(-proportion)
# # A tibble: 3 x 2
# Count Perc
# <int> <dbl>
# 1 1 50.0
# 2 2 80.0
# 3 3 100.0