How to create another table in R to calculate the difference? - r

I have a set of data frame as below:
ID
Parameter
value
123-01
a1
x
123-02
a1
x
123-01
b3
x
123-02
b3
x
124-01
a1
x
125-01
a1
x
126-01
a1
x
124-01
b3
x
125-01
b3
x
126-01
b3
x
I would like to find the sampleID that ended with "-02", and calculate the difference of the same sample ID that has the same first three digit by same parameter.
For example, calculate the difference of 123-01 and 123-02 based on parameter a1. Then the difference of 123-01 and 123-02 based on parameter b3, etc....
In the end, I can get a table contains
ID
Parameter
DiffValue
123
a1
y
123
b3
y
127
a1
y
127
b3
y
How can I do it?
I tried to use dplyr (filter) to create a table that only contains the duplicate, and then how do I match the origin table and do the calculation?

try to do it this way
library(tidyverse)
df <- read.table(text = "ID Parameter value
123-01 a1 10
123-02 a1 10
123-01 b3 10
123-02 b3 10
124-01 a1 10
125-01 a1 10
126-01 a1 10
124-01 b3 10
125-01 b3 10
126-01 b3 10", header = T)
df %>%
arrange(Parameter, ID) %>%
separate(ID, into = c("id_grp", "n"), sep = "-", remove = F) %>%
group_by(Parameter, id_grp) %>%
mutate(diff_value = c(NA, diff(value))) %>%
select(-c(id_grp, n))
#> Adding missing grouping variables: `id_grp`
#> # A tibble: 10 x 5
#> # Groups: Parameter, id_grp [8]
#> id_grp ID Parameter value diff_value
#> <chr> <chr> <chr> <int> <int>
#> 1 123 123-01 a1 10 NA
#> 2 123 123-02 a1 10 0
#> 3 124 124-01 a1 10 NA
#> 4 125 125-01 a1 10 NA
#> 5 126 126-01 a1 10 NA
#> 6 123 123-01 b3 10 NA
#> 7 123 123-02 b3 10 0
#> 8 124 124-01 b3 10 NA
#> 9 125 125-01 b3 10 NA
#> 10 126 126-01 b3 10 NA
Created on 2021-01-26 by the reprex package (v0.3.0)

Related

Remove a list of levels from a chr or factor column in R

I have a dataframe with 1000 IDs, each with > 100 rows of data.
I want to remove all IDs that meet a criteria based on another column at least once.
As an example with the dummy data below, I want to remove all IDs, where var2 is <20 at least once.
How do I do this without spelling out each individual ID to be dropped?
dummy data of similar structure:
data <- data.frame(ID = rep(c('B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9', 'B10'), each = 5),
var1 = rep(c('a', 'b', 'b', 'c', 'd','a', 'c', 'c', 'b', 'a' ), times = 5),
var2 = sample(1:100, 50))
I have tried using the function droplevel, but I do not want to spell out every individual ID to be dropped.
tidyverse
df <- data.frame(ID = rep(c('B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9', 'B10'), each = 5),
var1 = rep(c('a', 'b', 'b', 'c', 'd','a', 'c', 'c', 'b', 'a' ), times = 5),
var2 = sample(1:100, 50))
library(tidyverse)
df %>%
group_by(ID) %>%
filter(!any(var2 < 20)) %>%
ungroup()
#> # A tibble: 25 x 3
#> ID var1 var2
#> <chr> <chr> <int>
#> 1 B2 a 100
#> 2 B2 c 67
#> 3 B2 c 64
#> 4 B2 b 78
#> 5 B2 a 73
#> 6 B3 a 83
#> 7 B3 b 32
#> 8 B3 b 23
#> 9 B3 c 65
#> 10 B3 d 96
#> # ... with 15 more rows
Created on 2022-01-14 by the reprex package (v2.0.1)
data.table
library(data.table)
setDT(df)[, .SD[!any(var2 < 20)], by = ID]
#> ID var1 var2
#> 1: B1 a 47
#> 2: B1 b 81
#> 3: B1 b 95
#> 4: B1 c 48
#> 5: B1 d 43
#> 6: B4 a 77
#> 7: B4 c 54
#> 8: B4 c 23
#> 9: B4 b 55
#> 10: B4 a 25
#> 11: B6 a 98
#> 12: B6 c 99
#> 13: B6 c 86
#> 14: B6 b 92
#> 15: B6 a 33
#> 16: B7 a 73
#> 17: B7 b 94
#> 18: B7 b 62
#> 19: B7 c 40
#> 20: B7 d 49
#> 21: B10 a 66
#> 22: B10 c 44
#> 23: B10 c 35
#> 24: B10 b 76
#> 25: B10 a 38
#> ID var1 var2
Created on 2022-01-14 by the reprex package (v2.0.1)
I just found the answer here: How to remove all rows belonging to a particular group when only one row fulfills the condition in R?
This does the trick:
new.data <- subset(data, ave(var2 >=20, ID, FUN = all))

R: Dataframe Manipulation

I’ve the follow dataframe as shown below
ID
COUNT OF STOCK
YEAR
A1
10
2000
A1
20
2000
A1
18
2000
A1
15
2001
A1
30
2001
A2
35
2002
A2
50
2001
A2
10
2002
A2
22
2002
A3
11
2001
A3
15
2001
A3
28
2000
I would like change the dataframe to the one shown below by grouping ID and Year(which is then use to count the number of years from 2020) to find the sum of count of stock
ID
Sum of COUNT OF STOCK
number of years from 2020 (2020-year)
A1
48
20
A1
45
19
A2
67
18
A2
50
19
A3
26
19
A3
28
20
Thanks in advance!!
This is pretty straight forward. To work with those verbose column names you will have to quote them though, which might be a challenge.
dat %>% group_by( ID, YEAR ) %>%
summarise(
`Sum of COUNT OF STOCK` = sum( `COUNT OF STOCK` ),
`number of years from 2020 (2020-year)` = 2020 - first(YEAR)
) %>% select( -YEAR )
Output:
ID `Sum of COUNT OF STOCK` `number of years from 2020 (2020-year)`
<chr> <int> <dbl>
1 A1 48 20
2 A1 45 19
3 A2 50 19
4 A2 67 18
5 A3 28 20
6 A3 26 19
Simply do this.
df %>% group_by(D, number_of_years = 2020 - YEAR) %>%
summarise(Sum_of_stock = sum(COUNT_OF_STOCK))
# A tibble: 6 x 3
# Groups: D [3]
D number_of_years Sum_of_stock
<chr> <dbl> <int>
1 A1 19 45
2 A1 20 48
3 A2 18 67
4 A2 19 50
5 A3 19 26
6 A3 20 28
data
df <- read.table(text = "D COUNT_OF_STOCK YEAR
A1 10 2000
A1 20 2000
A1 18 2000
A1 15 2001
A1 30 2001
A2 35 2002
A2 50 2001
A2 10 2002
A2 22 2002
A3 11 2001
A3 15 2001
A3 28 2000", header = T)

How to create an incrementing variable with 2 variables in R?

I would like to create an incrementing variable (Id1 or Id2) from 2 others variables (Var1 and Var2).
Thank you.
Elodie
EDIT (reproductible example for Aaron Montgomery)
I want to create an incrementing variable : "Id". The value of "Id" changes if VarA is a new value and if VarB is a new value. See in particular when Id = 4 in the expected table.
data_example <- data.table::fread("
VarA VarB
A1 B1
A1 B2
A1 B3
A1 B4
A2 B5
A3 B6
A4 B7
A5 B7
A5 B8
A6 B9
A7 B10
A8 B10
A9 B10")
Expected table
VarA VarB Id
A1 B1 1
A1 B2 1
A1 B3 1
A1 B4 1
A2 B5 2
A3 B6 3
A4 B7 4
A5 B7 4
A5 B8 4
A6 B9 5
A7 B10 6
A8 B10 6
A9 B10 6
Here is one solution using the tidyverse
library(tidyverse)
data_example <- data.table::fread("
Var1 Var2 Id1 Id2
604211 1001 3 1
604211 1093 3 1
604211 1146 3 1
604211 1319 3 1
635348 1002 5 2
634849 1005 5 2
620861 1004 4 3
622281 1004 4 3
622281 1041 4 3
600044 1100 1 4
600049 1033 2 5
607692 1033 2 5
612595 1033 2 5")
data_example %>%
arrange(Var1,Var2) %>%
group_by(Var1) %>%
mutate(id1 = group_indices()) %>%
group_by(Var2) %>%
mutate(id2 = group_indices())

Ascending group by date

I cannot able to ascend my group by dates. Please help!
df <- data.frame(A = c('a1','a1','b1','b1','b1','c2','d2','d2'),
B = c("2017-02-20","2018-02-14","2017-02-06","2018-02-27","2017-02-29","2017-02-28","2017-02-09","2017-02-10"))
Code:
df %>% group_by(A) %>% arrange(A,(as.Date(B)))
I am getting wrong result as the b1 didn't sort
A B
<fctr> <fctr>
1 a1 2017-02-20
2 a1 2018-02-14
3 b1 2017-02-06
4 b1 2018-02-27
5 b1 2017-02-29
6 c2 2017-02-28
7 d2 2017-02-09
8 d2 2017-02-10
You can see that the 2017-02-29 is not a real date, only 28 days in feb 2017. So, when you are converting your column B to date, it converts that value to NA. Fix that entry and it your answer should work.
Also, you probably do not need to group_by A
library(dplyr)
#>
df <- data.frame(A = c('a1','a1','b1','b1','b1','c2','d2','d2'),
B = c("2017-02-20","2018-02-14","2017-02-06","2018-02-27","2017-02-29","2017-02-28","2017-02-09","2017-02-10"))
as.Date(df$B)
#> [1] "2017-02-20" "2018-02-14" "2017-02-06" "2018-02-27" NA
#> [6] "2017-02-28" "2017-02-09" "2017-02-10"
df%>%arrange(A, as.Date(B))
#> A B
#> 1 a1 2017-02-20
#> 2 a1 2018-02-14
#> 3 b1 2017-02-06
#> 4 b1 2018-02-27
#> 5 b1 2017-02-29
#> 6 c2 2017-02-28
#> 7 d2 2017-02-09
#> 8 d2 2017-02-10
Created on 2019-09-16 by the reprex package (v0.2.1)

get previous value to the current value

How can i get the previous value of each group in a new column C and the starting value for each group will be empty as it does not have previous value of respective group!
Can dplyr can perform this?
Code:
df <- data.frame(A = c('a1','a1','b1','b1','b1','c2','d2','d2'),
B = c("2017-02-20","2018-02-14","2017-02-06","2017-02-27","2017-02-29","2017-02-28","2017-02-09","2017-02-10"))
Dataframe:
A B
a1 2017-02-20
a1 2018-02-14
b1 2017-02-06
b1 2017-02-27
b1 2017-02-29
c2 2017-02-28
d2 2017-02-09
d2 2017-02-10
Expected Output
A B C
a1 2017-02-20
a1 2018-02-14 2017-02-20
b1 2017-02-06
b1 2017-02-27 2017-02-06
b1 2017-02-29 2017-02-27
c2 2017-02-28
d2 2017-02-09
d2 2017-02-10 2017-02-09
You could use the lag function from dplyr:
df <- data.frame(A = c('a1','a1','b1','b1','b1','c2','d2','d2'),
B = c("2017-02-20","2018-02-14","2017-02-06",
"2017-02-27","2017-02-29","2017-02-28",
"2017-02-09","2017-02-10"))
library(dplyr)
df %>%
group_by(A) %>%
mutate(C = lag(B, 1, default = NA))
This will apply the lag function for each group of "A"
Output:
# A tibble: 8 x 3
# Groups: A [4]
A B C
<fct> <fct> <fct>
1 a1 2017-02-20 NA
2 a1 2018-02-14 2017-02-20
3 b1 2017-02-06 NA
4 b1 2017-02-27 2017-02-06
5 b1 2017-02-29 2017-02-27
6 c2 2017-02-28 NA
7 d2 2017-02-09 NA
8 d2 2017-02-10 2017-02-09

Resources