Recode when there is a missing category in R - r

I need a recoding help. Here how my dataset looks like.
df <- data.frame(id = c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3, 4,4,4,4,4),
score = c(0,1,0,1,0, 0,2,0,2,2, 0,3,3,0,0, 0,1,3,1,3))
> df
id score
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 2 0
7 2 2
8 2 0
9 2 2
10 2 2
11 3 0
12 3 3
13 3 3
14 3 0
15 3 0
16 4 0
17 4 1
18 4 3
19 4 1
20 4 3
Some ids have missing score categories. So if this is the case per id, I would like to recode score category. So:
a) if the score options are `0,1,2` and `1` score is missing, then `2` need to be recoded as `1`,
b) if the score options are `0,1,2,3` and `1,2` score is missing, then `3` need to be recoded as `1`,
c) if the score options are `0,1,2,3` and `2` score is missing, then `2,3` need to be recoded as `1,2`,
the idea is there should not be any missing score categories in between.
The desired output would be:
> df.1
id score score.recoded
1 1 0 0
2 1 1 1
3 1 0 0
4 1 1 1
5 1 0 0
6 2 0 0
7 2 2 1
8 2 0 0
9 2 2 1
10 2 2 1
11 3 0 0
12 3 3 1
13 3 3 1
14 3 0 0
15 3 0 0
16 4 0 0
17 4 1 1
18 4 3 2
19 4 1 1
20 4 3 2

df %>%
group_by(id)%>%
mutate(score = as.numeric(factor(score)) - 1)
# A tibble: 20 x 2
# Groups: id [4]
id score
<dbl> <dbl>
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 2 0
7 2 1
8 2 0
9 2 1
10 2 1
11 3 0
12 3 1
13 3 1
14 3 0
15 3 0
16 4 0
17 4 1
18 4 2
19 4 1
20 4 2

Using data.table
library(data.table)
setDT(df)[, score.recoded := 0][
score >0, score.recoded := match(score, score), id]
-output
> df
id score score.recoded
<num> <num> <int>
1: 1 0 0
2: 1 1 1
3: 1 0 0
4: 1 1 1
5: 1 0 0
6: 2 0 0
7: 2 2 1
8: 2 0 0
9: 2 2 1
10: 2 2 1
11: 3 0 0
12: 3 3 1
13: 3 3 1
14: 3 0 0
15: 3 0 0
16: 4 0 0
17: 4 1 1
18: 4 3 2
19: 4 1 1
20: 4 3 2

Related

R How to count by group starting when condition is met

df <- data.frame (id = c(1,1,1,2,2,2,3,3,3,3,4,4,4,4,4,4),
qresult=c(0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0),
count=c(0,0,0,0,1,2,0,0,1,2,1,2,3,4,5,6))
> df
id qresult count
1 1 0 0
2 1 0 0
3 1 0 0
4 2 0 0
5 2 1 1
6 2 0 2
7 3 0 0
8 3 0 0
9 3 1 1
10 3 0 2
11 4 1 1
12 4 0 2
13 4 0 3
14 4 0 4
15 4 0 5
16 4 0 6
What would be a way to obtain the count column which begins counting when the condition, q_result==1 is met and resets for each new id?
We could wrap with double cumsum on a logical vector after grouping
library(dplyr)
df %>%
group_by(id) %>%
mutate(count2 = cumsum(cumsum(qresult))) %>%
ungroup
-output
# A tibble: 16 × 4
id qresult count count2
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 0
2 1 0 0 0
3 1 0 0 0
4 2 0 0 0
5 2 1 1 1
6 2 0 2 2
7 3 0 0 0
8 3 0 0 0
9 3 1 1 1
10 3 0 2 2
11 4 1 1 1
12 4 0 2 2
13 4 0 3 3
14 4 0 4 4
15 4 0 5 5
16 4 0 6 6

how to create a column that determines if a value is missing in a variable in R

I am trying to identify if a column has a missing number category based on a max.score. Here is a sample dataset.
df <- data.frame(id = c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3),
score = c(0,0,2,0,2, 0,1,1,0,1, 0,1,0,1,0),
max.score = c(2,2,2,2,2, 1,1,1,1,1, 2,2,2,2,2))
> df
id score max.score
1 1 0 2
2 1 0 2
3 1 2 2
4 1 0 2
5 1 2 2
6 2 0 1
7 2 1 1
8 2 1 1
9 2 0 1
10 2 1 1
11 3 0 2
12 3 1 2
13 3 0 2
14 3 1 2
15 3 0 2
for the id = 1, based on the max.score, it is missing the category 1. I would like to add missing column saying something like 1. When id=3 is missing score = 2, the missing column should indicate a value of 2. If there are more than one category is missing, then it would indicate those missing categories as ,for example, 1,3. The desired output should be:
> df
id score max.score missing
1 1 0 2 1
2 1 0 2 1
3 1 2 2 1
4 1 0 2 1
5 1 2 2 1
6 2 0 1 NA
7 2 1 1 NA
8 2 1 1 NA
9 2 0 1 NA
10 2 1 1 NA
11 3 0 2 2
12 3 1 2 2
13 3 0 2 2
14 3 1 2 2
15 3 0 2 2
Any thoughts?
Thanks!
df %>%
group_by(id) %>%
mutate(missing = toString(setdiff(0:max.score[1], unique(score))),
missing = ifelse(nzchar(missing), missing, NA))
# A tibble: 15 x 4
# Groups: id [3]
id score max.score missing
<dbl> <dbl> <dbl> <chr>
1 1 0 2 1
2 1 0 2 1
3 1 2 2 1
4 1 0 2 1
5 1 2 2 1
6 2 0 1 NA
7 2 1 1 NA
8 2 1 1 NA
9 2 0 1 NA
10 2 1 1 NA
11 3 0 2 2
12 3 1 2 2
13 3 0 2 2
14 3 1 2 2
15 3 0 2 2

Editing each row in column in R

I have a data frame that looks like this:
Twin_Pair zyg CDsumTwin1 CDsumTwin2
<chr> <int> <dbl> <dbl>
1 pair1(2891,2892) 2 0 5
2 pair2(4000,4001) 1 0 0
3 pair3(4006,4007) 2 0 3
4 pair4(4009,4010) 2 1 3
5 pair5(4012,4013) 2 2 0
6 pair6(4015,4016) 2 0 9
7 pair7(4018,4019) 2 0 0
8 pair8(4021,4022) 1 0 0
9 pair9(4024,4025) 1 0 0
10 pair10(4027,4028) 2 2 17
How can I remove "pair1", "pair2", etc. from each row in the first column such that I am left with something like (4027,4028)? I know how to remove the first 5 characters, but the problem is goes up to pair100. What would be an efficient way to do this?
You need a regex call to identify your pattern. Please test this code to see if it works.
dat$Twin_Pair <- sub("^pair[0-9]+", "", dat$Twin_Pair)
dat
# Twin_Pair zyg CDsumTwin1 CDsumTwin2
# 1 (2891,2892) 2 0 5
# 2 (4000,4001) 1 0 0
# 3 (4006,4007) 2 0 3
# 4 (4009,4010) 2 1 3
# 5 (4012,4013) 2 2 0
# 6 (4015,4016) 2 0 9
# 7 (4018,4019) 2 0 0
# 8 (4021,4022) 1 0 0
# 9 (4024,4025) 1 0 0
# 10 (4027,4028) 2 2 17
Data
dat <- read.table(text = "Twin_Pair zyg CDsumTwin1 CDsumTwin2
1 'pair1(2891,2892)' 2 0 5
2 'pair2(4000,4001)' 1 0 0
3 'pair3(4006,4007)' 2 0 3
4 'pair4(4009,4010)' 2 1 3
5 'pair5(4012,4013)' 2 2 0
6 'pair6(4015,4016)' 2 0 9
7 'pair7(4018,4019)' 2 0 0
8 'pair8(4021,4022)' 1 0 0
9 'pair9(4024,4025)' 1 0 0
10 'pair10(4027,4028)' 2 2 17",
header = TRUE)
An option with trimws
dat$Twin_Pair <- trimws(dat$Twin_Pair, whitespace = "[^(]+", which = 'left')
-output
> dat
Twin_Pair zyg CDsumTwin1 CDsumTwin2
1 (2891,2892) 2 0 5
2 (4000,4001) 1 0 0
3 (4006,4007) 2 0 3
4 (4009,4010) 2 1 3
5 (4012,4013) 2 2 0
6 (4015,4016) 2 0 9
7 (4018,4019) 2 0 0
8 (4021,4022) 1 0 0
9 (4024,4025) 1 0 0
10 (4027,4028) 2 2 17
We could use str_extract with regex '\(.*?\)', that basically extracts everything between parenthesis:
library(stringr)
library(dplyr)
dat %>%
mutate(Twin_Pair = str_extract(Twin_Pair, '\\(.*?\\)'))
Twin_Pair zyg CDsumTwin1 CDsumTwin2
1 (2891,2892) 2 0 5
2 (4000,4001) 1 0 0
3 (4006,4007) 2 0 3
4 (4009,4010) 2 1 3
5 (4012,4013) 2 2 0
6 (4015,4016) 2 0 9
7 (4018,4019) 2 0 0
8 (4021,4022) 1 0 0
9 (4024,4025) 1 0 0
10 (4027,4028) 2 2 17

R Data Table Assign Subset of Rows and Columns with Zero

I'm trying to explode a data table into a time series by populating future time steps with values of zero. The starting data table has the following structure. Values for V1 and V2 can be thought of as values for the first time step.
dt <- data.table(ID = c(1,2,3), V1 = c(1,2,3), V2 = c(4,5,6))
ID V1 V2
1: 1 1 4
2: 2 2 5
3: 3 3 6
What I want to get to is a data table like this
ID year V1 V2
1: 1 1 1 4
2: 1 2 0 0
3: 1 3 0 0
4: 1 4 0 0
5: 1 5 0 0
6: 2 1 2 5
7: 2 2 0 0
8: 2 3 0 0
9: 2 4 0 0
10: 2 5 0 0
11: 3 1 3 6
12: 3 2 0 0
13: 3 3 0 0
14: 3 4 0 0
15: 3 5 0 0
I've exploded the original data table and appended the year column with the following
dt <- dt[, .(year=1:5), by=ID][dt, on=ID, allow.cartesian=T]
ID year V1 V2
1: 1 1 1 4
2: 1 2 1 4
3: 1 3 1 4
4: 1 4 1 4
5: 1 5 1 4
6: 2 1 2 5
7: 2 2 2 5
8: 2 3 2 5
9: 2 4 2 5
10: 2 5 2 5
11: 3 1 3 6
12: 3 2 3 6
13: 3 3 3 6
14: 3 4 3 6
15: 3 5 3 6
Any ideas on how to populate columns V1 and V2 with zeros for year!=1 would be much appreciated. I also need to avoid spelling out the V1 and V2 column names as the actual data table I'm working with has 58 columns.
I got an error with that last step, but if you have a more recent version of data.table that behaves differently hten by all means just :
dt[year != 1, V1 := 0] # logical condition in the 'i' position
dt[year != 1, V2 := 0] # data.table assign in the 'j' position
Ooops. Didn't read to the end. Will see if I can test a range of columns.
Ranges can be constructed on the LHS of data.table.[ assignment operator (:=):
> dt2[year != 1, paste0("V", 1:2) := 0 ]
> dt2
ID V1 V2 year
1: 1 1 4 1
2: 1 0 0 2
3: 1 0 0 3
4: 1 0 0 4
5: 1 0 0 5
6: 2 2 5 1
7: 2 0 0 2
8: 2 0 0 3
9: 2 0 0 4
10: 2 0 0 5
11: 3 3 6 1
12: 3 0 0 2
13: 3 0 0 3
14: 3 0 0 4
15: 3 0 0 5
You can use tidyr::complete -
library(dplyr)
library(tidyr)
dt %>%
mutate(year = 1) %>%
complete(ID, year = 1:5, fill = list(V1 = 0, V2 = 0))
# ID year V1 V2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 4
# 2 1 2 0 0
# 3 1 3 0 0
# 4 1 4 0 0
# 5 1 5 0 0
# 6 2 1 2 5
# 7 2 2 0 0
# 8 2 3 0 0
# 9 2 4 0 0
#10 2 5 0 0
#11 3 1 3 6
#12 3 2 0 0
#13 3 3 0 0
#14 3 4 0 0
#15 3 5 0 0

Removing the unordered pairs repeated twice in a file in R

I have a file like this in R.
**0 1**
0 2
**0 3**
0 4
0 5
0 6
0 7
0 8
0 9
0 10
**1 0**
1 11
1 12
1 13
1 14
1 15
1 16
1 17
1 18
1 19
**3 0**
As we can see, there are similar unordered pairs in this ( marked pairs ), like,
1 0
and
0 1
I wish to remove these pairs. And I want to count the number of such pairs that I have and append the count in front of the tow that is repeated. If not repeated, then 1 should be written in the third column.
For example ( A sample of the output file )
0 1 2
0 2 1
0 3 2
0 4 1
0 5 1
0 6 1
0 7 1
0 8 1
0 9 1
0 10 1
1 11 1
1 12 1
1 13 1
1 14 1
1 15 1
1 16 1
1 17 1
1 18 1
1 19 1
How can I achieve it in R?
Here is a way using transform, pmin and pmax to reorder the data by row, and then aggregate to provide a count:
# data
x <- data.frame(a=c(rep(0,10),rep(1,10),3),b=c(1:10,0,11:19,0))
#logic
aggregate(count~a+b,transform(x,a=pmin(a,b), b=pmax(a,b), count=1),sum)
a b count
1 0 1 2
2 0 2 1
3 0 3 2
4 0 4 1
5 0 5 1
6 0 6 1
7 0 7 1
8 0 8 1
9 0 9 1
10 0 10 1
11 1 11 1
12 1 12 1
13 1 13 1
14 1 14 1
15 1 15 1
16 1 16 1
17 1 17 1
18 1 18 1
19 1 19 1
Here's one approach:
First, create a vector of the columns sorted and then pasted together.
x <- apply(mydf, 1, function(x) paste(sort(x), collapse = " "))
Then, use ave to create the counts you are looking for.
mydf$count <- ave(x, x, FUN = length)
Finally, you can use the "x" vector again, this time to detect and remove duplicated values.
mydf[!duplicated(x), ]
# V1 V2 count
# 1 0 1 2
# 2 0 2 1
# 3 0 3 2
# 4 0 4 1
# 5 0 5 1
# 6 0 6 1
# 7 0 7 1
# 8 0 8 1
# 9 0 9 1
# 10 0 10 1
# 12 1 11 1
# 13 1 12 1
# 14 1 13 1
# 15 1 14 1
# 16 1 15 1
# 17 1 16 1
# 18 1 17 1
# 19 1 18 1
# 20 1 19 1

Resources