How to create a column based on conditions with rows - r

I have the following problem:
Shared_ID<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
Individual_ID<-c(11,12,13,21,22,23,31,32,33,41,42,43,51,52,53)
Individual_Con<-c(1,2,3,1,1,1,2,2,2,3,3,3,3,2,1)
table<-tibble(Shared_ID,Individual_ID,Individual_Con)
table
what I'm looking for is a way to make a new column called Shared_Con where: for each Shared_ID shows a number based on the following:
Individual_Con==1 ~ 1
Individual_Con==2 ~ 2
Individual_Con==3 ~ 3
any combination of Individual_Con ~ 4
For me this means that if all the Individual_Con within a Shared_ID are x.e equal to 1, then Shared_Con will be 1, and the last case is if there are at least 2 different Individual_Con per Shared_ID then Shared_Con will be 4
This is my desire result:
# A tibble: 15 x 4
Shared_ID Individual_ID Individual_Con Shared_Con
<dbl> <dbl> <dbl> <dbl>
1 1 11 1 4
2 1 12 2 4
3 1 13 3 4
4 2 21 1 1
5 2 22 1 1
6 2 23 1 1
7 3 31 2 2
8 3 32 2 2
9 3 33 2 2
10 4 41 3 3
11 4 42 3 3
12 4 43 3 3
13 5 51 3 4
14 5 52 2 4
15 5 53 1 4
How can I make this easily? Thanks in advance for any help!

We can do a group by 'Shared_ID', check whether the number of distinct elements in 'Individual_Con' are greater than 1 then return 4 or else return the Individual_Con
library(dplyr)
table %>%
group_by(Shared_ID) %>%
mutate(Shared_Con = if(n_distinct(Individual_Con) > 1) 4 else Individual_Con)
# A tibble: 15 x 4
# Groups: Shared_ID [5]
# Shared_ID Individual_ID Individual_Con Shared_Con
# <dbl> <dbl> <dbl> <dbl>
# 1 1 11 1 4
# 2 1 12 2 4
# 3 1 13 3 4
# 4 2 21 1 1
# 5 2 22 1 1
# 6 2 23 1 1
# 7 3 31 2 2
# 8 3 32 2 2
# 9 3 33 2 2
#10 4 41 3 3
#11 4 42 3 3
#12 4 43 3 3
#13 5 51 3 4
#14 5 52 2 4
#15 5 53 1 4

Related

Exclude rows where value used in another row

Imagine you have the following data set:
df = data.frame(ID = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), gender= c(1,2,1,2,2,2,2,1,1,2,1,2,1,2,2,2,2,1,1,2),
PID = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10))
how can I write a code that removes the rows in the df whose gender and PID are the same (see picture). Please imagine that the code is over 1000 rows long (so it should be a solution that automatically searches for the right values to exclude).
base R
df[ave(rep(TRUE, nrow(df)), df[,c("gender","paar")], FUN = function(z) !any(duplicated(z))),]
# ID gender paar
# 1 1 1 1
# 2 2 2 1
# 3 3 1 2
# 4 4 2 2
# 7 7 2 4
# 8 8 1 4
# 9 9 1 5
# 10 10 2 5
# 11 11 1 6
# 12 12 2 6
# 13 13 1 7
# 14 14 2 7
# 17 17 2 9
# 18 18 1 9
# 19 19 1 10
# 20 20 2 10
dplyr
library(dplyr)
df %>%
group_by(gender, paar) %>%
filter(!any(duplicated(cbind(gender, paar)))) %>%
ungroup()
In base R, we may use subset after removing the observations where the group count for 'gender' and 'paar' are not 1
subset(df, ave(seq_along(gender), gender, paar, FUN = length) == 1)
Or with duplicated
df[!(duplicated(df[-1])|duplicated(df[-1], fromLast = TRUE)),]
-output
ID gender paar
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
7 7 2 4
8 8 1 4
9 9 1 5
10 10 2 5
11 11 1 6
12 12 2 6
13 13 1 7
14 14 2 7
17 17 2 9
18 18 1 9
19 19 1 10
20 20 2 10
Here is one more: :-)
library(dplyr)
df %>%
group_by(gender, PID) %>%
filter(is.na(ifelse(n()>1, 1, NA)))
ID gender PID
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
5 7 2 4
6 8 1 4
7 9 1 5
8 10 2 5
9 11 1 6
10 12 2 6
11 13 1 7
12 14 2 7
13 17 2 9
14 18 1 9
15 19 1 10
16 20 2 10
Another dplyr option could be:
df %>%
filter(with(rle(paste0(gender, PID)), rep(lengths == 1, lengths)))
ID gender PID
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
5 7 2 4
6 8 1 4
7 9 1 5
8 10 2 5
9 11 1 6
10 12 2 6
11 13 1 7
12 14 2 7
13 17 2 9
14 18 1 9
15 19 1 10
16 20 2 10
If the duplicated values can occur also between non-consecutive rows:
df %>%
arrange(gender, PID) %>%
filter(with(rle(paste0(gender, PID)), rep(lengths == 1, lengths)))
Using aggregate
na.omit(aggregate(. ~ gender + PID, df, function(x)
ifelse(length(x) == 1, x, NA)))
gender PID ID
1 1 1 1
2 2 1 2
3 1 2 3
4 2 2 4
6 1 4 8
7 2 4 7
8 1 5 9
9 2 5 10
10 1 6 11
11 2 6 12
12 1 7 13
13 2 7 14
15 1 9 18
16 2 9 17
17 1 10 19
18 2 10 20
With dplyr
library(dplyr)
df %>%
group_by(gender, PID) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 16 × 3
ID gender PID
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
5 7 2 4
6 8 1 4
7 9 1 5
8 10 2 5
9 11 1 6
10 12 2 6
11 13 1 7
12 14 2 7
13 17 2 9
14 18 1 9
15 19 1 10
16 20 2 10

Add rows to dataframe in R based on values in column

I have a dataframe with 2 columns: time and day. there are 3 days and for each day, time runs from 1 to 12. I want to add new rows for each day with times: -2, 1 and 0. How do I do this?
I have tried using add_row and specifying the row number to add to, but this changes each time a new row is added making the process tedious. Thanks in advance
picture of the dataframe
We could use add_row
then slice the desired sequence
and bind all to a dataframe:
library(tibble)
library(dplyr)
df1 <- df %>%
add_row(time = -2:0, Day = c(1,1,1), .before = 1) %>%
slice(1:15)
df2 <- bind_rows(df1, df1, df1) %>%
mutate(Day = rep(row_number(), each=15, length.out = n()))
Output:
# A tibble: 45 x 2
time Day
<dbl> <int>
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
Here's a fast way to create the desired dataframe from scratch using expand.grid(), rather than adding individual rows:
df <- expand.grid(-2:12,1:3)
colnames(df) <- c("time","day")
Results:
df
time day
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
You can use tidyr::crossing
library(dplyr)
library(tidyr)
add_values <- c(-2, 1, 0)
crossing(time = add_values, Day = unique(day$Day)) %>%
bind_rows(day) %>%
arrange(Day, time)
# A tibble: 45 x 2
# time Day
# <dbl> <int>
# 1 -2 1
# 2 0 1
# 3 1 1
# 4 1 1
# 5 2 1
# 6 3 1
# 7 4 1
# 8 5 1
# 9 6 1
#10 7 1
# … with 35 more rows
If you meant -2, -1 and 0 you can also use complete.
tidyr::complete(day, Day, time = -2:0)

How to define rows numbering depending on a group and a value in group's first rows?

A dataframe DD has some missing rows. Based on the values in 'ID_raw' column I have duplicated the rows in order to replace the missing rows. Now I have to number the rows in such way that the first value in each group (column 'File') equals the value in the same row in the column 'ID_raw'. This will be a key in joining the dataframe with another one. Below a dummy example of the DD dataframe:
DD<-data.frame(ID_raw=c(1,5,7,8,5,7,9,13,3,6),Val=c(1,2,8,15,54,23,88,77,32,2),File=c("A","A","A","A","B","B","B","B","C","C"))
ID_raw Val File
1 1 1 A
2 5 2 A
3 7 8 A
4 8 15 A
5 5 54 B
6 7 23 B
7 9 88 B
8 13 77 B
9 3 32 C
10 6 2 C
So far I've successfully duplicated the rows, however, I have a problem in numbering the rows in such way, that they start from the same value as the value in ID_raw column for each group ('File').
DD$ID_diff <- 0
DD$ID_diff[1:nrow(DD)-1] <- as.integer(diff(DD$ID_raw, 1)) #values which tell how many times a row has to be duplicated
DD$ID_diff <- sapply(DD$ID_diff, function(x) ifelse(x<0, 0, x)) #replacement the values <0 (for the first rows in each 'File' group)
DD <- DD[rep(seq(nrow(DD)), DD$ID_diff), 1:ncol(DD)] #rows duplication
Based on the code above I receive this output:
ID_raw Val File ID_diff
1 1 1 A 4
1.1 1 1 A 4
1.2 1 1 A 4
1.3 1 1 A 4
2 5 2 A 2
2.1 5 2 A 2
3 7 8 A 1
5 5 54 B 2
5.1 5 54 B 2
6 7 23 B 2
6.1 7 23 B 2
7 9 88 B 4
7.1 9 88 B 4
7.2 9 88 B 4
7.3 9 88 B 4
9 3 32 C 3
9.1 3 32 C 3
9.2 3 32 C 3
I would like to receive this:
ID_raw Val File ID_diff ID_new
1 1 1 A 4 1
1.1 1 1 A 4 2
1.2 1 1 A 4 3
1.3 1 1 A 4 4
2 5 2 A 2 5
2.1 5 2 A 2 6
3 7 8 A 1 7
5 5 54 B 2 5
5.1 5 54 B 2 6
6 7 23 B 2 7
6.1 7 23 B 2 8
7 9 88 B 4 9
7.1 9 88 B 4 10
7.2 9 88 B 4 11
7.3 9 88 B 4 12
9 3 32 C 3 3
9.1 3 32 C 3 4
9.2 3 32 C 3 5
This is one option using dplyr based on the output of your code:
df %>%
group_by(File) %>%
mutate(ID_new = seq(1, n()) + first(ID_raw) - 1)
# A tibble: 18 x 5
# Groups: File [3]
ID_raw Val File ID_diff ID_new
<int> <int> <fct> <int> <dbl>
1 1 1 A 4 1
2 1 1 A 4 2
3 1 1 A 4 3
4 1 1 A 4 4
5 5 2 A 2 5
6 5 2 A 2 6
7 7 8 A 1 7
8 5 54 B 2 5
9 5 54 B 2 6
10 7 23 B 2 7
11 7 23 B 2 8
12 9 88 B 4 9
13 9 88 B 4 10
14 9 88 B 4 11
15 9 88 B 4 12
16 3 32 C 3 3
17 3 32 C 3 4
18 3 32 C 3 5
We can do this in the chain from the beginning itself i.e. instead of creating the 'ID_diff' and using sapply, directly use diff on the 'ID_raw', then uncount, grouped by 'File', create the sequence column
library(tidyverse)
DD %>%
mutate(ID_diff = pmax(c(diff(ID_raw), 0), 0)) %>%
uncount(ID_diff, .remove = FALSE) %>%
group_by(File) %>%
mutate(ID_new = seq(first(ID_raw), length.out = n(), by = 1))
# A tibble: 18 x 5
# Groups: File [3]
# ID_raw Val File ID_diff ID_new
# <dbl> <dbl> <fct> <dbl> <dbl>
# 1 1 1 A 4 1
# 2 1 1 A 4 2
# 3 1 1 A 4 3
# 4 1 1 A 4 4
# 5 5 2 A 2 5
# 6 5 2 A 2 6
# 7 7 8 A 1 7
# 8 5 54 B 2 5
# 9 5 54 B 2 6
#10 7 23 B 2 7
#11 7 23 B 2 8
#12 9 88 B 4 9
#13 9 88 B 4 10
#14 9 88 B 4 11
#15 9 88 B 4 12
#16 3 32 C 3 3
#17 3 32 C 3 4
#18 3 32 C 3 5

R: Separate data into combinations of two columns

I have some data where each id is measured by different types which can be have different values type_val. The measured value is val. A small dummy data is like this:
df <- data.frame(id=rep(letters[1:2],6),
type=c(rep('t1',6), rep('t2',6)),
type_val=rep(c(1,1,2,2,3,3),2),
val=1:12)
Then df is:
id type type_val val
1 a t1 1 1
2 b t1 1 2
3 a t1 2 3
4 b t1 2 4
5 a t1 3 5
6 b t1 3 6
7 a t2 1 7
8 b t2 1 8
9 a t2 2 9
10 b t2 2 10
11 a t2 3 11
12 b t2 3 12
I need to spread/cast data so that all combinations of type and type_val for each id are row-wise. I think this must be a job for pkgs reshape2 or tidyr but I have completely failed to generate anything other than errors.
The outcome data structure - somewhat redundant - would be something like this (hope I got it right!) where pairs of type (as given by combinations of the type_val) are columns type_t1 and type_t2 , and their associated values (val in df) are val_t1 and val_t2 - columns names are of cause arbitrary :
id type_t1 type_t2 val_t1 val_t2
1 a 1 1 1 7
2 a 1 2 1 9
3 a 1 3 1 11
4 a 2 1 3 7
5 a 2 2 3 9
6 a 2 3 3 11
7 a 3 1 5 7
8 a 3 2 5 9
9 a 3 3 5 11
10 b 1 1 2 8
11 b 1 2 2 10
12 b 1 3 2 12
13 b 2 1 4 8
14 b 2 2 4 10
15 b 2 3 4 12
16 b 3 1 6 8
17 b 3 2 6 10
18 b 3 3 6 12
UPDATE
Note that (#Sotos)
> spread(df, type, val)
id type_val t1 t2
1 a 1 1 7
2 a 2 3 9
3 a 3 5 11
4 b 1 2 8
5 b 2 4 10
6 b 3 6 12
is not the desired output - it fails to deliver the wide format defined by combinations of type and type_val in df.
how about this:
df1=df[df$type=="t1",]
df2=df[df$type=="t2",]
DF=merge(df1,df2,by="id")
DF=DF[,-c(2,5)]
colnames(DF)<-c("id", "type_t1", "val_t1","type_t2", "val_t2")
Here is something more generic that will work with an arbitrary number of unique type:
library(dplyr)
# This function takes a list of dataframes (.data) and merges them by ID
reduce_merge <- function(.data, ID) {
return(Reduce(function(x, y) merge(x, y, by = ID), .data))
}
# This function renames the cols columns in .data by appending _identifier
batch_rename <- function(.data, cols, identifier, sep = '_') {
return(plyr::rename(.data, sapply(cols, function(x){
x = paste(x, .data[1, identifier], sep = sep)
})))
}
# This function creates a list of subsetted dataframes
# (subsetted by values of key),
# uses batch_rename() to give each dataframe more informative column names,
# merges them together, and returns the columns you'd like in a sensible order
multi_spread <- function(.data, grp, key, vals) {
.data %>%
plyr::dlply(key, subset) %>%
lapply(batch_rename, vals, key) %>%
reduce_merge(grp) %>%
select(-starts_with(paste0(key, '.'))) %>%
select(id, sort(setdiff(colnames(.), c(grp, key, vals))))
}
# Your example
df <- data.frame(id=rep(letters[1:2],6),
type=c(rep('t1',6), rep('t2',6)),
type_val=rep(c(1,1,2,2,3,3),2),
val=1:12)
df %>% multi_spread('id', 'type', c('type_val', 'val'))
id type_val_t1 type_val_t2 val_t1 val_t2
1 a 1 1 1 7
2 a 1 2 1 9
3 a 1 3 1 11
4 a 2 1 3 7
5 a 2 2 3 9
6 a 2 3 3 11
7 a 3 1 5 7
8 a 3 2 5 9
9 a 3 3 5 11
10 b 1 1 2 8
11 b 1 2 2 10
12 b 1 3 2 12
13 b 2 1 4 8
14 b 2 2 4 10
15 b 2 3 4 12
16 b 3 1 6 8
17 b 3 2 6 10
18 b 3 3 6 12
# An example with three unique values of 'type'
df <- data.frame(id = rep(letters[1:2], 9),
type = c(rep('t1', 6), rep('t2', 6), rep('t3', 6)),
type_val = rep(c(1, 1, 2, 2, 3, 3), 3),
val = 1:18)
df %>% multi_spread('id', 'type', c('type_val', 'val'))
id type_val_t1 type_val_t2 type_val_t3 val_t1 val_t2 val_t3
1 a 1 1 1 1 7 13
2 a 1 1 2 1 7 15
3 a 1 1 3 1 7 17
4 a 1 2 1 1 9 13
5 a 1 2 2 1 9 15
6 a 1 2 3 1 9 17
7 a 1 3 1 1 11 13
8 a 1 3 2 1 11 15
9 a 1 3 3 1 11 17
10 a 2 1 1 3 7 13
11 a 2 1 2 3 7 15
12 a 2 1 3 3 7 17
13 a 2 2 1 3 9 13
14 a 2 2 2 3 9 15
15 a 2 2 3 3 9 17
16 a 2 3 1 3 11 13
17 a 2 3 2 3 11 15
18 a 2 3 3 3 11 17
19 a 3 1 1 5 7 13
20 a 3 1 2 5 7 15
21 a 3 1 3 5 7 17
22 a 3 2 1 5 9 13
23 a 3 2 2 5 9 15
24 a 3 2 3 5 9 17
25 a 3 3 1 5 11 13
26 a 3 3 2 5 11 15
27 a 3 3 3 5 11 17
28 b 1 1 1 2 8 14
29 b 1 1 2 2 8 16
30 b 1 1 3 2 8 18
31 b 1 2 1 2 10 14
32 b 1 2 2 2 10 16
33 b 1 2 3 2 10 18
34 b 1 3 1 2 12 14
35 b 1 3 2 2 12 16
36 b 1 3 3 2 12 18
37 b 2 1 1 4 8 14
38 b 2 1 2 4 8 16
39 b 2 1 3 4 8 18
40 b 2 2 1 4 10 14
41 b 2 2 2 4 10 16
42 b 2 2 3 4 10 18
43 b 2 3 1 4 12 14
44 b 2 3 2 4 12 16
45 b 2 3 3 4 12 18
46 b 3 1 1 6 8 14
47 b 3 1 2 6 8 16
48 b 3 1 3 6 8 18
49 b 3 2 1 6 10 14
50 b 3 2 2 6 10 16
51 b 3 2 3 6 10 18
52 b 3 3 1 6 12 14
53 b 3 3 2 6 12 16
54 b 3 3 3 6 12 18

Data Frame Filter Values

Suppose I have the next data frame.
table<-data.frame(group=c(0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40),plan=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),price=c(1,4,5,6,8,9,12,12,12,3,5,6,7,10,12,20,20,20,5,6,8,12,15,20,22,28,28))
group plan price
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
8 35 1 12
9 40 1 12
10 0 2 3
11 5 2 5
12 10 2 6
13 15 2 7
14 20 2 10
15 25 2 12
16 30 2 20
17 35 2 20
18 40 2 20
How can I get the values from the table up to the maximum price, without duplicates.
So the result would be:
group plan price
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
10 0 2 3
11 5 2 5
12 10 2 6
13 15 2 7
14 20 2 10
15 25 2 12
16 30 2 20
You can use slice in dplyr:
library(dplyr)
table %>%
group_by(plan) %>%
slice(1:which.max(price == max(price)))
which.max gives the index of the first occurrence of price == max(price). Using that, I can slice the data.frame to only keep rows for each plan up to the maximum price.
Result:
# A tibble: 22 x 3
# Groups: plan [3]
group plan price
<dbl> <dbl> <dbl>
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
8 0 2 3
9 5 2 5
10 10 2 6
# ... with 12 more rows

Resources