How to define rows numbering depending on a group and a value in group's first rows? - r

A dataframe DD has some missing rows. Based on the values in 'ID_raw' column I have duplicated the rows in order to replace the missing rows. Now I have to number the rows in such way that the first value in each group (column 'File') equals the value in the same row in the column 'ID_raw'. This will be a key in joining the dataframe with another one. Below a dummy example of the DD dataframe:
DD<-data.frame(ID_raw=c(1,5,7,8,5,7,9,13,3,6),Val=c(1,2,8,15,54,23,88,77,32,2),File=c("A","A","A","A","B","B","B","B","C","C"))
ID_raw Val File
1 1 1 A
2 5 2 A
3 7 8 A
4 8 15 A
5 5 54 B
6 7 23 B
7 9 88 B
8 13 77 B
9 3 32 C
10 6 2 C
So far I've successfully duplicated the rows, however, I have a problem in numbering the rows in such way, that they start from the same value as the value in ID_raw column for each group ('File').
DD$ID_diff <- 0
DD$ID_diff[1:nrow(DD)-1] <- as.integer(diff(DD$ID_raw, 1)) #values which tell how many times a row has to be duplicated
DD$ID_diff <- sapply(DD$ID_diff, function(x) ifelse(x<0, 0, x)) #replacement the values <0 (for the first rows in each 'File' group)
DD <- DD[rep(seq(nrow(DD)), DD$ID_diff), 1:ncol(DD)] #rows duplication
Based on the code above I receive this output:
ID_raw Val File ID_diff
1 1 1 A 4
1.1 1 1 A 4
1.2 1 1 A 4
1.3 1 1 A 4
2 5 2 A 2
2.1 5 2 A 2
3 7 8 A 1
5 5 54 B 2
5.1 5 54 B 2
6 7 23 B 2
6.1 7 23 B 2
7 9 88 B 4
7.1 9 88 B 4
7.2 9 88 B 4
7.3 9 88 B 4
9 3 32 C 3
9.1 3 32 C 3
9.2 3 32 C 3
I would like to receive this:
ID_raw Val File ID_diff ID_new
1 1 1 A 4 1
1.1 1 1 A 4 2
1.2 1 1 A 4 3
1.3 1 1 A 4 4
2 5 2 A 2 5
2.1 5 2 A 2 6
3 7 8 A 1 7
5 5 54 B 2 5
5.1 5 54 B 2 6
6 7 23 B 2 7
6.1 7 23 B 2 8
7 9 88 B 4 9
7.1 9 88 B 4 10
7.2 9 88 B 4 11
7.3 9 88 B 4 12
9 3 32 C 3 3
9.1 3 32 C 3 4
9.2 3 32 C 3 5

This is one option using dplyr based on the output of your code:
df %>%
group_by(File) %>%
mutate(ID_new = seq(1, n()) + first(ID_raw) - 1)
# A tibble: 18 x 5
# Groups: File [3]
ID_raw Val File ID_diff ID_new
<int> <int> <fct> <int> <dbl>
1 1 1 A 4 1
2 1 1 A 4 2
3 1 1 A 4 3
4 1 1 A 4 4
5 5 2 A 2 5
6 5 2 A 2 6
7 7 8 A 1 7
8 5 54 B 2 5
9 5 54 B 2 6
10 7 23 B 2 7
11 7 23 B 2 8
12 9 88 B 4 9
13 9 88 B 4 10
14 9 88 B 4 11
15 9 88 B 4 12
16 3 32 C 3 3
17 3 32 C 3 4
18 3 32 C 3 5

We can do this in the chain from the beginning itself i.e. instead of creating the 'ID_diff' and using sapply, directly use diff on the 'ID_raw', then uncount, grouped by 'File', create the sequence column
library(tidyverse)
DD %>%
mutate(ID_diff = pmax(c(diff(ID_raw), 0), 0)) %>%
uncount(ID_diff, .remove = FALSE) %>%
group_by(File) %>%
mutate(ID_new = seq(first(ID_raw), length.out = n(), by = 1))
# A tibble: 18 x 5
# Groups: File [3]
# ID_raw Val File ID_diff ID_new
# <dbl> <dbl> <fct> <dbl> <dbl>
# 1 1 1 A 4 1
# 2 1 1 A 4 2
# 3 1 1 A 4 3
# 4 1 1 A 4 4
# 5 5 2 A 2 5
# 6 5 2 A 2 6
# 7 7 8 A 1 7
# 8 5 54 B 2 5
# 9 5 54 B 2 6
#10 7 23 B 2 7
#11 7 23 B 2 8
#12 9 88 B 4 9
#13 9 88 B 4 10
#14 9 88 B 4 11
#15 9 88 B 4 12
#16 3 32 C 3 3
#17 3 32 C 3 4
#18 3 32 C 3 5

Related

R: How to split a row in a dataframe into a number of rows, conditional on a value in a cell?

I have a data.frame which looks like the following:
id <- c("a","a","a","a","b","b","b","b")
age_from <- c(0,2,3,7,0,1,2,6)
age_to <- c(2,3,7,10,1,2,6,10)
y <- c(100,150,100,250,300,200,100,150)
df <- data.frame(id,age_from,age_to,y)
df$years <- df$age_to - df$age_from
Which gives a df that looks like:
id age_from age_to y years
1 a 0 2 100 2
2 a 2 3 150 1
3 a 3 7 100 4
4 a 7 10 250 3
5 b 0 1 300 1
6 b 1 2 200 1
7 b 2 6 100 4
8 b 6 10 150 4
Instead of having an unequal number of years per row, I would like to have 20 rows, 10 for each id, with each row accounting for one year. This would also involve averaging the y column across the number of years listed in the years column.
I believe this may have to be done using a loop 1:n with the n equaling a value in the years column. Although I am not sure how to start with this.
You can use rep to repeat the rows by the number of given years.
x <- df[rep(seq_len(nrow(df)), df$years),]
x
# id age_from age_to y years
#1 a 0 2 50.00000 2
#1.1 a 0 2 50.00000 2
#2 a 2 3 150.00000 1
#3 a 3 7 25.00000 4
#3.1 a 3 7 25.00000 4
#3.2 a 3 7 25.00000 4
#3.3 a 3 7 25.00000 4
#4 a 7 10 83.33333 3
#4.1 a 7 10 83.33333 3
#4.2 a 7 10 83.33333 3
#5 b 0 1 300.00000 1
#6 b 1 2 200.00000 1
#7 b 2 6 25.00000 4
#7.1 b 2 6 25.00000 4
#7.2 b 2 6 25.00000 4
#7.3 b 2 6 25.00000 4
#8 b 6 10 37.50000 4
#8.1 b 6 10 37.50000 4
#8.2 b 6 10 37.50000 4
#8.3 b 6 10 37.50000 4
When you mean with averaging the y column across the number of years to divide by the number of years:
x$y <- x$y / x$years
In case age_from should go from 0 to 9 and age_to from 1 to 10 for each id:
x$age_from <- x$age_from + ave(x$age_from, x$id, x$age_from, FUN=seq_along) - 1
#x$age_from <- ave(x$age_from, x$id, FUN=seq_along) - 1 #Alternative
x$age_to <- x$age_from + 1
Here is a solution with tidyr and dplyr.
First of all we complete age_from from 0 to 9 as you wanted, by keeping only the existing ids.
You will have several NAs on age_to, y and years. So, we fill them by dragging down each value in order to complete the immediately following values that are NA.
Now you can divide y by years (I assumed you meant this by setting the average value so to leave the sum consistent).
At that point, you only need to recalculate age_to accordingly.
Remember to ungroup at the end!
library(tidyr)
library(dplyr)
df %>%
complete(id, age_from = 0:9) %>%
group_by(id) %>%
fill(y, years, age_to) %>%
mutate(y = y/years) %>%
mutate(age_to = age_from + 1) %>%
ungroup()
# A tibble: 20 x 5
id age_from age_to y years
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 0 1 50 2
2 a 1 2 50 2
3 a 2 3 150 1
4 a 3 4 25 4
5 a 4 5 25 4
6 a 5 6 25 4
7 a 6 7 25 4
8 a 7 8 83.3 3
9 a 8 9 83.3 3
10 a 9 10 83.3 3
11 b 0 1 300 1
12 b 1 2 200 1
13 b 2 3 25 4
14 b 3 4 25 4
15 b 4 5 25 4
16 b 5 6 25 4
17 b 6 7 37.5 4
18 b 7 8 37.5 4
19 b 8 9 37.5 4
20 b 9 10 37.5 4
A tidyverse solution.
library(tidyverse)
df %>%
mutate(age_to = age_from + 1) %>%
group_by(id) %>%
complete(nesting(age_from = 0:9, age_to = 1:10)) %>%
fill(y, years) %>%
mutate(y = y / years)
# A tibble: 20 x 5
# Groups: id [2]
id age_from age_to y years
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 0 1 50 2
2 a 1 2 50 2
3 a 2 3 150 1
4 a 3 4 25 4
5 a 4 5 25 4
6 a 5 6 25 4
7 a 6 7 25 4
8 a 7 8 83.3 3
9 a 8 9 83.3 3
10 a 9 10 83.3 3
11 b 0 1 300 1
12 b 1 2 200 1
13 b 2 3 25 4
14 b 3 4 25 4
15 b 4 5 25 4
16 b 5 6 25 4
17 b 6 7 37.5 4
18 b 7 8 37.5 4
19 b 8 9 37.5 4
20 b 9 10 37.5 4

Create a function to Impute values form one data frame into another

The NA values in column A should be filled by the A value from the dat data frame and so on for the other variables.
id <- factor(rep(letters[1:2], each=5))
A <- c(1,2,NA,6,8,9,0,6,7,9)
B <- c(5,6,1,9,8,1,NA,9,7,4)
C <- c(2,3,5,NA,NA,2,7,6,4,6)
D <- c(6,5,8,3,2,9,NA,2,6,8)
df <- data.frame(id, A, B,C,D)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA 1 5 8
4 a 6 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
dat <- data.frame(col=c("A","B","C","D"), value=c(23,45,26,89))
dat
dat
col value
1 A 23
2 B 45
3 C 26
4 D 89
It should look like:
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 26 3
5 a 8 8 26 2
6 b 9 1 2 9
7 b 0 45 7 89
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
I was thinking something like this but I dont know how to connect those data frames in a function...
test <- function(i){
df[,i][is.na(df[,i])] <- dat$value
}
test(2)
If you want it in your format
test <- function(i){
df[,i][is.na(df[,i])] <<- dat$value[dat$col==i]
}
test("A")
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
One approach is to iterate over the columns and values and use coalesce():
library(dplyr)
library(purrr)
df[-1] <- map2_df(df[-1], dat$value, coalesce)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 26 3
5 a 8 8 26 2
6 b 9 1 2 9
7 b 0 45 7 89
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
Or same using replace():
map2_df(df[-1], dat$value, ~ replace(.x, is.na(.x), .y))

How to create a column based on conditions with rows

I have the following problem:
Shared_ID<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
Individual_ID<-c(11,12,13,21,22,23,31,32,33,41,42,43,51,52,53)
Individual_Con<-c(1,2,3,1,1,1,2,2,2,3,3,3,3,2,1)
table<-tibble(Shared_ID,Individual_ID,Individual_Con)
table
what I'm looking for is a way to make a new column called Shared_Con where: for each Shared_ID shows a number based on the following:
Individual_Con==1 ~ 1
Individual_Con==2 ~ 2
Individual_Con==3 ~ 3
any combination of Individual_Con ~ 4
For me this means that if all the Individual_Con within a Shared_ID are x.e equal to 1, then Shared_Con will be 1, and the last case is if there are at least 2 different Individual_Con per Shared_ID then Shared_Con will be 4
This is my desire result:
# A tibble: 15 x 4
Shared_ID Individual_ID Individual_Con Shared_Con
<dbl> <dbl> <dbl> <dbl>
1 1 11 1 4
2 1 12 2 4
3 1 13 3 4
4 2 21 1 1
5 2 22 1 1
6 2 23 1 1
7 3 31 2 2
8 3 32 2 2
9 3 33 2 2
10 4 41 3 3
11 4 42 3 3
12 4 43 3 3
13 5 51 3 4
14 5 52 2 4
15 5 53 1 4
How can I make this easily? Thanks in advance for any help!
We can do a group by 'Shared_ID', check whether the number of distinct elements in 'Individual_Con' are greater than 1 then return 4 or else return the Individual_Con
library(dplyr)
table %>%
group_by(Shared_ID) %>%
mutate(Shared_Con = if(n_distinct(Individual_Con) > 1) 4 else Individual_Con)
# A tibble: 15 x 4
# Groups: Shared_ID [5]
# Shared_ID Individual_ID Individual_Con Shared_Con
# <dbl> <dbl> <dbl> <dbl>
# 1 1 11 1 4
# 2 1 12 2 4
# 3 1 13 3 4
# 4 2 21 1 1
# 5 2 22 1 1
# 6 2 23 1 1
# 7 3 31 2 2
# 8 3 32 2 2
# 9 3 33 2 2
#10 4 41 3 3
#11 4 42 3 3
#12 4 43 3 3
#13 5 51 3 4
#14 5 52 2 4
#15 5 53 1 4

count row number first and then insert new row by condition [duplicate]

This question already has answers here:
How to create missing value for repeated measurement data?
(2 answers)
Closed 4 years ago.
I need to count the number of rows first after a group_by function and add up new row(s) to 6 row if the row number < 6.
My df has three variables (v1,v2,v3): v1 = group name, v2 = row number (i.e., 1,2,3,4,5,6). In the new row(s), I want to repeat the v1 value, v2 continue the couting of row number, v3 = NA
sample df
v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32
expected output
v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
2 5 NA #insert
2 6 NA #insert
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32
3 6 NA #insert
I tried to count the row number first by dplyr, but I don't know if I can or how can I add this if else condition by using the pip. Or is there other easier function?
My code
df %>%
group_by(v1) %>%
dplyr::summarise(N=n()) %>%
if (N < 6) {
# sth like that?
}
Thanks!
We can use complete
library(tidyverse)
complete(df1, v1, v2)
# A tibble: 18 x 3
# v1 v2 v3
# <int> <int> <int>
# 1 1 1 79
# 2 1 2 32
# 3 1 3 53
# 4 1 4 33
# 5 1 5 76
# 6 1 6 11
# 7 2 1 32
# 8 2 2 42
# 9 2 3 44
#10 2 4 12
#11 2 5 NA
#12 2 6 NA
#13 3 1 22
#14 3 2 12
#15 3 3 12
#16 3 4 67
#17 3 5 32
#18 3 6 NA
Here is a way to do it using merge.
df <- read.table(text =
"v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32", header = T)
toMerge <- data.frame(v1 = rep(1:3, each = 6), v2 = rep(1:6, times = 3))
m <- merge(toMerge, df, by = c("v1", "v2"), all.x = T)
m
v1 v2 v3
1 1 1 79
2 1 2 32
3 1 3 53
4 1 4 33
5 1 5 76
6 1 6 11
7 2 1 32
8 2 2 42
9 2 3 44
10 2 4 12
11 2 5 NA
12 2 6 NA
13 3 1 22
14 3 2 12
15 3 3 12
16 3 4 67
17 3 5 32
18 3 6 NA

R: Separate data into combinations of two columns

I have some data where each id is measured by different types which can be have different values type_val. The measured value is val. A small dummy data is like this:
df <- data.frame(id=rep(letters[1:2],6),
type=c(rep('t1',6), rep('t2',6)),
type_val=rep(c(1,1,2,2,3,3),2),
val=1:12)
Then df is:
id type type_val val
1 a t1 1 1
2 b t1 1 2
3 a t1 2 3
4 b t1 2 4
5 a t1 3 5
6 b t1 3 6
7 a t2 1 7
8 b t2 1 8
9 a t2 2 9
10 b t2 2 10
11 a t2 3 11
12 b t2 3 12
I need to spread/cast data so that all combinations of type and type_val for each id are row-wise. I think this must be a job for pkgs reshape2 or tidyr but I have completely failed to generate anything other than errors.
The outcome data structure - somewhat redundant - would be something like this (hope I got it right!) where pairs of type (as given by combinations of the type_val) are columns type_t1 and type_t2 , and their associated values (val in df) are val_t1 and val_t2 - columns names are of cause arbitrary :
id type_t1 type_t2 val_t1 val_t2
1 a 1 1 1 7
2 a 1 2 1 9
3 a 1 3 1 11
4 a 2 1 3 7
5 a 2 2 3 9
6 a 2 3 3 11
7 a 3 1 5 7
8 a 3 2 5 9
9 a 3 3 5 11
10 b 1 1 2 8
11 b 1 2 2 10
12 b 1 3 2 12
13 b 2 1 4 8
14 b 2 2 4 10
15 b 2 3 4 12
16 b 3 1 6 8
17 b 3 2 6 10
18 b 3 3 6 12
UPDATE
Note that (#Sotos)
> spread(df, type, val)
id type_val t1 t2
1 a 1 1 7
2 a 2 3 9
3 a 3 5 11
4 b 1 2 8
5 b 2 4 10
6 b 3 6 12
is not the desired output - it fails to deliver the wide format defined by combinations of type and type_val in df.
how about this:
df1=df[df$type=="t1",]
df2=df[df$type=="t2",]
DF=merge(df1,df2,by="id")
DF=DF[,-c(2,5)]
colnames(DF)<-c("id", "type_t1", "val_t1","type_t2", "val_t2")
Here is something more generic that will work with an arbitrary number of unique type:
library(dplyr)
# This function takes a list of dataframes (.data) and merges them by ID
reduce_merge <- function(.data, ID) {
return(Reduce(function(x, y) merge(x, y, by = ID), .data))
}
# This function renames the cols columns in .data by appending _identifier
batch_rename <- function(.data, cols, identifier, sep = '_') {
return(plyr::rename(.data, sapply(cols, function(x){
x = paste(x, .data[1, identifier], sep = sep)
})))
}
# This function creates a list of subsetted dataframes
# (subsetted by values of key),
# uses batch_rename() to give each dataframe more informative column names,
# merges them together, and returns the columns you'd like in a sensible order
multi_spread <- function(.data, grp, key, vals) {
.data %>%
plyr::dlply(key, subset) %>%
lapply(batch_rename, vals, key) %>%
reduce_merge(grp) %>%
select(-starts_with(paste0(key, '.'))) %>%
select(id, sort(setdiff(colnames(.), c(grp, key, vals))))
}
# Your example
df <- data.frame(id=rep(letters[1:2],6),
type=c(rep('t1',6), rep('t2',6)),
type_val=rep(c(1,1,2,2,3,3),2),
val=1:12)
df %>% multi_spread('id', 'type', c('type_val', 'val'))
id type_val_t1 type_val_t2 val_t1 val_t2
1 a 1 1 1 7
2 a 1 2 1 9
3 a 1 3 1 11
4 a 2 1 3 7
5 a 2 2 3 9
6 a 2 3 3 11
7 a 3 1 5 7
8 a 3 2 5 9
9 a 3 3 5 11
10 b 1 1 2 8
11 b 1 2 2 10
12 b 1 3 2 12
13 b 2 1 4 8
14 b 2 2 4 10
15 b 2 3 4 12
16 b 3 1 6 8
17 b 3 2 6 10
18 b 3 3 6 12
# An example with three unique values of 'type'
df <- data.frame(id = rep(letters[1:2], 9),
type = c(rep('t1', 6), rep('t2', 6), rep('t3', 6)),
type_val = rep(c(1, 1, 2, 2, 3, 3), 3),
val = 1:18)
df %>% multi_spread('id', 'type', c('type_val', 'val'))
id type_val_t1 type_val_t2 type_val_t3 val_t1 val_t2 val_t3
1 a 1 1 1 1 7 13
2 a 1 1 2 1 7 15
3 a 1 1 3 1 7 17
4 a 1 2 1 1 9 13
5 a 1 2 2 1 9 15
6 a 1 2 3 1 9 17
7 a 1 3 1 1 11 13
8 a 1 3 2 1 11 15
9 a 1 3 3 1 11 17
10 a 2 1 1 3 7 13
11 a 2 1 2 3 7 15
12 a 2 1 3 3 7 17
13 a 2 2 1 3 9 13
14 a 2 2 2 3 9 15
15 a 2 2 3 3 9 17
16 a 2 3 1 3 11 13
17 a 2 3 2 3 11 15
18 a 2 3 3 3 11 17
19 a 3 1 1 5 7 13
20 a 3 1 2 5 7 15
21 a 3 1 3 5 7 17
22 a 3 2 1 5 9 13
23 a 3 2 2 5 9 15
24 a 3 2 3 5 9 17
25 a 3 3 1 5 11 13
26 a 3 3 2 5 11 15
27 a 3 3 3 5 11 17
28 b 1 1 1 2 8 14
29 b 1 1 2 2 8 16
30 b 1 1 3 2 8 18
31 b 1 2 1 2 10 14
32 b 1 2 2 2 10 16
33 b 1 2 3 2 10 18
34 b 1 3 1 2 12 14
35 b 1 3 2 2 12 16
36 b 1 3 3 2 12 18
37 b 2 1 1 4 8 14
38 b 2 1 2 4 8 16
39 b 2 1 3 4 8 18
40 b 2 2 1 4 10 14
41 b 2 2 2 4 10 16
42 b 2 2 3 4 10 18
43 b 2 3 1 4 12 14
44 b 2 3 2 4 12 16
45 b 2 3 3 4 12 18
46 b 3 1 1 6 8 14
47 b 3 1 2 6 8 16
48 b 3 1 3 6 8 18
49 b 3 2 1 6 10 14
50 b 3 2 2 6 10 16
51 b 3 2 3 6 10 18
52 b 3 3 1 6 12 14
53 b 3 3 2 6 12 16
54 b 3 3 3 6 12 18

Resources