I'm trying to lag variables by day but many don't have an observation on the previous day. So I need to add an extra row in the process. Dplyr gets me close but I need a way to add a new row in the process and have many thousands of cases. Any thoughts would be much appreciated.
ID<-c(1,1,1,1,2,2)
day<-c(0,1,2,5,1,3)
v<-c(2.2,3.4,1.2,.8,6.4,2)
dat1<-as.data.frame(cbind(ID,day,v))
dat1
ID day v
1 1 0 2.2
2 1 1 3.4
3 1 2 1.2
4 1 5 0.8
5 2 1 6.4
6 2 3 2.0
Using dplyr gets me here:
dat2<-
dat1 %>%
group_by(ID) %>%
mutate(v.L = dplyr::lead(v, n = 1, default = NA))
dat2
ID day v v.L
1 1 0 2.2 3.4
2 1 1 3.4 1.2
3 1 2 1.2 0.8
4 1 5 0.8 NA
5 2 1 6.4 2.0
6 2 3 2.0 NA
But I need to get here:
ID2<-c(1,1,1,1,1,2,2,2)
day2<-c(0,1,2,4,5,1,2,3)
v2<-c(2.2,3.4,1.2,NA,.8,6.4,NA,2)
v2.L<-c(3.4,1.2,NA,.8,NA,NA,2,NA)
dat3<-as.data.frame(cbind(ID2,day2,v2,v2.L))
dat3
ID2 day2 v2 v2.L
1 1 0 2.2 3.4
2 1 1 3.4 1.2
3 1 2 1.2 NA
4 1 4 NA 0.8
5 1 5 0.8 NA
6 2 1 6.4 NA
7 2 2 NA 2.0
8 2 3 2.0 NA
You could use complete and full_seq from the tidyr package to complete the sequence of days. You'd need to remove at the end the rows that have NA in both v and v.L:
library(dplyr)
library(tidyr)
dat2 = dat1 %>%
group_by(ID) %>%
complete(day = full_seq(day,1)) %>%
mutate(v.L = lead(v)) %>%
filter(!(is.na(v) & is.na(v.L)))
ID day v v.L
<dbl> <dbl> <dbl> <dbl>
1 0 2.2 3.4
1 1 3.4 1.2
1 2 1.2 NA
1 4 NA 0.8
1 5 0.8 NA
2 1 6.4 NA
2 2 NA 2.0
2 3 2.0 NA
Related
I know how to do basic stuff in R, but I am still a newbie. I am also probably asking a pretty redundant question (but I don't know how to enter it into google so that I find the right hits).
I have been getting hits like the below:
Assign value to group based on condition in column
R - Group by variable and then assign a unique ID
I want to assign subgroups into groups, and create a new column out of them.
I have data like the following:
dataframe:
ID SubID Values
1 15 0.5
1 15 0.2
2 13 0.1
2 13 0
1 14 0.3
1 14 0.3
2 10 0.2
2 10 1.6
6 31 0.7
6 31 1.0
new dataframe:
ID SubID Values groups
1 15 0.5 2
1 15 0.2 2
2 13 0.1 2
2 13 0 2
1 14 0.3 1
1 14 0.3 1
2 10 0.2 1
2 10 1.6 1
6 31 0.7 1
6 31 1.0 1
I have tried the following in R, but I am not getting the desired results:
newdataframe$groups <- dataframe %>% group_indices(,dataframe$ID, dataframe$SubID)
newdataframe<- dataframe %>% group_by(ID, SubID) %>% mutate(groups=group_indices(,dataframe$ID, dataframe$SubID))
I am not sure how to frame the question in R. I want to group by ID, and SubID, and then assign those subgroups in that are grouped by IDs and reset the the grouping count on each ID.
Any help would be really appreciated.
Here is an alternative approach which uses the rleid() function from the data.table package. rleid() generates a run-length type id column.
According to the expected result, the OP expects SubId to be numbered by order of value and not by order of appearance. Therefore, we need to call arrange().
library(dplyr)
df %>%
group_by(ID) %>%
arrange(SubID) %>%
mutate(groups = data.table::rleid(SubID))
ID SubID Values groups
<int> <int> <dbl> <int>
1 2 10 0.2 1
2 2 10 1.6 1
3 2 13 0.1 2
4 2 13 0 2
5 1 14 0.3 1
6 1 14 0.3 1
7 1 15 0.5 2
8 1 15 0.2 2
9 6 31 0.7 1
10 6 31 1 1
Note that the row order has changed.
BTW: With data.table, the code is less verbose and the original row order is maintained:
library(data.table)
setDT(df)[order(ID, SubID), groups := rleid(SubID), by = ID][]
ID SubID Values groups
1: 1 15 0.5 2
2: 1 15 0.2 2
3: 2 13 0.1 2
4: 2 13 0.0 2
5: 1 14 0.3 1
6: 1 14 0.3 1
7: 2 10 0.2 1
8: 2 10 1.6 1
9: 6 31 0.7 1
10: 6 31 1.0 1
There are multiple ways to do this one way would be to group_by ID and create a unique number for each SubID by converting it to factor and then to integer.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(groups = as.integer(factor(SubID)))
# ID SubID Values groups
# <int> <int> <dbl> <int>
# 1 1 15 0.5 2
# 2 1 15 0.2 2
# 3 2 13 0.1 2
# 4 2 13 0 2
# 5 1 14 0.3 1
# 6 1 14 0.3 1
# 7 2 10 0.2 1
# 8 2 10 1.6 1
# 9 6 31 0.7 1
#10 6 31 1 1
In base R, we can use ave with similar logic
df$groups <- with(df, ave(SubID, ID, FUN = factor))
I have data that looks like this
samplesize <- 6
group <- c(1,2,3)
total <- rep(samplesize,length(group))
outcomeTrue <- c(2,1,3)
df <- data.frame(group,total,outcomeTrue)
and would like my data to look like this
group2 <- c(rep(1,6),rep(2,6),rep(3,6))
outcomeTrue2 <- c(rep(1,2),rep(0,6-2),rep(1,1),rep(0,6-1),rep(1,3),rep(0,6-3))
df2 <- data.frame(group2,outcomeTrue2)
That is to say I have binary data where I am told the total observations and the successful observations, but would prefer it to be organised as individual observations with their explicit outcome as 0 or 1. i.e.Visual Example of Desired Result
Is there an easy way to do this in r, or will I need to write a loop to automate this myself?
Here is one option with tidyverrse. We uncount to expand the rows using the 'total' column, grouped by 'group', create a binary index with a logical condition based on the row_number() and the value of 'outcomeTrue'
library(tidyverse)
df %>%
uncount(total) %>%
group_by(group) %>%
mutate(outcomeTrue = as.integer(row_number() <= outcomeTrue[1]))
# A tibble: 18 x 2
# Groups: group [3]
# group outcomeTrue
# <dbl> <int>
# 1 1 1
# 2 1 1
# 3 1 0
# 4 1 0
# 5 1 0
# 6 1 0
# 7 2 1
# 8 2 0
# 9 2 0
#10 2 0
#11 2 0
#12 2 0
#13 3 1
#14 3 1
#15 3 1
#16 3 0
#17 3 0
#18 3 0
You are also there. just use the group 2 variable with the "[" function in the x position:
df[ group2 , ]
group total outcomeTrue
1 1 6 2
1.1 1 6 2
1.2 1 6 2
1.3 1 6 2
1.4 1 6 2
1.5 1 6 2
2 2 6 1
2.1 2 6 1
2.2 2 6 1
2.3 2 6 1
2.4 2 6 1
2.5 2 6 1
3 3 6 3
3.1 3 6 3
3.2 3 6 3
3.3 3 6 3
3.4 3 6 3
3.5 3 6 3
When a number or character value that matches a rowname is put in the x-position of the "[" it replicates the entire row
Here is a base R solution.
do.call(rbind, lapply(split(df, df$group), function(x) data.frame(group2 = x$group, outcome2 = rep(c(1,0), times = c(x$outcome, x$total-x$outcome)))))
# group2 outcome2
# 1.1 1 1
# 1.2 1 1
# 1.3 1 0
# 1.4 1 0
# 1.5 1 0
# 1.6 1 0
# 2.1 2 1
# 2.2 2 0
# 2.3 2 0
# 2.4 2 0
# 2.5 2 0
# 2.6 2 0
# 3.1 3 1
# 3.2 3 1
# 3.3 3 1
# 3.4 3 0
# 3.5 3 0
# 3.6 3 0
I'm working with programming language R on a dataframe (data) that look like this:
ID t P1 P2 P3 P4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100003 0 5 4 3 2
2 100003 0 6 2 1 3
3 100013 0 6 5 7 3
4 100013 0 4 5 4 1
5 100014 0 1 1 1 1
6 100014 0 1 1 1 1
7 100015 0 6 6 1 1
8 100015 0 6 6 1 1
9 100044 0 6 2 5 1
10 100044 0 6 3 1 1
11 100051 0 NA NA NA NA
12 100051 0 4 4 2 2
13 100074 0 4 6 4 3
14 100074 0 5 6 3 2
15 100075 0 2 2 1 1
AIM: I need to aggregate by ID (t is always equal to 0) for each variable from P1,P2,P3,P4 like this:
new_data<-aggregate(P1~ID+t,data,mean,na.rm=T)
new_data<-aggregate(P2~ID+t,data,mean,na.rm=T)
new_data<-aggregate(P3~ID+t,data,mean,na.rm=T)
new_data<-aggregate(P4~ID+t,data,mean,na.rm=T)
PROBLEM: Is there a loop I can run or some code from the apply family instead of going through each variable (P1-P4) manually. Thanks a lot!
Haven't tested it, but this should do the loop:
cols<-c("P1","P2","P3","P4")
dat2<-lapply(data[cols],function(x){
aggregate(x~ID+t, data, mean, na.rm=T)
})
You can aggregate multiple variables at once with cbind(P1, P2, P3, P4) ~ ID + t or equivalently using a dot in place of cbind(P1, P2, P3, P4). The dot means every remaining variable.
> aggregate(. ~ ID + t, old.data, mean,na.rm=T)
ID t P1 P2 P3 P4
1 100003 0 5.5 3.0 2.0 2.5
2 100013 0 5.0 5.0 5.5 2.0
3 100014 0 1.0 1.0 1.0 1.0
4 100015 0 6.0 6.0 1.0 1.0
5 100044 0 6.0 2.5 3.0 1.0
6 100051 0 4.0 4.0 2.0 2.0
7 100074 0 4.5 6.0 3.5 2.5
8 100075 0 2.0 2.0 1.0 1.0
>
I have a data frame with two columns that I am grouping by with dplyr, a column of months (as numerics, e.g. 1 through 12), and several columns with statistical data following that (values unimportant). An example:
ID_1 ID_2 month st1 st2
1 1 1 0.5 0.2
1 1 2 0.7 0.9
1 1 3 1.1 1.7
1 1 4 2.6 0.8
1 1 5 1.8 1.3
1 1 6 2.1 2.2
1 1 7 0.5 0.2
1 1 8 0.7 0.9
1 1 9 1.1 1.7
1 1 10 2.6 0.8
1 1 11 1.8 1.3
1 1 12 2.1 2.2
1 2 1 0.5 0.2
1 2 2 0.7 0.9
1 2 3 1.1 1.7
1 2 4 2.6 0.8
1 2 5 1.8 1.3
1 2 6 2.1 2.2
1 2 7 0.5 0.2
1 2 9 1.1 1.7
1 2 10 2.6 0.8
1 2 11 1.8 1.3
1 2 12 2.1 2.2
For the second grouping (ID_1 = 1 and ID_2 = 2), there is a month missing from the data (month = 8). Is there a way I can find this month and insert a row with the correct ID_1 and ID_2 values, the missing month value, and NA values for the rest of the columns? I've been playing around with this using dplyr functions and can't seem to figure it out, perhaps there is even a non-dplyr solution out there as well.
PS: If it helps, each unique grouping of ID_1 and ID_2 will have no more than 1 month missing.
Expand grid to make all combos of groups, then merge:
# make reference with all needed rows
ref <- data.frame(expand.grid(unique(df1$ID_1),
unique(df1$ID_2),
1:12))
colnames(ref) <- colnames(df1)[1:3]
# them merge with all TRUE
res <- merge(df1, ref, all = TRUE)
# to check output, show only month = 8
res[ res$month == 8, ]
# ID_1 ID_2 month st1 st2
# 8 1 1 8 0.7 0.9
# 20 1 2 8 NA NA
This can be done via tidyr::complete:
library(dplyr)
library(tidyr)
dat %>%
group_by(ID_1, ID_2) %>%
complete(month = 1:12)
Tail of dataset:
Source: local data frame [6 x 5]
Groups: ID_1, ID_2 [1]
ID_1 ID_2 month st1 st2
<int> <int> <int> <dbl> <dbl>
1 1 2 7 0.5 0.2
2 1 2 8 NA NA
3 1 2 9 1.1 1.7
4 1 2 10 2.6 0.8
5 1 2 11 1.8 1.3
6 1 2 12 2.1 2.2
If you go with tidyr, there is the complete function for this, you can nest ID_1 and ID_2 if you want both of the two variables as your grouping variable:
library(tidyr)
df1 = df %>% complete(nesting(ID_1, ID_2), month)
tail(df1)
# Source: local data frame [6 x 5]
# ID_1 ID_2 month st1 st2
# <int> <int> <int> <dbl> <dbl>
# 1 1 2 7 0.5 0.2
# 2 1 2 8 NA NA
# 3 1 2 9 1.1 1.7
# 4 1 2 10 2.6 0.8
# 5 1 2 11 1.8 1.3
# 6 1 2 12 2.1 2.2
I want to add an extra row for each subject ID in the data frame (below). This row should have TIME=0 and DV=0. Other values in the other columns should stay the same. The data frame looks like the following:
ID TIME DV DOSE pH
1 1 5 50 4.6
1 5 10 50 4.6
2 1 6 100 6.0
2 7 10 100 6.0
After adding the extra row, it should look like this:
ID TIME DV DOSE pH
1 0 0 50 4.6
1 1 5 50 4.6
1 5 10 50 4.6
2 0 0 100 6.0
2 1 6 100 6.0
2 7 10 100 6.0
How could I achieve this in R?
Try this:
#dummy data
df <- read.table(text="ID TIME DV DOSE pH
1 1 5 50 4.6
1 5 10 50 4.6
2 1 6 100 6.0
2 7 10 100 6.0",header=TRUE)
#data with zeros
df1 <- df
df1[,c(2,3)] <- 0
df1 <- unique(df1)
#rowbind and sort
res <- rbind(df,df1)
res <- res[order(res$ID,res$TIME),]
res
# ID TIME DV DOSE pH
# 11 1 0 0 50 4.6
# 1 1 1 5 50 4.6
# 2 1 5 10 50 4.6
# 31 2 0 0 100 6.0
# 3 2 1 6 100 6.0
# 4 2 7 10 100 6.0
Here's another possible data.table solution
library(data.table)
setDT(df)[, .SD[c(1L, seq_len(.N))], ID][,
indx := seq_len(.N), ID][indx == 1L, 2:3 := 0][]
# ID TIME DV DOSE pH indx
# 1: 1 0 0 50 4.6 1
# 2: 1 1 5 50 4.6 2
# 3: 1 5 10 50 4.6 3
# 4: 2 0 0 100 6.0 1
# 5: 2 1 6 100 6.0 2
# 6: 2 7 10 100 6.0 3
I changed the indexing from c(.N+1, 1:.N) to c(1L, 1:.N) (from #David Arenburg's post) as it is easier in this way :-)
library(data.table)
setDT(df)[, .SD[c(1L,1:.N)], by=ID][, 2:3 := .SD*(!duplicated(.SD,
fromLast=TRUE))+0L, .SDcols=2:3][]
# ID TIME DV DOSE pH
#1: 1 0 0 50 4.6
#2: 1 1 5 50 4.6
#3: 1 5 10 50 4.6
#4: 2 0 0 100 6.0
#5: 2 1 6 100 6.0
#6: 2 7 10 100 6.0
Or you could use set that updates by reference (if there are many columns)
DT <- setDT(df)[, .SD[c(1L, 1:.N)], by=ID]
indx <- DT[, !duplicated(.SD, fromLast=TRUE), .SDcols=2:3]
for(j in 2:3){
set(DT, i=NULL, j=j, value= DT[[j]]*(indx+0L))
}
A concise approach using plyr:
library(plyr)
ldply(split(df, df$ID), function(u){x=u[1,];x[c("DV","TIME")]=0;rbind(x,u)})
# .id ID TIME DV DOSE pH
#1 1 1 0 0 50 4.6
#2 1 1 1 5 50 4.6
#3 1 1 5 10 50 4.6
#4 2 2 0 0 100 6.0
#5 2 2 1 6 100 6.0
#6 2 2 7 10 100 6.0