I have a df with patient ids and a binary indicator for whether they have experienced an intervention or not. I want to make a new column called "time_post" which tells me how many timepoints have passed since experiencing the intervention.
here is my DF:
names<-c("tom","tom","tom","tom","tom","tom","tom","tom", "john", "john","john", "john","john", "john","john", "john")
post<-as.numeric(0,0,0,1,1,1,1,1,0,1,1,1,1,1,1,1)
df<-data.frame(names,post)
this is what I have tried:
df$time_post<-ifelse(df$post==1[1],1,0) ##this tries to assign 1 to "time_post" for first value of 1 seen in post
df$time_post<-ifelse(df$post==1[2],2,df$time_post) ##trying to apply same logic here, but doesn't work. Introduces NAs into time_post column.
This is my desired output;
names post time_post
1 tom 0 0
2 tom 0 0
3 tom 0 0
4 tom 1 1
5 tom 1 2
6 tom 1 3
7 tom 1 4
8 tom 1 5
9 john 0 0
10 john 1 1
11 john 1 2
12 john 1 3
13 john 1 4
14 john 1 5
15 john 1 6
16 john 1 7
Thank you in advance
Try this:
df<-data.frame(names=c("tom","tom","tom","tom","tom","tom","tom","tom",
"john", "john","john", "john","john", "john","john", "john"),
post=c(0,0,0,1,1,1,1,1,0,1,1,1,1,1,1,1))
df$time_post <- with(df, ave(post,names,FUN=cumsum))
Which gives you:
> df
names post time_post
1 tom 0 0
2 tom 0 0
3 tom 0 0
4 tom 1 1
5 tom 1 2
6 tom 1 3
7 tom 1 4
8 tom 1 5
9 john 0 0
10 john 1 1
11 john 1 2
12 john 1 3
13 john 1 4
14 john 1 5
15 john 1 6
16 john 1 7
Related
Suppose I have two data frames df1 and df2 as follows
Df1
Id Price Profit Month
10 5 2 1
10 5 3 2
10 5 2 3
11 7 3 1
11 7 1 2
12 0 0 1
12 5 1 2
Df2
Id Name
9 Kane
10 Jack
10 Jack
11 Will
12 Matt
13 Lee
14 Han
Now I want to insert a new column in Df1 named Name and get its value from Df2 based on matching Id
So modified Df1 will be
Id Price Profit Month Name
10 5 2 1 Jack
10 5 3 2 Jack
10 5 2 3 Jack
11 7 3 1 Will
11 7 1 2 Will
12 0 0 1 Matt
12 5 1 2 Matt
df1 <- data.frame(Id=c(10L,10L,10L,11L,11L,12L,12L),Price=c(5L,5L,5L,7L,7L,0L,5L),Profit=c(2L,3L,2L,3L,1L,0L,1L),Month=c(1L,2L,3L,1L,2L,1L,2L),stringsAsFactors=F);
df2 <- data.frame(Id=c(9L,10L,10L,11L,12L,13L,14L),Name=c('Kane','Jack','Jack','Will','Matt','Lee','Han'),stringsAsFactors=F);
df1$Name <- df2$Name[match(df1$Id,df2$Id)];
df1;
## Id Price Profit Month Name
## 1 10 5 2 1 Jack
## 2 10 5 3 2 Jack
## 3 10 5 2 3 Jack
## 4 11 7 3 1 Will
## 5 11 7 1 2 Will
## 6 12 0 0 1 Matt
## 7 12 5 1 2 Matt
use left_join in dplyr
library(dplyr)
left_join(df1, df2, "Id")
eg:
> left_join(df1, df2)
Joining by: "Id"
Id Price Profit Month Name
1 10 5 2 1 Jack
2 10 5 3 2 Jack
3 10 5 2 3 Jack
4 11 7 3 1 Will
5 11 7 1 2 Will
6 12 0 0 1 Matt
7 12 5 1 2 Matt
Data wrangling cheatsheet by RStudio is a very helpful resource.
Here is an option using data.table
library(data.table)
setDT(Df1)[unique(Df2), on = "Id", nomatch=0]
# Id Price Profit Month Name
#1: 10 5 2 1 Jack
#2: 10 5 3 2 Jack
#3: 10 5 2 3 Jack
#4: 11 7 3 1 Will
#5: 11 7 1 2 Will
#6: 12 0 0 1 Matt
#7: 12 5 1 2 Matt
Or as #Arun mentioned in the comments, we can assign (:=) the "Name" column after joining on "Id" to reflect the changes in the original dataset "Df1".
setDT(Df1)[Df2, Name:= Name, on = "Id"]
Df1
A simple base R option could be merge()
merge(Df1,unique(Df2), by="Id")
# Id Price Profit Month Name
#1 10 5 2 1 Jack
#2 10 5 3 2 Jack
#3 10 5 2 3 Jack
#4 11 7 3 1 Will
#5 11 7 1 2 Will
#6 12 0 0 1 Matt
#7 12 5 1 2 Matt
The function unique() is used here because of the duplicate entry in Df2 concerning "Jack". For the example data described in the OP the option by="Id" can be omitted, but it might be necessary in a more general case.
I have the following individual data and I want to make an unique household identifier. Every individual already has its rank in household so basically rank 1 marks the start of the new household.
e.g.
rank name
1 John
2 Lisa
3 Stu
1 Phil
1 Mike
1 Florence
2 George
3 David
4 Diana
1 Eleanor
The result I am looking for is this:
rank name id
1 John 1
2 Lisa 1
3 Stu 1
1 Phil 2
1 Mike 3
1 Florence 4
2 George 4
3 David 4
4 Diana 4
1 Eleanor 5
There are around 320 000 individuals, so group id should go from 1 to sum(df$rank[rank = 1]) or something similar. Any other type of unique ID also works, it doesn't have to be seq(1,n,1).
df$id <- cumsum(df$rank == 1)
# rank name id
# 1 1 John 1
# 2 2 Lisa 1
# 3 3 Stu 1
# 4 1 Phil 2
# 5 1 Mike 3
# 6 1 Florence 4
# 7 2 George 4
# 8 3 David 4
# 9 4 Diana 4
# 10 1 Eleanor 5
As #Andre Elrico noted, if rank is NA for any rows, the method above would give you NA for id in all subsequent rows, so you could use the option below instead if you know rank may be NA (but not when it should be 1).
df$id <- cumsum(df$rank %in% 1)
Data used:
df <- read.table(text = '
rank name
1 John
2 Lisa
3 Stu
1 Phil
1 Mike
1 Florence
2 George
3 David
4 Diana
1 Eleanor
', header = T)
I have three character vectors. List 1 contains all independent names; list 2 and 3 only contain a subset of names in list 1. The names can appear multiple times in the lists 2 and 3.
list1 <- c("Jane","Michael","Zach","Zoey","Mary","Joe","Samantha","Eva","Chris","David","James","Kim","John")
list2 <- c("Jane","Jane","Zoey","Joe","Joe","Samantha","Eva","David","Kim","Kim","Kim")
list3 <- c("Michael","Michael","Zach","Mary","Mary","Joe","Eva","Eva","Chris","Chris","James","John","John")
I would like to obtain a data frame in the end, the first column containing list 1, then the second and third containing the number of times the name in the first list appears in list 2 and 3.
Jane 2 0
Mike 0 2
Zach 0 1
Zoey 1 0
Mary 0 2
Joe 2 1
Sam 1 0
Eva 1 1
Chris 0 2
David 1 0
James 0 1
Kim 3 0
John 0 2
I know how to do this in Excel, but my list1 has more than 10,0000 entry and it is prohibitively slow if I did this in Excel. Is there any way to do this in R?
Here's a way to do it with data.table
list1 <- c("Jane","Michael","Zach","Zoey","Mary","Joe","Samantha","Eva","Chris","David","James","Kim","John")
list2 <- c("Jane","Jane","Zoey","Joe","Joe","Samantha","Eva","David","Kim","Kim","Kim")
list3 <- c("Michael","Michael","Zach","Mary","Mary","Joe","Eva","Eva","Chris","Chris","James","John","John")
library(data.table)
dt = data.table(list1)
dt[ , "row" := 1:.N ]
dt[ , "list2count" := sum(list1 == list2), by = row]
dt[ , "list3count" := sum(list1 == list3), by = row]
> dt
list1 row list2count list3count
1: Jane 1 2 0
2: Michael 2 0 2
3: Zach 3 0 1
4: Zoey 4 1 0
5: Mary 5 0 2
6: Joe 6 2 1
7: Samantha 7 1 0
8: Eva 8 1 2
9: Chris 9 0 2
10: David 10 1 0
11: James 11 0 1
12: Kim 12 3 0
13: John 13 0 2
Using dplyr:
list1 <- c("Jane","Michael","Zach","Zoey","Mary","Joe","Samantha","Eva","Chris","David","James","Kim","John")
list2 <- data.frame(name = c("Jane","Jane","Zoey","Joe","Joe","Samantha","Eva","David","Kim","Kim","Kim"))
list3 <-data.frame(name = c("Michael","Michael","Zach","Mary","Mary","Joe","Eva","Eva","Chris","Chris","James","John","John"))
list2$listNumber <- rep("list2",length(list2))
list3$listNumber <- rep("list3",length(list3))
combList <- rbind(list2,list3)
library(dplyr)
combList%>% group_by(listNumber)%>% count(name)%>% filter( name %in% list1)
Output:
# A tibble: 15 x 3
listNumber name n
<chr> <fctr> <int>
1 list2 David 1
2 list2 Eva 1
3 list2 Jane 2
4 list2 Joe 2
5 list2 Kim 3
6 list2 Samantha 1
7 list2 Zoey 1
8 list3 Eva 2
9 list3 Joe 1
10 list3 Chris 2
11 list3 James 1
12 list3 John 2
13 list3 Mary 2
14 list3 Michael 2
15 list3 Zach 1
In base R, you could use factor, setting the levels as those of list 1, then use table to get the counts, and data.frame to put them all together:
data.frame(list1,
l2=c(table(factor(list2, levels=list1))),
l3=c(table(factor(list3, levels=list1))))
this retuns
list1 l2 l3
Jane Jane 2 0
Michael Michael 0 2
Zach Zach 0 1
Zoey Zoey 1 0
Mary Mary 0 2
Joe Joe 2 1
Samantha Samantha 1 0
Eva Eva 1 2
Chris Chris 0 2
David David 1 0
James James 0 1
Kim Kim 3 0
John John 0 2
Here's a base solution that will scale to any number of lists
list0 <- list(list1, list2, list3)
Reduce(function(...) merge(..., by = 1, all = TRUE),
lapply(list0, function(x) as.data.frame(table(x))))
colnames(res) <- c("Name","L1","L2","L3")
res
# Name L1 L2 L3
# 1 Chris 1 NA 2
# 2 David 1 1 NA
# 3 Eva 1 1 2
# 4 James 1 NA 1
# 5 Jane 1 2 NA
# 6 Joe 1 2 1
# 7 John 1 NA 2
# 8 Kim 1 3 NA
# 9 Mary 1 NA 2
# 10 Michael 1 NA 2
# 11 Samantha 1 1 NA
# 12 Zach 1 NA 1
# 13 Zoey 1 1 NA
I have this data-frame:
df=data.frame(student=c(rep("John",6),rep("Meredith",7),rep("Jeremy",5),rep("Audrey",8)),
semester=c(1,2,3,4,5,6, 1,2,3,4,5,6,7, 1,2,3,4,5, 1,2,3,4,5,6,7,8),
addQual=c( 1,0,0,0,1,0, 0,0,0,0,0,0,0, 0,0,1,0,1, 0,0,0,0,0,0,0,0))
It contains students, all their semesters and for every semester whether they took an additional qualifications course(dummy variable addQual = 1 if they took it).
How can I get a data-frame dfFilt that only contains those students who ever participated in an additional qualifications course?
My desired output would therefore be:
dfFilt=data.frame(student=c(rep("John",6),rep("Jeremy",5)),
semester=c(1,2,3,4,5,6, 1,2,3,4,5),
addQual=c( 1,0,0,0,1,0, 0,0,1,0,1))
A solution in dplyr is preferred.
A base R solution
df[which(df$student %in% df$student[which(df$addQual >0)]), ]
student semester addQual
1 John 1 1
2 John 2 0
3 John 3 0
4 John 4 0
5 John 5 1
6 John 6 0
14 Jeremy 1 0
15 Jeremy 2 0
16 Jeremy 3 1
17 Jeremy 4 0
18 Jeremy 5 1
In dplyr:
dfFilt = df %>% group_by(student) %>% filter(sum(addQual)>=1)
student semester addQual
<fctr> <dbl> <dbl>
1 John 1 1
2 John 2 0
3 John 3 0
4 John 4 0
5 John 5 1
6 John 6 0
7 Jeremy 1 0
8 Jeremy 2 0
9 Jeremy 3 1
10 Jeremy 4 0
11 Jeremy 5 1
I have a df as follows
names numbers
1 john -3
2 john -2
3 john -1
4 john 1
5 john 2
6 mary -2
7 mary -1
8 mary 1
9 mary 2
10 mary 3
11 tom -1
12 tom 1
13 tom 2
14 tom 3
I want to limit the df to people who have a value that begins with -3. Then I want to do the same for -2 and then the same again for people who start off with a value of -1. My end result would be three dfs, one each for john, mary and tom given that they all have different starting values (-3,-2 and -1).
e.g., for mary
names numbers
6 mary -2
7 mary -1
8 mary 1
9 mary 2
10 mary 3
my real dataframe has about 10,000 people in it so I can't just filter out by name as i'm doing here. I'd like a way of doing it by number, something like
df1<-df[df$number>=-3,] ##too simplistic
but this pulls in all the rows for everyone in the dataframe (logical considering they all have values > -3). I want the code to limit the resultant df to just the person who had a starting value of -3 and then all their values underneath that as shown for mary above.
Thank you in advance!
I would use ave to calculate the first number for each group, then split on it.
df$first <- ave(df$numbers, df$names, FUN=function(x) x[1])
split(df, f = df$first)
yields:
$`-3`
i names numbers first
1 1 john -3 -3
2 2 john -2 -3
3 3 john -1 -3
4 4 john 1 -3
5 5 john 2 -3
$`-2`
i names numbers first
6 6 mary -2 -2
7 7 mary -1 -2
8 8 mary 1 -2
9 9 mary 2 -2
10 10 mary 3 -2
$`-1`
i names numbers first
11 11 tom -1 -1
12 12 tom 1 -1
13 13 tom 2 -1
14 14 tom 3 -1