I want to create a variable using dplyr that takes in a value conditional on another variable.
See example below.
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I want to create a variable 'baseline' that takes the content of variable 'value' where time = 1 and by group. As such the desired output would be
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1),
baseline = c(1,1,3,3)))
Tried to run the following code with indexing but am clearly going wrong somewhere
x <- data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
x %>% group_by(group) %>%
mutate(baseline = .[[.$time==1,.$value]])
Thanks
We can use which.min
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(baseline = value[which.min(time)])
# A tibble: 4 x 4
# Groups: group [2]
# group time value baseline
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 2 1
#3 b 1 3 3
#4 b 2 4 3
and if it is already ordered by 'time', then simply use first
df1 %>%
group_by(group) %>%
mutate(baseline = first(value))
data
df1 <- data.frame(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
Related
I have the following database with several entries per individual:
record_id<-c(21,21,21,15,15,15,2,2,2,2,3,3,3)
var<-c(0,0,0,1,0,0,1,1,0,0,1,1,0)
data<-data.frame(cbind(record_id,var))
I want to create a new data frame with just 1 row per record_id. But it has to fulfill that if the individual (record_id) has a data$var == 1. The outcome data frame must indicate 1.
So, the outcome would be like this:
record_id<-c(21,15,2,3)
var<-c(0,1,1,1)
data_sol<-data.frame(cbind(record_id,var))
I have tried this:
DF1 <- data %>%
group_by(record_id) %>%
mutate(class = ifelse(var==1,1,0)) %>%
ungroup
I know it's not the best way, I was planning to obtain afterwards the unique values... But it did not make the trick.
If your 'var' is all zeroes or ones, you can also use max():
data%>%group_by(record_id)%>%
summarise(new_var=max(var))
# A tibble: 4 x 2
record_id new_var
<dbl> <dbl>
1 2 1
2 3 1
3 15 1
4 21 0
You can use mean() with the mutate to detect if there exsist any non zero value inside a group like,
data %>%
group_by(record_id) %>%
mutate(var = ifelse(mean(var)!=0,1,0)) %>%
distinct(record_id,var)
gives,
# A tibble: 4 x 2
# Groups: record_id [4]
# record_id var
# <dbl> <dbl>
# 1 21 0
# 2 15 1
# 3 2 1
# 4 3 1
We can do
library(dplyr)
data %>%
group_by(record_id) %>%
summarise(var = +(mean(var) != 0))
Or using slice
data %>%
group_by(record_id) %>%
slice_max(n = 1, order_by = var)
I would like to know how to keep ordering after spread.
data<-tibble(var=c("A","C","D","B"), score=c(1,2,4,3))
data_spread <-data%>%spread(key = var, value = score)
I would like to keep the order of c("A","C","D","B").
An option is to convert to factor with levels specified as the unique elements of 'var' will make sure the order is the order of occurrence
library(dplyr)
library(tidyr)
data %>%
mutate(var = factor(var, levels = unique(var))) %>%
spread(var, score)
# A tibble: 1 x 4
# A C D B
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 4 3
I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05
Let's say that I have a dataset which consists of multiple observations. Sometimes a single observation is actually multple ones that have been condensed into one. To keep track of how many observations were merged together an integer-valued variable exists.
What I want to do is to reverse this process.
Example code:
library(tidyverse)
# Example tibble
df_ex <- tibble(
var1 = seq(1, 3),
var2 = c('Some', 'Random', 'Text'),
var3 = c(1, 3, 2)
)
The code above produces the following tibble:
# A tibble: 3 x 3
var1 var2 var3
<int> <chr> <dbl>
1 1 Some 1
2 2 Random 3
3 3 Text 2
The desired tibble after some tidyverse magic would be:
# A tibble: 6 x 3
var1 var2 var3
<dbl> <chr> <dbl>
1 1 Some 1
2 2 Random 1
3 2 Random 1
4 2 Random 1
5 3 Text 1
6 3 Text 1
There are multiple ways to do this in tidyverse
1) Do a group by 'var1' (assuming it is unique), create a list column for 'var3' by replicating 1 with the value of 'var3' and then unnest
df_ex %>%
group_by(var1) %>%
mutate(var3 = list(rep(1, var3))) %>%
unnest
2) Use map to get the list column for 'var3' and unnest
df_ex %>%
mutate(var3 = map(var3, ~ rep(1, .x))) %>%
unnest
3) With base R, replicate the sequence of rows to expand the data and then transform the 'var3' to 1
transform(df_ex[rep(seq_len(nrow(df_ex)), df_ex$var3),], var3 = 1)
I have a dataframe with groups that essentially looks like this
DF <- data.frame(state = c(rep("A", 3), rep("B",2), rep("A",2)))
DF
state
1 A
2 A
3 A
4 B
5 B
6 A
7 A
My question is how to count the number of consecutive rows where the first value is repeated in its first "block". So for DF above, the result should be 3. The first value can appear any number of times, with other values in between, or it may be the only value appearing.
The following naive attempt fails in general, as it counts all occurrences of the first value.
DF %>% mutate(is_first = as.integer(state == first(state))) %>%
summarize(count = sum(is_first))
The result in this case is 5. So, hints on a (preferably) dplyr solution to this would be appreciated.
You can try:
rle(as.character(DF$state))$lengths[1]
[1] 3
In your dplyr chain that would just be:
DF %>% summarize(count_first = rle(as.character(state))$lengths[1])
# count_first
# 1 3
Or to be overzealous with piping, using dplyr and magrittr:
library(dplyr)
library(magrittr)
DF %>% summarize(count_first = state %>%
as.character %>%
rle %$%
lengths %>%
first)
# count_first
# 1 3
Works also for grouped data:
DF <- data.frame(group = c(rep(1,4),rep(2,3)),state = c(rep("A", 3), rep("B",2), rep("A",2)))
# group state
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 2 B
# 6 2 A
# 7 2 A
DF %>% group_by(group) %>% summarize(count_first = rle(as.character(state))$lengths[1])
# # A tibble: 2 x 2
# group count_first
# <dbl> <int>
# 1 1 3
# 2 2 1
No need of dplyrhere but you can modify this example to use it with dplyr. The key is the function rle
state = c(rep("A", 3), rep("B",2), rep("A",2))
x = rle(state)
DF = data.frame(len = x$lengths, state = x$values)
DF
# get the longest run of consecutive "A"
max(DF[DF$state == "A",]$len)