I have a dataset like this:
df_have <- data.frame(id = rep("a",3), time = c(1,3,5), flag = c(0,1,1))
The data has one row per time per id but I need to have the second row duplicated and put into the data.frame like this:
df_want <- data.frame(id = rep("a",4), time = c(1,3,3,5), flag = c(0,0,1,1))
The flag variables should become 0 with the new row added and all other information the same. Any help would be appreciated.
Edit:
The comments below are helpful but I would also need to do this in groups by id and some ids have more rows than other ids. After reading this and seeing the comments below I see the logic isn't clear. My original data does not have a count variable (what I call flag) but it needs it in the final output. What I would need is that every row besides for the first and last timepoint (within each id) to be duplicated and every time there is a duplicate make a counter to demonstrate when a row was created until the next new row is created.
df_have2 <- data.frame(id = c(rep("a",3),rep("b",4)) ,
time = c(1,3,5,1,3,5,7))
df_want2 <- data.frame(id = c(rep("a",4),rep("b",6)),
time = c(1,3,3,5,1,3,3,5,5,7),
flag = c(1,1,2,2,1,1,2,2,3,3))
We could expand the data with slice and then create the 'flag' by matching the 'time' with unique values of 'time' and take the lag of it
library(dplyr)
df_have2 %>%
group_by(id) %>%
slice(rep(row_number(), c(1, rep(2, n() - 2), 1))) %>%
mutate(flag = lag(match(time, unique(time)), default = 1)) %>%
ungroup
# A tibble: 10 x 3
# id time flag
# <chr> <dbl> <dbl>
# 1 a 1 1
# 2 a 3 1
# 3 a 3 2
# 4 a 5 2
# 5 b 1 1
# 6 b 3 1
# 7 b 3 2
# 8 b 5 2
# 9 b 5 3
#10 b 7 3
Related
In R, I'm trying to average a subset of a column based on selecting a certain value (ID) in another column. Consider the example of choosing an ID among 100 IDs, perhaps the ID number being 5. Then, I want to average a subset of values in another column that corresponds to the ID number that is 5. Then, I want to do the same thing for the rest of the IDs. What should this function be?
Using dplyr:
library(dplyr)
dt <- data.frame(ID = rep(1:3, each=3), values = runif(9, 1, 100))
dt %>%
group_by(ID) %>%
summarise(avg = mean(values))
Output:
ID avg
<int> <dbl>
1 1 41.9
2 2 79.8
3 3 39.3
Data:
ID values
1 1 8.628964
2 1 99.767843
3 1 17.438596
4 2 79.700918
5 2 87.647472
6 2 72.135906
7 3 53.845573
8 3 50.205122
9 3 13.811414
We can use a group by mean. In base R, this can be done with aggregate
dt <- data.frame(ID = rep(1:3, each=3), values = runif(9, 1, 100))
aggregate(values ~ ID, dt, mean)
Output:
ID values
1 1 40.07086
2 2 53.59345
3 3 47.80675
I have a dataset like this:
set.seed(71)
dat <- data.table(region = rep(c('A','B'), each=10),
place = rep(c('C','D'), 10),
start = sample.int(5, 20, replace = TRUE),
end = sample.int(10, 20, replace = TRUE),
count = sample.int(50, 20, replace = TRUE),
para1 = rnorm(20,3,1),
para2 = rnorm(20,4,1))
I would like to loop through this data to conditionally generate another table with the following columns:
region, place, start, end, count, count0
with potentially more than one rows for each row in dat.
in the new table, data for columns region, place, and start will be copied over from dat, and data for columns end, count, and count0 will be generated.
Here are the rules for iterating through each row of dat:
end = end +1
if (count=0) {
count0=0
} else {
count0=start*para1 + end*para2
}
if (count0>count) {
count0=count
}
count = count -count0
I tried to use the combination of for loop, if statement, and mutate but could not get it right.
I expect to get a table like this after going through the first two rows of dat:
region place start end count count0
A C 2 7 6.01673062 17.98326938
A C 2 8 0 6.01673062
A D 3 2 5.34392419 7.65607581
A D 3 3 0 5.34392419
the first two rows of dat I have are:
region place start end count para1 para2
A C 2 6 24 0.39412969 2.45643
A D 3 1 13 0.64372127 2.862456
Edit: Here's a lazy approach that should still be extremely fast, at the cost of temporarily making rows that we'll remove at the end. Rather than figure out how many copies to make of each row, I make a bunch of copies of every row, then apply fast vectorized calculations to get the updated end, count, and count0 values, and remove the rows we don't need.
library(dplyr); library(tidyr)
output <-
dat %>%
mutate(orig_row = row_number()) %>%
uncount(10) %>% # I'm assuming here that 10 is enough columns
group_by(orig_row) %>%
mutate(row = row_number()) %>%
mutate(
end = end + row,
count0 = pmin(count, start * para1 + end * para2), # Edit #2
count = count - cumsum(count0)
) %>%
filter(lag(count, default = 0) >= 0) %>%
mutate(count = pmax(0, count),
count0 = if_else(count == 0, lag(count), count0))
output
# A tibble: 4 x 10
# Groups: orig_row [2]
region place start end count para1 para2 orig_row row count0
<chr> <chr> <int> <int> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 A C 2 7 6.02 0.394 2.46 1 1 18.0
2 A C 2 8 0 0.394 2.46 1 2 6.02
3 A D 3 2 5.34 0.644 2.86 2 1 7.66
4 A D 3 3 0 0.644 2.86 2 2 5.34
Initial answer:
I imagine this is in the neighborhood.
Caveat: I didn't get the same sample data as you showed, nor do I understand how the specific numbers in your provided sample would generate the suggested output. For instance, from the first row of dat you show (different than I got), the first count0 should be 2*0.394 + 6*2.456 = 15.527, no?
My approach here is to calculate count0, and then figure out how many of count fit into it, then make that many copies of the row, decrementing count by count0 with each row.
library(dplyr); library(tidyr)
output <- dat %>%
mutate(end = end + 1,
orig_data = row_number(),
count0 = if_else(count == 0, 0,
start*para1 + end*para2),
copies = 1 + count %/% count0) %>%
uncount(copies) %>%
group_by(orig_data) %>%
mutate(row = row_number() - 1,
count = count - row * count0)
BTW, my dat initializes differently using set.seed(71). Could you please confirm if your data initializes as provided in the OP? It will be easier to get aligned if we can start from the same place.
> head(dat)
region place start end count para1 para2
1: A C 2 7 19 3.400587 2.757140
2: A D 3 3 31 1.503740 6.089518
3: A C 2 8 2 2.561869 5.236298
4: A D 2 3 33 3.069835 3.770121
5: A C 2 2 21 2.989221 3.547926
6: A D 5 5 32 2.720636 5.379352
I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05
I have a dataframe with groups that essentially looks like this
DF <- data.frame(state = c(rep("A", 3), rep("B",2), rep("A",2)))
DF
state
1 A
2 A
3 A
4 B
5 B
6 A
7 A
My question is how to count the number of consecutive rows where the first value is repeated in its first "block". So for DF above, the result should be 3. The first value can appear any number of times, with other values in between, or it may be the only value appearing.
The following naive attempt fails in general, as it counts all occurrences of the first value.
DF %>% mutate(is_first = as.integer(state == first(state))) %>%
summarize(count = sum(is_first))
The result in this case is 5. So, hints on a (preferably) dplyr solution to this would be appreciated.
You can try:
rle(as.character(DF$state))$lengths[1]
[1] 3
In your dplyr chain that would just be:
DF %>% summarize(count_first = rle(as.character(state))$lengths[1])
# count_first
# 1 3
Or to be overzealous with piping, using dplyr and magrittr:
library(dplyr)
library(magrittr)
DF %>% summarize(count_first = state %>%
as.character %>%
rle %$%
lengths %>%
first)
# count_first
# 1 3
Works also for grouped data:
DF <- data.frame(group = c(rep(1,4),rep(2,3)),state = c(rep("A", 3), rep("B",2), rep("A",2)))
# group state
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 2 B
# 6 2 A
# 7 2 A
DF %>% group_by(group) %>% summarize(count_first = rle(as.character(state))$lengths[1])
# # A tibble: 2 x 2
# group count_first
# <dbl> <int>
# 1 1 3
# 2 2 1
No need of dplyrhere but you can modify this example to use it with dplyr. The key is the function rle
state = c(rep("A", 3), rep("B",2), rep("A",2))
x = rle(state)
DF = data.frame(len = x$lengths, state = x$values)
DF
# get the longest run of consecutive "A"
max(DF[DF$state == "A",]$len)
I am looking to filter and retrieve all rows from all groups where a specific row meets a condition, in my example when the value is more than 3 at the highest day per group. This is obviously simplified but breaks it down to the essential.
# Dummy data
id = rep(letters[1:3], each = 3)
day = rep(1:3, 3)
value = c(2,3,4,2,3,3,1,2,4)
my_data = data.frame(id, day, value, stringsAsFactors = FALSE)
My approach works, but it seems somewhat unsmart:
require(dplyr)
foo <- my_data %>%
group_by(id) %>%
slice(which.max(day)) %>% # gets the highest day
filter(value>3) # filters the rows with value >3
## semi_join with the original data frame gives the required result:
semi_join(my_data, foo, by = 'id')
id day value
1 a 1 2
2 a 2 3
3 a 3 4
4 c 1 1
5 c 2 2
6 c 3 4
Is there a more succint way to do this?
my_data %>% group_by(id) %>% filter(value[which.max(day)] > 3)