I am working with longitudinal data. I want to remove the observations of people that were only measured once (ids 5,7,9 below). How do I do this? Assume id is the unique identifier for people in the data set. Therefore, I would want to remove observations associated with ids 5,7, and 9. I've played with duplicated, unique, the table function, and the count function in plyr but haven't been successful. Example data below.
y<-sample(1:10, 20, replace=TRUE)
x<-sample(c(0,1),20, replace=TRUE)
id<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,7,8,8,8,9)
data<-data.frame(cbind(y,x,id))
You would have received immediate assistance had you tagged the post as R,data.frame
Here, the ! "not" function is used to remove id rows which match the values c(5,7,9)
> data[!data$id %in% c(5,7,9),]
y x id
1 3 0 1
2 2 1 1
3 3 0 1
4 9 0 2
5 9 0 2
6 1 0 2
7 9 0 3
8 7 0 3
9 4 0 3
10 9 1 4
11 7 0 4
12 8 1 4
14 4 1 6
15 1 0 6
17 2 0 8
18 8 0 8
19 2 0 8
Related
This question already has an answer here:
Incrementing an ID number each time a condition is met
(1 answer)
Closed 1 year ago.
I have a data.frame ordered by ID with a column of numeric values that I would like to bin into groups, increasing the group number only when a certain target value/trigger is surpassed. I haven't had success with seq(), seq_along(), or data.table cumsum(), but I'm sure there must be a way
Example data.frame with desired group column below. In this example, the sequence generating the group column should increase only when a number >= 300 appears in the value column.
dat = data.frame(ID=1:10, value=c(0,2,1,12,68,300,41,0,72959,51), group=c(1,1,1,1,1,2,2,2,3,3))
> dat
ID value group
1 1 0 1
2 2 2 1
3 3 1 1
4 4 12 1
5 5 68 1
6 6 300 2
7 7 41 2
8 8 0 2
9 9 72959 3
10 10 51 3
We may use cumsum on a logical vector to create the group
library(dplyr)
dat %>%
mutate(group2 = cumsum(value >=300)+ 1)
-output
ID value group group2
1 1 0 1 1
2 2 2 1 1
3 3 1 1 1
4 4 12 1 1
5 5 68 1 1
6 6 300 2 2
7 7 41 2 2
8 8 0 2 2
9 9 72959 3 3
10 10 51 3 3
I have a vector of consecutive states (you can only go from 3 to 4, from 4 to 5 etc., and there's no way back):
cons_states <- c(3,4,5,6)
Simultenously I have data:
from to status id
2 3 1 1
2 4 0 2
2 5 0 3
2 6 0 4
2 8 0 5
2 16 0 6
3 4 0 7
3 8 0 8
3 16 1 9
16 3 0 10
16 4 0 11
16 5 0 12
16 6 0 13
16 8 1 14
8 3 0 15
8 4 1 16
8 5 0 17
8 6 0 18
I have two assumptions that I would like my data to perform:
if state was visited there's no way back, for example once state 3 was visited (to=3 & status=1) there shouldn't be anymore possibility to move to state 3 from the next states (there shouldn't be anymore to=3):
from to status id
2 3 1 1
2 4 0 2
2 5 0 3
2 6 0 4
2 8 0 5
2 16 0 6
3 4 0 7
3 8 0 8
3 16 1 9
16 4 0 11
16 5 0 12
16 6 0 13
16 8 1 14
8 4 1 16
8 5 0 17
8 6 0 18
I managed to do it with (it's ugly I realize it, but it works):
ind <- data[which(data$status == 1),]
res <- NULL
for (j in 1:nrow(ind)){
ind_to <- unlist(ind [j,c("to")])
ind_id <- unlist(ind [j,c("id")])
id_remove <- data[which(data$to == ind_to & data$id> ind_id ),"seq"]
if(length(id_remove) == 0) next
res <- rbind(id_remove, res)
}
Which gives me a vector of IDs to remove from my data that fulfills my first assumption.
Also I would like to meet an assumption that if we going to state that belongs to vector cons_states we can go only to the consecutive one yet no visited. As we can see if the state number in "from" belongs to cons_states vector - the problem doesn't exist. Otherwise there's a possibility to move to other states only than the consecutive.
My desired output would be:
from to status id
2 3 1 1
2 8 0 5
2 16 0 6
3 4 0 7
3 8 0 8
3 16 1 9
16 4 0 11
16 8 1 14
8 4 1 16
I spent a lot of time trying to figure it out but I'm stucking on writing complicated loops that doesn't work. Is there any not super complicated way to do it?
I am trying to create duplicate rows by group. The number of duplicate rows I want to create varies by group and I want to fix the value of one column Attended = 0.
A minimal working example of the data set DF I am working with is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
2 5 1 2
2 5 1 4
3 7 0 1
For the example above, suppose I want every person (ID) to have 5 rows, with Demo the same across all rows for each individual. Thus, I have to create 1 row for ID = 1, 3 for ID = 2 and 4 for ID = 4 (I would like to calculate these dynamically for each subgroup). For the new rows I generate I want Attended = 0 and t to take on the value of a missing index, so that the final output is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
1 3 0 2
2 5 1 2
2 5 1 4
2 5 0 1
2 5 0 3
2 5 0 5
3 7 0 1
3 7 0 2
3 7 0 3
3 7 0 4
3 7 0 5
I have been able to create duplicate rows by group, but haven't been able to figure out how to create different number of duplicates by participant and correctly fill in the index column t.
Here is what I have working:
DF %>%
group_by(ID) %>%
rbind(., mutate(., t = row_number()))
I have been trying to create the right number of duplicates using slice() and trying to get the t value to be exactly what I want but to no avail.
Any help would be appreciated!
One tidyverse possibility could be:
df %>%
complete(t, nesting(ID), fill = list(Attended = 0)) %>%
arrange(ID)
t ID Demo Attended
<int> <int> <int> <dbl>
1 1 1 3 1
2 2 1 3 0
3 3 1 3 1
4 4 1 3 0
5 5 1 3 1
6 1 2 5 0
7 2 2 5 1
8 3 2 5 0
9 4 2 5 1
10 5 2 5 0
11 1 3 7 0
12 2 3 7 0
13 3 3 7 0
14 4 3 7 0
15 5 3 7 0
trying to get the spread() function to work with duplicates in the key column- yes, this has been covered before but I can't seem to get it to work and I've spent the better part of a day on it (somewhat new to R).
I have two columns of data. The first column 'snowday' represents the first day of a winter season, with the corresponding snow depth in the 'depth' column. This is several years of data (~62 years). So there should be sixty two years of first, second, third, etc days for the snowday column- this produces duplicates in snowday:
snowday row depth
1 1 0
1 2 0
1 3 0
1 4 0
1 5 0
1 6 0
...
75 4633 24
75 4634 4
75 4635 6
75 4636 20
75 4637 29
75 4638 1
I added a "row" column to make the data frame more transient (which I vaguely understand to be hones so 1:4638 rows is the total measurements taken over ~62 years at 75 days per year . Now i'd like to spread it wide:
wide <- spread(seasondata, key = snowday, value = depth, fill = 0)
and i get all zeros:
row 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
what I want it to look like is something like this (the columns are defined by the "snowday" and the row values are the various depths recorded on for that particular day over the various years- e.g. days 1 through 11 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 1 3 4 0 0 1 0 2 8 9 19 0 3
0 8 0 0 0 4 0 6 6 0 1 0 2 0
3 5 0 0 0 2 0 1 0 2 7 0 12 4
I think I'm fundamentally missing something here- I've tried working through drop=TRUE or convert = TRUE, and the output values are either all zeros or NA's depending on how I tinker. Also, all values in the data.frame(seasondata) are integers. Any thoughts?
It seems to me what you wish to do is to split up the the depth column according to values of snowday, and then bind all the 75 columns together.
There is a complication, in that 62*75 is not 4638, so I assume we do not observe 75 snowdays in some years. That is, some of the 75 columns (snowdays) will not have 62 observations. We'll make sure all 75 columns are 62 entries long by filling short columns up with NAs.
I make some fake data as an example. We observe 3 "years" of data for snowdays 1 and 2, but only 2 "years" of data for snowdays 3 and 4.
set.seed(1)
seasondata <- data.frame(
snowday = c(rep(1:2, each = 3), rep(3:4, each = 2)),
depth = round(runif(10, 0, 10), 0))
# snowday depth
# 1 1 3
# 2 1 4
# 3 1 6
# 4 2 9
# 5 2 2
# 6 2 9
# 7 3 9
# 8 3 7
# 9 4 6
# 10 4 1
We first figure out how long a column should be. In your case, m == 62. In my example, m == 3 (the years of data).
m <- max(table(seasondata$snowday))
Now, we use the by function to split up depth by values of snowdays, and fill short columns with NAs, and finally cbind all the columns together:
out <- do.call(cbind,
by(seasondata$depth, seasondata$snowday,
function(x) {
c(x, rep(NA, m - length(x)))
}
)
)
out
# 1 2 3 4
# [1,] 3 9 9 6
# [2,] 4 2 7 1
# [3,] 6 9 NA NA
Using spread:
You can use spread if you wish. In this case, you have to define row correctly. row should be 1 for the first first snowday (snowday == 1), 2 for the second first snowday, etc. row should also be 1 for the first second snowday, 2 for the second second snowday, etc.
seasondata$row <- unlist(sapply(rle(seasondata$snowday)$lengths, seq_len))
seasondata
# snowday depth row
# 1 1 3 1
# 2 1 4 2
# 3 1 6 3
# 4 2 9 1
# 5 2 2 2
# 6 2 9 3
# 7 3 9 1
# 8 3 7 2
# 9 4 6 1
# 10 4 1 2
Now we can use spread:
library(tidyr)
spread(seasondata, key = snowday, value = depth, fill = NA)
# row 1 2 3 4
# 1 1 3 9 9 6
# 2 2 4 2 7 1
# 3 3 6 9 NA NA
I have a dataframe that looks like
day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3
and another like
day.of.week count
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13
I want to add the values from df1 to df2 based on day.of.week. I was trying to use ddply
total=ddply(merge(total, subtotal, all.x=TRUE,all.y=TRUE),
.(day.of.week), summarize, count=sum(count))
which almost works, but merge combines rows that have a shared value. For instance in the example above for day.of.week=5. Rather than being merged to two records each with count one, it is instead merged to one record of count one, so instead of total count of two I get a total count of one.
day.of.week count
1 0 3
2 0 17
3 1 6
4 2 1
5 3 1
6 4 1
7 4 5
8 5 1
9 6 3
10 6 13
There is no need to merge. You can simply do
ddply(rbind(d1, d2), .(day.of.week), summarize, sum_count = sum(count))
I have assumed that both data frames have identical column names day.of.week and count
In addition to the suggestion Ben gave you about using merge, you could also do this simply using subsetting:
d1 <- read.table(textConnection(" day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3"),sep="",header = TRUE)
d2 <- read.table(textConnection(" day.of.week count1
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13"),sep = "",header = TRUE)
d2[match(d1[,1],d2[,1]),2] <- d2[match(d1[,1],d2[,1]),2] + d1[,2]
> d2
day.of.week count1
1 0 20
2 1 6
3 2 1
4 3 2
5 4 6
6 5 2
7 6 16
This assumes no repeated day.of.week rows, since match will return only the first match.