Sequence along numeric vector, increasing sequence only at trigger values [duplicate] - r

This question already has an answer here:
Incrementing an ID number each time a condition is met
(1 answer)
Closed 1 year ago.
I have a data.frame ordered by ID with a column of numeric values that I would like to bin into groups, increasing the group number only when a certain target value/trigger is surpassed. I haven't had success with seq(), seq_along(), or data.table cumsum(), but I'm sure there must be a way
Example data.frame with desired group column below. In this example, the sequence generating the group column should increase only when a number >= 300 appears in the value column.
dat = data.frame(ID=1:10, value=c(0,2,1,12,68,300,41,0,72959,51), group=c(1,1,1,1,1,2,2,2,3,3))
> dat
ID value group
1 1 0 1
2 2 2 1
3 3 1 1
4 4 12 1
5 5 68 1
6 6 300 2
7 7 41 2
8 8 0 2
9 9 72959 3
10 10 51 3

We may use cumsum on a logical vector to create the group
library(dplyr)
dat %>%
mutate(group2 = cumsum(value >=300)+ 1)
-output
ID value group group2
1 1 0 1 1
2 2 2 1 1
3 3 1 1 1
4 4 12 1 1
5 5 68 1 1
6 6 300 2 2
7 7 41 2 2
8 8 0 2 2
9 9 72959 3 3
10 10 51 3 3

Related

anti-join not working - giving 0 rows, why?

I am trying to use anti-join exactly as I have done many times to establish which rows across two datasets do not have matches for two specific columns. For some reason I keep getting 0 rows in the result and I can't understand why.
Below are two dummy df's containing the two columns I am trying to compare - you will see one is missing an entry (df1, SITE no2, PLOT no 8) - so when I use anti-join to compare the two dfs, this entry should be returned, but I am just getting a result of 0.
a<- seq(1:3)
SITE <- rep(a, times = c(16,15,1))
PLOT <- c(1:16,1:7,9:16,1)
df1 <- data.frame(SITE,PLOT)
SITE <- rep(a, times = c(16,16,1))
PLOT <- c(rep(1:16,2),1)
df2 <- data.frame(SITE,PLOT)
df1 df2
SITE PLOT SITE PLOT
1 1 1 1
1 2 1 2
1 3 1 3
1 4 1 4
1 5 1 5
1 6 1 6
1 7 1 7
1 9 1 8
1 10 1 9
1 11 1 10
1 12 1 11
1 13 1 12
1 14 1 13
1 15 1 14
1 16 1 15
1 1 1 16
2 2 2 1
2 3 2 2
2 4 2 3
2 5 2 4
2 6 2 5
2 7 2 6
2 8 2 7
2 9 2 8
2 10 2 9
2 11 2 10
2 12 2 11
2 13 2 12
2 14 2 13
2 15 2 14
2 16 2 15
3 1 2 16
3 1
a <- anti_join(df1, df2, by=c('SITE', 'PLOT'))
a
<0 rows> (or 0-length row.names)
I'm sure the answer is obvious but I can't see it.
The answer can be found in the help file.
anti_join() return all rows from x without a match in y.
So reversing the input for df1 and df2 will give you what you expect.
anti_join(df2, df1, by=c('SITE', 'PLOT'))
# SITE PLOT
# 1 2 8

Remove groups with only one individual in R [duplicate]

This question already has an answer here:
Select groups with more than one distinct value per group [duplicate]
(1 answer)
Closed 1 year ago.
Consider the following dataset. The data is grouped with either one or two people per group. However, an individual may have several entries.
df1<-data.frame(group,individualID,X)
> df1
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 2 3 1
6 2 3 1
7 3 5 1
8 3 5 1
9 3 6 1
10 3 6 1
11 4 7 0
12 4 7 1
From the above Group 1 and group 3 have 2 individuals whereas group 2 and group 4 have 1 individual each.
> aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
group individualID
1 1 2
2 2 1
3 3 2
4 4 1
How can I subset the data to have only groups that have more than 1 individual. i.e. omit groups with 1 individual.
I should end up with only group 1 and group 3.
You could make a lookup table to identify the groups that have more than one unique individualID (similar to what you did with aggregate), then filter df1 based on that:
library(dplyr)
lookup <- df1 %>%
group_by(group) %>%
summarise(count = n_distinct(individualID)) %>%
filter(count > 1)
df1 %>% filter(group %in% unique(lookup$group))
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 3 5 1
6 3 5 1
7 3 6 1
8 3 6 1
Or, as #MrGumble suggests above, you could also merge df1 after creating lookup:
merge(df1, lookup)
group individualID X count
1 1 1 0 2
2 1 1 1 2
3 1 2 1 2
4 1 2 1 2
5 3 6 1 2
6 3 6 1 2
7 3 5 1 2
8 3 5 1 2

R: Return values in a columns when the value in another column becomes negative for the first time

For each ID, I want to return the value in the 'distance' column where the value becomes negative for the first time. If the value does not become negative at all, return the value 99 (or some other random number) for that ID. A sample data frame is given below.
df <- data.frame(ID=c(rep(1, 4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)),distance=rep(1:4,5), value=c(1,4,3,-1,2,1,-4,1,3,2,-1,1,-4,3,2,1,2,3,4,5))
> df
ID distance value
1 1 1 1
2 1 2 4
3 1 3 3
4 1 4 -1
5 2 1 2
6 2 2 1
7 2 3 -4
8 2 4 1
9 3 1 3
10 3 2 2
11 3 3 -1
12 3 4 1
13 4 1 -4
14 4 2 3
15 4 3 2
16 4 4 1
17 5 1 2
18 5 2 3
19 5 3 4
20 5 4 5
The desired output is as follows
> df2
ID first_negative_distance
1 1 4
2 2 3
3 3 3
4 4 1
5 5 99
I tried but couldn't figure out how to do it through dplyr. Any help would be much appreciated. The actual data I'm working on has thousands of ID's with 30 different distance levels for each. Bear in mind that for any ID, there could be multiple instances of negative values. I just need the first one.
Edit:
Tried the solution proposed by AntonoisK.
> df%>%group_by(ID)%>%summarise(first_neg_dist=first(distance[value<0]))
first_neg_dist
1 4
This is the result I am getting. Does not match what Antonois got. Not sure why.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(first_neg_dist = first(distance[value < 0]))
# # A tibble: 5 x 2
# ID first_neg_dist
# <dbl> <int>
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 1
# 5 5 NA
If you really prefer 99 instead of NA you can use
summarise(first_neg_dist = coalesce(first(distance[value < 0]), 99L))
instead.

Replace duplicate values in vector using criteria from other columns in data frame

I have a very similar problem to:
Identify and replace duplicates elements from a vector
I need to replace duplicate values in a column occurring in a sequence BUT based on criteria from other columns in the data frame.
I have a data frame like this (plus a number of extra columns):
ID<- c("1V","1V","1V","1V","2V","2V","4V","4V","4V","4V","4V")
year<- c(1,1,1,2,1,1,2,2,3,3,3)
sequence<- c(1,2,2,1, 1,2,1,2,1,1,1)
score <- c(5,5,5,5,10,10,10,10,11,11,11)
examp <- data.frame(ID,year, sequence, score)
> examp
ID year sequence score
1 1V 1 1 5
2 1V 1 2 5
3 1V 1 2 5
4 1V 2 1 5
5 2V 1 1 10
6 2V 1 2 10
7 4V 2 1 10
8 4V 2 2 10
9 4V 3 1 11
10 4V 3 1 11
11 4V 3 1 11
What I need is to replace the duplicate scores within each ID, year and sequence with NA. Also the sequence couple with the score should be replaced with NA. Thus, no rows are deleted, only specific entries.
> examp
ID year sequence score
1 1V 1 1 5
2 1V 1 2 5
3 1V 1 NA NA
4 1V 2 2 5
5 2V 1 1 10
6 2V 1 2 10
7 4V 2 1 10
8 4V 2 2 10
9 4V 3 1 11
10 4V 3 NA NA
11 4V 3 NA NA
All rows are retained. The same scores may occur across different IDs/years/sequences, but only within each unique combination of these three columns can I replace a duplicate score.
Example with a single vector and solution from the other linked question:
a <- 1 1 1 2 3 2 2 2 2 1 0 0 0 0 2 3 4 4 1 1
ifelse(a == c(a[1]-1,a[(1:length(a)-1)]) , 0 , a)
[1] 1 0 0 2 3 2 0 0 0 1 0 0 0 0 2 3 4 0 1 0
I am unsure of how to adapt the above code in the question above with multiple criteria. Is it possible?
Primarily, the most important is to replace the scores, but if someone has a solution to replacing both scores and sequence I would be very happy.
In base R, you can use subsetting and is.na.
is.na(examp[duplicated(examp[1:3]), c("sequence", "score")]) <- TRUE
examp
ID year sequence score
1 1V 1 1 5
2 1V 1 2 5
3 1V 1 NA NA
4 1V 2 1 5
5 2V 1 1 10
6 2V 1 2 10
7 4V 2 1 10
8 4V 2 2 10
9 4V 3 1 11
10 4V 3 NA NA
11 4V 3 NA NA
Here, ID year sequence returns a logical vector the length of your data.frame that signals whether the rows of the first three variables are duplicates of previous rows. c("sequence", "score") determines the columns that are to be replaced. Then is.na is set to TRUE in those column for the duplicated rows.
A longer, but more readable version is to use the variable names rather than their positions.
is.na(examp[duplicated(examp[c("ID", "year", "sequence")]), c("sequence", "score")]) <- TRUE
This is also safer in the long run in case the positions shift due to merging or other manipulations. It may be also easier to read/interpret when reviewing the code six months from now.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(examp)), grouped by 'ID', 'year', we get the row index (.I) where column 'sequence', is duplicated and then set those values in the dataset columns 'sequence', 'score' to NA. This should be very efficient as we are setting in place
library(data.table)
i1 <- setDT(examp)[, .I[duplicated(sequence)], .(ID, year)]$V1
for(j in 3:4){
set(examp, i = i1, j=j, value = NA)
}
examp
# ID year sequence score
# 1: 1V 1 1 5
# 2: 1V 1 2 5
# 3: 1V 1 NA NA
# 4: 1V 2 1 5
# 5: 2V 1 1 10
# 6: 2V 1 2 10
# 7: 4V 2 1 10
# 8: 4V 2 2 10
# 9: 4V 3 1 11
#10: 4V 3 NA NA
#11: 4V 3 NA NA
Or with dplyr
library(dplyr)
examp %>%
group_by(ID, year) %>%
mutate_each(funs(replace(., duplicated(.), NA)))
With base R, we can do a compact option
examp[duplicated(examp[1:3]), 3:4] <- NA
examp
# ID year sequence score
#1 1V 1 1 5
#2 1V 1 2 5
#3 1V 1 NA NA
#4 1V 2 1 5
#5 2V 1 1 10
#6 2V 1 2 10
#7 4V 2 1 10
#8 4V 2 2 10
#9 4V 3 1 11
#10 4V 3 NA NA
#11 4V 3 NA NA
Or another option is replace with lapply
examp[3:4] <- lapply(examp[3:4], function(x) replace(x, duplicated(examp[1:3]), NA))

Summing two dataframes based on common value

I have a dataframe that looks like
day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3
and another like
day.of.week count
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13
I want to add the values from df1 to df2 based on day.of.week. I was trying to use ddply
total=ddply(merge(total, subtotal, all.x=TRUE,all.y=TRUE),
.(day.of.week), summarize, count=sum(count))
which almost works, but merge combines rows that have a shared value. For instance in the example above for day.of.week=5. Rather than being merged to two records each with count one, it is instead merged to one record of count one, so instead of total count of two I get a total count of one.
day.of.week count
1 0 3
2 0 17
3 1 6
4 2 1
5 3 1
6 4 1
7 4 5
8 5 1
9 6 3
10 6 13
There is no need to merge. You can simply do
ddply(rbind(d1, d2), .(day.of.week), summarize, sum_count = sum(count))
I have assumed that both data frames have identical column names day.of.week and count
In addition to the suggestion Ben gave you about using merge, you could also do this simply using subsetting:
d1 <- read.table(textConnection(" day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3"),sep="",header = TRUE)
d2 <- read.table(textConnection(" day.of.week count1
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13"),sep = "",header = TRUE)
d2[match(d1[,1],d2[,1]),2] <- d2[match(d1[,1],d2[,1]),2] + d1[,2]
> d2
day.of.week count1
1 0 20
2 1 6
3 2 1
4 3 2
5 4 6
6 5 2
7 6 16
This assumes no repeated day.of.week rows, since match will return only the first match.

Resources