Selecting unique rows in R - r

There is a data.frame with duplicate values for the variable "Time"
> data.old
Time Count Direction
1 100000630955 95 1
2 100000637570 5 0
3 100001330144 7 1
4 100001330144 33 1
5 100001331413 39 0
6 100001331413 43 0
7 100001334038 1 1
8 100001357594 50 0
You must leave all values without duplicates. And sum the values of the variable "Count" with duplicate values, i.e.
> data.new
Time Count Direction
1 100000630955 95 1
2 100000637570 5 0
3 100001330144 40 1
4 100001331413 82 0
5 100001334038 1 1
6 100001357594 50 1
All I could find these unique values with the help of the command
> data.old$Time[!duplicated(data.old$Time)]
[1] 100000630955 100000637570 100001330144 100001331413 100001334038 100001357594
I can do this in a loop, but maybe there is a more elegant solution

Here's one approach using dplyr. Is this what you want to do?
library(tidyverse)
data.old %>%
group_by(Time) %>%
summarise(Count = sum(Count))
Edit: Keeping other variables
OP has indicated a desire to keep the values of other variables in the dataframe, which summarise deletes. Assuming that all values of those other variables are the same for all the rows being summarised, you could use the Mode function from this SO question.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Then change my answer to the following, with one call to Mode for each variable you want kept. This works with both numeric and character data.
library(tidyverse)
data.old %>%
group_by(Time) %>%
summarise(Count = sum(Count), Direction = Mode(Direction))

here is the one by using aggregating function
data.new<-aggregate( Count~Time , data=data.old, sum, na.rm=TRUE)

library(dplyr)
data.old %>% group_by(Time) %>% summarise(Count = sum(Count),
Direction = unique(Direction))
Of course, assuming you want to keep unique values of Direction column

Related

Create group variable based on common dates

I have a large data set containing animal ID's and dates. There are two groups within this dataset but there is no grouping variable, so I have to extrapolate who belongs to which group based on the dates they appear to have in common.
Dummy data.
mydf<-data.frame(
Date=sort(rep(seq(as.Date("2012/1/1"),as.Date("2012/1/4"), length.out = 4),5)),
ID = c(1,2,3,4,5,5,6,7,8,9,1,2,3,4,5,6,7,8,9,10))
The other issue I have is that every now and then an ID belonging to group 1 might appear with a date associated with group 2, which is what has thrown off every attempt I've made so far at grouping.
What I need is a output with ID's and a new Group ID like this
ID Group
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 2
10 2
1:5 all appear together on the 1st and the 3rd so they are likely to be one group.
6:10 appear on the 2nd and 4th and are likely to be the 2nd group.
ID 5 belongs to group 1, because even though it was observed once on the 2nd with ID's 6:9, it was observed twice on the 1st and 2nd 1:4, so it's most likely to belong to group 1.
All my attempts have fallen flat. Can anyone offer a solution to this?
Thanks in advance.
EDIT:
I thought we had nailed a solution using Jon's kmeans solution (in the comments below):
mydf_wide <- mydf %>%
select(ID, date) %>%
distinct(ID,date)%>% #
mutate(x = 1) %>%
spread(date, x, fill = 0)
mydf_wide$clusters <- mydf_wide %>%
kmeans(centers = 2) %>%
pluck("cluster")
but I'm actually finding the kmeans method not quite getting it right every time. See below:
The groups where certain tags (ID) appear on the same day as each other are fairly easy to spot by eye. There are two groups, one is in the center, and the other group appears on either side. The clustering should be vertical by common dates as in Jon's answer below, but it is clustering across the entire date range. (Apologies for the messy axis labels)
The k-means method has worked on other groups, but it's not consistently able to group by common dates. I think the clustering approach is sensible, but I was wondering if there may be other clustering methods that may cope better than kmeans?
Alternatively, could a filtering method help reduce any background noise and help the kmeans approach more reliable?
Again, very grateful for any and all advice.
Cheers.
My thinking here is that you just assign each Date to a group, then take the average of group for each ID. You could then round to the nearest whole number from there. In this case, average group of ID == 5 would be 1.33
library(dplyr)
mydf %>%
mutate(group = case_when(
Date %in% as.Date(c("2012-01-01", "2012-01-03")) ~ 1,
Date %in% as.Date(c("2012-01-02", "2012-01-04")) ~ 2,
TRUE ~ NA_real_
)) %>%
group_by(ID) %>%
summarise(likely_group = mean(group) %>% round)
Which gives you the following:
# A tibble: 10 x 2
ID likely_group
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 2
7 7 2
8 8 2
9 9 2
10 10 2
This works as long as there isn't an even split between groups for a single ID. But there isn't currently a way to address this situation with the information provided.
As a general solution, you might consider using k-means as an automatic way to split the data into groups based on similarity to other IDs.
First, I converted the data into wide format so that each ID gets one row. Then fed that into the base kmeans function to get the clustering output as a list, and purrr::pluck to extract just the assignment part of that list.
library(tidyverse)
mydf_wide <- mydf %>%
mutate(x = 1) %>%
spread(Date, x, fill = 0)
mydf_wide
# ID 2012-01-01 2012-01-02 2012-01-03 2012-01-04
#1 1 1 0 1 0
#2 2 1 0 1 0
#3 3 1 0 1 0
#4 4 1 0 1 0
#5 5 1 1 1 0
#6 6 0 1 0 1
#7 7 0 1 0 1
#8 8 0 1 0 1
#9 9 0 1 0 1
#10 10 0 0 0 1
clusters <- mydf_wide %>%
kmeans(centers = 2) %>%
pluck("cluster")
clusters
# [1] 2 2 2 2 2 1 1 1 1 1
Here's what that looks if you add those to the original data and plot.
mydf_wide %>%
mutate(cluster = clusters) %>%
# ggplot works better with long (tidy) data...
gather(date, val, -ID, -cluster) %>%
filter(val != 0) %>%
arrange(cluster) %>%
ggplot(aes(date, ID, color = as.factor(cluster))) +
geom_point(size = 5) +
scale_y_continuous(breaks = 1:10, minor_breaks = NULL) +
scale_color_discrete(name = "cluster")

R create sequence in dplyr by id beginning with zero

I'm looking for the cleanest way to create a sequence beginning with zero by id in a dataframe.
df <- data.frame (id=rep(1:10,each=10))
If I wanted to start sequence at 1 the following would do:
library(dplyr)
df<-df %>% group_by(id) %>%
mutate(start = 1:n()) %>%
ungroup()
but starting at 0 doesn't work because it creates an extra number (0-10 compared to 1-10) so I need to add an extra row, is there a way to do this all in one step, perhaps using dplyr? There is obviously a number of work arounds such as creating another dataset and appending it to the original.
df1 <- data.frame (id=1:10,
start=0)
new<-rbind(df,df1)
That just seems a bit awkward and not that tidy. I know you can use rbind in dplyr but not sure how to incorporate everything in one step especially if you had other non-timing varying variables you just wanted to copy over into the new row. Interested to see suggestions, thanks.
You could use complete() from the tidyverse:
library(tidyverse)
df %>%
group_by(id) %>%
mutate(start = 1:n()) %>%
complete(start = c(0:10)) %>%
ungroup()
Which yields
# A tibble: 110 x 2
id start
<int> <int>
1 1 0
2 1 1
3 1 2
4 1 3
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
10 1 9
# ... with 100 more rows

r: Summarise for rowSums after group_by

I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581

Grouping and filtering consecutively over a dataframe

I am working with a large dataframe in R but I got the next action and my solution looks too extent. I will use DF as an example of the dataframe I am using:
library(dplyr)
DF<-data.frame(ID=c(1:10),Cause1=c(rep("Yes 1",8),rep("No 1",2)),Cause2=c(rep("Yes 2",6),rep("No 2",4)),
Cause3=c(rep("Yes S",5),rep("No S",5)),Cause4=c(rep("Yes P",3),rep("No P",7)),
Cause5=c(rep("Yes",2),rep("No",8)),stringsAsFactors = F)
DF has the next structure:
ID Cause1 Cause2 Cause3 Cause4 Cause5
1 1 Yes 1 Yes 2 Yes S Yes P Yes
2 2 Yes 1 Yes 2 Yes S Yes P Yes
3 3 Yes 1 Yes 2 Yes S Yes P No
4 4 Yes 1 Yes 2 Yes S No P No
5 5 Yes 1 Yes 2 Yes S No P No
6 6 Yes 1 Yes 2 No S No P No
7 7 Yes 1 No 2 No S No P No
8 8 Yes 1 No 2 No S No P No
9 9 No 1 No 2 No S No P No
10 10 No 1 No 2 No S No P No
Where DF is composed of six variables (1 id variable and the others are variables that can be Yes or No). Then, for each of the variables with the prefix Cause I need to compute a summary of that variable, as first step, and after that I have to filter by that variable when it was achieved (or this is equal to Yes). For example I will do the first stage of this process with the next code and its respective explanation:
#Filtering stage
#N1
DF %>% group_by(Cause1) %>% summarise(N=n()) -> d1
DF %>% filter(Cause1=="Yes 1") -> DF2
In this case, using dplyr I group DF by variable Cause1 and summarise() to count the number of values it has (n()). Therefore, the result is saved in d1. After, I have to filter DF when Cause1 is equal to Yes 1 and that must be saved in a new data.frame called DF2. Once I get DF2 I must repeat a similar routine for Cause2, Cause3, Cause4 and Cause5. For that I use the next code:
#N2
DF2 %>% group_by(Cause2) %>% summarise(N=n()) -> d2
DF2 %>% filter(Cause2=="Yes 2") -> DF3
#N3
DF3 %>% group_by(Cause3) %>% summarise(N=n()) -> d3
DF3 %>% filter(Cause3=="Yes S") -> DF4
#N4
DF4 %>% group_by(Cause4) %>% summarise(N=n()) -> d4
DF4 %>% filter(Cause4=="Yes P") -> DF5
#N5
DF5 %>% group_by(Cause5) %>% summarise(N=n()) -> d5
DF5 %>% filter(Cause5=="Yes") -> DF6
The final result is DF6 but I have to make a control by combining all the dataframes d1,d2,d3,d4 and d5 and filtering all the No values. I used this code with that porpouse. The code sets a common names for all d's dataframes, rbind them and filter the No pattern.
#Connect
names(d1)<-names(d2)<-names(d3)<-names(d4)<-names(d5)<-c("Cause","N")
#Rbind
d<-rbind(d1,d2,d3,d4,d5)
d_reduced<-d[grepl("No",d$Cause),]
I obtain this:
Cause N
1 No 1 2
2 No 2 2
3 No S 1
4 No P 2
5 No 1
The final step is to compute the sum of N in d_reduced and the number of rows in DF minus that value must be the same that the number of rows of DF6:
(dim(DF)[1]-sum(d_reduced$N))==dim(DF6)[1]
That in this case is TRUE.
I would like to reduce this too long code because in my analysis the number of Cause variables can increase and the the code will be larger. Maybe by using the apply strategy or reshaping the data could be better. Any help about reducing the level of code would be marvelous. Thanks in advance.
How about something like this?
First we summarise how many "No" cases are in each column that starts with "Cause":
num_no <- DF %>% summarise_each(funs(substr(., 1, 1) == "N"), starts_with("Cause"))
> num_no
Cause1 Cause2 Cause3 Cause4 Cause5
1 2 4 5 7 8
You are interested in the incremental difference between each subsequent column, so lets just subtract a lagged version of num_no from num_no.
d_reduced <- num_no - lag(num_no, 1, 0)
> d_reduced
Cause1 Cause2 Cause3 Cause4 Cause5
1 2 2 1 2 1
This gives the values you wanted, but they are not the labelled, lets fix that, extracting the unique string that begins with N for each column:
labs <- lapply(DF, function(X){unique(X[grep("N", X)])}) %>% unlist
names(d_reduced) <- labs
> d_reduced
No 1 No 2 No S No P No
1 2 2 1 2 1
Then we do your final step would be, summing the occurrences of d_reduced and subtracting those from number of rows of DF and then checking if that is equal to the number of rows which are "Yes" for their entire row.
> (nrow(DF) - sum(d_reduced)) == sum(DF[, ncol(DF)] == "Yes")
[1] TRUE
Warning: This would only work because if someone has yes in the final column all preceding columns are yes (like in your example). If that assumption changed then this answer will not work.
You could reshape to long format, then count the votes, and then take the difference between the Yes values. data.table::melt uses regular expression for detecting measure variables, which should be useful in capturing all the Cause variables. Does this work?
d <-
melt(as.data.table(DF), # launch melt.data.table
id.vars = "ID",
measure.vars = patterns("Cause"), # grep columns
variable.name = "Cause") %>%
group_by(Cause) %>% # tabulate Yes's and No's
summarise(Yes = sum(grepl("Yes", value)),
No = sum(grepl("No", value))) %>%
mutate(N = lag(Yes) - Yes) %>% # N = difference between Yes's
rowwise() %>% # replace the NA in first row with the No value
mutate(N = replace(N, is.na(N), No))

Collapsing data frame by selecting one row per group

I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.
For example, I'd like to convert this
> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17))
> d
x y z
1 1 10 20
2 1 11 19
3 2 12 18
4 4 13 17
Into this:
x y z
1 1 11 19
2 2 12 18
3 4 13 17
I'm using aggregate to do this currently, but the performance is unacceptable with more data:
> d.ordered = d[order(-d$y),]
> aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]})
I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.
Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?
Maybe duplicated() can help:
R> d[ !duplicated(d$x), ]
x y z
1 1 10 20
3 2 12 18
4 4 13 17
R>
Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:
R> ddply(d, "x", function(z) tail(z,1))
x y z
1 1 11 19
2 2 12 18
3 4 13 17
R>
Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).
Just to add a little to what Dirk provided... duplicated has a fromLast argument that you can use to select the last row:
d[ !duplicated(d$x,fromLast=TRUE), ]
Here is a data.table solution which will be time and memory efficient for large data sets
library(data.table)
DT <- as.data.table(d) # convert to data.table
setkey(DT, x) # set key to allow binary search using `J()`
DT[J(unique(x)), mult ='last'] # subset out the last row for each x
DT[J(unique(x)), mult ='first'] # if you wanted the first row for each x
There are a couple options using dplyr:
library(dplyr)
df %>% distinct(x, .keep_all = TRUE)
df %>% group_by(x) %>% filter(row_number() == 1)
df %>% group_by(x) %>% slice(1)
You can use more than one column with both distinct() and group_by():
df %>% distinct(x, y, .keep_all = TRUE)
The group_by() and filter() approach can be useful if there is a date or some other sequential field and
you want to ensure the most recent observation is kept, and slice() is useful if you want to avoid ties:
df %>% group_by(x) %>% filter(date == max(date)) %>% slice(1)

Resources