I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.
For example, I'd like to convert this
> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17))
> d
x y z
1 1 10 20
2 1 11 19
3 2 12 18
4 4 13 17
Into this:
x y z
1 1 11 19
2 2 12 18
3 4 13 17
I'm using aggregate to do this currently, but the performance is unacceptable with more data:
> d.ordered = d[order(-d$y),]
> aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]})
I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.
Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?
Maybe duplicated() can help:
R> d[ !duplicated(d$x), ]
x y z
1 1 10 20
3 2 12 18
4 4 13 17
R>
Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:
R> ddply(d, "x", function(z) tail(z,1))
x y z
1 1 11 19
2 2 12 18
3 4 13 17
R>
Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).
Just to add a little to what Dirk provided... duplicated has a fromLast argument that you can use to select the last row:
d[ !duplicated(d$x,fromLast=TRUE), ]
Here is a data.table solution which will be time and memory efficient for large data sets
library(data.table)
DT <- as.data.table(d) # convert to data.table
setkey(DT, x) # set key to allow binary search using `J()`
DT[J(unique(x)), mult ='last'] # subset out the last row for each x
DT[J(unique(x)), mult ='first'] # if you wanted the first row for each x
There are a couple options using dplyr:
library(dplyr)
df %>% distinct(x, .keep_all = TRUE)
df %>% group_by(x) %>% filter(row_number() == 1)
df %>% group_by(x) %>% slice(1)
You can use more than one column with both distinct() and group_by():
df %>% distinct(x, y, .keep_all = TRUE)
The group_by() and filter() approach can be useful if there is a date or some other sequential field and
you want to ensure the most recent observation is kept, and slice() is useful if you want to avoid ties:
df %>% group_by(x) %>% filter(date == max(date)) %>% slice(1)
Related
I have a file that contains multiple individuals and multiple values for the same individual.
I need to remove the first 10 and last 10 values of each individual, putting all the leftover values in a new table.
This is what my data kinda looks like:
Cow Data
NL123456 123
NL123456 456
I tried doing a for-loop, counting per individual how many values there were (but I think, I already got stuck there, because I am not using the right command I think? All variables in Cow are a factor).
I figured removing the first and last had to be something like this:
data1[c(11: n-10),]
If you know you always have more than 20 datapoints by cow you can do the following, illustrated on the iris dataset :
library(dplyr)
dim(iris)
# [1] 150 5
iris_trimmed <-
iris %>%
group_by(Species) %>%
slice(11:(n()-10)) %>%
ungroup()
dim(iris_trimmed)
# [1] 90 5
On your data :
res <-
your_data %>%
group_by(Cow) %>%
slice(11:(n()-10)) %>%
ungroup()
In base R you can do :
iris_trimmed <- do.call(
rbind,
lapply(split(iris, iris$Species),
function(x) head(tail(x,-10),-10)))
dim(iris_trimmed)
# [1] 90 5
Using data.table:
library(data.table)
idt <- as.data.table(iris)
idt[, .SD[11:(.N-10)], Species]
Same logic in base R:
do.call(
rbind,
lapply(
split(iris, iris[["Species"]]),
function(x) x[11:(nrow(x)-10), ]
)
)
Here a solution with dplyr.
In my example I cut only the first and last values. (you can adapt it by changing 2 with any number in filter).
The idea is to add after you group_by id the number of row per each observation starting from the top (n) and in reverse from the bottom (n1), then you simply filter out.
library(dplyr)
data %>%
group_by(id) %>%
mutate(n=1:n(),
n1 = n():1) %>% # n and n1 are the row numbers
filter(n >= 2,n1 >= 2) %>% # change 2 with 10, or whatever
# filter() keeps only the rows that you want
select(-n, -n1) %>%
ungroup()
# # A tibble: 4 x 2
# id value
# <dbl> <int>
# 1 1 6
# 2 1 8
# 3 2 1
# 4 2 2
Data:
set.seed(123)
data <- data.frame(id = c(rep(1,4), rep(2,4)), value=sample(8))
data
# id value
# 1 1 3
# 2 1 6
# 3 1 8
# 4 1 5
# 5 2 4
# 6 2 1
# 7 2 2
# 8 2 7
I have two datasets:
loc <- c("a","b","c","d","e")
id1 <- c(NA,9,3,4,5)
id2 <- c(2,3,7,5,6)
id3 <- c(2,NA,5,NA,7)
cost1 <- c(10,20,30,40,50)
cost2 <- c(50,20,30,30,50)
cost3 <- c(40,20,30,10,20)
dt <- data.frame(loc,id1,id2,id3,cost1,cost2,cost3)
id <- c(1,2,3,4,5,6,7)
rate <- c(0.9,0.8,0.7,0.6,0.5,0.4,0.3)
lookupd_tb <- data.frame(id,rate)
what I want to do, is to match the values in dt with lookup_tb for id1,id2 and id3 and if there is a match, multiply rate for that id to its related cost.
This is my approach:
dt <- dt %>%
left_join(lookupd_tb , by=c("id1"="id")) %>%
dplyr :: mutate(cost1 = ifelse(!is.na(rate), cost1*rate, cost1)) %>%
dplyr :: select (-rate)
what I am doing now, works fine but I have to repeat it 3 times for each variable and I was wondering if there is a more efficient way to do this(probably using apply family?)
I tried to join all three variables with id in my look up table but when rate is joined with my dt, all the costs (cost1, cost2 and cost3) will be multiply by the same rate which I don't want.
I appreciate your help!
A base R approach would be to loop through the columns of 'id' using sapply/lapply, get the matching index from the 'id' column of 'lookupd_tb', based on the index, get the corresponding 'rate', replace the NA elements with 1, multiply with 'cost' columns and update the 'cost' columns
nmid <- grep("id", names(dt))
nmcost <- grep("cost", names(dt))
dt[nmcost] <- dt[nmcost]*sapply(dt[nmid], function(x) {
x1 <- lookupd_tb$rate[match(x, lookupd_tb$id)]
replace(x1, is.na(x1), 1) })
Or using tidyverse, we can loop through the sets of columns i.e. 'id' and 'cost' with purrr::map2, then do the same approach as above. The only diference is that here we created new columns instead of updating the 'cost' columns
library(tidyverse)
dt %>%
select(nmid) %>%
map2_df(., dt %>%
select(nmcost), ~
.x %>%
match(., lookupd_tb$id) %>%
lookupd_tb$rate[.] %>%
replace(., is.na(.),1) * .y ) %>%
rename_all(~ paste0("costnew", seq_along(.))) %>%
bind_cols(dt, .)
In tidyverse you can also try an alternative approach by transforming the data from wide to long
library(tidyverse)
dt %>%
# data transformation to long
gather(k, v, -loc) %>%
mutate(ID=paste0("costnew", str_extract(k, "[:digit:]")),
k=str_remove(k, "[:digit:]")) %>%
spread(k, v) %>%
# left_join and calculations of new costs
left_join(lookupd_tb , by="id") %>%
mutate(cost_new=ifelse(is.na(rate), cost,rate*cost)) %>%
# clean up and expected output
select(loc, ID, cost_new) %>%
spread(ID, cost_new) %>%
left_join(dt,., by="loc") # or %>% bind_cols(dt, .)
loc id1 id2 id3 cost1 cost2 cost3 costnew1 costnew2 costnew3
1 a NA 2 2 10 50 40 10 40 32
2 b 9 3 NA 20 20 20 20 14 20
3 c 3 7 5 30 30 30 21 9 15
4 d 4 5 NA 40 30 10 24 15 10
5 e 5 6 7 50 50 20 25 20 6
The idea ist to bring the data in suitable long format for the lef_joining using a gather & spread combination with new index columns k and ID. After the calculation we will transform to the expected output using a second spread and binding to dt
I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581
There is a data.frame with duplicate values for the variable "Time"
> data.old
Time Count Direction
1 100000630955 95 1
2 100000637570 5 0
3 100001330144 7 1
4 100001330144 33 1
5 100001331413 39 0
6 100001331413 43 0
7 100001334038 1 1
8 100001357594 50 0
You must leave all values without duplicates. And sum the values of the variable "Count" with duplicate values, i.e.
> data.new
Time Count Direction
1 100000630955 95 1
2 100000637570 5 0
3 100001330144 40 1
4 100001331413 82 0
5 100001334038 1 1
6 100001357594 50 1
All I could find these unique values with the help of the command
> data.old$Time[!duplicated(data.old$Time)]
[1] 100000630955 100000637570 100001330144 100001331413 100001334038 100001357594
I can do this in a loop, but maybe there is a more elegant solution
Here's one approach using dplyr. Is this what you want to do?
library(tidyverse)
data.old %>%
group_by(Time) %>%
summarise(Count = sum(Count))
Edit: Keeping other variables
OP has indicated a desire to keep the values of other variables in the dataframe, which summarise deletes. Assuming that all values of those other variables are the same for all the rows being summarised, you could use the Mode function from this SO question.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Then change my answer to the following, with one call to Mode for each variable you want kept. This works with both numeric and character data.
library(tidyverse)
data.old %>%
group_by(Time) %>%
summarise(Count = sum(Count), Direction = Mode(Direction))
here is the one by using aggregating function
data.new<-aggregate( Count~Time , data=data.old, sum, na.rm=TRUE)
library(dplyr)
data.old %>% group_by(Time) %>% summarise(Count = sum(Count),
Direction = unique(Direction))
Of course, assuming you want to keep unique values of Direction column
My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).