This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 2 years ago.
I have one data set.This data set contain two columns first is column with ID and second is with VALUE.You can see code and table below :
DATA_TEST <- data.frame(
ID = c("03740270423222","03740270423222","03740270423222","03740270423222","01380926325248","01380926325248","01380926325248"),
VALUE = c("100","200","300","200","300","200","300"))
But here in table there are a lot of duplicate, so my intention is only to extract last value separably by each ID so final result should be like in table below:
So can anybody help me how to resolve this problem ?
A base R solution with aggregate() and tail()
aggregate(VALUE~ ID, DATA_TEST, tail, 1)
# ID VALUE
# 1 01380926325248 300
# 2 03740270423222 200
or by dplyr package.
library(dplyr)
option 1: summarise() + last()
DATA_TEST %>%
group_by(ID) %>%
summarise(VALUE = last(VALUE))
option 2: slice_tail() <=> slice(n())
DATA_TEST %>%
group_by(ID) %>%
slice_tail()
In data.table:
DATA_TEST<-data.frame(
ID=c("03740270423222","03740270423222","03740270423222","03740270423222","01380926325248","01380926325248","01380926325248"),
VALUE=c("100","200","300","200","300","200","300")
)
library(data.table)
DT <- as.data.table(DATA_TEST)
DT[, .(VALUE = last(VALUE)), by = ID]
ID VALUE
1: 03740270423222 200
2: 01380926325248 300
Related
Like the title, the question is very straightforward. (pardon my ignorance)
I have a column, character type, in a data table.
And there are several different words/values stored, some of them only appear once, others appear multiple times.
How can I select out the ones that only appear once??
Any help is appreciated! Thank you!
One option would be to do a group by and then select the groups having single row
library(data.table)
dt1 <- dt[, .SD[.N == 1], .(col)]
library(dplyr)
df %>%
group_by(column) %>%
dplyr::filter(n() == 1) %>%
ungroup()
Example:
data = tibble(text = c("a","a","b","c","c","c"))
data %>%
group_by(text) %>%
dplyr::filter(n() == 1) %>%
ungroup()
# A tibble: 1 x 1
text
<chr>
1 b
This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 5 years ago.
I have a dataframe with one observation per row and two observations per subject. I'd like to filter out just the rows with duplicate 'day' numbers.
ex <- data.frame('id'= rep(1:5,2), 'day'= c(1:5, 1:3,5:6))
The following code filters out just the second duplicated row, but not the first. Again, I'd like to filter out both of the duplicated rows.
ex %>%
group_by(id) %>%
filter(duplicated(day))
The following code works, but seems clunky. Does anyone have a more efficient solution?
ex %>%
group_by(id) %>%
filter(duplicated(day, fromLast = TRUE) | duplicated(day, fromLast = FALSE))
duplicated can be applied on the whole dataset and this can be done with just base R methods.
ex[duplicated(ex)|duplicated(ex, fromLast = TRUE),]
Using dplyr, we can group_by both the columns and filter only when the number of rows (n()) is greater than 1.
ex %>%
group_by(id, day) %>%
filter(n()>1)
Single tidyverse pipe:
exSinglesOnly <-
ex %>%
group_by(id,day) %>% # the complete group of interest
mutate(duplicate = n()) %>% # count number in each group
filter(duplicate == 1) %>% # select only unique records
select(-duplicate) # remove group count column
> exSinglesOnly
Source: local data frame [4 x 2]
Groups: id, day [4]
id day
<int> <int>
1 4 4
2 5 5
3 4 5
4 5 6
I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value.
I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:
new <- aggregate(e_date ~ id, data_full, min)
data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)
new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)
data_fixed <- data_full[which(new$comb %in% data_full$comb),]
The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).
library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]
Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.
library(dplyr)
data_full %>%
group_by(id) %>%
arrange(e_date) %>%
slice(1L)
If we need a base R option, ave can be used
data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]
Another answer that uses dplyr's filter command:
dta %>%
group_by(id) %>%
filter(date == min(date))
You may use library(sqldf) to get the minimum date as follows:
data1<-data.frame(id=c("789","123","456","123","123","456","789"),
e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))
library(sqldf)
data2 = sqldf("SELECT id,
min(e_date) as 'earliest_date'
FROM data1 GROUP BY 1", method = "name__class")
head(data2)
id earliest_date
123 2014-03-01
456 2015-07-08
789 2015-12-11
I made a reproducible example, supposing that you grouped some dates by which quarter they were in.
library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)
df %>%
group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])
# A tibble: 4 x 2
which_quarter sort(rand_weeks)[1]
<dbl> <time>
1 1 2017-01-05 05:46:32
2 2 2017-04-06 05:46:32
3 3 2016-08-18 05:46:32
4 4 2016-10-06 05:46:32
This question already has answers here:
count number of rows in a data frame in R based on group [duplicate]
(8 answers)
Closed 6 years ago.
Say I have a data table like this:
id days age
"jdkl" 8 23
"aowl" 1 09
"mnoap" 4 82
"jdkl" 3 14
"jdkl" 2 34
"mnoap" 27 56
I want to create a new data table that has one column with the ids and one column with the number of times they appear. I know that data table has something with =.N, but I wasn't sure how to use it for only one column.
The final data table would look like this:
id count
"jdkl" 3
"aowl" 2
"mnoap" 1
You can just use table from base R:
as.data.frame(sort(table(df$id), decreasing = T))
However, if you want to do it using data.table:
library(data.table)
setDT(df)[, .(Count = .N), by = id][order(-Count)]
or there is the dplyr solution
library(dplyr)
df %>% count(id) %>% arrange(desc(n))
We can use
library(dplyr)
df %>%
group_by(id) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
Or using aggregate from base R
r1 <- aggregate(cbind(Count=days)~id, df1, length)
r1[order(-r1$Count),]
# id Count
#2 jdkl 3
#3 mnoap 2
#1 aowl 1
This question already has answers here:
Subset of a data frame with the penultimate values of one of the columns
(3 answers)
Closed 8 years ago.
I have data with the following format. There is a non-unique ID, the number of times it's shown up, and more data.
I want to add the pen-ultimate row for each ID to a new table, IE a2, and b4.
What are a couple methods for accomplishing this?
ID # data
a 1 ...
a 2 ...
a 3 ...
b 1 ...
b 2 ...
b 3 ...
b 4 ...
b 5 ...
...
In addition to #Ben's answer and those in the duplicate answer, you could use dplyr to achieve this:
df %.% #your data.frame
group_by(ID) %.%
mutate(count = 1:n()) %.%
filter(count %in% max(c(count-1,1))) %.% #if each ID occures more than 1 time, you can simplify this to filter(count %in% max(count-1)) %.%
select(-count)
This can also be written in a single line:
df %.% group_by(ID) %.% mutate(count = 1:n()) %.% filter(count %in% max(c(count-1,1))) %.% select(-count)
I would use plyr::ddply:
penult <- function(x) head(tail(x,2),1))
ddply(mydata,"ID",penult)
Somewhat to my surprise this actually works fine in the edge case (only one row per ID), because tail(x,2) returns a single row in that case.
mydata[ tapply( rownames(mydata), mydata$ID, function(n) n[ min(1, length(n)-1 ] ) ), ]
No testing in absence of a valid example. The edge case of a single row for an ID was not considered in your problem formulation so I decided to use the solitary row in that situation.