This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 6 years ago.
I have a very large data frame that I need to subset by last values. I know that the data.table library includes the last() function which returns the last value of an array, but what I need is to subset foo by the last value in id for every separate value in track. Values in id are consecutive integers, but the last values will be different for every track.
> head(foo)
track id coords.x coords.y
1 0 0 -79.90732 43.26133
2 0 1 -79.90733 43.26124
3 0 2 -79.90733 43.26124
4 0 3 -79.90733 43.26124
5 0 4 -79.90725 43.26121
6 0 5 -79.90725 43.26121
The output would look something like this.
track id coords.x coords.y
1 0 57 -79.90756 43.26123
2 1 98 -79.90777 43.26231
3 2 61 -79.90716 43.26200
... and so on
How would one apply the last() function (or another function like tail()) to produce this output?
We can try with dplyr, grouping by track and selecting only the last row of every group.
library(dplyr)
df %>%
group_by(track) %>%
filter(row_number() == n())
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'track' get the last row with tail
library(data.table)
setDT(df1)[, tail(.SD, 1), by = track]
As the also mentioned another logic with 'id' about the consecutive numbers, we can also create a logical index using diff, get the row index (.I) and subset the rows.
setDT(df1)[df1[, .I[c(FALSE, diff(id) ! = 1)], by = track]$V1]
Or we can do this using base R itself
df1[!duplicated(df1$track, fromLast=TRUE),]
Or another option is dplyr
library(dplyr)
df1 %>%
group_by(track) %>%
slice(n())
Related
This question already has answers here:
Cumulative number of unique values in a column up to current row
(2 answers)
Closed 4 years ago.
My data frame looks something like this:
USER URL
1 homepage.com
1 homepage.com/welcome
1 homepage.com/overview
1 homepage.com/welcome
What I want is a vector with the following values:
UNIQUE
1
2
3
3
How do I do that?
We could use cumsum and duplicated
df$unique <- cumsum(!duplicated(df$URL))
df$unique
#[1] 1 2 3 3
duplicated gives us logical vector of whether a value is duplicate or not, we negate it (!) and then use cumsum over it so we have cumulative sum of unique values.
Using dplyr to add a new column:
library(dplyr)
df %>%
mutate(Dups=cumsum(!duplicated(URL)))
This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 4 years ago.
I have a data frame like this:
x y z country
1 4 1 USA
3 1 1 Canada
0 1 1 Spain
0 2 3 USA
4 1 1 Canada
I need to select the data which countries appear at least 1000 times through all the data frame. Let's say, for example, that only USA and Canada meet that condition. The problem is that I have more than 40 countries and 500000 cases so I can't do it case by case.
I suppose that I need a loop "for" to do so, but I can't figure how to do it.
First get the names of the countries you want. Then subset by those names.
tab <- table(df$country)
mycountries <- names(tab[tab > 1000])
df <- df[df$country %in% mycountries, ]
With data.table and by assuming your dataframe is named df, we can create a variable named count that counts the total number of rows for each country, and then subset to only those countries with >1000 rows:
library(data.table)
setDT(df)
df[ , count := .N, by=country]
df[count > 1000]
One possible solution using dplyr:
library(dplyr)
df %>%
group_by(country) %>%
summarise(count = n()) %>%
filter(count >= 1000) %>%
arrange(desc(count))
This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I have some large dataset (more than 500 000 rows) and I want to filter it in R. I just want to retain the most relevant information so I thought that it would be a good idea to just save the rows whose elements have an occurrence greater than some value. For example I have this data:
A B
2 5
4 7
2 8
3 7
2 9
4 2
1 0
And I want to retain the rows whose element of the A row has an occurrence greater than 1. In this case the output will be:
A B
2 5
4 7
2 8
2 9
4 2
I know how to do it with for loops and rbind but since the dataset I am using is very big the performance is greatly hindered. Any advice?
We can do this using either data.table, dplyr or base R methods. By using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'A', if the nrows are greater than 1, we get the Subset of Data.table (.SD).
library(data.table)
setDT(df1)[, if(.N>1) .SD, by = A]
Or we use dplyr. We group by 'A', filter the groups that have nrows greater than 1 (n() >1)
library(dplyr)
df1 %>%
group_by(A) %>%
filter(n()>1)
Or using ave from base R, we get a logical index and use that to subset the dataset
df1[with(df1, ave(seq_along(A), A, FUN=length))> 1,]
Or without using any groupings, we can use duplicated to get the index and subset
df1[duplicated(df1$A)|duplicated(df1$A, fromLast=TRUE),]
This question already has answers here:
Select groups with more than one distinct value
(3 answers)
Closed 7 years ago.
I have data like below:
ID category class
1 a m
1 a s
1 b s
2 a m
3 b s
4 c s
5 d s
I want to subset the data by only including those "ID" which have several (> 1) different categories.
My expected output:
ID category class
1 a m
1 a s
1 b s
Is there a way to doing so?
I tried
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(category, class) > 1)
But it gave me an error:
# Error: expecting a single value
Using data.table
library(data.table) #see: https://github.com/Rdatatable/data.table/wiki for more
setDT(data) #convert to native 'data.table' type by reference
data[ , if(uniqueN(category) > 1) .SD, by = ID]
uniqueN is data.table's (fast) native mask for length(unique()), and .SD is just the whole data.table (in more general cases, it can represent a subset of columns, e.g. when the .SDcols argument is activated). So basically the middle statement (j, the column selection argument) says to return all columns and rows associated with an ID for which there are at least two distinct values of category.
Use the by argument to extend to a case involving counts ok multiple columns.
This question already has answers here:
Add column with order counts
(2 answers)
Count number of rows within each group
(17 answers)
Closed 7 years ago.
What is the easiest way to count the occurrences of a an element on a vector or data.frame at every grouop?
I don't mean just counting the total (as other stackoverflow questions ask) but giving a different number to every succesive occurence.
for example for this simple dataframe: (but I will work with dataframes with more columns)
mydata <- data.frame(A=c("A","A","A","B","B","A", "A"))
I've found this solution:
cbind(mydata,myorder=ave(rep(1,nrow(mydata)),mydata$A, FUN=cumsum))
and here the result:
A myorder
A 1
A 2
A 3
B 1
B 2
A 4
A 5
Isn't there any single command to do it?. Or using an specialized package?
I want it to later use tidyr's spread() function.
My question is not the same than
Is there an aggregate FUN option to count occurrences?
because I don't want to know the total number of occurrencies at the end but the cumulative occurencies till every element.
OK, my problem is a little bit more complex
mydata <- data.frame(group=c("x","x","x","x","y","y", "y"), letter=c("A","A","A","B","B","A", "A"))
I only know to solve the first example I wrote above.
But what happens when I want it also by a second grouping variable?
something like occurrencies(letter) by group.
group letter "occurencies within group"
x A 1
x A 2
x A 3
x B 1
y B 1
y A 1
y A 2
I've found the way with
ave(rep(1,nrow(mydata)),list(mydata$group, mydata$letter), FUN=cumsum)
though it shoould be something easier.
Using data.table
library(data.table)
setDT(mydata)
mydata[, myorder := 1:.N, by = .(group, letter)]
The by argument makes the table be dealt with within the groups of the column called A. .N is the number of rows within that group (if the by argument was empty it would be the number of rows in the table), so for each sub-table, each row is indexed from 1 to the number of rows in that sub-table.
mydata
group letter myorder
1: x A 1
2: x A 2
3: x A 3
4: x B 1
5: y B 1
6: y A 1
7: y A 2
or a dplyr solution which is pretty much the same
mydata %>%
group_by(group, letter) %>%
mutate(myorder = 1:n())