dplyr: Generate row number/row position in group_by [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I have a dataset and I want to generate the row position by group. For example
library(data.table)
data<-data.table(Position=c(1,2,3,4,5,6,7,8,9,10),
Category=c("M","M","M","M","F","F","F","M","M","F"))
I group by the Category and want to create column that is the row position by group. Something like below or with data.table
dataByGroup %>% group_by(Category) %>% mutate(positionInCategory = 1:nrow(Category))
Unable to work out how to achieve this?
Desired output:
| Position|Category | positionInCategory|
|--------:|:--------|------------------:|
| 1|M | 1|
| 2|M | 2|
| 3|M | 3|
| 4|M | 4|
| 5|F | 1|
| 6|F | 2|
| 7|F | 3|
| 8|M | 5|
| 9|M | 6|
| 10|F | 4|

Try the following:
library(data.table)
library(dplyr)
data<-data.table(Position=c(1,2,3,4,5,6,7,8,9,10),
Category=c("M","M","M","M","F","F","F","M","M","F"))
cleanData <- data %>%
group_by(Category) %>%
mutate(positionInCategory = 1:n())

Try
data[, new := rowid(Category)]
# or, if you're using 1.9.6 or older
data[, new := 1:.N, by=Category]
Position Category new
1: 1 M 1
2: 2 M 2
3: 3 M 3
4: 4 M 4
5: 5 F 1
6: 6 F 2
7: 7 F 3
8: 8 M 5
9: 9 M 6
10: 10 F 4
To use rowid, you'll currently need the unstable/devel version of the package.

Related

Create a new data.table of percentages based off another data.table

I am trying to create a new data.table of percentages based on another data.table.
I thought about creating new columns and dividing them but I have gotten lost in the logic of how to do this. Basically I need to know what the percentage of subjects MEET at VISIT
datatable_1
+---------------+------------+--------------+
| Meet | Subject | Visit |
+---------------+------------+--------------+
| 1 | a | 1 |
+---------------+------------+--------------+
| 1 | a | 2 |
+---------------+------------+--------------+
| 0 | a | 3 |
+---------------+------------+--------------+
| 1 | b | 1 |
+---------------+------------+--------------+
| 1 | b | 2 |
+---------------+------------+--------------+
| 1 | b | 3 |
+---------------+------------+--------------+
This is what the new data.table should look like
datatable 2
+---------------+------------+
| Subject | Percentage |
+---------------+------------+
| a | .66 |
+---------------+------------+
| b | .100 |
+---------------+------------+
If Meet values has only 1/0 values, you can take average of Meet values for each Subject.
library(data.table)
setDT(df)[, .(Percentage = mean(Meet)), Subject]
# Subject Percentage
#1: a 0.667
#2: b 1.000
This can also be written in base R and dplyr
#Base R
aggregate(Meet~Subject, df, mean)
#dplyr
library(dplyr)
df %>% group_by(Subject) %>%summarise(Percentage = mean(Meet))

Select highest 3 values and return row and column name in R

I would like to select top 3 highest score from a data table and return the row and column name. I have a data table named score like below.
| |a |b |c |d |
|-|----|----|----|----|
|1|10 |23 |56 |5 |
|2|34 |25 |16 |67 |
|3|9 |11 |32 |45 |
|4|29 |47 |27 |35 |
|5|48 |4 |41 |22 |
This is my expected output:
d 2
c 1
a 5
Thank you in advance.
Get the data in long format and keep top 3 rows from the data.
library(tidyverse)
df %>%
rownames_to_column('row') %>%
pivot_longer(cols = -row) %>%
slice_max(value, n = 3)
# row name value
# <chr> <chr> <int>
#1 2 d 67
#2 1 c 56
#3 5 a 48

Sort by occurance or freq in data frame in R? [duplicate]

This question already has answers here:
Count number of times each unique value appearing in R [duplicate]
(1 answer)
How to return 5 topmost values from vector in R?
(4 answers)
Closed 5 years ago.
I have a dataframe now of a list of urls and i'm trying to find the top 10 urls based on freq. This is what I have,
+------------+
|urls |
+------------+
|google.com |
|linkedin.com|
|yahoo.com |
|google.com |
|yahoo.com |
+------------+
I tried to add a freq column but I cannot seem to get it. I tried count(df,"url") but it only gives me the freq without the urls like this,
+----+
|freq|
+----+
|2 |
|1 |
|2 |
|2 |
|2 |
+----+
can I know how can I get a dataframe like this,
+---------------+------------+
|urls | freq |
+---------------+------------+
|google.com | 2 |
|linkedin.com | 1 |
|yahoo.com | 2 |
|google.com | 2 |
|yahoo.com | 2 |
+---------------+------------+
also I need to sort it by top 10?
Table returns the frequency of the urls. Then you can sort it decreasing and pick the first 10.
sort(table(df$urls), decreasing = T)[1:10]
if you want to have the url names use
names(sort(table(df$urls), decreasing = T)[1:10])
Here's a tidyverse solution. Use group_by and n to get the counts of each url. Then order the rows with arrange.
library('tidyverse')
df <- tibble(urls = c('google.com ', 'linkedin.com', 'yahoo.com ', 'google.com ', 'yahoo.com'))
df %>%
group_by(urls) %>%
mutate(freq = n()) %>%
arrange(desc(freq)) %>%
head(10)
#> # A tibble: 5 x 2
#> # Groups: urls [4]
#> urls freq
#> <chr> <int>
#> 1 google.com 2
#> 2 google.com 2
#> 3 linkedin.com 1
#> 4 yahoo.com 1
#> 5 yahoo.com 1

R ddply sum value from next row

I want to sum the column value from a row with the next one.
> df
+----+------+--------+------+
| id | Val | Factor | Col |
+----+------+--------+------+
| 1 | 15 | 1 | 7 |
| 3 | 20 | 1 | 4 |
| 2 | 35 | 2 | 8 |
| 7 | 35 | 1 | 12 |
| 5 | 40 | 1 | 11 |
| 6 | 45 | 2 | 13 |
| 4 | 55 | 1 | 4 |
| 8 | 60 | 1 | 7 |
| 9 | 15 | 2 | 12 |
..........
I would like to have the mean of sum of the Row$Val + nextRow$Val based on their id and Col. I can't assume that the id or Col are consecutive.
I am using ddply to summarize my df. I have tried
> ddply(df, .(Factor), summarize,
max(Val),
sum(Val),
mean(Val + df[df$id == id+1 & df$Col = Col]$Val)
)
> "longer object length is not a multiple of shorter object length"
You can build a vector of values with
sapply(df$id, function(x){mean(c(
subset(df, id == x, select = Val, drop = TRUE),
subset(df, id == x+1, select = Val, drop = TRUE)
))})
You could simplify, but I tried to make it as readable as possible.
You can use rollapply from the zoo package. Since you want mean of only two consecutive rows , you can try
library(zoo)
rollapply(df[order(df$id), 2], 2, function(x) sum(x)/2)
#[1] 17.5 27.5 35.0 37.5 42.5 50.0 57.5 37.5
You can do something like this with dplyr package:
library(dplyr)
df <- arrange(df, id)
mean(df$Val + lead(df$Val), na.rm = TRUE)
[1] 76.25

Get minimum grouped by unique combination of two columns

What I'm trying to achieve in R is the following: given a table (data frame in my case) - I want to be get the lowest price for each unique combination of two columns.
For example, given the following table:
+-----+-----------+-------+----------+----------+
| Key | Feature1 | Price | Feature2 | Feature3 |
+-----+-----------+-------+----------+----------+
| AAA | 1 | 100 | whatever | whatever |
| AAA | 1 | 150 | whatever | whatever |
| AAA | 1 | 200 | whatever | whatever |
| AAA | 2 | 110 | whatever | whatever |
| AAA | 2 | 120 | whatever | whatever |
| BBB | 1 | 100 | whatever | whatever |
+-----+-----------+-------+----------+----------+
I want a result that looks like:
+-----+-----------+-------+----------+----------+
| Key | Feature1 | Price | Feature2 | Feature3 |
+-----+-----------+-------+----------+----------+
| AAA | 1 | 100 | whatever | whatever |
| AAA | 2 | 110 | whatever | whatever |
| BBB | 1 | 100 | whatever | whatever |
+-----+-----------+-------+----------+----------+
So I'm working on a solution along the lines of:
s <- lapply(split(data, list(data$Key, data$Feature1)), function(chunk) {
chunk[which.min(chunk$Price),]})
But the result is a 1 x n matrix - so I need to unsplit the result. Also - it seems very slow. How can I improve this logic?
I've seen solutions pointing in the directions of the data.table package. Should I re-write using that package?
Update
Great answers guys - thanks! However - my original dataframe contains more columns ( Feature2 ... ) and I need them all back after the filtering. The rows that do not have the lowest price ( for the combination of Key/Feature1 ) can be discarded, so I'm not interested in their values for Feature2 / Feature3
You can use the dplyr package:
library(dplyr)
data %>% group_by(Key, Feature1) %>%
slice(which.min(Price))
Since you referred to data.table package, I provide here the solution using that package:
library(data.table)
setDT(df)[,.(Price=min(Price)),.(Key, Feature1)] #initial question
setDT(df)[,.SD[which.min(Price)],.(Key, Feature1)] #updated question
df is your sample data.frame.
Update: Test using mtcars data
df<-mtcars
library(data.table)
setDT(df)[,.SD[which.min(mpg)],by=am]
am mpg cyl disp hp drat wt qsec vs gear carb
1: 1 15.0 8 301 335 3.54 3.57 14.60 0 5 8
2: 0 10.4 8 472 205 2.93 5.25 17.98 0 3 4
The base R solution would be aggregate(Price ~ Key + Feature1, data, FUN = min)
Using R base aggregate
> aggregate(Price~Key+Feature1, min, data=data)
Key Feature1 Price
1 AAA 1 100
2 BBB 1 100
3 AAA 2 110
See this post for other alternatives.

Resources