R count number of rows with duplicate values

R count number of rows with duplicate values - r

Let's say we have this data frame:
column_a <- c("a","a","b","c","c","c")
column_b <- c("xx","zz","nn","mm","vv","yy")
df <- data.frame (column_a, column_b)
I'm looking to count the number of rows with the same unique values in column_a so that I get something like this:
df2 <- data.frame(unique = c("a","b","c"), n = c("2","1","3"))
So far I tried this but it's not exactly what I'm looking for:
df %>% group_by(column_a) %>% mutate(replicate=seq(n()))

You can try this
library(dplyr)
df %>%
select(column_a, column_b) %>%
unique() %>%
group_by(column_a) %>%
summarize(n = n())
This gives the result:
# A tibble: 3 x 2
column_a n
<fct> <int>
1 a 2
2 b 1
3 c 3
You can convert it to a data.frame if required.

I believe you're looking for tally() or maybe count
df %>% group_by(column_a) %>% tally()

Related

Calculating average rle$lengths over grouped data

I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks

As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN

What is the tidyverse way to apply a function designed to take data.frames as input across a grouped tibble in R?

I've written a function that takes multiple columns as its input that I'd like to apply to a grouped tibble, and I think that something with purrr::map might be the right approach, but I don't understand what the appropriate input is for the various map functions. Here's a dummy example:
myFun <- function(DF){
DF %>% mutate(MyOut = (A * B)) %>% pull(MyOut) %>% sum()
}
MyDF <- data.frame(A = 1:5, B = 6:10)
myFun(MyDF)
This works fine. But what if I want to add some grouping?
MyDF <- data.frame(A = 1:100, B = 1:100, Fruit = rep(c("Apple", "Mango"), each = 50))
MyDF %>% group_by(Fruit) %>% summarize(MyVal = myFun(.))
This doesn't work. I get the same value for every group in my data.frame or tibble. I then tried using something with purrr:
MyDF %>% group_by(Fruit) %>% map(.f = myFun)
Apparently, that's expecting character data as input, so that's not it.
This next variation is basically what I need, but the output is a list of lists rather than a tibble with one row for each value of Fruit:
MyDF %>% group_by(Fruit) %>% group_map(~ myFun(.))

We can use the OP's function in group_modify
library(dplyr)
MyDF %>%
group_by(Fruit) %>%
group_modify(~ .x %>%
summarise(MyVal = myFun(.x))) %>%
ungroup
-output
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425
Or in group_map where the .y is the grouping column
MyDF %>%
group_by(Fruit) %>%
group_map(~ bind_cols(.y, MyVal = myFun(.))) %>%
bind_rows
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425

R Studio using dplyr to summarize (where like)

I'm trying to output the number of status (that is open) group by ID. Please see below example:
(note: (1 status that is open) is used to show why it's 1, I don't want to output the sentence)
Re-producible code:
ID <- c(1,1,1,2,2,2)
Status <- c("status.open","status.closed", "status.wait", "status.open", "status.open", "status.wait" )
df <- data.frame(ID, Status)
pseudo-code:
df %>%
group_by(ID) %>%
summarize(count = length(Status where status like "%open"))
Please help, thanks!

You may achieve this with the following code:
require(dplyr)
df %>% filter(Status == "status.open") %>% ## you only want status.open
count(ID) ## count members of ID
Which produces:
# A tibble: 2 x 2
# Groups: ID [2]
ID n
<dbl> <int>
1 1 1
2 2 2

Solution (as close as possible to your 'pseudo-code') using dplyr and grepl and R's implicit conversion of booleans (where TRUE becomes 1 if we try to to math with it):
library(dplyr)
df %>%
group_by(ID) %>%
summarise(count = sum(grepl("open", Status)))
Returns:
# A tibble: 2 x 2
ID count
<dbl> <int>
1 1 1
2 2 2

Roughly like SQL-%open is:
library(stringr)
df %>%
filter(str_detect(Status, "open$")) # open$ = ends with open

What about this solution :
df %>% dplyr::group_by(ID) %>% dplyr::summarize(count = sum(Status == "status.open"))

Select rows by ID with most matches

I have a data frame like this:
df <- data.frame(id = c(1,1,1,2,2,3,3,3,3,4,4,4),
torre = c("a","a","b","d","a","q","t","q","g","a","b","c"))
and I would like my code to select for each id the torre that repeats more, or the last torre for the id if there isnt one that repeats more than the other, so ill get a new data frame like this:
df2 <- data.frame(id = c(1,2,3,4), torre = c("a","a","q","c"))

You can use aggregate:
aggregate(torre ~ id, data=df,
FUN=function(x) names(tail(sort(table(factor(x, levels=unique(x)))),1))
)
The full explanation for this function is a bit involved, but most of the job is done by the FUN= parameter. In this case we are making a function that get's the frequency counts for each torre, sorts them in increasing order, then get's the last one with tail(, 1) and takes the name of it. aggregate() function then applies this function separately for each id.

You could do this using the dplyr package: group by id and torre to calculate the number of occurrences of each torre/id combination, then group by id only and select the last occurrence of torre that has the highest in-group frequency.
library(dplyr)
df %>%
group_by(id,torre) %>%
mutate(n=n()) %>%
group_by(id) %>%
filter(n==max(n)) %>%
slice(n()) %>%
select(-n)
id torre
<dbl> <chr>
1 1 a
2 2 a
3 3 q
4 4 c

An approach with the data.table package:
library(data.table)
setDT(df)[, .N, by = .(id, torre)][order(N), .(torre = torre[.N]), by = id]
which gives:
id torre
1: 1 a
2: 2 a
3: 3 q
4: 4 c
And two possible dplyr alternatives:
library(dplyr)
# option 1
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
group_by(id) %>%
mutate(f = rank(n, ties.method = "first")) %>%
filter(f == max(f)) %>%
select(-n, -f)
# option 2
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
distinct() %>%
arrange(n) %>%
group_by(id) %>%
slice(n()) %>%
select(-n)

Yet another dplyr solution, this time using add_count() instead of mutate():
df %>%
add_count(id, torre) %>%
group_by(id) %>%
filter(n == max(n)) %>%
slice(n()) %>%
select(-n)
# A tibble: 4 x 2
# Groups: id [4]
id torre
<dbl> <fct>
1 1. a
2 2. a
3 3. q
4 4. c

unexpected row when going from long to wide format with dplyr and tidyr

I've got a data frame (dfdat) with two categorical variables, location and employmentstatus.
I'd like to generate a data frame with the proportions of employment status for each location.
mydf_wide (achieved outcome) is almost what I'm looking for. The problem's that employmentstatus is a variable with two levels, yet there're three rows in mydf_wide. I don't understand why that is, because I'd have expected something similar to mytable (expected outcome).
Any help would be much appreciated.
Starting point (df):
dfdat <- data.frame(location=c("GA","GA","MA","OH","RI","GA","AZ","MA","OH","RI"),employmentstatus=c(1,2,1,2,1,1,1,2,1,1))
Expected outcome (table):
mytable <- table(dfdat$employmentstatus,dfdat$location)
mytable <- round(100*(prop.table(mytable, 2)),1)
Achieved outcome (df):
library(dplyr)
mydf <- dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1))
library(tidyr)
mydf_wide <- spread(mydf, location, freq)
mydf_wide <- as.data.frame(mydf_wide)

We need to do a second group_by with 'location' to get the sum. Also, instead of grouping and then creating the 'n', count function can be used
dfdat %>%
count(location, employmentstatus) %>%
group_by(location) %>%
mutate(n = round(100*n/sum(n), 2)) %>%
spread(location, n, fill = 0)
# A tibble: 2 x 6
# employmentstatus AZ GA MA OH RI
#* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 100 66.67 50 50 100
#2 2 0 33.33 50 50 0
If we are using the OP's code, then remove the 'n' column and then do the spread
dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1)) %>%
select(-n) %>%
spread(location, freq, fill =0)
or update the 'n' column with the output of round and then spread. An extra column in 'n' made sure that the combinations exist in the dataset

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R count number of rows with duplicate values - r

You can try this library(dplyr) df %>% select(column_a, column_b) %>% unique() %>% group_by(column_a) %>% summarize(n = n()) This gives the result: # A tibble: 3 x 2 column_a n <fct> <int> 1 a 2 2 b 1 3 c 3 You can convert it to a data.frame if required.

I believe you're looking for tally() or maybe count df %>% group_by(column_a) %>% tally()

Related

Calculating average rle$lengths over grouped data

What is the tidyverse way to apply a function designed to take data.frames as input across a grouped tibble in R?

R Studio using dplyr to summarize (where like)

Select rows by ID with most matches

unexpected row when going from long to wide format with dplyr and tidyr

Categories

Resources