Find a turnover of each value in column [duplicate] - r

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 2 years ago.
I have a dataset, which i define for example like this:
type <- c(1,1,1,2,2,2,2,2,3,3,4,4,5)
val <- c(4,1,1,2,8,2,3,2,3,3,4,4,5)
tdt <- data.frame(plu, occur)
So it looks like this:
type val
1 4
1 1
1 1
2 2
2 8
2 2
2 3
2 2
3 3
3 3
4 4
4 4
5 5
5 7
I want to find how many unique vals each type gets (turnover). So desired result is:
type turnover
1 2
2 3
3 1
4 1
5 2
How could i get it? How this function should look like? I know how to count occurrences of each type, but not with each unique val

With n_distinct, we can get the number of unique elements grouped by 'type'
library(dplyr)
tdt %>%
group_by(type) %>%
summarise(turnover = n_distinct(val))
# A tibble: 5 x 2
# type turnover
# <int> <int>
#1 1 2
#2 2 3
#3 3 1
#4 4 1
#5 5 2
Or with distinct and count
tdt %>%
distinct() %>%
count(type)
# type n
#1 1 2
#2 2 3
#3 3 1
#4 4 1
#5 5 2
Or using uniqueN from data.table
library(data.table)
setDT(tdt)[, .(turnover = uniqueN(val)), type]
Or with table in base R after getting the unique rows
table(unique(tdt)$type)
data
tdt <- structure(list(type = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
4L, 4L, 5L, 5L), val = c(4L, 1L, 1L, 2L, 8L, 2L, 3L, 2L, 3L,
3L, 4L, 4L, 5L, 7L)), class = "data.frame", row.names = c(NA,
-14L))

Another base R option is using aggregate
tdtout <- aggregate(val~.,tdt,function(v) length(unique(v)))
such that
> tdtout
type val
1 1 2
2 2 3
3 3 1
4 4 1
5 5 2
data
> dput(tdt)
structure(list(type = c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 5,
5), val = c(4, 1, 1, 2, 8, 2, 3, 2, 3, 3, 4, 4, 5, 7)), class = "data.frame", row.names = c(NA,
-14L))

Related

How can I calculate the sum of the column wise differences using dplyr

Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())

Number of rows by Group ID conditional

I have a dataframe like this:
ID S1 C
1 1 2 3
2 1 2 3
3 3 1 1
4 6 2 5
5 6 7 5
What I need is the number of rows per group ID where S1 <= C. This is the desired output.
ID Obs
1 1 2
2 3 1
3 6 1
Even though the question was answered below, I have a follow up question: Is it possible to do the same for multiple columns (S1, S2, ..). For example for the dataframe below:
ID S1 S2 C
1 1 2 2 3
2 1 2 2 3
3 3 1 1 1
4 6 2 2 5
5 6 7 7 5
And then get:
ID S1.Obs S2.Obs
1 1 2 2
2 3 1 1
3 6 1 1
A base R solution with aggregate().
aggregate(Obs ~ ID, transform(df, Obs = S1 <= C), sum)
# ID Obs
# 1 1 2
# 2 3 1
# 3 6 1
A dplyr solution
library(dplyr)
df %>%
filter(S1 <= C) %>%
count(ID, name = "Obs")
# ID Obs
# 1 1 2
# 2 3 1
# 3 6 1
Data
df <- structure(list(ID = c(1L, 1L, 3L, 6L, 6L), S1 = c(2L, 2L, 1L, 2L, 7L),
C = c(3L, 3L, 1L, 5L, 5L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
Extension
If you want to apply this rule on multiple columns such as S1, S2, S3:
df %>%
group_by(ID) %>%
summarise(across(starts_with("S"), ~ sum(.x <= C)))
data <- data.frame(
ID = c(1, 1, 3, 6, 6),
S1 = c(2, 2, 1, 2, 7),
C = c(3, 3, 1, 5, 5)
)
library(dplyr)
data.filtered <- data[data$S1 <= data$C,]
data.filtered %>% group_by(ID) %>%
summarize(Obs = length(ID))
An option with data.table
library(data.table)
setDT(df)[S1 <=C, .(Obs = .N), ID]
# ID Obs
#1: 1 2
#2: 3 1
#3: 6 1
data
df <- structure(list(ID = c(1L, 1L, 3L, 6L, 6L), S1 = c(2L, 2L, 1L, 2L, 7L),
C = c(3L, 3L, 1L, 5L, 5L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

R Data Frame remove rows with max values from all columns

Hello I have the data frame and I need to remove all the rows with max values from each columns.
Example
A B C
1 2 3 5
2 4 1 1
3 1 4 3
4 2 1 1
So the output is:
A B C
4 2 1 1
Is there any quick way to do this?
We can do this with %in%
df1[!seq_len(nrow(df1)) %in% sapply(df1, which.max),]
# A B C
#4 2 1 1
If there are ties for maximum values in each row, then do
df1[!Reduce(`|`, lapply(df1, function(x) x== max(x))),]
df[-sapply(df, which.max),]
# A B C
#4 2 1 1
DATA
df = structure(list(A = c(2L, 4L, 1L, 2L), B = c(3L, 1L, 4L, 1L),
C = c(5L, 1L, 3L, 1L)), .Names = c("A", "B", "C"),
class = "data.frame", row.names = c(NA,-4L))

R Aggregate and count of not null

I have the following data table
PIECE SAMPLE QC_CODE
1 1 1
2 1 NA
3 2 2
4 2 4
5 2 NA
6 3 6
7 3 3
8 3 NA
9 4 6
10 4 NA
and I would like to count the number of qc_code in each sample and return an output like this
SAMPLE SAMPLE_SIZE QC_CODE_COUNT
1 2 1
2 3 2
3 3 2
4 2 1
Where sample size is the count of pieces in each sample, and qc_code_count is the count of al qc_code that are no NA.
How would I go about this in R
You can try
library(dplyr)
df1 %>%
group_by(SAMPLE) %>%
summarise(SAMPLE_SIZE=n(), QC_CODE_UNIT= sum(!is.na(QC_CODE)))
# SAMPLE SAMPLE_SIZE QC_CODE_UNIT
#1 1 2 1
#2 2 3 2
#3 3 3 2
#4 4 2 1
Or
library(data.table)
setDT(df1)[,list(SAMPLE_SIZE=.N, QC_CODE_UNIT=sum(!is.na(QC_CODE))), by=SAMPLE]
Or using aggregate from base R
do.call(data.frame,aggregate(QC_CODE~SAMPLE, df1, na.action=NULL,
FUN=function(x) c(SAMPLE_SIZE=length(x), QC_CODE_UNIT= sum(!is.na(x)))))
data
df1 <- structure(list(PIECE = 1:10, SAMPLE = c(1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 4L), QC_CODE = c(1L, NA, 2L, 4L, NA, 6L, 3L, NA,
6L, NA)), .Names = c("PIECE", "SAMPLE", "QC_CODE"), class = "data.frame",
row.names = c(NA, -10L))

New dataframe with difference between first and last values of repeated measurements?

I am working with time series data and want to calculate the difference between the first and final measurement times, and put these numbers into a new and simpler dataframe. For example, for this dataframe
structure(list(time = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), indv = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), value = c(1L, 3L, 5L, 8L, 3L, 4L,
7L, 8L)), .Names = c("time", "indv", "value"), class = "data.frame", row.names = c(NA,
-8L))
or
time indv value
1 1 1
2 1 3
3 1 5
4 1 8
1 2 3
2 2 4
3 2 7
4 2 8
I can use this code
ddply(test, .(indv), transform, value_change = (value[length(value)] - value[1]), time_change = (time[length(time)] - time[1]))
to give
time indv value value_change time_change
1 1 1 7 3
2 1 3 7 3
3 1 5 7 3
4 1 8 7 3
1 2 3 5 3
2 2 4 5 3
3 2 7 5 3
4 2 8 5 3
However, I would like to eliminate the redundant rows and make a new and simpler dataframe like this
indv time_change value_change
1 3 7
2 3 5
Does anyone have any clever way to do this?
Thanks!
Just replace transform with summarize. You can also make your code a little prettier by using head and tail:
ddply(test, .(indv), summarize,
value_change = tail(value, 1) - head(value, 1),
time_change = tail(time, 1) - head(time, 1))
For maximum readability, write a function:
change <- function(x) tail(x, 1) - head(x, 1)
ddply(test, .(indv), summarize, value_change = change(value),
time_change = change(time))

Resources