How to count unique values a column in R - r

I have a database and would like to know how many people (identified by ID) match a characteristic. The list is like this:
111 A
109 A
112 A
111 A
108 A
I only need to count how many 'ID's have that feature, the problem is there duplicated ID's. I've tried with
count(df, vars = ID)
but it does not show the total number of ID's, just how many times they are repeated. Same with
count(df, c('ID')
as it shows the total number of ID's and many of them are duplicated, I need to count them one single time.
Do you have any suggestions? Using table function is not an option because of the size of this database.

We can use n_distinct() from dplyr to count the number of unique values for a column in a data frame.
textFile <- "id var1
111 A
109 A
112 A
111 A
108 A"
df <- read.table(text = textFile,header = TRUE)
library(dplyr)
df %>% summarise(count = n_distinct(id))
...and the output:
> df %>% summarise(count = n_distinct(id))
count
1 4
We can also summarise the counts within one or more by_group() columns.
textFile <- "id var1
111 A
109 A
112 A
111 A
108 A
201 B
202 B
202 B
111 B
112 B
109 B"
df <- read.table(text = textFile,header = TRUE)
df %>% group_by(var1) %>% summarise(count = n_distinct(id))
...and the output:
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
var1 count
<chr> <int>
1 A 4
2 B 5

You can first remove duplicates using unique and then countthe remaining rows :
d <- tribble(
~ID,~feature,
111, "A",
109, "A",
112, "A",
111, "A",
108, "A")
count(unique(d,vars = c(ID, feature)),vars=ID)
vars n
<dbl> <int>
1 108 1
2 109 1
3 111 1
4 112 1

Related

Rearranging data according to rater and subject, simultaneously creating new row names

I have a dataset where multiple raters rate multiple subjects.
I'd like to rearrange the data that looks like this:
data <- data.frame(rater=c("A", "B", "C", "A", "B", "C"),
subject=c(1, 1, 1, 2, 2, 2),
measurment1=c(1, 2, 3, 4, 5,6),
measurment2=c(11, 22, 33, 44, 55,66),
measurment3=c(111, 222, 333, 444, 555, 666))
data
# rater subject measurment1 measurment2 measurment3
# 1 A 1 1 11 111
# 2 B 1 2 22 222
# 3 C 1 3 33 333
# 4 A 2 4 44 444
# 5 B 2 5 55 555
# 6 C 2 6 66 666
into data that looks like this:
data_transformed <- data.frame( A = c(1,11,111,4,44,444),
B = c(2,22,222,5,55,555),
C = c(3,33,333,6,66,666)
)
row.names(data_transformed) <- c("measurment1_1", "measurment2_1", "measurment3_1", "measurment1_2", "measurment2_2", "measurment3_2")
data_transformed
# A B C
# measurment1_1 1 2 3
# measurment2_1 11 22 33
# measurment3_1 111 222 333
# measurment1_2 4 5 6
# measurment2_2 44 55 66
# measurment3_2 444 555 666
In the new data frame, the raters (A, B and C) should become the columns. The measurement should become the rows and I'd also like to add the subject number as a suffix to the row-names.
For the rearranging one could probably use the pivot functions, yet I have no idea on how to combine the measurement-variables with the subject number.
Thanks for your help!
We could use pivot_longer, pivot_wider and unite from the tidyr package.
pivot_longer makes our data in a vertical format, it transforms the measurment columns into a sigle variable
pivot_wider does the opposite of pivot_longer, transform a variable into multiple columns for each unique value from the variable
data |>
pivot_longer(measurment1:measurment3) |>
pivot_wider(names_from = rater, values_from = value, values_fill = 0 ) |>
unite("measure_subjet",name,subject, remove = TRUE)
Please try the below code where we can accomplish the expected result using pivot_longer, pivot_wider and column_to_rownames.
library(tidyverse)
data_transformed <- data %>%
pivot_longer(c('measurment1', 'measurment2', 'measurment3')) %>%
mutate(rows = paste0(name, '_', subject)) %>%
pivot_wider(rows, names_from = rater, values_from = value) %>%
column_to_rownames(var = "rows")

In R , how to summarize data frame in multiple dimensions

There is dataframe raw_data as below, How can i change it to wished_data in easy way ?
I currently know group_by/summarise the data serval times (and add variables) , then rbind them. But this is little boring , especially when variables more then this example in occasion.
I want to know ,if is there any general method for similar situation ? Thanks!
library(tidyverse)
country <- c('UK','US','UK','US')
category <- c("A", "B", "A", "B")
y2021 <- c(17, 42, 21, 12)
y2022 <- c(49, 23, 52, 90)
raw_data <- data.frame(country,category,y2021,y2022)
We may use rollup/cube/groupingsets from data.table
library(data.table)
out <- rbind(setDT(raw_data), groupingsets(raw_data, j = lapply(.SD, sum),
by = c("country", "category"),
sets = list("country", "category", character())))
out[is.na(out)] <- 'TOTAL'
-output
> out
country category y2021 y2022
<char> <char> <num> <num>
1: UK A 17 49
2: US B 42 23
3: UK A 21 52
4: US B 12 90
5: UK TOTAL 38 101
6: US TOTAL 54 113
7: TOTAL A 38 101
8: TOTAL B 54 113
9: TOTAL TOTAL 92 214
Or with cube
out <- rbind(raw_data, cube(raw_data,
j = .(y2021= sum(y2021), y2022=sum(y2022)), by = c("country", "category")))
out[is.na(out)] <- 'TOTAL'
We can use the adorn_totals function from janitor. get_totals accepts a data frame and a column and it outputs the data frame with totals for the numeric columns, one such row for each level of the specified column. It then extracts out the total rows and since adorn_totals can rearrange the column order uses select to put the order back to the original so that we can later bind mulitiple instances together. We then bind together the orignal data frame and each of the total row data frames that we want.
library(dplyr)
library(janitor)
get_totals <- function(data, col) {
data %>%
group_by({{col}}) %>%
group_modify(~ adorn_totals(.)) %>%
ungroup %>%
filter(rowSums(. == "Total") > 0) %>%
select(any_of(names(data)))
}
bind_rows(
raw_data,
get_totals(raw_data, category),
get_totals(raw_data, country),
get_totals(raw_data)
)
giving:
country category y2021 y2022
1 UK A 17 49
2 US B 42 23
3 UK A 21 52
4 US B 12 90
5 Total A 38 101
6 Total B 54 113
7 UK Total 38 101
8 US Total 54 113
9 Total - 92 214

Sampling different x and different sample size in R

Say I have a table like this:
Students
Equipment #
A
101
A
102
A
103
B
104
B
105
B
106
B
107
B
108
C
109
C
110
C
111
C
112
I want to grab equipment # samples from each student in the data frame with varying sample sizes.
For example, I want 1 equipment # from student "A", 2 from student "B", and 3 from student "C". How can I achieve this in R?
This is the code that I have now, but I'm only getting 1 equipment # printed from each student.
students <- unique(df$`Students`)
sample_size <- c(1,2,3)
for (i in students){
s <- sample(df[df$`Students` == i,]$`Equipment #`, size = sample_size, replace = FALSE)
print(s)
}
You can create a dataframe which has information students and the rows to be sampled. Join the data and use sample_n to sample those rows.
library(dplyr)
sample_data <- data.frame(Students = c('A', 'B', 'C'), nr = 1:3)
df %>%
left_join(sample_data, by = 'Students') %>%
group_by(Students) %>%
sample_n(first(nr)) %>%
ungroup() %>%
select(-nr) -> s
s
# Students Equipment
# <chr> <int>
#1 A 102
#2 B 108
#3 B 105
#4 C 110
#5 C 112
#6 C 111
You're close. You need to index the sample_size vector with the loop, otherwise it will just take the first item in the vector for each iteration.
library(dplyr)
# set up data
df <- data.frame(Students = c(rep("A", 3),
rep("B", 5),
rep("C", 4)),
Equipment_num = 101:112)
# create vector of students
students <- df %>%
pull(Students) %>%
unique()
# sample and print
for (i in seq_along(students)) {
p <- df %>%
filter(Students == students[i]) %>%
slice_sample(n = i)
print(p)
}
#> Students Equipment_num
#> 1 A 102
#> Students Equipment_num
#> 1 B 107
#> 2 B 105
#> Students Equipment_num
#> 1 C 109
#> 2 C 110
#> 3 C 112
Created on 2021-08-06 by the reprex package (v2.0.0)
Actually this is a much more elegant and generalizable way to tackle this problem.

How to reshape a data table

I am using R and have the next table: (example)
ID Euros N Euros N Euros N
1 A 133.911,20 451 134.208,78 450 442,03 328
2 C 9.470,35 2856 26,18 2721 26,28 2699
My desired behaivour is that you have Euros in one line and N in other line instead of columns:
ID Var1 Var2 Var3 Var4
1 A Euros 133.911,20 134.208,78 442,03
2 A N 451 450 328
3 C Euros 9.470,35 26,18 26,28
4 C N 2856 2721 2699
I have tried to do so only with A group and using the following code:
mydatatable_wide <- spread(mydatatable, Euros, N)
But I don´t get my expected result. What I get is:
ID 133.911,20 134.208,78 442,03
1 A 451 450 328
Need some work to achieve what you want - I am using dplyr & tidyr
library(dplyr)
library(tidyr)
# Here is the tribble from your question
# Note that in my language "." is decimal point and "," is thousand separate
# In R code thousand separate is not used.
df <- tribble(
~ID, ~Euros, ~N, ~Euros, ~N, ~Euros, ~N,
"A", 133911.20, 451, 134208.78, 450, 442.03, 328,
"C", 9470.35, 2856, 26.18, 2721, 26.28, 2699)
df %>%
# first convert your data set into a long version with multiple lines per ID
# contains all the numerical values Euros & N
pivot_longer(cols = where(is.numeric), names_to = "var", values_to = "value") %>%
# then split them into multiple group of Euros using group_by & group_map
group_by(var) %>%
group_map(~ {
.x %>%
group_by(ID) %>%
# in group map within each ID create a index var for those values
mutate(index_name = paste0("var_", seq(1, n(), by =1))) %>%
# then pivot them wider to have one line per ID & (Euros/N)
pivot_wider(names_from = "index_name", values_from = value, values_fill = NA)
}, .keep = TRUE) %>%
# Finally combined all the data.frame from group_map into one data.frame
bind_rows()
Output
ID var var_1 var_2 var_3
<chr> <chr> <dbl> <dbl> <dbl>
1 A Euros 133911. 134209. 442.
2 C Euros 9470. 26.2 26.3
3 A N 451 450 328
4 C N 2856 2721 2699

reshape dataframe from columns to rows and collapse cell values

Here's the challenge i am facing. I am trying to transform this dataset
a b c
100 0 111
0 137 17
78 117 91
into (column to rows)
col1 col2
a 100,78
b 137,117
c 111,17,91
I know I can do this using reshape or melt function but I am not sure how to collapse and paste the cell values. Any suggestions or pointers is appreciated folks.
Here is a light weight option using toString() method to collapse each column to a string and using stack() to reshape the result list to your desired output:
stack(lapply(df, function(col) toString(col[col!=0])))
# values ind
#1 100, 78 a
#2 137, 117 b
#3 111, 17, 91 c
I would use dplyr rather than reshape.
library(dplyr)
library(tidyr)
Data <- data.frame(a=c(100,0,78),b=c(0,137,117),c=c(111,17,91))
Data %>%
gather(Column, Value) %>%
filter(Value != 0) %>%
group_by(Column) %>%
summarize(Value=paste0(Value,collapse=', '))
The gather function is similar to melt in reshape. The group_by function tells later functions that you want to seperate based off of values in Column. Finally summarize calculates whatever summary we want for each of the groups. In this case, paste all the terms together.
Which should give you:
# A tibble: 3 × 2
Column Value
<chr> <chr>
1 a 100, 78
2 b 137, 117
3 c 111, 17, 91
With library(data.table)
melt(dt)[, .(value = paste(value[value !=0], collapse=', ')), by=variable]
# variable value
# 1: a 100, 78
# 2: b 137, 117
# 3: c 111, 17, 91
The data:
dt = fread("a b c
100 0 111
0 137 17
78 117 91")

Resources