Count unique rows irrespective of column order - r

EDIT
My question was badly asked. Therefore I re-edited it in order to make it hopefully more useful for others. It has an answer already.
sample data.frame:
set.seed(10)
df <- data.frame(a = sample(1:3, 30, rep=T), b = sample(1:3, 30, rep = T), c = sample(1:3, 30, rep = T))
My question:
I have several columns (in my example a,b,c). Now, slightly similar, but different to this question asked by R-user, I would like to count the possible 'value sets' of in this case three columns (but in general: n columns), irrespective of their order.
count(df,a,b,c) from dplyr does not help:
require (dplyr)
count(df,a,b,c)
# A tibble: 17 x 4
a b c n
<int> <int> <int> <int>
1 1 1 1 1
2 1 1 2 2
...
7 2 1 1 4
...
In this example, row 2 and 7 contain the same set of values (1,1,2), and that's not what I want, because I do not care about the order of the values within the set, so '1,1,2' and '2,1,1' should be considered the same. How to count those value sets?
EDIT 2
The neat trick of #Mouad_S 's answer is that you first order the rows with apply() and then transpose the result (t()) and then you can use count on the columns.)

require(dplyr)
set.seed(10)
df <- data.frame(a = sample(1:3, 30, rep=T),
b = sample(1:3, 30, rep = T),
c = sample(1:3, 30, rep = T))
## the old answer
require(dplyr)
count(data.frame(t(apply(df, 1, function(x) sort(x)))), X1, X2, X3)
## the new answer
t(apply(df,1, function(x) sort(x))) %>% # sorting the values of each row
as.data.frame() %>% # turning the resulting matrix into a data frame
distinct() %>% # taking the unique values
nrow() # counting them
[1] 9

Related

Calculating percentages by group in R

My data set looks like this:
The data is from course student evaluations. The columns include categorical data for courses, and numerical data for scores in various criteria from a rubric. I am trying to use R to calculate the percentage for values equal or greater than 3 for all the columns by course. I can't figure out a straight forward way that is faster than doing it manually.
Thank you
Fernando
The tidyverse packages are well suited for this kind of tasks.
library(tidyverse)
First let's create some dummy data.
df <- tibble(`1..Course` = rep(LETTERS[1:3], each=5),
col1 = sample(c(NA,1:5), 15, replace=TRUE),
col2 = sample(c(NA,1:5), 15, replace=TRUE),
col3 = sample(c(NA,1:5), 15, replace=TRUE))
Now, for each column we want to look which values are >3:
df$col1 > 3
[1] FALSE NA TRUE NA FALSE NA NA FALSE NA FALSE FALSE TRUE FALSE NA FALSE
So we get a boolean, which will be automatically converted to numbers (0 and 1) if we try to take the sum. So computing a proportion is just taking the mean!
But there are missing values, so we will explicitly ignore them:
mean(df$col1 > 3, na.rm = TRUE)
[1] 0.2222222
So we know how to do it for a whole column, now we can use the functions from the tidyverse to do it by course:
df %>%
group_by(`1..Course`) %>%
summarize(prop_col1 = mean(col1 > 3, na.rm = TRUE),
prop_col2 = mean(col2 > 3, na.rm = TRUE),
prop_col3 = mean(col3 > 3, na.rm = TRUE))
# A tibble: 3 x 4
# `1..Course` prop_col1 prop_col2 prop_col3
# <chr> <dbl> <dbl> <dbl>
#1 A 0.333 0.2 0.5
#2 B 0 0.75 0.2
#3 C 0.25 0 0.25
And it's done.
Possibly, you may want to do this for every criteria without typing them. So you need to see the criteria type as a variable, and convert your data.frame to long format. Then the same code applies.
df %>%
pivot_longer(-`1..Course`, names_to="criterium") %>%
group_by(`1..Course`, criterium) %>%
summarize(prop_value = mean(value > 3, na.rm = TRUE))

how to filter data by the number of unique values in R [duplicate]

This question already has answers here:
drop columns that take less than n values?
(2 answers)
Closed 3 years ago.
I have some data that I would like to investigate and would like to pull out
all features which have a certain number of unique values, whether that's 2,
5, 10, etc.
I'm not sure how to go about doing this though.
For example :
tst = data.frame(
a = c(1,1,1,0,0),
b = c(1,2,3,3,3),
c = c(1,2,3,4,4),
d = c(1,2,3,4,5)
)
tst
tst %>%
filter(<variables with x unique values>)
Where x=2 would just filter to a, x=3 filter to b, etc
You can use select_if with the n_distinct function.
tst %>%
select_if(~n_distinct(.) == 2)
# a
# 1 1
# 2 1
# 3 1
# 4 0
# 5 0
Here is one way in base R:
x <- 2
tst[, apply(tst, 2, function(row) length(unique(row))) == x, drop = FALSE]
This example code will create a variable combination of abcd. Then will identify which are duplicate combinations, then will return only those combinations that are not duplicates. I hope this is what you were asking for...
tst = data.frame(
a = c(1,1,1,0,0),
b = c(1,2,3,3,3),
c = c(1,2,3,4,4),
d = c(1,2,3,4,5)
)
tst %>%
unite(new,a,b,c,d,sep="") %>%
mutate(duplicate=duplicated(new)) %>%
filter(duplicate !="TRUE")

Sample_n to Get a Maximum Number From Each Group

Using this very simple data example below, my goal would be to sample all 3 of A and only sample 5 out of 7 of B.
id group
1 A
2 A
3 A
4 B
5 B
6 B
7 B
8 B
9 B
10 B
ex_df <- data.frame(id = 1:10, group = c(rep("A", 3), rep("B", 7)))
Now, normally it'd just be a case of using sample_n from dplyr such that the code would be along the lines of
sel_5 <- ex_df %>%
group_by(group) %>%
sample_n(5)
Except this gives the error (for obvious reasons)
Error: size must be less or equal than 2 (size of data), set
replace = TRUE to use sampling with replacement
but sampling with replacement isn't an option. Is there any way that I might be able to set the sample_n size to be the minimum of 5 or the size of the group?
Or maybe another function that I'm unaware of that would be capable of this?
I've had the same problem, and here's what I did.
library(dplyr)
split_up <- split(ex_df, f = ex_df$group)
#split original dataframe into a list of dataframes for each unique group
sel_5 <- lapply(split_up, function(x) {x %>% sample_n(ifelse(nrow(x) < 5, nrow(x), 5))})
#on each dataframe, subsample to 5 or to the number of rows if there are less than 5
sel_5 <- do.call("rbind", sel_5)
#bind it back up!

Gather duplicate column sets into single columns

The problem of gathering multiple sets of columns was already addressed here: Gather multiple sets of columns, but in my case, the columns are not unique.
I have the following data:
input <- data.frame(
id = 1:2,
question = c("a", "b"),
points = 0,
max_points = c(3, 5),
question = c("c", "d"),
points = c(0, 20),
max_points = c(5, 20),
check.names = F,
stringsAsFactors = F
)
input
#> id question points max_points question points max_points
#> 1 1 a 0 3 c 0 5
#> 2 2 b 0 5 d 20 20
The first column is an id, then I have many repeated columns (the original dataset has 133 columns):
identifier for question
points given
maximum points
I would like to end up with this structure:
expected <- data.frame(
id = c(1, 2, 1, 2),
question = letters[1:4],
points = c(0, 0, 0, 20),
max_points = c(3, 5, 5, 20),
stringsAsFactors = F
)
expected
#> id question points max_points
#> 1 1 a 0 3
#> 2 2 b 0 5
#> 3 1 c 0 5
#> 4 2 d 20 20
I have tried several things:
tidyr::gather(input, key, val, -id)
reshape2::melt(input, id.vars = "id")
Both do not deliver the desired output. Furthermore, with more columns than shown here, gather doesn't work any more, because there are too many duplicate columns.
As a workaround I tried this:
# add numbers to make col headers "unique"
names(input) <- c("id", paste0(1:(length(names(input)) - 1), names(input)[-1]))
# gather, remove number, spread
input %>%
gather(key, val, -id) %>%
mutate(key = stringr::str_replace_all(key, "[:digit:]", "")) %>%
spread(key, val)
which gives an error: Duplicate identifiers for rows (3, 9), (4, 10), (1, 7), (2, 8)
This problem was already discussed here: Unexpected behavior with tidyr, but I don't know why/how I should add another identifier. Most likely this is not the main problem, because I probably should approach the whole thing differently.
How could I solve my problem, preferably with tidyr or base? I don't know how to use data.table, but in case there is a simple solution, I will settle for that too.
Try this:
do.call(rbind,
lapply(seq(2, ncol(input), 3), function(i){
input[, c(1, i:(i + 2))]
})
)
# id question points max_points
# 1 1 a 0 3
# 2 2 b 0 5
# 3 1 c 0 5
# 4 2 d 20 20
The idiomatic way to do this in data.table is pretty simple:
library(data.table)
setDT(input)
res = melt(
input,
id = "id",
meas = patterns("question", "^points$", "max_points"),
value.name = c("question", "points", "max_points")
)
id variable question points max_points
1: 1 1 a 0 3
2: 2 1 b 0 5
3: 1 2 c 0 5
4: 2 2 d 20 20
You get the extra column called "variable", but you can get rid of it with res[, variable := NULL] afterwards if desired.
Another way to accomplish the same goal without using lapply:
We start by grabbing all the columns for question, max_points, and points then we melt each one individually and cbind them all together.
library(reshape2)
questions <- input[,c(1,c(1:length(names(input)))[names(input)=="question"])]
points <- input[,c(1,c(1:length(names(input)))[names(input)=="points"])]
max_points <- input[,c(1,c(1:length(names(input)))[names(input)=="max_points"])]
questions_m <- melt(questions,id.vars=c("id"),value.name = "questions")[,c(1,3)]
points_m <- melt(points,id.vars=c("id"),value.name = "points")[,3,drop=FALSE]
max_points_m <- melt(max_points,id.vars=c("id"),value.name = "max_points")[,3, drop=FALSE]
res <- cbind(questions_m,points_m, max_points_m)
res
id questions points max_points
1 1 a 0 3
2 2 b 0 5
3 1 c 0 5
4 2 d 20 20
You might need to clarify how you want the ID column to be handled but perhaps something like this ?
runme <- function(word , dat){
grep( paste0("^" , word , "$") , names(dat))
}
l <- mapply( runme , unique(names(input)) , list(input) )
l2 <- as.data.frame(l)
output <- data.frame()
for (i in 1:nrow(l2)) output <- rbind( output , input[, as.numeric(l2[i,]) ])
Not sure how robust it is with respect to handling different numbers of repeated columns but it works for your test data and should work if you columns are repeated equal numbers of times.

Boxplot of pre-aggregated/grouped data in R

In R I want to create a boxplot over count data instead of raw data. So my table schema looks like
Value | Count
1 | 2
2 | 1
...
Instead of
Value
1
1
2
...
Where in the second case I could simply do boxplot(x)
I'm sure there's a way to do what you want with the already summarized data, but if not, you can abuse the fact that rep takes vectors:
> dat <- data.frame(Value = 1:5, Count = sample.int(5))
> dat
Value Count
1 1 1
2 2 3
3 3 4
4 4 2
5 5 5
> rep(dat$Value, dat$Count)
[1] 1 2 2 2 3 3 3 3 4 4 5 5 5 5 5
Simply wrap boxplot around that and you should get what you want. I'm sure there's a more efficient / better way to do that, but this should work for you.
I solved a similar issue recently by using the 'apply' function on each column of counts with the 'rep' function:
> datablock <- apply(countblock[-1], 2, function(x){rep(countblock$value, x)})
> boxplot(datablock)
...The above assumes that your values are in the first column and subsequent columns contain count data.
A combination of rep and data.frame can be used as an approach if another variable is needed for classification
Eg.
with(data.frame(v1=rep(data$v1,data$count),v2=(data$v2,data$count)),
boxplot(v1 ~ v2)
)
Toy data:
(besides Value and Count, I add a categorical variable Group)
set.seed(12345)
df <- data.frame(Value = sample(1:100, 100, replace = T),
Count = sample(1:10, 100, replace = T),
Group = sample(c("A", "B", "C"), 100, replace = T),
stringsAsFactors = F)
Use purrr::pmap and purrr::reduce to manipulate the data frame:
library(purrr)
data <- pmap(df, function(Value, Count, Group){
data.frame(x = rep(Value, Count),
y = rep(Group, Count))
}) %>% reduce(rbind)
boxplot(x ~ y, data = data)

Resources