Boxplot of pre-aggregated/grouped data in R - r

In R I want to create a boxplot over count data instead of raw data. So my table schema looks like
Value | Count
1 | 2
2 | 1
...
Instead of
Value
1
1
2
...
Where in the second case I could simply do boxplot(x)

I'm sure there's a way to do what you want with the already summarized data, but if not, you can abuse the fact that rep takes vectors:
> dat <- data.frame(Value = 1:5, Count = sample.int(5))
> dat
Value Count
1 1 1
2 2 3
3 3 4
4 4 2
5 5 5
> rep(dat$Value, dat$Count)
[1] 1 2 2 2 3 3 3 3 4 4 5 5 5 5 5
Simply wrap boxplot around that and you should get what you want. I'm sure there's a more efficient / better way to do that, but this should work for you.

I solved a similar issue recently by using the 'apply' function on each column of counts with the 'rep' function:
> datablock <- apply(countblock[-1], 2, function(x){rep(countblock$value, x)})
> boxplot(datablock)
...The above assumes that your values are in the first column and subsequent columns contain count data.

A combination of rep and data.frame can be used as an approach if another variable is needed for classification
Eg.
with(data.frame(v1=rep(data$v1,data$count),v2=(data$v2,data$count)),
boxplot(v1 ~ v2)
)

Toy data:
(besides Value and Count, I add a categorical variable Group)
set.seed(12345)
df <- data.frame(Value = sample(1:100, 100, replace = T),
Count = sample(1:10, 100, replace = T),
Group = sample(c("A", "B", "C"), 100, replace = T),
stringsAsFactors = F)
Use purrr::pmap and purrr::reduce to manipulate the data frame:
library(purrr)
data <- pmap(df, function(Value, Count, Group){
data.frame(x = rep(Value, Count),
y = rep(Group, Count))
}) %>% reduce(rbind)
boxplot(x ~ y, data = data)

Related

R creating a function using plyr revalue with multiple inputs

I am new to R and just learning the ropes so thanks in advance for any assistance you can provide.
I have a dataset that I am cleaning as a class project.
I have several sets of categorical data that I want to turn into specific numeric values.
I am repeating the same code format for different columns that I think would make a good function.
I would like to turn this:
# plyr using revalue
df$Area <- revalue(x = df$Area,
replace = c("rural" = 1,
"suburban" = 2,
"urban" = 3))
df$Area <- as.numeric(df$Area)
into this:
reval_3 <- function(data, columnX,
value1, num_val1,
value2, num_val2,
value3, num_val3) {
# plyr using revalue
data$columnX <- revalue(x = data$columnX,
replace = c(value1 = num_val1,
value2 = num_val2,
value3 = num_val3))
# set as numeric
data$columnX <- as.numeric(data$columnX)
# return dataset
return(data)
}
I get the following error:
The following `from` values were not present in `x`: value1, value2, value3
Error: Assigned data `as.numeric(data$columnX)` must be compatible with existing data.
x Existing data has 10000 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unknown or uninitialised column: `columnX`.
I've tried it with a single value1 where value1 <- c("rural" = 1, "suburban" = 2, "urban" = 3)
I know I can just:
df$Area <- as.numeric(as.factor(df$Area))
the data but I want specific values for each choice rather than R choosing.
Any assistance appreciated.
As already mentioned by #MartinGal in his comment, plyr is retired and the package authors themselves recommend using dplyr instead. See https://github.com/hadley/plyr.
Hence, one option to achieve your desired result would be to make use of dplyr::recode. Additionally if you want to write your function I would suggest to pass the values to recode and the replacements as vectors instead of passing each value and replacement as separate arguments:
library(dplyr)
set.seed(42)
df <- data.frame(
Area = sample(c("rural", "suburban", "urban"), 10, replace = TRUE)
)
recode_table <- c("rural" = 1, "suburban" = 2, "urban" = 3)
recode(df$Area, !!!recode_table)
#> [1] 1 1 1 1 2 2 2 1 3 3
reval_3 <- function(data, x, values, replacements) {
recode_table <- setNames(replacements, values)
data[[x]] <- recode(data[[x]], !!!recode_table)
data
}
df <- reval_3(df, "Area", c("rural", "suburban", "urban"), 1:3)
df
#> Area
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 2
#> 6 2
#> 7 2
#> 8 1
#> 9 3
#> 10 3
You can use case_when with across.
If the columns that you want to change are called col1, col2 you can do -
library(dplyr)
df <- df %>%
mutate(across(c(col1, col2), ~case_when(. == 'rural' ~ 1,
. == 'suburban' ~ 2,
. == 'urban' ~ 3)))
Based on your actual column names you can also pass starts_with, ends_with, range of columns A:Z in across.

how to filter data by the number of unique values in R [duplicate]

This question already has answers here:
drop columns that take less than n values?
(2 answers)
Closed 3 years ago.
I have some data that I would like to investigate and would like to pull out
all features which have a certain number of unique values, whether that's 2,
5, 10, etc.
I'm not sure how to go about doing this though.
For example :
tst = data.frame(
a = c(1,1,1,0,0),
b = c(1,2,3,3,3),
c = c(1,2,3,4,4),
d = c(1,2,3,4,5)
)
tst
tst %>%
filter(<variables with x unique values>)
Where x=2 would just filter to a, x=3 filter to b, etc
You can use select_if with the n_distinct function.
tst %>%
select_if(~n_distinct(.) == 2)
# a
# 1 1
# 2 1
# 3 1
# 4 0
# 5 0
Here is one way in base R:
x <- 2
tst[, apply(tst, 2, function(row) length(unique(row))) == x, drop = FALSE]
This example code will create a variable combination of abcd. Then will identify which are duplicate combinations, then will return only those combinations that are not duplicates. I hope this is what you were asking for...
tst = data.frame(
a = c(1,1,1,0,0),
b = c(1,2,3,3,3),
c = c(1,2,3,4,4),
d = c(1,2,3,4,5)
)
tst %>%
unite(new,a,b,c,d,sep="") %>%
mutate(duplicate=duplicated(new)) %>%
filter(duplicate !="TRUE")

Sample_n to Get a Maximum Number From Each Group

Using this very simple data example below, my goal would be to sample all 3 of A and only sample 5 out of 7 of B.
id group
1 A
2 A
3 A
4 B
5 B
6 B
7 B
8 B
9 B
10 B
ex_df <- data.frame(id = 1:10, group = c(rep("A", 3), rep("B", 7)))
Now, normally it'd just be a case of using sample_n from dplyr such that the code would be along the lines of
sel_5 <- ex_df %>%
group_by(group) %>%
sample_n(5)
Except this gives the error (for obvious reasons)
Error: size must be less or equal than 2 (size of data), set
replace = TRUE to use sampling with replacement
but sampling with replacement isn't an option. Is there any way that I might be able to set the sample_n size to be the minimum of 5 or the size of the group?
Or maybe another function that I'm unaware of that would be capable of this?
I've had the same problem, and here's what I did.
library(dplyr)
split_up <- split(ex_df, f = ex_df$group)
#split original dataframe into a list of dataframes for each unique group
sel_5 <- lapply(split_up, function(x) {x %>% sample_n(ifelse(nrow(x) < 5, nrow(x), 5))})
#on each dataframe, subsample to 5 or to the number of rows if there are less than 5
sel_5 <- do.call("rbind", sel_5)
#bind it back up!

Count unique rows irrespective of column order

EDIT
My question was badly asked. Therefore I re-edited it in order to make it hopefully more useful for others. It has an answer already.
sample data.frame:
set.seed(10)
df <- data.frame(a = sample(1:3, 30, rep=T), b = sample(1:3, 30, rep = T), c = sample(1:3, 30, rep = T))
My question:
I have several columns (in my example a,b,c). Now, slightly similar, but different to this question asked by R-user, I would like to count the possible 'value sets' of in this case three columns (but in general: n columns), irrespective of their order.
count(df,a,b,c) from dplyr does not help:
require (dplyr)
count(df,a,b,c)
# A tibble: 17 x 4
a b c n
<int> <int> <int> <int>
1 1 1 1 1
2 1 1 2 2
...
7 2 1 1 4
...
In this example, row 2 and 7 contain the same set of values (1,1,2), and that's not what I want, because I do not care about the order of the values within the set, so '1,1,2' and '2,1,1' should be considered the same. How to count those value sets?
EDIT 2
The neat trick of #Mouad_S 's answer is that you first order the rows with apply() and then transpose the result (t()) and then you can use count on the columns.)
require(dplyr)
set.seed(10)
df <- data.frame(a = sample(1:3, 30, rep=T),
b = sample(1:3, 30, rep = T),
c = sample(1:3, 30, rep = T))
## the old answer
require(dplyr)
count(data.frame(t(apply(df, 1, function(x) sort(x)))), X1, X2, X3)
## the new answer
t(apply(df,1, function(x) sort(x))) %>% # sorting the values of each row
as.data.frame() %>% # turning the resulting matrix into a data frame
distinct() %>% # taking the unique values
nrow() # counting them
[1] 9

Gather duplicate column sets into single columns

The problem of gathering multiple sets of columns was already addressed here: Gather multiple sets of columns, but in my case, the columns are not unique.
I have the following data:
input <- data.frame(
id = 1:2,
question = c("a", "b"),
points = 0,
max_points = c(3, 5),
question = c("c", "d"),
points = c(0, 20),
max_points = c(5, 20),
check.names = F,
stringsAsFactors = F
)
input
#> id question points max_points question points max_points
#> 1 1 a 0 3 c 0 5
#> 2 2 b 0 5 d 20 20
The first column is an id, then I have many repeated columns (the original dataset has 133 columns):
identifier for question
points given
maximum points
I would like to end up with this structure:
expected <- data.frame(
id = c(1, 2, 1, 2),
question = letters[1:4],
points = c(0, 0, 0, 20),
max_points = c(3, 5, 5, 20),
stringsAsFactors = F
)
expected
#> id question points max_points
#> 1 1 a 0 3
#> 2 2 b 0 5
#> 3 1 c 0 5
#> 4 2 d 20 20
I have tried several things:
tidyr::gather(input, key, val, -id)
reshape2::melt(input, id.vars = "id")
Both do not deliver the desired output. Furthermore, with more columns than shown here, gather doesn't work any more, because there are too many duplicate columns.
As a workaround I tried this:
# add numbers to make col headers "unique"
names(input) <- c("id", paste0(1:(length(names(input)) - 1), names(input)[-1]))
# gather, remove number, spread
input %>%
gather(key, val, -id) %>%
mutate(key = stringr::str_replace_all(key, "[:digit:]", "")) %>%
spread(key, val)
which gives an error: Duplicate identifiers for rows (3, 9), (4, 10), (1, 7), (2, 8)
This problem was already discussed here: Unexpected behavior with tidyr, but I don't know why/how I should add another identifier. Most likely this is not the main problem, because I probably should approach the whole thing differently.
How could I solve my problem, preferably with tidyr or base? I don't know how to use data.table, but in case there is a simple solution, I will settle for that too.
Try this:
do.call(rbind,
lapply(seq(2, ncol(input), 3), function(i){
input[, c(1, i:(i + 2))]
})
)
# id question points max_points
# 1 1 a 0 3
# 2 2 b 0 5
# 3 1 c 0 5
# 4 2 d 20 20
The idiomatic way to do this in data.table is pretty simple:
library(data.table)
setDT(input)
res = melt(
input,
id = "id",
meas = patterns("question", "^points$", "max_points"),
value.name = c("question", "points", "max_points")
)
id variable question points max_points
1: 1 1 a 0 3
2: 2 1 b 0 5
3: 1 2 c 0 5
4: 2 2 d 20 20
You get the extra column called "variable", but you can get rid of it with res[, variable := NULL] afterwards if desired.
Another way to accomplish the same goal without using lapply:
We start by grabbing all the columns for question, max_points, and points then we melt each one individually and cbind them all together.
library(reshape2)
questions <- input[,c(1,c(1:length(names(input)))[names(input)=="question"])]
points <- input[,c(1,c(1:length(names(input)))[names(input)=="points"])]
max_points <- input[,c(1,c(1:length(names(input)))[names(input)=="max_points"])]
questions_m <- melt(questions,id.vars=c("id"),value.name = "questions")[,c(1,3)]
points_m <- melt(points,id.vars=c("id"),value.name = "points")[,3,drop=FALSE]
max_points_m <- melt(max_points,id.vars=c("id"),value.name = "max_points")[,3, drop=FALSE]
res <- cbind(questions_m,points_m, max_points_m)
res
id questions points max_points
1 1 a 0 3
2 2 b 0 5
3 1 c 0 5
4 2 d 20 20
You might need to clarify how you want the ID column to be handled but perhaps something like this ?
runme <- function(word , dat){
grep( paste0("^" , word , "$") , names(dat))
}
l <- mapply( runme , unique(names(input)) , list(input) )
l2 <- as.data.frame(l)
output <- data.frame()
for (i in 1:nrow(l2)) output <- rbind( output , input[, as.numeric(l2[i,]) ])
Not sure how robust it is with respect to handling different numbers of repeated columns but it works for your test data and should work if you columns are repeated equal numbers of times.

Resources