Shuffle data frame rows depending on a factor - r

I have a data frame, for example:
letter class value
A 0 55
B 1 23
C 1 12
D 1 9
E 2 68
F 2 78
G 2 187
I want to re-sample randomly the rows in each class to associate a letter to a new random value (but from the same class).
Desired example output:
letter class value
A 0 55
B 1 12
C 1 9
D 1 23
E 2 187
F 2 78
G 2 68
I tried something with dplyr like:
tab %>% group_by(class) %>% sample_n(size=3)
But this sample 3 rows per group and I don't have the same number of values per group.
The only solution I found at the moment is to create n data frames for each class, and to shuffle each data frame independently. But as my class number is large, it might be too long and dirty.

We can use sample on the sequence of rows (row_number()) and rearrange the 'value' based on the sampled index
df1 %>%
group_by(class) %>%
mutate(value = value[sample(row_number())])
Or as #RonakShah mentioned in the comments, if we have only a single row, then using sample would trigger sample of the sequence of values. So, if we directly use sample on the 'value', then an if/else condition can be used
df1 %>%
group_by(class) %>%
mutate(value = if(n() == 1) value else sample(value, n()))
If we want to use sample_n, it can be done within do
df1 %>%
group_by(class) %>%
do(sample_n(., size = nrow(.)))
NOTE: We need to specify nrow instead of n() as some of the tidyverse specific functions work within certain functions such as mutate/fsummarise/filter/arrange etc, but it is not implemented to work along with sample_n

Related

How to conditionally mutate a variable in R based on the values in multiple columns?

There are no recent answers to this question using the current tidyverse verbs (R 4.1 & tidyverse 1.3.1 in my case). I've tried using mutate with both case_when() and ifelse() with select_if() to conditionally fill a new variable with a value calculated from the number of TRUE values in specific other columns by row but neither seem to be filtering the correct columns to calculate from, as intended. I could probably pivot longer to replace my column groupings and avoid the need to filter which columns are used in the mutate calculation but I want to keep one response per row for merging later. Here's a reproducible example.
library(tidyverse)
set.seed(195)
# create dataframe
response_id <- rep(1:461)
questions <- c("overall","drought","domestic","livestock","distance")
answers <- c("a","b","c","d","e")
colnames <- apply(expand.grid(questions, answers), 1, paste, collapse="_")
df <- tibble(response_id)
# data is actually an unknown mix of TRUE and FALSE values in all columns but just doing that for one column for now for simplicity
df[,colnames] = FALSE
df$overall_a[sample(nrow(df),100)] <- TRUE
# using ifelse and select if to filter which columns to sum
df %>%
mutate(positive = ifelse(select_if(isTRUE), sum(str_detect(colnames(df), "a|b")), NA)) %>%
mutate(negative = ifelse(select_if(isTRUE), sum(str_detect(colnames(df), "c|d|e")), NA)) %>%
select(response_id, positive, negative)
# using case_when
df %>%
mutate(positive = case_when(TRUE ~ sum(str_detect(colnames(df), "a|b"))), NA) %>%
mutate(negative = case_when(TRUE ~ sum(str_detect(colnames(df), "c|d|e"))), NA) %>%
select(response_id, positive, negative)
The desired output should be as follows. Thanks for any help on this!
# A tibble: 461 × 3
response_id positive negative
<int> <int> <int>
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 1 0
6 6 1 0
7 7 0 0
8 8 1 0
9 9 0 0
10 10 1 0
# … with 451 more rows
Having data in column names is not considered "tidy" and the "tidyverse" works best with tidy data. Rather than hacking against the column names, the pivoting approach would be the most consistent with the tidy philosophy. Plus it will scale better for more categories. For example
df %>%
pivot_longer(-response_id) %>%
separate(name, into=c("category", "code")) %>%
mutate(sentiment=case_when(
code %in% c("a", "b") ~ "positive",
code %in% c("c", "d", "e") ~ "negative")) %>%
group_by(response_id, sentiment) %>%
summarize(count=sum(value)) %>%
pivot_wider(response_id, names_from=sentiment, values_from=count)
It's not as concise but it more directly says what it's doing.
But if you really want to keep data in the row names, you can perform rowwise summaries use c_across() with the latest dplyr
df %>%
rowwise() %>%
mutate(
positive=sum(c_across(ends_with(c("_a", "_b")))),
negative=sum(c_across(ends_with(c("_c", "_d", "_e"))))) %>%
select(response_id, positive, negative)

R: Is there a way to sort messy data where it pivots from long to wide, and as it moves across variables, into one logical key:value column?

I have extremely messy data. A portion of it looks like the following example.
x1_01=c("bearing_coordinates", "bearing_coordinates", "bearing_coordinates", "roadkill")
x1_02=c(146,122,68,1)
x2_01=c("tree_density","animals_on_road","animals_on_road", "tree_density")
x2_02=c(13,2,5,11)
x3_01=c("animals_on_road", "tree_density", "roadkill", "bearing_coordinates")
x3_02=c(3,10,1,1000)
x4_01=c("roadkill","roadkill", "tree_density", "animals_on_road")
x4_02=c(1,1,12,6)
testframe = data.frame(x1_01 = x1_01,x1_02=x1_02,x2_01=x2_01, x2_02=x2_02, x3_01=x3_01, x3_02=x3_02, x4_01=x4_01, x4_02=x4_02)
x1_01 x1_02 x2_01 x2_02 x3_01 x3_02 x4_01
1 bearing_coordinates 146 tree_density 13 animals_on_road 3 roadkill
2 bearing_coordinates 122 animals_on_road 2 tree_density 10 roadkill
3 bearing_coordinates 68 animals_on_road 5 roadkill 1 tree_density
4 roadkill 1 tree_density 11 bearing_coordinates 1000 animals_on_road
x4_02
1 1
2 1
3 12
4 6
I noticed when using dplyr spread that if I spread x1_01 and x1_02 on the initial datasheet, e.g.
test <- testframe %>%
spread(x1_01, x1_02)
and then used spread on that dataframe for x2_01 and x2_02, e.g.
testtest <- test %>%
spread(x2_01, x2_02)
that the second "bearing_coordinates" column would replace the original column, and result in NAs where there were values. To get around that, I went down the route of creating multiple dataframes and merging them together, e.g.
test <- testframe %>%
spread(x1_01, x1_02) %>%
mutate(id = row_number())
test2 <- testframe %>%
spread(x2_01, x2_02) %>%
mutate(id = row_number())
test3 <- testframe %>%
spread(x3_01, x3_02) %>%
mutate(id = row_number())
test4 <- testframe %>%
spread(x4_01, x4_02) %>%
mutate(id = row_number())
merge_test <- merge(test, test2, by="id")
merge_test2 <- merge(merge_test, test3, by ="id")
merge_test3 <- merge(merge_test2, test4, by = "id")
This (long-winded) approach is ok if it is a small dataset, like the test data I have supplied. However, as variables increase (x5_01, x5_02, x5_01, x5_02, etc) columns begin getting duplicated and deleting the previous columns named e.g. "bearing_coordinates", which results in loss of data. My question is, is there a way to do this where the data pivots from long to wide, and as it moves across variables, into one logical key:value column, so that all values associated with "bearing_coordinates" are in that column? The data should then look like this:
bearing_coordinates=c(146,122,68,1000)
roadkill=c(1,1,1,1)
tree_density=c(13,10,12,11)
animals_on_road=c(3,2,5,6)
id=c(1,2,3,4)
clean.data = data.frame(bearing.coordinates=bearing_coordinates,roadkill=roadkill,tree_density=tree_density,animals_on_road=animals_on_road,id=id)
bearing_coordinates roadkill tree_density animals_on_road id
1 146 1 13 3 1
2 122 1 10 2 2
3 68 1 12 5 3
4 1000 1 11 6 4
I assume there must be a way to do this surprisingly easily in dplyr, but I rarely have data this messy and so am at a bit of loss as to what tools will accomplish this.
I've been looking through the dplyr documentation and SO posts and everything seems to be almost what I'm looking for but not quite right. For example, this post indicates that there could be a different strategy of taking "bearing.coordinates.x" and "bearing.coordinates.y" and then making those columns have duplicate names before finally merging them with no loss of data. However, that looks like it could be even more long-winded (particularly with multiple key:value pairs, as in my real dataset) and also potentially prone to error. I've also looked at filter as perhaps being a good option, but it seems to still hit that issue of columns deleting each other, and results in a necessary extra coding step to keep all the rest of the data.
Thank you in advance for help.
EDIT: Ben's answer below is correct, but I initially inaccurately represented the variables as being separated by "." and not "_" as they are in my real data. This could be addressed by simply changing the regex to (.*)_(.*), so:
testframe %>%
pivot_longer(cols = everything(), names_to = c("name", ".value"), names_pattern = "(.*)_(.*)") %>%
select(-name) %>%
pivot_wider(names_from = "01", values_from = "02", values_fn = list) %>%
unnest(cols = everything())
This is a really beautiful and elegant solution. Thank you Ben!
Maybe you might try something like this below. Based on your needs it could be modified further - but a lot depends on what your actual data looks like. This assumes complete key/value pairs, evenly divided.
Would first use pivot_longer to get your keys/values in two columns. Then you can use pivot_wider so that values are placed in the appropriate key columns.
library(tidyr)
library(dplyr)
testframe %>%
pivot_longer(cols = everything(), names_to = c("name", ".value"), names_pattern = "x(\\d+)_(\\d+)") %>%
select(-name) %>%
pivot_wider(names_from = `01`, values_from = `02`, values_fn = list) %>%
unnest(cols = everything())
Output
bearing.coordinates tree.density animals.on.road roadkill
<dbl> <dbl> <dbl> <dbl>
1 146 13 3 1
2 122 10 2 1
3 68 12 5 1
4 1000 11 6 1

Count and Assign Consecutive Occurrences of Variable

I wish to count consecutive occurrence of any value and assign that count to that value in next column. Below is the example of input and desired output:
dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"))
dataset$count <- c(1,2,2,2,2,1,4,4,4,4,1,1)
dataset
input count
a 1
b 2
b 2
a 2
a 2
c 1
a 4
a 4
a 4
a 4
b 1
c 1
With rle(dataset$input) I can just get number of occurrences of each value. But I want resulting output in above format.
My question is similar to:
R: count consecutive occurrences of values in a single column
But here output is in sequence and I want to assign the count itself to that value.
You can repeat the lengths argument lengths time in rle
with(rle(dataset$input), rep(lengths, lengths))
#[1] 1 2 2 2 2 1 4 4 4 4 1 1
Using dplyr, we can use lag to create groups and then count the number of rows in each group.
library(dplyr)
dataset %>%
group_by(gr = cumsum(input != lag(input, default = first(input)))) %>%
mutate(count = n())
and with data.table
library(data.table)
setDT(dataset)[, count:= .N, rleid(input)]
data
Make sure the input column is character and not factor.
dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"),
stringsAsFactors = FALSE)
We can use rleid with dplyr
library(dplyr)
dataset %>%
group_by(grp = rleid(input)) %>%
mutate(count = n())

To create a frequency table with dplyr to count the factor levels and missing values and report it

Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.
What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100
A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.

Conditionally mutate columns based on column class

My question is based on a previous topic posted here: Mutating multiple columns in a data frame
Suppose I have a tibble as follows:
id char_var_1 char_var_2 num_var_1 num_var_2 ... x_var_n
1 ... ... ... ... ...
2 ... ... ... ... ...
3 ... ... ... ... ...
where id is the key and char_var_x is a character variable and num_var_x is a numerical variable. I have 346 columns in total and I want to write a function that scales all the numerical variables except the id column. I'm looking for an elegant way to mutate these columns using pipes and dplyr functions.
Obviously the following works for all numeric variables:
pre_process_data <- function(dt)
{
# scale numeric variables
dt %>% mutate_if(is.numeric, scale)
}
But I'm looking for a way to exclude id column from scaling and retain the original values and at the same time scale all other numerical variables. Is there an elegant way to do this?
Try below, answer is similar to select_if post:
library(dplyr)
# Using #Psidom's example data: https://stackoverflow.com/a/48408027
df %>%
mutate_if(function(col) is.numeric(col) &
!all(col == .$id), scale)
# id a b
# 1 1 a -1
# 2 2 b 0
# 3 3 c 1
Not a canonical way to do this, but with a little bit hack, you can do this with mutate_at by making the integer indices of columns to mutate using which with manually constructed column selecting conditions:
df = data.frame(id = 1:3, a = letters[1:3], b = 2:4)
df %>%
mutate_at(vars(which(sapply(., is.numeric) & names(.) != 'id')), scale)
# id a b
#1 1 a -1
#2 2 b 0
#3 3 c 1
How about the "make the column your interested a character, then change it back approach?"
dt %>%
mutate(id = as.character(id)) %>%
mutate_if(is.numeric, scale) %>%
mutate(id = as.numeric(id))
you can use dplyr's across
df %>% mutate(across(c(where(is.numeric),-id),scale))

Resources