Differenciated sampling rate by group - r

For a machine learning model training, I'm trying to sample a dataframe that has a grouping variable, so that each group is treated with a different sampling rule. For instance, my data:
df = data.frame(value = 1:10, label=c("a", "a", "b", rep("c", 7)))
For groups of size under, say, 3, I want to take the whole group and no more, and for bigger groups I want to take a sample of size 3 without replacement.
So here, the result could be: df[c(1:3, 6,9,10),]
If I use group_by and sample_n, I get an size error. I thought of going "manual" with splits and differentiated sampling and then bind again the rows, but is there a more efficient and direct way?

Using the size of the group n(), in sample_n.
df %>% group_by(label) %>% sample_n(min(n(), 3))
# A tibble: 6 x 3
# Groups: label [3]
# value label n
# <int> <fct> <int>
#1 1 a 2
#2 2 a 2
#3 3 b 1
#4 5 c 7
#5 10 c 7
#6 4 c 7

Related

R: Long to wide without time

I am working with a medication prescription dataset which I want to transfer from long to wide format.
I tried to use the reshape function, however, this requires a time variable, which I don't have (at least not in a useful format I believe).
Concept dataset:
id <- c(1, 1, 1, 2, 2, 3, 3, 3)
prescription_date <- c("17JAN2009", "02MAR2009", "20MAR2009", "05JUL2009", "10APR2009", "09MAY2009", "13JUN2009", "29MAY2009")
med <- c("A", "B", "A", "B", "A", "B", "A", "B")
df <- data.frame(id, prescription_date, med)
To make a time variable I have tried to make a time variable like 1st, 2nd, etc med per id, but I didn't succeed.
Background: I want this in a wide format to eventually create definitions for diagnoses (i.e. if a patient had >1 prescriptions of A, diagnosis is confirmed). This has to be combined with factors from other datasets, hence the idea to go from long to wide.
Any help is much appreciated, thank you.
You might consider keeping the data in long format to perform some of these calculations. I would also suggest changing your dates into a date format that can be calculated upon. This will show, for instance, that the last two rows are not chronological. For instance:
library(dplyr)
df %>%
mutate(prescription_date = lubridate::dmy(prescription_date)) %>%
arrange(id, prescription_date) %>%
group_by(id) %>%
mutate(A_cuml = cumsum(med=="A"),
A_ttl = sum(med=="A")) %>%
ungroup()
# A tibble: 8 × 5
id prescription_date med A_cuml A_ttl
<dbl> <date> <chr> <int> <int>
1 1 2009-01-17 A 1 2
2 1 2009-03-02 B 1 2
3 1 2009-03-20 A 2 2
4 2 2009-04-10 A 1 1
5 2 2009-07-05 B 1 1
6 3 2009-05-09 B 0 1
7 3 2009-05-29 B 0 1
8 3 2009-06-13 A 1 1
If you calculate summary stats for each id, you might save this in a summarized table with one row per id and use joins (e.g. left_join) to append the results of each of these summaries.

R dplyr: filter common values by group

I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]

How to split dataframes into different dataframes based on one column name values that starts with some prefix?

How to split dataframes into different dataframes based on one column name say ## sensor_name ## values that starts with some prefix like "RI_", "AI_" in R so that I can have two dataframes one for RI and another for AI?
I have tried the following code but this works well when I pivot my dataframe.
map(set_names(c("RI", "AI","FI")),~select(temp_df,starts_with(.x),starts_with("time_stamp")))
I expect the output to have two different dataframes,
RI_df:
AI_df:
It would be great if anyone help me with this since I just started to work on R programming language.
An option is split from base R
lst1 <- split(df1, substr(df1$sensor_name, 1,2))
names(lst1) <- paste0(names(lst1), "_df")
If the prefix length is variable
lst1 <- split(df1, sub("_.*", "", df1$sensor_name))
Or using tidyverse
library(dplyr)
df1 %>%
group_split(grp = str_remove(sensor_name, "_.*"), keep = FALSE)
NOTE: It is not recommended to have multiple objects in the global env. For that reason, keep it in the list and do all thee analysis on that list itself
Another approach from base R
df <- data.frame(sensor_name=c("R1_111","R1_113","A1_124","A1_2444"),
A=c(1,2,24,4),B=c(2,2,1,2),C=c(3,4,4,2))
df[grepl("R1",df$sensor_name),]
sensor_name A B C
1 R1_111 1 2 3
2 R1_113 2 2 4
df[grepl("A1",df$sensor_name),]
sensor_name A B C
3 A1_124 24 1 4
4 A1_2444 4 2 2
Create a variable to identify each group. After that you can subset the data to separate the groups. Functions from the stringr package can extract the relevant text from the longer sensor name.
library(stringr)
library(dplyr)
# Sample data
X <- tibble(
sensor = c("RI_1", "RI_2", "AI_1", "AI_2"),
A = c(1, 2, 3, 4),
B = c(5, 6, 7, 8),
C = c(9, 10, 11, 12)
)
# Extract text to identify groups
X <- X %>%
mutate(prefix = str_replace(sensor, "_.*", ""))
# Subset for desired group
X %>% filter(prefix == "AI")
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 AI_1 3 7 11 AI
2 AI_2 4 8 12 AI
# Or, split all the groups
lapply(unique(X$prefix), function(x) {
X %>% filter(prefix == x)
})
[[1]]
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 RI_1 1 5 9 RI
2 RI_2 2 6 10 RI
[[2]]
# A tibble: 2 x 5
sensor A B C prefix
<chr> <dbl> <dbl> <dbl> <chr>
1 AI_1 3 7 11 AI
2 AI_2 4 8 12 AI
Depending on what you are doing with these groups you may do better to use group_by() form the dplyr package

How to run a for loop for each group in a dataframe?

This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.
Here is code that I have confirmed to work:
df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)
The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.
Say I had data like this:
How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.
Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".
In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups: ID [3]
# ID value value.mean
# <fct> <int> <dbl>
#1 a 13 12.6
#2 a 14 12.6
#3 a 12 12.6
#4 a 13 12.6
#5 a 11 12.6
#6 b 12 15.5
#7 b 19 15.5
#8 cc4 10 10.0
Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(value.mean = mean(value))
## A tibble: 3 x 2
# ID value.mean
# <fct> <dbl>
#1 a 12.6
#2 b 15.5
#3 cc4 10.0
The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).
Sample data
df <- read.table(text =
"ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)
Update
To conclude from the comments&chat, this should be what you're after.
# Sample data
set.seed(2017)
csvdata <- data.frame(
microsat = rep(c("A", "B", "C"), each = 8),
allele = sample(20, 3 * 8, replace = T))
csvdata %>%
group_by(microsat) %>%
summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
# microsat D
# <fct> <dbl>
#1 A 0.844
#2 B 0.812
#3 C 0.812
Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.
A base R option would be
df1$value.mean <- with(df1, ave(value, ID))

Iterate through columns and row values (list) in R dplyr

This question is based on the following post with additional requirements (Iterate through columns in dplyr?).
The original code is as follows:
df <- data.frame(col1 = rep(1, 15),
col2 = rep(2, 15),
col3 = rep(3, 15),
group = c(rep("A", 5), rep("B", 5), rep("C", 5)))
for(col in c("col1", "col2", "col3")){
filt.df <- df %>%
filter(group == "A") %>%
select_(.dots = c('group', col))
# do other things, like ggplotting
print(filt.df)
}
My objective is to output a frequency table for each unique COL by GROUP combination. The current example specifies a dplyr filter based on a GROUP value A, B, or C. In my case, I want to iterate (loop) through a list of values in GROUP (list <- c("A", "B", "C") and generate a frequency table for each combination.
The frequency table is based on counts. For Col1 the result would look something like the table below. The example data set is simplified. My real dataset is more complex with multiple 'values' per 'group'. I need to iterate through Col1-Col3 by group.
group value n prop
A 1 5 .1
B 2 5 .1
C 3 5 .1
A better example of the frequency table is here: How to use dplyr to generate a frequency table
I struggled with this for a couple days, and I could have done better with my example. Thanks for the posts. Here is what I ended up doing to solve this. The result is a series of frequency tables for each column and each unique value found in group. I had 3 columns (col1, col2, col3) and 3 unique values in group (A,B,C), 3x3. The result is 9 frequency tables and a frequency table for each group value that is non-sensical. I am sure there is a better way to do this. The output generates some labeling, which is useful.
# Build unique group list
group <- unique(df$group)
# Generate frequency tables via a loop
iterate_by_group <- function(x)
for (i in 1:length(group)){
filt.df <- df[df$group==group[i],]
print(lapply(filt.df, freq))
}
# Run
iterate_by_group(df)
We could gather into long format and then get the frequency (n()) by group
library(tidyverse)
gather(df, value, val, col1:col3) %>%
group_by(group, value = parse_number(value)) %>%
summarise(n = n(), prop = n/nrow(.))
# A tibble: 9 x 4
# Groups: group [?]
# group value n prop
# <fct> <dbl> <int> <dbl>
#1 A 1 5 0.111
#2 A 2 5 0.111
#3 A 3 5 0.111
#4 B 1 5 0.111
#5 B 2 5 0.111
#6 B 3 5 0.111
#7 C 1 5 0.111
#8 C 2 5 0.111
#9 C 3 5 0.111
Is this what you want?
df %>%
group_by(group) %>%
summarise_all(funs(freq = sum))

Resources