Sampling different x and different sample size in R - r

Say I have a table like this:
Students
Equipment #
A
101
A
102
A
103
B
104
B
105
B
106
B
107
B
108
C
109
C
110
C
111
C
112
I want to grab equipment # samples from each student in the data frame with varying sample sizes.
For example, I want 1 equipment # from student "A", 2 from student "B", and 3 from student "C". How can I achieve this in R?
This is the code that I have now, but I'm only getting 1 equipment # printed from each student.
students <- unique(df$`Students`)
sample_size <- c(1,2,3)
for (i in students){
s <- sample(df[df$`Students` == i,]$`Equipment #`, size = sample_size, replace = FALSE)
print(s)
}

You can create a dataframe which has information students and the rows to be sampled. Join the data and use sample_n to sample those rows.
library(dplyr)
sample_data <- data.frame(Students = c('A', 'B', 'C'), nr = 1:3)
df %>%
left_join(sample_data, by = 'Students') %>%
group_by(Students) %>%
sample_n(first(nr)) %>%
ungroup() %>%
select(-nr) -> s
s
# Students Equipment
# <chr> <int>
#1 A 102
#2 B 108
#3 B 105
#4 C 110
#5 C 112
#6 C 111

You're close. You need to index the sample_size vector with the loop, otherwise it will just take the first item in the vector for each iteration.
library(dplyr)
# set up data
df <- data.frame(Students = c(rep("A", 3),
rep("B", 5),
rep("C", 4)),
Equipment_num = 101:112)
# create vector of students
students <- df %>%
pull(Students) %>%
unique()
# sample and print
for (i in seq_along(students)) {
p <- df %>%
filter(Students == students[i]) %>%
slice_sample(n = i)
print(p)
}
#> Students Equipment_num
#> 1 A 102
#> Students Equipment_num
#> 1 B 107
#> 2 B 105
#> Students Equipment_num
#> 1 C 109
#> 2 C 110
#> 3 C 112
Created on 2021-08-06 by the reprex package (v2.0.0)
Actually this is a much more elegant and generalizable way to tackle this problem.

Related

Rearranging data according to rater and subject, simultaneously creating new row names

I have a dataset where multiple raters rate multiple subjects.
I'd like to rearrange the data that looks like this:
data <- data.frame(rater=c("A", "B", "C", "A", "B", "C"),
subject=c(1, 1, 1, 2, 2, 2),
measurment1=c(1, 2, 3, 4, 5,6),
measurment2=c(11, 22, 33, 44, 55,66),
measurment3=c(111, 222, 333, 444, 555, 666))
data
# rater subject measurment1 measurment2 measurment3
# 1 A 1 1 11 111
# 2 B 1 2 22 222
# 3 C 1 3 33 333
# 4 A 2 4 44 444
# 5 B 2 5 55 555
# 6 C 2 6 66 666
into data that looks like this:
data_transformed <- data.frame( A = c(1,11,111,4,44,444),
B = c(2,22,222,5,55,555),
C = c(3,33,333,6,66,666)
)
row.names(data_transformed) <- c("measurment1_1", "measurment2_1", "measurment3_1", "measurment1_2", "measurment2_2", "measurment3_2")
data_transformed
# A B C
# measurment1_1 1 2 3
# measurment2_1 11 22 33
# measurment3_1 111 222 333
# measurment1_2 4 5 6
# measurment2_2 44 55 66
# measurment3_2 444 555 666
In the new data frame, the raters (A, B and C) should become the columns. The measurement should become the rows and I'd also like to add the subject number as a suffix to the row-names.
For the rearranging one could probably use the pivot functions, yet I have no idea on how to combine the measurement-variables with the subject number.
Thanks for your help!
We could use pivot_longer, pivot_wider and unite from the tidyr package.
pivot_longer makes our data in a vertical format, it transforms the measurment columns into a sigle variable
pivot_wider does the opposite of pivot_longer, transform a variable into multiple columns for each unique value from the variable
data |>
pivot_longer(measurment1:measurment3) |>
pivot_wider(names_from = rater, values_from = value, values_fill = 0 ) |>
unite("measure_subjet",name,subject, remove = TRUE)
Please try the below code where we can accomplish the expected result using pivot_longer, pivot_wider and column_to_rownames.
library(tidyverse)
data_transformed <- data %>%
pivot_longer(c('measurment1', 'measurment2', 'measurment3')) %>%
mutate(rows = paste0(name, '_', subject)) %>%
pivot_wider(rows, names_from = rater, values_from = value) %>%
column_to_rownames(var = "rows")

How to merge by two columns aggregating one of them

I'm struggling on how can I make a merge using two columns. I have one dataframe containing measure about how much palette was used in some dates. I have another dataframe containing the distance travelled by the car. Then I need to merge both, and the condition to join is that: the car and the sum of the distance of one car until the date that the measure of the palette occur.
Here is a toy example:
#palette measure dataframe
measure = data.frame(car = c("A", "A", "A", "B"), data1 = c("20-09-2020", "15-10-2020", "13-05-2021", "20-10-2021"), palette = c(5,4,3,5))
#> measure
# car data1 palette
#1 A 20-09-2020 5
#2 A 15-10-2020 4
#3 A 13-05-2021 3
#4 B 20-10-2021 5
#the distance dataframe
dist_ = data.frame(car = c("A", "C", "B", "A", "A", "A"), data2 = c("20-09-2020", "14-05-2020", "20-10-2021", "10-01-2021", "11-01-2021", "13-01-2021"), distance = c(10, 20, 10, 5, 3,8))
#> dist_
# car data2 distance
#1 A 20-09-2020 10
#2 C 14-05-2020 20
#3 B 20-10-2021 10
#4 A 10-01-2021 5
#5 A 11-01-2021 3
#6 A 13-01-2021 8
#for result I'd like something like
# car data1 palette distance
#1 A 20-09-2020 5 10
#2 A 15-10-2020 4 0
#3 A 13-05-2020 3 16
#4 B 20-10-2021 5 10
Note that the distance are summed until I have a date that the palette are measured. So I can say that a car has covered a distance of 16 km and its palette is 3 cm.
I thought that I could use something like merge(x = measure, y = dist_, by.x=c("car", "date1"), by.y=c("car", "data2"),all.x = T), but I don't know how to sum the distance values until the date of the pallete measure for a specif car.
Any hint on how could I do that?
Something like this would work:
library(tidyverse)
library(lubridate)
result <- left_join(measure, dist_, by = c("car")) %>%
mutate(across(c("data1", "data2"), dmy)) %>%
filter(data1 >= data2) %>%
group_by(car, data2) %>%
mutate(threshold = min(data1)) %>%
ungroup() %>%
filter(data1 == threshold) %>%
group_by(car, data1, palette)%>%
summarise(distance = sum(distance))
result
# A tibble: 3 x 4
# Groups: car, data1 [3]
car data1 palette distance
<chr> <date> <dbl> <dbl>
1 A 2020-09-20 5 10
2 A 2021-05-13 3 16
3 B 2021-10-20 5 10
If you want to keep the non-matches you could then rejoin with measure like so:
result.final <- measure %>%
mutate(data1 = dmy(data1))%>%
left_join(result, by = c("car", "data1", "palette"))
result.final
car data1 palette distance
1 A 2020-09-20 5 10
2 A 2020-10-15 4 NA
3 A 2021-05-13 3 16
4 B 2021-10-20 5 10

Keeping one row and discarding others in R using specific criteria?

I'm working with the data frame below, which is just part of the full data, and I need to condense the duplicate numbers in the id column into one row. I want to preserve the row that has the highest sbp number, unless it's 300 or over, in which case I want to discard that too.
So for example, for the first three rows that have id as 13480, I want to keep the row that has 124 and discard the other two.
id,sex,visits,sbp
13480,M,2,124
13480,M,3,306
13480,M,4,116
13520,M,2,124
13520,M,3,116
13520,M,4,120
13580,M,2,NA
13580,M,3,124
This is the farthest I got, been trying to tweak this but not sure I'm on the right track:
maxsbp <- split(sbp, sbp$sbp)
r <- data.frame()
for (i in 1:length(maxsbp)){
one <- maxsbp[[i]]
index <- which(one$sbp == max(one$sbp))
select <- one[index,]
r <- rbind(r, select)
}
r1 <- r[!(sbp$sbp>=300),]
r1
I think a tidy solution to this would work quite well. I would first filter all values above 300, if you do not want to keep any value above that threshold. Then group_by id, order, and keep the first.
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
my.df %>% filter(sbp < 300) # filter to retain only values below 300
%>% group_by(id) # group by id
%>% arrange(-sbp) # arrange by id in descending order
%>% top_n(1, sbp) # retain first value i.e. the largest
# A tibble: 3 x 3
# Groups: id [3]
# id sex sbp
# <dbl> <chr> <dbl>
#1 13480 M 124
#2 13520 M 124
#3 13580 M 124
In R, very rarely you'll require explicit for loops to do tasks.
There are functions available which will help you perform such grouped operations.
For example, in base R you can use subset and ave :
subset(df,sbp == ave(sbp,id,FUN = function(x) max(sbp[sbp <= 300],na.rm = TRUE)))
# id sex visits sbp
#1 13480 M 2 124
#4 13520 M 2 124
#8 13580 M 3 124
The same can be done using dplyr whose syntax is a little bit easier to understand.
library(dplyr)
df %>%
group_by(id) %>%
filter(sbp == max(sbp[sbp <= 300], na.rm = TRUE))
slice_head can also be used
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
> my.df
id sex sbp
1 13480 M 124
2 13480 M 306
3 13480 M 116
4 13520 M 124
5 13520 M 116
6 13520 M 120
7 13580 M NA
8 13580 M 124
Proceed simply like this
my.df %>% group_by(id, sex) %>%
arrange(desc(sbp)) %>%
slice_head() %>%
filter(sbp <300)
# A tibble: 2 x 3
# Groups: id, sex [2]
id sex sbp
<dbl> <chr> <dbl>
1 13520 M 124
2 13580 M 124

How to count unique values a column in R

I have a database and would like to know how many people (identified by ID) match a characteristic. The list is like this:
111 A
109 A
112 A
111 A
108 A
I only need to count how many 'ID's have that feature, the problem is there duplicated ID's. I've tried with
count(df, vars = ID)
but it does not show the total number of ID's, just how many times they are repeated. Same with
count(df, c('ID')
as it shows the total number of ID's and many of them are duplicated, I need to count them one single time.
Do you have any suggestions? Using table function is not an option because of the size of this database.
We can use n_distinct() from dplyr to count the number of unique values for a column in a data frame.
textFile <- "id var1
111 A
109 A
112 A
111 A
108 A"
df <- read.table(text = textFile,header = TRUE)
library(dplyr)
df %>% summarise(count = n_distinct(id))
...and the output:
> df %>% summarise(count = n_distinct(id))
count
1 4
We can also summarise the counts within one or more by_group() columns.
textFile <- "id var1
111 A
109 A
112 A
111 A
108 A
201 B
202 B
202 B
111 B
112 B
109 B"
df <- read.table(text = textFile,header = TRUE)
df %>% group_by(var1) %>% summarise(count = n_distinct(id))
...and the output:
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
var1 count
<chr> <int>
1 A 4
2 B 5
You can first remove duplicates using unique and then countthe remaining rows :
d <- tribble(
~ID,~feature,
111, "A",
109, "A",
112, "A",
111, "A",
108, "A")
count(unique(d,vars = c(ID, feature)),vars=ID)
vars n
<dbl> <int>
1 108 1
2 109 1
3 111 1
4 112 1

Select grouped rows with at least one matching criterion

I want to select all those groupings that contain at least one of the elements that I am interested in. I was able to do this by creating an intermediate array, but I am looking for something simpler and faster. This is because my actual data set has over 1M rows (and 20 columns) so I am not sure whether I will have sufficient memory to create an intermediate array. More importantly, the below method on my original file takes a lot of time.
Here's my code and data:
a) Data
dput(Data_File)
structure(list(Group_ID = c(123, 123, 123, 123, 234, 345, 444,
444), Product_Name = c("ABCD", "EFGH", "XYZ1", "Z123", "ABCD",
"EFGH", "ABCD", "ABCD"), Qty = c(2, 3, 4, 5, 6, 7, 8, 9)), .Names = c("Group_ID",
"Product_Name", "Qty"), row.names = c(NA, 8L), class = "data.frame")
b) Code: I want to select Group_ID that has at least one Product_Name = ABCD
#Find out transactions
Data_T <- Data_File %>%
group_by(Group_ID) %>%
dplyr::filter(Product_Name == "ABCD") %>%
select(Group_ID) %>%
distinct()
#Now filter them
Filtered_T <- Data_File %>%
group_by(Group_ID) %>%
dplyr::filter(Group_ID %in% Data_T$Group_ID)
c) Expected output is
Group_ID Product_Name Qty
<dbl> <chr> <dbl>
123 ABCD 2
123 EFGH 3
123 XYZ1 4
123 Z123 5
234 ABCD 6
444 ABCD 8
444 ABCD 9
I'm struggling with this for over 3 hours now. I looked at the auto-suggested thread by SO: Select rows with at least two conditions from all conditions but my question is very different.
I would do it like this:
Data_File %>% group_by(Group_ID) %>%
filter(any(Product_Name %in% "ABCD"))
# Source: local data frame [7 x 3]
# Groups: Group_ID [3]
#
# Group_ID Product_Name Qty
# <dbl> <chr> <dbl>
# 1 123 ABCD 2
# 2 123 EFGH 3
# 3 123 XYZ1 4
# 4 123 Z123 5
# 5 234 ABCD 6
# 6 444 ABCD 8
# 7 444 ABCD 9
Explanation: any() will return TRUE if there are any rows (within the group) that match the condition. The length-1 result will then be recycled to the full length of the group and the entire group will be kept. You could also do it with sum(Product_name %in% "ABCD") > 0 as the condition, but the any reads very nicely. Use sum instead if you wanted a more complicated condition, like 3 or more matching product names.
I prefer%in%to == for things like this because it has better behavior with NA and it is easy to expand if you wanted to check for any of multiple products by group.
If speed and efficiency are an issue, data.table will be faster. I would do it like this, which relies on a keyed join for the filtering and uses no non-data.table operations, so it should be very fast:
library(data.table)
df = as.data.table(df)
setkey(df)
groups = unique(subset(df, Product_Name %in% "ABCD", Group_ID))
df[groups, nomatch = 0]
# Group_ID Product_Name Qty
# 1: 123 ABCD 2
# 2: 123 EFGH 3
# 3: 123 XYZ1 4
# 4: 123 Z123 5
# 5: 234 ABCD 6
# 6: 444 ABCD 8
# 7: 444 ABCD 9

Resources