R (Stratified) Random Sampling for Defined Cases - r

I have a data frame:
DF <- data.frame(Value = c("AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK", "KL", "LM"),
ID = c(1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1))
My question: I would like to create a new column that includes a (binary) random number ('0' or '1') for cases 'ID' == 1 with a fixed proportion (or pre-defined prevalence) (e.g., random numbers '0' x 2 and '1' x 4).
EDIT I:
For non-case specific purposes, the solution might be:
DF$RANDOM[sample(1:nrow(DF), nrow(DF), FALSE)] <- rep(RANDOM, c(nrow(DF)-4,4))
But, I still need the cas-specific assignment AND the aforementioned solution does not explicitly refer to '0' or '1'.
(Note: The variable 'value' is not relevant for the question; only an identifier.)
I figured out relevant posts on stratified sampling or random row selection - but this question is not covered by those (and other) posts.
Thank you VERY much in advance.

You can subset the data first by case ID == 1. To ensure occurrence of 1s and 0s, we use rep function and set replace to False in sample function.
Here's a solution.
library(data.table)
set.seed(121)
DF[ID == 1, new_column := sample(rep(c(0,1), c(2,4)), .N, replace = F)]
print(DF1)
Value ID new_column
1: AB 1 1
2: BC 0 NA
3: CD 0 NA
4: DE 1 1
5: EF 0 NA
6: FG 1 1
7: GH 1 1
8: HI 0 NA
9: IJ 0 NA
10: JK 1 0
11: KL 0 NA
12: LM 1 0

library(dplyr)
DF <- data.frame(Value = c("AB", "BC", "CD", "DE", "EF", "FG", "GH",
"HI", "IJ", "JK", "KL", "LM"),
ID = c(1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1),
stringsAsFactors = FALSE)
DF %>% group_by(ID) %>% sample_n(4, replace = FALSE)

Related

Fill NAs in R for certain columns

I'm running a linear regression, but many of my observations can be used because some of the values have an NA in the row. I know that if one of a set of variables is entered, then and NA is actually 0. However, if all the values are NA, then the columns do not change. I will include and example because I know this might be confusing.
What I have is something that looks likes this:
df <- data.frame(outcome = c(1, 0, 1, 1, 0),
Var1 = c(1, 0, 1, NA, NA),
Var2 = c(NA, 1, 0, 0, NA),
Var3 = c(0, 1, NA, 1, NA))
For Vars 1-3, the first 4 rows have an NA, but have other entries in other vars. In the last row, however, all values are NA. I know that everything in the last row is NA, but I want the NAs in those first 4 rows to be filled with 0. The desired outcome would look like this:
desired - data.frame(outcome = c(1, 0, 1, 1, 0),
Var1 = c(1, 0, 1, 0, NA),
Var2 = c(0, 1, 0, 0, NA),
Var3 = c(0, 1, 0, 1, NA))
I know there are messy ways I could go about this, but I was wondering what would be the most streamlined process for this?
I hope this makes sense, I know the question is confusing. I can clarify anything if needed.
We can create a logical vector with rowSums, use that to subset the rows before changing the NA to 0
i1 <- rowSums(!is.na(df[-1])) > 0
df[i1, -1][is.na(df[i1, -1])] <- 0
-checking with desired
identical(df, desired)
#[1] TRUE
You can use apply to conditionally replace NA in certain rows:
data.frame(t(apply(df, 1, function(x) if (all(is.na(x[-1]))) x else replace(x, is.na(x), 0))))
Output
outcome Var1 Var2 Var3
1 1 1 0 0
2 0 0 1 1
3 1 1 0 0
4 1 0 0 1
5 0 NA NA NA

I tried to adjust datapoint using index and match and applied it to specified group in dataset in R

i have one dataset recording infection status in chicken. The first column in the groups of chicken which are I and S. the remaining column is the status (0,1) in each sampling time. i need to adjust the information in the I group in which i want to replace the last 0 before the first 1 with 0.5.
i tried it on a vector using index and match
v= c(0,0,1,0,1,1,1)
v[[match(1,v) -1]] = 0.5
but i am struggling to apply this to the dataset
I wrote a simplify version of the dataframe here
dftry <- data.frame("Role" = c("I", "I", "S", "S", "S", "I"),
"T1" = c(0,0, 0, 0, 0, 0),
"T2" = c(0,0, 0, 0, 0, 0),
"T3"= c(0,0, 1, 0, 1, 1),
"T4"= c(1,1,1, 1, 1, 1))**
and the desired output should look like this
dftry <- data.frame("Role" = c("I", "I", "S", "S", "S", "I"),
"T1" = c(0,0, 0, 0, 0, 0),
"T2" = c(0,0, 0, 0, 0, 0.5),
"T3"= c(0.5,0.5, 1, 0, 1, 1),
"T4"= c(1,1,1, 1, 1, 1))
I've tried using mutate and innerjoin but it does not seem to work. please help
Here is one approach. You can add row numbers to consider each row indendently. Using pivot_longer you can put your data into long format, and then look for transitions from 0 to 1 over time (from T1 to T4) for those with Role of "I". The data can be left in this way for further manipulation or analysis, or converted back to wide form as below. Note that this solution considers the transition from one state to another (infection status of 0 to 1) - not necessarily looking at if this involves the "first" status of 1 for a given Role.
library(tidyverse)
dftry %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -c(Role, rn)) %>%
group_by(rn) %>%
mutate(value = ifelse(
Role == "I" & value == 0 & lead(value) == 1, .5, value
)) %>%
pivot_wider(id_cols = c(Role, rn))
Output
Role rn T1 T2 T3 T4
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 I 1 0 0 0.5 1
2 I 2 0 0 0.5 1
3 S 3 0 0 1 1
4 S 4 0 0 0 1
5 S 5 0 0 1 1
6 I 6 0 0.5 1 1

how I can make a vector by using 2 columns in data set

I have these two columns in my data.frame :
df1 <- structure(list(Mode = c("car", "walk", "passenger", "car", "bus"
), Licence = c(1, 1, 0, 1, 1)), row.names = c(NA, -5L), class = "data.frame")
df1
# Mode Licence
# 1 car 1
# 2 walk 1
# 3 passenger 0
# 4 car 1
# 5 bus 1
I want to make an indicator vector b, that is 1 if the mode of that person is not car an have a driver licence and 0 otherwise. in the above example I need d to be:
df2 <- structure(list(Mode = c("car", "walk", "passenger", "car", "bus"
), Licence = c(1, 1, 0, 1, 1), b = c(0, 1, 0, 0, 1)), row.names = c(NA,
-5L), class = "data.frame")
df2
# Mode Licence b
# 1 car 1 0
# 2 walk 1 1
# 3 passenger 0 0
# 4 car 1 0
# 5 bus 1 1
Here you go. You could use "ifelse" statements for this as its easier to understand.
data = data.frame(mode = c("car", "walk", "passanger", "car", "bus"), License = c(1,1,0,1,1))
data$b = ifelse(data$mode !="car" & data$License == 1, 1,0)
Another solution using logical operations and implicit conversion between numeric and logical:
df1$b <- with(df1, Mode!="car" & Licence)*1
Note: 0 is equivalent to FALSE and everything else is equivalent to TRUE, so if the possible values are just 0 and 1, we can shorten Licence == 1 to just Licence. The *1 part at he end converts truth values to 0's and 1's again.
Another solution with dplyr:
library(dplyr)
df1 %>% mutate(b = if_else(Mode %in% c('walk', 'bus')&Licence == 1, # condition
true = 1,
false = 0))

Return whole group when filtering with dplyr in R [duplicate]

This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 4 years ago.
I would like to return all the observations within a group if at least one of the group's observations meet a filtering criteria.
For example below, I would like only the groups "shoe" and "ship" and all of the values returned since both of those groups have at least one value under 50.
I tried using the group_by but it seems to only return observations where the filter criteria are met and not the whole group.
library(dplyr)
test <- data.frame('prod_id'= c("shoe", "shoe", "shoe", "shoe", "shoe",
"shoe", "boat", "boat","boat","boat","boat","boat", "ship", "ship",
"ship",
"ship", "ship", "ship"),
'seller_id'= c("a", "b", "c", "d", "e", "f", "a","g", "h", "r",
"q", "b", "qe", "dj", "d3", "kk", "dn", "de"),
'Dich'= c(1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0),
'price' = c(12, 200, 10, 4, 3, 4, 99, 55, 86, 88, 75, 64, 82,
21, 44, 34, 22, 33)
)
Here is what I tried
test2 <- test%>%
group_by(prod_id) %>%
(filter = price < 50)
You need filter with any
library(dplyr)
test%>%
group_by(prod_id) %>%
filter(any(price < 50))
# prod_id seller_id Dich price
# <fct> <fct> <dbl> <dbl>
# 1 shoe a 1 12
# 2 shoe b 0 200
# 3 shoe c 0 10
# 4 shoe d 0 4
# 5 shoe e 0 3
# 6 shoe f 0 4
# 7 ship qe 0 82
# 8 ship dj 0 21
# 9 ship d3 0 44
#10 ship kk 0 34
#11 ship dn 0 22
#12 ship de 0 33
Or the base R approach using ave
test[with(test, ave(price < 50, prod_id, FUN = any)), ]
For completeness sake, one with data.table
library(data.table)
setDT(test)[, if(any(price < 50)) .SD, prod_id]

Select or subset variables whose column sums are not zero

I want to select or subset variables in a data frame whose column sum is not zero but also keeping other factor variables as well. It should be fairly simple but I cannot figure out how to run the select_if() function on a subset of variables using dplyr:
df <- data.frame(
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
)
require(dplyr)
df %>%
select_if(funs(sum(.) > 0))
#Error in Summary.factor(c(1L, 1L, 2L, 3L, 3L, 4L), na.rm = FALSE) :
# ‘sum’ not meaningful for factors
Then I tried to only select B, C, D and this works, but I won't have variable A:
df %>%
select(-A) %>%
select_if(funs(sum(.) > 0)) -> df2
df2
# C D
#1 3 0
#2 0 3
#3 0 2
#4 1 1
#5 1 4
#6 2 5
I could simply do cbind(A = df$A, df2) but since I have a dataset with 3000 rows and 200 columns, I am afraid this could introduce errors (if values sort differently for example).
Trying to subset variables B, C, D in the sum() function doesn't work either:
df %>%
select_if(funs(sum(names(.[2:4])) > 0))
#data frame with 0 columns and 6 rows
Try this:
df %>% select_if(~ !is.numeric(.) || sum(.) != 0)
# A C D
# 1 a 3 0
# 2 a 0 3
# 3 b 0 2
# 4 c 1 1
# 5 c 1 4
# 6 d 2 5
The rationale is that for || if the left-side is TRUE, the right-side won't be evaluated.
Note:
the second argument for select_if should be a function name or formula (lambda function). the ~ is necessary to tell select_if that !is.numeric(.) || sum(.) != 0 should be converted to a function.
As commented below by #zx8754, is.factor(.)should be used if one only wants to keep factor columns.
Edit: a base R solution
cols <- c('B', 'C', 'D')
cols.to.keep <- cols[colSums(df[cols]) != 0]
df[!names(df) %in% cols || names(df) %in% cols.to.keep]
Here is an update for everyone who wants to use the new dplyr 1.0.0 which doesn't have the scoped variants (like select_if as nicely shown by #mt1022 but deprecated):
df %>%
select(where(is.numeric)) %>%
select(where(~sum(.) != 0))
If you want to compress the two select statements into one, you cannot do this by the element-wise & but longer form && because this produces the required boolean output:
df %>% select(where(~ is.numeric(.x) && sum(.x) !=0 ))
This is a soltion using data.table
df<-data.table(
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
)
df2<-df[,lapply(X = .SD,FUN = function(x){sum(as.numeric(x))}),.SDcols = colnames(df)]
df[,which(is.na(df[1,]) == F),with = F]

Resources