r generate a column with random 1s and 0s with restrictions - r

I have a data set with 500 observations. I like to generate 1s and 0s randomly based on two scenarios
Current Dataset
Id Age Category
1 23 1
2 24 1
3 21 2
. . .
. . .
. . .
500 27 3
Scenario 1
The total number of 1s should be 200 and they should be random. The remaining 300 should be 0s.
Scenario 2
The total number of 1s should be 200. The remaining 300 should be 0s.
40% of the 1s should be in Category1. That is 80 1s should be in Category1
40% of the 1s should be in Category2 That is 80 1s should be in Category2
20% of the 1s should be in Category3 That is 40 1s should be in Category3
Expected Output
Id Age Category Indicator
1 23 1 1
2 24 1 0
3 21 2 1
. . .
. . .
. . .
500 27 3 1
I know function sample(c(0,1), 500) will generate 1s but I dont know how to make this generate 200 1s randomly. Also not sure how to generate 80 1s randomly in Category1, 80 1s in category2 and 40 1s in Category3.

Here's a full worked example.
Let's say your data looked like this:
set.seed(69)
df <- data.frame(id = 1:500,
Age = 20 + sample(10, 500, TRUE),
Category = sample(3, 500, TRUE))
head(df)
#> id Age Category
#> 1 1 21 2
#> 2 2 22 2
#> 3 3 28 3
#> 4 4 27 2
#> 5 5 27 1
#> 6 6 26 2
Now, you didn't mention how many of each category you had, so let's check how many there are in our sample:
table(df$Category)
#> 1 2 3
#> 153 179 168
Scenario 1 is straightforward. You need to create a vector of 500 zeros, then write a one into a sample 200 of the indexes of your new vector:
df$label <- numeric(nrow(df))
df$label[sample(nrow(df), 200)] <- 1
head(df)
#> id Age Category label
#> 1 1 21 2 1
#> 2 2 22 2 1
#> 3 3 28 3 0
#> 4 4 27 2 0
#> 5 5 27 1 0
#> 6 6 26 2 1
So we have random zeros and ones, but when we count them, we have:
table(df$label)
#>
#> 0 1
#> 300 200
Scenario 2 is similar but a bit more involved, because we need to perform a similar operation groupwise by category:
df$label <- numeric(nrow(df))
df <- do.call("rbind", lapply(split(df, df$Category), function(d) {
n_ones <- round(nrow(d) * 0.4 / ((d$Category[1] %/% 3) + 1))
d$label[sample(nrow(d), n_ones)] <- 1
d
}))
head(df)
#> id Age Category label
#> 1.5 5 27 1 0
#> 1.10 10 24 1 0
#> 1.13 13 23 1 1
#> 1.19 19 24 1 0
#> 1.26 26 22 1 1
#> 1.27 27 24 1 1
Now, since the number in each category is not nicely divisible by 10, we cannot get exactly 40% and 20% (though you might with your own data), but we get as close as possible to it, as the following demonstrates:
label_table <- table(df$Category, df$label)
label_table
#> 0 1
#> 1 92 61
#> 2 107 72
#> 3 134 34
apply(label_table, 1, function(x) x[2]/sum(x))
#> 1 2 3
#> 0.3986928 0.4022346 0.2023810
Created on 2020-08-12 by the reprex package (v0.3.0)

Another way to fill random values is to create a vector of possible values (80 values of 1, and nrow-80 values of 0) and then sample from those possible values. This can use a bit more memory than setting values by indexing, but a vector of potential values is so small that it is generally trivial.
set.seed(42)
df <- data.frame(id = 1:500,
Age = 20 + sample(10, 500, TRUE),
Category = sample(3, 500, TRUE))
## In Tidyverse
library(tidyverse)
set.seed(42)
df2 <- df %>%
group_by(Category) %>%
mutate(Label = case_when(
Category == 1 ~ sample(
c(rep(1,80),rep(0,n()-80)),
n()
),
Category == 2 ~ sample(
c(rep(1,80),rep(0,n()-80)),
n()
),
Category == 3 ~ sample(
c(rep(1,40),rep(0,n()-40)),
n()
)
))
table(df2$Category,df2$Label)
# 0 1
# 1 93 80
# 2 82 80
# 3 125 40
## In base
df3 <- df
df3[df$Category == 1,"Label"] <- sample(
c(rep(1,80),rep(0,nrow(df[df$Category == 1,])-80)),
nrow(df[df$Category == 1,])
)
df3[df$Category == 2,"Label"] <- sample(
c(rep(1,80),rep(0,nrow(df[df$Category == 2,])-80)),
nrow(df[df$Category == 2,])
)
df3[df$Category == 3,"Label"] <- sample(
c(rep(1,40),rep(0,nrow(df[df$Category == 3,])-40)),
nrow(df[df$Category == 3,])
)
table(df3$Category,df3$Label)
# 0 1
# 1 93 80
# 2 82 80
# 3 125 40

To solve scenario 1, you'll need to create a vector with 300 zeroes and 200 ones and then same from that without replacement.
pull_from = c(rep(0,300), rep(1,200))
sample(pull_from, replace = FALSE)
For scenario 2, I suggest breaking your data into 3 separate chunks based on category, repeating the above step with different values for the numbers of zeroes and ones you need and then recombining into one dataframe.

Related

Is there code to determine the amount of criteria met by a row in R?

I am trying to figure out a way to assign a column that would list out the number of criteria that is met by a certain row. For example, I am looking at how many risk factors for heart disease someone has met and trying to run an ordinal regression on those values. I have tried
cvd_status <- ifelse( data_tot$X5_A_01_d_Heart.Disease=="1"|data_tot$X5_A_01_e_Stroke=="1"|data_tot$X5_A_01_f_Chronic.Kidney.Disease==1, 1,0)
but that only gives me whether people have any risk factors, not how many risk factors they have. Is there any way to figure out how many risk factors someone would have?
Edit: The variables are not simply binary, but are either 1s or 2s or ranges of numbers.
If the variables contain only 0 or 1, then the following could be used:
with(data_tot,
rowSums(cbind(X5_A_01_d_Heart.Disease,
X5_A_01_e_Stroke,
X5_A_01_f_Chronic.Kidney.Disease))
)
Edit:
And if they are coded as 1 (yes) and 2 (no), plus if other risk factors such as blood pressure and cholesterol level are to be included, AND there are no missing values in these risk factor variables, then you'll can use something similar to the following:
data_tot %>%
mutate(CVD_Risk.Factors=
(Heart == 1) +
(Stroke == 1) +
(CKD == 1) +
(Systolic_BP >= 130) + (Diastolic_BP >= 80) +
(Cholesterol > 150))
Heart Stroke CKD Systolic_BP Diastolic_BP Cholesterol CVD_Risk.Factors
1 1 1 2 118 90 200 4
2 2 1 2 125 65 150 1
3 2 1 1 133 95 190 5
4 1 1 2 120 87 250 4
5 2 2 2 155 110 NA NA
6 2 2 2 130 105 140 2
You can see that if there are any missing values, then this would not work. One solution is to use rowwise and then sum.
data_tot %>%
rowwise() %>% # This tells R to apply a function by the rows of the selected inputs
mutate(CVD_Risk.Factors=sum( # This function has an "na.rm" argument
(Heart == 1),
(Stroke == 1),
(CKD == 1),
(Systolic_BP >= 130), (Diastolic_BP >= 80),
(Cholesterol > 150), na.rm=TRUE)) # Omit NA in the summations
# A tibble: 6 x 7
Heart Stroke CKD Systolic_BP Diastolic_BP Cholesterol CVD_Risk.Factors
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 1 2 118 90 200 4
2 2 1 2 125 65 150 1
3 2 1 1 133 95 190 5
4 1 1 2 120 87 250 4
5 2 2 2 155 110 NA 2 # not NA
6 2 2 2 130 105 140 2
Data:
data_tot <- data.frame(Heart=c(1,2,2,1,2,2),
Stroke=c(1,1,1,1,2,2),
CKD=c(2,2,1,2,2,2),
Systolic_BP=c(118,125,133,120,155,130),
Diastolic_BP=c(90,65,95,87,110,105),
Cholesterol=c(200,150,190,250,NA,140))

R - replace all values smaller than a specific value in a column with the nearest bigger value

I have a data frame like this one:
df <- data.frame(c(1,2,3,4,5,6,7), c(0,23,55,0,1,40,21))
names(df) <- c("a", "b")
a b
1 0
2 23
3 55
4 0
5 1
6 40
7 21
Now I want to replace all values smaller than 22 in column b with the nearest bigger value. Of course it is possible to use loops, but since I have quite big datasets this is way too slow.
The solution should look somewhat like this:
a b
1 23
2 23
3 55
4 55
5 40
6 40
7 40
Here is a tidyverse possibility (but note #phiver's comment on replacement ambiguities)
library(tidyverse);
df %>%
mutate(b = ifelse(b < 22, NA, b)) %>%
fill(b) %>%
fill(b, .direction = "up");
# a b
#1 1 23
#2 2 23
#3 3 55
#4 4 55
#5 5 55
#6 6 40
#7 7 40
Explanation: Replace values b < 22 with NA and then use fill to fill NAs with previous/following non-NA entries.
Sample data
df <- data.frame(a = c(1,2,3,4,5,6,7), b = c(0,23,55,0,1,40,21))
You can use zoo::rollapply :
library(zoo)
df$b <- rollapply(df$b,3,function(x)
if (x[2] < 22) min(x[x>22]) else x[2],
partial =T)
# df
# a b
# 1 1 23
# 2 2 23
# 3 3 55
# 4 4 55
# 5 5 40
# 6 6 40
# 7 7 40
In base R you could do this for the same output:
transform(df, b = sapply(seq_along(b),function(i)
if (b[i] < 22) {
bi <- c(b,Inf)[seq(i-1,i+1)]
min(bi[bi>=22])
} else b[i]))

randomly select rows based on limited random numbers

Seems simple but I can't figure it out.
I have a bunch of animal location data (217 individuals) as a single dataframe. I'm trying to randomly select X locations per individual for further analysis with the caveat that X is within the range of 6-156.
So I'm trying to set up a loop that first randomly selects a value within the range of 6-156 then use that value (say 56) to randomly extract 56 locations from the first individual animal and so on.
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
sub2<-i([sample(nrow(i),sub),])
}
This approach didn't seem to work so I tried tweaking it...
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
rand<-i[sample(1:nrow(i),sub,replace=FALSE),]
}
This did not work either.. Any suggestions or previous postings would be helpful!
Head of the datafile...ANIMALS is the name of the df, ID indicates unique individuals
> FID X Y MONTH DAY YEAR HOUR MINUTE SECOND ELKYR SOURCE ID animalid
1 0 510313 4813290 9 5 2008 22 30 0 342008 FG 1 1
2 1 510382 4813296 9 6 2008 1 30 0 342008 FG 1 1
3 2 510385 4813311 9 6 2008 2 0 0 342008 FG 1 1
4 3 510385 4813394 9 6 2008 3 30 0 342008 FG 1 1
5 4 510386 4813292 9 6 2008 2 30 0 342008 FG 1 1
6 5 510386 4813431 9 6 2008 4 1 0 342008 FG 1 1
Here's one way using mapply. This function takes two lists (or something that can be coerced into a list) and applies function FUN to corresponding elements.
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
# calculate number of samples for individual animal
num.samples.per.animal <- sample(3:6, length(unique(xy$animal)), replace = TRUE)
num.samples.per.animal
[1] 6 3 4 4 6 3 3 6 3 5
# subset random x number of rows from each animal
result <- do.call("rbind",
mapply(num.samples.per.animal, split(xy, f = xy$animal), FUN = function(x, y) {
y[sample(1:nrow(y), x),]
}, SIMPLIFY = FALSE)
)
result
animal loc
7 1 0.99483999
1 1 0.50951321
10 1 0.36505294
6 1 0.34058842
8 1 0.26489107
9 1 0.47418823
13 2 0.27213396
12 2 0.28087775
15 2 0.22130069
23 3 0.33646632
21 3 0.02395097
28 3 0.53079981
29 3 0.85287600
35 4 0.84534073
33 4 0.87370167
31 4 0.85646813
34 4 0.11642335
46 5 0.59624723
48 5 0.15379729
45 5 0.57046122
42 5 0.88799675
44 5 0.62171858
49 5 0.75014593
60 6 0.86915983
54 6 0.03152932
56 6 0.66128549
64 7 0.85420774
70 7 0.89262455
68 7 0.40829671
78 8 0.19073661
72 8 0.20648832
80 8 0.71778913
73 8 0.77883677
75 8 0.37647108
74 8 0.65339300
82 9 0.39957202
85 9 0.31188471
88 9 0.10900795
100 10 0.55282999
95 10 0.10145296
96 10 0.09713218
93 10 0.64900866
94 10 0.76099256
EDIT
Here is another (more straightforward) approach that also handles cases when number of rows is less than the number of samples that should be allocated.
set.seed(357)
result <- do.call("rbind",
by(xy, INDICES = xy$animal, FUN = function(x) {
avail.obs <- nrow(x)
num.rows <- sample(3:15, 1)
while (num.rows > avail.obs) {
message("Sample to be larger than available data points, repeating sampling.")
num.rows <- sample(3:15, 1)
}
x[sample(1:avail.obs, num.rows), ]
}))
result
I like Stackoverflow because I learn so much. #RomanLustrik provided a simple solution; mine is straight-froward as well:
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
newVec <- NULL #Create a blank dataFrame
for(i in unique(xy$animal)){
#Sample a number between 1 and 10 (or 6 and 156, if you need)
samp <- sample(1:10, 1)
#Determine which rows of dataFrame xy correspond with unique(xy$animal)[i]
rows <- which(xy$animal == unique(xy$animal)[i])
#From xy, sample samp times from the rows associated with unique(xy$animal)[i]
newVec1 <- xy[sample(rows, samp, replace = TRUE), ]
#append everything to the same new dataFrame
newVec <- rbind(newVec, newVec1)
}

remove rows based on substraction results

I have a large data set like this:
df <- data.frame(group = c(rep(1, 6), rep(5, 6)), score = c(30, 10, 22, 44, 6, 5, 20, 35, 2, 60, 14,5))
group score
1 1 30
2 1 10
3 1 22
4 1 44
5 1 6
6 1 5
7 5 20
8 5 35
9 5 2
10 5 60
11 5 14
12 5 5
...
I want to do a subtraction for each neighboring score within each group, if the difference is greater than 30, remove the smaller score. For example, within group 1, 30-10=20<30, 10-22=-12<30, 22-44=-22<30, 44-6=38>30 (remove 6), 44-5=39>30 (remove 5)... The expected output should look like this:
group score
1 1 30
2 1 10
3 1 22
4 1 44
5 5 20
6 5 35
7 5 60
...
Does anyone have idea about realizing this?
Like this?
repeat {
df$diff=unlist(by(df$score,df$group,function(x)c(0,-diff(x))))
if (all(df$diff<30)) break
df <- df[df$diff<30,]
}
df$diff <- NULL
df
# group score
# 1 1 30
# 2 1 10
# 3 1 22
# 4 1 44
# 7 5 20
# 8 5 35
# 10 5 60
This (seems...) to require an iterative approach, because the "neighboring score" changes after removal of a row. So before you remove 6, the difference 44 - 6 > 30, but 6 - 5 < 30. After you remove 6, the difference 44 - 5 > 30.
So this calculates difference between successive rows by group (using by(...) and diff(...)), and removes the appropriate rows, then repeats the process until all differences are < 30.
It's not elegant but it should work:
out = data.frame(group = numeric(), score=numeric())
#cycle through the groups
for(g in levels(as.factor(df$group))){
temp = subset(df, df$group==g)
#now go through the scores
left = temp$score[1]
for(s in seq(2, length(temp$score))){
if(left - temp$score[s] > 30){#Test the condition
temp$score[s] = NA
}else{
left = temp$score[s] #if condition not met then the
}
}
#Add only the rows without NAs to the out
out = rbind(out, temp[which(!is.na(temp$score)),])
}
There should be a way to do this using ave but carrying the last value when removing the next if the diff >30 is tricky! I'd appreciate the more elegant solution if there is one.
You can try
df
## group score
## 1 1 30
## 2 1 10
## 3 1 22
## 4 1 44
## 5 1 6
## 6 1 5
## 7 5 20
## 8 5 35
## 9 5 2
## 10 5 60
## 11 5 14
## 12 5 5
tmp <- df[!unlist(tapply(df$score, df$group, FUN = function(x) c(F, -diff(x) > 30), simplify = T)), ]
while (!identical(df, tmp)) {
df <- tmp
tmp <- df[!unlist(tapply(df$score, df$group, FUN = function(x) c(F, -diff(x) > 30), simplify = T)), ]
}
tmp
## group score
## 1 1 30
## 2 1 10
## 3 1 22
## 4 1 44
## 7 5 20
## 8 5 35
## 10 5 60

Summing up specific entries in subset group (R programming)

So basically I have this format of data:
ID Value
1 32
5 231
2 122
1 11
3 ...
2 ...
5 ...
6 ...
2 ...
1 33
. ...
. ...
. ...
I want to sum up the values with ID '1', but in a group of 5.
i.e.
In the first 5 entries, there are 2 entries with ID '1', so i get a sum 43,
and then in the next 5 entries, only one entry have ID '1', so i get 33.
and so on...
so at the end I want to get a array with all the sums, i.e. (43,33,......)
I can do it with for loop and tapply, but I think there must be a better way in R that doesnt need a for loop
Any help is much appreciated! Thank you very much!
Make a new column to reflect the groups of 5:
df = data.frame(
id = sample(1:5, size=98, replace=TRUE),
value = sample(1:98)
)
# This gets you a vector of 1,1,1,1, 2,2,2,2,2, 3, ...
groups = rep(1:(ceiling(nrow(df) / 5)), each=5)
# But it might be longer than the dataframe, so:
df$group = groups[1:nrow(df)]
Then it's pretty easy to get the sums within each group:
library(plyr)
sums = ddply(
df,
.(group, id),
function(df_part) {
sum(df_part$value)
}
)
Example output:
> head(df)
id value group
1 4 94 1
2 4 91 1
3 3 22 1
4 5 42 1
5 1 46 1
6 2 38 2
> head(sums)
group id V1
1 1 1 46
2 1 3 22
3 1 4 185
4 1 5 42
5 2 2 55
6 2 3 158
Something like this will do the job:
m <- matrix(d$Value, nrow=5)
# Remove unwanted elements
m[which(d$ID != 1)] <- 0
# Fix for short data
if ((length(d$Value) %/% 5) != 0)
m[(length(d$Value)+1):length(m)] <- 0
# The columns contain the groups of 5
colSums(m)
If you add a column to delineate groups, ddply() can work magic:
ID <- c(1, 5, 2, 1, 3, 2, 5, 6, 2, 1)
Value <- c(32, 231, 122, 11, 45, 34, 74, 12, 32, 33)
Group <- rep(seq(100), each=5)[1:length(ID)]
test.data <- data.frame(ID, Value, Group)
library(plyr)
output <- ddply(test.data, .(Group, ID), function(chunk) sum(chunk$Value))
> head(test.data)
ID Value Group
1 1 32 1
2 5 231 1
3 2 122 1
4 1 11 1
5 3 45 1
6 2 34 2
> head(output)
Group ID V1
1 1 1 47
2 1 2 125
3 1 3 49
4 1 5 237
5 2 1 36
6 2 2 74

Resources