Is there code to determine the amount of criteria met by a row in R? - r

I am trying to figure out a way to assign a column that would list out the number of criteria that is met by a certain row. For example, I am looking at how many risk factors for heart disease someone has met and trying to run an ordinal regression on those values. I have tried
cvd_status <- ifelse( data_tot$X5_A_01_d_Heart.Disease=="1"|data_tot$X5_A_01_e_Stroke=="1"|data_tot$X5_A_01_f_Chronic.Kidney.Disease==1, 1,0)
but that only gives me whether people have any risk factors, not how many risk factors they have. Is there any way to figure out how many risk factors someone would have?
Edit: The variables are not simply binary, but are either 1s or 2s or ranges of numbers.

If the variables contain only 0 or 1, then the following could be used:
with(data_tot,
rowSums(cbind(X5_A_01_d_Heart.Disease,
X5_A_01_e_Stroke,
X5_A_01_f_Chronic.Kidney.Disease))
)
Edit:
And if they are coded as 1 (yes) and 2 (no), plus if other risk factors such as blood pressure and cholesterol level are to be included, AND there are no missing values in these risk factor variables, then you'll can use something similar to the following:
data_tot %>%
mutate(CVD_Risk.Factors=
(Heart == 1) +
(Stroke == 1) +
(CKD == 1) +
(Systolic_BP >= 130) + (Diastolic_BP >= 80) +
(Cholesterol > 150))
Heart Stroke CKD Systolic_BP Diastolic_BP Cholesterol CVD_Risk.Factors
1 1 1 2 118 90 200 4
2 2 1 2 125 65 150 1
3 2 1 1 133 95 190 5
4 1 1 2 120 87 250 4
5 2 2 2 155 110 NA NA
6 2 2 2 130 105 140 2
You can see that if there are any missing values, then this would not work. One solution is to use rowwise and then sum.
data_tot %>%
rowwise() %>% # This tells R to apply a function by the rows of the selected inputs
mutate(CVD_Risk.Factors=sum( # This function has an "na.rm" argument
(Heart == 1),
(Stroke == 1),
(CKD == 1),
(Systolic_BP >= 130), (Diastolic_BP >= 80),
(Cholesterol > 150), na.rm=TRUE)) # Omit NA in the summations
# A tibble: 6 x 7
Heart Stroke CKD Systolic_BP Diastolic_BP Cholesterol CVD_Risk.Factors
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 1 2 118 90 200 4
2 2 1 2 125 65 150 1
3 2 1 1 133 95 190 5
4 1 1 2 120 87 250 4
5 2 2 2 155 110 NA 2 # not NA
6 2 2 2 130 105 140 2
Data:
data_tot <- data.frame(Heart=c(1,2,2,1,2,2),
Stroke=c(1,1,1,1,2,2),
CKD=c(2,2,1,2,2,2),
Systolic_BP=c(118,125,133,120,155,130),
Diastolic_BP=c(90,65,95,87,110,105),
Cholesterol=c(200,150,190,250,NA,140))

Related

Most efficient way of determing which ID does not have a pair?

Say that I have a dataframe that looks like the one below. In the dataframe we have the following pairs of IDs (4330, 4331), (2333,2334), (3336,3337), which are +/- 1 of each other. However, 3349 does not have pair. What would be the most efficient way of filtering out unpaired IDs?
ID sex zyg race SES
1 4330 2 2 2 1
2 4331 2 2 2 1
3 2333 2 2 1 78
4 2334 2 2 1 78
5 3336 2 2 1 18
6 3337 2 2 1 18
6 3349 2 2 1 18
This will return only pairs/twins (no unpaired or triplets, quadruplets, etc.). In base R:
df <- data.frame(ID = c(1:3, 4330, 4331, 2333, 2334, 3336, 3337, 3349), sex = 2)
df <- df[order(df$ID),]
df[
rep(
with(
rle(diff(df$ID)),
cumsum(lengths)[lengths == 1L & values == 1]
), each = 2
) + 0:1,
]
#> ID sex
#> 6 2333 2
#> 7 2334 2
#> 8 3336 2
#> 9 3337 2
#> 4 4330 2
#> 5 4331 2
Explanation:
After sorting the data, only individuals in a group (a twin, triplet, etc.) will have an ID difference of 1 from the individual in the next row. diff(df$ID) returns the difference in ID value from one row to the next along the whole data.frame. To identify twins, we want to find where diff(df$ID) has a 1 that is by itself (i.e., neither the previous value nor the next value is also 1). We use rle to find those lone 1s:
rle(diff(df$ID))
#> Run Length Encoding
#> lengths: int [1:8] 2 1 1 1 1 1 1 1
#> values : num [1:8] 1 2330 1 1002 1 12 981 1
Lone 1s occur when both the value of diff(df$ID) (values) and the length of runs of the same value (lengths) are both 1. This occurs with the third, fifth, and eighth run. The starting rows (within df) of all runs are given by cumsum(lengths), so we subset them at 3, 5, and 8 to get the starting index of each twin pair in df. We repeat each of those indices twice with rep(..., each = 2) then add 0:1 (taking advantage of recycling in R) to get the indices of any individual who is a twin.
Using dplyr::lag() and lead(), you can filter() to rows where the previous ID is ID - 1 or the next ID is ID + 1:
library(dplyr)
df %>%
filter(lag(ID) == ID - 1 | lead(ID) == ID + 1)
# A tibble: 6 × 5
ID sex zyg race SES
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4330 2 2 2 1
2 4331 2 2 2 1
3 2333 2 2 1 78
4 2334 2 2 1 78
5 3336 2 2 1 18
6 3337 2 2 1 18
*edit, this will not filter out "triplets," "quadruplets," etc., contrary to the additional requirements mentioned in the comments.

r generate a column with random 1s and 0s with restrictions

I have a data set with 500 observations. I like to generate 1s and 0s randomly based on two scenarios
Current Dataset
Id Age Category
1 23 1
2 24 1
3 21 2
. . .
. . .
. . .
500 27 3
Scenario 1
The total number of 1s should be 200 and they should be random. The remaining 300 should be 0s.
Scenario 2
The total number of 1s should be 200. The remaining 300 should be 0s.
40% of the 1s should be in Category1. That is 80 1s should be in Category1
40% of the 1s should be in Category2 That is 80 1s should be in Category2
20% of the 1s should be in Category3 That is 40 1s should be in Category3
Expected Output
Id Age Category Indicator
1 23 1 1
2 24 1 0
3 21 2 1
. . .
. . .
. . .
500 27 3 1
I know function sample(c(0,1), 500) will generate 1s but I dont know how to make this generate 200 1s randomly. Also not sure how to generate 80 1s randomly in Category1, 80 1s in category2 and 40 1s in Category3.
Here's a full worked example.
Let's say your data looked like this:
set.seed(69)
df <- data.frame(id = 1:500,
Age = 20 + sample(10, 500, TRUE),
Category = sample(3, 500, TRUE))
head(df)
#> id Age Category
#> 1 1 21 2
#> 2 2 22 2
#> 3 3 28 3
#> 4 4 27 2
#> 5 5 27 1
#> 6 6 26 2
Now, you didn't mention how many of each category you had, so let's check how many there are in our sample:
table(df$Category)
#> 1 2 3
#> 153 179 168
Scenario 1 is straightforward. You need to create a vector of 500 zeros, then write a one into a sample 200 of the indexes of your new vector:
df$label <- numeric(nrow(df))
df$label[sample(nrow(df), 200)] <- 1
head(df)
#> id Age Category label
#> 1 1 21 2 1
#> 2 2 22 2 1
#> 3 3 28 3 0
#> 4 4 27 2 0
#> 5 5 27 1 0
#> 6 6 26 2 1
So we have random zeros and ones, but when we count them, we have:
table(df$label)
#>
#> 0 1
#> 300 200
Scenario 2 is similar but a bit more involved, because we need to perform a similar operation groupwise by category:
df$label <- numeric(nrow(df))
df <- do.call("rbind", lapply(split(df, df$Category), function(d) {
n_ones <- round(nrow(d) * 0.4 / ((d$Category[1] %/% 3) + 1))
d$label[sample(nrow(d), n_ones)] <- 1
d
}))
head(df)
#> id Age Category label
#> 1.5 5 27 1 0
#> 1.10 10 24 1 0
#> 1.13 13 23 1 1
#> 1.19 19 24 1 0
#> 1.26 26 22 1 1
#> 1.27 27 24 1 1
Now, since the number in each category is not nicely divisible by 10, we cannot get exactly 40% and 20% (though you might with your own data), but we get as close as possible to it, as the following demonstrates:
label_table <- table(df$Category, df$label)
label_table
#> 0 1
#> 1 92 61
#> 2 107 72
#> 3 134 34
apply(label_table, 1, function(x) x[2]/sum(x))
#> 1 2 3
#> 0.3986928 0.4022346 0.2023810
Created on 2020-08-12 by the reprex package (v0.3.0)
Another way to fill random values is to create a vector of possible values (80 values of 1, and nrow-80 values of 0) and then sample from those possible values. This can use a bit more memory than setting values by indexing, but a vector of potential values is so small that it is generally trivial.
set.seed(42)
df <- data.frame(id = 1:500,
Age = 20 + sample(10, 500, TRUE),
Category = sample(3, 500, TRUE))
## In Tidyverse
library(tidyverse)
set.seed(42)
df2 <- df %>%
group_by(Category) %>%
mutate(Label = case_when(
Category == 1 ~ sample(
c(rep(1,80),rep(0,n()-80)),
n()
),
Category == 2 ~ sample(
c(rep(1,80),rep(0,n()-80)),
n()
),
Category == 3 ~ sample(
c(rep(1,40),rep(0,n()-40)),
n()
)
))
table(df2$Category,df2$Label)
# 0 1
# 1 93 80
# 2 82 80
# 3 125 40
## In base
df3 <- df
df3[df$Category == 1,"Label"] <- sample(
c(rep(1,80),rep(0,nrow(df[df$Category == 1,])-80)),
nrow(df[df$Category == 1,])
)
df3[df$Category == 2,"Label"] <- sample(
c(rep(1,80),rep(0,nrow(df[df$Category == 2,])-80)),
nrow(df[df$Category == 2,])
)
df3[df$Category == 3,"Label"] <- sample(
c(rep(1,40),rep(0,nrow(df[df$Category == 3,])-40)),
nrow(df[df$Category == 3,])
)
table(df3$Category,df3$Label)
# 0 1
# 1 93 80
# 2 82 80
# 3 125 40
To solve scenario 1, you'll need to create a vector with 300 zeroes and 200 ones and then same from that without replacement.
pull_from = c(rep(0,300), rep(1,200))
sample(pull_from, replace = FALSE)
For scenario 2, I suggest breaking your data into 3 separate chunks based on category, repeating the above step with different values for the numbers of zeroes and ones you need and then recombining into one dataframe.

Subsetting data.frame to return first 200 rows for specific condition in r

I have a data.frame with 3.3 million rows and 9 columns. Below is an example with the 3 relevant columns.
StimulusName Subject Pupil Means
1 1 101 3.270000
2 1 101 3.145000
3 1 101 3.265000
4 2 101 3.015000
5 2 101 3.100000
6 2 101 3.051250
7 1 102 3.035000
8 1 102 3.075000
9 1 102 3.050000
10 2 102 3.056667
11 2 102 3.059167
12 2 102 3.060000
13 1 103 3.085000
14 1 103 3.125000
15 1 103 3.115000
I want to subset data based on stimulus name and subject and then take either the first few or the last few rows for that subset. So, for example returning row 10 and 11 by getting the first 2 rows where df$StimulusName == 2 & df$Subject == 102.
The actual data frame contains thousands of observations per Stimulus and Subject. I want to use it to plot the first and last 200 observations of the stimulus separately.
Have not tested this out, but should work.
First 200
df_filtered <- subset(df, StimulusName == 2 & Subject == 102)
df_filtered <- df_filtered[1:200,]
Then plot df_filtered.
Last 200
df_filtered <- subset(df, StimulusName == 2 & Subject == 102)
df_filtered <- df_filtered[(nrow(df_filtered)-199):nrow(df_filtered),]
Then plot df_filtered.
Perhaps you want something like this:
subCond <- function(x, r, c) {
m <- x[x[, 1] == r & x[, 2] == c,]
return(m)
}
Yields e.g.:
> subCond(df, 1, 102)
StimulusName Subject PupilMeans
7 1 102 3.035
8 1 102 3.075
9 1 102 3.050
or
> subCond(df, 2, 101)
StimulusName Subject PupilMeans
4 2 101 3.01500
5 2 101 3.10000
6 2 101 3.05125

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

How to use apply function once for each unique factor value

I'm trying on some commands on the R-studio built-in databse, ChickWeight. The data looks as follows.
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 171 18 1 1
11 199 20 1 1
12 205 21 1 1
13 40 0 2 1
14 49 2 2 1
15 58 4 2 1
Now what I would like to do is to simply output the difference between the chicken-weight for the "Chick" column for time 0 and 21 (last time value). I.e the weight the chick has put on.
I've been trying tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1]). But this of course applies the value to all rows.
How do I make it so that it applies only once for each unique Chick-value?
If we need a single value per each 'factor' column (assuming that 'Chick', and 'Diet' are the factor columns)
library(data.table)
setDT(df1)[, list(Diff= abs(weight[Time==21]-weight[Time==0])) ,.(Chick, Diet)]
and If we need to create a column
setDT(df1)[, Diff:= abs(weight[Time==21]-weight[Time==0]) ,.(Chick, Diet)]
I noticed that in the example Time = 21 is not found in the Chick No:2, may be in that case, we need one of the number
setDT(df1)[, {tmp <- Time %in% c(0,21)
list(Diff= if(sum(tmp)>1) abs(diff(weight[tmp])) else weight[tmp]) } ,
by = .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 40
If we are taking the difference of 'weight' based on the max and min 'Time' for each group
setDT(df1)[, list(Diff=weight[which.max(Time)]-
weight[which.min(Time)]), .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 18
Also, if the 'Time' is ordered
setDT(df1)[, list(Diff= abs(diff(weight[c(1L,.N)]))), by =.(Chick, Diet)]
Using by from base R
by(df1[1:2], df1[3:4], FUN= function(x) with(x,
abs(weight[which.max(Time)]-weight[which.min(Time)])))
#Chick: 1
#Diet: 1
#[1] 163
#------------------------------------------------------------
#Chick: 2
#Diet: 1
#[1] 18
Here's a solution using dplyr:
ChickWeight %>%
group_by(Chick = as.numeric(as.character(Chick))) %>%
summarise(weight_gain = last(weight) - first(weight), final_time = last(Time))
(First and last as suggested by #ulfelder.)
Note that ChickWeight$Chick is an ordered factor so without coercing it into numeric the final order looks odd.
Using base R:
ChickWeight$Chick <- as.numeric(as.character(ChickWeight$Chick))
tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1])

Resources