Count Number of Values Based on Subgroups and 2 Columns - r

I was wondering if someone could help me produce this data in R. It is rather complicated, and I am not sure how to start. I apologize in advance if my question is not clear. I am trying to create a unique dataset. Essentially, I am trying to divide my data into four groups and count how many times an individual receives a certain value(s) within a group based on a certain column’s value.
I am looking at roll call data among legislators and how they voted. Specifically, I have panel data with four variables: id is the individual legislator’s identification number; the struggle variable is whether a member had trouble voting (dichotomous); vote indicates how the member voted (it can take on any value between 0 and 9 and it is a categorical variable); and rollcall is the roll call number or an id for each roll call.
First, I would like the data separated into two groups. This separation would be based on whether member 999 (id) took any value for the vote column that equals 1 through 6. If he did, I would like all those roll call votes separated (and the members) in one category. For all the remaining roll call votes (or does not equal 1 though 6), I would like all the roll call votes (and the members) in a separate group.
Second, I would like to separate both groups that were created from the above step (did member 999 take any value that equals 1-6 on the vote variable or not) by whether an individual legislator struggled to vote (struggle) or they did not struggle to vote. Thus, I would have four groups total.
Third, based on the vote variable, I would like to add up the total number times an individual legislator received either the values 7, 8, or 9 (in each four groups). Thus, I would have four new variables and values for each member
Here is an example of the data.
Here is the code to produce that table:
id=c(999,1,2, 999,1,2,999,1,2,999,1,2)
Struggle=c("NO", "YES", "NO", "NO", "NO", "YES", "NO", "NO", "YES", "YES", "YES", "YES")
Vote=c(1,9,1,9,0,1,2,9,9,9,9,1)
Rollcall=c(1,1,1,2,2,2,3,3,3,4,4,4)
data=cbind("id", "Struggle", "Vote", "Rollcall")
I would like for it to look like the following:
A indicates the group in which member 999 received the value between 1-6 in the rollcall variable AND the legislator (id) struggled.
B indicates the group in which member 999 received the value between 1-6 in the rollcall variable & the legislator (id) did not struggled.
C indicates the group in which member 999 did not received the value between 1-6 in the rollcall variable & the legislator (id) struggled.
D indicates the group in which member 999 did not received the value between 1-6 in the rollcall variable & the legislator (id) did not struggled.
That number values in each group indicate the number of times a legislator received either a 7,8, or 9 in one of the four groups (A, B, C, or D).
Does anyone have any advice or potential code to produce this data? I appreciate any assistance someone could provide. Again, I apologize for this complicated question and any lack of clarity.

Interesting question! From what I understand, every group A, B, C, or D in your output will satisfy two conditions: whether id = 999 has Vote in 1:6 or 7:9 and the second condition is whether Struggle is YES or NO.
For each group, the first condition evaluates to be the same. So, we first determine the first condition for each group and then left_join it to original data and then summarize it.
library(tidyverse)
data <- data.frame(id, Struggle, Vote, Rollcall)
data %>%
filter(id==999) %>%
mutate(cond = ifelse(Vote %in% 1:6, TRUE, FALSE)) %>%
select(Rollcall, cond) %>%
left_join(data, by='Rollcall') %>%
group_by(id) %>%
summarize(A = sum( (cond == TRUE) & (Struggle == 'YES') ),
B = sum( (cond == TRUE) & (Struggle == 'NO') ),
C = sum( (cond == FALSE) & (Struggle == 'YES') ),
D = sum( (cond == FALSE) & (Struggle == 'NO') ))
The first four lines of expression is evaluating the first condition (whether Vote of 999 is between 1 and 6 for each Rollcall group.
We left_join that to original data and create 4 groups based on your criteria.
Output:
id A B C D
<dbl> <int> <int> <int> <int>
1 1 1 1 1 1
2 2 1 1 2 0
3 999 0 2 1 1

Related

Take unique rows in R, but keep most common value of a column, and use hierarchy to break ties in frequency

I have a data frame that looks like this:
df <- data.frame(Set = c("A","A","A","B","B","B","B"), Values=c(1,1,2,1,1,2,2))
I want to collapse the data frame so I have one row for A and one for B. I want the Values column for those two rows to reflect the most common Values from the whole dataset.
I could do this as described here (How to find the statistical mode?), but notably when there's a tie (two values that each occur once, therefore no "true" mode) it simply takes the first value.
I'd prefer to use my own hierarchy to determine which value is selected in the case of a tie.
Create a data frame that defines the hierarchy, and assigns each possibility a numeric score.
hi <- data.frame(Poss = unique(df$Set), Nums =c(105,104))
In this case, A gets a numerical value of 105, B gets a numerical score of 104 (so A would be preferred over B in the case of a tie).
Join the hierarchy to the original data frame.
require(dplyr)
matched <- left_join(df, hi, by = c("Set"="Poss"))
Then, add a frequency column to your original data frame that lists the number of times each unique Set-Value combination occurs.
setDT(matched)[, freq := .N, by = c("Set", "Value")]
Now that those frequencies have been recorded, we only need row of each Set-Value combo, so get rid of the rest.
multiplied <- distinct(matched, Set, Value, .keep_all = TRUE)
Now, multiply frequency by the numeric scores.
multiplied$mult <- multiplied$Nums * multiplied$freq
Lastly, sort by Set first (ascending), then mult (descending), and use distinct() to take the highest numerical score for each Value within each Set.
check <- multiplied[with(multiplied, order(Set, -mult)), ]
final <- distinct(check, Set, .keep_all = TRUE)
This works because multiple instances of B (numerical score = 104) will be added together (3 instances would give B a total score in the mult column of 312) but whenever A and B occur at the same frequency, A will win out (105 > 104, 210 > 208, etc.).
If using different numeric scores than the ones provided here, make sure they are spaced out enough for the dataset at hand. For example, using 2 for A and 1 for B doesn't work because it requires 3 instances of B to trump A, instead of only 2. Likewise, if you anticipate large differences in the frequencies of A and B, use 1005 and 1004, since A will eventually catch up to B with the scores I used above (200 * 104 is less than 199 * 205).

Is there an equivalent of COUNTIF in R?

I have some forestry data I want to work with. There are two variables in question for this portion of the data frame:
species
status (0 = alive, 2 = dead, 3 = ingrowth, 5 = grew with another tree)
MY GOAL is to count the number of trees that are 0 or 3 (the live trees) and create a tibble with species and number present as columns.
I have tried:
spp_pres_n <- plot9 %>% count(spp, status_2021, sort = TRUE)
Which gives a tibble of every species with each status. But I need a condition that selects only status 0 and 3 to be counted. Would if_else or a simple if statement then count suffice?
A simple way with dplyr
plot9 %>%
filter(status_2021 %in% c(0,3)) %>%
count(spp, status_2021, sort = TRUE)

mutate column in R: subtract values from column based on another column condition

I'm sorry for the vague question title and because of my inability to think of a concise question name I might have missed an answer that already exists. If someone has a title suggestion I'm happy to edit!
(1) I have a dataframe with id's, values, and a baseline column which is either blank or Y (2) I want to filter the dataframe based on the outliers then create a table with the outlier values AND a column which subtracts the value from the baseline, per id.
set.seed(42)
test <- data.frame(id = c(rep("A", 5), rep("B", 5), rep("C", 5)),
values = rnorm(15, 1.5),
baseline = rep(c("Y", "", "", "", ""), 3))
Data Frame:
Three unique IDs, each with their own baseline values.
id values baseline
1 A 2.1359504 Y
2 A 1.2157471
3 A -1.1564554
4 A -0.9404669
5 A 2.8201133
6 B 1.1933614 Y
7 B -0.2813084
8 B 1.3280826
9 B 2.7146747
10 B 3.3951935
11 C 1.0695309 Y
12 C 1.2427306
13 C -0.2631631
14 C 1.9600974
15 C 0.8600051
Current Output
I haven't mutated for the third, new column here
test %>% filter(values > (1.5*IQR(test$values)))
The id's and values that are outliers
id values baseline
A 2.820113
B 2.714675
B 3.395193
Desired Output
Per patient, get the value where baseline == "Y" then subtract that value from the values column.
id values v-baseline
A 2.820113 0.6841626 #2.820113-2.1359504 values - A baseline
B 2.714675 1.521314 #2.714675-1.1933614 values - B baseline
B 3.395193 2.201832 #3.395193-1.1933614 values - B baseline
I know this is possible I think my main issue was my inability to properly google the question!
You can group by id and then calculate values relative to the baseline value for each id. For the outlier filtering, I've selected rows where values is less than the overall 25th percentile or greater than the overall 75th percentile, which seemed to be what you were aiming for. However, you can, of course, tweak this to meet your specific needs.
library(tidyverse)
test %>%
group_by(id) %>%
mutate(v_baseline = values - values[baseline=="Y"]) %>%
ungroup %>%
filter(values < quantile(values, probs=0.25) |
values > quantile(values, probs=0.75))

R function that creates indicator variable values unique between several columns

I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.
Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3
I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.

For loops, lists-of-lists, and conditional analyses (in R)

I'm trying to compute a reaction time score for every subject in an experiment, but only using a subset of trials, contingent on the subject's performance.
Each subject took a quiz on 16 items. They then took a test on the same 16 items. I'd like to get, for each subject, an average reaction time score but only for those items they got both quiz and test questions correct.
My data file looks something like this:
subject quizitem1 quizitem2 testitem1 testitem2 RT1 RT2
1 1 0 1 1 5 10
2 0 1 0 1 3 7
Ideally I'd like another column that represents the average reaction time for each subject when considering only RTs for items i with 1s under both quizitem[i] and testitem[i]. To use the above example, the column would look like this:
newDV
5
7
...since subject 1 only got item 1 correct on both measures, and subject 2 only got item 2 correct on both measures.
I've started by making three vectors, to help keep data from relevant items in the correct order.
quizacclist = c(quizitem1, quizitem2)
testacclist = c(testitem1, testitem2)
RTlist = c(RT1, RT2)
Each of these new vectors is very long, appending the RT1s from all subjects to the RT2s for all subjects, and so forth.
I've tried computing this column using for loops, but can't quite figure out what conditions would be necessary to restrict the analysis to the items meeting the above criteria.
Here is my attempt:
attach(df)
i = 0
j = 0
for(i in subject) {
for(j in 1:16) {
denominator[i] = sum(quizacclist[i*j]==1 & testacclist[i*j]==1)
qualifiedindex[i] = ??
numerator[i] = sum(RTlist[qualifiedindex])
meanqualifiedRT[i] = numerator[i]/denominator[i]
}
}
The denominator variable should be counting the number of items for which a subject has gotten both the quiz and test questions correct.
The numerator variable should be adding up all the RTs for items that contributed to the denominator variable; that is, got quiz and test questions correct for that item.
My specific question at this point is: How do I specify this qualifiedindex? As I conceive of it, it should be a list of lists; each index within the macro list corresponds to a subject, and each subject has a list of their own that pinpoints which items have 1s under both quizacclist[i] and testacclist[i].
For instance:
Qualifiedindex = ([1,5,9],[2,6],[8,16],etc)
Ideally, this structure would allow the numerator variable to only add up RTs that meet the accuracy conditions.
How can this list-within-a-list be created?
Alternatively, is there a better way of achieving my aim?
Any help would be appreciated!
Thanks in advance,
Adam
Here's a solution using base R reshape and then dplyr:
quiz_long <- reshape(quiz, direction = "long",
varying = -1, sep = "", idvar = "subject",
timevar = "question")
quiz_long %>%
filter(quizitem == 1 & testitem == 1) %>%
group_by(subject) %>%
summarise(mean(RT))
Note this will only include subjects who got at least one usable question. An alternative which will have NA for those subjects:
quiz_long %>%
mutate(RT = replace(RT, quizitem != 1 | testitem != 1, NA)) %>%
group_by(subject) %>%
summarise(mean_RT = mean(RT, na.rm = TRUE))
Thanks for the promising suggestion Nick! I've tried that out but currently stuck dealing with an error prompted by the mutate feature, where the replacement has a different number of rows than the data. Is there a common reason for why that occurs?
Thanks again,
Adam

Resources