I have some forestry data I want to work with. There are two variables in question for this portion of the data frame:
species
status (0 = alive, 2 = dead, 3 = ingrowth, 5 = grew with another tree)
MY GOAL is to count the number of trees that are 0 or 3 (the live trees) and create a tibble with species and number present as columns.
I have tried:
spp_pres_n <- plot9 %>% count(spp, status_2021, sort = TRUE)
Which gives a tibble of every species with each status. But I need a condition that selects only status 0 and 3 to be counted. Would if_else or a simple if statement then count suffice?
A simple way with dplyr
plot9 %>%
filter(status_2021 %in% c(0,3)) %>%
count(spp, status_2021, sort = TRUE)
Related
I'm looking at the effects of drought on plants and for that I would need to compare data from before, during and after the drought. However, it has proven to be difficult to select those periods from my data, as the length of days varies. As I have timeseries of several years with daily resolution, I'd like to avoid selecting the periods manually. I have been struggling with this for quite some time and would be really greatful for any tips and advice.
Here's a simplified example of my data:
myData <- tibble(
day = c(1:16),
TWD = c(0,0,0,0.444,0.234,0.653,0,0,0.789,0.734,0.543,0.843,0,0,0,0),
Amp = c(0.6644333,0.4990167,0.3846500,0.5285000,0.4525833,0.4143667,0.3193333,0.5690167,0.2614667,0.2646333,0.7775167,3.5411667,0.4515333,2.3781333,2.4140667,2.6979333)
)
In my data, TWD > 0 means that there is drought, so I identified these periods.
myData %>%
mutate(status = case_when(TWD > 0 ~ "drought",
TWD == 0 ~ "normal")) %>%
{. ->> myData}
I used the following code to get the length of the individual normal and drought periods
myData$group <- with(myData, rep(seq_along(z<-rle(myData$status)$lengths),z))
with(myData, table(group, status))
status
group drought normal
1 0 3
2 3 0
3 0 2
4 4 0
5 0 4
Here's where I get stuck. Ideally, I would like to have the means of Amp for each drought period and compare them to mean of normal period from before and after the drought, and then move to the next drought period. How can I compare the days of e.g. groups 1, 2 and 3? I found a promising solution here Selecting a specific range of days prior to event in R where map(. , function(x) dat[(x-5):(x), ]) was used, but the problem is that I don't have a fixed number of days I want to compare as the number of days depends on the length of the normal and drought periods.
I thought of creating a nested tibble to compare the different groups like here Compare groups with each other with
tibble(value = myData,
group= myData$group %>%
nest(value))
but that creates an error which I believe is because I'm trying to combine a vector and not a tibble.
One possibility would be to use the pairwise Wilcoxon test to compare the means of each group (though, to be honest, I'm not an expert on whether the Wilcoxon is appropriate for this data):
pairwise.wilcox.test(myData$Amp, myData$group, p.adjust.method = 'none', alternative = 'greater')
The column and row indices are the groups, and in this instance you know that the even-numbered groups are the 'drought' periods.
You may need to correct for multiple comparisons (by investigating the p.adjust.method parameter).
I have a data set with multiple rows per patient, where each row represents a 1-week period of time over the course of 4 months. There is a variable grade that can take on values of 1,2,or 3, and I want to detect when a single patient's grade INCREASES (1 to 2, 1 to 3, or 2 to 3) at any point (the result would be a yes/no variable). I could write a function to do it but I'm betting there is some clever functional programming I could do to make use of existing R functions. Here is a sample data set below. Thank you!
df=data.frame(patient=c(1,1,1,2,2,3,3,3,3),period=c(1,2,3,1,3,1,3,4,5),grade=c(1,1,1,2,3,1,1,2,3))
what I would want is a resulting data frame of:
data.frame(patient=c(1,2,3),grade.increase=c(0,1,1))
library(dplyr)
df %>%
arrange(patient, period) %>%
mutate(grade.increase = case_when(grade > lag(grade) ~ TRUE,TRUE ~ FALSE)) %>%
group_by(patient) %>%
summarise(grade.increase = max(grade.increase))
Combining lag which checks the previous value with case_when allows us to identify each grade.increase.
Summarising the maximum of grade.increase for each patient gets the desired results as boolean calculations treat FALSE as 0 and TRUE as 1.
If you feel like doing this in base R, here's a solution that uses the split-apply-combine approach.
You use split to make a list with a separate data frame for each patient;
you use lapply to iterate a summarization function over each list element, where the summarization function uses diff to look at changes in grade and if and any to summarize; and then
you wrap the whole thing in do.call(rbind, ...) to collapse the resulting list into a data frame.
Here's what that looks like:
do.call(rbind, lapply(split(df, df[,"patient"]), function(i) {
data.frame(patient = i[,"patient"][1],
grade.increase = if (any(diff(i[,"grade"]) > 0)) 1 else 0 )
}))
Result:
patient grade.increase
1 1 0
2 2 1
3 3 1
I have a big data frame with 5 columns and 1000+ rows like this:
cluster sample_id proportion condition patient_id
Basophils Base1001 0.358183106 Base B1001
Every patient has 18 different clusters, 2 samples and 2 conditions. I have to do a log ratio of the proportion of every cluster with its match under the different condition.
I have tried to use automatic conditions like for df$patient_id == B1001 get cluster == Basophils and similar things but I can't get it right.
The only thing I managed to do is subset everything and do a manual log ratio but that's too painful.
prueba1 = subset(ggdf, ggdf$patient_id == "B1001")
prueba2 = subset(prueba1, prueba1$cluster == "Basophils")
prueba3 = prueba2$proportion[1]/prueba2$proportion[2]
prueba4 = log(prueba3)
How can I do it to automatically compare the proportions of clusters with same name and patient but different condition?
Sorry if this is too basic, if it is, could you point me where to find a step by step manual?
Thank you in advance.
dplyr is perfect for this sort of data manipulation
Supposing the proportions for each patient/cluster combination sums to 1, then this should do what you're after
library(dplyr)
ggdf %>%
select(patient_id, cluster) %>%
group_by(patient_id, cluster) %>%
summarise(log_proportions=log((proportion/(1-proportion))))
I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.
Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3
I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.
I'm trying to compute a reaction time score for every subject in an experiment, but only using a subset of trials, contingent on the subject's performance.
Each subject took a quiz on 16 items. They then took a test on the same 16 items. I'd like to get, for each subject, an average reaction time score but only for those items they got both quiz and test questions correct.
My data file looks something like this:
subject quizitem1 quizitem2 testitem1 testitem2 RT1 RT2
1 1 0 1 1 5 10
2 0 1 0 1 3 7
Ideally I'd like another column that represents the average reaction time for each subject when considering only RTs for items i with 1s under both quizitem[i] and testitem[i]. To use the above example, the column would look like this:
newDV
5
7
...since subject 1 only got item 1 correct on both measures, and subject 2 only got item 2 correct on both measures.
I've started by making three vectors, to help keep data from relevant items in the correct order.
quizacclist = c(quizitem1, quizitem2)
testacclist = c(testitem1, testitem2)
RTlist = c(RT1, RT2)
Each of these new vectors is very long, appending the RT1s from all subjects to the RT2s for all subjects, and so forth.
I've tried computing this column using for loops, but can't quite figure out what conditions would be necessary to restrict the analysis to the items meeting the above criteria.
Here is my attempt:
attach(df)
i = 0
j = 0
for(i in subject) {
for(j in 1:16) {
denominator[i] = sum(quizacclist[i*j]==1 & testacclist[i*j]==1)
qualifiedindex[i] = ??
numerator[i] = sum(RTlist[qualifiedindex])
meanqualifiedRT[i] = numerator[i]/denominator[i]
}
}
The denominator variable should be counting the number of items for which a subject has gotten both the quiz and test questions correct.
The numerator variable should be adding up all the RTs for items that contributed to the denominator variable; that is, got quiz and test questions correct for that item.
My specific question at this point is: How do I specify this qualifiedindex? As I conceive of it, it should be a list of lists; each index within the macro list corresponds to a subject, and each subject has a list of their own that pinpoints which items have 1s under both quizacclist[i] and testacclist[i].
For instance:
Qualifiedindex = ([1,5,9],[2,6],[8,16],etc)
Ideally, this structure would allow the numerator variable to only add up RTs that meet the accuracy conditions.
How can this list-within-a-list be created?
Alternatively, is there a better way of achieving my aim?
Any help would be appreciated!
Thanks in advance,
Adam
Here's a solution using base R reshape and then dplyr:
quiz_long <- reshape(quiz, direction = "long",
varying = -1, sep = "", idvar = "subject",
timevar = "question")
quiz_long %>%
filter(quizitem == 1 & testitem == 1) %>%
group_by(subject) %>%
summarise(mean(RT))
Note this will only include subjects who got at least one usable question. An alternative which will have NA for those subjects:
quiz_long %>%
mutate(RT = replace(RT, quizitem != 1 | testitem != 1, NA)) %>%
group_by(subject) %>%
summarise(mean_RT = mean(RT, na.rm = TRUE))
Thanks for the promising suggestion Nick! I've tried that out but currently stuck dealing with an error prompted by the mutate feature, where the replacement has a different number of rows than the data. Is there a common reason for why that occurs?
Thanks again,
Adam