I have a big data frame with 5 columns and 1000+ rows like this:
cluster sample_id proportion condition patient_id
Basophils Base1001 0.358183106 Base B1001
Every patient has 18 different clusters, 2 samples and 2 conditions. I have to do a log ratio of the proportion of every cluster with its match under the different condition.
I have tried to use automatic conditions like for df$patient_id == B1001 get cluster == Basophils and similar things but I can't get it right.
The only thing I managed to do is subset everything and do a manual log ratio but that's too painful.
prueba1 = subset(ggdf, ggdf$patient_id == "B1001")
prueba2 = subset(prueba1, prueba1$cluster == "Basophils")
prueba3 = prueba2$proportion[1]/prueba2$proportion[2]
prueba4 = log(prueba3)
How can I do it to automatically compare the proportions of clusters with same name and patient but different condition?
Sorry if this is too basic, if it is, could you point me where to find a step by step manual?
Thank you in advance.
dplyr is perfect for this sort of data manipulation
Supposing the proportions for each patient/cluster combination sums to 1, then this should do what you're after
library(dplyr)
ggdf %>%
select(patient_id, cluster) %>%
group_by(patient_id, cluster) %>%
summarise(log_proportions=log((proportion/(1-proportion))))
Related
I have a multiple-response-variable with seven possible observations: "Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker".
If one chose more than one observation, the answers however are not separated in the data (Data)
My goal is to create a matrix with all possible observations as variables and marked with 1 (yes) and 0 (No). Currently I am using this command:
einzeln_strategisch_2021 <- data.frame(strategisch_2021[, ! colnames (strategisch_2021) %in% "Q12"], model.matrix(~ Q12 - 1, strategisch_2021)) %>%
This gives me the matrix I want but it does not separate the observations, so now I have a matrix with 20 variables instead of the seven (variables).
I also tried seperate() like this:
separate(Q12, into = c("Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker"), ";") %>%
This does separate the observations, but not in the right order and without the matrix.
How do I separate my observations and create a matrix with the possible observations as variables akin to the third picture (Matrix)?
Thank you very much in advance ;)
I have some forestry data I want to work with. There are two variables in question for this portion of the data frame:
species
status (0 = alive, 2 = dead, 3 = ingrowth, 5 = grew with another tree)
MY GOAL is to count the number of trees that are 0 or 3 (the live trees) and create a tibble with species and number present as columns.
I have tried:
spp_pres_n <- plot9 %>% count(spp, status_2021, sort = TRUE)
Which gives a tibble of every species with each status. But I need a condition that selects only status 0 and 3 to be counted. Would if_else or a simple if statement then count suffice?
A simple way with dplyr
plot9 %>%
filter(status_2021 %in% c(0,3)) %>%
count(spp, status_2021, sort = TRUE)
I'm looking at the effects of drought on plants and for that I would need to compare data from before, during and after the drought. However, it has proven to be difficult to select those periods from my data, as the length of days varies. As I have timeseries of several years with daily resolution, I'd like to avoid selecting the periods manually. I have been struggling with this for quite some time and would be really greatful for any tips and advice.
Here's a simplified example of my data:
myData <- tibble(
day = c(1:16),
TWD = c(0,0,0,0.444,0.234,0.653,0,0,0.789,0.734,0.543,0.843,0,0,0,0),
Amp = c(0.6644333,0.4990167,0.3846500,0.5285000,0.4525833,0.4143667,0.3193333,0.5690167,0.2614667,0.2646333,0.7775167,3.5411667,0.4515333,2.3781333,2.4140667,2.6979333)
)
In my data, TWD > 0 means that there is drought, so I identified these periods.
myData %>%
mutate(status = case_when(TWD > 0 ~ "drought",
TWD == 0 ~ "normal")) %>%
{. ->> myData}
I used the following code to get the length of the individual normal and drought periods
myData$group <- with(myData, rep(seq_along(z<-rle(myData$status)$lengths),z))
with(myData, table(group, status))
status
group drought normal
1 0 3
2 3 0
3 0 2
4 4 0
5 0 4
Here's where I get stuck. Ideally, I would like to have the means of Amp for each drought period and compare them to mean of normal period from before and after the drought, and then move to the next drought period. How can I compare the days of e.g. groups 1, 2 and 3? I found a promising solution here Selecting a specific range of days prior to event in R where map(. , function(x) dat[(x-5):(x), ]) was used, but the problem is that I don't have a fixed number of days I want to compare as the number of days depends on the length of the normal and drought periods.
I thought of creating a nested tibble to compare the different groups like here Compare groups with each other with
tibble(value = myData,
group= myData$group %>%
nest(value))
but that creates an error which I believe is because I'm trying to combine a vector and not a tibble.
One possibility would be to use the pairwise Wilcoxon test to compare the means of each group (though, to be honest, I'm not an expert on whether the Wilcoxon is appropriate for this data):
pairwise.wilcox.test(myData$Amp, myData$group, p.adjust.method = 'none', alternative = 'greater')
The column and row indices are the groups, and in this instance you know that the even-numbered groups are the 'drought' periods.
You may need to correct for multiple comparisons (by investigating the p.adjust.method parameter).
I have a data set with multiple rows per patient, where each row represents a 1-week period of time over the course of 4 months. There is a variable grade that can take on values of 1,2,or 3, and I want to detect when a single patient's grade INCREASES (1 to 2, 1 to 3, or 2 to 3) at any point (the result would be a yes/no variable). I could write a function to do it but I'm betting there is some clever functional programming I could do to make use of existing R functions. Here is a sample data set below. Thank you!
df=data.frame(patient=c(1,1,1,2,2,3,3,3,3),period=c(1,2,3,1,3,1,3,4,5),grade=c(1,1,1,2,3,1,1,2,3))
what I would want is a resulting data frame of:
data.frame(patient=c(1,2,3),grade.increase=c(0,1,1))
library(dplyr)
df %>%
arrange(patient, period) %>%
mutate(grade.increase = case_when(grade > lag(grade) ~ TRUE,TRUE ~ FALSE)) %>%
group_by(patient) %>%
summarise(grade.increase = max(grade.increase))
Combining lag which checks the previous value with case_when allows us to identify each grade.increase.
Summarising the maximum of grade.increase for each patient gets the desired results as boolean calculations treat FALSE as 0 and TRUE as 1.
If you feel like doing this in base R, here's a solution that uses the split-apply-combine approach.
You use split to make a list with a separate data frame for each patient;
you use lapply to iterate a summarization function over each list element, where the summarization function uses diff to look at changes in grade and if and any to summarize; and then
you wrap the whole thing in do.call(rbind, ...) to collapse the resulting list into a data frame.
Here's what that looks like:
do.call(rbind, lapply(split(df, df[,"patient"]), function(i) {
data.frame(patient = i[,"patient"][1],
grade.increase = if (any(diff(i[,"grade"]) > 0)) 1 else 0 )
}))
Result:
patient grade.increase
1 1 0
2 2 1
3 3 1
I have a dataset from which I want to select a random sample of rows, but following some pre-defined rules. This may be a very basic question but I am very new to this and still trying to grasp the basic concepts. My dataset includes some 330 rows of data (I have included a simplified version here) with several columns. I want to sample 50 rows out of the 330 (I kept these numbers in the mock dataset for simplicity as this is part of the problem I am having) with the option to add the predefined rules to the sampling process.
Here is a mock version of the data:
bank<-data.frame(matrix(0,nrow=330,ncol=5))
colnames(bank)<-c("id","var1","var2","year","lo")
bank$id<-c(1:330)
bank$var1<-sample(letters[1:5],330,replace=T)
bank$var2<-sample(c("s","r"),330,replace=T)
bank$var3<-sample(2010:2018,330,replace=T)
bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)
The code I used to try to sample the correct number of rows is
library(splitstackshape)
x<-splitstackshape::stratified(indt=bank,group=c("var1","var2","year","lo"),0.151)
However this is not selecting 50 rows. I had initially tried to define size=50 but I got the following error:
Groups b s 2012 lo4,... [there is a very long list here],...contain fewer rows than requested. Returning all rows.
Then I tried to define size as a percent: 0.151 (15.1%?) which should be right 50 out of 330 but that samples 5 rows (I tried 0.5 and samples 44 rows and if I try 0.500000001 it samples 287 rows???).
What am I missing? For the moment I am stuck here.
Once I manage to sample the correct number of rows (50) I would like to define some rules, like: only upto 50% of the sample can have 2018 (bank$year) AND only up to half of the bank$year==2018 rows can have bank$var2=="r". Obviously I don't expect someone to do this for me, but could you please provide some advice on
1- Why am I getting the wrong number of rows (probably just syntax?)
2- what package I should look into if splitstackshape::stratified() is not the best or a good choice to achieve this?
Many thanks!
M
I think the issues comes from the fact that your dataset (as you've shared here) is fairly small, you have a large number of strata (5 letters X 2 s or r X 9 years X 6 lo categories), and it's just not possible to take samples of the desired size from within each stratum. When I bump the sample size up to 33,000 and take a sample of 15.1%, I get a sample of size 4,994. Putting size = 50 is requesting a sample of size 50 from each stratum, which is not remotely possible with the data you've shared.
> bank<-data.frame(matrix(0,nrow=33000,ncol=5))
> colnames(bank)<-c("id","var1","var2","year","lo")
> bank$id<-c(1:33000)
> bank$var1<-sample(letters[1:5],33000,replace=T)
> bank$var2<-sample(c("s","r"),33000,replace=T)
> bank$var3<-sample(2010:2018,33000,replace=T)
> bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)
>
> k <- stratified(bank, group = c('var1', 'var2', 'var3', 'lo'), size = .151)
> dim(k)
[1] 4994 6
Another process, by selecting the n = sample desired for each group, provided by Jenny Bryan here; sampling from groups where you specify n based on the specific sample size per group, samp is the randomized sample per n group; so n will need to be adjusted according to the proportionate amount per group:
bank %>%
group_by(var1) %>%
nest() %>%
mutate(n = c(7,0,9,1,13),
samp = map2(data, n, sample_n)) %>%
select(var1, samp) %>%
unnest()