Calculating Survival rate from month to month without losing starting values - r

I have a set of code that divides the number of alive specimens from the initial count.
I am trying to determine the survival rate for the entire 5 month experiment but, there seems to be an issue with the computation each month. For the initial month, the code computes the correct survival rate (ie: 48/50 - 96%). But, the issue comes in when computing for the next month where the code will compute the survival rate from 48 instead of 50 (ie: 46/48 survived, instead of 46/50 which is what I need). It continues this way for the remainder of the experiment (30/46 month 3, then 20/30 for month 4).
Additionally, each of the "dead" specimens are then added to an NA group automatically (There should be no NA groups). I think if the first issue is taken care of then the NA issue wont happen. Is there a way to fix this with the code I have or do I need to rearrange the data in excel?
I have 2 species in 4 habitats that need this code for analysis.
Thanks!
Month 1
| Species | Cage || nStart | nAlive || PropAlive
| -------- | -------------- || | |
| X | 1 || 10 | 9 | .9
| Y | 2 || 10 | 8 | .8
| -------- | -------------- || | |
Month 2
| Species | Cage || nStart | nAlive || PropAlive (nAlive/nStart)
| -------- | -------------- || | |
| X | 1 || 9 | 8 | .89
| Y | 2 || 8 | 7 | .875
| -------- | -------------- || | |
month 2 should be 8/10 and 7/10 for Prop Alive not 8/9 and 7/8.
library(readxl)
library(tidyverse)
library(lme4)
library(car)
library(emmeans)
JulyData<- read_excel("~/R/Cage Data Final 2016 EMV 1.20.xlsx", sheet="7.1.2016")
str(JulyData)
summary(JulyData$Lice)
AllCages<- distinct(JulyData, Cage, Species)
AllCages$nStart<- rep(10,nrow(AllCages))
Alive<- JulyData%>%
filter(!is.na(Lice))%>%
group_by(Cage, Species)%>%
summarise(nAlive=n())
CleanData<-merge(AllCages,Alive, all=TRUE)
CleanData$nAlive[is.na(CleanData$nAlive)]<-0
CleanData$nAlive[CleanData$nAlive>10]<-10
CleanData<-CleanData %>%
separate(Cage,c("Habitat", "Rep"),1,remove=FALSE) %>%
mutate(nDead=nStart-nAlive)
CleanData
CleanData%>%
group_by(Species,Habitat)%>%
summarize(nStart=sum(nStart),
nAlive=sum(nAlive),
PropAlive = nAlive/nStart)

So, the issue was related to formatting within the data. The code is right - a simple labeling error was the issue.

Related

WGCNA package: value matching function output contains wrong NAs

I use WGCNA package for analyzing the co-expressed genes. Here I try to Form a data frame analogous to expression data that will hold the clinical traits. and i use the following codes:
table for traitData
| x | sample | NoduleperPlant |
|- |- |- |
| 1 | 1021_verbena_rep_1 | 2 |
| 2 | 1021_verbena_rep_2 | 3 |
| 3 | 1021_verbena_rep_3 | 1 |
| 4 | 1021_camporegio_rep_1 | 2 |
| 5 | 1021_camporegio_rep_2 | 3 |
| 6 | 1021_camporegio_rep_3 | 4 |
| 7 | BL225C_camporegio_rep_1 | 5 |
| 8 | BL225C_camporegio_rep_2 | 4 |
| 9 | BL225C_camporegio_rep_3 | 1 |
Table dfxpr (some of the genes are presented in table)
|FIELD1 |aacC-1|aacC4-1|aapJ-1|aapM-1|aapP-1|aapQ-1|aarF-1|
|-----------------------|------|-------|------|------|------|------|------|
|X1021_verbena_rep_1 |42 |46 |12412 |935 |3354 |2876 |550 |
|X1021_verbena_rep_2 |52 |37 |11775 |946 |2970 |2824 |514 |
|X1021_verbena_rep_3 |12 |22 |5077 |397 |1462 |1228 |230 |
|X1021_camporegio_rep_1 |52 |71 |12983 |1454 |3408 |3248 |707 |
|X1021_camporegio_rep_2 |20 |65 |9240 |803 |2807 |3146 |445 |
|X1021_camporegio_rep_3 |28 |53 |11030 |1065 |3480 |3410 |582 |
|BL225C_camporegio_rep_1|29 |19 |6346 |375 |938 |768 |118 |
|BL225C_camporegio_rep_2|51 |62 |12938 |781 |1765 |1629 |291 |
|BL225C_camporegio_rep_3|52 |43 |6462 |504 |1120 |1091 |238 |
traitData = read.csv("NodulPerPlantTraitForLowGroup.csv"); #this csv file contains 3 columns as the first column is non-relevant information, second column contains the names of samples and the third column holds the values measured for the traits.
# remove columns that hold information I do not need.
allTraits = traitData[, -1];
allTraits = allTraits[, 1:2];
# Form a data frame analogous to expression data that will hold the clinical traits.
lowNoduleSamples = rownames(dfxpr) #dfxpr is a data frame containing 9 observations (i.e. samples) and 6398 variables (i.e. genes)
traitRows = match(lowNoduleSamples, allTraits$sample); #here is the line i get wrong values as NAs while i know they all should match
datTraits = allTraits[traitRows, -1]; #then this lines result NAs too
rownames(datTraits) = allTraits[traitRows, 1];
collectGarbage();
how can I fix the problem?
I have Added a "drop = FALSE" to this line: datTraits = allTraits[traitRows, -1]
datTraits = allTraits[traitRows, -1, drop = FALSE]
I realized that my allTraits contains only 2 columns; when I remove the first one, I'm left with just one column and R converts that into a single vector unless I add the drop = FALSE argument.

How can I create a multiple loop in R?

I am working with a database of daily deaths of a country, so I need to create a database that contains the aggregated data of daily deaths by day, month and state. My database (def_2020) is something like this:
|--------------|------------|-------|
| State | Month | Day |
|--------------|------------|-------|
| state1 | jan | 1 |
|--------------|------------|-------|
| state1 | jan | 1 |
|--------------|------------|-------|
| . | . | . |
|--------------|------------|-------|
| . | . | . |
|--------------|------------|-------|
| state2 | dic | 4 |
|--------------|------------|-------|
I have 24 states (100.000 obs), of diferent days and months of death. I need to get something like this:
|--------------|------------|-------|-------|
| State | Month | Day | Deaths|
|--------------|------------|-------|-------|
| state1 | jan | 1 | 25 |
|--------------|------------|-------|-------|
| state1 | jan | 2 | 35 |
|--------------|------------|-------|-------|
| . | . | . | |
|--------------|------------|-------|-------|
| . | . | . | |
|--------------|------------|-------|-------|
| state2 | dic | 4 | |
|--------------|------------|-------|-------|
I am new to R, so I create loop like this:
day <- c(1:31)
death_state1 <- NULL
for (i in day) {
death_state_1[i] <- sum(with(def2020 %>% filter(State == "state1", Month =="jan"), Day == i))
}
But I need to optimize this loop to get a dataframe by month (columns), days (rows) and states (also rows). Help me please, I'm still new with this.
It looks like you are using a mixture of base R and dplyr syntax (the pipe %>% and filter are exports from the dplyr package.)
dplyr has its own syntax for grouped operations that allows you to avoid defining explicit loops. You use group_by() to group your data and summarize() to define variables containing the results of dimension-reducing functions like mean(), min(), n(), etc.
def_2020 %>%
group_by(State, Month, Day) %>%
summarize(Deaths = n())
With base R, we can use aggregate
aggregate(Deaths ~ ., transform(def_2020, Deaths = 1), FUN = sum)

Count merged observations and calculate fraction

I merged two data sets using Stata and now I need to find the fraction and number of projects matched. To do this, I am assuming that I will need to calculate two counts.
How do I get both of the counts to display at the same time, and then divide one by the other?
Below is an example of my _merge variable:
4022. | master only (1) |
4023. | matched (3) |
4024. | using only (2) |
4025. | using only (2) |
4026. | using only (2) |
4027. | matched (3) |
4028. | matched (3) |
4029. | matched (3) |
4030. | matched (3) |
I would first like to count and store all of the variables under _merge, and then count those that don't say "master only". Then divide the two by each other.
For example:
count1 count2 fraction
6019 4020 .66 (4020/6019)
With count1 being everything under _merge, while count2 being everything that was matched (excludes master only).
Using the following toy example:
clear
webuse autosize
merge 1:1 make using http://www.stata-press.com/data/r14/autoexpense
First it is a good idea to confirm the value which corresponds to "master only":
list _merge
+-----------------+
| _merge |
|-----------------|
1. | matched (3) |
2. | matched (3) |
3. | matched (3) |
4. | master only (1) |
5. | matched (3) |
|-----------------|
6. | matched (3) |
+-----------------+
list _merge, nolabel
+--------+
| _merge |
|--------|
1. | 3 |
2. | 3 |
3. | 3 |
4. | 1 |
5. | 3 |
|--------|
6. | 3 |
+--------+
Then generate the three variables by first counting the relevant observations and dividing:
count if _merge
generate count1 = r(N)
count if _merge != 1
generate count2 = r(N)
generate fraction = count2 / count1
display count1
6
display count2
5
display fraction
1.2

How to do cumulative plots and which statistical test is better?

I need to do some cumulative plots in R, but I really don't know what to use. I have data like the one below.
I want to do some graphs, like shown in the images (below the links). The first showing me that for example 80% of the stops happen when Q is X value. The second one starting from the exceeded value (1mg/l), show the accumulation of stops over time. And the third showing the accumulation of stops over time.
+---------------------------------------------------------+
| Date | Stops | Q (m3/s) | Concentration (mg/L) |
+---------------------------------------------------------+
| 1/01/2009 | no | 100 | 0,5 |
| 2/01/2009 | no | 98 | --- |
| 3/01/2009 | no | 80 | --- |
| 4/01/2009 | yes | 65 | 1,2 |
| 5/01/2009 | yes | 60 | --- |
| 6/01/2009 | yes | 67 | --- |
| 7/01/2009 | no | 75 | 0,6 |
| 8/01/2009 | no | 70 | --- |
| 9/01/2009 | no | 72 | --- |
| 10/01/2009| yes | 60 | 1,0 |
| 11/01/2009| yes | 63 | --- |
+---------------------------------------------------------+
[%stops and discharge][1] [cumulative stops with concentration][2] [cumulative stops over time][3]
The data i'm using is bigger off course, is of 10 years.
After doing the plots I would also like to find the proportion of time where a stop happened with low discharge, or with exceeded concentrations. For example, in the 10 year period, 10 months represent stops.
I'm also looking at the relation of the stops with the other variables, but I'm not sure which test is best for that. I'm planning to use Pearson for the relation of discharge with concentration, although I'm not sure if the discontinuous data of concentration is a problem. For the relation of Stops with concentration and discharge, I'm planning Spearman rank, but again, I'm not sure if its alright with categorical variables(stops) and the discontinuous data (concentration). What do you think is the best option for relating this variables?
[1]: https://i.stack.imgur.com/hYdkD.png [2]: https://i.stack.imgur.com/N0qNW.png [3]: https://i.stack.imgur.com/0nSrF.png
Thanks you for your help!

Levensthein logic to get all the string with minimum difference

Suppose i have a datframe with values
Mtemp:
-----+
code |
-----+
Ram |
John |
Tracy|
Aman |
i want to compare it with dataframe
M2:
------+
code |
------+
Vivek |
Girish|
Rum |
Rama |
Johny |
Stacy |
Jon |
i want to get result so that for each value in Mtemp i will get maximum 2 possible match in M2 with Levensthein distance 2.
i have used
tp<-as.data.frame(amatch(Mtemp$code,M2$code,method = "lv",maxDist = 2))
tp$orig<-Mtemp$code
colnames(tp)<-c('Res','orig')
and i am getting result as follow
Res |orig
-----+-----
3 |Ram
5 |John
6 |Tracy
4 |Aman
please let me know a way to get 2 values(if possible) for every Mtemp string with Lev distance =2

Resources