Sub setting data frame by group means - r

I would like to subset a data frame by group means. I want to subset all data values greater than the group mean. The code I have tried is:
data<-read.csv("TreeData.csv")
library(plyr)
#Calculating the group means
MDBH<-ddply(data, .(PLTPA),summarise, MDBH=mean(DBH))
MDBH
dataDHT<-subset(data,DBH>MDBH)
#The subset data is incorrect, it excluded some value greater than the mean
and included some values less than the mean.
dataDHT
The data set I created for this problem is at:
https://www.dropbox.com/s/ejnjhg4ogk2g4rw/TreeData.csv?dl=0
Thank you in advance for the help.

Related

Calculating the offset between two columns in a dataframe but ignoring some of the outliers in one of those

I have a dataeframe with two columns, one of which is the baseline (baseline_CO2) I have calculated using a previous set of data and the other is a set of data I believe to be offset with respect to this baseline value.
I want to quantify this offset and calculate it's value in order to correct my original data (CO2_LICOR). In order to do this accurately I need to be able to remove some of the outlier peak values in this offset calculation for the LICOR_CO2 data, say all values over 350.
Can anyone help?
The dataframe looks like the following:
If you want to compare the two rows then you can use the approach Jon Spring suggested.
df$offset <- df$baseline_CO2 - df$CO2_LICOR
If you want to filter these values then something like
df_filtered <- df[df$CO2_LICOR < 350]

Obtaining proportions within subsets of a data frame

I am trying to obtain proportions within subsets of a data frame. The inputs are Grade, Fully Paid and Charged Off. I tried using
DF$proportion<-as.vector(unlist(tapply(DF$Grade,paste(DF$Fully Paid ,DF$ Charged Off,sep="."),FUN=function(x){x/sum(x)}))
based on an answer given to this same question in a previous post Calculate proportions within subsets of a data frame but not having luck. I am guessing because Grade is a character not a number in my data.
Based on your comments, Here is the code you should try for each column.
DF$Charged_off_proportion <- as.vector(unlist(tapply(DF$Charged_Off,DF$Grade,FUN=function(x){x/sum(x)})))
Similarly you can change the column names for other columns like
DF$Fully_Paid_proportion <- as.vector(unlist(tapply(DF$Fully_Paid,DF$Grade,FUN=function(x){x/sum(x)})))

Apply if function to identify the value of variable based on the value of another variable

I am trying to identify the value of the variable in an R data frame conditioning on the value of another variable, but unable to do it.
Basically, I am giving 3 different Dose of vaccine to three groups of animals (5 animal per group ( Total )) and recording results as Protec which means the number of protected animals in each group. From the Protec, I am calculating the proportion of protection (Protec/Total as Prop for each Dose group. For example
library(dplyr)
Dose=c(1,0.25,0.0625);Dose #Dose or Dilution
Protec=c(5,4,3);Protec
Total=c(rep(5,3));Total
df=as.data.frame(cbind(Dose,Protec,Total));df
df=df %>% mutate(Prop=Protec/Total);df
Question is, what is the log10 of minimum value of Dose for which Prop==1, which can be found using the following code
X0=log10(min(df$Dose[df$Prop1==1.0]));X0
The result should be X0=0
If the Protec=c(5,5,3), the Prop becomes c(1.0,1.0,0.6) then the X0 should be -0.60206.
If the Protec=c(5,5,5), the Prop becomes c(1.0,1.0,1.0), For which I want X0=0.
if the Protec=c(5,4,5), the Prop becomes c(1.0,0.8,1.0), then also I want X0=0 because I consider them as unordered and take the highest dose for calculating X0
I think it requires if function but the conditions for which I don't know how to write the code.
can someone explain how to do it in R?. thanking you in advance
We can use mutate_at to create apply the calculation on multiple columns that have column name starting with 'Protec'
library(dplyr)
df1 <- df %>%
mutate_at(vars(starts_with("Protec")), list(Prop = ~./Total))

Normalize data by use of ratios based on a changing dataset in R

I am trying to normalize a Y scale by converting all values to percentages.
Therefore, I need to divide every number in a column by the first number in that column. In Excel, this would be equivalent to locking a cell A1/$A1, B1/$A1, C1/$A1 then D1/$D1, E1/$D1...
The data needs to first meet four criteria (Time, Treatment, Concentration and Type) and the reference value changes at every new treatment. Each treatment has 4 concentrations (0, 0.1, 2 and 50). I would like for the values associated to each concentration to be divided by the reference value (when the concentration is equal to 0).
The tricky part is that this reference value changes every 4 rows.
I have tried doing this using ddply:
`MasterTable <- read.csv("~/Dropbox/Master-table.csv")`
MasterTable <- ddply(MasterTable, .(Time, Type, Treatment), transform, pc=(Value/Value$Concentration==0))
But this is not working at all. Any help would be really appreciated!
My data file can be found here:
Master-table
Thank you!
dplyr is very efficient here:
library(dplyr)
result <- group_by(MasterTable, Time, Type, Treatment) %>%
mutate(pc = Value / Value[1])

Conditionally create new column in R

I would like to create a new column in my dataframe that assigns a categorical value based on a condition to the other observations.
In detail, I have a column that contains timestamps for all observations. The columns are ordered ascending according to the timestamp.
Now, I'd like to calculate the difference between each consecutive timestamp and if it exceeds a certain threshold the factor should be increased by 1 (see Desired Output).
Desired Output
I tried solved it with a for loop, however that takes a lot of time because the dataset is huge.
After searching for a bit I found this approach and tried to adapt it: R - How can I check if a value in a row is different from the value in the previous row?
ind <- with(df, c(TRUE, timestamp[-1L] > (timestamp[-length(timestamp)]-7200)))
However, I can not make it work for my dataset.
Thanks for your help!

Resources