NA variables in dplyr summary r - r

I am trying to create a table, which includes relative frequencies (counts) of variables taken from two groups (A and B) that fall within pre-given temporal intervals. My problem is that if a row starts with 0 seconds (see start_sec), the variable does not fall within the 0-5 seconds interval but is marked as NA (see Output). My wish is to include these cases within the above-mentioned interval.
This is a dummy example:
Variables
group <- c("A","A","A","A","A","A","B","B","B")
person <- c("p1","p1","p1","p3","p2","p2","p1","p1","p2")
start_sec <- c(0,10.7,11.8,3.9,7.4,12.1,0,3.3,0)
dur_sec <- c(7.1,8.2,9.3,10.4,11.5,12.6,13.7,14.8,15.9)
Data frame
df <- data.frame(group,person,start_sec,dur_sec)
df
Pipeline
df %>%
group_by(group,person, interval=cut(start_sec, breaks=c(0,5,10,15))) %>%
summarise(counts= n(),sum_dur_sec=sum(dur_sec))
Output (so far)
Thank you in advance for all comments and feedback!

Related

Count the values from a dataframe, to populate a variable in another dataframe (which specifies the counts required)

I have a summary table which is a reference of all the variables and possible values contained in another dataframe. I need to create a new variable which is a count of those variable values from the other dataframe.
Here is some example data,
# the dataframe with data I want to count from
Y1 <- c(1,2,3,2,4,4,2,1,2,3,4,3,4,4,4,4,3,2,1,2)
Y2 <- c(1,2,1,2,1,1,2,1,1,2,2,1,2,1,1,1,2,1,2,1)
Y3 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
df <- data.frame(Y1,Y2)
# the summary table with all variables and their possible values
Variable <- c("Y1","Y1","Y1","Y1","Y1","Y2","Y2","Y3", "Y3")
Values <- c(1,2,3,4,5,1,2,1,2)
summary <- data.frame(Variable,Values)
>summary
Variable Values
1 Y1 1
2 Y1 2
3 Y1 3
4 Y1 4
5 Y1 5
6 Y2 1
7 Y2 2
8 Y3 1
9 Y3 2
The summary contains the variable names and all the possible values. I need to create variable with a count of those variable and values, from the df dataframe. The example data is simulated so that Y1 and Y3, have no values for 5 and 2 respectively.
I imagine in words it would be describes as, the new_variable_count[i] in "summary" table, equals the count from "df", where (in "summary" table) variable[i] = value[i].
Some context, if it helps. In a questionnaire, asking people to rate something 1-5, you may get no responses for 5 out of 5. The value of 5 is possible, but if you run a frequency table of the variable you'll just get a count of values 1-4. Even if you reshape the data the possible values would not exist. How would a machine, without context, know what values should exist. Zero counts are actually useful when analysing 100s of variables, also if you don't know the possible value range of a variable you may incorrectly assume the values counted are the only values that exist.
Any help would be great. Any advice or direction will also be immensely helpful, ie. Is it better to use a loop? or is there a built-in function? Many thanks in advance.
I hope I understood your question, I am quite sure there is already a built-in function but have no idea which one sorry.
However, a solution could be easily achieved with the tidyverse package and 2 for loops:
First you produce the summary table
library(tidyverse)
df %>%
pivot_longer(everything(), names_to = "question", values_to = "answer") %>%
group_by(question,answer) %>%
summarise(count = n()) %>%
ungroup()-> test
Then you check for each question (first for loop) if any of the possible answers are not present in the test df. If not it means that there is 0 answer and thus we add with rbind the line which represents for the q question and the i possible answer that there was no answer (0).
possible_answers <- 1:5
for(q in colnames(df)){
print(q)
for(i in possible_answers){
if (! i %in% filter(test, question == q)$answer){
print(i)
test %>%
rbind(c(q,i,0)) -> test
}
}
}
final df visualisation:
test %>%
arrange(question,answer)
I am quite sure there is a more "elegant" way to performe this.
You can also do the summary df with base R data manipulation if you do not want to use tidyverse package.
Tom

R: Create column showing days leading up to/ since the maximum value in another column was reached?

I have a dataset with repeated measures: measurements nested within participants (ID) nested in groups. A variable G (with range 0-100) was measured on the group-level. I want to create a new column that shows:
The first day on which the maximum value of G was reached in a group coded as zero.
How many days each measurement (in this same group) occurred before or after the day on which the maximum was reached. For example: a measurement taken 2 days before the maximum is then coded -2, and a measurement 5 days after the maximum is coded as 5.
Here is an example of what I'm aiming for: Example
I highlighted the days on which the maximum value of G was reached in the different groups. The column 'New' is what I'm trying to get.
I've been trying with dplyr and I managed to get for each group the maximum with group_by, arrange(desc), slice. I then recoded those maxima into zero and joined this dataframe with my original dataframe. However, I cannot manage to do the 'sequence' of days leading up to/ days from the maximum.
EDIT: sorry I didn't include a reprex. I used this code so far:
To find the maximum value: First order by date
data <- data[with(data, order(G, Date)),]
Find maximum and join with original data:
data2 <- data %>%
dplyr::group_by(Group) %>%
arrange(desc(c(G)), .by_group=TRUE) %>%
slice(1) %>%
ungroup()
data2$New <- data2$G
data2 <- data2 %>%
dplyr::select(c("ID", "New", "Date"))
data3 <- full_join(data, data2, by=c("ID", "Date"))
data3$New[!is.na(data3$New)] <- 0
This gives me the maxima coded as zero and all the other measurements in column New as NA but not yet the number of days leading up to this, and the number of days since. I have no idea how to get to this.
It would help if you would be able to provide the data using dput() in your question, as opposed to using an image.
It looked like you wanted to group_by(Group) in your example to compute number of days before and after the maximum date in a Group. However, you have ID of 3 and Group of A that suggests otherwise, and maybe could be clarified.
Here is one approach using tidyverse I hope will be helpful. After grouping and arranging by Date, you can look at the difference in dates comparing to the Date where G is maximum (the first maximum detected in date order).
Also note, as.numeric is included to provide a number, as the result for New is a difftime (e.g., "7 days").
library(tidyverse)
data %>%
group_by(Group) %>%
arrange(Date) %>%
mutate(New = as.numeric(Date - Date[which.max(G)]))

In R, subset data using longest contiguous stretch of non-NA values

I am prepping data for linear regression and want to address missing values (NA) by using the longest contiguous stretch of non-NA values in a given year and site.
I have tried na.contiguous() but my code is not applying the function by year or site
Thanks for your assistance
The test data is a multivariate time series that spans 2 years and 2 sites. My hope is that the solution will accommodate data with many more years and 32 sites- so some level of automation and qa/qc is appreciated.
library(dataRetrieval)
library(dplyr)
# read in Data, q is discharge and wt is stream temperature
wt<-readNWISdv(siteNumbers=c("08181800","07308500"),
parameterCd=c("00010","00060"), statCd=c("00003"),
startDate="1998-07-01", endDate="1999-09-30" )
dfwt<-wt%>%
group_by(site_no)%>%
select(Date,site_no,X_00010_00003,X_00060_00003)%>%
rename(wt=X_00010_00003,q=X_00060_00003)
#Subset summer season, add dummy air temp (at).
dfwt$Date<-ymd(dfwt$Date, tz=Sys.timezone())
dfwt$month<-month(dfwt$Date)
dfwt$year<-year(dfwt$Date)
df<- dfwt %>%
group_by(site_no)%>%
subset(month>=7 & month<=9)%>%
mutate(at=wt*1.3)
# add NA
df[35:38,3]<-NA
df[155,3]<-NA
df[194,3]<-NA
test<-df%>%
group_by(site_no, year)%>%
na.contiguous(df)
Using a for loop I found the following solution,
library(zoo)
library(plyr)
library(lubridate)
zoo(df)
sites<-as.vector(unique(df$site_no))
bfi_allsites<-data.frame()
for(i in 1:2){
site1<-subset(dfz, site_no==sites[i])
str(site1)
ss1<-split(site1,site1$year)
site1result<-lapply(ss1,na.contiguous)#works
site_df <- ldply(site1result,data.frame)
bfi_allsites<-rbind(bfi_allsites, site_df)
}
head(bfi_allsites)

Apply if function to identify the value of variable based on the value of another variable

I am trying to identify the value of the variable in an R data frame conditioning on the value of another variable, but unable to do it.
Basically, I am giving 3 different Dose of vaccine to three groups of animals (5 animal per group ( Total )) and recording results as Protec which means the number of protected animals in each group. From the Protec, I am calculating the proportion of protection (Protec/Total as Prop for each Dose group. For example
library(dplyr)
Dose=c(1,0.25,0.0625);Dose #Dose or Dilution
Protec=c(5,4,3);Protec
Total=c(rep(5,3));Total
df=as.data.frame(cbind(Dose,Protec,Total));df
df=df %>% mutate(Prop=Protec/Total);df
Question is, what is the log10 of minimum value of Dose for which Prop==1, which can be found using the following code
X0=log10(min(df$Dose[df$Prop1==1.0]));X0
The result should be X0=0
If the Protec=c(5,5,3), the Prop becomes c(1.0,1.0,0.6) then the X0 should be -0.60206.
If the Protec=c(5,5,5), the Prop becomes c(1.0,1.0,1.0), For which I want X0=0.
if the Protec=c(5,4,5), the Prop becomes c(1.0,0.8,1.0), then also I want X0=0 because I consider them as unordered and take the highest dose for calculating X0
I think it requires if function but the conditions for which I don't know how to write the code.
can someone explain how to do it in R?. thanking you in advance
We can use mutate_at to create apply the calculation on multiple columns that have column name starting with 'Protec'
library(dplyr)
df1 <- df %>%
mutate_at(vars(starts_with("Protec")), list(Prop = ~./Total))

recording time a taxa first appears: nested loops and conditional statements in R

Here is my example. Here is some hypothetical data resembling my own. Environmental data describes the metadata of the community data, which is made up of taxa abundances over years in different treatments.
#Elements of Environmental (meta) data
nTrt<-2
Trt<-c("High","High","High","Low","Low","Low")
Year<-c(1,2,3,1,2,3)
EnvData<-cbind(Trt,Year)
#Elements of community data
nTaxa<-2
Taxa1<-c(0,0,2,50,3,4)
Taxa2<-c(0,34,0,0,0,23)
CommData<-cbind(Taxa1,Taxa2)
#Elements of ideal data produced
Ideal_YearIntroduced<-array(0,dim=c(nTrt,nTaxa))
Taxa1_i<-c(2,1)
Taxa2_i<-c(2,3)
IdealData<-cbind(Taxa1_i,Taxa2_i)
rownames(IdealData)<-c("High","Low")
I want to know what the Year is (in EnvData) when a given taxa first appears in a particular treatment. ie The "introduction year". That is, if the taxa is there at year 1, I want it to record "1" in an array of Treatment x Taxa, but if that taxa in that treatment does not arrive until year 3 (which means it meets the condition that it is absent in year 2), I want it to record Year 3.
So I want these conditional statements to only loop within a treatment. In other words, I do not want it to record a taxa as being "introduced" if it is 0 in year 3 of one treatment and prsent in year 1 of the next.
I've approached this by doing several for loops, but the loops are getting out of hand, with the conditional statements, and there is now an error that I can't figure out- I may be not thinking of the i and j's correctly.'
The data itself is more complicated than this...has 6 years, 1102 taxa, many treatments.
#Get the index number where each treatment starts
Index<-which(EnvData[,2]==1)
TaxaIntro<-array(0,dim=dim(Comm_0)) #Array to hold results
for (i in 1:length(Index)) { #Loop through treatment (start at year 1 each time)
for (j in 1:3) { #Loop through years within a treatment
for (k in 1:ncol(CommData)) { #Loop through Taxa
if (CommData[Index[i],1]>0 ) { #If Taxa is present in Year 1...want to save that it was introduced at Year 1
TaxaIntro[i,k]<-EnvData[Index[i],2]
}
if (CommData[Index[i+j]]>0 && CommData[Index[((i+j)-j)]] ==0) { #Or if taxa is present in a year AND absent in the previous year
TaxaIntro[i,k]<-EnvData[Index[i+j],2]
}
}
}
}
With this example, I get an error related to my second conditional statement...I may be going about this the wrong way.
Any help would be greatly appreciated. I am open to other (non-loop) approaches, but please explain thoroughly as I'm not so well-versed.
Current error:
Error in if (CommData[Index[i + j]] > 0 & CommData[Index[((i + j) - j)]] == :
missing value where TRUE/FALSE needed
Based on your example, I think you could combine your environmental and community data into a single data.frame. Then you might approach your problem using functions from the package dplyr.
# Make combined dataset
dat = data.frame(EnvData, CommData)
Since you want to do the work separately for each Trt, you'll want group_by that variable to do everything separately by group.
Then the problem is to find the first time each one of your Taxa columns contains a value greater than 0 and record which year that is. Because you want to do the same thing for many columns, you can use summarise_each. To get the desired summary, I used the function first to choose the first instance of Year where whatever Taxa column you are working with is greater than 0. The . refers to the Taxa columns. The last thing I did in summarise_each is to choose which columns I wanted to do this work on. In this case, you want to do this for all your Taxa columns, so I chose all columns that starts_with the word Taxa.
With chaining, this looks like:
library(dplyr)
dat %>%
group_by(Trt) %>%
summarise_each(funs(first(Year[. > 0])), contains("Taxa"))
The result is slightly different than yours, but I think this is correct based on the data provided (Taxa1 in High first seen in year 3 not year 2).
Source: local data frame [2 x 3]
Trt Taxa1 Taxa2
1 High 3 2
2 Low 1 3
The above code assumes that your dataset is already in order by Year. If it isn't, you can use arrange to set the order before summarising.
If you aren't used to chaining, the following code is the equivalent to above.
groupdat = group_by(dat, Trt)
summarise_each(groupdat, funs(first(Year[. > 0])), starts_with("Taxa"))

Resources