Obtaining CCF values using a loop in R - r

I have a data frame which looks like this:
files
Time
Male
Female
A
1.1
0
1
A
1.2
0
1
A
1.3
1
1
A
1.4
1
0
B
2.4
0
1
B
2.5
1
1
B
2.6
0
1
B
2.7
1
1
The 'files' column represent recording file names, 'Time' represents discretised time bins of 0.1 seconds, the 'Male' and 'Female' column represents whether the male and female are calling (1) or not (0) during that time bin.
I want to find at which lags the female and male are most correlated for all different recordings. More specifically I want the output to be a dataframe with three columns: recording file names, peak correlation score between female and male, and the lag value (at which peak correlation occurred).
I have so far could measure the peak cross-correlation of the files individually:
file1 <- dataframe %>% filter(file == unique(dataframe$`Begin File`)[1])
#obtaining observations of first file entry
Then I used following function to find peak correlation:
Find_Abs_Max_CCF <- function(a, b, e=0) {
d <- ccf(a, b, plot = FALSE, lag.max = length(a)/2)
cor = d$acf[,,1]
abscor = abs(d$acf[,,1])
lag = d$lag[,,1]
res = data.frame(cor, lag)
absres = data.frame(abscor, lag)
maxcor = max(absres$abscor)
absres_max = res[which(absres$abscor >= maxcor-maxcor*e & absres$abscor <= maxcor+maxcor*e),]
return(absres_max)
}
Find_Abs_Max_CCF(file1$f,file1$m,0.05)
Is there a way to to use a function or loop to automate the process so that I get peak correlation value, respective lag value of all the distinct file entries?
Any help is highly appreciated. Thanks in advance.
Edits:
I used group_map() function with following code:
part.cor<-dataframe %>% group_by(files) %>% group_map(~Find_Abs_Max_CCF(datframe$f, dataframe$m, 0.05))
However, it is returning the same values of peak correlation and lag repeated throughout output dataframe.

Related

Filter 0 values and output chi square results to a data frame in R

I have data that consists of many Cases (over 600) where I have two independent assessments to compare. I want to determine, based on the relative abundance of species observed, whether the differences between the assessments are due to random variation (differing plot locations/methodologies) or due to human error. The assessments were conducted by a forest manager (FM; generally an ocular estimate) and the ministry responsible for validating the result (MNRF; intensive plot based survey). A result with a p-value <0.05 would indicate that it is either highly unlikely that the two samples were taken from the same population, or that the less intensive method is not sufficiently accurate.
Species composition has been converted to counts of trees by species based on the number of plots established by MNRF. There are several species that may be encountered, but in each case, there are generally less than 6. Species are identified by a two letter code (e.g. PJ = jack pine, BW = white birch). An example of a single case is:
> head(case545)
Case Source PJ SB BW PO BF SW PR LA MR CW PW
1 545 MNRF 68 21 17 15 1 0 0 0 0 0 0
2 545 FM 101 13 13 0 0 0 0 0 0 0 0
I can calculate the statistic I want for this case using the code:
chisq.test(rbind(c(68,21,17,15,1),c(101,13,13,0,0)))
My problem is I have many many cases and I can't figure out how to tell R which values to use in each case. As far as I can tell the logic flow should be
identify and eliminate species where both assessments recorded a value of 0
ensure that the values are organized correctly for chisq.test
run the test and output a new table with the X2 and p value for each case
Any help is greatly appreciated.
This might be of use but might requiere some changes based on some nuances you might have with your data.
For this example I recreate two cases with the naming convention caseXXX
case545 <- data.frame(Case="545",
Source=c("XX","X1"), PJ=c(68,21),SB=c(17,13),BW=c(1,0), SW=c(0,0))
case546 <- data.frame(Case="546",
Source=c("XX","X1"), PJ=c(100,300),SB=c(0,0),BW=c(400,0), SW=c(300,500))
We then create a list of all the data.frames with that naming convention
library(dplyr)
DF <- ls(pattern = "case")
We then apply a function to the list of of data.frames and bind the rows together to make one single data.frame.
This function does what you are asking.
1-Get rid of columns with only 0s
2-Compute the statistical test
3-Give us the X2 statistic and the p-value as a data.frame
Output <- bind_rows(lapply(DF, function(DF){
TMP <- get(DF)
TMP <- TMP %>%
select(grep(pattern = F,colSums( TMP != 0) == 0))
TMP <- chisq.test(rbind(TMP[1,-c(1:2)],TMP[2,-c(1:2)]))
TMP <- data.frame(X2=TMP$statistic,p=TMP$p.value,case=DF)
return(TMP)
}))
> Output
X2 p case
1 4.703423 9.520608e-02 case545
2 550.000000 3.706956e-120 case546

R For Loop anomaly when expanding the range

Assume the following dataframe:
Application <- c('A','A','B','B','B','C','C','D')
Rating <- c('0','0.6','0.6','2.0','2.0','3.8','3.8','3.9')
DF <- data.frame(Application,Rating)
DF
#Application Score
#1 A 0
#2 A 0.6
#3 B 0.6
#4 B 2.0
#5 B 2.0
#6 C 3.8
#7 C 3.8
#8 D 3.9
I want to create an empty results table to be populated through a loop:
1st column - to show the rating being counted (e.g. 0.6)
2nd column - to show the number of times that rating occurs in DF
3rd column - to list total number of ratings in DF (i.e. 8)
4th column - to calculate the proportion of the applications with that rating relative to the overall
#create empty results table
results_rating_bins <- as.data.frame(matrix(nrow = 1, ncol = 4))
#initiate row count
rownr = 1
#Loop:
for (rating in seq(from = 0, to = 4.0, by = 0.1)) {
this_rating <- subset(DF, DF$Score == rating)
results_rating_bins[rownr, 1] = rating
results_rating_bins[rownr, 2] = nrow(this_rating)
results_rating_bins[rownr, 3] = nrow(DF)
results_rating_bins[rownr, 4] = nrow(this_rating) / nrow(DF)
rownr <- rownr + 1
}
The final result is what I expect, except for rating 2.0 where the count is 0 even though it should be 2.
This illustrates at small scale, what I see at larger scale with a 30k line dataset. I have a list of apps with ratings going from 0 to 4.9, so the range in my loop would be set to 0 to 4.9 instead of 0.6 to 4.0 in my example. However, when I run the loop on the large dataset I end up with a number of instances where the rating count is 0 even though it shouldn't be. What's even more odd, is that by playing around with the ranges, the ratings where the anomaly (i.e. count = 0) happens varies completely randomly.
Any idea what may justify this type of behaviour?
Amnesty
Typically I answer the questions as asked, trying to work through the logic a question poster is already using. However, in this case, it is so much easier to use dplyr to aggregate into the new table that I am breaking with tradition.
require(dplyr)
Application <- c('A','A','B','B','B','C','C','D')
Rating <- c('0','0.6','0.6','2.0','2.0','3.8','3.8','3.9')
DF <- data.frame(Application,Rating)
df2<-DF%>%
group_by(Application, Rating)%>%
summarize(ratio=(n()/nrow(DF)))
The first part is the same as yours, but with the library call added
where it starts df2 you are setting the df2 data frame equal to a grouped version of your initial data frame based on the combinations of Application and Rating. In the summarize statement, for each possible combination we tell it to count the number n() and divide it by the total number of rows in the original data frame nrow(DF), This creates the third row of your new the percent of total each pair represents.
It looks like this and you could add the column with the number of rows with another summarize statement if you need it, but to perform this function, it is not necessary.
Application Rating ratio
1 A 0 0.125
2 A 0.6 0.125
3 B 0.6 0.125
4 B 2.0 0.250
5 C 3.8 0.250
6 D 3.9 0.125
This will absolutely catch every combination of Application and Rating and calculate the ratio relative to the whole data frame.
EDIT: If you do not care about the Application letter, you cans imply remove it from the group_by function and still get what you want.
And add
%>%
summarise(rows=nrow(DF))
if you want the total number of rows in the frame on each row

R data frame, sampling with replacement while controling for two variables

I have the following data frame in R, with three variables:
id<-c(1,2,3,4,5,6,7,8,9,10)
frequency<-c(1,2,3,4,5,6,7,8,9,10)
male<-c(1,0,1,0,1,0,1,0,1,0)
df<-data.frame(id,frequency,male)
For df mean frequency is 5.5 and 50% of observations are male. Now I want to take a random sample with replacement from df and with the same size, while mean frequency of the new sample is 4 and male's proportion remains constant.
I wonder if there is any way to do such thing in R.
Thanks in advance.
I cannot find any particular function for what you want. But it will give the results you want. The combination of 'repeat' and if function play the same role as while loop, and other line means do sampling size of 4.
repeat
{
df.sample = df[sample(nrow(df),size=4,replace=FALSE),]
if(mean(df.sample$frequency) == 4.5 & mean(df.sample$male) == 0.5){
break
}
}
The results is
> df.sample
id frequency male
4 4 4 0
2 2 2 0
9 9 9 1
3 3 3 1
For while loop,
while(!(mean(df.sample$frequency) == 4.5 & mean(df.sample$male) == 0.5)){
df.sample = df[sample(nrow(df),size=4,replace=FALSE),]
}

aggregating counts per category

I have a dataset (df) where I would just like to get some summary stats for the entire column variables and then a summary for the variables of 2 specific treatments. So far so good:
summary(var1)
aggregate(var1 ~ treatment, results, summary)
I then have one variable that are values of 1 and 2. I can count these with the sum function:
sum(var3 == 1)
sum(var3 == 2)
However, when I try to sum these by treatment:
aggregate(var3 ~ treatment, results, sum var3 == 1)
I get the following error:
Error in sum == 1 :
comparison (1) is possible only for atomic and list types
I have tried lots of variations on the same theme and taken a look through the textbooks I am using to help me with my first forays into R... but I can't seem to find the answer.
Here's a sample dataset (it's always best to include sample data to make your question reproducible).
set.seed(15)
results<-data.frame(
var1=runif(30),
var3=sample(1:2, 30, replace=T),
treatment=gl(2,15)
)
If you really want to use aggregate, you can do
aggregate(var3==1~treatment, results, sum)
# treatment var3 == 1
# 1 1 9
# 2 2 5
but since you're counting discrete observations, table() may be a better choice to do all the counting at once
with(results, table(var3, treatment))
# treatment
# var3 1 2
# 1 9 5
# 2 6 10

Function to work out an average number of unique occurrences

I have the following code, which does what I want. But I would like to know if there is a simpler/nicer way of getting there?
The overall aim of me doing this is that I am building a separate summary table for the overall data, so the average which comes out of this will go into that summary.
Test <- data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Thing = c("Apple","Apple","Pear","Pear","Apple","Apple","Kiwi","Apple","Pear"),
Day = c("Mon","Tue","Wed")
)
countfruit <- function(data){
df <- as.data.frame(table(data$ID,data$Thing))
df <- dcast(df, Var1 ~ Var2)
colnames(df) = c("ID", "Apple","Kiwi", "Pear")
#fixing the counts to apply a 1 for if there is any count there:
df$Apple[df$Apple>0] = 1
df$Kiwi[df$Kiwi>0] = 1
df$Pear[df$Pear>0] = 1
#making a new column in the summary table of how many for each person
df$number <- rowSums(df[2:4])
return(mean(df$number))}
result <- countfruit(Test)
I think you over complicate the problem, Here a small version keeping the same rationale.
df <- table(data$ID,data$Thing)
mean(rowSums(df>0)) ## mean of non zero by column
EDIT one linear solution:
with(Test , mean(rowSums(table(ID,Thing)>0)))
It looks like you are trying to count how many nonzero entries in each column. If so, either use as.logical which will convert any nonzero number to TRUE (aka 1) , or just count the number of zeros in a row and subtract from the number of pertinent columns.
For example, if I followed your code correctly, your dataframe is
Var1 Apple Kiwi Pear
1 1 2 0 1
2 2 2 0 1
3 3 1 1 1
So, (ncol(df)-1) - length(df[1,]==0) gives you the count for the first row.
Alternatively, use as.logical to convert all nonzero values to TRUE aka 1 and calculate the rowSums over the columns of interest.

Resources