I have a dataframe DF, with two columns A and B shown below:
A B
1 0
3 0
4 0
2 1
6 0
4 1
7 1
8 1
1 0
A sliding window approach is performed as shown below. The mean is calulated for column B in a sliding window of size 3 sliding by 1 using: rollapply(DF$B, width=3,by=1). The mean values for each window are shown on the left side.
A: 1 3 4 2 6 4 7 8 1
B: 0 0 0 1 0 1 1 1 0
[0 0 0] 0
[0 0 1] 0.33
[0 1 0] 0.33
[1 0 1] 0.66
[0 1 1] 0.66
[1 1 1] 1
[1 1 0] 0.66
output: 0 0.33 0.33 0.66 0.66 1 1 1 0.66
Now, for each row/coordinate in column A, all windows containing the coordinate are considered and should retain the highest mean value which gives the results as shown in column 'output'.
I need to obtain the output as shown above. The output should like:
A B Output
1 0 0
3 0 0.33
4 0 0.33
2 1 0.66
6 0 0.66
4 1 1
7 1 1
8 1 1
1 0 0.66
Any help in R?
Try this:
# form input data
library(zoo)
B <- c(0, 0, 0, 1, 0, 1, 1, 1, 0)
# calculate
k <- 3
rollapply(B, 2*k-1, function(x) max(rollmean(x, k)), partial = TRUE)
The last line returns:
[1] 0.0000000 0.3333333 0.3333333 0.6666667 0.6666667 1.0000000 1.0000000
[8] 1.0000000 0.6666667
If there are NA values you might want to try this:
k <- 3
B <- c(1, 0, 1, 0, NA, 1)
rollapply(B, 2*k-1, function(x) max(rollapply(x, k, mean, na.rm = TRUE)), partial = TRUE)
where the last line gives this:
[1] 0.6666667 0.6666667 0.6666667 0.5000000 0.5000000 0.5000000
Expanding it out these are formed as:
c(mean(B[1:3], na.rm = TRUE), ##
max(mean(B[1:3], na.rm = TRUE), mean(B[2:4], na.rm = TRUE)), ##
max(mean(B[1:3], na.rm = TRUE), mean(B[2:4], na.rm = TRUE), mean(B[3:5], na.rm = TRUE)),
max(mean(B[2:4], na.rm = TRUE), mean(B[3:5], na.rm = TRUE), mean(B[4:6], na.rm = TRUE)),
max(mean(B[3:5], na.rm = TRUE), mean(B[4:6], na.rm = TRUE)), ##
mean(B[4:6], na.rm = TRUE)) ##
If you don't want the k-1 components at each end (marked with ## above) drop partial = TRUE.
The R library TTR has a number of functions for calculating averages over sliding windows
SMA = simple moving average
data$sma <- SMA(data$B, 3)
More documentation is here http://cran.r-project.org/web/packages/TTR/TTR.pdf
Related
I have a dataframe "data" containing 10 variables A to J (which all contain 0s and 1s) and 500 rows:
I need to make a second set of 10 variables AY to JY based on the variables A to J such that:
for AY, if A==1 then AY takes the value 1 with 80% probability and if A==0 then AY takes the value 1 with 20% probability
for BY, if B==1 then BY takes the value 1 with 80% probability and if B==0 then BY takes the value 1 with 20% probability
And so on...
Right now, I have the variables A to J stored the dataframe "data", and have the following as my code:
out <- paste0(LETTERS[1:10], "Y")
data2 <- data.frame(data)
colnames(data2) <- out
for (i in out) {
data2[i] <- ifelse(**???**, rbinom(length(out), 1, 0.8), rbinom(length(out), 1, 0.2))
}
What would I write instead of the question marks to denote "if any value in the list of variables AY:JY is equal to 1, execute the first argument, otherwise execute the second argument"?
Please find below one solution that should work.
Starting data
set.seed(4854)
df <- data.frame("A" = sample(c(0,1), 500, replace = TRUE),
"B" = sample(c(0,1), 500, replace = TRUE),
"C" = sample(c(0,1), 500, replace = TRUE),
"D" = sample(c(0,1), 500, replace = TRUE),
"E" = sample(c(0,1), 500, replace = TRUE),
"F" = sample(c(0,1), 500, replace = TRUE),
"G" = sample(c(0,1), 500, replace = TRUE),
"H" = sample(c(0,1), 500, replace = TRUE),
"I" = sample(c(0,1), 500, replace = TRUE),
"J" = sample(c(0,1), 500, replace = TRUE)
)
Saving original data
df2 <- df
Apply with apply a function which randomly samples with replacement
(i.e. replace = TRUE) 1 or 0 with the probabilities which you indicated
according to whether the original data is 0 or 1
df2 <- apply(df2, c(1,2), function (x)
ifelse(
x == 1,
sample(c(0, 1), 1, prob = c(0.2, 0.8), replace = TRUE),
sample(c(0, 1), 1, prob = c(0.8, 0.2), replace = TRUE)
))
Renaming of columns
colnames(df2) <- paste0(colnames(df),"Y")
Output
head(df2)
#> AY BY CY DY EY FY GY HY IY JY
#> [1,] 1 0 0 1 0 0 0 1 0 1
#> [2,] 0 1 0 0 0 0 0 0 1 1
#> [3,] 1 1 1 0 1 1 0 0 0 0
#> [4,] 1 0 1 0 1 1 1 1 1 0
#> [5,] 1 1 0 1 0 1 1 0 0 0
#> [6,] 0 0 0 1 1 1 1 1 0 1
Created on 2021-09-24 by the reprex package (v2.0.1)
I've a database with 4 columns and 8 observations:
> df1
Rater1 Rater2 Rater4 Rater5
1 3 3 3 3
2 3 3 2 3
3 3 3 2 2
4 0 0 1 0
5 0 0 0 0
6 0 0 0 0
7 0 0 1 0
8 0 0 0 0
I would like to have the mean, median, iqr, sd of all Rater1 and Rater4 observations (16) and all Rater2 and Rater5 observations (16) without creating a new df with 2 variables like this:
> df2
var1 var2
1 3 3
2 3 3
3 3 3
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 3 3
10 2 3
11 2 2
12 1 0
13 0 0
14 0 0
15 1 0
16 0 0
I would like to obtain this (without a new database, just working on the first database):
> stat.desc(df2)
var1 var2
nbr.val 16.0000000 16.0000000
nbr.null 8.0000000 10.0000000
nbr.na 0.0000000 0.0000000
min 0.0000000 0.0000000
max 3.0000000 3.0000000
range 3.0000000 3.0000000
sum 18.0000000 17.0000000
median 0.5000000 0.0000000
mean 1.1250000 1.0625000
SE.mean 0.3275541 0.3590352
CI.mean.0.95 0.6981650 0.7652653
var 1.7166667 2.0625000
std.dev 1.3102163 1.4361407
coef.var 1.1646367 1.3516618
How can I do this in R?
Thank you in advance
Another solution, using a for loop to compute the statistics in one go:
First, create vectors for the raters you want to combine:
# Raters 2 and 4:
r24 <- as.integer(unlist(df1[,c("Rater2", "Rater4")]))
# Raters 1 and 5:
r15 <- as.integer(unlist(df1[,c("Rater1","Rater5")]))
Combine these vectors in a dataframe:
df <- data.frame(r15, r24)
And calculate the statistics:
for(i in 1:ncol(df)){
print(c(mean(df[,i]), IQR(df[,i]), median(df[,i]), sd(df[,i])))
}
[1] 1.062500 3.000000 0.000000 1.436141
[1] 1.125000 2.250000 0.500000 1.310216
A possible base approach:
df <- data.frame( # construct your original dataframe
Rater1 = c(3, 3, 3, 0, 0, 0, 0, 0),
Rater2 = c(3, 3, 3, 0, 0, 0, 0, 0),
Rater4 = c(3, 2, 2, 1, 0, 0, 1, 0),
Rater5 = c(3, 3, 2, 0, 0, 0, 0, 0)
)
combined <- data.frame( # make a new dataframe with your desired variables
R14 = with(df, c(Rater1, Rater4)),
R25 = with(df, c(Rater2, Rater5))
)
sapply(combined, mean) # compute mean of each column
sapply(combined, median) # median
sapply(combined, sd) # standard deviation
sapply(combined, IQR) # interquartile range
We can loop over the column names that are similar, convert to a vector and get the mean, median, IQR and sd
out <- do.call(rbind, Map(function(x, y) {v1 <- c(df1[[x]], df1[[y]])
data.frame(Mean = mean(v1), Median = median(v1),
IQR = IQR(v1), SD = sd(v1))}, names(df1)[1:2], names(df1)[3:4]))
row.names(out) <- paste(names(df1)[1:2], names(df1)[3:4], sep="_")
out
# Mean Median IQR SD
#Rater1_Rater4 1.1250 0.5 2.25 1.310216
#Rater2_Rater5 1.0625 0.0 3.00 1.436141
data
df1 <- structure(list(Rater1 = c(3, 3, 3, 0, 0, 0, 0, 0), Rater2 = c(3,
3, 3, 0, 0, 0, 0, 0), Rater4 = c(3, 2, 2, 1, 0, 0, 1, 0), Rater5 = c(3,
3, 2, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-8L))
A tidyverse/dplyr solution.
library(dplyr)
bind_rows(select(df, r12 = Rater1, r45 = Rater4),
select(df, r12 = Rater2, r45 = Rater5)) %>%
summarise_all(list(
mean = mean,
median = median,
sd = sd,
iqr = IQR
))
#> r12_mean r45_mean r12_median r45_median r12_sd r45_sd r12_iqr r45_iqr
#> 1 1.125 1.0625 0 0.5 1.5 1.236595 3 2
In case you want the output similar to the one in your question, use t() to transpose the result.
t(.Last.value)
I am trying to calculate row means to create an average variable from 3 assessment points. I want to include cases that have 2 or 3 measurement points, but not those that have only one.
For example,
> a <- c(1,0,NA,1,NA,0,1,0,NA,0,NA)
> b <- c(1,0,NA,1,0,1,1,1,NA,0,1)
> c <- c(1,NA,NA,0,NA,0,1,1,1,0,0)
> mydata <- data.frame(a,b,c)
> mydata$M <- rowMeans(subset(mydata, select = c(1:3)), na.rm = TRUE)
> mydata$M
The current output produces a list of means for all the rows except one, which had 3 NAs:
[1] 1.00 0.00 NaN 0.66 0.00 0.33 1.00 0.66 1.00 0.00 0.50
However, my desired output would be:
[1] 1.00 0.00 NaN 0.66 NaN 0.33 1.00 0.66 NaN 0.00 0.50
Such that only rows with at least two data points would be used to calculate the mean, rather than returning the single data point as the row mean.
This is a complicated rule and I'm not sure how to define it. Any help would be appreciated.
(This is a data set with several thousand rows, so doing it manually is unthinkable!)
Thank you!
Sophie
You could make a function that applies a mean to a row based on some condition. In your example, if there are two or more valid measurements, calculate mean.
a <- c(1,0,NA,1,NA,0,1,0,NA,0,NA)
b <- c(1,0,NA,1,0,1,1,1,NA,0,1)
c <- c(1,NA,NA,0,NA,0,1,1,1,0,0)
mydata <- data.frame(a,b,c)
Reading functions is best done from inside out. This one will take a vector x and see how many are not NA. When it sums (sum) the TRUE/FALSE values it turns them beforehand to 1 and 0, respectively. It then performs a test if there are more than 1 (so 2 or more) values - that are not NA.
conditionalMean <- function(x) {
if (sum(!is.na(x)) > 1) {
mean(x, na.rm = TRUE)
} else {
NA
}
}
We apply this function to your data.frame row-wise, as denoted by MARGIN = 1. If you had a function that worked column-wise, you would use MARGIN = 2. You can try it out. Compare apply(mydata, MARGIN = 2, FUN = mean, na.rm = TRUE) and colMeans(mydata, na.rm = TRUE).
apply(mydata, MARGIN = 1, FUN = conditionalMean)
[1] 1.0000000 0.0000000 NA 0.6666667 NA 0.3333333 1.0000000
[8] 0.6666667 NA 0.0000000 0.5000000
You can try something like this with dplyr:
library(dplyr)
mydata %>%
mutate(row_mean = ifelse((is.na(a)+is.na(b)+is.na(c)) %in% 2:3, NaN,
rowMeans(.[1:3], na.rm = TRUE)))
Result:
a b c row_mean
1 1 1 1 1.0000000
2 0 0 NA 0.0000000
3 NA NA NA NaN
4 1 1 0 0.6666667
5 NA 0 NA NaN
6 0 1 0 0.3333333
7 1 1 1 1.0000000
8 0 1 1 0.6666667
9 NA NA 1 NaN
10 0 0 0 0.0000000
11 NA 1 0 0.5000000
I've been braking my head whole morning how to do this.
So lets say this is my data set
set.seed(1)
temp <- as.data.frame(cbind(Key = letters[1:5], sapply(1:12, function(x) sample(c(0, 1), 5, replace = T))))
names(temp)[2:13] <- month.abb
temp
# Key Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1 a 0 1 0 0 1 0 0 1 1 1 0 0
# 2 b 0 1 0 1 0 0 1 1 1 0 1 0
# 3 c 1 1 1 1 1 0 0 0 1 0 0 1
# 4 d 1 1 0 0 0 1 0 1 1 1 0 1
# 5 e 0 0 1 1 0 0 1 0 1 1 0 0
What I'm trying to do is to calculate the percentage of occurrences (1s) in two consecutive months.
For example, c and d had an occurrence in Jan. Both had occurrence in Feb too, so the output will be 1 for this month. In Feb, a-d had occurrences but only c had also an occurrence in Mar, so the the output will be .25 for that month, etc.
Desired output for that mini example:
data.frame(Month = month.abb[1:11], OverlapPercent = c(1, 1/4, 1, 1/3, 0, 0, 1/2, 1, 3/5, 0, 0))
# Month OverlapPercent
# 1 Jan 1.0000000
# 2 Feb 0.2500000
# 3 Mar 1.0000000
# 4 Apr 0.3333333
# 5 May 0.0000000
# 6 Jun 0.0000000
# 7 Jul 0.5000000
# 8 Aug 1.0000000
# 9 Sep 0.6000000
# 10 Oct 0.0000000
# 11 Nov 0.0000000
Was thinking to some how use rle for it, but not sure how to force it to stop on two occurences each time
Unless I'm missing something, the following looks valid:
#just to remove 'factor's from "temp"
tmp = do.call(cbind.data.frame, c(temp[1], lapply(temp[-1], function(x) as.numeric(as.character(x)))))
sapply(head(seq_len(ncol(tmp))[-1], -1),
function(i) sum(tmp[[i]] & tmp[[i+1]]) / sum(tmp[[i]]))
#[1] 1.0000000 0.2500000 1.0000000 0.3333333 0.0000000 0.0000000 0.5000000 1.0000000 0.6000000 0.0000000 0.0000000
EDIT:
Out of curiosity I checked #Bathsheba 's "bitwise AND" speed and seems to be faster than the "logical AND":
#identical results
sapply(head(seq_len(ncol(tmp))[-1], -1),
function(i) sum(bitwAnd(tmp[[i]], tmp[[i+1]])) / sum(tmp[[i]]))
#[1] 1.0000000 0.2500000 1.0000000 0.3333333 0.0000000 0.0000000 0.5000000 1.0000000 0.6000000 0.0000000 0.0000000
#twice as fast
x1 = sample(0:1, 1e6, T); x2 = sample(0:1, 1e6, T)
identical(sum(x1 & x2) / sum(x1), sum(bitwAnd(x1, x2)) / sum(x1))
#[1] TRUE
microbenchmark(sum(x1 & x2) / sum(x1), sum(bitwAnd(x1, x2)) / sum(x1), times = 50)
#Unit: milliseconds
# expr min lq median uq max neval
# sum(x1 & x2)/sum(x1) 23.95648 25.32448 25.78471 26.56232 49.18491 50
# sum(bitwAnd(x1, x2))/sum(x1) 10.97982 11.07309 11.20237 13.00450 35.67963 50
First fix up temp so that the 0/1 columns are numeric rather than factor. Then apply overlap to each pair of columns:
temp[-1] <- lapply(temp[-1], function(x) as.numeric(as.character(x)))
overlap <- function(x, y) mean(y[x == 1])
data.frame(Month = month.abb[-12],
Overlap = sapply(2:12, function(i) overlap(temp[,i], temp[,i+1])))
The above is preferred as it keeps the independent parts of the solution separate; however, as an alternative we could omit the first line above (which fixes up the factors) and instead incorporate that into overlap like this:
overlap <- function(x, y) mean(as.numeric(as.character(y))[x == 1]
Note that the Overlaps are fractions (as per the output shown in the question) and not percents as the heading in the question suggests.
In pseudocode, represent each column as a binary number.
E.g. Jan = 0b00110 and Feb = 0b11110.
Your formula for Jan is then
Bitcount(Jan AND Feb) / Bitcount(Jan)
Where AND is the bitwise AND operator and Bitcount counts the number of 1 bits in the number. (I can supply a way of bit counting if you need it). Of course, the formula for other months is a trivial generalisation.
Obviously you'll need a branch for the denominator being zero: not well defined in your question.
length(which(!xor(data["Feb"],data["Mar"]) & data["Feb"]==1)) / length(which(data["Feb"]==1))
!xor is the negated exclusive or.
length(which(...)) gives the number of true values in a logical vector.
I have a data frame like the one below, but with a lot more rows
> df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
> df
x1 x2
1 1 a
2 1 a
3 0 b
4 0 a
5 1 c
6 0 c
From df I want a data frame where the rows are the unique values of df$x2 and col1 is the proportion of 1s associated with each letter, and col2 is the count of each letter. So, my output would be
> getprops(df)
prop count
a .6666 3
b 0 1
c 0.5 2
I can think of some elaborate, dirty ways to do this, but I'm looking for something short and efficient. Thanks
I like #RicardoSaporta's solution (+1), but you can use ?prop.table as well:
> df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
> df
x1 x2
1 1 a
2 1 a
3 0 b
4 0 a
5 1 c
6 0 c
> tab <- table(df$x2, df$x1)
> tab
0 1
a 1 2
b 1 0
c 1 1
> ptab <- prop.table(tab, margin=1)
> ptab
0 1
a 0.3333333 0.6666667
b 1.0000000 0.0000000
c 0.5000000 0.5000000
> dframe <- data.frame(values=rownames(tab), prop=ptab[,2], count=tab[,2])
> dframe
values prop count
a a 0.6666667 2
b b 0.0000000 0
c c 0.5000000 1
If you'd like, you can put this together into a single function:
getprops <- function(values, indicator){
tab <- table(values, indicator)
ptab <- prop.table(tab, margin=1)
dframe <- data.frame(values=rownames(tab), prop=ptab[,2], count=tab[,2])
return(dframe)
}
> getprops(values=df$x2, indicator=df$x2)
values prop count
a a 0.6666667 2
b b 0.0000000 0
c c 0.5000000 1
Try installing plyr and running
library(plyr)
df <- data.frame(x1=c(1, 1, 0, 0, 1, 0),
label=c("a", "a", "b", "a", "c", "c"))
ddply(df, .(label), summarize, prop = mean(x1), count = length(x1))
# label prop count
# 1 a 0.6666667 3
# 2 b 0.0000000 1
# 3 c 0.5000000 2
which under the hood applies a split/apply/combine method similar to this in base R:
do.call(rbind, lapply(split(df, df$x2),
with, list(prop = mean(x1),
count = length(x1))))
Here is a one-liner in data.table:
> DT[, list(props=sum(x1) / .N, count=.N), by=x2]
x2 props count
1: a 0.6666667 3
2: b 0.0000000 1
3: c 0.5000000 2
where DT <- data.table(df)
I am not sure if this does what you want.
df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
ones <- with(df, aggregate(x1 ~ x2, FUN = sum))
count <- table(df$x2)
prop <- ones$x1 / count
df2 <- data.frame(prop, count)
df2
rownames(df2) <- df2[,3]
df2 <- df2[,c(2,4)]
colnames(df2) <- c('prop', 'count')
df2
prop count
a 0.6666667 3
b 0.0000000 1
c 0.5000000 2
Try using table
tbl <- table(df$x1, df$x2)
# a b c
# 0 1 1 1
# 1 2 0 1
tbl["1",] / colSums(tbl)
# a b c
# 0.6666667 0.0000000 0.5000000
For nice output use:
data.frame(proportions=tbl["1",] / colSums(tbl))
proportions
a 0.6666667
b 0.0000000
c 0.5000000