R: Average nearby elements in a vector - r

I have many vectors such as this: c(28, 30, 50, 55, 99, 102) and I would like to obtain a new vector where elements differing less than 10 from one to another are averaged. In this case, I would like to obtain c(29, 52.5, 100.5).

Another way
vec <- c(28, 30, 50, 55, 99, 102)
indx <- cumsum(c(0, diff(vec)) > 10)
tapply(vec, indx, mean)
# 0 1 2
# 29.0 52.5 100.5

Related

Replace multiple column values based on the logical set of the same dataframe

I have a dataframe df. I want to replace any column values where df[c("PhysicalActivity_yn_agesurvey", "smoker_former_or_never_yn_agesurvey", "NOT_RiskyHeavyDrink_yn_agesurvey", "Not_obese_yn_agesurvey", "HEALTHY_Diet_yn_agesurvey")] != df$SURVEY_MIN] is true with NA. How do I do that in R?
df <- structure(list(PhysicalActivity_yn_agesurvey = c(58, 47, 47,
50, 53, 59), smoker_former_or_never_yn_agesurvey = c(58, 47,
47, 50, 53, 59), NOT_RiskyHeavyDrink_yn_agesurvey = c(59, 48,
47, 50, 53, 59), Not_obese_yn_agesurvey = c(58, 47, 47, 50, 53,
59), HEALTHY_Diet_yn_agesurvey = c(58, 47, 47, 50, 53, 59), SURVEY_MIN = c(58,
47, 47, 50, 53, 59)), row.names = c(NA, 6L), class = "data.frame")
These are the codes I tried:
df[lapply(df, function(x) ifelse(x != df$SURVEY_MIN, TRUE, FALSE))] <- NA
Also tried:
df[c("PhysicalActivity_yn_agesurvey", "smoker_former_or_never_yn_agesurvey", "NOT_RiskyHeavyDrink_yn_agesurvey",
"Not_obese_yn_agesurvey", "HEALTHY_Diet_yn_agesurvey")] [df[c("PhysicalActivity_yn_agesurvey", "smoker_former_or_never_yn_agesurvey", "NOT_RiskyHeavyDrink_yn_agesurvey",
"Not_obese_yn_agesurvey", "HEALTHY_Diet_yn_agesurvey")] != df$SURVEY_MIN] <- NA
Writing for loops is very bad practise in R ! (99% of the time)
df[(df != df$SURVEY_MIN)]<-NA
will do the trick.
I hope I understand your question correctly, but this should do the trick:
for (i in 1:nrow(df)) {
for (j in 1:(ncol(df)-1)) {
if (df[i,j] != df$SURVEY_MIN[i]) {
df[i,j] <- NA
}
}
}
You need to first create a data frame of 0 values which wil be filled based your condition (conditional statement if you translate to R). This requires a loop where each cell should be compared to the corresponding value in column SURVEY_MIN. So first I create a data frame called df_result excluding the column you want to compare (SURVEY_MIN), but later you can join it:
df_result <- data.frame(PhysicalActivity_yn_agesurvey = numeric(nrow(df)),
smoker_former_or_never_yn_agesurvey = numeric(nrow(df)),
NOT_RiskyHeavyDrink_yn_agesurvey = numeric(nrow(df)),
Not_obese_yn_agesurvey = numeric(nrow(df)),
HEALTHY_Diet_yn_agesurvey = numeric(nrow(df)))
Then we need to define a function fill the cells based on your question, apply the function to each cell from df and save the result in the df_result:
for (i in 1:nrow(df)) {
for (j in 1:5) {
colname <- names(df[j])
if (df[i, j] == df$SURVEY_MIN[i]) {
df_result[i, j] <- df[i, j]
} else {
df_result[i, j] <- NA
}
}
}
This tells me there are only two values that are different from the corresponding row value in SURVEY_MIN, and they are from NOT_RiskyHeavyDrink_yn_agesurvey:
df_result
PhysicalActivity_yn_agesurvey smoker_former_or_never_yn_agesurvey NOT_RiskyHeavyDrink_yn_agesurvey Not_obese_yn_agesurvey HEALTHY_Diet_yn_agesurvey
58 58 NA 58 58
47 47 NA 47 47
47 47 47 47 47
50 50 50 50 50
53 53 53 53 53
59 59 59 59 59

Writing a median function in R

I have been tasked to write my own median function in R, without using the built-in median function. If the numbers are odd; calculate the two middle values, as is usual concerning the median value.
Something i probably could do in Java, but I struggle with some of the syntax in
R Code:
list1 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47, 71)
sorted=list1[order(list1)]
sorted
n = length(sorted)
n
if(n%2==0) # problem here, implementing mod() and the rest of logic.
Here is a self-written function mymedian:
mymedian <- function(lst) {
n <- length(lst)
s <- sort(lst)
ifelse(n%%2==1,s[(n+1)/2],mean(s[n/2+0:1]))
}
Example
list1 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47, 71)
list2 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47)
mymedian(list1)
mymedian(list2)
such that
> mymedian(list1)
[1] 44.5
> mymedian(list2)
[1] 42
I believe this should get you the median you're looking for:
homemade_median <- function(vec){
sorted <- sort(vec)
n <- length(sorted)
if(n %% 2 == 0){
mid <- sorted[c(floor(n/2),floor(n/2)+1)]
med <- sum(mid)/2
} else {
med <- sorted[ceiling(n/2)]
}
med
}
homemade_median(list1)
median(list1) # for comparison
A short function that does the trick:
my_median <- function(x){
# Order Vector ascending
x <- sort(x)
# For even lenght average the value of the surrounding numbers
if((length(x) %% 2) == 0){
return((x[length(x)/2] + x[length(x)/2 + 1]) / 2)
}
# For uneven lenght just take the value thats right in the center
else{
return(x[(length(x)/2) + 0.5])
}
}
Check to see if it returns desired outcomes:
my_median(list1)
44.5
median(list1)
44.5
#
list2 <- c(1,4,5,90,18)
my_median(list2)
5
median(list2)
5
You don't need to test for evenness, you can just create a sequence from half the length plus one, using floor and ceiling as appriopriate:
x <- rnorm(100)
y <- rnorm(101)
my_median <- function(x)
{
mid <- seq(floor((length(x)+1)/2),ceiling((length(x)+1)/2))
mean(sort(x)[mid])
}
my_median(x)
[1] 0.1682606
median(x)
[1] 0.1682606
my_median(y)
[1] 0.2473015
median(y)
[1] 0.2473015

Generating a sequence where the gap between numbers increases

I need to generate a sequence on R where the gap between elements increases each time
Seq1:
1, 49, 100, 154, ... 19306
Seq2:
48, 99, 153, 210, ..., 19650
Note the gap between seq1 elements increases by 3 each time. 49-1 = 48, 100-49 =51, 154-100 = 54...
The gap between Seq2 elements also increases by 3 each time 99-48 =51, 153-99 = 54
Given the advice from #Dason:
seq1 <- seq(48, 19306,3)
which(cumsum(seq1) ==19650)
seq2 <- cumsum(seq1)[1:100]
seq3 <- seq(47, 19306, 3)
seq4 <- seq2 -seq3[1:100]

unique pairs or combinations from a vector

Where am I going wrong with my function.
I am trying to create a function which will count all the unique pairs in a vector, say I have the following input:
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
The number of unique pairs is 20 = 1, 30 = 1 so I can just sum these up and the total number of unique pairs is 2.
However everything I am trying is creating 30 as having 2 unique pairs (since 30 occurs 3 times in the vector.
n <- 9
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
CountThePairs <- function(n, ar){
for(i in 1:length(ar)){
sum = ar[i] - ar[]
pairs = length(which(sum == 0))
}
return(sum)
}
CountThePairs(n = NULL, ar)
Is there an easier way of doing this? I prefer the base R version but interested in package versions also.
Here's a simpler way using floor and table form base R -
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
sum(floor(table(ar)/2))
[1] 2
Example 2 - Adding one more 30 to vector so now there are 2 pairs of 30 -
ar <- c(10, 20, 20, 30, 30, 30, 30, 40, 50)
sum(floor(table(ar)/2))
[1] 3
If 2 30 pairs count as one "unique" pair then original solution by #tmfmnk was correct -
sum(table(ar) >= 2)
You could use sapply on the unique values of the vector to return a logical vector if that value is repeated. The sum of that logical value is then the number of unique pairs.
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
is_pair <- sapply(unique(ar), function(x) length(ar[ar == x]) > 1)
sum(is_pair)
#[1] 2
I'm not sure what behaviour you want if there are four 30's - does this count as one unique pair still or is it now two? If the latter, you would need a slightly different solution:
n_pair <- sapply(unique(ar), function(x) length(ar[ar == x]) %/% 2)
sum(n_pair)
#[1] 2

Return last match from vector

Is there a simple way to get the index of the last match of a vector?
lastInflectionRow(c(TRUE,FALSE,TRUE,FALSE,FALSE))
lastInflectionRow<-function(temp){
m<-match(TRUE,temp,nomatch=NA)
m
}
GOAL: 3
Another simple way could be using max on the index of TRUE elements.
x <- c(TRUE,FALSE,TRUE,FALSE,FALSE)
max(which(x))
#[1] 3
?Position is made for this sort of thing, when using the right=TRUE argument. All of the below should be essentially equivalent.
Position(I, x, right=TRUE)
#[1] 3
Position(identity, x, right=TRUE)
#[1] 3
Position(isTRUE, x, right=TRUE)
#[1] 3
Position(function(x) x, x, right=TRUE)
#[1] 3
We could use == if we are comparing with a single element
tail(which(v1 == TRUE),1)
#[1] 3
The == part is not necessary as the vector is logical
tail(which(v1),1)
#[1] 3
NOTE: Here I am assuming that the OP's vector may not be always TRUE/FALSE values as is showed in the example.
If we need to use match, one option is mentioned here
data
v1 <- c(TRUE,FALSE,TRUE,FALSE,FALSE)
If performance is a consideration, then the best way I've found of doing this is
length(x) + 1L - match(TRUE, rev(x))
This is significantly faster, particularly in the general case where one desires the rightmost match for more than one entry.
MatchLast <- function (needles, haystack) # This approach
length(haystack) + 1L - match(needles, rev(haystack))
MaxWhich <- function (needles, haystack) # Ronak Shah's approach
vapply(needles, function (needle) max(which(haystack==needle)), integer(1))
Pos <- function (needles, haystack) # thelatemail's suggestion
vapply(needles, function (needle)
Position(function (x) x == needle, haystack, right=TRUE),
integer(1))
Tail <- function (needles, haystack) # akrun's solution
vapply(needles, function (needle) tail(which(haystack==needle), 1), integer(1))
With Rilkon42's data:
x <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
microbenchmark(MatchLast(TRUE, x), MaxWhich(TRUE, x), Pos(TRUE, x), Tail(TRUE, x))
## function min lq mean median uq max
## MatchLast 10.730 19.1270 175.3851 23.7920 28.458 14757.131
## MaxWhich 11.663 22.1600 275.4657 25.1920 28.224 24355.120
## Pos 25.192 47.5845 194.1296 52.7160 64.612 12890.622
## Tail 39.187 69.7435 223.1278 83.0395 101.233 9223.848
In the more general case:
needles <- 24:45
haystack <- c(45, 45, 44, 44, 43, 43, 42, 42, 41, 41, 40, 40, 39, 39, 38, 38, 37, 37,
36, 36, 35, 35, 34, 34, 33, 33, 32, 32, 31, 31, 30, 30, 29, 29, 28, 28,
27, 27, 26, 26, 25, 25, 24, 24)
microbenchmark(MatchLast(needles, haystack), MaxWhich(needles, haystack),
Pos(needles, haystack), Tail(needles, haystack))
## function min lq mean median uq max
## MatchLast 15.395 30.3240 137.3086 36.8550 48.051 9842.441
## MaxWhich 90.971 102.1665 161.1100 172.3765 214.829 238.854
## Pos 709.563 733.8220 1111.7000 1162.7780 1507.530 1645.383
## Tail 654.981 690.2035 1017.7400 882.6385 1404.197 1595.933

Resources