Writing a median function in R - r

I have been tasked to write my own median function in R, without using the built-in median function. If the numbers are odd; calculate the two middle values, as is usual concerning the median value.
Something i probably could do in Java, but I struggle with some of the syntax in
R Code:
list1 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47, 71)
sorted=list1[order(list1)]
sorted
n = length(sorted)
n
if(n%2==0) # problem here, implementing mod() and the rest of logic.

Here is a self-written function mymedian:
mymedian <- function(lst) {
n <- length(lst)
s <- sort(lst)
ifelse(n%%2==1,s[(n+1)/2],mean(s[n/2+0:1]))
}
Example
list1 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47, 71)
list2 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47)
mymedian(list1)
mymedian(list2)
such that
> mymedian(list1)
[1] 44.5
> mymedian(list2)
[1] 42

I believe this should get you the median you're looking for:
homemade_median <- function(vec){
sorted <- sort(vec)
n <- length(sorted)
if(n %% 2 == 0){
mid <- sorted[c(floor(n/2),floor(n/2)+1)]
med <- sum(mid)/2
} else {
med <- sorted[ceiling(n/2)]
}
med
}
homemade_median(list1)
median(list1) # for comparison

A short function that does the trick:
my_median <- function(x){
# Order Vector ascending
x <- sort(x)
# For even lenght average the value of the surrounding numbers
if((length(x) %% 2) == 0){
return((x[length(x)/2] + x[length(x)/2 + 1]) / 2)
}
# For uneven lenght just take the value thats right in the center
else{
return(x[(length(x)/2) + 0.5])
}
}
Check to see if it returns desired outcomes:
my_median(list1)
44.5
median(list1)
44.5
#
list2 <- c(1,4,5,90,18)
my_median(list2)
5
median(list2)
5

You don't need to test for evenness, you can just create a sequence from half the length plus one, using floor and ceiling as appriopriate:
x <- rnorm(100)
y <- rnorm(101)
my_median <- function(x)
{
mid <- seq(floor((length(x)+1)/2),ceiling((length(x)+1)/2))
mean(sort(x)[mid])
}
my_median(x)
[1] 0.1682606
median(x)
[1] 0.1682606
my_median(y)
[1] 0.2473015
median(y)
[1] 0.2473015

Related

How to find GCD for a list of (1:n) numbers in R

How to find GCD for a list of (1:n) numbers in R ?
GCD=function(a, b){
m=min(a, b)
while(a%%m>0|b%%m>0){m=m-1}
return(m)}
Here is my code to find GCD for two integers, how can I modify it to find GCD for a list of numbers from 1 to n without too much changes on my original code?
Thankyou very much !
You can use the any function:
GCD <- function(x) {
m = min(x)
while (any(x %% m > 0)){
m = m - 1
}
return(m)
}
GCD(c(12, 24, 28, 36, 200))
# [1] 4
GCD(c(6, 24, 28, 36, 200))
# [1] 2
You can define function gcd based on GCD like below
gcd <- function(...) Reduce(GCD,list(...))
and you can try
> gcd(6, 24, 28, 36, 200)
[1] 2

R - Comparing datas of two data.frames with a loop, constant by constant

Basically, I want to compare the same constants (same [X,Y] values) from two data frames and applied them a few operations afterward, before stocking the result in a new data frame. The tricky part is that I need to treat all the values of these data frames.
In other words, I need to compare the value of dataA[1,1] with dataB[1,1] and if they respect certain conditions, I'll perform an operation, else another opeartion, then stock the result in a new data.frame.
After, rinse and repeat for dataA[1,2] and dataB[1,2] up to dataA[100,100] and dataB[100,100].
Obviously, I've to use a loop here (and some if/else), but I can't seems to figure out the proper structure.
Being used to php, I tried the foreach package in R, but it only return FALSE (and it do so in vector format instead of a matrix with multiple columns). If I do the operation by hand, there are more TRUE than FALSE, so obviously, something's wrong here :
x <- foreach(dataIDH, dataPIB) %do% {
if (dataPIB <= dataIDH+5 & dataPIB >= rankIDH-5) {
x <- mean(dataPIB, dataIDH)
} else { x <- FALSE}
}
x
I did tried a for loop, but I'm simply unable to put the results in a data.frame (even less a ones that match the layout of the dataframes used here, which I need to do) :
x <- for(idh in 1:nrow(dataIDH)) {
for(idh in 1:ncol(dataIDH)) {
for(pib in 1:nrow(dataPIB)) {
for(pib in 1:ncol(dataPIB)) {
if (pib<=idh+5 & pib>=idh-5) {
x <- mean(pib,idh)
} else { x <- FALSE}
}
}
}
}
x
For informations : the data frames contain numeric values for a set of countries (rows) for a few years (columns).
Any ideas on how to get out of this mess?
Edit 1 : an extract of the two dataframes used (1st row and col displayed here are actually headers) :
dataIDH
CountryCode,2005,2006,2007,2008
AFG,14,14,16,16
ALB,100,98,99,98
DZA,85,86,90,86
dataPIB
CountryCode,2005,2006,2007,2008
AFG, 69, 18, 70, 71
ALB, 102, 98, 97, 63
DZA, 85, 89, 91, 137
Edit 2 : and the final result should be a new data.frame, on the same layout:
x
CountryCode,2005,2006,2007,2008
AFG, FALSE, 16, FALSE, FALSE
ALB, 101, 98, 98, FALSE
DZA, 85, 87.5, 90.5, FALSE
With the basic looping way. Hope this helps you.
df <- dataIDH
for(i in 1:length(dataIDH$CountryCode)){
for(j in 2:ncol(dataIDH)){
if((dataIDH[i,j] <= dataPIB[i,j]+5) & (dataPIB[i,j] <= dataIDH[i,j]+5)){
df[i,j] <- mean(dataPIB[i,j], dataIDH[i,j])
} else{ df[i,j] <- "False" }
}
}
here is your answer:
df1 <- data.frame(a= rnorm(1000),
b=rnorm(1000),
c= rnorm(1000))
df2 <- data.frame(aa= rnorm(1000, 3, 3),
bb=rnorm(1000, -2, 3),
cc= rnorm(1000, 5, 3))
df3 <- data.frame(df1, df2)
test <- function(df, column_number1, column_number2){
mean_vec <- apply(df[, c(column_number1, column_number2)], 1, function(x) mean(x, na.rm = TRUE))
dif_vec <- abs(df[,column_number1]-df[,column_number2])
ind_true <- dif_vec<=5
ind_false <- dif_vec>5
column_name <- paste(colnames(df)[column_number1],
colnames(df)[column_number2], sep = "_" )
df[ind_true, (column_name)] <- mean_vec[ind_true]
df[ind_false, (column_name)] <- "FALSE"
return(df)
}
df3 <- test(df3, 1,4)
df3 <- test(df3, 2, 5)
df3 <- test(df3, 3, 6)
Assuming you don't actually want to convert your data to strings (which would be necessary to include "FALSE" in a numeric vector), R is really good at working with vectors and matrices...
dataIDH <- read.csv(header = TRUE, as.is = TRUE, text = "
CountryCode,2005,2006,2007,2008
AFG,14,14,16,16
ALB,100,98,99,98
DZA,85,86,90,86
")
dataPIB <- read.csv(header = TRUE, as.is = TRUE, text = "
CountryCode,2005,2006,2007,2008
AFG, 69, 18, 70, 71
ALB, 102, 98, 97, 63
DZA, 85, 89, 91, 137
")
x <- abs(dataIDH[-1] - dataPIB[-1]) <= 5
y <- (dataIDH[-1] + dataPIB[-1]) / 2
y[!x] <- NA
cbind(dataIDH[1], y)
# CountryCode X2005 X2006 X2007 X2008
# 1 AFG NA 16.0 NA NA
# 2 ALB 101 98.0 98.0 NA
# 3 DZA 85 87.5 90.5 NA

Return last match from vector

Is there a simple way to get the index of the last match of a vector?
lastInflectionRow(c(TRUE,FALSE,TRUE,FALSE,FALSE))
lastInflectionRow<-function(temp){
m<-match(TRUE,temp,nomatch=NA)
m
}
GOAL: 3
Another simple way could be using max on the index of TRUE elements.
x <- c(TRUE,FALSE,TRUE,FALSE,FALSE)
max(which(x))
#[1] 3
?Position is made for this sort of thing, when using the right=TRUE argument. All of the below should be essentially equivalent.
Position(I, x, right=TRUE)
#[1] 3
Position(identity, x, right=TRUE)
#[1] 3
Position(isTRUE, x, right=TRUE)
#[1] 3
Position(function(x) x, x, right=TRUE)
#[1] 3
We could use == if we are comparing with a single element
tail(which(v1 == TRUE),1)
#[1] 3
The == part is not necessary as the vector is logical
tail(which(v1),1)
#[1] 3
NOTE: Here I am assuming that the OP's vector may not be always TRUE/FALSE values as is showed in the example.
If we need to use match, one option is mentioned here
data
v1 <- c(TRUE,FALSE,TRUE,FALSE,FALSE)
If performance is a consideration, then the best way I've found of doing this is
length(x) + 1L - match(TRUE, rev(x))
This is significantly faster, particularly in the general case where one desires the rightmost match for more than one entry.
MatchLast <- function (needles, haystack) # This approach
length(haystack) + 1L - match(needles, rev(haystack))
MaxWhich <- function (needles, haystack) # Ronak Shah's approach
vapply(needles, function (needle) max(which(haystack==needle)), integer(1))
Pos <- function (needles, haystack) # thelatemail's suggestion
vapply(needles, function (needle)
Position(function (x) x == needle, haystack, right=TRUE),
integer(1))
Tail <- function (needles, haystack) # akrun's solution
vapply(needles, function (needle) tail(which(haystack==needle), 1), integer(1))
With Rilkon42's data:
x <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
microbenchmark(MatchLast(TRUE, x), MaxWhich(TRUE, x), Pos(TRUE, x), Tail(TRUE, x))
## function min lq mean median uq max
## MatchLast 10.730 19.1270 175.3851 23.7920 28.458 14757.131
## MaxWhich 11.663 22.1600 275.4657 25.1920 28.224 24355.120
## Pos 25.192 47.5845 194.1296 52.7160 64.612 12890.622
## Tail 39.187 69.7435 223.1278 83.0395 101.233 9223.848
In the more general case:
needles <- 24:45
haystack <- c(45, 45, 44, 44, 43, 43, 42, 42, 41, 41, 40, 40, 39, 39, 38, 38, 37, 37,
36, 36, 35, 35, 34, 34, 33, 33, 32, 32, 31, 31, 30, 30, 29, 29, 28, 28,
27, 27, 26, 26, 25, 25, 24, 24)
microbenchmark(MatchLast(needles, haystack), MaxWhich(needles, haystack),
Pos(needles, haystack), Tail(needles, haystack))
## function min lq mean median uq max
## MatchLast 15.395 30.3240 137.3086 36.8550 48.051 9842.441
## MaxWhich 90.971 102.1665 161.1100 172.3765 214.829 238.854
## Pos 709.563 733.8220 1111.7000 1162.7780 1507.530 1645.383
## Tail 654.981 690.2035 1017.7400 882.6385 1404.197 1595.933

R: Average nearby elements in a vector

I have many vectors such as this: c(28, 30, 50, 55, 99, 102) and I would like to obtain a new vector where elements differing less than 10 from one to another are averaged. In this case, I would like to obtain c(29, 52.5, 100.5).
Another way
vec <- c(28, 30, 50, 55, 99, 102)
indx <- cumsum(c(0, diff(vec)) > 10)
tapply(vec, indx, mean)
# 0 1 2
# 29.0 52.5 100.5

intersection in R

I have two tables.
Both tables have only 1 column.
Both have random integer values between 1 to 1000.
I want to intersect these two tables. The catch is I want to intersect the numbers even if they have a difference of about 10.
1st table -> 5 , 50, 160, 280
2nd table -> 14, 75, 162, 360
Output ->
1st table -> 5, 160
2nd table -> 14, 162
How can I achieve this in R
You could do this with the sapply function, checking if each element of x or y is sufficiently close to some member of the other vector:
x <- c(5, 50, 160, 280)
y <- c(14, 75, 162, 360)
new.x <- x[sapply(x, function(z) min(abs(z-y)) <= 10)]
new.y <- y[sapply(y, function(z) min(abs(z-x)) <= 10)]
new.x
# [1] 5 160
new.y
# [1] 14 162
Here is an approach that uses the outer function (so your 2 tables will need to be reasonably sized):
x <- c(5,50,160,280)
y <- c(999,14,75,162,360)
tmp1 <- outer(x,y, function(x,y) abs(x-y))
tmp2 <- which(tmp1 <= 10, arr.ind=TRUE)
rbind(
x=x[ tmp2[,1] ],
y=y[ tmp2[,2] ]
)
This looks at every possible pair between x and y and computes the difference between the 2 values, then finds those with a difference <= 10.

Resources