Double For Loop and calculate averages in R - r

I have a minor problem, and I'm unsure how to fix the error.
Basically, I have two columns and I want to use a Double For Loop to calculate the averages between each number in both columns so it results in a vector of averages. To clarify, apply and mean functions isn't the best function because I need only half of the total possible combinations to obtain averages. For example:
Col1<-c(1,2,3,4,5)
Col2<-c(1,2,3,4,5)
Q1<-data.frame(cbind(Col1, Col2))
Q1$mean<-0
for (i in 1:length(Q1$Col1)) {
for (j in i+1:length(Q1$Col2)) {
Q1$mean[i]<-(Q1$Col1[i]+Q1$Col2[j])/2
}
}
Basically, for each number in Q1$Col1, I want it average it with Q1$Col2. The reason why I want to use a double for loop is to eliminate duplicates. This is the matrix version to provide visualization:
1.0 1.5 2.0 2.5 3.0
1.5 2.0 2.5 3.0 3.5
2.0 2.5 3.0 3.5 4.0
2.5 3.0 3.5 4.0 4.5
3.0 3.5 4.0 4.5 5.0
Here, each row represents a number from Q1$Col1 and each column represents a number from Q1$Col2. However, notice that there is redundancy on both sides of the matrix diagonal. So using the Double For Loop, I eliminate the redundancy to obtain the averages of the unique combination of cases. Using the matrix above, it should look like this:
1.0 1.5 2.0 2.5 3.0
2.0 2.5 3.0 3.5
3.0 3.5 4.0
4.0 4.5
5.0

What I think you're asking is this: given two vectors of numbers, how can I find the mean of the first items in each vector, the mean of the second items in each vector, and so on. If that's the case, then here is a way to do that.
First, you want use cbind() not rbind() in order to get columns not rows.
Col1<-c(1,2,3,4,5)
Col2<-c(2,3,4,5,6)
Q1<-cbind(Col1, Col2)
Then you can use the function [rowMeans()][1] to figure out (you guessed it) the means of each row. (See also rowSums() and colMeans() and colSums().)
rowMeans(Q1)
#> [1] 1.5 2.5 3.5 4.5 5.5
The more general way to do this is the apply() function, which will let us apply a function to each column or row. Here we use the argument 1 to apply it to rows (because the first row takes the first item from Col1 and Col2, etc.).
apply(Q1, 1, mean)
The results are these:
#> [1] 1.5 2.5 3.5 4.5 5.5
If you really want them in your existing matrix, you could do something like this:
means <- rowMeans(Q1)
cbind(Q1, means)

You do not need the loops to get the averages, you can use vectorised operations:
Col1 <- c(1,2,3,4,5)
Col2 <- c(2,3,4,5,6)
Mean <- (Col1+Col2)/2
Q1 <- rbind(Col1, Col2, Mean)
However rbind treats your vectors as rows, you could use cbind for columns.

You could just use the outer function to first calculate the averages, then use lower.trito fill the area underneath the diagonal of the matrix with NA values.
matrix<-outer(Q1$Col1, Q1$Col2, "+")/2
matrix[lower.tri(matrix)] = NA

Related

Using distinct() with a vector of column names

I have a question using distinct() from dplyr on a tibble/data.frame. From the documentation it is clear that you can use it by naming explicitely the column names. I have a data frame with >100 columns and want to use the funtion just on a subset. My intuition said I put the column names in a vector and use it as an argument for distinct. But distinct uses only the first vector element
Example on iris:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct_(iris, exclude.columns)
This is different from
exclude.columns <- c('Sepal.Width', 'Species')
distinct_(iris, exclude.columns)
I think distinct is not made for this operation. Another option would be to subset the data.frame then use distinct and join again with the excluded columns. But my question is if there is another option using just one function?
As suggested in my comment, you could also try:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct(iris, !!! syms(exclude.columns))
Output (first 10 rows):
Sepal.Width Species
1 3.5 setosa
2 3.0 setosa
3 3.2 setosa
4 3.1 setosa
5 3.6 setosa
6 3.9 setosa
7 3.4 setosa
8 2.9 setosa
9 3.7 setosa
10 4.0 setosa
However, that was suggested more than 2 years ago. A more proper usage of latest dplyr functionalities would be:
distinct(iris, across(all_of(exclude.columns)))
It is not entirely clear to me whether you would like to keep only the exclude.columns or actually exclude them; if the latter then you just put minus in front i.e. distinct(iris, across(-all_of(exclude.columns))).
Your objective sounds unclear. Are you trying to get all distinct rows across all columns except $Species and $Sepal.Width? If so, that doesn't make sense.
Let's say two rows are the same in all other variables except for $Sepal.Width. Using distinct() in the way you described would throw out the second row because it was not distinct from the first. Except that it was in the column you ignored.
You need to rethink your objective and whether it makes sense.
If you are just worried about duplicate rows, then
data %>%
distinct(across(everything()))
will do the trick.

Determining observations with same values in one variable

My problem is that i want to determine households which have the same and then use the rank number (ranked by income) to create another rank variable.Sample.Data.Frame
For example you have a data.frame like displayed in the image. The first 2 observations have no income. So there are 2(=n) observations with the same income and rank of 1(=y) and 2(=x). The new rank variable I want to create for both observations: rank.new = (y+x)/n. So that there new column with "rank.new" where in observation 1 and 2 the value is 1.5.
Of course I have many more observations an more identical income households, so i want to ask you have i could fix this in R?
You are looking for the function rank
Income = c(0,0,150,300,300,440,500,500,500)
rank(Income)
[1] 1.5 1.5 3.0 4.5 4.5 6.0 8.0 8.0 8.0
I am making your test data a little bigger to show what happens when there are more than two points that are the same group. You just need to group the points that have the same income and take the average of the groups. I am assuming that the data has been sorted by Income.
## Test Data
Income = c(0,0,150,300,300,440,500,500,500)
Rank = 1:length(Income)
Group = cumsum(c(1, diff(Income) != 0))
NewRank = aggregate(Rank, list(Group), mean)[Group,2]
NewRank
[1] 1.5 1.5 3.0 4.5 4.5 6.0 8.0 8.0 8.0

Calculate a value in a column for each row

Here is the table that I am trying to manipulate:
colnames sampA sampB
#1 conA conB
#2 1.1 4.4
#3 2.2 5.5
#4 3.3 6.6
I want to calculate log2(x(1-x)) for each number in $sampB. Here is my code so far:
DF[-1,3] <- apply(DF[-1,]$sampB,1,function(x) log2(x(1-x)))
then I got the error message:
dim(X) must have a positive length
You shouldn't need apply(), as log2() is vectorized. Try this
x <- as.numeric(as.character(DF$sampB[-1]))
log2(x * (1 - x))
I took off the first element because I'm not really sure what that conB part is about (and now you have confirmed it in the comments). I also suspect that the column might be a factor (because of conB), so I wrapped the column in as.numeric(as.character(...)). That may not be necessary, but better safe than sorry.

Is there any subtle difference between 'mean' and 'average' in R?

Is there any difference between what these two lines of code do:
mv_avg[i-2] <- (sum(file1$rtn[i-2]:file1$rtn[i+2])/5)
and
mv_avg[i-2] <- mean(file1$rtn[i-2]:file1$rtn[i+2])
I'm trying to calculate the moving average of first 5 elements in my dataset. I was running a for loop and the two lines are giving different outputs. Sorry for not providing the data and the rest of the code for you guys to execute and see (can't do that, some issues).
I just want to know if they both do the same thing or if there's a subtle difference between them both.
It's not an issue with mean or sum. The example below illustrates what's happening with your code:
x = seq(0.5,5,0.5)
i = 8
# Your code
x[i-2]:x[i+2]
[1] 3 4 5
# Index this way to get the five values for the moving average
x[(i-2):(i+2)]
[1] 3.0 3.5 4.0 4.5 5.0
x[i-2]=3 and x[i+2]=5, so x[i-2]:x[i+2] is equivalent to 3:5. You're seeing different results with mean and sum because your code is not returning 5 values. Therefore dividing the sum by 5 does not give you the average. In my example, sum(c(3,4,5))/5 != mean(c(3,4,5)).
#G.Grothendieck mentioned rollmean. Here's an example:
library(zoo)
rollmean(x, k=5, align="center")
[1] 2.1 3.1 4.1 5.1 6.1 7.1 8.1

Generate values between non-linear points

I am aiming at smoothing out a curve with set values. To do this, I currently generate a vector between points in my curve like so:
> y.values <- c(values[1], mean(values[1:2]), values[2], ...)
This is not the fastest approach to say the least (and this snippet is just between two of the numbers!). I need a better way to generate a vector with known non-linear values and insert a value between each one, like so:
> values
[1] 1 2 4 6 9
> y.values <- magic(values)
> y.values
[1] 1 1.5 2 3 4 5 6 7.5 9
This question feels basic but I researched it and cannot seem to find a proper method for my non-linear vector, and any help is appreciated. Thank you for reading.
Maybe not the most elegant way to do this but it works:
values <- c(1,2,4,6,9)
#lapply is used to create the mean values and those get merged
#in between your values inside the function
a <- c(unlist(lapply( 1:(length(values)-1 ), function(x) c(values[x],(values[x]+values[x+1])/2))),
values[length(values)])
Output:
> a
[1] 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.5 9.0
Or as a function:
magic <- function(x) {
c(unlist(lapply( 1:(length(x)-1 ), function(z) c(x[z],(x[z]+x[z+1])/2))),
x[length(x)])
}
> magic(values)
[1] 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.5 9.0

Resources