Average Cells of Two or More DataFrames - r

So I currently have 3 data frames that I need to average each cell in, and I am at a loss of how to do this... Essentially, I need to obtain the mean of the first observation in column 1 for df1, df2, df3, and like that for every single observation.
Here is a reproducible sample data.
set.seed(789)
df1 <- data.frame(
a = runif(100, 0, 100),
b = runif(100, 0, 100),
c = runif(100, 0, 100),
d = runif(100, 0, 100))
df2 <- data.frame(
a = runif(100, 0, 100),
b = runif(100, 0, 100),
c = runif(100, 0, 100),
d = runif(100, 0, 100))
df3 <- data.frame(
a = runif(100, 0, 100),
b = runif(100, 0, 100),
c = runif(100, 0, 100),
d = runif(100, 0, 100))
I need to create a fourth data frame of dimensions 100 by 4 that is the result of averaging each cell across the first three dataframes. Any ideas are highly appreciated!

We can do this with Reduce with + and divide by the number of datasets in a list. This has the flexibility of keeping 'n' number of datasets in a list
dfAvg <- Reduce(`+`, mget(paste0("df", 1:3)))/3
Or another option is to convert to array and then use apply, which also have the option of removing the missing values (na.rm=TRUE)
apply(array(unlist(mget(paste0("df", 1:3))), c(dim(df1), 3)), 2, rowMeans, na.rm = TRUE)
As #user20650 mentioned, rowMeans can be applied directly on the array with the dim
rowMeans(array(unlist(mget(paste0("df", 1:3))), c(dim(df1), 3)), dims=2)

Related

Optimize / solve equation for unknown exponent

I have dataframe with the following variables: a and b as predictors and c as outcome. My formula is:
c = (a^x) / (a^x + b^x)
How to solve for x?
Example data:
dat <- data.frame(a = runif(5, 1, 100), b = runif(5, 10, 20), c = runif(5, 0, 1))
Reply to comment:
What is your expected output? A single x-value from least squares fitting, or a column x?
The whole column (sum of all row errors). I want to minimize the error for every row.
You can use the following code
library(minpack.lm)
dataset = data.frame(a = runif(5, 1, 100), b = runif(5, 10, 20), c = runif(5, 0, 1))
fun <- as.formula(c ~ a^x/(a^x + b^x))
#Fitting model using minpack.lm package
nls.out1 <- nlsLM(fun,
data = dataset,
start=list(x=1),
algorithm = "LM",
control = nls.lm.control(maxiter = 500))
summary(nls.out1)

"For" loop with column names as index

I would like to create a loop in which the index is given by the column names of a dataframe. The idea is to select one column at a time and create a map based on the data in that column. I need i being the column name, as it identifies the name of the variable and I'll use that as part of the title of the map. However, I do not seem to be able to associate my index i to the name of the column. My code goes as follows:
# random data
x <- rep(c("AT130", "DEA1A", "DEA2C", "SE125", "SE232"), 4)
y <- c(1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0 ,1, 0, 1, 0, 1)
z <- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ,0, 0, 0, 0, 0)
w <- c(0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1 ,0, 1, 0, 1, 0)
d <- as.data.frame(cbind(x,y,z,w))
colnames(d) <- c("id", "typeA", "typeB", "typeC")
for (i in colnames(d[,2:ncol(d)])) {
var_to_map <- d[,c(1,i)]
## do stuff
}
I get the following error at the first line:
Error: Can't subset columns that don't exist.
x Columns `1`, `2`, and `3` don't exist.
Run `rlang::last_error()` to see where the error occurred.
However, if I just run colnames(d[,2:ncol(d)]), it works properly
colnames(d[,2:ncol(d)])
[1] "typeA" "typeB" "typeC"
I could find a workaround by using columns numbers to make it work, but I would like to keep the column names since I am printing (10+) maps within the loop and I am using i to insert the title of the map each, as follows:
# I use geodata files from the library `Eurostat`.
geodata <- get_eurostat_geospatial(resolution = "60", nuts_level = "3", year = 2013)
for (i in colnames(d[,2:ncol(d)])) {
var_to_map <- d[,c(1,i)]
colnames(var_to_map)[1] <- "geo"
# Joining, by = "geo"
map_data <- merge(var_to_map, geodata, by=c("geo"), all.y=T, all.x=T)
## creating ranges
map_data$cat <- with(map_data, cut(value,
breaks= qu <- unique(quantile(value,
probs=c(0, 0.2, 0.5, 0.8,
0.9, 0.95, 0.99, 1),
na.rm=TRUE, include.lowest=T )),
labels=qu[-1]),include.lowest=TRUE )
# Map
print(ggplot(data=map_data) + geom_sf(aes(fill=cat), size=.1) +
scale_fill_brewer(palette = "Darkred", na.value= "grey") + aes(geometry = geometry) +
guides(fill = guide_legend(reverse=T, title = "Percentiles")) +
labs(title = paste("The name of this graph is the column name", i) ## here is where I use the index
)+
theme_minimal() + theme(legend.position=c(.8,.6)) +
coord_sf(xlim=c(-12,44), ylim=c(35,70)) +
theme( axis.text.x=element_blank(), axis.text.y=element_blank()))
}
I could also use column numbers for i and create another object with column names to refer to when pasting the title of the map, but I am wondering why the above approach fails and what I could do to make it work in that setting.
In base R, you can either select the columns by position or by name, you can't combine them both in one command. If you use dplyr::select you can select columns by name and position in the same command.
So here are your options -
cols <- colnames(d)
for (i in cols[-1]) {
#Select columns by position
var_to_map <- d[,c(1,match(i, cols))]
#OR select column by name
var_to_map <- d[,c(cols[1],i)]
#OR select column by position and name
var_to_map <- dplyr::select(d, 1, i)
#...rest of the code
#...rest of the code
}
There's a lot going on in this question, but perhaps this minimal example will help:
library(tidyverse)
# random data
d <- data.frame(x = rep(c("AT130", "DEA1A", "DEA2C", "SE125", "SE232"), 4),
y = sample(1:10, 20, replace = TRUE),
z = sample(1:10, 20, replace = TRUE),
w = sample(1:10, 20, replace = TRUE))
colnames(d) <- c("id", "typeA", "typeB", "typeC")
for (i in colnames(d[,2:ncol(d)])) {
type <- ensym(i)
p <- ggplot(d, aes(y = !!type, x = id, fill = id)) +
geom_boxplot() +
ggtitle(type)
print(p)
}

Combine two list with equal element names into list of lists by element name R

I have two lists, one with matrices and one with vectors. Both have named elements which are equal. This looks like this:
set.seed(1)
mat_list <- list("2009" = matrix(runif(n = 9, min = 0, max = 10), 3, 3),
"2010" = matrix(runif(n = 9, min = 0, max = 10), 3, 3))
vec_list <- list("2009" = c(runif(n = 3, min = 0, max = 10)),
"2010" = c(runif(n = 3, min = 0, max = 10)))
What I want is to create a new list with the elements 2009 and 2010 that contains the respective matrix and vector, so that I can acces them both in a lapply call. My actual dat has some more years, so I would be nice to not have to reference the years explicitly.
I found a bunch of similar questions, but I couldn't figure out how to apply the answers to my situation. Thanks!
With the purrr package and the map2 function, you can do this:
#install_packages("purrr")
set.seed(1)
mat_list <- list("2009" = matrix(runif(n = 9, min = 0, max = 10), 3, 3),
"2010" = matrix(runif(n = 9, min = 0, max = 10), 3, 3))
vec_list <- list("2009" = c(runif(n = 3, min = 0, max = 10)),
"2010" = c(runif(n = 3, min = 0, max = 10)))
l <- purrr::map2(mat_list, vec_list, function(x,y) list(x,y))
#l <- purrr::map2(mat_list, vec_list, ~list(.x,.y)) #shorter notation
#l <- purrr::map2(mat_list, vec_list, list) #even shorter
#x and y inside the map2 are the elements of each list at each iteration,
#so we can combine them in a list
Thanks to #markus:
l <- Map(list, mat_list, vec_list) # no need for another package

How to create a dataframe representing a 10000 points unit square?

I have to create a dataframe representing a unit square, shaped by 10 000 points. In orderd to achieve that, I need all the combinations between (coordinates) x and y, where each one goes from 0 to 1,00. The result should be something like this:
x y
1 0,01 0,01
2 0,01 0,02
n 0,12 0,04
10000 1,00 1,00
I would be very glad if you can help me.
10 000 points are just a 100x100 square.
Here I fix the value of y and describe the 100 values of x for this possibility.
To do this:
df<-data.frame(
x = rep(seq(from = 0, to = 1, length.out = 100), times = 100)
y = rep(seq(from = 0, to = 1, length.out = 100), each = 100)
)
Using #Heroka's suggestion, for the same output:
df<- expand.grid(x = seq(from = 0, to = 1, length.out = 100),
y = seq(from = 0, to = 1, length.out = 100)
)

Subset a dataframes in a list based on the content of a vector

I have a list of five dataframes. Each dataframe contains one dimension column and 4 value columns. I would like to subset each dataframe in the list based on the contents of a vector.
df <- data.frame(x = 1:100, y2 = runif(100, 0, 100), y3 = runif(100, 0, 100), y4 = runif(100, 0, 100), y5 = runif(100,0,100))
df2 <- data.frame(x = 1:100, y2 = runif(100, 0, 100), y3 = runif(100, 0, 100), y4 = runif(100, 0, 100), y5 = runif(100,0,100))
df3 <- data.frame(x = 1:100, y2 = runif(100, 0, 100), y3 = runif(100, 0, 100), y4 = runif(100, 0, 100), y5 = runif(100,0,100))
df4 <- data.frame(x = 1:100, y2= runif(100, 0, 100), y4 = runif(100, 0, 100), y4 = runif(100, 0, 100), y5 = runif(100,0,100))
df5 <- data.frame(x = 1:100, y2= runif(100, 0, 100), y4 = runif(100, 0, 100), y4 = runif(100, 0, 100), y5 = runif(100,0,100))
frames <- list(df, df2, df3, df4, df5)
So in this example, my list is "frames". Let's say I have the following vector:
subs <- 50:60
My goal here would be to subset the list of dataframes such that each dataframe only contains rows where the value of the first colunmn is inside the subs vector.
Any advice?
Thanks,
Ben
It seems to me that almost all of your questions are regarding a list of data frames with same columns which cause you to use lapply loops on every single operation (which seem highly inefficient).
Alternatively, you could vectorize most of your operations by simply binding all the lists into a single object while maintaining the ID of each data.frame and when finished with all the data manipulations, you could split them back into lists using split.
Here's an example using data.tables development version on Github (you could achieve similar results using dplyr::unnest)
library(data.table)
Res <- rbindlist(frames, idcol = "ID")[x %between% subs]
# ID x y2 y3 y4 y5
# 1: 1 50 54.692889 58.51886 12.754368 35.61516
# 2: 1 51 21.206308 12.77442 52.440787 93.67734
# 3: 2 50 12.655685 84.55044 3.194644 54.46706
# 4: 2 51 83.840276 61.32614 61.139038 92.39402
# 5: 3 50 54.847797 20.68419 19.585931 48.87072
# 6: 3 51 75.510691 68.17955 98.696579 91.48688
# 7: 4 50 63.203071 95.94132 41.835923 60.68250
# 8: 4 51 75.481676 51.67619 80.393557 24.48381
# 9: 5 50 65.744847 50.36983 86.548843 83.31730
# 10: 5 51 4.956835 57.25666 27.106395 32.92020
Eventually (after finished with the all the data manipulations) you will just do
split(Res, Res$ID)
In order to get the data.frames back into lists
You can try lapply
lapply(frames, function(.dat) .dat[with(.dat, x %in% subs),])
If your first columns are all named x, you can use lapply on frames:
lapply(frames,function(p){p[p$x %in% subs,]})

Resources