I have a question about applying a function on each elements of a list.
Here's my problem:
I have a list of DF (I divided a bigger DF by days):
mydf <- data.frame(x=c(1:5), y=c(21:25),z=rnorm(1:5))
mylist <- rep(list(mydf),5)
names(mylist) <-c("2006-01-01","2006-01-02","2006-01-03","2006-01-04","2006-01-05")
Don't care about this fake data if it's identical), it's just for the example. I've my results in column "z" for each DF of the list, and 2 other columns "x" and "y" representing some spatial coordinates.
I have another independent DF containing a list of "x" and "y" too, representing some specific regions (imagine 10 regions):
region <- data.frame(x=c(1:10),y=c(21:30),region=c(1:10))
The final aim is to have for each 10 regions, a value "z" (of my results) from the nearest point (according to coordinates) of each of the DF of my list.
That means for one region: 10 results "z" from DF1 of my list, then 10 other results "z" from DF2, ...
My final DF should look like this if possible (for the structure):
final1 <- data.frame("2006-01-01"=rnorm(1:10),"2006-02-01"=rnorm(1:10),
"2006-03-01"=rnorm(1:10),"2006-04-01"=rnorm(1:10),"2006-05-01"=rnorm(1:10))
With one column for one day (so one DF of the list) and one value for each row (so for example for 2006-01-01: the value "z" from the nearest point with the first region).
I already have a small function to look for the nearest value:
min.dist <- function(p, coord){
which.min( colSums((t(coord) - p)^2) )
}
Then, I'm trying to make a loop to have what I want, but I have difficulties with the list. I would need to put 2 variables in the loop, but it doesn't works.
This works approximately if I just take 1 DF of my list:
for (j in 1:nrow(region)){
imin <- min.dist(c(region[j,1],region[j,2]),mylist[[1]][,1:2])
imin[j] <- min.dist(c(region[j,1],region[j,2]),mylist[[1]][,1:2])
final <- mylist[[1]][imin[j], "z"]
final[j] <- mylist[[1]][imin[j], "z"]
final <- as.data.frame(final)
}
But if I select my whole list (in order to have one column of results for each DF of the list in the object "final"), I have errors.
I think the first problem is that the length of "regions" is different of the length of my list, and the second maybe is about adding a second variable for the length of my list.
I'm not very familiar with loop, and so with 2-variables loops.
Could you help me to change in the loop what should be changed in order to have what I'm looking for?
Thank you very much!
You can use lapply() to apply a function over a list.
This should work. It returns a list of vectors.
lapply(
mylist,
FUN = function(mydf)
mydf[apply(
region[, -3],
1,
FUN = function(x)
which.min(apply(
mydf[, -3],
1,
FUN = function(y)
dist(rbind(x, y))
))
), 3]
)
Related
I'm combining 12 CSV files into one dataframe in R. Before doing this I want to ensure all the column names are an exact match with each other. I've made a dataframe where each column is the column names of the 12 CSV files.
jul21_cols <- data.frame(colnames(jul21))
aug21_cols <- data.frame(colnames(aug21))
sep21_cols <- data.frame(colnames(sep21))
oct21_cols <- data.frame(colnames(oct21))
nov21_cols <- data.frame(colnames(nov21))
dec21_cols <- data.frame(colnames(dec21))
jan22_cols <- data.frame(colnames(jan22))
feb22_cols <- data.frame(colnames(feb22))
mar22_cols <- data.frame(colnames(mar22))
apr22_cols <- data.frame(colnames(apr22))
may22_cols <- data.frame(colnames(may22))
jun22_cols <- data.frame(colnames(jun22))
col_df <- cbind(jul21_cols,aug21_cols,sep21_cols,oct21_cols,nov21_cols,dec21_cols,
jan22_cols,feb22_cols,mar22_cols,apr22_cols,may22_cols,jun22_cols)
I've tried using the identical function to compare 2 columns at a time.
identical(col_df[['jul21']], col_df[['aug21']])
identical(col_df[['aug21']], col_df[['sep21']])
identical(col_df[['sep21']], col_df[['oct21']])
identical(col_df[['oct21']], col_df[['nov21']])
identical(col_df[['nov21']], col_df[['dec21']])
identical(col_df[['dec21']], col_df[['jan22']])
identical(col_df[['jan22']], col_df[['feb22']])
identical(col_df[['feb22']], col_df[['mar22']])
identical(col_df[['mar22']], col_df[['apr22']])
identical(col_df[['apr22']], col_df[['may22']])
identical(col_df[['may22']], col_df[['jun22']])`
All of the identical lines return the value of TRUE
I'm just trying to verify that this code is telling me all my column names are identical in each CSV files before I move on. I'd also like to know if there is a more efficient way to solve this problem.
First, identical() will only return TRUE if the two dataframes have all the same column names in the same order. If you don’t care about order, just that all the same names are in both dataframes, you can sort() the names before comparing as shown below.
Second, you can often use the base::lapply() or purrr::map() families of functions for operations requiring iteration.
For your case, let’s put your dataframes in a list (which they probably should be to begin with), then use sapply() to compare the column names of the first df in the list to the column names of all other dfs.
jul21 <- data.frame(x = 1, y = 2)
aug21 <- data.frame(x = 3, y = 4)
sep21 <- data.frame(y = 6, x = 5)
dfs <- list(jul21,aug21,sep21)
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# TRUE
And as another test case, we’ll add a df with a non-matching column.
oct22 <- data.frame(x = 1, y = 2, z = 3)
dfs[[4]] <- oct22
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# FALSE
We assume that what is needed is to determine if the column names are the same and in same order and if not to determine which differ.
First get a character vector, Names, containing the names of the data frames and from that make a named list L containing the data frames themselves.
From those names assemble a list L of the data frames and then get a character vector nms whose elements are strings of column names, one for each data frame.
Finally group the names of the data frames using tapply and nms as the groupings so we can see which data frames contain which columns. In the example below aug21 and jul21 have one set of columns, i.e. Time and demand, and sep21 has a different set, i.e. Time and DEMAND. If there were only one row then all data frames have the same column names in the same order.
Names <- c("jul21", "aug21", "sep21") # using example in Note
L <- mget(Names)[Names]
nms <- sapply(names(L), function(x) toString(names(L[[x]])))
tab <- stack(tapply(names(nms), nms, toString))
names(tab) <- c("data.frames", "column.names")
nrow(tab)
## [1] 2
tab
## data.frames column.names
## 1 jul21, aug21 Time, demand
## 2 sep21 Time, DEMAND
graph
Another approach which could be used alternately or in conjuction with the one above is to create a graph such that each vertex is a data frame and each edge means that the two vertices on either end of the edge have the same column names in the same order. Each connected component represents distinct column names or orders. From the example below we see that jul21 and aug21 form one connected component and sep21 forms a second connected component.
To investigate how data frame column names differ note that setdiff(names(jul21), names(sep21)) will show names that are in jul21 but not in sep21 and the reverse can be used for the other direction. If the setdiff in both directions are zero length vectors and names vectors are not the same then they differ by order.
library(igraph)
set.seed(123)
isSame <- function(x, y) +identical(names(x), names(y))
A <- outer(L, L, Vectorize(isSame))
diag(A) <- 0
g <- graph_from_adjacency_matrix(A, "undirected")
plot(g, vertex.color = "white", vertex.size = 30)
Note
Test data. BOD comes with R.
jul21 <- aug21 <- sep21 <- BOD
names(sep21) <- c("Time", "DEMAND")
I have a tibble with variable, that contains a lists. Each list has a different lengths. I would like to have two new variables, let’s say “lon” and “lat”. In variable “lon” I’d like to have first half of each list, and in variable “lat” the second half.
data:
file_url <- "https://github.com/slawomirmatuszak/Covid.UA/raw/master/sample.Rda?raw=true"
load(url(file_url))
I can achieve that by filtering lists, but I’d like to do this by more universal code (based on lengths, not a specific number).
sample.data$lon <- lapply(sample.data$geometry, function(x) unlist(x)[x<40])
sample.data$lat <- lapply(sample.data$geometry, function(x) unlist(x)[x>40])
Probably, you can try with length to get first half and second half of geometry column.
sample.data$lat <- sapply(sample.data$geometry, function(x)
{tmp <- unlist(x);tmp[1:(length(tmp)/2)]})
sample.data$lon <- sapply(sample.data$geometry, function(x)
{tmp <- unlist(x);tmp[((length(tmp)/2) + 1):length(tmp)]})
I have a named list of vectors, y. The names of the list correspond to the values of variable, x. I need to return the value of the vector in y that matches the value of x at position i. For example, if x == "b" at index 25, I expect to return the 25th value of the "b" vector contained in the list y.
This is my current solution:
x <- sample(letters[1:4], 100, replace = T)
y <- list("a"=rnorm(100), "b"=rnorm(100), "c"=rnorm(100))
i <- match(x, names(y))
m <- sapply(i, function(i) {out <- rep(0,3); out[i] <- 1; out})
final <- apply(t(m) * do.call(cbind, y), 1, sum)
I am hoping for something more idiomatic. As part of the solution, the answer handle cases where values in x do not appear in the names of y.
The real world use case I am trying to solve is the case where I have several segmented model predictions applied to the entire population that I need to assign to their appropriate segment.
EDIT
Also, trying to avoid the clunky usage of ifelse. Since the names are known, I shouldn't have to specify them manually.
Using matrix subsetting with 2-dimensional indices, you could simply do
do.call(cbind, y)[cbind(1:length(i), i)]
My goal is take a list of dataframes, see if a specific column of the data frames has a max value of 0, and if so, remove that data frame from my list.
Right now I am looping over names of the list. Given that this is R, there must be a better way. I feel I need some function applied through lapply() to get this right. I've also considered ddply() but I think that maybe overkill. Here is what I have so far:
# Make df of First element
myColumn <- rep ("ElementA",times=10)
values <- seq(1,10)
a <- data.frame(myColumn,values)
# Make df of second element
myColumn <- rep ("ElementB",times=10)
values <- rep(0,10)
b <- data.frame(myColumn,values)
# Bind the dataframes together
df <- rbind(a,b)
#Now split the dataframes based on element name
myList <- split(df,df$myColumn)
# Now loop through element lists and check for max of 0 in values
for (name in names(myList)) { # Loop through List
if (max(myList[[name]]$values) == 0) { # Check Max for 0
myList <- myList[[-names]] # If 0, remove element from list
} # Close If
} # Close Loop
Error in -names : invalid argument to unary operator
I've tested my code outside the loop, and it all seems to work.
Any help is greatly appreciated. Thanks!
You can use this:
myList <- myList[sapply(myList, function(d) max(d$values) != 0)]
instead of the for() loop. This will let pass dataframes with zero rows, with a warning.
To ensure empty dataframes are removed, use this:
myList <- myList[sapply(myList, function(d) if(nrow(d)==0) FALSE else max(d$values)!=0)]
I have a list of different data types (factors, data.frames, and vectors, all the same length or number of rows), What I would like to do is subset each element of the list by a vector (let's call it rows) that represents row names.
If it was a data.frame() I would:
x <- x[rows,]
If it was a vector() or factor() I would:
x <- x[rows]
So, I've been playing around with this:
x <- lapply(my_list, function(x) ifelse(is.data.frame(x), x[rows,], x[rows]))
So, how do I accomplish my goal of getting a list of subsetted data?
I think this is YAIEP (Yet Another If Else Problem). From ?ifelse:
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
See the trouble? Same shape as test.
So just do this:
l <- list(a = data.frame(x=1:10,y=1:10),b = 1:10, c = factor(letters[1:20]))
rows <- 1:3
fun <- function(x){
if (is.data.frame(x)){
x[rows,]
}
else{
x[rows]
}
}
lapply(l,fun)