I have a large list of dataframes with environmental variables from different localities. For each of the dataframes in the list, I want to summarize the values across locality (= group measurements of the same locality into one), using the name of the dataframes as a condition for which variables need to be summarized. For example, for a dataframe with the name 'salinity' I want to only summarize across salinity, and not the other environmental variables. Note that the different dataframes contain data from different localities, so I cannot simply merge them into one dataframe.
Let's do this with a dummy dataset:
#create list of dataframes
df1 = data.frame(locality = c(1, 2, 2, 5, 7, 7, 9),
Temp = c(14, 15, 16, 18, 20, 18, 21),
Sal = c(16, NA, NA, 12, NA, NA, 9))
df2 = data.frame(locality = c(1, 1, 3, 6, 8, 9, 9),
Temp = c(1, 2, 4, 5, 0, 2, -1),
Sal = c(18, NA, NA, NA, 36, NA, NA))
df3 = data.frame(locality = c(1, 3, 4, 4, 5, 5, 9),
Temp = c(14, NA, NA, NA, 17, 18, 21),
Sal = c(16, 8, 24, 23, 11, 12, 9))
df4 = data.frame(locality = c(1, 1, 1, 4, 7, 8, 10),
Temp = c(1, NA, NA, NA, NA, 0, 2),
Sal = c(18, 17, 13, 16, 20, 36, 30))
df_list = list(df1, df2, df3, df4)
names(df_list) = c("Summer_temperature", "Winter_temperature",
"Summer_salinity", "Winter_salinity")
Next, I used lapply to summarize environmental variables:
#select only those dataframes in the list that have either 'salinity' or 'temperature' in the dataframe names
df_sal = df_list[grep("salinity", names(df_list))]
df_temp = df_list[grep("temperature", names(df_list))]
#use apply to summarize salinity or temperature values in each dataframe
##salinity
df_sal2 = lapply(df_sal, function(x) {
x %>%
group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))
})
##temperature
df_temp2 = lapply(df_temp, function(x) {
x %>%
group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))
})
Now, this code is repetitive, so I want to downsize this by combining everything into one function. This is what I tried:
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Sal = mean(Sal, na.rm = TRUE))}
if (grepl("temperature", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Temp = mean(Temp, na.rm = TRUE))}
})
But I am getting the following output:
$Summer_temperature
NULL
$Winter_temperature
NULL
$Summer_salinity
NULL
$Winter_salinity
NULL
And the following warning messages:
Warning messages:
1: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
2: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
3: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
4: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
5: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
6: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
7: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
8: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
Now, I read here that this warning message can potentially be solved by using ifelse. However, in the final dataset I will have more than two environmental variables, so I will have to add many more if statements - for this reason I believe ifelse is not a solution here. Does anyone have an elegant solution to my problem? I am new to using both functions and lapply, and would appreciate any help you can give me.
EDIT:
I tried using the else if option suggested in one of the answers, but this still returns NULL values. I also tried the return and assigning output to x but both have the same problem as below code - any ideas?
#else if
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))}
else if (grepl("temperature", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))}
})
df_env
What I think is happening is that my if argument does not get passed to the summarize function, so nothing is being summarized.
Several things going on here, including
as akrun said, if statements must have a condition with a length of 1. Yours are not.
grepl("locality", names(df1))
# [1] TRUE FALSE FALSE
That must be reduced so that it is always exactly length 1. Frankly, grepl is the wrong tool here, since technically a column named notlocality would match and then it would error. I suggest you change to
"locality" %in% names(df1)
# [1] TRUE
You need to return something. Always. You shifted from if ...; if ...; to if ... else if ..., which is a good start, but really if you meet neither condition, then nothing is returned. I suggest one of the following: either add one more } else x, or reassign as if (..) { x <- x %>% ...; } else if (..) { x <- x %>% ... ; } and then end the anon-func with just x (to return it).
However, I think ultimately the problem is that you are looking for "temperature" or "salinity" which are in the names of the list-objects, not in the frames themselves. For instance, your reference to names(x) is returning c("locality", "Temp", "Sal"), the names of the frame x itself.
I think this is what you want?
Map(function(x, nm) {
if (grepl("salinity", nm)) {
x %>%
group_by(locality) %>%
summarize(Sal = mean(Sal, na.rm = TRUE))
} else if (grepl("temperature", nm)) {
x %>%
group_by(locality) %>%
summarize(Temp = mean(Temp, na.rm = TRUE))
} else x
}, df_list, names(df_list))
# $Summer_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 14
# 2 2 15.5
# 3 5 18
# 4 7 19
# 5 9 21
# $Winter_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 1.5
# 2 3 4
# 3 6 5
# 4 8 0
# 5 9 0.5
# $Summer_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 3 8
# 3 4 23.5
# 4 5 11.5
# 5 9 9
# $Winter_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 4 16
# 3 7 20
# 4 8 36
# 5 10 30
Related
I'm currently working on a longitudinal data base in R. Therefore, I have a lot of missing values, because the values of the variables which have been unchanged since the last interview are not added in the new database. For example in the first wave the sex is defined as boy or girl and it doesn't change between the first wave and the second wave, so they are not giving the sex in the second wave again.
Basically, what I would like to do is to merge the data I have selected for the second wave and merge it with the data from the first wave, in order to eliminate some NAs. However, I would like to only keep the columns I have selected from the second wave. For the moment, and after looking on the internet, I was only able to merge the two datasets but I'm not able to only keep the data from the second wave.
Here is my code:
library("rqdatatable")
x <- data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, NA,0),
S2 = c(2, NA, 2,2)
)
y <- data.frame(
ID = c(1, 2, 3, 4,5,6,7,8),
S1 = c(1, 2, 5, 1,3,6,8,2),
S3 = c(3, 3, 3, 3,7,1,6,9),
S2 = c(0,0,0,0,0,0,0,0),
S4 = c(0,0,0,0,0,0,0,0)
)
final <- natural_join(x, y,
by = "ID",
jointype = "LEFT")
What I would like to get after my merge is:
z = data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, 5,0),
S2 = c(2, 0, 2,2)
)
Do you have any idea of how I can solve my problem?
It would be very time consuming to merge everything and to select the variables I want again.
Many thanks and best regards!
Here is a base r function that joins the data like in the question. It can also be call via a pipe, in this case R's pipe operator introduced in R 4.1.
x <- data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, NA,0),
S2 = c(2, NA, 2,2)
)
y <- data.frame(
ID = c(1, 2, 3, 4,5,6,7,8),
S1 = c(1, 2, 5, 1,3,6,8,2),
S3 = c(3, 3, 3, 3,7,1,6,9),
S2 = c(0,0,0,0,0,0,0,0),
S4 = c(0,0,0,0,0,0,0,0)
)
joinSpecial <- function(x, y, idcol = "ID"){
idcolx <- which(names(x) == idcol)
idcoly <- which(names(y) == idcol)
idx <- which(names(x) %in% names(y))
idy <- which(names(y) %in% names(x))
idx <- idx[idx != idcolx]
idy <- idy[idy != idcoly]
i <- match(x[[idcolx]], y[[idcoly]])
x[idx] <- mapply(\(a, b, i){
na <- is.na(a)
a[na] <- b[i][na]
a
}, x[idx], y[idy], MoreArgs = list(i = i), SIMPLIFY = FALSE)
x
}
joinSpecial(x, y)
#> ID S1 S2
#> 1 1 1 2
#> 2 2 3 0
#> 3 3 5 2
#> 4 4 0 2
x |> joinSpecial(y)
#> ID S1 S2
#> 1 1 1 2
#> 2 2 3 0
#> 3 3 5 2
#> 4 4 0 2
Created on 2022-03-18 by the reprex package (v2.0.1)
We could use inner_join in combination with coalesce
library(dplyr)
x %>%
inner_join(y, by="ID") %>%
mutate(S1 = coalesce(S1.x, S1.y),
S2 = coalesce(S2.x, S2.y)) %>%
select(ID, S1, S2)
ID S1 S2
1 1 1 2
2 2 3 0
3 3 5 2
4 4 0 2
I have a dataframe with two variables (start,end). would like to create an identifier variable which grows in ascending order of start and, most importantly, is kept constant if the value of start coincides with end of any other row in the dataframe.
Below is a simple example of the data
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
The output I would be looking for is the following:
output_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17),
NEW_VAR = c(1,1,2,3,4))
You could try adapting this answer to group by ranges that are adjacent to each other. Credit goes entirely to #r2evans.
In this case, you would use expand.grid to get combinations of start and end. Instead of labels you would have row numbers rn to reference.
In the end, you can number the groups based on which rows appear together in the list. The last few lines starting with enframe use tibble/tidyverse. To match the group numbers I resorted the results too.
I hope this might be helpful.
library(tidyverse)
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
toy_data$rn = 1:nrow(toy_data)
eg <- expand.grid(a = seq_len(nrow(toy_data)), b = seq_len(nrow(toy_data)))
eg <- eg[eg$a < eg$b,]
together <- cbind(
setNames(toy_data[eg$a,], paste0(names(toy_data), "1")),
setNames(toy_data[eg$b,], paste0(names(toy_data), "2"))
)
together <- subset(together, end1 == start2)
groups <- split(together$rn2, together$rn1)
for (i in toy_data$rn) {
ind <- (i == names(groups)) | sapply(groups, `%in%`, x = i)
vals <- groups[ind]
groups <- c(
setNames(list(unique(c(i, names(vals), unlist(vals)))), i),
groups[!ind]
)
}
min_row <- as.numeric(sapply(groups, min))
ctr <- seq_along(groups)
lapply(ctr[order(match(min_row, ctr))], \(x) toy_data[toy_data$rn %in% groups[[x]], ]) %>%
enframe() %>%
unnest(col = value) %>%
select(-rn)
Output
name start end
<int> <dbl> <dbl>
1 1 1 10
2 1 10 15
3 2 5 9
4 3 6 11
5 4 16 17
The following function should give you the desired identifier variable NEW_VAR.
identifier <- \(df) {
x <- array(0L, dim = nrow(df))
count <- 0L
my_seq <- seq_len(nrow(df))
for (i in my_seq) {
if(!df[i,]$start %in% df$end) {
x[i] <- my_seq[i] + count
} else {
x[i] <- my_seq[i]-1L + count
count <- count - 1L
}
}
x
}
Examples
# your example
toy_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 1 1 2 3 4
# other example
toy_data <- data.frame(start = c(1, 2, 2, 4, 16, 21, 18, 3),
end = c(16, 2, 21, 2, 2, 2, 3, 1))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 0 0 0 1 1 1 2 2
I am trying to summarise a list of dataframes. Here is some test data
noms <- list('A', 'B')
A_data <- data.frame('Dis' = c(1, 1, 2, 2),
'adj' = c(3, 2, 6, 7))
B_data <- data.frame('Dis' = c(1, 1, 2, 2),
'adj' = c(2, 6, 3, 6))
frames <- list(A_data, B_data)
I want to produce a list of data frams where'adj' is summed for each 'Dis' group, and then add a column for the relevant name from 'noms' so I can then combine the data frames together to form a single dataframe in the future.
So far I have this:
totals <- setNames(lapply(frames, function (x)
x %>%
dplyr::group_by(Dis) %>%
dplyr::summarise(total = sum(adj)))
,paste0(unlist(noms)))
But I can figure out how to add a column with the relevant name. I know I need to use the mutate function something like so:
totals <- setNames(lapply(frames, function (x)
x %>%
dplyr::group_by(Dis) %>%
dplyr::summarise(total = sum(adj)) %>%
dplyr::mutate(nom = )
,paste0(unlist(noms)))
but I cant figure out how to add the correct name.
The expected output would be a list of two dataframes one for 'A' and one for 'B'. Here is the expected output for 'A':
Dis total Nom
1 1 5 A
2 2 13 A
How do I do this?
A base R option where we use Map instead of lapply
out <- Map(function(x, y) {
transform(aggregate(adj ~ Dis, data = x, sum), Nom = y)
}, x = frames, y = noms)
out
#[[1]]
# Dis adj Nom
#1 1 5 A
#2 2 13 A
#[[2]]
# Dis adj Nom
#1 1 8 B
#2 2 9 B
The same idea with tidyverse functions
library(purrr); library(dplyr)
map2(.x = frames, .y = noms, ~ .x %>%
group_by(Dis) %>%
summarise(adj = sum(adj)) %>%
mutate(Nom = .y))
I'm trying to check the "pin" numbers of cases with missing data for each variable of interest in my dataset.
Here are some fake data:
c <- data.frame(pin = c(1, 2, 3, 4), type = c(1, 1, 2, 2), v1 = c(1, NA, NA,
NA), v2 = c(NA, NA, 1, 1))
I wrote a function "m.pin" to do this:
m.pin <- function(x, data = "c", return = "$pin") {
sect <- gsub("^.*\\[", "\\[", deparse(substitute(x)))
vect <- eval(parse(text = paste(data, return, sect, sep = "")))
return(vect[is.na(x)])
}
And I use it like so:
m.pin(c$v1[c$type == 1])
[1] 2
I wrote a function to apply "m.pin" over a list of variables to only return pins with missing data:
return.m.pin <- function(x, fun = m.pin) {
val.list <- lapply(x, fun)
condition <- lapply(val.list, function(x) length(x) > 0)
val.list[unlist(condition)]
}
But when I apply it, I get this error:
l <- lst(c$v1[c$type == 1], c$v2[c$type == 2])
return.m.pin(l)
Error in parse(text = paste(data, return, sect, sep = "")) :
<text>:1:9: unexpected ']'
1: c$pin[i]]
^
How can I rewrite my function(s) to avoid this issue?
Many thanks!
Please see Gregor's comment for the most critical issues with your code (to add: don't use return as a variable name as it is the name of a base R function).
It's not clear to me why you want to define a specific function m.pin, nor what you ultimately are trying to do, but I am assuming this is a critical design component.
Rewriting m.pin as
m.pin <- function(df, type, vcol) which(df[, "type"] == type & is.na(df[, vcol]))
we get
m.pin(df, 1, "v1")
#[1] 2
Or to identify rows with NA in "v1" for all types
lapply(unique(df$type), function(x) m.pin(df, x, "v1"))
#[[1]]
#[1] 2
#
#[[2]]
#[1] 3 4
Update
In response to Gregor's comment, perhaps this is what you're after?
by(df, df$type, function(x)
list(v1 = x$pin[which(is.na(x$v1))], v2 = x$pin[which(is.na(x$v2))]))
# df$type: 1
# $v1
# [1] 2
#
# $v2
# [1] 1 2
#
# ------------------------------------------------------------
# df$type: 2
# $v1
# [1] 3 4
#
# $v2
# integer(0)
This returns a list of the pin numbers for every type and NA entries in v1/v2.
Sample data
df <- data.frame(
pin = c(1, 2, 3, 4),
type = c(1, 1, 2, 2),
v1 = c(1, NA, NA, NA),
v2 = c(NA, NA, 1, 1))
I would suggest rewriting like this (if this approach is to be taken at all). I call your data d because c is already the name of an extremely common function.
# string column names, pass in the data frame as an object
# means no need for eval, parse, substitute, etc.
foo = function(data, na_col, return_col = "pin", filter_col, filter_val) {
if(! missing(filter_col) & ! missing(filter_val)) {
data = data[data[, filter_col] == filter_val, ]
}
data[is.na(data[, na_col]), return_col]
}
# working on the whole data frame
foo(d, na_col = "v1", return_col = "pin")
# [1] 2 3 4
# passing in a subset of the data
foo(d[d$type == 1, ], "v1", "pin")
# [1] 2
# using function arguments to subset the data
foo(d, "v1", "pin", filter_col = "type", filter_val = 1)
# [1] 2
# calling it with changing arguments:
# you could use `Map` or `mapply` to be fancy, but this for loop is nice and clear
inputs = data.frame(na_col = c("v1", "v2"), filter_val = c(1, 2), stringsAsFactors = FALSE)
result = list()
for (i in 1:nrow(inputs)) {
result[[i]] = foo(d, na_col = inputs$na_col[i], return_col = "pin",
filter_col = "type", filter_val = inputs$filter_val[i])
}
result
# [[1]]
# [1] 2
#
# [[2]]
# numeric(0)
A different approach I would suggest is melting your data into a long format, and simply taking a subset of the NA values, hence getting all combinations of type and the v* columns that have NA values at once. Do this once, and no function is needed to look up individual combinations.
d_long = reshape2::melt(d, id.vars = c("pin", "type"))
library(dplyr)
d_long %>% filter(is.na(value)) %>%
arrange(variable, type)
# pin type variable value
# 1 2 1 v1 NA
# 2 3 2 v1 NA
# 3 4 2 v1 NA
# 4 1 1 v2 NA
# 5 2 1 v2 NA
How do I store a few numbers in one element of data frame?
For example I want a summary of my data, including the class and values in each column.
dat = data.frame(STATE = 1:5,
MONTH = 1:5)
should yield:
var class values
STATE numeric c(1,2,3,4,5)
MONTH numeric c(1,2,3,4,5)
Now I try:
dat = data.frame(STATE = 1:5,
MONTH = 1:5)
vars = data.frame(var = colnames(dat), class = NA, values = NA,
stringsAsFactors = F)
vars$class = sapply(dat, class)
vars
# var class values
# 1 STATE integer NA
# 2 MONTH integer NA
vars$values = sapply(dat, function(x) unique(x))
# Error in `$<-.data.frame`(`*tmp*`, "values", value = c(1L, 2L, 3L, 4L, :
# replacement has 5 rows, data has 2
# UPDATE: #jMathew 's answer:
vars$values = sapply(dat, function(x) list(unique(x)))
vars
# var class values
# 1 STATE integer 1, 2, 3, 4, 5
# 2 MONTH integer 1, 2, 3, 4, 5
It doesn't work because unique(dat$STATE) = c(1,2,3,4,5), and R thinks it should be 5 elements in data frame, and can't fit in one element.
But the above code works for many data sets I work with, e.g.:
library(foreign)
dat = read.xport('LLCP2013.XPT')
# download from http://www.cdc.gov/brfss/annual_data/2013/files/LLCP2013XPT.ZIP
dat = dat[1:5, 1:3]
dat
# X_STATE FMONTH IDATE
# 1 1 1 01092013
# 2 1 1 01192013
# 3 1 1 01192013
# 4 1 1 01112013
# 5 1 2 02062013
vars = data.frame(var = colnames(dat), class = NA, values = NA,
stringsAsFactors = F)
vars$class = sapply(dat, class)
vars$values = sapply(dat, function(x) unique(x))
vars
# var class values
# 1 X_STATE numeric 1
# 2 FMONTH numeric 1, 2
# 3 IDATE factor 16, 36, 20, 70
# UPDATE:
class(vars[3,3])
# [1] "list"
# #jMathew was right, it was somehow coerced to list
Can somebody tells me why this works in the second case but not in the first? Thanks
I suspect that in your second case, the vector is being coerced to a list
Try, this on your first example
vars$values = sapply(dat, function(x) list(unique(x)))
We could try
do.call(rbind,lapply(seq_along(dat), function(i)
data.frame(var=names(dat)[i], class=class(dat[,i]),
values= sprintf('c(%s)', toString(unique(dat[,i]))))))
# var class values
#1 STATE integer c(1, 2, 3, 4, 5)
#2 MONTH integer c(1, 2, 3, 4, 5)