Would you suggest a better (briefer or more legible) way of converting NULLs in a list to NAs; and from list to vector?
list(1, 2, 3, numeric(0), 5) %>%
purrr::map_dbl(~ ifelse(length(.) == 0, NA_real_, .))
# [1] 1 2 3 NA 5
I would prefer not using ifelse and instead using if_else.
Is there another way of doing it with purrr?
If the length of each element of a list is 0 or 1 (e.g. lst.1), you could simply use
lst.1 %>% map_dbl(1, .default = NA)
# [1] 1 2 3 NA 5
A general way to deal with a list with different length in each element (e.g. lst.2) is
lst.2 %>%
map_if(~ length(.) == 0, ~ NA) %>%
flatten_dbl()
# [1] 1 2 3 NA 5
Data
lst.1 <- list(1, 2, 3, numeric(0), 5)
lst.2 <- list(1:3, numeric(0), 5)
If L is the list then any of these work:
replace(L, lengths(L) == 0, NA)
ifelse(lengths(L), L, NA)
No packages needed.
Related
I have the following vector with names:
myvec <- c(`C1-C` = 3, `C2-C` = 1, `C3-C` = NA, `C4-C` = 5, `C5-C` = NA)
C1-C C2-C C3-C C4-C C5-C
3 1 NA 5 NA
I would to convert it in a dadtaframe/tibble... keeping the names of elements as rowname.
The best way that I found it was:
mynames <- names(myvec)
myvec <- myvec %>%
as_tibble() %>%
mutate(rownames = mynames) %>%
column_to_rownames("rownames")
How can I to do this in a more efficient way?
Thanks all
as.data.frame(myvec)
myvec
C1-C 3
C2-C 1
C3-C NA
C4-C 5
C5-C NA
Or
data.frame(myvec)
I have a large list of dataframes with environmental variables from different localities. For each of the dataframes in the list, I want to summarize the values across locality (= group measurements of the same locality into one), using the name of the dataframes as a condition for which variables need to be summarized. For example, for a dataframe with the name 'salinity' I want to only summarize across salinity, and not the other environmental variables. Note that the different dataframes contain data from different localities, so I cannot simply merge them into one dataframe.
Let's do this with a dummy dataset:
#create list of dataframes
df1 = data.frame(locality = c(1, 2, 2, 5, 7, 7, 9),
Temp = c(14, 15, 16, 18, 20, 18, 21),
Sal = c(16, NA, NA, 12, NA, NA, 9))
df2 = data.frame(locality = c(1, 1, 3, 6, 8, 9, 9),
Temp = c(1, 2, 4, 5, 0, 2, -1),
Sal = c(18, NA, NA, NA, 36, NA, NA))
df3 = data.frame(locality = c(1, 3, 4, 4, 5, 5, 9),
Temp = c(14, NA, NA, NA, 17, 18, 21),
Sal = c(16, 8, 24, 23, 11, 12, 9))
df4 = data.frame(locality = c(1, 1, 1, 4, 7, 8, 10),
Temp = c(1, NA, NA, NA, NA, 0, 2),
Sal = c(18, 17, 13, 16, 20, 36, 30))
df_list = list(df1, df2, df3, df4)
names(df_list) = c("Summer_temperature", "Winter_temperature",
"Summer_salinity", "Winter_salinity")
Next, I used lapply to summarize environmental variables:
#select only those dataframes in the list that have either 'salinity' or 'temperature' in the dataframe names
df_sal = df_list[grep("salinity", names(df_list))]
df_temp = df_list[grep("temperature", names(df_list))]
#use apply to summarize salinity or temperature values in each dataframe
##salinity
df_sal2 = lapply(df_sal, function(x) {
x %>%
group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))
})
##temperature
df_temp2 = lapply(df_temp, function(x) {
x %>%
group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))
})
Now, this code is repetitive, so I want to downsize this by combining everything into one function. This is what I tried:
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Sal = mean(Sal, na.rm = TRUE))}
if (grepl("temperature", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Temp = mean(Temp, na.rm = TRUE))}
})
But I am getting the following output:
$Summer_temperature
NULL
$Winter_temperature
NULL
$Summer_salinity
NULL
$Winter_salinity
NULL
And the following warning messages:
Warning messages:
1: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
2: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
3: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
4: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
5: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
6: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
7: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
8: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
Now, I read here that this warning message can potentially be solved by using ifelse. However, in the final dataset I will have more than two environmental variables, so I will have to add many more if statements - for this reason I believe ifelse is not a solution here. Does anyone have an elegant solution to my problem? I am new to using both functions and lapply, and would appreciate any help you can give me.
EDIT:
I tried using the else if option suggested in one of the answers, but this still returns NULL values. I also tried the return and assigning output to x but both have the same problem as below code - any ideas?
#else if
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))}
else if (grepl("temperature", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))}
})
df_env
What I think is happening is that my if argument does not get passed to the summarize function, so nothing is being summarized.
Several things going on here, including
as akrun said, if statements must have a condition with a length of 1. Yours are not.
grepl("locality", names(df1))
# [1] TRUE FALSE FALSE
That must be reduced so that it is always exactly length 1. Frankly, grepl is the wrong tool here, since technically a column named notlocality would match and then it would error. I suggest you change to
"locality" %in% names(df1)
# [1] TRUE
You need to return something. Always. You shifted from if ...; if ...; to if ... else if ..., which is a good start, but really if you meet neither condition, then nothing is returned. I suggest one of the following: either add one more } else x, or reassign as if (..) { x <- x %>% ...; } else if (..) { x <- x %>% ... ; } and then end the anon-func with just x (to return it).
However, I think ultimately the problem is that you are looking for "temperature" or "salinity" which are in the names of the list-objects, not in the frames themselves. For instance, your reference to names(x) is returning c("locality", "Temp", "Sal"), the names of the frame x itself.
I think this is what you want?
Map(function(x, nm) {
if (grepl("salinity", nm)) {
x %>%
group_by(locality) %>%
summarize(Sal = mean(Sal, na.rm = TRUE))
} else if (grepl("temperature", nm)) {
x %>%
group_by(locality) %>%
summarize(Temp = mean(Temp, na.rm = TRUE))
} else x
}, df_list, names(df_list))
# $Summer_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 14
# 2 2 15.5
# 3 5 18
# 4 7 19
# 5 9 21
# $Winter_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 1.5
# 2 3 4
# 3 6 5
# 4 8 0
# 5 9 0.5
# $Summer_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 3 8
# 3 4 23.5
# 4 5 11.5
# 5 9 9
# $Winter_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 4 16
# 3 7 20
# 4 8 36
# 5 10 30
Take a list like as.list(rep(c(NA, 4, NA), times = c(5, 1, 2))) i.e.
[[1]]
[1] NA
[[2]]
[1] NA
[[3]]
[1] NA
[[4]]
[1] NA
[[5]]
[1] NA
[[6]] # index of non-NA list element, 6
[1] 4 # ...and its corresponding value, 4
[[7]]
[1] NA
[[8]]
[1] NA
I want to extract the index of the non-NA element (here: 6), and its corresponding value (here: 4). Is there any idiomatic way to get these two numbers?
1) Base R Assuming that the list L contains only scalars and NA's this returns a 2 column matrix with one row for each set of xy coordinates and an attribute recording which positions were omitted.
Omit the x= and y= if you don't want the column names. If you don't want the attribute recording the positions of the NA's append [,] to the end of the line. If you know that there is only one scalar you might want to wrap it in c(...) to produce a 2 element vector. If you prefer data frame output replace cbind with data.frame.
na.omit(cbind(x = seq_along(L), y = unlist(L)))
2) tidyverse or using the tidyverse
library(tibble)
library(tidyr)
drop_na(enframe(unlist(L)))
2a) which could alternately be written using pipes like this:
L %>% unlist %>% enframe %>% drop_na
I am not sure this is elegant enough but works;
mylist <- as.list(rep(c(NA, 4, NA), times = c(5, 1, 2)))
x <- (1:length(mylist))[!sapply(mylist,is.na)]
y <- mylist[[x]]
coor <- c(x,y)
coor
output;
6 4
Find index of non NA list == 6 to get x
Remove all list elements with NA to get y
my_list <- as.list(rep(c(NA, 4, NA), times = c(5, 1, 2)))
x <- which(!is.na(my_list))
y <- unlist(Filter(function(a) any(!is.na(a)), my_list))
coordinate <- c(x,y)
coordinate
> coordinate
[1] 6 4
Using stack
na.omit(stack(setNames(L, seq_along(L))))
values ind
6 4 6
I have a dataframe with over hundreds of variables, grouped in different factors ("Happy_","Sad_", etc) and I want to create a set new variables indicating whether a participant put a rating of 4 in any of the variables in one factor. However, if any of the variable in that factor is NA, then the new variable will also output NA.
I have tried the following, but it didn't work:
library(tidyverse)
df <- data.frame(Subj = c("A", "B", "C", "D"),
Happy_1_Num = c(4,2,2,NA),
Happy_2_Num = c(4,2,2,1),
Happy_3_Num = c(1,NA,2,4),
Sad_1_Num = c(2,1,4,3),
Sad_2_Num = c(NA,1,2,4),
Sad_3_Num = c(4,2,2,1))
# Don't work
df <- df %>% mutate(Happy_Any4 = ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ is.na(.)), NA,
ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ . == 4),1,0)),
Sad_Any4 = ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ is.na(.)), NA,
ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ . == 4),1,0)))
I tried a workaround by first generating a set of variables to indicate if that factor has any NA, and after that check if participant put any rating of "4". it works; but since I have many factors, I was wondering if there is a more elegant way of doing it.
# workaround
df <- df %>% mutate(
NA_Happy = ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ is.na(.)), 1,0),
NA_Sad = ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ is.na(.)), 1,0))
df <- df %>% mutate(
Happy_Any4 = ifelse(NA_Happy == 1, NA,
ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ . == 4),1,0)),
Sad_Any4 = ifelse(NA_Sad == 1, NA,
ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ . == 4),1,0)))
Here is a base R option using split.default -
tmp <- df[-1]
cbind(df, sapply(split.default(tmp, sub('_.*', '', names(tmp))),
function(x) as.integer(rowSums(x== 4) > 0)))
# Subj Happy_1_Num Happy_2_Num Happy_3_Num Sad_1_Num Sad_2_Num Sad_3_Num Happy Sad
#1 A 4 4 1 2 NA 4 1 NA
#2 B 2 2 NA 1 1 2 NA 0
#3 C 2 2 2 4 2 2 0 1
#4 D NA 1 4 3 4 1 NA 1
sub would keep only either "Happy" or "Sad" part of the names, split.default splits the data based on that and use sapply to calculate if any value of 4 is present in a row.
If you can afford to write each and every factor manually you can do -
library(dplyr)
df %>%
mutate(Happy = as.integer(rowSums(select(., starts_with('Happy')) == 4) > 0),
Sad = as.integer(rowSums(select(., starts_with('Sad')) == 4) > 0))
here is another workaround by transposing the data.frame and an apply on colonns. I'm not sure it's more elegant but here it is ^^
tmp <- cbind(sub("^((Happy)|(Sad))(_.*_Num)$", "\\1", colnames(df)), t(df))
Happy_Any4 <- apply(tmp[tmp[,1]== "Happy", -1], 2,
function(x) ifelse(any(is.na(x)), NA, length(grep("4", x))) )
Sad_Any4 <- apply(tmp[tmp[,1]== "Sad", -1], 2,
function(x) ifelse(any(is.na(x)), NA, length(grep("4", x))) )
df <- cbind(df, Happy_Any4 = Happy_Any4, Sad_Any4 = Sad_Any4)
EDIT : Above was a strange test, but now this work with more beauty !
This is because the sum of anything where there is an NA will return NA.
df <- df %>% mutate(Happy_Any4 = apply(df[,grep("^Happy_.*_Num$", colnames(df))],
1, function(x) 1*(sum(x == 4) > 0)),
Sad_Any4 = apply(df[, grep("^Sad_.*_Num$", colnames(df))],
1, function(x) 1*(sum(x == 4) > 0)))
The apply will look every row, only on columns where we find the correct part in colnames (with grep. It then find every occurence of 4, which form a logical vector, and it's sum is the number of occurence. The presence of an NA will bring the sum to NA. I then just check if the sum is above 0 and the 1* will turn the numeric into logical.
Imagine an array of numbers called A. At each level of A, you want to find the most recent item with a matching value. You could easily do this with a for loop as follows:
A = c(1, 1, 2, 2, 1, 2, 2)
for(i in 1:length(A)){
if(i > 1 & sum(A[1:i-1] == A[i]) > 0){
answer[i] = max(which(A[1:i-1] == A[i]))
}else{
answer[i] = NA
}
}
But I want vectorize this for loop (because I'll be applying this principle on a very large data set). I tried using sapply:
answer = sapply(A, FUN = function(x){max(which(A == x))})
As you can see, I need some way of reducing the array to only values that come before x. Any advice?
We can use seq_along to loop over the index of each element and then subset it and get the max index where the value last occured.
c(NA, sapply(seq_along(A)[-1], function(x) max(which(A[1:(x-1)] == A[x]))))
#[1] NA 1 -Inf 3 2 4 6
We can change the -Inf to NA if needed in that format
inds <- c(NA, sapply(seq_along(A)[-1], function(x) max(which(A[1:(x-1)] == A[x]))))
inds[is.infinite(inds)] <- NA
inds
#[1] NA 1 NA 3 2 4 6
The above method gives a warning, to remove the warning we can perform an additional check of the length
c(NA, sapply(seq_along(A)[-1], function(x) {
inds <- which(A[1:(x-1)] == A[x])
if (length(inds) > 0)
max(inds)
else
NA
}))
#[1] NA 1 NA 3 2 4 6
Here's an approach with dplyr which is more verbose, but easier for me to grok. We start with recording the row_number, make a group for each number we encounter, then record the prior matching row.
library(dplyr)
A2 <- A %>%
as_tibble() %>%
mutate(row = row_number()) %>%
group_by(value) %>%
mutate(last_match = lag(row)) %>%
ungroup()
You can do:
sapply(seq_along(A)-1, function(x)ifelse(any(a<-A[x+1]==A[sequence(x)]),max(which(a)),NA))
[1] NA 1 NA 3 2 4 6
Here's a function that I made (based upon Ronak's answer):
lastMatch = function(A){
uniqueItems = unique(A)
firstInstances = sapply(uniqueItems, function(x){min(which(A == x))}) #for NA
notFirstInstances = setdiff(seq(A),firstInstances)
lastMatch_notFirstInstances = sapply(notFirstInstances, function(x) max(which(A[1:(x-1)] == A[x])))
X = array(0, dim = c(0, length(A)))
X[firstInstances] = NA
X[notFirstInstances] = lastMatch_notFirstInstances
return(X)
}