I have a list of dataframes:
df1 <- data.frame(c(1:5), c(6:10))
df2 <- data.frame(c(1:7))
df3 <- data.frame(c(1:5), c("a", "b", "c", "d", "e"))
my_list <- list(df1, df2, df3)
From my_list, I want to extract the data frames which have only 2 columns (df1 and df3), and put them in a new list.
Maybe you can try lengths
> my_list[lengths(my_list) == 2]
[[1]]
c.1.5. c.6.10.
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
[[2]]
c.1.5. c..a....b....c....d....e..
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
It's also possible to subset using lapply and a logical condition (sapply will also work):
my_list[lapply(my_list, ncol) == 2]
[[1]]
c.1.5. c.6.10.
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
[[2]]
c.1.5. c..a....b....c....d....e..
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
We could use keep from purrr package with the condition:
library(purrr)
my_list %>% keep(~ ncol(.x) == 2)
[[1]]
c.1.5. c.6.10.
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
[[2]]
c.1.5. c..a....b....c....d....e..
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
Related
I have a vector and list of the same length. The list contains vectors of arbitrary lengths as such:
vec1 <- c("a", "b", "c")
list1 <- list(c(1, 3, 2),
c(4, 5, 8, 9),
c(5, 2))
What is the fastest, most effective way to create a dataframe such that the elements of vec1 are replicated the number of times corresponding to their index in list1?
Expected output:
# col1 col2
# 1 a 1
# 2 a 3
# 3 a 2
# 4 b 4
# 5 b 5
# 6 b 8
# 7 b 9
# 8 c 5
# 9 c 2
I have included a tidy solution as an answer, but I was wondering if there are other ways to approach this task.
In base R, set the names of the list with 'vec1' and use stack to return a two column data.frame
stack(setNames(list1, vec1))[2:1]
-output
ind values
1 a 1
2 a 3
3 a 2
4 b 4
5 b 5
6 b 8
7 b 9
8 c 5
9 c 2
If we want a tidyverse approach, use enframe
library(tibble)
library(dplyr)
library(tidyr)
list1 %>%
set_names(vec1) %>%
enframe(name = 'col1', value = 'col2') %>%
unnest(col2)
# A tibble: 9 × 2
col1 col2
<chr> <dbl>
1 a 1
2 a 3
3 a 2
4 b 4
5 b 5
6 b 8
7 b 9
8 c 5
9 c 2
This tidy solution replicates the vec1 elements according to the nested vector's lengths, then flattens both lists into a tibble.
library(purrr)
library(tibble)
tibble(col1 = flatten_chr(map2(vec1, map_int(list1, length), function(x, y) rep(x, times = y))),
col2 = flatten_dbl(list1))
# # A tibble: 9 × 2
# col1 col2
# <chr> <dbl>
# 1 a 1
# 2 a 3
# 3 a 2
# 4 b 4
# 5 b 5
# 6 b 8
# 7 b 9
# 8 c 5
# 9 c 2
A tidyr/tibble-approach could also be unnest_longer:
library(dplyr)
library(tidyr)
tibble(vec1, list1) |>
unnest_longer(list1)
Output:
# A tibble: 9 × 2
vec1 list1
<chr> <dbl>
1 a 1
2 a 3
3 a 2
4 b 4
5 b 5
6 b 8
7 b 9
8 c 5
9 c 2
Another possible solution, based on purrr::map2_dfr:
library(purrr)
map2_dfr(vec1, list1, ~ data.frame(col1 = .x, col2 =.y))
#> col1 col2
#> 1 a 1
#> 2 a 3
#> 3 a 2
#> 4 b 4
#> 5 b 5
#> 6 b 8
#> 7 b 9
#> 8 c 5
#> 9 c 2
I have the following dataframe
df<-data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"),
'A_Frequency'=c(1,2,3,4,5,1,2,3,4,5),
'B_Frequency'=c(1,2,NA,4,6,1,2,5,6,7))
The dataframe appears as follows
ID A_Frequency B_Frequency
1 A 1 1
2 A 2 2
3 A 3 NA
4 A 4 4
5 A 5 6
6 B 1 1
7 B 2 2
8 B 3 5
9 B 4 6
10 B 5 7
I Wish to create a new dataframe df2 from df that looks as follows
ID CFreq
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 1
8 B 2
9 B 3
10 B 4
11 B 5
12 B 6
13 B 7
The new dataframe has a column CFreq that takes unique values from A_Frequency, B_Frequency and groups them by ID. Then it ignores the NA values and generates the CFreq column
I have tried dplyr but am unable to get the required response
df2<-df%>%group_by(ID)%>%select(ID, A_Frequency,B_Frequency)%>%
mutate(Cfreq=unique(A_Frequency, B_Frequency))
This yields the following which is quite different
ID A_Frequency B_Frequency Cfreq
<fct> <dbl> <dbl> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 NA 3
4 A 4 4 4
5 A 5 6 5
6 B 1 1 1
7 B 2 2 2
8 B 3 5 3
9 B 4 6 4
10 B 5 7 5
Request someone to help me here
gather function from tidyr package will be helpful here:
library(tidyverse)
df %>%
gather(x, CFreq, -ID) %>%
select(-x) %>%
na.omit() %>%
unique() %>%
arrange(ID, CFreq)
A different tidyverse possibility could be:
df %>%
nest(A_Frequency, B_Frequency, .key = C_Frequency) %>%
mutate(C_Frequency = map(C_Frequency, function(x) unique(x[!is.na(x)]))) %>%
unnest()
ID C_Frequency
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
9 A 6
10 B 1
11 B 2
12 B 3
13 B 4
14 B 5
18 B 6
19 B 7
Base R approach would be to split the dataframe based on ID and for every list we count the number of unique enteries and create a sequence based on that.
do.call(rbind, lapply(split(df, df$ID), function(x) data.frame(ID = x$ID[1] ,
CFreq = seq_len(length(unique(na.omit(unlist(x[-1]))))))))
# ID CFreq
#A.1 A 1
#A.2 A 2
#A.3 A 3
#A.4 A 4
#A.5 A 5
#A.6 A 6
#B.1 B 1
#B.2 B 2
#B.3 B 3
#B.4 B 4
#B.5 B 5
#B.6 B 6
#B.7 B 7
This will also work when A_Frequency B_Frequency has characters in them or some other random numbers instead of sequential numbers.
In tidyverse we can do
library(tidyverse)
df %>%
group_split(ID) %>%
map_dfr(~ data.frame(ID = .$ID[1],
CFreq= seq_len(length(unique(na.omit(flatten_chr(.[-1])))))))
A data.table option
library(data.table)
cols <- c('A_Frequency', 'B_Frequency')
out <- setDT(df)[, .(CFreq = sort(unique(unlist(.SD)))),
.SDcols = cols,
by = ID]
out
# ID CFreq
# 1: A 1
# 2: A 2
# 3: A 3
# 4: A 4
# 5: A 5
# 6: A 6
# 7: B 1
# 8: B 2
# 9: B 3
#10: B 4
#11: B 5
#12: B 6
#13: B 7
I have two lists. Each of them with many vectors (around 500) of different lengths and I would like to get a tibble data frame with three columns.
My reproducible example is the following:
> a
[[1]]
[1] 1 3 6
[[2]]
[1] 5 4
> b
[[1]]
[1] 3 4
[[2]]
[1] 5 6 7
I would like to get the following tibble data frame:
name index value
a 1 1
a 1 3
a 1 6
a 2 5
a 2 4
b 1 3
b 1 4
b 2 5
b 2 6
b 2 7
I would be grateful if someone could help me with this issue
using Base R:
transform(stack(c(a=a,b=b)),name=substr(ind,1,1),ind=substr(ind,2,2))
values ind name
1 1 1 a
2 2 1 a
3 3 1 a
4 5 2 a
5 6 2 a
6 3 1 b
7 4 1 b
8 5 2 b
9 6 2 b
10 7 2 b
using tidyverse:
library(tidyverse)
list(a=a,b=b)%>%map(~stack(setNames(.x,1:length(.x))))%>%bind_rows(.id = "name")
name values ind
1 a 1 1
2 a 2 1
3 a 3 1
4 a 5 2
5 a 6 2
6 b 3 1
7 b 4 1
8 b 5 2
9 b 6 2
10 b 7 2
Here is one option with tidyverse
library(tidyverse)
list(a= a, b = b) %>%
map_df(enframe, name = "index", .id = 'name') %>%
unnest
# A tibble: 10 x 3
# name index value
# <chr> <int> <dbl>
# 1 a 1 1
# 2 a 1 3
# 3 a 1 6
# 4 a 2 5
# 5 a 2 4
# 6 b 1 3
# 7 b 1 4
# 8 b 2 5
# 9 b 2 6
#10 b 2 7
data
a <- list(c(1, 3, 6), c(5, 4))
b <- list(c(3, 4), c(5, 6, 7))
Let's say our initial data frame looks like this:
df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,NA,NA,NA),C=c(1,2,3,NA,NA,NA))
> df1
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
Another data frame contains new information for col B and C
df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5))
> df2
Index B C
1 4 4 5
2 5 4 5
3 6 4 5
How can you update the missing values in df1 so it looks like this:
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 5
5 5 5 4 5
6 6 6 4 5
My attempt:
library(dplyr)
> full_join(df1,df2)
Joining by: c("Index", "B", "C")
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 NA NA
5 5 5 NA NA
6 6 6 NA NA
7 4 NA 4 5
8 5 NA 4 5
9 6 NA 4 5
Which as you can see has created duplicate rows for the 4,5,6 index instead of replacing the NA values.
Any help would be greatly appreciated!
merge then aggregate:
aggregate(. ~ Index, data=merge(df1, df2, all=TRUE), na.omit, na.action=na.pass )
# Index B C A
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
#4 4 4 5 4
#5 5 4 5 5
#6 6 4 5 6
Or in dplyr speak:
df1 %>%
full_join(df2) %>%
group_by(Index) %>%
summarise_each(funs(na.omit))
#Joining by: c("Index", "B", "C")
#Source: local data frame [6 x 4]
#
# Index A B C
# (dbl) (int) (dbl) (dbl)
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
#4 4 4 4 5
#5 5 5 4 5
#6 6 6 4 5
We can use join from data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), join on with 'df1' using "Index" and assign (:=), the values in 'B' and 'C' with 'i.B' and 'i.C'.
library(data.table)
setDT(df1)[df2, c('B', 'C') := .(i.B, i.C), on = "Index"]
df1
# Index A B C
#1: 1 1 1 1
#2: 2 2 2 2
#3: 3 3 3 3
#4: 4 4 4 5
#5: 5 5 4 5
#6: 6 6 4 5
For those interested, I've extended this problem to:
- handle updating a data frame with another data frame with new columns - replace any existing entries regardless if they're NA or not.
Heres the solution I found using the aggregate function from #thelatemail :)
df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,3,3,3),C=c(1,2,3,3,3,3))
df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5),D=c(6,6,6),E=c(7,7,7))
df3 = full_join(df1,df2)
# Create a function na.omit.last
na.omit.last = function(x){
x <- na.omit(x)
x <- last(x)
}
# For the columns not in df1
dfA = aggregate(. ~ Index, df3, na.omit,na.action = na.pass)
dfA = dfA[,-(1:ncol(df1))]
dfA = data.frame(lapply(dfA,as.numeric))
dfB = aggregate(. ~ Index, df3[,1:ncol(df1)], na.omit.last, na.action = na.pass)
# If there are more columns in df2 append dfA
if (ncol(df2) > ncol(df1)) {
df3 = cbind(dfB,dfA)
} else {
df3 = dfB
}
print(df3)
Not sure what the general case or conditions would be, but this works for this instance without dplyr
df3 <- as.matrix(df1)
df3[which(is.na(df3))] <- as.matrix(df2)
df3 <- as.data.frame(df3)
df3
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 5
5 5 4 5
6 6 4 5
As of dplyr >= 1.0.0 you can use rows_update:
library(dplyr)
df1 %>%
rows_update(df2, by = "Index")
Index A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 5
5 5 5 4 5
6 6 6 4 5
Alternatively, there is rows_patch:
rows_patch() works like rows_update() but only overwrites NA values.
I have a large data frame where most of subjects have a pair of observations such like that:
set.seed(123)
df<-data.frame(ID=c(letters[1:4],letters[1:6]),x=sample(1:5,10,T))
ID x
1 a 2
2 b 4
3 c 3
4 d 5
5 a 5
6 b 1
7 c 3
8 d 5
9 e 3
10 f 3
I'd extract the rows that all IDs are paired such as:
ID x
1 a 2
5 a 5
2 b 4
6 b 1
3 c 3
7 c 3
4 d 5
8 d 5
What's the best way to do that in R?
Alternatively, I tend to use duplicated:
> df[df$ID %in% df$ID[duplicated(df$ID)],]
ID x
1 a 2
2 b 1
3 c 5
4 d 5
5 a 4
6 b 2
7 c 3
8 d 4
You can use ave to get the length of each value in df$ID and use that to subset your data.frame:
out <- df[as.numeric(ave(as.character(df$ID), df$ID, FUN = length)) == 2, ]
out
# ID x
# 1 a 2
# 2 b 4
# 3 c 3
# 4 d 5
# 5 a 5
# 6 b 1
# 7 c 3
# 8 d 5
Use order to sort the output if required.
out[order(out$ID), ]
You can also look into using data.table:
dt <- data.table(df, key = "ID") # Also sorts the output
dt[, n := .N, by = "ID"][n == 2]