I have a list of nested data, containing lists with two data frames each looking like this:
mylist <- list(
list(
p = data.frame(
id = "01",
stringsAsFactors = F
),
c = data.frame(
text = c("one", "two"),
from = c("A", "B"),
stringsAsFactors = F
)
),
list(
p = data.frame(
id = "02",
stringsAsFactors = F
),
c = data.frame(
text = c("three", "four", "five"),
from = c("C", "D", "E"),
stringsAsFactors = F
)
),
list(
p = data.frame(
id = "03",
stringsAsFactors = F
),
c = data.frame(
text = logical(0),
from = logical(0)
)
)
)
I want to flatten this list into a dataframe, with one row per observation from the "c" data frames, a colum that indicates the "id" of the level above stored in the "p" dataframes, and non-observations filled with NAs. The result should look like this:
df <- data.frame(
p.id = c("01", "01", "02", "02", "02", "03"),
c.text = c("one", "two", "three", "four", "five", NA),
c.from = c("A", "B", "C", "D", "E", NA)
)
#
# p.id c.text c.from
# 01 one A
# 01 two B
# 02 three C
# 02 four D
# 02 five E
# 03 <NA> <NA>
as.data.frame() provides a very simple solution that comes very close to the desired result, but breaks when there are 0 observations in the "c" data frame.
mylist[[1]] %>% as.data.frame()
mylist[[3]] %>% as.data.frame()
I am aware that are similar questions on stackoverflow - but I read several threads closely, have tried different things from bind_rows(), to jsonlite::flatten(), tidr::unnest(), or data.table::rbindlist() - but have not make it work.
Help is highly appreciated!
Here is a tidyverse solution:
library(tidyverse)
map(mylist, ~as_tibble(.)) %>%
enframe() %>%
unnest_longer(value)
Which gives us:
# A tibble: 6 x 2
name value$p$id $c$text $$from
<int> <chr> <chr> <chr>
1 1 01 one A
2 1 01 two B
3 2 02 three C
4 2 02 four D
5 2 02 five E
6 3 NA NA NA
I create a helper function to combine p and c:
foo <- function(x) {
a <- x[[1]]
b <- x[[2]]
if (nrow(b) == 0) b[1, ] <- NA
return(cbind(a, b))
}
Then I run the helper function on each element and bind the rows:
do.call(rbind, lapply(mylist, foo))
The result:
> do.call(rbind, lapply(mylist, foo))
id text from
1 01 one A
2 01 two B
3 02 three C
4 02 four D
5 02 five E
6 03 <NA> <NA>
P.S. The same result using the R base pipe:
lapply(mylist, foo) |> do.call(what = rbind)
Related
Say I have a df like so:
T1 <- c("a","b","c","d","e")
T2 <- c("f","g","h","i","j")
score1 <- c(NA,0.01,0.5,0.78,NA)
score2 <- c(1, 2, 3, NA, 6)
df <- data.frame(T1, T2, score1, score2)
df
T1 T2 score1 score2
1 a f NA 1
2 b g 0.01 2
3 c h 0.50 3
4 d i 0.78 NA
5 e j NA 6
If I want to randomly create new T1-T2 pairs, how can I see if these new pairs are in the df but only if score1 column is not NA?
In other words, I randomly sample, say, 2 values from T1 and T2:
(l1 <- sample(df$T1, 2))
(l2 <- sample(df$T2, 2))
and get:
> l1
[1] "c" "d"
> l2
[1] "h" "g"
How would one go about to get the score2 of the c-h and d-g pairs from df but only if score1 is not NA?
My first instinct would be to create a new df2 without NAs in the score1 column:
df2 <- df[which(!is.na(df$score1)), ]
Then I can create a new df for the new pairs:
df3$X1 <- l1
df3$X2 <- l2
df3$X3 <- l2
df3$X4 <- l1
#stack X3 with X1 and X4 with X2 (considering that T1-T2 pair is the same as T2-T1 pair)
df4 <- data.frame(T1 = c(df3[,"X1"], df3[,"X3"]),
T2 = c(df3[,"X2"], df3[,"X4"]))
> df4
T1 T2
1 c h
2 d g
3 h c
4 g d
But I'm missing the last step of how to get see if the paired columns from df4 match the paired columns in df2. In the end, I want to get something like:
df
T1 T2 score1 score2
1 c h 0.50 3
2 d g NA NA
I think a merge/join operation makes sense here:
res <- merge(df, data.frame(T1=l1, T2=l2, found=TRUE), by = c("T1","T2"), all = TRUE)
subset(res, found, select = -found)
# T1 T2 score1 score2
# 3 c h 0.5 3
# 4 d g NA NA
Data
df <- structure(list(T1 = c("a", "b", "c", "d", "e"), T2 = c("f", "g", "h", "i", "j"), score1 = c(NA, 0.01, 0.5, 0.78, NA), score2 = c(1, 2, 3, NA, 6)), class = "data.frame", row.names = c(NA, -5L))
l1 <- c("c", "d"); l2 <- c("h", "g")
Something like this?
set.seed(2022)
(l1 <- sample(df$T1, 2))
#> [1] "d" "c"
(l2 <- sample(df$T2, 2))
#> [1] "h" "i"
mapply(\(x1, x2, data){
i <- match(x1, data$T1)
j <- match(x2, data$T2)
if(any(is.na(c(data$score1[i], data$score1[i])))) {
NA_real_
} else {
sum(c(data$score2[i], -1*data$score2[j]), na.rm = TRUE)
}
}, l1, l2, MoreArgs = list(data = df))
#> d c
#> -3 3
Created on 2022-01-30 by the reprex package (v2.0.1)
I have this data:
dat=list(structure(list(Group.1 = structure(3:4, .Label = c("A","B", "C", "D", "E", "F"), class = "factor"), Pr1 = c(65, 75)), row.names = c(NA, -2L), class = "data.frame"),NULL, structure(list( Group.1 = structure(3:4, .Label = c("A","B", "C", "D", "E", "F"), class = "factor"), Pr1 = c(81,4)), row.names = c(NA,-2L), class = "data.frame"))
I want to use combine using bind_rows(dat) but keeping the index number as a varaible
Output Include Type([[1]] and [[3]])
type Group.1 Pr1
1 1 C 65
2 1 D 75
3 3 C 81
4 3 D 4
data.table solution
use rbindlist() from the data.table-package, which had built-in id-support that respects NULL df's.
library(data.table)
rbindlist( dat, idcol = TRUE )
.id Group.1 Pr1
1: 1 C 65
2: 1 D 75
3: 3 C 81
4: 3 D 4
dplyr - partly solution
bind_rows also has ID-support, but it 'skips' empty elements...
bind_rows( dat, .id = "id" )
id Group.1 Pr1
1 1 C 65
2 1 D 75
3 2 C 81
4 2 D 4
Note that the ID of the third element from dat becomes 2, and not 3.
According to the documentation of bind_rows() you can supply the name for .id argument of the function. When you apply bind_rows() to the list of data.frames the names of the list containing your data.frames are assigned to the identifier column. [EDIT] But there is a problem mentioned by #Wimpel:
names(dat)
NULL
However, supplying the names to the list will do the thing:
names(dat) <- 1:length(dat)
names(dat)
[1] "1" "2" "3"
bind_rows(dat, .id = "type")
type Group.1 Pr1
1 1 C 65
2 1 D 75
3 3 C 81
4 3 D 4
Or in one line, if you prefer:
bind_rows(setNames(dat, seq_along(dat)), .id = "type")
Let's say we have two data frames:
df1 <- data.frame(A = letters[1:3], B = letters[4:6], C = letters[7:9], stringsAsFactors = FALSE)
A B C
1 a d g
2 b e h
3 c f i
df2 <- data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9)
V1 V2 V3
1 1 4 7
2 2 5 8
3 3 6 9
I need to build a function that takes as input a single value or a vector containing elements from one of the data frames and returns the elements from the other data frame according to their positional indexes.
The function should work like this:
> matchdf(values = c("a", "e", "i"), dfin = df1, dfout = df2)
[1] 1 5 9
> matchdf(values = c(1, 5, 9), dfin = df2, dfout = df1)
[1] "a" "e" "i"
> matchdf(values = c(1, 1, 1), dfin = df2, dfout = df1)
[1] "a" "a" "a"
This is what I have tried so far:
requiere(dplyr)
toVec <- function(df) df %>% as.matrix %>% as.vector
matchdf <- function(values, dfin, dfout) toVec(dfout)[toVec(dfin) %in% values]
# But sometimes the output values aren't in correct order:
> matchdf(c("c", "i", "h"), dt1, dt2)
[1] 3 8 9
# should output 3 9 8
> matchdf(values = c("a", "a", "a"), dfin = dt1, dfout = dt2)
[1] 1
# Should output 1 1 1
Feel free to use data.table or/and dplyr if it eases the task. I would prefer a solution without for loops.
Assumptions:
elements from df1 are different from df2
dim(df1) = dim(df2)
matchdf <- function(values, dfin, dfout){
unlist(sapply(values,
function(val) dfout[dfin == val],
USE.NAMES = F)
)
}
matchdf(c("c", "i", "h"), df1, df2)
#should output 3 9 8
[1] 3 9 8
matchdf(values = c("a", "a", "a"), dfin = df1, dfout = df2)
#should output 1 1 1
[1] 1 1 1
matchdf(values = c("X", "Y", "a"), dfin = df1, dfout = df2)
#should output vector, not list
[1] 1
I have a dataframe looking like this:
id value1 value2 value3 value4
A 14 24 22 9
B 51 25 29 33
C 4 16 8 10
D 1 4 2 4
Now I want to compare each column of the row with the others rows in order to identify the rows where every value is higher.
So, for example for id D this would be A, B and C.
For C it would be B, for A it's B and for B there is no row.
I tried to do that by looping through the rows and comparing every column, but that takes a lot of time. The original dataset has about 5000 rows and 20 columns to compare.
I am sure that there is a way to do that more efficiently. Thanks for your help!
I don't know a simple function to do this task.
Here is how I would do.
library(dplyr)
DF <- data.frame(
id = c("A", "B", "C", "D"),
value1 = c(14, 51, 4, 1),
value2 = c(24, 25, 16, 4),
value3 = c(22, 29, 8, 2),
value4 = c(9, 33, 10, 4),
stringsAsFactors = FALSE)
# get the order for each value
tmp <- lapply(select(DF, -id), function(x) DF$id[order(x)])
# find a set of "biggers" for each id
tmp <- lapply(tmp, function(x) data.frame(
id = rep(x, rev(seq_along(x))-1),
bigger = x[lapply(seq_along(x), function(i)
which(seq_along(x) > i)) %>% unlist()],
stringsAsFactors = FALSE))
# inner_join all, this keeps "biggers" in all columns
out <- NULL
for (v in tmp) {
if (is.null(out)) {
out <- v
} else {
out <- inner_join(out, v, by = c("id", "bigger"))
}
}
This gets you:
out
# id bigger
#1 D C
#2 D A
#3 D B
#4 C B
#5 A B
Here's an approach that returns results in a data frame format.
library(tidyr)
library(dplyr)
# reshape data to long format
td <- d %>% gather(key, value, value1:value4)
# create a copy w/ different names for merging
td2 <- td %>% select(id2 = id, key, value2 = value)
# full outer join to produce one row per pair of IDs
dd <- merge(td, td2, by = "key", all = TRUE)
# the result
dd %>%
filter(id != id2) %>%
group_by(id, id2) %>%
summarise(all_less = !any(value >= value2)) %>%
filter(all_less)
results (id is less than id2)
id id2 all_less
(fctr) (fctr) (lgl)
1 A B TRUE
2 C B TRUE
3 D A TRUE
4 D B TRUE
5 D C TRUE
data
d <- structure(list(
id = structure(1:4, .Label = c("A", "B", "C", "D"), class = "factor"),
value1 = c(14L, 51L, 4L, 1L),
value2 = c(24L, 25L, 16L, 4L),
value3 = c(22L, 29L, 8L, 2L), value4 = c(9L, 33L, 10L, 4L)
),
.Names = c("id", "value1", "value2", "value3", "value4"),
class = "data.frame", row.names = c(NA, -4L)
)
I think this works just fine:
ind <- which(names(df) == "id")
apply(df[,-ind],1,function(x) df$id[!rowSums(!t(x < t(df[,-ind])))] )
# [[1]]
# [1] "B"
#
# [[2]]
# character(0)
#
# [[3]]
# [1] "B"
#
# [[4]]
# [1] "A" "B" "C"
My data.frame contains information on the movements completed by an individual and a string (of alpha characters) that represents these movements in a database. It is structured as follows:
MovementAnalysis <- structure(list(Strings = c("AaB", "cZhH", "Bb", "bAc"), Descriptor = c("Jog/ Stop/ Turn", "Change/ Shuffle/ Backwards/ Jump", "Turn/ Duck", "Duck/ Jog/ Change"), Person = c("Sally", "Sally", "Ben", "Ben")), .Names = c("Strings", "Descriptor", "Person"), row.names = c(NA, 4L), class = "data.frame")
I wish to capture the frequency of each alpha letter (for example: A, a, B, b) within all the Strings for each Person. There are 48 alpha upper and lower case letters. My actual data.frame contains the movements of 100 + individuals, so a quick solution to iterate over each individual would be ideal. As an example, my anticipated output would be:
Output <- structure(list(Person = c("Sally", "Sally", "Sally", "Sally", "Ben", "Ben", "Ben", "Ben"), Letter = c("A", "a", "B", "b", "A", "a", "B", "b"), Frequency = c(1, 1, 1, 0, 1, 0, 1, 2)), .Names = c("Person", "Letter", "Frequency"), row.names = c(NA, 8L), class = "data.frame")
Thank you!
One option is using data.table
library(data.table)
df2 <- setDT(df1)[,list(Letter={
tmp <- unlist(strsplit(Strings, ''))
factor(tmp[tmp %in% c("A", "a", "B", "b")],
levels=c("A", "a", "B", "b"))}) , Person]
df2[, ind:="Frequency"]
dcast(df2, Person+Letter~ind, value.var="Letter", length, drop=FALSE)
# Person Letter Frequency
#1: Ben A 1
#2: Ben a 0
#3: Ben B 1
#4: Ben b 2
#5: Sally A 1
#6: Sally a 1
#7: Sally B 1
#8: Sally b 0
Less wizardy than akrun's answer, but I think it works:
your.func <- function(data) {
require(dplyr)
bag.of.letters <- function(strings) {
concat.string <- paste(strings, collapse='')
all.chars.vec <- unlist(strsplit(concat.string,""))
result <- data.frame(table(factor(all.chars.vec,levels = c(letters,LETTERS))))
colnames(result) <- c("Letter","Frequency")
result[order(result[["Letter"]]),]
}
lapply(X = unique(data[["Person"]]),
FUN = function(n) {
strings = data %>% filter(Person == n) %>% .[["Strings"]]
data.frame(Person = n, bag.of.letters(strings))
}) %>% do.call(rbind,.)
}
your.func(MovementAnalysis)
If you want to have only letters with positive Frequency in your Letter column, remove the factor(..., levels = c(letters,LETTERS)) part.
Here's an option using cSplit_e from my "splitstackshape" package. I've combined it with "magrittr" so that you can walk through the steps without having to store any intermediate objects or create a long nested expression.
The first option shows how to get the "wide" form, as described by #alistaire.
library(splitstackshape)
library(magrittr)
data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
.[, lapply(.SD, sum), by = Person] %>%
subset(select = grep("Person|_[AaBb]$", names(.)))
# Person Strings_a Strings_A Strings_b Strings_B
# 1: Sally 1 1 0 1
# 2: Ben 0 1 2 1
To go from the above to the long form, you just need to add a melt line.
data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
.[, lapply(.SD, sum), by = Person] %>%
subset(select = grep("Person|_[AaBb]$", names(.))) %>%
melt(id.vars = "Person")
# Person variable value
# 1: Sally Strings_a 1
# 2: Ben Strings_a 0
# 3: Sally Strings_A 1
# 4: Ben Strings_A 1
# 5: Sally Strings_b 0
# 6: Ben Strings_b 2
# 7: Sally Strings_B 1
# 8: Ben Strings_B 1
It's not clear from your question, but if your restricting the data to just "A", "a", "B", and "b" was just for the purpose of illustration and you're actually interested in the full 48 options, then you can also omit the following line:
subset(select = grep("Person|_[AaBb]$", names(.)))