How to distribute a dataframe's rows among list elements

How to distribute a dataframe's rows among list elements - r

I have a list that comprises a number of dataframes for different securities: for each security there's a date and a reading. I also have a dataframe that has an additional reading-date pair for each security as a row. I'd like to append the additional reading to the appropriate dataframe element in the list.
I can manage this interactively by using
list_name %>% map_df(slice, .id = "id")
to convert the list to a dataframe, append the new readings with bind_rows and then split to convert back to a list.
BUT: I'm trying to convert this into a function for use in a package. One of the columns in the data is a date. As a standalone function, it also works. But when packaged and loaded as a library, it throws an error:
Error in UseMethod("slice_") :
no applicable method for 'slice_' applied to an object of class "Date"
For example, if I have a list z and dataframe d
b <- data.frame(Rank = c(1, 2, 3), Value = c("A", "B", "C"))
c <- data.frame(Rank = c(1, 2, 3), Value = c("A", "B", "C"))
z <- list(Z = b, Y =c)
d <- data.frame(Sec = c("B", "C"), Rank = 4, Value = c("D"))
I'd like the result to be
$Z
Rank Value
1 1 A
2 2 B
3 3 C
4 4 D
$Y
Rank Value
1 1 A
2 2 B
3 3 C
4 4 D
But have no idea where to start. I've tried the usual googling approaches, but can't get the right words to bring up an answer.

An option would be to split the 'd' dataset by the 'Sec' with group_split and use that in map2 to bind the rows with the corresponding elements of 'z'
library(tidyverse)
d %>%
group_split(Sec, keep = FALSE) %>%
map2(z, ., bind_rows)
#$Z
# Rank Value
#1 1 A
#2 2 B
#3 3 C
#4 4 D
#$Y
# Rank Value
#1 1 A
#2 2 B
#3 3 C
#4 4 D

You could also do:
Map(rbind, A= z, B = lapply(split(d,d$Sec), function(x){x[-1]}))
# $Z
# Rank Value
# A.1 1 A
# A.2 2 B
# A.3 3 C
# B 4 D
#
# $Y
# Rank Value
# A.1 1 A
# A.2 2 B
# A.3 3 C
# B 4 D

Related

Is there a way to count values by presence per rows in R?

I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1

It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))

Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d

Count of number of elements between distinct elements in vector

Suppose I have a vector of values, such as:
A C A B A C C B B C C A A A B B B B C A
I would like to create a new vector that, for each element, contains the number of elements since that element was last seen. So, for the vector above,
NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
(where NA indicates that this is the first time the element has been seen).
For example, the first and second A are in position 1 and 3 respectively, a difference of 2; the third and fourth A are in position 4 and 11, a difference of 7, and so on.
Is there a pre-built pipe-compatible function that does this?
I hacked together this function to demonstrate:
# For reproducibility
set.seed(1)
# Example vector
x = sample(LETTERS[1:3], size = 20, replace = TRUE)
compute_lag_counts = function(x, first_time = NA){
# return vector to fill
lag_counts = rep(-1, length(x))
# values to match
vals = unique(x)
# find all positions of all elements in the target vector
match_list = grr::matches(vals, x, list = TRUE)
# compute the lags, then put them in the appropriate place in the return vector
for(i in seq_along(match_list))
lag_counts[x == vals[i]] = c(first_time, diff(sort(match_list[[i]])))
# return vector
return(lag_counts)
}
compute_lag_counts(x)
Although it seems to do what it is supposed to do, I'd rather use someone else's efficient, well-tested solution! My searching has turned up empty, which is surprising to me given that it seems like a common task.

Or
ave(seq.int(x), x, FUN = function(x) c(NA, diff(x)))
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
We calculate the first difference of the indices for each group of x.
A data.table option thanks to #Henrik
library(data.table)
dt = data.table(x)
dt[ , d := .I - shift(.I), x]
dt

Here's a function that would work
compute_lag_counts <- function(x) {
seqs <- split(seq_along(x), x)
unsplit(Map(function(i) c(NA, diff(i)), seqs), x)
}
compute_lag_counts (x)
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
Basically you use split() to separate the indexes where values appear by each unique value in your vector. Then we use the different between the index where they appear to calculate the distance to the previous value. Then we use unstack to put those values back in the original order.

An option with dplyr by taking the difference of adjacent sequence elements after grouping by the original vector
library(dplyr)
tibble(v1) %>%
mutate(ind = row_number()) %>%
group_by(v1) %>%
mutate(new = ind - lag(ind)) %>%
pull(new)
#[1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
data
v1 <- c("A", "C", "A", "B", "A", "C", "C", "B", "B", "C", "C", "A",
"A", "A", "B", "B", "B", "B", "C", "A")

Take sum of rows for every 3 columns in a dataframe

I have searched high and low and also tried multiple options to solve this but did not get the desired output as mentioned below:
I have dataframe df3 with headers as date and values beteween 0-1 as shown below:
df = data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) = c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 = data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 = cbind(df2,df)
Now I need df4 in which sum of first 3 columns in series will form one column. This will be repeated in series for rest of the columns dynamically.
df4
Options I tried:
a) rbind.data.frame(apply(matrix(df3, nrow = n - 1), 1,sum))
b) col_list <- list(c("1/1/2018","1/2/2018","1/3/2018"), c("1/4/2018","1/5/2018","1/6/2018"))
lapply(col_list, function(x)sum(df3[,x])) %>% data.frame

One way would be to split df3 every 3 columns using split.default. To split the data we generate a sequence using rep, then for each dataframe we take rowSums and finally cbind the result together.
cbind(df3[1], sapply(split.default(df3[-1],
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))), rowSums))
# CUST_ID 1 2
#1 A 1 1
#2 B 2 0
#3 C 2 1
#4 D 1 1
#5 E 2 2
#6 F 2 2
FYI, the sequence generated from rep is
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))
#[1] 1 1 1 2 2 2
This makes it possible to split every 3 columns.
The results are different because OP used sample without set.seed.
If rep seems too long then we can generate the same sequence of columns using gl
gl(ncol(df3[-1])/3, 3)
#[1] 1 1 1 2 2 2
#Levels: 1 2
So the final code, would be
cbind(df3[1], sapply(split.default(df3[-1], gl(ncol(df3[-1])/3, 3)), rowSums))

We can use seq to create index, get the subset of columns within in a list, Reduce by taking the sum, and create new columns
df4 <- df3[1]
df4[paste0('col', c('123', '456'))] <- lapply(seq(2, ncol(df3), by = 3),
function(i) Reduce(`+`, df3[i:min((i+2), ncol(df3))]))
df4
# CUST_ID col123 col456
#1 A 2 2
#2 B 3 3
#3 C 1 3
#4 D 2 3
#5 E 2 1
#6 F 0 1
data
set.seed(123)
df <- data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) <- c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 <- data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 <- cbind(df2, df)

Arranging data.frame's columns based on a reference vector [duplicate]

I have a data.frame that looks like this:
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
The vector is sorted by the cluster_id (which goes up to 11).
I want to sort the columns in the data frame such that the columns are in the order of the names in the vector.
A simple example of what I want is that:
Data:
A B C
1 2 3
4 5 6
Vector:
c("B","C","A")
Sorted:
B C A
2 3 1
5 6 4
Is there a fast way to do this?

UPDATE, with reproducible data added by OP:
df <- read.table(h=T, text="A B C
1 2 3
4 5 6")
vec <- c("B", "C", "A")
df[vec]
Results in:
B C A
1 2 3 1
2 5 6 4
As OP desires.
How about:
df[df.clust$mutation_id]
Where df is the data.frame you want to sort the columns of and df.clust is the data frame that contains the vector with the column order (mutation_id).
This basically treats df as a list and uses standard vector indexing techniques to re-order it.

Brodie's answer does exactly what you're asking for. However, you imply that your data are large, so I will provide an alternative using "data.table", which has a function called setcolorder that will change the column order by reference.
Here's a reproducible example.
Start with some simple data:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
matches <- data.frame(X = 1:3, Y = c("C", "A", "B"), Z = 4:6)
mydf
# A B C
# 1 1 3 5
# 2 2 4 6
matches
# X Y Z
# 1 1 C 4
# 2 2 A 5
# 3 3 B 6
Provide proof that Brodie's answer works:
out <- mydf[matches$Y]
out
# C A B
# 1 5 1 3
# 2 6 2 4
Show a more memory efficient way to do the same thing.
library(data.table)
setDT(mydf)
mydf
# A B C
# 1: 1 3 5
# 2: 2 4 6
setcolorder(mydf, as.character(matches$Y))
mydf
# C A B
# 1: 5 1 3
# 2: 6 2 4

A5C1D2H2I1M1N2O1R2T1's solution didn't work for my data (I've a similar problem that Yilun Zhang) so I found another option:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
# A B C
# 1 1 3 5
# 2 2 4 6
matches <- c("B", "C", "A") #desired order
mydf_reorder <- mydf[,match(matches, colnames(mydf))]
colnames(mydf_reorder)
#[1] "B" "C" "A"
match() find the the position of first element on the second one:
match(matches, colnames(mydf))
#[1] 2 3 1
I hope this can offer another solution if anyone is having problems!

Sort columns of a data frame by a vector of column names

I have a data.frame that looks like this:
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
The vector is sorted by the cluster_id (which goes up to 11).
I want to sort the columns in the data frame such that the columns are in the order of the names in the vector.
A simple example of what I want is that:
Data:
A B C
1 2 3
4 5 6
Vector:
c("B","C","A")
Sorted:
B C A
2 3 1
5 6 4
Is there a fast way to do this?

UPDATE, with reproducible data added by OP:
df <- read.table(h=T, text="A B C
1 2 3
4 5 6")
vec <- c("B", "C", "A")
df[vec]
Results in:
B C A
1 2 3 1
2 5 6 4
As OP desires.
How about:
df[df.clust$mutation_id]
Where df is the data.frame you want to sort the columns of and df.clust is the data frame that contains the vector with the column order (mutation_id).
This basically treats df as a list and uses standard vector indexing techniques to re-order it.

Brodie's answer does exactly what you're asking for. However, you imply that your data are large, so I will provide an alternative using "data.table", which has a function called setcolorder that will change the column order by reference.
Here's a reproducible example.
Start with some simple data:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
matches <- data.frame(X = 1:3, Y = c("C", "A", "B"), Z = 4:6)
mydf
# A B C
# 1 1 3 5
# 2 2 4 6
matches
# X Y Z
# 1 1 C 4
# 2 2 A 5
# 3 3 B 6
Provide proof that Brodie's answer works:
out <- mydf[matches$Y]
out
# C A B
# 1 5 1 3
# 2 6 2 4
Show a more memory efficient way to do the same thing.
library(data.table)
setDT(mydf)
mydf
# A B C
# 1: 1 3 5
# 2: 2 4 6
setcolorder(mydf, as.character(matches$Y))
mydf
# C A B
# 1: 5 1 3
# 2: 6 2 4

A5C1D2H2I1M1N2O1R2T1's solution didn't work for my data (I've a similar problem that Yilun Zhang) so I found another option:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
# A B C
# 1 1 3 5
# 2 2 4 6
matches <- c("B", "C", "A") #desired order
mydf_reorder <- mydf[,match(matches, colnames(mydf))]
colnames(mydf_reorder)
#[1] "B" "C" "A"
match() find the the position of first element on the second one:
match(matches, colnames(mydf))
#[1] 2 3 1
I hope this can offer another solution if anyone is having problems!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to distribute a dataframe's rows among list elements - r

You could also do: Map(rbind, A= z, B = lapply(split(d,d$Sec), function(x){x[-1]})) # $Z # Rank Value # A.1 1 A # A.2 2 B # A.3 3 C # B 4 D # # $Y # Rank Value # A.1 1 A # A.2 2 B # A.3 3 C # B 4 D

Related

Is there a way to count values by presence per rows in R?

Count of number of elements between distinct elements in vector

Take sum of rows for every 3 columns in a dataframe

Arranging data.frame's columns based on a reference vector [duplicate]

Sort columns of a data frame by a vector of column names

Categories

Resources