sorting list of dataframes by date - r

I have a list of dataframes in a structure similar to this:
`ID1_01/05/10` <- data.frame(c(1,1))
`ID1_21/02/10` <- data.frame(c(2,1))
`ID2_01/05/10` <- data.frame(c(3,1))
`ID2_21/02/10` <- data.frame(c(4,1))
lst <- list(mget ( ls ( pattern = 'ID\\d+')))
I'd like to order them in the list first by identity and then by date. I.e.:
`ID1_21/02/10`
`ID1_01/05/10`
`ID2_21/02/10`
`ID2_01/05/10`
Is there a way of doing this easily?

We extract the names, get the numeric part ('v1') and the Date part, and order based on it
nm1 <- sapply(lst, names)[,1]
v1 <- as.numeric(sub(".*(\\d+)_.*", "\\1", nm1))
d1 <- as.Date(sub(".*_", "", nm1), "%d/%m/%y")
nm1[order(v1, d1)]
#[1] "ID1_21/02/10" "ID1_01/05/10" "ID2_21/02/10" "ID2_01/05/10"
lapply(lst, function(x) x[order(v1, d1)])
#[[1]]
#[[1]]$`ID1_21/02/10`
# c.2..1.
#1 2
#2 1
#[[1]]$`ID1_01/05/10`
# c.1..1.
#1 1
#2 1
#[[1]]$`ID2_21/02/10`
# c.4..1.
#1 4
#2 1
#[[1]]$`ID2_01/05/10`
# c.3..1.
#1 3
#2 1
Update
In the OP's example, the mget was wrapped with list and it would create a list of lists. Instead it would be
lst <- mget ( ls ( pattern = 'ID\\d+'))
and if that is the case, then
nm1 <- names(lst)
lst[order(v1, d1)]

Related

How to modify a list of data.frame and then output the data.frame

I want to create a second column in each of a list of data.frames that is just a duplicate of the first column, and then output those data.frames:
store the data frames:
> FileList <- list(DF1, DF2)
Add another column to each data frame:
> ModifiedDataFrames <- lapply(1:length(FileList), function (x) {FileList[[x]]$Column2 == FileList[[x]]$Column1})
but ModifiedDataFrames[[1]] just returns a list which contains what I assume is the content from DF1$Column1
What am I missing here?
There are a few problems with your code. First, you are using the equivalence operator == for assignment and second you are not returning the correct element from your function. Here is a possible solution:
df1 <- data.frame(Column1 = c(1:3))
df2 <- data.frame(Column1 = c(4:6))
FileList <- list(df1, df2)
ModifiedDataFrames <- lapply(FileList, function(x) {
x$Column2 <- x$Column1
return(x)
})
> ModifiedDataFrames
[[1]]
Column1 Column2
1 1 1
2 2 2
3 3 3
[[2]]
Column1 Column2
1 4 4
2 5 5

Extract and append data to new datasets in a for loop

I have (what I think) is a really simple question, but I can't figure out how to do it. I'm fairly new to lists, loops, etc.
I have a small dataset:
df <- c("one","two","three","four")
df <- as.data.frame(df)
df
I need to loop through this dataset and create a list of datasets, such that this is the outcome:
[[1]]
one
[[2]]
one
two
[[3]]
one
two
three
This is more or less as far as I've gotten:
blah <- list()
for(i in 1:3){
blah[[i]]<- i
}
The length will be variable when I use this in the future, so I need to automate it in a loop. Otherwise, I would just do
one <- df[1,]
two <- df[2,]
list(one, rbind(one, two))
Any ideas?
You can try using lapply :
result <- lapply(seq(nrow(df)), function(x) df[seq_len(x), , drop = FALSE])
result
#[[1]]
# df
#1 one
# [[2]]
# df
#1 one
#2 two
#[[3]]
# df
#1 one
#2 two
#3 three
#[[4]]
# df
#1 one
#2 two
#3 three
#4 four
seq(nrow(df)) creates a sequence from 1 to number of rows in your data (which is 4 in this case). function(x) part is called as anonymous function where each value from 1 to 4 is passed to one by one. seq_len(x) creates a sequence from 1 to x i.e 1 to 1 in first iteration, 1 to 2 in second and so on. We use this sequence to subset the rows from dataframe (df[seq_len(x), ]). Since the dataframe has only 1 column when we subset it , it changes it to a vector. To avoid that we add drop = FALSE.
Base R solution:
# Coerce df vector of data.frame to character, store as new data.frame: str_df => data.frame
str_df <- transform(df, df = as.character(df))
# Allocate some memory in order to split data into a list: df_list => empty list
df_list <- vector("list", nrow(str_df))
# Split the string version of the data.frame into a list as required:
# df_list => list of character vectors
df_list <- lapply(seq_len(nrow(str_df)), function(i){
str_df[if(i == 1){1}else{1:i}, grep("df", names(str_df))]
}
)
Data:
df <- c("one","two","three","four")
df <- as.data.frame(df)
df

Columns names goes away when using lapply

I'm trying to keep the column name in a list of data frames when using lapply function.
I have a list of data frames. Let's say:
lst:
[[1]] [[2]]
A ind C ind
1 0 4 2
2 1 8 0
I'm trying to get elements of the first columns of each dataframe ([[1]] and [[2]]) which has the index 0 in the second column of each data frame.
I'm using the code
aux <- lapply(lst, function(x) x[,1][x[,2]==0])
And it is working. The only problem is that I 'd like to keep the first column names. It means I'd like to get
aux:
[[1]] [[2]]
A C
1 8
but I'm getting
aux:
[[1]] [[2]]
V1 V1
1 8
How can I keep the column names intact?
data
lst <- list(
data.frame(A=1:2, ind = 0:1),
data.frame(C=c(4,8), ind = c(2,0))
)
We can just subset the first column
lapply(lst, function(x) x[x[,2] == 0, 1, drop = FALSE])
Or with tidyverse, this can made more compact
library(purrr)
map(lst, ~ .x[!.x[,2],1, drop = FALSE])
Here is another way that might be a bit more readable, using subset,
lapply(l1, function(i) subset(i, i[2] == 0)[1])

Subset a dataframe by matching it to a list and include non-match value too in the output using R

I have a dataframe (myDF) that has 2 columns "A" and "B" and a function (myfunc) which takes a list as an input and if it finds a match in column "A" then it returns a new dataframe that is a subset of myDF containing the value match and the corresponding "B" column.
But I want the function to also return the non-matching value in column A and NULL string in column B.
myDF:
A B
1 11
2 22
3 33
myfunc:
myfunc <- function(x) {
r<- with(myDF, myDF[a %in% x, c("a", "b")])
return(data.frame(r))
}
Input: mylist = c(1,2,"E")
Expected Output:
A B
1 11
2 22
E NULL
We create a logical index and assign
i1 <- with(myDF, !A %in% mylist)
myDF$B[i1] <- "NULL"
myDF$A[i1] <- mylist[i1]
myDF
# A B
#1 1 11
#2 2 22
#3 E NULL
Note: By assigning a character string to 'B' column, it effectively changes the type from numeric to character. A better option would be to assign it to NA
myDF$B[i1] <- NA
Or
data.frame(A= mylist, B = myDF$B[match(mylist, myDF$A)])
This is a join operation, which can be done in base R with merge, if you make the list a data.frame first. The all.y = T argument includes rows of mylistDF with no matching rows in myDF in the output.
mylistDF <- data.frame(A = mylist, stringsAsFactors = F)
merge(myDF, mylistDF, by = 'A', all.y = T)
# A B
# 1 1 11
# 2 2 22
# 3 E NA
Since you tagged tidyr, here's a tidyverse solution (same output)
library(tidyverse)
mylistDF <- tibble(A = mylist)
myDF %>%
mutate_at('A', as.character) %>%
right_join(mylistDF, by = 'A')

Spacing vector by regular pattern

I have a vector
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
which generally follows the pattern of alternating between two different sequences of strings (the first sequence being all alphabetical, the second being numerical with the symbol #).
However there are cases where no # term appears: so in the above between mp and jq, and then again after ez. I would like to define a function which "fills the gaps" with the character string #, so that I would have the output:
[1] "ab" "#4" "gw" "#29" "mp" "#" "jq" "#35" "ez" "#"
which I would then convert to a data frame
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
My attempt so far is rather clunky and relies on looping through the vector and filling the gaps. I'd be interested to see more elegant solutions.
My Solution
greplSpace <- function(pattern, replacement, x){
j <- 1
while( j < length(x) ){
if(grepl(pattern, x[j+1]) ){
j <- j+2
} else {
x <- c( x[1:j], replacement, x[(j+1):length(x)] )
j <- j+2
}
}
if( ! grepl(pattern, tail(x,1) ) ){ x <- c(x, replacement) }
return(x)
}
library(magrittr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
vec %>% greplSpace("#", "#", . ) %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame
Start with your vec, we can create your expected data frame directly with some functions from the dplyr, tidyr, and stringr.
library(dplyr)
library(tidyr)
library(stringr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
dat <- data_frame(Value = vec)
dat2 <- dat %>%
mutate(String = !str_detect(vec, "#"),
Key = ifelse(String, "V1", "V2"),
Row = cumsum(String)) %>%
select(-String) %>%
spread(Key, Value, fill = "#") %>%
select(-Row)
dat2
# # A tibble: 5 x 2
# V1 V2
# <chr> <chr>
# 1 ab #4
# 2 gw #29
# 3 mp #
# 4 jq #35
# 5 ez #
Here is a base R option with split. Create a logical index by checking the "#" in each of the strings, get the cumulative sum and split the original vector by this grouping variable into a list ('lst'). For those list elements that don't have two (maximum length) elements are appended with NA at the end by assignment with length<-. Then, rbind, the list elements into a two column matrix. If needed, convert those NA to #
lst <- split(vec, cumsum(!grepl("#", vec)))
out <- do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
out[,2][is.na(out[,2])] <- "#" #not recommended though
out
# [,1] [,2]
#1 "ab" "#4"
#2 "gw" "#29"
#3 "mp" "#"
#4 "jq" "#35"
#5 "ez" "#"
Wrap it with as.data.frame if we need a data.frame output
You can use Base R:
First Collapse the vector into a string while replaceing # where needed.
Then just read using read.csv
vec1=gsub("([a-z]),\\s*([a-z])|$","\\1,#,\\2",toString(vec))
read.csv(text=gsub("(#.*?),","\\1\n",vec1),h=F)
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
Explanation:
First collapse the vector into a string by toString
Then if there are alphabets on either side of the , ie [a-z],\s*[a-z] or at the end ie |$ you insert an #.
Then create line breaks after numbers or # and read in the data as a table
You can also do:
a=read.csv(h=F,text=toString(sub("([a-z]+)","\n\\1",vec)),na=c(" ",""))[1:2]
a
V1 V2
1 ab #4
2 gw #29
3 mp <NA>
4 jq #35
5 ez <NA>
data.frame(replace(as.matrix(a),is.na(a),"#"))
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
Another base possibility:
do.call(rbind, tapply(vec, cumsum(!grepl("^#", vec)), FUN = function(x){
if(length(x) == 1) c(x, "#") else x}))
# [,1] [,2]
# 1 "ab" "#4"
# 2 "gw" "#29"
# 3 "mp" "#"
# 4 "jq" "#35"
# 5 "ez" "#"
Explanation:
Check if elements in vec starts with #, and negate it: !grepl("^#", vec); creates a logical vector.
Create a grouping variable by applying cumsum to the logical vector (note: 1 & 2 similar to #akrun).
Use tapply to apply a function to each subset of vec, defined by the grouping variable. Check if the length is 1. If so, pad by a trailing #, else just return the subset: if(length(x) == 1) c(x, "#") else x
Bind the resulting list together by row: do.call(rbind,
Another one:
# create a row index
ri <- cumsum(!grepl("^#", vec))
# create a column index
ci <- ave(ri, ri, FUN = seq_along)
# create an empty matrix of desired dimensions
m <- matrix(nrow = max(ri), ncol = 2)
# assign 'vec' to matrix at relevant indices
m[cbind(ri, ci)] <- vec
# replace NA with '#'
m[is.na(m)] <- "#"
Using data.table. Create a grouping variable as above, and reshape from long to wide.
library(data.table)
d <- data.table(vec)
d[ , g := cumsum(!grepl("^#", vec))]
dcast(d, g ~ rowid(g), value.var = "vec", fill = "#")
# g 1 2
# 1: 1 ab #4
# 2: 2 gw #29
# 3: 3 mp #
# 4: 4 jq #35
# 5: 5 ez #

Resources