Split dataframe into list's based on id's

Split dataframe into list's based on id's - r

Please note, I'm not a programmer by trade. I'm literature student. So please bear with me.
I would like to improve the existing working procedure. Certainly function split is one option (I'm not sure how however).
Basically, I'm trying to subdivide existing dataframe into list of sub samples so that the sequnce of id's is not splitted into second list.
Here is working example together with sample data:
df <- data.frame(id=c(rep(1,3),rep(2,2),rep(3,3),rep(4,2),5,6,7,8,9,rep(10,5)),r1=rep(1,40),r2=rep(2,40))
x <- transform(df, rec=ave(df$id,df$id, FUN=seq_along))
x$cum <- cumsum(x$rec)
x$dif <- diff(c(0,x$cum),1)
x$lab <- ifelse(x$dif!=1,0,1)
x$seq <- seq_along(x$id)
x$subs <- x$lab*x$seq
seqrow <- seq(1,nrow(x),3) # how many rows approx. per part
rw <- x$subs[x$subs %in% seqrow]
start_rw <- c(1,rw[2:length(rw)])
end_rw <- c(start_rw[2:length(start_rw)]-1,nrow(x))
df.lst <- list()
for(i in 1:length(start_rw)){
df.lst[[i]] <- x[(start_rw[i]:end_rw[i]), ]
}
In each list the id's should be also sorted increasingly and should be arranged according to id's.

Reading through your code, I would summarize your procedure as:
Compute seqrow, which is row numbers where you would be willing to split the list
Split df only at the positions in seqrow where df$id is new (hasn't appeared above); this list of positions is called start_rw in your code.
You can use duplicated to determine if df$id has appeared above or not, which enables you to grab start_rw more easily:
seqrow <- seq(1,nrow(df),3)
(start_rw <- intersect(which(!duplicated(df$id)), seqrow))
# [1] 1 4 13 16
All that remains is to split df at these positions. You can use diff to compute the number of elements in each grouping:
(groups <- rep(seq(start_rw), times=diff(c(start_rw, nrow(df)+1))))
# [1] 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
df.lst2 <- split(df, groups)
This matches the output of your code:
all.equal(unname(df.lst2), lapply(df.lst, function(x) x[,1:3]))
# [1] TRUE

Related

changing column names of a data frame by changing values - R

Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.

We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)

Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3

Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3

Unique Data Frame Based On Three Column Values

I have a data frame of 6449x743, in which few rows are repeating twice with same column_X and column_Y values, but with higher column_Z values for second repetition for the same row. I want to keep the row with higher column_Z only.
I tried following, but this doesn't get rid of duplicate values and gives me output of 6449x743 only.
output <- unique(Data[,c('column_X', 'column_Y', max('column_Z'))])
Ideally, the output should be (6449 - N)x743, as number of rows will be less, but number of columns will remain same, as column_X and column_Y will become unique after filtering data based on column_Z.
If anyone has suggestions, please let me know. Thanks.

You can used not duplicated (!duplicated) with option fromLast = TRUE on specific columns like this:
df <- data.frame(a=c(1,1,2,3,4),b=c(2,2,3,4,5),c=1:5)
df <- df[order(df$c),] #make sure the data is sorted.
a b c
1 1 2 1
2 1 2 2
3 2 3 3
4 3 4 4
5 4 5 5
df[!duplicated(df$a,fromLast = TRUE) & !duplicated(df$b,fromLast = TRUE),]
a b c
2 1 2 2
3 2 3 3
4 3 4 4
5 4 5 5

Try
library(dplyr)
Data %>%
group_by(column_x, column_Y) %>%
filter(Z==max(column_Z))
It works with the sample data
set.seed(13)
df<-data_frame(a=sample(1:4, 50, rep=T),
b=sample(1:3, 50, rep=T),
x=runif(50), y=rnorm(50))
df %>% group_by(a,b) %>% filter(x==max(x))

Probably the easiest way would be to order the whole thing by column_Z and then remove the duplicates:
output <- Data[order(Data$column_Z, decreasing=TRUE),]
output <- output[!duplicated(paste(output$column_X, output$column_Y)),]
assuming I understood you correctly.

Here's an older answer which may be trying to accomplish the same thing that you are:
How to make a unique in R by column A and keep the row with maximum value in column B
Editing with relevant code:
A solution using package data.table:
set.seed(42)
dat <- data.frame(A=c('a','a','a','b','b'),B=c(1,2,3,5,200),C=rnorm(5))
library(data.table)
dat <- as.data.table(dat)
dat[,.SD[which.max(B)],by=A]
A B C
1: a 3 0.3631284
2: b 200 0.4042683

combine list of data frames in list in specific manner

I got a list which have another list of data frames.
The outside list elements represents years and inside list represent months data.
Now I want to create a final list which will contain data for all months. Each Month columns will be "cbinded" by other years column values.
Alldata <- list()
Alldata[[1]] <- list(data.frame(Jan_2015_A=c(1,2), Jan_2015_B=c(3,4)), data.frame(Feb_2015_C=c(5,6), Feb_2015_D=c(7,8)))
Alldata[[2]] <- list(data.frame(Jan_2016_A=c(1,2), Jan_2016_B=c(3,4)), data.frame(Feb_2016_C=c(5,6), Feb_2016_D=c(7,8)))
Expected output list is as following
I've tried using for loops and its little complex, I want any R function to do this task.
I have done this using for loops using following code. But this is really complex and I myself found this little complicate. Hope I will get any simpler and tidy code for this operation.
I created list with each months and years data as a list item in form of data frames
x2 <- list()
for(l1 in 1: length(Alldata[[1]])){
temp <- list()
for(l2 in 1: length(Alldata)){
temp <- append(temp, list(Alldata[[l2]][[l1]]))
}
x2 <- append(x2, list(temp))
}
# then created final List with succesive years data of each month as list items. This is primarily used for Tracking data for years For Example: how much was count was for Jan_2015 and Jan_2016 for "A"
finalList <- list()
for(l3 in 1: length(x2)){
temp <- x2[[l3]]
td2 <- as.data.frame(matrix("", nrow = nrow(temp[[1]])))
rownames(td2)[rownames(temp[[1]])!=""] <- rownames(temp[[1]])[rownames(temp[[1]])!=""]
for(l4 in 1:ncol(temp[[1]])){
for(l5 in 1: length(temp)){
# lapply(l4, function(x) do.call(cbind,
td2 <- cbind(td2, temp[[l5]][, l4, drop=F])
}
}
finalList <- append(finalList, list(td2))
}
> finalList
[[1]]
V1 Jan_2015_A Jan_2016_A Jan_2015_B Jan_2016_B
1 1 1 3 3
2 2 2 4 4
[[2]]
V1 Feb_2015_C Feb_2016_C Feb_2015_D Feb_2016_D
1 5 5 7 7
2 6 6 8 8

You could do the following below. The lapply will iterate over the outer list and the do.call will cbind the inner list of data frames.
lapply(Alldata, do.call, what = 'cbind')
[[1]]
Jan_2015_A Jan_2015_B Feb_2015_C Feb_2015_D
1 1 3 5 7
2 2 4 6 8
[[2]]
Jan_2016_A Jan_2016_B Feb_2016_C Feb_2016_D
1 1 3 5 7
2 2 4 6 8
You can also use dplyr to get the same results.
library(dplyr)
lapply(Alldata, bind_cols)
Here is a third option proposed by J.R.
lapply(Alldata, Reduce, f = cbind)
EDIT
After clarification from OP, the above solution has been modified (see below) to produce the newly specified output. The solution above has been left there since it is a building block for the solution below.
pattern.vec <- c("Jan", "Feb")
### For a given vector of months/patterns, returns a
### list of elements with only that month.
mon_data <- function(mo) {
return(bind_cols(sapply(Alldata, function(x) { x[grep(pattern = mo, x)]})))
}
### Loop through months/patterns.
finalList <- lapply(pattern.vec, mon_data)
finalList
## [[1]]
## Jan_2015_A Jan_2015_B Jan_2016_A Jan_2016_B
## 1 1 3 1 3
## 2 2 4 2 4
##
## [[2]]
## Feb_2015_C Feb_2015_D Feb_2016_C Feb_2016_D
## 1 5 7 5 7
## 2 6 8 6 8
## Ordering the columns as specified in the original question.
## sorting is by the last character in the column name (A or B)
## and then the year.
lapply(finalList, function(x) x[ order(gsub('[^_]+_([^_]+)_(.*)', '\\2_\\1', colnames(x))) ])
## [[1]]
## Jan_2015_A Jan_2016_A Jan_2015_B Jan_2016_B
## 1 1 1 3 3
## 2 2 2 4 4
##
## [[2]]
## Feb_2015_C Feb_2016_C Feb_2015_D Feb_2016_D
## 1 5 5 7 7
## 2 6 6 8 8

new column with value from another one which name is in a third one in R

got that one I can't resolve.
Example dataset:
company <- c("compA","compB","compC")
compA <- c(1,2,3)
compB <- c(2,3,1)
compC <- c(3,1,2)
df <- data.frame(company,compA,compB,compC)
I want to create a new column with the value from the column which name is in the column "company" of the same line. the resulting extraction would be:
df$new <- c(1,3,2)
df

The way you have it set up, there's one row and one column for every company, and the rows and columns are in the same order. If that's your real dataset, then as others have said diag(...) is the solution (and you should select that answer).
If your real dataset has more than one instance of company (e.g., more than one row per company, then this is more general:
# using your df
sapply(1:nrow(df),function(i)df[i,as.character(df$company[i])])
# [1] 1 3 2
# more complex case
set.seed(1) # for reproducible example
newdf <- data.frame(company=LETTERS[sample(1:3,10,replace=T)],
A = sample(1:3,10,replace=T),
B=sample(1:5,10,replace=T),
C=1:10)
head(newdf)
# company A B C
# 1 A 1 5 1
# 2 B 1 2 2
# 3 B 3 4 3
# 4 C 2 1 4
# 5 A 3 2 5
# 6 C 2 2 6
sapply(1:nrow(newdf),function(i)newdf[i,as.character(newdf$company[i])])
# [1] 1 2 4 4 3 6 7 2 5 3

EDIT: eddi's answer is probably better. It is more likely that you would have the dataframe to work with rather than the individual row vectors.
I am not sure if I understand your question, it is unclear from your description. But it seems you are asking for the diagonals of the data values since this would be the place where "name is in the column "company" of the same line". The following will do this:
df$new <- diag(matrix(c(compA,compB,compC), nrow = 3, ncol = 3))
The diag function will return the diagonal of the matrix for you. So I first concatenated the three original vectors into one vector and then specified it to be wrapped into a matrix of three rows and three columns. Then I took the diagonal. The whole thing is then added to the dataframe.
Did that answer your question?

identify groups of linked episodes which chain together

Take this simple data frame of linked ids:
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
> test
id1 id2
1 10 1
2 10 36
3 1 24
4 1 45
5 24 300
6 8 11
I now want to group together all the ids which link.
By 'link', I mean follow through the chain of links so that all ids in one group
are labelled together. A kind of branching structure. i.e:
Group 1
10 --> 1, 1 --> (24,45)
24 --> 300
300 --> NULL
45 --> NULL
10 --> 36, 36 --> NULL,
Final group members: 10,1,24,36,45,300
Group 2
8 --> 11
11 --> NULL
Final group members: 8,11
Now I roughly know the logic I would want, but don't know how I would implement it elegantly. I am thinking of a recursive use of match or %in% to go down each branch, but am truly stumped this time.
The final result I would be chasing is:
result <- data.frame(group=c(1,1,1,1,1,1,2,2),id=c(10,1,24,36,45,300,8,11))
> result
group id
1 1 10
2 1 1
3 1 24
4 1 36
5 1 45
6 1 300
7 2 8
8 2 11

The Bioconductor package RBGL (an R interface to the BOOST graph library) contains
a function, connectedComp(), which identifies the connected components in a graph --
just what you are wanting.
(To use the function, you will first need to install the graph and RBGL packages, available here and here.)
library(RBGL)
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
## Convert your 'from-to' data to a 'node and edge-list' representation
## used by the 'graph' & 'RBGL' packages
g <- ftM2graphNEL(as.matrix(test))
## Extract the connected components
cc <- connectedComp(g)
## Massage results into the format you're after
ld <- lapply(seq_along(cc),
function(i) data.frame(group = names(cc)[i], id = cc[[i]]))
do.call(rbind, ld)
# group id
# 1 1 10
# 2 1 1
# 3 1 24
# 4 1 36
# 5 1 45
# 6 1 300
# 7 2 8
# 8 2 11

Here's an alternative answer that I have discovered myself after the nudging in the right direction by Josh. This answer uses the igraph package.
For those that are searching and come across this answer, my test dataset is referred to as an "edge list" or "adjacency list" in graph theory (http://en.wikipedia.org/wiki/Graph_theory)
library(igraph)
test <- data.frame(id1=c(10,10,1,1,24,8 ),id2=c(1,36,24,45,300,11))
gr.test <- graph_from_data_frame(test)
links <- data.frame(id=unique(unlist(test)),group=components(gr.test)$membership)
links[order(links$group),]
# id group
#1 10 1
#2 1 1
#3 24 1
#5 36 1
#6 45 1
#7 300 1
#4 8 2
#8 11 2

Without using packages:
# 2 sets of test data
mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
grouppairs <- function(df){
# from wide to long format; assumes df is 2 columns of related id's
test <- data.frame(group = 1:nrow(df),val = unlist(df))
# keep moving to next pair until all same values have same group
i <- 0
while(any(duplicated(unique(test)$val))){
i <- i+1
# get group of matching values
matches <- test[test$val == test$val[i],'group']
# change all groups with matching values to same group
test[test$group %in% matches,'group'] <- test$group[i]
}
# renumber starting from 1 and show only unique values in group order
test$group <- match(test$group, sort(unique(test$group)))
unique(test)[order(unique(test)$group), ]
}
# test
grouppairs(test)
grouppairs(mytest)

You said recursive... and I thought I'd be super terse while I'm at it.
Test data
mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
Recursive function to get the groupings
aveminrec <- function(v1,v2){
v2 <- ave(v1,by = v2,FUN = min)
if(identical(v1,v2)){
as.numeric(as.factor(v2))
}else{
aveminrec(v2,v1)
}
}
Prep data and simplify after
groupvalues <- function(valuepairs){
val <- unlist(valuepairs)
grp <- aveminrec(val,1:nrow(valuepairs))
unique(data.frame(grp,val)[order(grp,val), ])
}
Get results
groupvalues(test)
groupvalues(mytest)
aveminrec() is probably along the lines of what you were thinking, though I bet there's a way to be more direct about going down each branch instead of repeating ave() which is essentially split() and lapply(). Maybe recursively split and lapply? As it is, it's like repeated partial branching, or alternately simplifying 2 vectors slightly without group information loss.
Maybe parts of this would be used on a real problem, but groupvalues() is way too dense to read without some comments at least. I also haven't checked how performance compares to a for loop with ave and flipping the groups that way.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split dataframe into list's based on id's - r

Related

changing column names of a data frame by changing values - R

Unique Data Frame Based On Three Column Values

combine list of data frames in list in specific manner

new column with value from another one which name is in a third one in R

identify groups of linked episodes which chain together

Categories

Resources