How to split a data frame in the order I want? - r

I have a data frame df like this
df
x y id
10 5 2
12 10 2
15 0 1
I want to split by the id. I used split(df, df$id) and I get
x y id
15 0 1
and
x y id
10 5 2
12 10 2
But I want the one with id=2 to come before than the one with id =1
So basically I want the output to be
x y id
10 5 2
12 10 2
and
x y id
15 0 1

According to the documentation of split(), The components of the list are named by the levels of f (after converting to a factor ...). f is the second parameter to split(). So, the chunks appear in the order of the factor levels after splitting.
The OP has requested that the chunks should be returned in the same order as they appear in df. This can be achieved conveniently with the fct_inorder() function of Hadley's forcats package:
split(df, forcats::fct_inorder(factor(df$id)))
#$`2`
# x y id
#1 10 5 2
#2 12 10 2
#
#$`1`
# x y id
#3 15 0 1
Note, that
id itself remains unchanged. fct_inorder() is only used for defining the split.
the additional call to factor() is only required because id is of type integer.
Edit This can also be achieved without any packages:
split(df, factor(df$id, levels = unique(df$id)))

Just switch the order of the elements in the list.
Sdf = split(df, df$id)
Sdf = Sdf[c(2,1)]
$`2`
x y id
1 10 5 2
2 12 10 2
$`1`
x y id
3 15 0 1
You could also use rev (reverse)
Sdf = rev(Sdf)

Related

How to split a dataframe into multiple at the same time based on the value of one column

crop.genos <- data.frame(crop=rep(1:6, each=4),genos=rep(1:4, 6))
crop.genos$crop.genotype <- paste(crop.genos$crop, crop.genos$genos, sep="")
Here I got a data frame with three columns: crop, genos, crop.geotype. And I want to get six different dataframe based on the crop catogory (such like the example below), all the rest columns are remained
crop genos crop.genotype
1 1 1 11
2 1 2 12
3 1 3 13
Use split:
l <- split(crop.genos, crop.genos$crop)
names(l) <- paste0('df', names(l))
list2env(l, env = .GlobalEnv)
output
> df1
# crop genos crop.genotype
#1 1 1 11
#2 1 2 12
#3 1 3 13
#4 1 4 14

mutate string into numeric, ignore alphabetical order of factor

I am trying to recode factor levels into numbers using mutate function, but I want to ignore alphabetical order the factors are appearing in. There are multiple same values of factor levels and I want them to be assigned the number in the new column of the row in which they first appeared in the dataframe.
Example:
library(stringi)
set.seed(234)
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
data2<-data %>% mutate(num=(as.numeric(factor(data))))
data2
Expected outcome:
dat<-data2[,-2]
order<-c(1,2,3,2,4,5)
expected_result<-cbind.data.frame(head(dat), order)
expected_result
I think you can just create a new factor and set the levels as unique values of data2$data in your example:
new_fac <- factor(data2$data, levels = unique(data2$data))
The numeric values can be obtained:
new_order <- as.numeric(new_fac)
And this is what your final result would look like:
head(data.frame(new_fac, new_order))
new_fac new_order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
Or in your example with dplyr, you can do:
data %>%
mutate(num = as.numeric(factor(data, levels = unique(data))))
You could accomplish this with a helper table that contains the row number of the first time a string appears in your table. I.e.
library(stringi)
library(tidyverse)
# generate data
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
Create helper table:
factorlevels <- data %>% unique() %>% mutate(order = row_number())
... and inner join to data
data %>% inner_join(factorlevels)
Output:
> data %>% inner_join(factorlevels)
Joining, by = "data"
data order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
7 v 6
8 i 7
9 v 6
10 H 8
11 Y 9
12 X 10
13 a 11
14 a 11
15 0 12
16 R 13
17 J 14
18 j 15
19 8 16
20 s 17
I am sure that there is a one-liner approach to this, but I could not figure it out right away.

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

R Split data.frame using a column that represents and on/off switch

I have data that looks like the following:
a <- data.frame(cbind(x=seq(50),
y=rnorm(50),
z=c(rep(0,5),
rep(1,8),
rep(0,3),
rep(1,2),
rep(0,12),
rep(1,12),
rep(0,8))))
I would like to split the data.frame a on the column z but have each group as a separate data.frame as a member of a list i.e. in my example the first 5 rows would be the first item in the list the next 8 rows would be the next item in the list, the next 3 rows would be item after that etc. etc.
Simple factors combine all the 1s together and all the 0s together...
I'm sure that there is a simple way to do this, but it has eluded for at the moment.
Thanks
Try the rleid function in data.table v > 1.9.5
library(data.table)
split(a, rleid(a$z))
# $`1`
# x y z
# 1 1 -0.03737561 0
# 2 2 -0.48663043 0
# 3 3 -0.98518106 0
# 4 4 0.09014355 0
# 5 5 -0.07703517 0
#
# $`2`
# x y z
# 6 6 0.3884339 1
# 7 7 1.5962833 1
# 8 8 -1.3750668 1
# 9 9 0.7987056 1
# 10 10 0.3483114 1
# 11 11 -0.1777759 1
# 12 12 1.1239553 1
# 13 13 0.4841117 1
....
Or, also with cumsum:
split(a, c(0, cumsum(diff(a$z) != 0)))
Here are some base R options.
Using rle. A variant of rleid function in the comments by #Spacedman
split(a,inverse.rle(within.list(rle(a$z), values <- seq_along(values))))
By using cumsum after creating a logical index based on whether the adjacent elements are equal or not
split(a, cumsum(c(TRUE, a$z[-1]!=a$z[-nrow(a)])))

Variable Length Core Name Identification

I have a data set with the following row-naming scheme:
a.X.V
where:
a is a fixed-length core ID
X is a variable-length string that subsets a, which means I should keep X
V is a variable-length ID which specifies the individual elements of a.X to be averaged
. is one of {-,_}
What I am trying to do is take column averages of all the a.X's. A sample:
sampleList <- list("a.12.1"=c(1,2,3,4,5), "b.1.23"=c(3,4,1,4,5), "a.12.21"=c(5,7,2,8,9), "b.1.555"=c(6,8,9,0,6))
sampleList
$a.12.1
[1] 1 2 3 4 5
$b.1.23
[1] 3 4 1 4 5
$a.12.21
[1] 5 7 2 8 9
$b.1.555
[1] 6 8 9 0 6
Currently I am manually gsubbing out the .Vs to get a list of general :
sampleList <- t(as.data.frame(sampleList))
y <- rowNames(sampleList)
y <- gsub("(\\w\\.\\d+)\\.d+", "\\1", y)
Is there a faster way to do this?
This is one half of 2 issues I've encountered in a workflow. The other half was answered here.
You can use a vector of patterns to find the locations of the columns you want to group. I included a pattern I knew wouldn't match anything in order to show that the solution is robust to that situation.
# A *named* vector of patterns you want to group by
patterns <- c(a.12="^a.12",b.12="^b.12",c.12="^c.12")
# Find the locations of those patterns in your list
inds <- lapply(patterns, grep, x=names(sampleList))
# Calculate the mean of each list element that matches the pattern
out <- lapply(inds, function(i)
if(l <- length(i)) Reduce("+",sampleList[i])/l else NULL)
# Set the names of the output
names(out) <- names(patterns)
Perhaps you could consider messing with your data structure to make it easier to apply some standard tools:
sampleList <- list("a.12.1"=c(1,2,3,4,5),
"b.1.23"=c(3,4,1,4,5), "a.12.21"=c(5,7,2,8,9),
"b.1.555"=c(6,8,9,0,6))
library(reshape2)
m1 <- melt(do.call(cbind,sampleList))
m2 <- cbind(m1,colsplit(m1$Var2,"\\.",c("coreID","val1","val2")))
The results looks like this:
head(m2)
Var1 Var2 value coreID val1 val2
1 1 a.12.1 1 a 12 1
2 2 a.12.1 2 a 12 1
3 3 a.12.1 3 a 12 1
Then you can more easily do something like this:
aggregate(value~val1,mean,data=subset(m2,coreID=="a"))
R is poised to do this stuff if you would just move to data.frames instead of lists. Make Your 'a', 'X', and 'V' into their own columns. Then you can use ave, by, aggregate, subset, etc.
data.frame(do.call(rbind, sampleList),
do.call(rbind, strsplit(names(sampleList), '\\.')))
# X1 X2 X3 X4 X5 X1.1 X2.1 X3.1
# a.12.1 1 2 3 4 5 a 12 1
# b.1.23 3 4 1 4 5 b 1 23
# a.12.21 5 7 2 8 9 a 12 21
# b.1.555 6 8 9 0 6 b 1 555

Resources