r cumsum-like function for splitting dataframe - r

Given the following dataframe:
mydf <- data.frame(x=c(1:10,10:1),y=c(10:1,1:10))
How is it possible to split it such that each sub-dataframe will have consecutive values of one column which are greater than the other column?
For example in mydf, the outcome that I am hoping for is spliting it into three dataframes:
(y > x; should contain the first 5 rows of mydf)
(x > y; should contain rows 6 to 15 of mydf)
(y > x again; should contain the last 5 rows of mydf)
I tried using the following code but it produced bad results where each y > x would be split individually; moreover, dataframes where x > y would contain a y > x in the first row:
split(mydf, cumsum(mydf$x > mydf$y))
Another less elegant approach I tried to do is sapply with individual ifs inside the split function, but I don't want to go this path because of performance issues.

Try
rl <- with(mydf, rle(x >y))
grp <- inverse.rle(within.list(rl , values <- seq_along(values)))
split(mydf, grp)
#$`1`
# x y
#1 1 10
#2 2 9
#3 3 8
#4 4 7
#5 5 6
#$`2`
# x y
#6 6 5
#7 7 4
#8 8 3
#9 9 2
#10 10 1
#11 10 1
#12 9 2
#13 8 3
#14 7 4
#15 6 5
#$`3`
# x y
#16 5 6
#17 4 7
#18 3 8
#19 2 9
#20 1 10
Or
group <- with(mydf, cumsum(c(1,abs(diff(x >y)))))
split(mydf, group)
Or you can use rleid from the devel version of data.table (from #David Arenburg's comments) , i.e. v1.9.5. Onstructions to install it are here
library(data.table)
split(mydf, rleid(with(mydf, y > x)))

Related

Adding new columns to dataframe with suffix

I want to subtract one column from another and create a new one using the corresponding suffix in the first column. I have approx 50 columns
I can do it "manually" as follows...
df$new1 <- df$col_a1 - df$col_b1
df$new2 <- df$col_a2 - df$col_b2
What is the easiest way to create a loop that does the job for me?
We can use grep to identify columns which has "a" and "b" in it and subtract them directly.
a_cols <- grep("col_a", names(df))
b_cols <- grep("col_b", names(df))
df[paste0("new", seq_along(a_cols))] <- df[a_cols] - df[b_cols]
df
# col_a1 col_a2 col_b1 col_b2 new1 new2
#1 10 15 1 5 9 10
#2 9 14 2 6 7 8
#3 8 13 3 7 5 6
#4 7 12 4 8 3 4
#5 6 11 5 9 1 2
#6 5 10 6 10 -1 0
data
Tested on this data
df <- data.frame(col_a1 = 10:5, col_a2 = 15:10, col_b1 = 1:6, col_b2 = 5:10)

How do I add observations to an existing data frame column?

I have a data frame. Let's say it looks like this:
Input data set
I have simulated some values and put them into a vector c(4,5,8,8). I want to add these simulated values to columns a, b and c.
I have tried rbind or inserting the vector into the existing data frame, but that replaced the existing values with the simulated ones, instead of adding the simulated values below the existing ones.
x <- data.frame("a" = c(2,3,1), "b" = c(5,1,2), "c" = c(6,4,7))
y <- c(4,5,8,8)
This is the output I expect to see:
Output
Help would be greatly appreciated. Thank you.
Can do:
as.data.frame(sapply(x,
function(z)
append(z,y)))
a b c
1 2 5 6
2 3 1 4
3 1 2 7
4 4 4 4
5 5 5 5
6 8 8 8
7 8 8 8
An option is assignment
n <- nrow(x)
x[n + seq_along(y), ] <- y
x
# a b c
#1 2 5 6
#2 3 1 4
#3 1 2 7
#4 4 4 4
#5 5 5 5
#6 8 8 8
#7 8 8 8
Another option is replicate the 'y' and rbind
rbind(x, `colnames<-`(replicate(ncol(x), y), names(x)))
x[(nrow(x)+1):(nrow(x)+length(y)),] <- y

deleting first row based on column variable

How do I delete the first row of each new variable? For example, here is some data:
m <- c("a","a","a","a","a","b","b","b","b","b")
n <- c('x','y','x','y','x','y',"x","y",'x',"y")
o <- c(1:10)
z <- data.frame(m,n,o)
I want to delete the first entry for a and b in column m. I have a very large data frame so I want to do this based on the change from a to b and so on.
Here is what I want the data frame to look like.
m n o
1 a y 2
2 a x 3
3 a y 4
4 a x 5
5 b x 7
6 b y 8
7 b x 9
8 b y 10
Thanks.
Just use duplicated:
z[duplicated(z$m),]
# m n o
#2 a y 2
#3 a x 3
#4 a y 4
#5 a x 5
#7 b x 7
#8 b y 8
#9 b x 9
#10 b y 10
Why this works? Consider:
duplicated("a")
#[1] FALSE
duplicated(c("a","a"))
#[1] FALSE TRUE
data.table is preferred for large datasets in R. setDT converts z data frame to data table by reference. Group by m and remove the first row.
library('data.table')
setDT(z)[, .SD[-1], by = "m"]
Using group_by and row_numberfrom package dplyr:
z %>%
group_by(m) %>%
filter(row_number(o)!=1)

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

Mean of each row of data.frames column nested in list

I have many data.frames nested in a list and want to get the row means of a column in the data.frames. Here is my MWE. I am wondering how to accomplish this in R?
set.seed(12345)
df1 <- data.frame(x=rnorm(10))
df2 <- data.frame(x=rnorm(10))
ls1 <- list(df1=df1, df2=df2)
ls1
$df1
x
1 0.5855288
2 0.7094660
3 -0.1093033
4 -0.4534972
5 0.6058875
6 -1.8179560
7 0.6300986
8 -0.2761841
9 -0.2841597
10 -0.9193220
$df2
x
1 -0.1162478
2 1.8173120
3 0.3706279
4 0.5202165
5 -0.7505320
6 0.8168998
7 -0.8863575
8 -0.3315776
9 1.1207127
10 0.2987237
Something like this one
(ls1$df1+ls1$df2)/2
x
1 0.23464051
2 1.26338903
3 0.13066227
4 0.03335964
5 -0.07232227
6 -0.50052806
7 -0.12812949
8 -0.30388085
9 0.41827645
10 -0.31029915
Edited
ls1 <- list(df=df1)
ls2 <- list(df=df2)
How this (ls1$df+ls2$df)/2 can be written more coherently?
x
1 0.23464051
2 1.26338903
3 0.13066227
4 0.03335964
5 -0.07232227
6 -0.50052806
7 -0.12812949
8 -0.30388085
9 0.41827645
10 -0.31029915
Extract the column, cbind the data.frames, and calculate the row means:
rowMeans(do.call(cbind, lapply(ls1, "[", "x")))
If there are no NA values, another option would be to do element wise (+) with Reduce and divide by the length of 'ls1'
Reduce(`+`, ls1)/length(ls1)
# x
#1 0.23464051
#2 1.26338903
#3 0.13066227
#4 0.03335964
#5 -0.07232227
#6 -0.50052806
#7 -0.12812949
#8 -0.30388085
#9 0.41827645
#10 -0.31029915

Resources