understanding apply and outer function in R - r

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.

So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)

Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".

We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

Related

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

Picking up only specific columns based on conditions on multiple columns in R [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 6 years ago.
I have a data frame, say
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
it looks like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 1 e
6 3 2 f
7 3 3 g
8 6 1 h
9 8 1 i
10 8 2 j
11 8 3 k
12 8 4 l
I would like pick unique elements from column x, based on column y such that y should be maximum (in this case say for row number 5 to 7 are 3'3, I would like to pick the x = 3 corresponding to y = 3 (maximum value) similarly for x = 8 I d like to pick y = 4 row )
the output should look like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 3 g
6 6 1 h
7 8 4 l
I have a solution for that, which I am posting in the solution, but if there is there any better method to achieve this, My solution only works in this specific case (picking the largest) what is the general case solution for this?
One solution using dplyr
library(dplyr)
df %>%
group_by(x) %>%
slice(max(y))
# x y z
# (dbl) (dbl) (chr)
#1 1 1 a
#2 2 1 b
#3 3 3 g
#4 5 1 c
#5 6 1 d
#6 8 4 l
The base R alternative is using aggregate
aggregate(y~x, df, max)
You can achieve the same result using a dplyr chain and dplyr's group_by function. Once you use a group_by function the rest of the functions in the chain are applied within group as opposed to the whole data.frame. So here I filter to where the only rows left are the max(y) per the grouping value of x. This can be extended to be used for the min of y or a particular value.
I think its generally good practice to ungroup the data at the end of a chain using group_by to avoid any unexpected behavior.
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df %>%
group_by(x) %>%
filter(y==max(y)) %>%
ungroup()
To make it more general... say instead you wanted the mean of y for a given x as opposed to the max. You could then use the summarise function instead of the filter as shown below.
df %>%
group_by(x) %>%
summarise(y=mean(y)) %>%
ungroup()
Using data.table we can use df[order(z), .I[which.max(y)], by = x] to get the rownumbers of interest, eg:
library(data.table)
setDT(df)
df[df[order(z), .I[which.max(y)], by = x][, V1]]
x y z
1: 1 1 a
2: 2 1 b
3: 5 1 c
4: 6 1 d
5: 3 3 g
6: 8 4 l
Here is my solution using dplyr package
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df <- arrange(df,desc(y))
df_out <- df[!duplicated(df$x),]
df_out
Printing df_out
x y z
1 8 4 l
2 3 3 g
6 1 1 a
7 2 1 b
8 5 1 c
9 6 1 d
Assuming the data frame is ordered by df[order(df$x, df$y),] as it is in the example, you can use base R functions, split, lapply, and do.call/rbind to extract your desired rows using the "split / apply / combine" methodology.
do.call(rbind, lapply(split(df, df$x), function(i) i[nrow(i),]))
x y z
1 1 1 a
2 2 1 b
3 3 3 g
5 5 1 c
6 6 1 h
8 8 4 l
split breaks up the data.frame into a list based on x. This list is fed to lapply which selects the last row of each data.frame, and returns these one row data.frames as a list. This list is then rbinded into a single data frame using do.call.

Strsplit on a column of a data frame [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I have a data.frame where one of the variables is a vector (or a list), like this:
MyColumn <- c("A, B,C", "D,E", "F","G")
MyDF <- data.frame(group_id=1:4, val=11:14, cat=MyColumn)
# group_id val cat
# 1 1 11 A, B,C
# 2 2 12 D,E
# 3 3 13 F
# 4 4 14 G
I'd like to have a new data frame with as many rows as the vector
FlatColumn <- unlist(strsplit(MyColumn,split=","))
which looks like this:
MyNewDF <- data.frame(group_id=c(rep(1,3),rep(2,2),3,4), val=c(rep(11,3),rep(12,2),13,14), cat=FlatColumn)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
In essence, for every factor which is an element of the list of MyColumn (the letters A to G), I want to assign the corresponding values of the list. Every factor appears only once in MyColumn.
Is there a neat way for this kind of reshaping/unlisting/merging? I've come up with a very cumbersome for-loop over the rows of MyDF and the length of the corresponding element of strsplit(MyColumn,split=","). I'm very sure that there has to be a more elegant way.
You can use separate_rows from tidyr:
tidyr::separate_rows(MyDF, cat)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
How about
lst <- strsplit(MyColumn, split = ",")
k <- lengths(lst) ## expansion size
FlatColumn <- unlist(lst, use.names = FALSE)
MyNewDF <- data.frame(group_id = rep.int(MyDF$group_id, k),
val = rep.int(MyDF$val, k),
cat = FlatColumn)
# group_id val cat
#1 1 11 A
#2 1 11 B
#3 1 11 C
#4 2 12 D
#5 2 12 E
#6 3 13 F
#7 4 14 G
We can use cSplit from splitstackshape
library(splitstackshape)
cSplit(MyDF, "cat", ",", "long")
# group_id val cat
#1: 1 11 A
#2: 1 11 B
#3: 1 11 C
#4: 2 12 D
#5: 2 12 E
#6: 3 13 F
#7: 4 14 G
We can also use do with base R with strsplit to split the 'cat' column into a list, replicate the sequence of rows of 'MyDF' with the lengths of 'lst', and create the 'cat' column by unlisting the 'lst'.
lst <- strsplit(as.character(MyDF$cat), ",")
transform(MyDF[rep(1:nrow(MyDF), lengths(lst)),-3], cat = unlist(lst))

ratios according to two variables, function aggregate in R?

I've been playing with some data in order to obtain the ratios between two levels within one variable and taking into account two other variables. I've been using the function aggregate(), which is very useful to calculate means and sums. However, I'm stuck when I want to calculate some ratios (divisions).
Here you find a dataframe very similar to my data:
w<-c("A","B","C","D","E","F","A","B","C","D","E","F")
x<-c(1,1,1,1,1,1,2,2,2,2,2,2)
y<-c(3,4,5,6,8,10,3,4,5,7,9,10)
z<-runif(12)
df<-data.frame(w,x,y,z)
df
w x y z
1 A 1 3 0.93767621
2 B 1 4 0.09169992
3 C 1 5 0.49012926
4 D 1 6 0.90886690
5 E 1 8 0.37058120
6 F 1 10 0.83558267
7 A 2 3 0.42670001
8 B 2 4 0.05656252
9 C 2 5 0.70694423
10 D 2 7 0.13634309
11 E 2 9 0.92065671
12 F 2 10 0.56276176
What I want is to obtain the ratios of z from the two levels of x and taking into account the variables w and y. So the level "A" from the variable "w" in the level "3" from the variable "y" should be:
df$z[1]/df$z[7]
With aggregate function should be something like this:
final<-aggregate(z~y:w, data=df)
However, I know that I miss something because in the variable y there are some classes that not appear in the two categories of w (e.g. 7, 8 and 9).
Any help will be welcomed!
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'w', 'y', if the nrow (.N) is 2, we divide the first value by the second or else return the 'z'. Assign (:=) the output to a new column 'z1'.
library(data.table)
setDT(df)[,z1 :=if(.N==2) z[1]/z[2] else z , by = .(w,y)]
df
# w x y z z1
# 1: A 1 3 0.93767621 2.1975069
# 2: B 1 4 0.09169992 1.6212135
# 3: C 1 5 0.49012926 0.6933068
# 4: D 1 6 0.90886690 0.9088669
# 5: E 1 8 0.37058120 0.3705812
# 6: F 1 10 0.83558267 1.4847894
# 7: A 2 3 0.42670001 2.1975069
# 8: B 2 4 0.05656252 1.6212135
# 9: C 2 5 0.70694423 0.6933068
#10: D 2 7 0.13634309 0.1363431
#11: E 2 9 0.92065671 0.9206567
#12: F 2 10 0.56276176 1.4847894
If we just want the summary output we don't need to use :=
setDT(df)[, list(z=if(.N==2) z[1]/z[2] else z) , by = .(w,y)]
Or using aggregate
aggregate(z~w+y, df, FUN=function(x)
if(length(x)==2) x[1]/x[2] else x)

r cumsum-like function for splitting dataframe

Given the following dataframe:
mydf <- data.frame(x=c(1:10,10:1),y=c(10:1,1:10))
How is it possible to split it such that each sub-dataframe will have consecutive values of one column which are greater than the other column?
For example in mydf, the outcome that I am hoping for is spliting it into three dataframes:
(y > x; should contain the first 5 rows of mydf)
(x > y; should contain rows 6 to 15 of mydf)
(y > x again; should contain the last 5 rows of mydf)
I tried using the following code but it produced bad results where each y > x would be split individually; moreover, dataframes where x > y would contain a y > x in the first row:
split(mydf, cumsum(mydf$x > mydf$y))
Another less elegant approach I tried to do is sapply with individual ifs inside the split function, but I don't want to go this path because of performance issues.
Try
rl <- with(mydf, rle(x >y))
grp <- inverse.rle(within.list(rl , values <- seq_along(values)))
split(mydf, grp)
#$`1`
# x y
#1 1 10
#2 2 9
#3 3 8
#4 4 7
#5 5 6
#$`2`
# x y
#6 6 5
#7 7 4
#8 8 3
#9 9 2
#10 10 1
#11 10 1
#12 9 2
#13 8 3
#14 7 4
#15 6 5
#$`3`
# x y
#16 5 6
#17 4 7
#18 3 8
#19 2 9
#20 1 10
Or
group <- with(mydf, cumsum(c(1,abs(diff(x >y)))))
split(mydf, group)
Or you can use rleid from the devel version of data.table (from #David Arenburg's comments) , i.e. v1.9.5. Onstructions to install it are here
library(data.table)
split(mydf, rleid(with(mydf, y > x)))

Resources