Subset specific rows per group [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a dataframe such that:
Group V2 V3 V4
1 D F W
1 T A L
1 P F P
2 T F L
2 R R O
2 D Y L
2 D F I
...
And I have list such that:
[1] 1 3
[2] 4
[3] 2 3 4
Each element of the list indicates which row I want to keep for each group. So I only want to keep row 1 and 3 of Group==1 in the dataframe; the 4th row for the second group; rows 2 3 and 4 for the 3rd group etc.
I have tried hard but I haven't found a straightforward way although I'm pretty sure there must be one using apply or something similar.

You can do,
do.call(rbind, Map(function(x, y) x[y,], split(df, df$Group), l1))
# Group V2 V3 V4
#1.1 1 D F W
#1.3 1 P F P
#2 2 D F I
where,
l1 <- list(c(1, 3), 4)

Having the folowing objects to work with, a data.frame and a list, similar to yours:
df <- read.table(text = "Group V2 V3 V4
1 D F W
1 T A L
1 P F P
2 T F L
2 R R O
2 D Y L
2 D F I
3 E F I
3 F F I
3 G F I
3 T F I", header = T)
l <- list(c(1, 3), 4, c(2:4))
do.call(rbind, lapply(seq_along(l), function(i) df[df$Group == i,][l[[i]],]))
# Group V2 V3 V4
#1 1 D F W
#3 1 P F P
#7 2 D F I
#9 3 F F I
#10 3 G F I
#11 3 T F I
yields the same result as the simpler data.table approach:
library(data.table)
dt <- as.data.table(df)
dt[, .SD[l[[.GRP]]], Group]
or
dt[, .SD[l[[unlist(.BY)]]], Group]
# Group V2 V3 V4
#1: 1 D F W
#2: 1 P F P
#3: 2 D F I
#4: 3 F F I
#5: 3 G F I
#6: 3 T F I

An option using tidyverse
library(tidyverse)
df %>%
group_split(Group) %>%
map2_df(l, ~ .x %>%
slice(.y))
# A tibble: 6 x 4
# Group V2 V3 V4
# <int> <fct> <fct> <fct>
#1 1 D F W
#2 1 P F P
#3 2 D F I
#4 3 F F I
#5 3 G F I
#6 3 T F I
data
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), V2 = structure(c(1L, 7L, 5L, 7L, 6L, 1L, 1L, 2L, 3L,
4L, 7L), .Label = c("D", "E", "F", "G", "P", "R", "T"), class = "factor"),
V3 = structure(c(2L, 1L, 2L, 2L, 3L, 4L, 2L, 2L, 2L, 2L,
2L), .Label = c("A", "F", "R", "Y"), class = "factor"), V4 = structure(c(5L,
2L, 4L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("I",
"L", "O", "P", "W"), class = "factor")), class = "data.frame",
row.names = c(NA,
-11L))
l <- list(c(1, 3), 4, 2:4)

Related

filter data.frame based on reactive input

I have the following data frame:
structure(list(A = c(1L, 1L, 1L, 1L, 2L), B = c(1L, 2L, 2L, 2L,
1L), C = c(1L, 1L, 2L, 2L, 1L), D = structure(c(2L, 2L, 1L, 2L,
2L), .Label = c("", "x"), class = "factor"), E = structure(c(2L,
1L, 2L, 2L, 1L), .Label = c("", "x"), class = "factor"), F = structure(c(2L,
1L, 2L, 2L, 2L), .Label = c("", "x"), class = "factor"), G = structure(c(2L,
1L, 1L, 1L, 1L), .Label = c("", "x"), class = "factor"), Y = structure(c(2L,
1L, 2L, 1L, 1L), .Label = c("", "x"), class = "factor")), .Names = c("A",
"B", "C", "D", "E", "F", "G", "Y"), class = "data.frame", row.names = c(NA,
-5L))
I would like to filter this dataframe and remove missing values in the columns (D,E,F,G,Y). I'm doing this using 'complete.cases' in the following code:
completeFun <- function(data, desiredCols) {
completeVec <- complete.cases(data[, desiredCols])
return(data[completeVec, ])
}
However, what I noticed is that when I call the function, e.g.: completeFun(test, c('E','F') the following output is returned:
A B C D E F G Y
1 1 1 1 x x x x x
3 1 2 2 <NA> x x <NA> x
4 1 2 2 x x x <NA> <NA>
which is removing the rows where E OR F are NA and only keeping the rows where E AND F are NOT NA.
However, what I want instead, is to keep the rows where any one of those columns (E,F) is NOT NA, i.e, neither E nor F == NA, which means this output in this case:
A B C D E F G Y
1 1 1 1 x x x x x
3 1 2 2 <NA> x x <NA> x
4 1 2 2 x x x <NA> <NA>
5 2 1 1 x <NA> x <NA> <NA>
Of course I would like to keep the function as flexible as possible to be able to include more columns into the calculation.
What is the best R way to do this?
UPDATE
Based on the answer of Sotos, here is a case that does not work based on his answer:
structure(list(A = c(1L, 1L, 1L, 1L, 2L), B = c(1L, 2L, 2L, 2L,
1L), C = c(1L, 1L, 2L, 2L, 1L), D = structure(c(1L, 1L, NA, 1L,
1L), .Label = "x", class = "factor"), E = structure(c(1L, NA,
1L, 1L, NA), .Label = "x", class = "factor"), F = structure(c(1L,
NA, 1L, 1L, 1L), .Label = "x", class = "factor"), G = structure(c(1L,
NA, NA, NA, NA), .Label = "x", class = "factor"), Y = structure(c(1L,
NA, 1L, NA, 1L), .Label = "x", class = "factor")), .Names = c("A",
"B", "C", "D", "E", "F", "G", "Y"), row.names = c(NA, -5L), class = "data.frame")
For this new data frame, if I call the function as follow: completeFun(test, cols = c('E','F', 'Y')) I get the following output:
A B C D E F G Y
1 1 1 1 x x x x x
NA NA NA NA <NA> <NA> <NA> <NA> <NA>
3 1 2 2 <NA> x x <NA> x
NA.1 NA NA NA <NA> <NA> <NA> <NA> <NA>
NA.2 NA NA NA <NA> <NA> <NA> <NA> <NA>
which is missing the last row of the dataframe where F AND Y have a non-empty value.
You can do this via rowSums, i.e.
completeFun <- function(df, cols) {
return(df[rowSums(df[cols] == '') != length(cols),])
}
completeFun(dd, cols = c('E', 'F'))
# A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2 x x x
#4 1 2 2 x x x
#5 2 1 1 x x
completeFun(dd, cols = 'Y')
# A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2 x x x
EDIT
In the previous example OP had empty spaces instead of NA hence, we were checking for them. If we want to check for NAs we can modify the function and check with is.na instead.
completeFun <- function(df, cols) {
df[rowSums(is.na(df[cols])) != length(cols), ]
}
completeFun(df, cols = c('E','F', 'Y'))
# A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2 <NA> x x <NA> x
#4 1 2 2 x x x <NA> <NA>
#5 2 1 1 x <NA> x <NA> x
Similar to Sotos' answer, except it is a bit more flexible.
A row is considered complete if the number of non-NA values is equal to, or more than the threshold thrsh.
completeFun <- function(dtf, cols, na.val="", thrsh=1) {
dtf[dtf == na.val] <- NA
ix <- rowSums(!is.na(dtf[, cols])) >= thrsh
dtf[ix, ]
}
completeFun(test, cols=c("E", "F"))
# A B C D E F G Y
# 1 1 1 1 x x x x x
# 3 1 2 2 <NA> x x <NA> x
# 4 1 2 2 x x x <NA> <NA>
# 5 2 1 1 x <NA> x <NA> <NA>
completeFun(test, cols=c("D", "E", "F", "Y"), thrsh=3)
# A B C D E F G Y
# 1 1 1 1 x x x x x
# 3 1 2 2 <NA> x x <NA> x
# 4 1 2 2 x x x <NA> <NA>

Sum the values of groups of 4 contiguous columns in R

Starting from a table of 372 columns and 12,000 rows in R, I need to create a new table with columns that contain rows with the sum of same row from columns 1:4, then 5:8, then 9:12, and so on up to column 372 of the original table. Here a short example:
Input:
m = structure(c(3L, 1L, 2L, 6L, 3L, 1L, 1L, 8L, 1L, 5L, 2L, 1L, 3L, 7L,
+ 1L, 1L), .Dim = c(2L, 8L), .Dimnames = list(c("r1", "r2"), c("a", "b",
+"c", "d", "e", "f", "g", "h")))
Which looks like this:
a b c d e f g h
r1 3 2 3 1 1 2 3 1
r2 1 6 1 8 5 1 7 1
Expected output:
A B
r1 9 7
r2 16 14
So, A = a+b+c+d, and B=e+f+g+h. Easy to do with a small table in Excel. Columns a-d correspond to a group, e-f to another, if that helps.
The question is currently underspecified, but supposing you have a matrix...
m = structure(c(3L, 1L, 2L, 6L, 3L, 1L, 1L, 8L, 1L, 5L, 2L, 1L, 3L,
7L, 1L, 1L), .Dim = c(2L, 8L), .Dimnames = list(c("r1", "r2"),
c("a", "b", "c", "d", "e", "f", "g", "h")))
Make your column mapping:
map = data.frame(old = colnames(m), new = rep(LETTERS, each=4, length.out=ncol(m)))
old new
1 a A
2 b A
3 c A
4 d A
5 e B
6 f B
7 g B
8 h B
And then rowsum by it:
res = rowsum(t(m), map$new)
r1 r2
A 9 16
B 7 14
We have to transpose the data with t here because R has rowsum but no colsum. You can transpose it back afterwards, like t(res).
A base R solution, suppose df is your data frame:
cols = 8
do.call(cbind, lapply(seq(1, ncols, 4), function(i) rowSums(df[i:(i+3)])))
# [,1] [,2]
# r1 9 7
# r2 16 14
Another way:
df <- data.frame(t(matrix(colSums(matrix(t(df), nrow=4)),nrow=nrow(df))))
## X1 X2
##1 9 7
##2 16 14
First transpose the data to a 4 x (ncol(df)/4 * now(df)) matrix where now each column is a group of four columns for each row in the original data frame.
Sum each column using colSums
Transpose the data back to a data frame with the original number of rows
You can do this in a vectorised way if you transform your original data to a matrix with 4 columns, then use rowSums on that, and then transform it back to match the rows of the original data frame. Here it is in one long command
df <- read.table(header = TRUE, text = "a b c d e f g h
3 2 3 1 1 2 3 1
1 6 1 8 5 1 7 1")
matrix(rowSums(matrix(as.vector(t(as.matrix(df))),
ncol = 4, byrow = TRUE)), ncol = ncol(df) / 4, byrow = TRUE)
# [,1] [,2]
#[1,] 9 7
#[2,] 16 14
Edit: To preserve the row names, if e.g. rownames(df) <- c("r1", "r2"), just apply them to the resulting matrix (the row order is preserved), ie run rownames(result) <- rownames(df).

How to collapse session path data into from-to paths for visualizing network data?

What are some ways to transform session path data such as this:
df
# Session Link1 Link2 Link3 Link4 Link5
# 1 1 A B
# 2 2 C
# 3 3 D A B
# 4 4 C F G H J
# 5 5 A B C
Into a data set that looks like this:
desired
# Session From To
# 1 1 A B
# 2 2 C <NA>
# 3 3 D A
# 4 3 A B
# 5 4 C F
# 6 4 F G
# 7 4 G H
# 8 4 H J
# 9 5 A B
# 10 5 B C
Data for reproducibility:
df <- structure(list(Session = 1:5, Link1 = structure(c(1L, 2L, 3L, 2L, 1L), .Label = c("A", "C", "D"), class = "factor"), Link2 = structure(c(3L, 1L, 2L, 4L, 3L), .Label = c("", "A", "B", "F"), class = "factor"), Link3 = structure(c(1L, 1L, 2L, 4L, 3L), .Label = c("", "B", "C", "G"), class = "factor"), Link4 = structure(c(1L, 1L, 1L, 2L, 1L), .Label = c("", "H"), class = "factor"), Link5 = structure(c(1L, 1L, 1L, 2L, 1L), .Label = c("", "J"), class = "factor")), .Names = c("Session", "Link1", "Link2", "Link3", "Link4", "Link5"), class = "data.frame", row.names = c(NA, -5L))
desired <- structure(list(Session = c(1L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L), From = structure(c(1L, 3L, 4L, 1L, 3L, 5L, 6L, 7L, 1L, 2L), .Label = c("A", "B", "C", "D", "F", "G", "H"), class = "factor"), To = structure(c(2L, NA, 1L, 2L, 4L, 5L, 6L, 7L, 2L, 3L), .Label = c("A", "B", "C", "F", "G", "H", "J"), class = "factor")), .Names = c("Session", "From", "To"), class = "data.frame", row.names = c(NA, -10L))
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)). Reshape from 'wide' to 'long' format with melt specifying the id.var as 'Session'. Remove the 'value' elements that are empty [value!='']. Grouped by 'Session', we insert 'NA' values in the 'value' column for those 'Session' that have only a single row (if...else), create a two columns ('From' and 'To') by removing the last and first element of 'V1' grouped by 'Session'.
library(data.table)#v1.9.5+
melt(setDT(df), id.var='Session')[value!=''][,
if(.N==1L) c(value, NA) else value, by = Session][,
list(From=V1[-.N], To=V1[-1L]), by = Session]
# Session From To
#1: 1 A B
#2: 2 C NA
#3: 3 D A
#4: 3 A B
#5: 4 C F
#6: 4 F G
#7: 4 G H
#8: 4 H J
#9: 5 A B
#10: 5 B C
The above could be simplified to a single block after the melt step. For some reason, tmp[-.N] is not working. So I used tmp[1:(.N-1)].
melt(setDT(df), id.var= 'Session')[value!='', {
tmp <- if(.N==1L) c(value, NA) else value
list(From= tmp[1:(.N-1)], To= tmp[-1L]) }, by = Session]
# Session From To
#1: 1 A B
#2: 2 C NA
#3: 3 D A
#4: 3 A B
#5: 4 C F
#6: 4 F G
#7: 4 G H
#8: 4 H J
#9: 5 A B
#10: 5 B C
Inspired by #akrun, this is my personal stab at the problem. Granted, the results are tweaked to include the terminal from-to path for each pair:
library(dplyr)
library(tidyr)
gather(df, "Link_Num", "Value", -Session) %>%
group_by(Session) %>%
mutate(to = Value,
from = lag(to)) %>%
filter(Link_Num != "Link1" &
from != "") %>%
select(Session, from, to, Link_Num) %>%
arrange(Session)
Which yields:
Session from to Link_Num
1 1 A B Link2
2 1 B Link3
3 2 C Link2
4 3 D A Link2
5 3 A B Link3
6 3 B Link4
7 4 C F Link2
8 4 F G Link3
9 4 G H Link4
10 4 H J Link5
11 5 A B Link2
12 5 B C Link3
13 5 C Link4
Another approach with dplyr functions melt and lead:
library(dplyr)
df$spacer <- ""
df %>% melt(id.var = "Session") %>%
arrange(Session) %>%
mutate(To = lead(value)) %>%
filter(To !="" & value !="" | To =="" & variable =="Link1") %>%
mutate(To = ifelse(To == "", NA, To)) %>% select(-variable)
# Session value To
# 1 1 A B
# 2 2 C <NA>
# 3 3 D A
# 4 3 A B
# 5 4 C F
# 6 4 F G
# 7 4 G H
# 8 4 H J
# 9 5 A B
# 10 5 B C

find highest value within factor levels

if I have the following dataframe:
value factorA factorB
1 a e
2 a f
3 a g
1 b k
2 b l
3 b m
1 c e
2 c g
how can I get for each factorA the highest value and the entry from factorB associated with it i.e.
value factorA factorB
3 a g
3 b m
2 c g
Is this possible without first using
blocks<-split(factorA, list(), drop=TRUE)
and then sorting each block$a as this will be performed many times and number of blocks will always change.
Here is one option, using base R functions:
maxRows <- by(df, df$factorA, function(X) X[which.max(X$value),])
do.call("rbind", maxRows)
# value factorA factorB
# a 3 a g
# b 3 b m
# c 2 c g
With your data
df<- structure(list(value = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), factorA = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
factorB = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 3L), .Label = c("e",
"f", "g", "k", "l", "m"), class = "factor")), .Names = c("value",
"factorA", "factorB"), class = "data.frame", row.names = c(NA,
-8L))
Using ddply function in plyr package
> df2<-ddply(df,c('factorA'),function(x) x[which(x$value==max(x$value)),])
value factorA factorB
1 3 a g
2 3 b m
3 2 c g
Or,
> rownames(df2) <- df2$factorA
> df2
value factorA factorB
a 3 a g
b 3 b m
c 2 c g

change data frame in R

i have a data frame generated inside a for loop and have this structure
V1 V2 V3
1 a a 1
2 a b 3
3 a c 2
4 a d 1
5 a e 3
6 b a 3
7 b b 1
8 b c 8
9 b d 1
10 b e 1
11 c a 2
12 c b 8
the data is longer than this , but that's the idea that i want
(transform it to a wide table [V1 by V2])
V3 is a value based on (V1, V2)
i want to rearrange data to be like this (with first col is the unique of V1 and first row is the unique of V2 and data between them are from V3 )
a b c d e
a 1 3 2 1 3
b 3 1 8 1 1
c 2 8 2 8 2
d 1 1 5 7 2
e 3 5 9 5 3
thnx in advance.
Reproducible example of yours:
df <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), V2 = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L), .Label = c("a", "b", "c", "d", "e"), class = "factor"), V3 = c(1L, 3L, 2L, 1L, 3L, 3L, 1L, 8L, 1L, 1L, 2L, 8L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
And compute a basic crosstable based on your variables:
> xtabs(V3~V1+V2, df)
V2
V1 a b c d e
a 1 3 2 1 3
b 3 1 8 1 1
c 2 8 0 0 0
I hope you meant this :)
If df is your data-frame, assuming a unique V3 is mapped to each V1,V2 combination, you can do it with
with(df, tapply(V3, list(V1,V2), identity))
Another method, perhaps slightly more baroque, for widening a dataframe from a third column on the basis of the first two... with Chase that the OP has not given an unambiguous problem description:
df2 <- expand.grid(A=LETTERS[1:5], B=LETTERS[1:5])
df2$N <- 1:25
mtx <- outer(X=LETTERS[1:5],Y=LETTERS[1:5], FUN=function(x,y){
df2[intersect(which(df2$A==x), which(df2$B==y)), "N"] })
colnames(mtx)<-LETTERS[1:5]; rownames(mtx)<-LETTERS[1:5]
mtx
A B C D E
A 1 6 11 16 21
B 2 7 12 17 22
C 3 8 13 18 23
D 4 9 14 19 24
E 5 10 15 20 25
I'm sure there are many other strategies using reshape in base or dcast in reshape2.

Resources