two factor group_by then add row number R dplyr [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
I have a data frame (df):
a <- c("up","up","up","up","down","down","down","down")
b <- c("l","r","l","r","l","l","r","r")
df <- data.frame(a,b)
I would like to add a third column (c) which contains the order of entries, grouped by columns a and b that looks something like this:
a b c
1 up l 1
2 up r 1
3 up l 2
4 up r 2
5 down l 1
6 down l 2
7 down r 1
8 down r 2
I have tried solutions using dplyr that have not worked:
order <- df %>%
group_by(a) %>%
group_by(b) %>%
mutate(c = row_number()) # This counts the order based on `b`, ignoring `a`
order <- df %>%
group_by(a) %>%
group_by(b) %>%
mutate(c = seq_len(n())) # This counts the order based on `b`, ignoring `a`
I would prefer to keep using dplyr and pipes if possible, but other suggestions are welcome

You need to combine a and b in the same group_by statement.
order <- df %>%
group_by(a, b) %>%
mutate(c = row_number())
order
# Source: local data frame [8 x 3]
# Groups: a, b [4]
#
# a b c
# <fctr> <fctr> <int>
# 1 up l 1
# 2 up r 1
# 3 up l 2
# 4 up r 2
# 5 down l 1
# 6 down l 2
# 7 down r 1
# 8 down r 2

Related

R dplyr filter data based on values in other rows

I am trying to filter a data frame using dplyr and I can't really think of a way to achieve what I want. I have a data frame of the following form:
A B C
-----------
1 2 5
1 4 6
2 2 7
2 4 6
Each value in column A appears exactly 2 times. Column B has exactly 2 different values, each appearing exactly once for each value of A. Column C can have any positive values. I want to keep all rows where for one value of A, the row with the bigger B value has a smaller C value than the row with the smaller B value. In the example above, this would result in:
A B C
-----------
2 2 7
2 4 6
Is there a way to achieve this using dplyr?
1) Sort by A and B to ensure that the larger B is always the second within A and then grouping by A use a filter based on diff(C) < 0.
library(dplyr)
DF %>%
arrange(A, B) %>%
group_by(A) %>%
filter((diff(C) < 0)) %>%
ungroup
## # A tibble: 2 × 3
## A B C
## <int> <int> <int>
## 1 2 2 7
## 2 2 4 6
2) Another possibility is to ensure that the maximum of B is on the same row as the minimum of C. This would also work with non-numeric data.
See comments below this answer for another idea along these lines.
DF %>%
group_by(A) %>%
filter(which.max(B) == which.min(C)) %>%
ungroup
3) If the slope of B with respect to C is negative then keep the group.
DF %>%
group_by(A) %>%
filter(coef(lm(B ~ C))[[2]] < 0) %>%
ungroup
or we can calculate the slope ourselves:
DF %>%
group_by(A) %>%
filter(diff(C) / diff(B) < 0) %>%
ungroup
Note
Lines <- "A B C
1 2 5
1 4 6
2 2 7
2 4 6"
DF <- read.table(text = Lines, header = TRUE)

I would like to create a new variable with the Unique occurences and their Frequency in R [duplicate]

This question already has answers here:
Find how many times duplicated rows repeat in R data frame [duplicate]
(4 answers)
Closed 6 years ago.
beginner and while i have attempted to search for an answer to this problem none seem to offer the solution that applys. it might be a simple one but i seem not to hack it. i have this data frame
df <- data.frame(FROM = c("A","A","A","B","D","C","A","D"),
TO = c("B","C","D","A","C","A","B","C"))
I would like to create a new data frame with an extra variable call it "FREQ" with all the unique values of "FROM" and "TO" Such that the new data set Looks like this. I would appreciate some assistance.
df2 <- data.frame(FROM = c("A","A","A","B","D","C"),
TO = c("B","C","D","A","C","A"),
FREQ = c(2,1,1,1,2,1))
If you are using dplyr package, you can use count, which is a short cut for group_by(FROM, TO) %>% summarise(n = n()) and count the number of rows for each group:
library(dplyr)
df %>% count(FROM, TO)
#Source: local data frame [6 x 3]
#Groups: FROM [?]
# FROM TO n
# <fctr> <fctr> <int>
#1 A B 2
#2 A C 1
#3 A D 1
#4 B A 1
#5 C A 1
#6 D C 2
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'FROM', 'TO', we get the number of elements in each group (.N)
library(data.table)
setDT(df)[, .(FREQ = .N) ,.(FROM, TO)]
# FROM TO FREQ
#1: A B 2
#2: A C 1
#3: A D 1
#4: B A 1
#5: D C 2
#6: C A 1
Another option is tally() from dplyr
library(dplyr)
df %>%
group_by(FROM, TO) %>%
tally()
# FROM TO n
# <fctr> <fctr> <int>
#1 A B 2
#2 A C 1
#3 A D 1
#4 B A 1
#5 C A 1
#6 D C 2
Or using table from base R, we just get the frequency of the dataset, convert to data.frame and remove the 0 elements in 'Freq' with subset.
subset(as.data.frame(table(df)), Freq !=0)

R: aggregate by all factor levels (present and not present)

I can aggregate a data.frame trivially with dplyr with the following:
z <- data.frame(a = rnorm(20), b = rep(letters[1:4], each = 5))
library(dplyr)
z %>%
group_by(b) %>%
summarise(out = n())
Source: local data frame [4 x 2]
b out
(fctr) (int)
1 a 5
2 b 5
3 c 5
4 d 5
However, sometimes a dataset may be missing a factor. In which case I would like the output to be 0.
For example, let's say the typical dataset should have 5 groups.
z$b <- factor(z$b, levels = letters[1:5])
But clearly there aren't any in this particular but could be in another. How can I aggregate this data so the length for missing factors is 0.
Desired output:
Source: local data frame [4 x 2]
b out
(fctr) (int)
1 a 5
2 b 5
3 c 5
4 d 5
5 e 0
One way to approach this is to use complete from "tidyr". You have to use mutate first to factor column "b":
library(dplyr)
library(tidyr)
z %>%
mutate(b = factor(b, letters[1:5])) %>%
group_by(b) %>%
summarise(out = n()) %>%
complete(b, fill = list(out = 0))
# Source: local data frame [5 x 2]
#
# b out
# (fctr) (dbl)
# 1 a 5
# 2 b 5
# 3 c 5
# 4 d 5
# 5 e 0
A workaround is to join with a table containing all levels:
z <- full_join(z, data.frame(b=levels(z$b))
This will set all the missing rows for your analysis variables to NA, which in the general case would make more sense than setting them to zero. You can change them to zero if necessary with z[is.na(z)] <- 0.
You could use xtabs:
xtabs(a ~ b, z)
This aggregates z$b rather than just counting levels in z$a as in your example, but that's easily achieved with table:
table(z$a)

Count occurence across multiple columns using R & dplyr

This should be a simple solution...I just can't wrap my head around this. I'd like to count the occurrences of a factor across multiple columns of a data frame. There're 13 columns range from abx.1 > abx.13 and a huge number of rows.
Sample data frame:
library(dplyr)
abx.1 <- c('Amoxil', 'Cipro', 'Moxiflox', 'Pip-tazo')
start.1 <- c('2012-01-01', '2012-02-01', '2013-01-01', '2014-01-01')
abx.2 <- c('Pip-tazo', 'Ampicillin', 'Amoxil', NA)
start.2 <- c('2012-01-01', '2012-02-01', '2013-01-01', NA)
abx.3 <- c('Ampicillin', 'Amoxil', NA, NA)
start.3 <- c('2012-01-01', '2012-02-01', NA,NA)
worksheet <-data.frame (abx.1, start.1, abx.2, start.2, abx.3, start.3)
Result I'd like:
name count
Amoxil 3
Ampicillin 2
Pip-tazo 2
Cipro 1
Moxiflox 1
I've tried :
worksheet %>% group_by (abx.1, abx.2, abx.3) %>% summarise(count = n())
This doesn't give me my desired output. Any thoughts would be greatly appreciated.
If you want a dplyr solution, I'd suggest combining it with tidyr in order to convert your data to a long format first
library(tidyr)
worksheet %>%
select(starts_with("abx")) %>%
gather(key, value, na.rm = TRUE) %>%
count(value)
# Source: local data frame [5 x 2]
#
# value n
# 1 Amoxil 3
# 2 Ampicillin 2
# 3 Cipro 1
# 4 Moxiflox 1
# 5 Pip-tazo 2
Alternatively, with base R, it's just
as.data.frame(table(unlist(worksheet[grep("^abx", names(worksheet))])))
# Var1 Freq
# 1 Amoxil 3
# 2 Cipro 1
# 3 Moxiflox 1
# 4 Pip-tazo 2
# 5 Ampicillin 2

splitting text in column and add row number [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I would like to split some text in a data frame column and save it into a data frame together with the row number or an id column.
I normally used plyr to do that, but this is no longer working in dplyr.
If I understand it correctly, it is more a bug in plyr and my code works since it is a bug.
So I am looking for the correct way to do this.
This is a minimal example in plyr:
library(plyr)
set.seed(1)
df <- data.frame(a=seq(2),
b=c(paste(sample(letters,3), collapse=';'),
paste(sample(letters,3), collapse=';')),
stringsAsFactors=FALSE)
ddply(df,.(a),summarise,unlist(strsplit(b,';')))
It turns the original data frame:
a b
1 1 g;j;n
2 2 x;f;v
Into this:
a ..1
1 1 g
2 1 j
3 1 n
4 2 x
5 2 f
6 2 v
What would be the correct dplyr solution?
I'm biased in favor of cSplit from the "splitstackshape" package, but you might be interested in unnest from "tidyr" in conjunction with "dplyr":
library(dplyr)
library(tidyr)
df %>%
mutate(b = strsplit(b, ";")) %>%
unnest(b)
# a b
# 1 1 g
# 2 1 j
# 3 1 n
# 4 2 x
# 5 2 f
# 6 2 v
You could do this using cSplit from splitstackshape
library(splitstackshape)
cSplit(df, 'b', ';', 'long')
# a b
#1: 1 g
#2: 1 j
#3: 1 n
#4: 2 x
#5: 2 f
#6: 2 v
Or using dplyr/tidyr
library(dplyr)
library(tidyr)
separate(df, b, c('b1', 'b2', 'b3'), sep=";") %>%
gather(Var, b, -a) %>%
select(-Var) %>%
arrange(a)
Or another option would be to use do
df %>%
group_by(a) %>%
do(data.frame(b=unlist(strsplit(.$b, ';'))))

Resources