I have data in format given below
a
b
1
A,B
1
A
1
B
2
A,B
2
D,C
2
A
2
A
What I need is when groupby column 'a' need the distinct values of column 'b'
a
count
1
2
2
4
Because for 1 we only have 2 distinct values, i.e. A,B
but for 2 we have 4 ,i.e. A,B,C,D.
I can first explode the data in tall format and then do the groupby, but since I have few other aggregation to be done so I was thinking of way to do in one line.
Thanks in advance
We can use aggregate in base R :
aggregate(b~a,df, function(x) length(unique(unlist(strsplit(x, ',')))))
# a b
#1 1 2
#2 2 4
data
df <- structure(list(a = c(1L, 1L, 1L, 2L, 2L, 2L, 2L), b = c("A,B",
"A", "B", "A,B", "D,C", "A", "A")), class = "data.frame", row.names = c(NA, -7L))
Using tidyr::separate_rows and dplyr::n_distinct this could be achieved like so:
library(dplyr)
d %>%
tidyr::separate_rows(b) %>%
group_by(a) %>%
summarise(count = n_distinct(b))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> a count
#> <int> <int>
#> 1 1 2
#> 2 2 4
DATA
d <- read.table(text = "a b
1 A,B
1 A
1 B
2 A,B
2 D,C
2 A
2 A", header = TRUE)
Base R using Map():
setNames(do.call(c, Map(function(x){length(unique(trimws(unlist(strsplit(x, ",")))))},
with(df, split(b, a)))), names(df))
Related
I have a large dataframe with approximately this pattern:
Person
Rate
Street
a
b
c
d
e
f
A
2
XYZ
1
NULL
3
4
5
NULL
A
2
XYZ
NULL
2
NULL
NULL
NULL
NULL
A
3
XYZ
NULL
NULL
NULL
NULL
NULL
6
B
2
DEF
NULL
NULL
NULL
NULL
5
NULL
B
2
DEF
NULL
2
3
NULL
NULL
6
C
1
DEF
1
2
3
4
5
6
A, b, c, d, e, f represents about 600 columns.
I am trying to combine the columns so that each person becomes one line, rows a-f combine into a single line using sum, and any conflicting rate or street information becomes a new row. So the data should look something like this:
Person
Rate
Rate 2
Street
a
b
c
d
e
f
A
2
3
XYZ
1
2
3
4
5
6
B
2
DEF
NULL
2
3
NULL
5
6
C
1
DEF
1
2
3
4
5
6
I keep trying to make this work with aggregate and summarize but I'm not sure that's the right approach.
Thank you very much for your help!
First we pivot all the unique rates per person and street.
library(reshape2)
tmp1=dcast(unique(df[,c("Person","Rate","Street")]),Person+Street~Rate,value.var="Rate")
colnames(tmp1)[-c(1:2)]=paste("Rate",colnames(tmp1)[-c(1:2)])
Then we aggregate and sum by person and rate, columns 4 to 9, from "a" to "f", change accordingly.
tmp2=aggregate(df[,4:9],list(Person=df$Person,Street=df$Street),function(x){
ifelse(all(is.na(x)),NA,sum(x,na.rm=T))
})
And finally merge the two.
merge(tmp1,tmp2,by=c("Person","Street"))
Person Street Rate 1 Rate 2 Rate 3 a b c d e f
1 A XYZ NA 2 3 1 2 3 4 5 6
2 B DEF NA 2 NA NA 2 3 NA 5 6
3 C DEF 1 NA NA 1 2 3 4 5 6
Perhaps, you can do this in two-step process -
library(dplyr)
library(tidyr)
#sum columns a-f
table1 <- df %>%
group_by(Person) %>%
summarise(across(a:f, sum, na.rm = TRUE))
#Remove duplicated values and get the data in separate columns
#for Rate and Street columns.
table2 <- df %>%
group_by(Person) %>%
mutate(across(c(Rate, Street), ~replace(., duplicated(.), NA))) %>%
select(Person, Rate, Street) %>%
filter(if_any(c(Rate, Street), ~!is.na(.))) %>%
mutate(col = row_number()) %>%
ungroup %>%
pivot_wider(names_from = col, values_from = c(Rate, Street)) %>%
select(where(~any(!is.na(.))))
#Join the two data to get final result
inner_join(table1, table2, by = 'Person')
# Person a b c d e f Rate_1 Rate_2 Street_1
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <chr>
#1 A 1 2 3 4 5 6 2 3 XYZ
#2 B 0 2 3 0 5 6 2 NA DEF
#3 C 1 2 3 4 5 6 1 NA DEF
data
It is helpful and easier to help when you share data in a reproducible format which can be copied directly. I have used the below data for the answer.
df <- structure(list(Person = c("A", "A", "A", "B", "B", "C"), Rate = c(2L,
2L, 3L, 2L, 2L, 1L), Street = c("XYZ", "XYZ", "XYZ", "DEF", "DEF",
"DEF"), a = c(1L, NA, NA, NA, NA, 1L), b = c(NA, 2L, NA, NA,
2L, 2L), c = c(3L, NA, NA, NA, 3L, 3L), d = c(4L, NA, NA, NA,
NA, 4L), e = c(5L, NA, NA, 5L, NA, 5L), f = c(NA, NA, 6L, NA,
6L, 6L)), row.names = c(NA, -6L), class = "data.frame")
This question already has answers here:
dplyr filter columns with value 0 for all rows with unique combinations of other columns
(2 answers)
Closed 1 year ago.
I have a data frame where I'd like to remove entire groups if their y value is the same across 6 time points.
Patients
Time
Status
1
a
5
1
b
5
1
c
5
1
d
5
1
e
5
1
f
5
2
a
4
2
b
4
2
c
5
2
d
5
2
e
5
2
f
5
Basically, I'd like to remove all patients from this data frame who have a status of "5" at ALL time points. If a patient has any value apart from 5 at any point in time I'd like to include them.
I tried
df <- df %>%
filter(a !=5 & b !=5 & c !=5 & d !=5 & e !=5 & f !=5)
To no avail, unfortunately. Would appreciate any help. Thank you!
You can use any/all :
library(dplyr)
df %>% group_by(Patients) %>% filter(any(Status != 5))
#With `all`
#df %>% group_by(Patients) %>% filter(!all(Status == 5))
# Patients Time Status
# <int> <chr> <int>
#1 2 a 4
#2 2 b 4
#3 2 c 5
#4 2 d 5
#5 2 e 5
#6 2 f 5
This can be also be written with base R :
subset(df, ave(Status != 5, Patients, FUN = any))
#and `data.table` :
library(data.table)
setDT(df)[, .SD[any(Status != 5)], Patients]
Without grouping by Patients you can do :
subset(df, Patients %in% unique(Patients[Status != 5]))
data
df <- structure(list(Patients = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), Time = c("a", "b", "c", "d", "e", "f", "a", "b",
"c", "d", "e", "f"), Status = c(5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L,
5L, 5L, 5L, 5L)), row.names = c(NA, -12L), class = "data.frame")
Something like this?
df <- data.frame(
patients = c(rep(1,6),rep(2,6)),
time = rep(letters[1:6],2),
status = c(rep(5,6),rep(4,2),rep(5,4))
)
df %>%
group_by(patients) %>%
dplyr::filter(status*6 != sum(status))
if I understood your problem correctly one of these two solutions should help:
library(dplyr)
library(data.table)
# your test data
df <- data.table::fread("Patients Time Status
1 a 5
1 b 5
1 c 5
1 d 5
1 e 5
1 f 5
2 a 4
2 b 4
2 c 5
2 d 5
2 e 5
2 f 5")
# one option to get all rows diferent than 5
df %>%
# exclude everything where Status is 5
dplyr::filter(Status != 5)
Patients Time Status
1: 2 a 4
2: 2 b 4
# one option to get all distinct patients
df %>%
# exclude everything where Status is 5
dplyr::filter(Status != 5) %>%
# unique values per column or column combination
dplyr::distinct(Patients)
Patients
1: 2
# on option to get all data of patien with at least one status != 5
df %>%
# exclude everything where Status is 5
dplyr::filter(Status != 5) %>%
# unique values per column or column combination
dplyr::distinct(Patients) %>%
# join back on original data to get all values for specific patients
dplyr::inner_join(df, by = "Patients")
Patients Time Status
1: 2 a 4
2: 2 b 4
3: 2 c 5
4: 2 d 5
5: 2 e 5
6: 2 f 5
My data.frame df looks like this:
A 1
A 2
A 5
B 2
B 3
B 4
C 3
C 7
C 9
I want it to look like this:
A B C
1 2 3
2 3 7
5 4 9
I have tried spread() but probably not in the right way. Any ideas?
We can use unstack from base R
unstack(df1, col2 ~ col1)
# A B C
#1 1 2 3
#2 2 3 7
#3 5 4 9
Or with split
data.frame(split(df1$col2, df1$col1))
Or if we use spread or pivot_wider, make sure to create a sequence column
library(dplyr)
library(tidyr)
df1 %>%
group_by(col1) %>%
mutate(rn = row_number()) %>%
ungroup %>%
pivot_wider(names_from = col1, values_from = col2) %>%
# or use
# spread(col1, col2) %>%
select(-rn)
# A tibble: 3 x 3
# A B C
# <int> <int> <int>
#1 1 2 3
#2 2 3 7
#3 5 4 9
Or using dcast
library(data.table)
dcast(setDT(df1), rowid(col1) ~ col1)[, .(A, B, C)]
data
df1 <- structure(list(col1 = c("A", "A", "A", "B", "B", "B", "C", "C",
"C"), col2 = c(1L, 2L, 5L, 2L, 3L, 4L, 3L, 7L, 9L)),
class = "data.frame", row.names = c(NA,
-9L))
In data.table, we can use dcast :
library(data.table)
dcast(setDT(df), rowid(col1)~col1, value.var = 'col2')[, col1 := NULL][]
# A B C
#1: 1 2 3
#2: 2 3 7
#3: 5 4 9
x y z column_indices
6 7 1 1,2
5 4 2 3
1 3 2 1,3
I have the column indices of the values I would like to collect in a separate column like so, what I want to create is something like this:
x y z column_indices values
6 7 1 1,2 6,7
5 4 2 3 2
1 3 2 1,3 1,2
What is the simplest way to do this in R?
Thanks!
In base R, we can use apply, split the column_indices on ',', convert them to integer and get the corresponding value from the row.
df$values <- apply(df, 1, function(x) {
inds <- as.integer(strsplit(x[4], ',')[[1]])
toString(x[inds])
})
df
# x y z column_indices values
#1 6 7 1 1,2 6, 7
#2 5 4 2 3 2
#3 1 3 2 1,3 1, 2
data
df <- structure(list(x = c(6L, 5L, 1L), y = c(7L, 4L, 3L), z = c(1L,
2L, 2L), column_indices = structure(c(1L, 3L, 2L), .Label = c("1,2",
"1,3", "3"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
One solution involving dplyr and tidyr could be:
df %>%
pivot_longer(-column_indices) %>%
group_by(column_indices) %>%
mutate(values = toString(value[1:n() %in% unlist(strsplit(column_indices, ","))])) %>%
pivot_wider(names_from = "name", values_from = "value")
column_indices values x y z
<chr> <chr> <int> <int> <int>
1 1,2 6, 7 6 7 1
2 3 2 5 4 2
3 1,3 1, 2 1 3 2
I have a data set that looks like this:
id a b
1 AA 2
1 AB 5
1 AA 1
2 AB 2
2 AB 4
3 AB 4
3 AB 3
3 AA 1
I need to calculate the cumulative mean for each record within each group and excluding the case where a == 'AA', So sample output should be:
id a b mean
1 AA 2 -
1 AB 5 5
1 AA 1 5
2 AB 2 2
2 AB 4 (4+2)/2
3 AB 4 4
3 AB 3 (4+3)/2
3 AA 1 (4+3)/2
3 AA 4 (4+3)/2
I tried to achieve it using dplyr and cummean by getting an error.
df <- df %>%
group_by(id) %>%
mutate(mean = cummean(b[a != 'AA']))
Error: incompatible size (123), expecting 147 (the group size) or 1
Can you suggest a better way to achieve the same in R ?
The trick here is to reconstruct the cummean by dividing the adjusted cumsum by the adjusted count. As a one-liner:
df %>% group_by(id) %>% mutate(cumsum(b * (a != 'AA')) / cumsum(a != 'AA'))
We can make this a little nicer (the "multiply by a!='AA' - magic!" is the ugliness in my mind) by taking out the a != 'AA' as a column
df %>%
group_by(id) %>%
mutate(relevance = 0+(a!='AA'),
mean = cumsum(relevance * b) / cumsum(relevance))
There may be an easier approach. Here, we group by 'id'. Create a new column 'Mean' by first converting the elements in 'b' that corresponds to 'AA' in 'a' to NA (b*NA^(a=='AA')). NA^(a=='AA') gives an output of NA for 'AA' in 'a' and 1 for all other values. So, when we multiply by 'b', it replaces the 1 with the values in 'b' while NA remains as such. We use na.aggregate to replace the 'NA' with the mean of non-NA elements in each group, then wrap with cummean to get the cumulative mean. If the first value in each group for 'a' is 'AA', we can get NA for that by multiplying with NA^(row_number()==1 & a=='AA').
library(zoo)
library(dplyr)
df %>%
group_by(id) %>%
mutate(Mean= cummean(na.aggregate(b*NA^(a=='AA')))*
NA^(row_number()==1 & a=='AA'))
# Source: local data frame [9 x 4]
#Groups: id [3]
# id a b Mean
# (int) (chr) (int) (dbl)
#1 1 AA 2 NA
#2 1 AB 5 5.0
#3 1 AA 1 5.0
#4 2 AB 2 2.0
#5 2 AB 4 3.0
#6 3 AB 4 4.0
#7 3 AB 3 3.5
#8 3 AA 1 3.5
#9 3 AA 4 3.5
data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
a = c("AA",
"AB", "AA", "AB", "AB", "AB", "AB", "AA", "AA"), b = c(2L, 5L,
1L, 2L, 4L, 4L, 3L, 1L, 4L)), .Names = c("id", "a", "b"),
class = "data.frame", row.names = c(NA, -9L))