Formatting a data.frame with binary values - r

I have a dataframe with 4 columns and 4 rows. For simplicity, I changed it to numeric format. The schema is as follows:
df <- structure(list(a = c(1,2,2,0),
b = c(2,1,2,2),
c = c(2,0,1,0),
d = c(0,2,1,1)),row.names=c(NA,-4L) ,class = "data.frame")
a b c d
1 1 2 2 0
2 2 1 2 2
3 2 0 1 0
4 0 2 1 1
I would like to change this data frame and obtain the following:
1 2
1 a b/c
2 b a/c/d
3 c a
4 c/d b
Is there a function or package I should look into? I have been doing lots of text processing in R recently. I'd appreciate your assistance!

tapply fun with some row and col indexes (stealing df from Ronak's answer):
tapply(
colnames(df)[col(df)],
list(row(df), unlist(df)),
FUN=paste, collapse="/"
)[,-1]
# 1 2
#1 "a" "b/c"
#2 "b" "a/c/d"
#3 "c" "a"
#4 "c/d" "b"
Basically I'm taking one long vector representing each column name in df, and tabulating it by the combination of the row of df, and the original values in df.

One way with dplyr and tidyr could be to get data in long format, remove 0 values and paste the column names together for each row and value combination. Finally get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
filter(value != 0) %>%
group_by(row, value) %>%
summarise(val = paste(name, collapse = "/")) %>%
pivot_wider(names_from = value, values_from = val)
# row `1` `2`
# <int> <chr> <chr>
#1 1 a b/c
#2 2 b a/c/d
#3 3 c a
#4 4 c/d b
data
df <- structure(list(a = c(1L, 2L, 2L, 0L), b = c(2L, 1L, 0L, 2L),
c = c(2L, 2L, 1L, 1L), d = c(0L, 2L, 0L, 1L)), class = "data.frame",
row.names = c("1", "2", "3", "4"))

Related

How can I calculate the sum of the column wise differences using dplyr

Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())

R using dplyr group_by/ sum in for loop, output as concatenated list

I am using the dplyr package to group by a week variable and get the sum for three variables. The output should be attached to each other.
Here is my data frame df:
week var1 var2 var3
1 1 2 3
1 2 2 3
2 4 4 5
2 2 2 6
3 6 6 6
3 4 4 4
My command is
calculate <- function(vars){
x <- df %>% group_by(week) %>% summarise(summe = sum(vars))%>%mutate(group = paste(vars))
x
}
cols <- c("var1", "var2", "var3")
for (i in 1:length(cols)){
var <- cols[i]
cal <- calculate(var)
total <- rbind(total,cal)
}
The expected output should be
week summe group
1 3 var1
2 6 var1
3 10 var1
1 4 var2
2 6 var2
3 10 var2
1 6 var3
2 11 var3
3 10 var3
My question is: Is there a better way instead of using a for loop?
Cheers,
Andi
We could pivot to 'long' format and then do a group by 'sum'
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('var'), names_to = 'group') %>%
group_by(week, group) %>%
summarise(summe = sum(value)) %>%
ungroup %>%
arrange(group) %>%
select(week, summe, group)
# A tibble: 9 x 3
# week summe group
# <int> <int> <chr>
#1 1 3 var1
#2 2 6 var1
#3 3 10 var1
#4 1 4 var2
#5 2 6 var2
#6 3 10 var2
#7 1 6 var3
#8 2 11 var3
#9 3 10 var3
We can also do the sum grouped by 'week' first and the pivot to 'long' format
df %>%
group_by(week) %>%
summarise_at(vars(-group_cols()), sum) %>%
pivot_longer(cols = starts_with('var'), names_to = 'group', values_to = 'summe') %>%
select(week, summe, group)
data
df <- structure(list(week = c(1L, 1L, 2L, 2L, 3L, 3L), var1 = c(1L,
2L, 4L, 2L, 6L, 4L), var2 = c(2L, 2L, 4L, 2L, 6L, 4L), var3 = c(3L,
3L, 5L, 6L, 6L, 4L)), class = "data.frame", row.names = c(NA,
-6L))

Counting occurrence of a variable without taking account duplicates

I have a big data frame, called data with 1 004 490 obs, and I want to analyse the success of a treatment.
ID POSITIONS TREATMENT
1 0 A
1 1 A
1 2 B
2 0 C
2 1 D
3 0 B
3 1 B
3 2 C
3 3 A
3 4 A
3 5 B
So firstly, I want to count the number of time that one treatment is applicated to a patient (ID), but one treatment can be given several times to an iD. So, do I need to first delete all the duplicates and after count or there is a function that don't take into account all the duplicates.
What I want to have :
A : 2
B : 2
C : 2
D : 1
Then, I want to know how many time the treatment was given at the last position, but the last position is always different according to the ID.
What I want to have :
A : 0
B : 2 (for ID = 1 and 3)
C : 0
D : 1 (for ID = 1)
Thanks for your help, I am a new user of R !
Using base R, we can do,
merge(aggregate(ID ~ TREATMENT, df, FUN = function(i) length(unique(i))),
aggregate(ID ~ TREATMENT, df[!duplicated(df$ID, fromLast = TRUE),], toString),
by = 'TREATMENT', all = TRUE)
Which gives,
TREATMENT ID.x ID.y
1 A 2 <NA>
2 B 2 1, 3
3 C 2 <NA>
4 D 1 2
Here is a tidyverse approach, where we get the distinct rows based on 'ID', 'TREATMENT' and get the count of 'TREATMENT'
library(tidyverse)
df1 %>%
distinct(ID, TREATMENT) %>%
count(TREATMENT)
# A tibble: 4 x 2
# TREATMENT n
# <chr> <int>
#1 A 2
#2 B 2
#3 C 2
#4 D 1
and for second output, after grouping by 'ID', slice the last row (n()), create a column 'ind' and fill that with 0 for all missing combinations of 'TREATMENT' with complete, then get the sum of 'ind' after grouping by 'TREATMENT'
df1 %>%
group_by(ID) %>%
slice(n()) %>%
mutate(ind = 1) %>%
complete(TREATMENT = unique(df1$TREATMENT), fill = list(ind=0)) %>%
group_by(TREATMENT) %>%
summarise(n = sum(ind))
# A tibble: 4 x 2
# TREATMENT n
# <chr> <dbl>
#1 A 0
#2 B 2
#3 C 0
#4 D 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L), POSITIONS = c(0L, 1L, 2L, 0L, 1L, 0L, 1L, 2L, 3L, 4L, 5L
), TREATMENT = c("A", "A", "B", "C", "D", "B", "B", "C", "A",
"A", "B")), .Names = c("ID", "POSITIONS", "TREATMENT"),
class = "data.frame", row.names = c(NA, -11L))

How to get the top cases for each group using dplyr? [duplicate]

This question already has answers here:
Getting the top values by group
(6 answers)
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Convert data from long format to wide format with multiple measure columns
(6 answers)
Closed 4 years ago.
I have a data table with 3 columns: ID, Type, and Count. For each ID, I want to get the Type with top 2 Count in this ID, and flatten the result into one row. For example, if my data table is like below:
ID Type Count
A 1 8
B 1 3
A 2 5
A 3 2
B 2 1
B 3 4
Then I want my output to be two rows like below:
ID Top1Type Top1TypeCount Top2Type Top2TypeCount
A 1 8 2 5
B 3 4 1 3
Can anyone tell me how to achieve this using the dplyr library in R? Thank you very much.
It's mostly better to keep your data in a long/tidy format. To achieve that, you can use:
df1 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID)
which gives:
ID Type Count
(fctr) (int) (int)
1 A 1 8
2 A 2 5
3 B 1 3
4 B 3 4
When you have ties, you can use slice to select an equal number of observations for each group:
# some example data
df2 <- structure(list(ID = structure(c(1L, 2L, 1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
Type = c(1L, 1L, 2L, 3L, 2L, 3L),
Count = c(8L, 3L, 8L, 8L, 1L, 4L)),
.Names = c("ID", "Type", "Count"), class = "data.frame", row.names = c(NA, -6L))
Without slice():
df2 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID)
gives:
ID Type Count
(fctr) (int) (int)
1 A 1 8
2 A 2 8
3 A 3 8
4 B 1 3
5 B 3 4
With the use of slice():
df2 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID) %>% slice(1:2)
gives:
ID Type Count
(fctr) (int) (int)
1 A 1 8
2 A 2 8
3 B 1 3
4 B 3 4
With arrange you can determine the order of the cases and thus which are selected by slice. The following:
df2 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID, -Type) %>% slice(1:2)
gives this result:
ID Type Count
(fctr) (int) (int)
1 A 3 8
2 A 2 8
3 B 3 4
4 B 1 3
Using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', we order the 'Count' in descending order, subset the first two rows (head(.SD, 2)). Then, we create a sequence column ('N') grouped by 'ID', and dcast from 'long' to 'wide'. The data.table dcast can take multiple value.var columns.
library(data.table)#v1.9.6+
DT <- setDT(df1)[order(-Count), head(.SD, 2) , by = ID]
DT[, N:= 1:.N, by = ID]
dcast(DT, ID~paste0('Top', N),
value.var=c('Type', 'Count'), fill = 0)
# ID Type_Top1 Type_Top2 Count_Top1 Count_Top2
#1: A 1 2 8 5
#2: B 3 1 4 3
data
df1 <- structure(list(ID = c("A", "B", "A", "A", "B", "B"),
Type = c(1L,
1L, 2L, 3L, 2L, 3L), Count = c(8L, 3L, 5L, 2L, 1L, 4L)),
.Names = c("ID",
"Type", "Count"), class = "data.frame", row.names = c(NA, -6L))

delete the rows with duplicated ids

I want to delete the rows with duplicated ids
data
id V1 V2
1 a 1
1 b 2
2 a 2
2 c 3
3 a 4
The problem is that some people did the test for a few times, which generate multiple scores on V2, I want to delete the duplicated id and retain one of the scores in V2 randomly.
output
id V1 V2
1 a 1
2 a 2
3 a 4
I tried this:
neu <- unique(neu$userid)
but it didn't work
Using dplyr:
library(dplyr)
set.seed(1)
df %>% sample_frac(., 1) %>% arrange(id) %>% distinct(id)
Output:
id V1 V2
1 1 b 2
2 2 c 3
3 3 a 4
Data:
df <- structure(list(id = c(1L, 1L, 2L, 2L, 3L), V1 = structure(c(1L,
2L, 1L, 3L, 1L), .Label = c("a", "b", "c"), class = "factor"),
V2 = c(1L, 2L, 2L, 3L, 4L)), .Names = c("id", "V1", "V2"), class = "data.frame", row.names = c(NA,
-5L))
Creating the data frame based on your example:
df <- read.table(text =
"id V1 V2
1 a 1
1 b 2
2 a 2
2 c 3
3 a 4", h = T)
Since you want to remove rows randomly, first sort the rows of your data frame randomly:
df <- df[sample(nrow(df)),]
Then remove duplicates in the order of appearence:
df <- df[!duplicated(df$id),]
Now sort your data frame back:
df <- df[with(df, order(id)),]
Remember to change df by your data frame name.

Resources