Grouping and Counting instances? - r

Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0

An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)

I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.

Related

aggregating repeat entries prior to summarizing on other variables

df<-data.frame(id=c(1,2,3,4,3,4,3,1,1,2),event=c('a','a','a','a','a','a','a','b','b','b'),var1=c(1,1,0,0,1,0,1,1,0,1),var2=c(1,0,0,0,1,1,0,1,1,0),var_total=c(1,1,0,0,1,1,1,1,1,1))
df
id event var1 var2 var_total
1 1 a 1 1 1
2 2 a 1 0 1
3 3 a 0 0 0
4 4 a 0 0 0
5 3 a 1 1 1
6 4 a 0 1 1
7 3 a 1 0 1
8 1 b 1 1 1
9 1 b 0 1 1
10 2 b 1 0 1
I am reformatting/cleaning data from a data entry site that produces very diagonal data. I have got it into a manageable form at the moment, but I still have one problem. There are events that repeat, and I would like each id/event combination to be unique. As you can see, lines 5 and 6 are duplicate in that aspect, but the variables are not identical. The variables are a binary response (yes=1,no=0), but if there is any yes in the event the variable should be 1. Additionally, the 'var_total' column should be 1 if ANY of the variables are positive.
My data set has 77 of these 'repeat events' out of over 6000 entries, and it's likely to change everytime more data is entered. How do I isolate ids with repeat events so I can aggregate() (?summarise) them and be sure it's done correctly? There are over 15 variables. I need to report number of ids per event for all variables.
library(dplyr)
df %>%
group_by(id, event) %>%
summarize(
across(var1:var2, ~ +any(. > 0)),
var_total = +((var1 + var2) > 0),
.groups = "drop"
) %>%
arrange(event, id)
# # A tibble: 6 x 5
# id event var1 var2 var_total
# <dbl> <chr> <int> <int> <int>
# 1 1 a 1 1 1
# 2 2 a 1 0 1
# 3 3 a 1 1 1
# 4 4 a 0 1 1
# 5 1 b 1 1 1
# 6 2 b 1 0 1
Notes:
The arrange is purely getting it back to the order you had in your question, not required for the operation of the code.
summarize is wiping out var_total and then recreating it based on the logic you stated (either var* is 1).
I could have easily used across(var1:var2, max) instead of the ~ +any(. > 0), it produces the same results here. I shows the ~ any version purely to demonstrating something a little more complex than max.

Arrange rows by pairs in R based on two columns

I need to order data table by pairs of users who sent messages. Currently, the data looks like this:
I want to rearrange rows so that I can see how many messages users exchanged between each other. If one user sent a message, but the other one did not respond, I need to have a value of 0 in column Messages_sent.
As a next step, I need to calculate conversation length between two users, therefore, sum Messages_sent for every two lines.
Please advice how I can rearrange data table!
With dplyr, to get the table given in your description, this code should work. But if you want to sum the counts in both directions the first line contains all you may want.
df <- merge(df,df
,by.x=c("from_id","to_id"),by.y=c("to_id","from_id")
,all.x=TRUE,all.y=TRUE)
df <- mutate(df,Messages_sent.x=coalesce(Messages_sent.x,0),
Messages_sent.y=coalesce(Messages_sent.y,0))
df$row <- 1:nrow(df)
rbind(select(df,-Messages_sents.y) %>%
rename(Messages_sent=Messages_sent.x),
select(df,-Messages_sent.x) %>%
rename(Messages_sent=Messages_sent.y,from_id=to_id,to_id=from_id)
) %>% arrange(row) %>% select(-row)
Here are the steps using base R functions:
df <- data.frame(from_id=c(624227,624227,624227,624227,624227,624227,667255,667255,667255,7134655,713465),
to_id = c(352731,693915,184455,771100,503940,91558,626814,857601,862512,156874,419242),
message_sent=c(1,6,2,1,1,1,2,7,3,1,1))
# merge dataset together with itself swapping from_id and to_id columns
df.full <- merge(df,df, by.x=c("from_id","to_id"), by.y=c("to_id","from_id"),suffixes = c(".x",".y"), all=TRUE)
# fill missing values with 0
# those records will correspond to all the pairs where
# someone did not send any messages back
df.full[is.na(df.full)] <- 0
# calculate total number of messages for each pair:
df.full$total <- df.full$message_sent.x + df.full$message_sent.y
head(df.full)
# from_id to_id message_sent.x message_sent.y total
# 1 91558 624227 0 1 1
# 2 156874 7134655 0 1 1
# 3 184455 624227 0 2 2
# 4 352731 624227 0 1 1
# 5 419242 713465 0 1 1
# 6 503940 624227 0 1 1
For very large datasets base R functions might be slow in this case you can look into using dplyr library (for most steps here it has similar functions):
library(dplyr)
df.full.2 <- merge(df,df # merge dataframe and switched one
,by.x=c("from_id","to_id"),by.y=c("to_id","from_id")
,all.x=TRUE,all.y=TRUE) %>%
mutate(message_sent.x=coalesce(message_sent.x,0), # replace NAs with 0
message_sent.y=coalesce(message_sent.y,0)) %>%
mutate(total=rowSums(.[3:4])) # calculate total number of messages
head(df2.full.2)
# from_id to_id message_sent.x message_sent.y total
#1 91558 624227 0 1 1
#2 156874 7134655 0 1 1
#3 184455 624227 0 2 2
#4 352731 624227 0 1 1
#5 419242 713465 0 1 1
#6 503940 624227 0 1 1
If it's important to have records in pairs follow each other you can also add the following code:
df2.full.3 <- df2.full.2 %>%
mutate(pair.id=sprintf("%06d%6d",pmin(from_id,to_id ),
pmax(from_id,to_id ))) %>%
arrange(pair.id) %>% select(-pair.id)
head(df2.full.3)
# from_id to_id message_sent.x message_sent.y total
#1 91558 624227 0 1 1
#2 624227 91558 1 0 1
#3 156874 7134655 0 1 1
#4 7134655 156874 1 0 1
#5 184455 624227 0 2 2
#6 624227 184455 2 0 2
There is also data.table package that is also very efficient for very large datasets:
library(data.table)
# convert dataframe to datatable
setDT(df)
df.full <- merge(df,df, by.x=c("from_id","to_id"), by.y=c("to_id","from_id"),
suffixes = c(".x",".y"), all=TRUE)
# substitute NAs with zeros
for (j in 3:4)set(df.full,which(is.na(df.full[[j]] )),j,0)
# calculate the total number of messages
df.full[, total:=message_sent.x+message_sent.y]
head(df.full)
# from_id to_id message_sent.x message_sent.y total
# 1: 91558 624227 0 1 1
# 2: 156874 7134655 0 1 1
# 3: 184455 624227 0 2 2
# 4: 352731 624227 0 1 1
# 5: 419242 713465 0 1 1
# 6: 503940 624227 0 1 1
Depending on the size of your dataset one of these methods might be more efficient than the other two.

Collecting success and total from incomplete binary groups in dplyr

Lets say I have the following
>blob
id group growth
1 A 1
2 A 1
3 B 0
4 B 1
5 B 0
6 C 0
7 C 0
8 C 0
I would like to eventually pull out success out of total data. I have gone this far
blob %>%
group_by(group,growth) %>%
tally()
group growth n
A 1 2
B 0 2
B 1 1
C 0 3
I would like to have something like
group success total
A 2 2
B 1 3
C 0 3
I have also tried
sales %>%
group_by(group,growth) %>%
tally() %>%
summarise(fail= n[factor(growth)==1],total = sum(n))
but I get an error because not all growths are equal to 1.
n() is a function from dplyr to count the number. If we group_by the group, we can use n() to count the number of rows and also use sum to add up the success number.
library(dplyr)
dt2 <- dt %>%
group_by(group) %>%
summarise(success = sum(growth), n = n())
Data Preparation
dt <- read.table(text = "id group growth
1 A 1
2 A 1
3 B 0
4 B 1
5 B 0
6 C 0
7 C 0
8 C 0",
header = TRUE, stringsAsFactors = FALSE)
Here's a simple example with data.table
require(data.table)
setDT(df1)
df1[, .(success = sum(growth), total = .N), by=group]
group success total
1: A 2 2
2: B 1 3
3: C 0 3
a=Map(tapply,list(dt$growth),list(dt$group),c(sum,length))
`names<-`(do.call(cbind.data.frame,a),c("Successes","Totals"))
Successes Totals
A 2 2
B 1 3
C 0 3
You can use a mapply function instead of a map:
mapply(tapply,list(dt$growth),list(dt$group),c(sum,length))
[,1] [,2]
A 2 2
B 1 3
C 0 3
Then you can decide to give the names you want to the specific columns. (Please change the class of the object from matrix to a dataframe).

propagate changes down a column

I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7

Splitting one Column to Multiple R and Giving logical value if true

I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1

Resources