I am attempting to do two things to a dataset I currently have:
ID IV1 DV1 DV2 DV3 DV4 DV5 DV6 DV7
1 97330 3 0 0 0 0 0 1 0
2 118619 0 0 0 0 0 1 1 0
3 101623 2 0 0 0 0 0 0 0
4 202626 0 0 0 0 0 0 0 0
5 182925 1 1 0 0 0 0 0 0
6 179278 1 0 0 0 0 0 0 0
Find the unique number of column combinations of 7 binary
independent variables (DV1 - DV7)
Find the sum of an independent count variable (IV1) by each unique group.
I have been able to determine the number of unique column combinations by using the following:
uniq <- unique(dat[,c('DV1','DV2','DV3','DV4','DV5','DV6','DV7')])
This indicates there are 101 unique combinations present in the dataset. What I haven't been able to figure out is how to determine how to sum the variable "IV1" by each unique group. I've been reading around on this site, and I'm fairly certain there is an easy dplyr answer to this, but it's eluded me so far.
NOTE: I'm essentially trying to find an R solution to perform a "conjunctive analysis" which is displayed in this paper. There is sample code for SPSS, SAS and STATA at the end of the paper.
library(dplyr)
group_by(dat, DV1, DV2, DV3, DV4, DV5, DV6, DV7) %>%
summarize(sumIV1 = sum(IV1))
The number of rows in the result is the number of unique combinations present in your data. The sumIV1 column, of course, has the group-wise sum of IV1.
Thanks to Frank in the comments, we can use strings with group_by_ to simplify:
group_by_(dat, .dots = paste0("DV", 1:7)) %>%
summarize(sumIV1 = sum(IV1))
Here's a reproducible example:
library(data.table)
DT <- data.table(X = c(1, 1, 1 , 1), Y = c(2, 2 , 3 , 4), Z = c(1,1,3,1))
Where X, Y ... are your columns.
Then use the Reduce function:
DT[, join_grp := Reduce(paste,list(X,Y,Z))]
This gives:
DT
X Y Z join_grp
1: 1 2 1 1 2 1
2: 1 2 1 1 2 1
3: 1 3 3 1 3 3
4: 1 4 1 1 4 1
And we can find:
unique(DT[, join_grp])
[1] "1 2 1" "1 3 3" "1 4 1"
For the sums:
DT[ , sum(X), by = join_grp]
Just put whatever column you want to sum in place of the X
Concise Solution
DT[, join_grp := Reduce(paste,list(X,Y,Z))][ , sum(X), by = join_grp]
or
DT[ , sum(X), by = list(Reduce(paste,list(X,Y,Z)))]
Related
My question is best expressed with the below example. Let's start with the below dataframe:
> myData
Name Group Code
1 R 0 0
2 R 0 2
3 T 0 2
4 R 0 0
5 N 1 3
6 N 1 0
7 T 0 4
myData <-
data.frame(
Name = c("R","R","T","R","N","N","T"),
Group = c(0,0,0,0,1,1,0),
Code = c(0,2,2,0,3,0,4)
)
Now, I'd like to add a column, CodeGrp, whereby if a row's Group is > 0, then allocate the max Code for that Group to all Group members with the same Group number so the results look this (note that only one Group member (where Group > 0) can have a Code > 0 and the rest of those Group members have code 0; maybe there's something easier than my proposed max) can only be :
Name Group Code CodeGrp Explain
1 R 0 0 0 Copy over Code since Group = 0
2 R 0 2 2 Copy over Code since Group = 0
3 T 0 2 2 Copy over Code since Group = 0
4 R 0 0 0 Copy over Code since Group = 0
5 N 1 3 3 Group is > 0 so insert in CodeGrp column the max Code in this Group
6 N 1 0 3 Group is > 0 so insert in CodeGrp column the max Code in this Group
7 T 0 4 4 Copy over Code since Group = 0
Any recommendations for how to do this, in a simple manner, using base R or dplyr?
Here is one approach. You want to group your data by Group. Then, you want to assign values in CodeGrp using a conditional statement. For each group, if Group is 0, assign values in Code. If Group is not 0, assign the max value of the group in CodeGrp.
group_by(myData, Group) %>%
mutate(CodeGrp = if_else(Group == 0, Code, max(Code)))
Name Group Code CodeGrp
<chr> <dbl> <dbl> <dbl>
1 R 0 0 0
2 R 0 2 2
3 T 0 2 2
4 R 0 0 0
5 N 1 3 3
6 N 1 0 3
7 T 0 4 4
I've been trying to solve this issue for too long now. I have binary insect outbreak data in annual time series format for 300+ years (rows) and 70+ trees (columns).
I'd like to conditionally fill a dataframe / matrix / data table of the same dimensions with cumulative sums, and have it reset to 0 at the end of each outbreak period. I've found a wealth of similar questions / answers that I just can't seem to translate to my issue.
I'll have a snippet of a dataframe, e.g., that looks like this:
t1 t2 t3 t4 t5
2000 1 0 0 1 0
2001 1 0 0 0 1
2002 1 1 0 0 1
2003 0 1 0 1 1
2004 1 1 1 1 1
And I want to create a new df that looks like this:
t1 t2 t3 t4 t5
2000 1 0 0 1 0
2001 2 0 0 0 1
2002 3 1 0 0 2
2003 0 2 0 1 3
2004 1 3 1 2 4
I've felt I've gotten close with both the data.table and rle packages, although I've also been going in tons of circles as well (pretty sure I did it for a single column once, but now can't remember what I did, or why I couldn't get it to work in a loop for all columns...).
I've always gotten the following methods to work to some extent, usually just a single column, or add one 1 df on top of a shifted df, so a single column might look like 0 1 2 2 1 0 instead of 0 1 2 3 4 0. Some attempts, if this helps, have been variations on code looking like this:
setDT(dt)[, new := t1 + shift(t1, fill = 0)]
apply(
rle( matrix)$lengths
, 2, seq)
rle( matrix[,1])$lengths
for( i in 1:dim(dt)[1]) {
for( j in 1:dim(dt)[2]) {
cols <- names(dt) # tried in place of .SD with negative results
if( dt[i,j] == 1) {
dt[, new := .SD + shift(.SD, 1L, fill = 0, type = "lag", give.names = TRUE)]
} else { dt }
}
}
Some of the main SO sources I've used include these pages: data.table, dplyr, rle
Let me know if I'm missing any important info (I'm new!). & thank you so much for any help!
We can use rle with sequence from base R
df2 <- df1 #create a copy of df1
#loop through the columns of 'df2', apply the `rle`, get the 'sequence'
#of 'lengths' and multiply with the column values.
df2[] <- lapply(df2, function(x) sequence(rle(x)$lengths)*x)
df2
# t1 t2 t3 t4 t5
#2000 1 0 0 1 0
#2001 2 0 0 0 1
#2002 3 1 0 0 2
#2003 0 2 0 1 3
#2004 1 3 1 2 4
You can use data.table combined with the ave function to calculate the cumsum of each column grouped by the rleid of the column itself:
library(data.table)
setDT(dt)[, names(dt) := lapply(.SD, function(col) ave(col, rleid(col), FUN = cumsum))][]
# t1 t2 t3 t4 t5
#1: 1 0 0 1 0
#2: 2 0 0 0 1
#3: 3 1 0 0 2
#4: 0 2 0 1 3
#5: 1 3 1 2 4
I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7
Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.
I want a simple way to create a new variable determining whether a boolean is ever true in R data frame.
Here is and example:
Suppose in the dataset I have 2 variables (among other variables which are not relevant) 'a' and 'b' and 'a' determines a group, while 'b' is a boolean with values TRUE (1) or FALSE (0). I want to create a variable 'c', which is also a boolean being 1 for all entries in groups where 'b' is at least once 'TRUE', and 0 for all entries in groups in which 'b' is never TRUE.
From entries like below:
a b
-----
1 1
2 0
1 0
1 0
1 1
2 0
2 0
3 0
3 1
3 0
I want to get variable 'c' like below:
a b c
-----------
1 1 1
2 0 0
1 0 1
1 0 1
1 1 1
2 0 0
2 0 0
3 0 1
3 1 1
3 0 1
-----------
I know how to do it in Stata, but I haven't done similar things in R yet, and it is difficult to find information on that on the internet.
In fact I am doing that only in order to later remove all the observations for which 'c' is 0, so any other suggestions would be fine as well. The application of that relates to multinomial logit estimation, where the alternatives that are never-chosen need to be removed from the dataset before estimation.
if X is your data frame
library(dplyr)
X <- X %>%
group_by(a) %>%
mutate(c = any(b == 1))
A base R option would be
df1$c <- with(df1, ave(b, a, FUN=any))
Or
library(sqldf)
sqldf('select * from df1
left join(select a, b,
(sum(b))>0 as c
from df1
group by a)
using(a)')
Simple data.table approach
require(data.table)
data <- data.table(data)
data[, c := any(b), by = a]
Even though logical and numeric (0-1) columns behave identically for all intents and purposes, if you'd like a numeric result you can simply wrap the call to any with as.numeric.
An answer with base R, assuming a and b are in dataframe x
c value is a 1-to-1 mapping with a, and I create a mapping here
cmap <- ifelse(sapply(split(x, x$a), function(x) sum(x[, "b"])) > 0, 1, 0)
Then just add in the mapped value into the data frame
x$c <- cmap[x$a]
Final output
> x
a b c
1 1 1 1
2 2 0 0
3 1 0 1
4 1 0 1
5 1 1 1
6 2 0 0
7 2 0 0
8 3 0 1
9 3 1 1
10 3 0 1
edited to change call to split.