R: Aggregate and create columns based on counts [duplicate] - r

This question already has answers here:
Frequency counts in R [duplicate]
(2 answers)
Closed 7 years ago.
I'm sure this question has been asked before, but I can't seem to find an answer anywhere, so I apologize if this is a duplicate.
I'm looking for R code that allows me to aggregate a variable in R, but while doing so creates new columns that count instances of levels of a factor.
For example, let's say I have the data below:
Week Var1
1 a
1 b
1 a
1 b
1 b
2 c
2 c
2 a
2 b
2 c
3 b
3 a
3 b
3 a
First, I want to aggregate by week. I'm sure this can be done with group_by in dplyr. I then need to be able to cycle through the code and create a new column each time a new level appears in Var 1. Finally, I need counts of each level of Var1 within each week. Note that I can probably figure out a way to do this manually, but I'm looking for an automated solution as I will have thousands of unique values in Var1. The result would be something like this:
Week a b c
1 2 3 0
2 1 1 3
3 2 2 0

I think from the way you worded your question, you've been looking for the wrong thing/something too complicated. It's a simple data-reshaping problem, and as such can be solved with reshape2:
library(reshape2)
#create wide dataframe (from long)
res <- dcast(Week~Var1, value.var="Var1",
fun.aggregate = length, data=data)
> res
Week a b c
1 1 2 3 0
2 2 1 1 3
3 3 2 2 0

Related

How do I use the tidyverse packages to get a running total of unique values occurring in a column? [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 3 years ago.
I'm trying to use the tidyverse (whatever package is appropriate) to add a column (via mutate()) that is a running total of the unique values that have occurred in the column so far. Here is some toy data, showing the desired output.
data.frame("n"=c(1,1,1,6,7,8,8),"Unique cumsum"=c(1,1,1,2,3,4,4))
Who knows how to accomplish this in the tidyverse?
Here is an option with group_indices
library(dplyr)
df1%>%
mutate(unique_cumsum = group_indices(., n))
# n unique_cumsum
#1 1 1
#2 1 1
#3 1 1
#4 6 2
#5 7 3
#6 8 4
#7 8 4
data
df1 <- data.frame("n"=c(1,1,1,6,7,8,8))
Here's one way, using the fact that a factor will assign a sequential value to each unique item, and then converting the underlying factor codes with as.numeric:
data.frame("n"=c(1,1,1,6,7,8,8)) %>% mutate(unique_cumsum=as.numeric(factor(n)))
n unique_cumsum
1 1 1
2 1 1
3 1 1
4 6 2
5 7 3
6 8 4
7 8 4
Another solution:
df <- data.frame("n"=c(1,1,1,6,7,8,8))
df <- df %>% mutate(`unique cumsum` = cumsum(!duplicated(n)))
This should work even if your data is not sorted.

Go through a column and collect a running total in new column [duplicate]

This question already has answers here:
Creation of a specific vector without loop or recursion in R
(2 answers)
Split data.frame by value
(2 answers)
Closed 4 years ago.
I have a dataframe whose rows represent people. For a given family, the first row has the value 1 in the column A, and all following rows contain members of the same family until another row in in column A has the value 1. Then, a new family starts.
I would like to assign IDs to all families in my dataset. In other words, I would like to take:
A
1
2
3
1
3
3
1
4
And turn it into:
A family_id
1 1
2 1
3 1
1 2
3 2
3 2
1 3
4 3
I'm playing with a dataframe of 3 million rows, so a simple for-loop solution I came up with falls short of necessary efficiency. Also, the family_id need not be sequential.
I'll take a dplyr solution.
data:
df <- data.frame(A = c(1:3,1,3,3,1,4))
code:
df$familiy_id <- cumsum(c(-1,diff(df$A)) < 0)
result:
# A familiy_id
#1 1 1
#2 2 1
#3 3 1
#4 1 2
#5 3 2
#6 3 2
#7 1 3
#8 4 3
please note:
This solution starts a new group when a number occurs that is smaller than the previous one.
When its 100% sure that a new group always begins with a 1 consistently, then ronak's solution is perfect.

How to assign IDs for consecutive rows in R split by a given kind of row? [duplicate]

This question already has answers here:
Creation of a specific vector without loop or recursion in R
(2 answers)
Split data.frame by value
(2 answers)
Closed 4 years ago.
I have a dataframe whose rows represent people. For a given family, the first row has the value 1 in the column A, and all following rows contain members of the same family until another row in in column A has the value 1. Then, a new family starts.
I would like to assign IDs to all families in my dataset. In other words, I would like to take:
A
1
2
3
1
3
3
1
4
And turn it into:
A family_id
1 1
2 1
3 1
1 2
3 2
3 2
1 3
4 3
I'm playing with a dataframe of 3 million rows, so a simple for-loop solution I came up with falls short of necessary efficiency. Also, the family_id need not be sequential.
I'll take a dplyr solution.
data:
df <- data.frame(A = c(1:3,1,3,3,1,4))
code:
df$familiy_id <- cumsum(c(-1,diff(df$A)) < 0)
result:
# A familiy_id
#1 1 1
#2 2 1
#3 3 1
#4 1 2
#5 3 2
#6 3 2
#7 1 3
#8 4 3
please note:
This solution starts a new group when a number occurs that is smaller than the previous one.
When its 100% sure that a new group always begins with a 1 consistently, then ronak's solution is perfect.

Summing data for individuals over a series of rounds in R [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
I am currently working on my masters thesis and part of my data analysis is in R. I am completely new to it and so am learning as I go along.
The experiments we are running consist of individuals playing a token allocation game, over a series of rounds.
I need to change the current csv file in R so that each individual appears in one row, with ingroup, outgroup and self giving summed over the 40 rounds they played.
Currently, the data frame is as follows:
id roundno tokenstoingroup tokenstooutgroup tokenstoself
0001 1 1 0 0
0001 2 0 1 0
0002 1 0 0 1
etc...
There are many participants (over a thousand), and every round's allocation for each participant is entered.
My question is:
How do I sum this up so that the data frame looks more like this??
id totalrounds tokenstoingroup tokenstooutgroup tokenstoself
0001 40 25 13 2
002 40 13 13 14
etc...
As I have said, I am totally new to this. I have tried to look online for aggregating and summing things up, but I have idea where to start with something a bit more complex like this.
You can use the aggregate function with cbind. As an example, let's create a data frame:
test <- data.frame('id'=rep(c('A','B','C'),each=2),'C1'=rep(1,6),'C2'=1:6)
> test
id C1 C2
1 A 1 1
2 A 1 2
3 B 1 3
4 B 1 4
5 C 1 5
6 C 1 6
Then:
test <- aggregate(cbind(C1,C2)~id,data=test,sum)
> test
id C1 C2
1 A 2 3
2 B 2 7
3 C 2 11
We can use summarise_each from dplyr
library(dplyr)
df1 %>%
group_by(id) %>%
summarise_each(funs(sum), roundno, tokenstoingroup,tokenstooutgroup, tokenstoself)

R reset cumsum when it found 0 [duplicate]

This question already exists:
R ffdfdply reset cumsum using data.table
Closed 9 years ago.
I am using the ff package to load an excel file.
i=as.ffdf(data.frame(a=c(1,1,1,1,1,1), b=c(1,4,6,2,5,3), c=c(1,1,1,1,1,1), d=c(1,0,1,1,0,1)))
I am trying to get the cumulative sum on column d and reset it whenever it found 0. I am trying to get the below output.
a b c d Result
1 1 1 1 1
1 4 1 0 0
1 6 1 1 1
1 2 1 1 2
1 5 1 0 0
1 3 1 1 1
I know, I can easily achieved it through ddply but I have large set of data rows i.e. > 5000000 rows.
Thanks
This will work but little bit slower with 24385601 rows. I created unique combination on column a and c and use the Arun solution. Key column (key_a_c) is used to split the data set i.e. to reset cumsum.
Create a unique key on column a and c
i$key_a_c <- ikey(i[c("a", "c")])
Generate cumulative series by spliting on the basis of key_a_c
p1=ffdfdply(i, split=as.character(i$key_a_c), FUN= function(x) {
x$Result <- as.ff(x[, "d"] * sequence(rle(x[, "d"])$lengths))
as.data.frame(x)
}, trace=T)
Please share your views and code if you have some optimized solution.

Resources