Combining an individual and aggregate level data sets

Combining an individual and aggregate level data sets - r

I've got two different data frames, lets call them "Months" and "People".
Months looks like this:
Month Site X
1 1 4
2 1 3
3 1 5
1 2 10
2 2 7
3 2 5
and People looks like this:
ID Month Site
1 1 1
2 1 2
3 1 1
4 2 2
5 2 2
6 2 2
7 3 1
8 3 2
I'd like to combine them so essentially each time an entry in "People" has a particular Month and Site combination, it's added to the appropriate aggregated data frame, so I'd get something like the following:
Month Site X People
1 1 4 2
2 1 3 0
3 1 5 1
1 2 10 1
2 2 7 3
3 2 5 1
But I haven't the foggiest idea of how to go about doing that. Any suggestions?

Using base packages
> aggregate( ID ~ Month + Site, data=People, FUN = length )
Month Site ID
1 1 1 2
2 3 1 1
3 1 2 1
4 2 2 3
5 3 2 1
> res <- merge(Months, aggdata, all.x = TRUE)
> res
Month Site X ID
1 1 1 4 2
2 1 2 10 1
3 2 1 3 NA
4 2 2 7 3
5 3 1 5 1
6 3 2 5 1
> res[is.na(res)] <- 0
> res
Month Site X ID
1 1 1 4 2
2 1 2 10 1
3 2 1 3 0
4 2 2 7 3
5 3 1 5 1
6 3 2 5 1

Assuming your data.frames are months and people, here's a data.table solution:
require(data.table)
m.dt <- data.table(months, key=c("Month", "Site"))
p.dt <- data.table(people, key=c("Month", "Site"))
# one-liner
dt.f <- p.dt[m.dt, list(X=X[1], People=sum(!is.na(ID)))]
> dt.f
# Month Site X People
# 1: 1 1 4 2
# 2: 1 2 10 1
# 3: 2 1 3 0
# 4: 2 2 7 3
# 5: 3 1 5 1
# 6: 3 2 5 1

Related

Group by each increasing sequence in data frame

If I have a data frame with a column of monotonically increasing values such as:
x
1
2
3
4
1
2
3
1
2
3
4
5
6
1
2
How do I add a column to group each increasing sequence that results in:
x y
1 1
2 1
3 1
4 1
1 2
2 2
3 2
1 3
2 3
3 3
4 3
5 3
6 3
1 4
2 4
I can only think of using a loop which will be slow.

You may choose cumsum function to do it.
> x <- c(1,2,3,4,1,2,3,1,2,4,5,1,2)
> cumsum(x==1)
[1] 1 1 1 1 2 2 2 3 3 3 3 4 4

I would use diff and compute the cumulative sum:
df$y <- c(1, cumsum(diff(df$x) < 0 ) + 1)
> df
x y
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 2 2
7 3 2
8 1 3
9 2 3
10 3 3
11 4 3
12 5 3
13 6 3
14 1 4
15 2 4

Using R code to reorganize data frame by randomly selecting one row from each combination

I have a data frame that looks like this:
Subject N S
Sub1-1 3 1
Sub1-2 3 1
Sub1-3 3 1
Sub1-4 3 1
Sub2-1 3 1
Sub2-2 3 1
Sub2-3 3 1
Sub2-4 3 1
Sub3-1 3 2
Sub3-2 3 2
Sub3-3 3 2
Sub4-1 3 2
Sub4-2 3 2
Sub4-3 3 2
Sub5-1 3 2
Sub5-2 3 2
Sub6-1 1 1
Sub6-2 1 1
Sub6-3 1 1
Sub7-1 1 1
Sub7-2 1 1
Sub7-3 1 1
Sub8-1 1 1
Sub8-2 1 1
Sub8-3 1 2
Sub9-1 1 2
Sub9-2 1 2
Sub1-1 1 2
Sub1-2 1 2
Sub1-3 1 2
Sub5-1 1 2
Sub5-2 1 2
Sub1-5 2 1
Sub1-6 2 1
Sub1-7 2 1
Sub1-5 2 1
Sub2-6 2 1
Sub2-5 2 1
Sub2-6 2 1
Sub2-7 2 1
Sub3-8 2 2
Sub3-5 2 2
Sub3-6 2 2
Sub4-7 2 2
Sub4-5 2 2
Sub4-6 2 2
Sub5-7 2 2
Sub5-8 2 2
As you can see in this data frame there are 6 different combinations in the N and S columns, and 8 consecutive rows of each combination. I want to create a new data frame where one row from each combination (be it 3 & 1 or 1 & 2) is randomly selected and then put into a new data frame so there are 8 consecutive rows of each different combination. That way the entire data frame of all 48 rows is completely reorganized. Is this possible in R code?
Edit: The desired output would be something like this, but repeating until all 48 rows are full and the subject number for each row would have be random because it is a randomly selected row of each N & S combo.
Subject N S
3 1
1 1
3 2
1 2
2 2
2 1
2 2
3 2
2 1
1 1
3 1
1 2

A solution using functions from dplyr.
# Load package
library(dplyr)
# Set seed for reproducibility
set.seed(123)
# Process the data
dt2 <- dt %>%
group_by(N, S) %>%
sample_n(size = 1)
# View the result
dt2
## A tibble: 6 x 3
## Groups: N, S [6]
# Subject N S
# <chr> <int> <int>
#1 Sub6-3 1 1
#2 Sub5-1 1 2
#3 Sub1-5 2 1
#4 Sub5-8 2 2
#5 Sub2-4 3 1
#6 Sub3-1 3 2
Update: Reorganize the row
The following randomize all rows.
dt3 <- dt %>% slice(sample(1:n(), n()))
Data Preparation
dt <- read.table(text = "Subject N S
Sub1-1 3 1
Sub1-2 3 1
Sub1-3 3 1
Sub1-4 3 1
Sub2-1 3 1
Sub2-2 3 1
Sub2-3 3 1
Sub2-4 3 1
Sub3-1 3 2
Sub3-2 3 2
Sub3-3 3 2
Sub4-1 3 2
Sub4-2 3 2
Sub4-3 3 2
Sub5-1 3 2
Sub5-2 3 2
Sub6-1 1 1
Sub6-2 1 1
Sub6-3 1 1
Sub7-1 1 1
Sub7-2 1 1
Sub7-3 1 1
Sub8-1 1 1
Sub8-2 1 1
Sub8-3 1 2
Sub9-1 1 2
Sub9-2 1 2
Sub1-1 1 2
Sub1-2 1 2
Sub1-3 1 2
Sub5-1 1 2
Sub5-2 1 2
Sub1-5 2 1
Sub1-6 2 1
Sub1-7 2 1
Sub1-5 2 1
Sub2-6 2 1
Sub2-5 2 1
Sub2-6 2 1
Sub2-7 2 1
Sub3-8 2 2
Sub3-5 2 2
Sub3-6 2 2
Sub4-7 2 2
Sub4-5 2 2
Sub4-6 2 2
Sub5-7 2 2
Sub5-8 2 2",
header = TRUE, stringsAsFactors = FALSE)

Count with table() and exclude 0's

I try to count triplets; for this I use three vectors that are packed in a dataframe:
X=c(4,4,4,4,4,4,4,4,1,1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
Y=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,3,4,2,2,2,2,3,4,1,1,2,2,3,3,4,4)
Z=c(4,4,5,4,4,4,4,4,6,1,1,1,1,1,1,1,2,2,2,2,7,2,3,3,3,3,3,3,3,3)
Count_Frame=data.frame(matrix(NA, nrow=(length(X)), ncol=3))
Count_Frame[1]=X
Count_Frame[2]=Y
Count_Frame[3]=Z
Counts=data.frame(table(Count_Frame))
There is the following problem: if I increase the value range in the vectors or use even more vectors the "Counts" dataframe quickly approaches its size limit due to the many 0-counts. Is there a way to exclude the 0-counts while generating "Counts"?

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(Count_Frame)), grouped by all the columns (.(X, Y, Z)), we get the number or rows (.N).
library(data.table)
setDT(Count_Frame)[,.N ,.(X, Y, Z)]
# X Y Z N
# 1: 4 1 4 7
# 2: 4 1 5 1
# 3: 1 1 6 1
# 4: 1 1 1 3
# 5: 1 2 1 2
# 6: 1 3 1 1
# 7: 1 4 1 1
# 8: 2 2 2 4
# 9: 2 3 7 1
#10: 2 4 2 1
#11: 3 1 3 2
#12: 3 2 3 2
#13: 3 3 3 2
#14: 3 4 3 2
Instead of naming all the columns, we can use names(Count_Frame) as well (if there are many columns)
setDT(Count_Frame)[,.N , names(Count_Frame)]

You can accomplish this with aggregate:
Count_Frame$one <- 1
aggregate(one ~ X1 + X2 + X3, data=Count_Frame, FUN=sum)
This will calculate the positive instances of table, but will not list the zero counts.

One solution is to create a combination of the column values and count those instead:
library(tidyr)
as.data.frame(table(unite(Count_Frame, tmp, X1, X2, X3))) %>%
separate(Var1, c('X1', 'X2', 'X3'))
Resulting output is:
X1 X2 X3 Freq
1 1 1 1 3
2 1 1 6 1
3 1 2 1 2
4 1 3 1 1
5 1 4 1 1
6 2 2 2 4
7 2 3 7 1
8 2 4 2 1
9 3 1 3 2
10 3 2 3 2
11 3 3 3 2
12 3 4 3 2
13 4 1 4 7
14 4 1 5 1

Or using plyr:
library(plyr)
count(Count_Frame, colnames(Count_Frame))
output
# > count(Count_Frame, colnames(Count_Frame))
# X1 X2 X3 freq
# 1 1 1 1 3
# 2 1 1 6 1
# 3 1 2 1 2
# 4 1 3 1 1
# 5 1 4 1 1
# 6 2 2 2 4
# 7 2 3 7 1
# 8 2 4 2 1
# 9 3 1 3 2
# 10 3 2 3 2
# 11 3 3 3 2
# 12 3 4 3 2
# 13 4 1 4 7
# 14 4 1 5 1

How to select only the last row among the subset of rows satisfying a condition in R programming

The dataframe looks like this :
Customer_id A B C D E F G
10000001 1 1 2 3 1 3 1
10000001 1 2 3 1 2 1 3
10000002 2 2 2 3 1 3 1
10000002 2 2 1 4 2 3 1
10000003 1 5 2 4 7 2 4
10000003 1 5 2 6 3 7 2
10000003 1 1 2 2 1 2 1
10000004 1 2 3 1 2 3 1
10000004 1 3 2 3 1 3 2
10000004 1 3 2 1 3 2 1
10000004 1 4 1 4 1 3 1
10000006 1 2 3 4 5 1 2
10000006 1 3 1 4 1 2 1
10000008 2 3 2 3 2 1 2
10000008 2 3 1 1 2 1 2
10000008 1 3 1 1 2 2 1
There are multiple entries for each customer_id. I need to create another data frame from this existing data frame. The new data frame should contain only the last row for every customer_id. It should look like this
10000001 1 1 2 3 1 3 1
10000002 2 2 1 4 2 3 1
10000003 1 1 2 2 1 2 1
10000004 1 4 1 4 1 3 1
10000006 1 3 1 4 1 2 1
10000008 1 3 1 1 2 2 1

Something like this (hard to code without the data in R format):
dataframe[ rev(!duplicated(rev(dataframe$Customer_id))),]
or better
dataframe[ !duplicated(dataframe$Customer_id,fromLast=TRUE),]

You can also use aggregate
aggregate(. ~ Customer_id, data = DF, FUN = tail, 1)
## Customer_id A B C D E F G
## 1 10000001 1 2 3 1 2 1 3
## 2 10000002 2 2 1 4 2 3 1
## 3 10000003 1 1 2 2 1 2 1
## 4 10000004 1 4 1 4 1 3 1
## 5 10000006 1 3 1 4 1 2 1
## 6 10000008 1 3 1 1 2 2 1

Assume your data is named dat,
Here's one way using by and rbind, although the other two methods (aggregate and duplicated) are much nicer:
> do.call(rbind, by(dat,dat$Customer_id,FUN=tail,1))
## Customer_id A B C D E F G
## 2 10000001 1 2 3 1 2 1 3
## 4 10000002 2 2 1 4 2 3 1
## 7 10000003 1 1 2 2 1 2 1
## 11 10000004 1 4 1 4 1 3 1
## 13 10000006 1 3 1 4 1 2 1
## 16 10000008 1 3 1 1 2 2 1

Episode count for each row

I'm sure this has been asked before but for the life of me I can't figure out what to search for!
I have the following data:
x y
1 3
1 3
1 3
1 2
1 2
2 2
2 4
3 4
3 4
And I would like to output a running count that resets everytime either x or y changes value.
x y o
1 3 1
1 3 2
1 3 3
1 2 1
1 2 2
2 2 1
2 4 1
3 4 1
3 4 2

Try something like
df<-read.table(header=T,text="x y
1 3
1 3
1 3
1 2
1 2
2 2
2 4
3 4
3 4")
cbind(df,o=sequence(rle(paste(df$x,df$y))$lengths))
> cbind(df,o=sequence(rle(paste(df$x,df$y))$lengths))
x y o
1 1 3 1
2 1 3 2
3 1 3 3
4 1 2 1
5 1 2 2
6 2 2 1
7 2 4 1
8 3 4 1
9 3 4 2

After seeing #ttmaccer's I see my first attempt with ave was wrong and this is perhaps what is needed:
> dat$o <- ave(dat$y, list(dat$y, dat$x), FUN=seq )
# there was a warning but the answer is corect.
> dat
x y o
1 1 3 1
2 1 3 2
3 1 3 3
4 1 2 1
5 1 2 2
6 2 2 1
7 2 4 1
8 3 4 1
9 3 4 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Combining an individual and aggregate level data sets - r

Related

Group by each increasing sequence in data frame

Using R code to reorganize data frame by randomly selecting one row from each combination

Count with table() and exclude 0's

How to select only the last row among the subset of rows satisfying a condition in R programming

Episode count for each row

Categories

Resources