Turning data long to wide with repeating values - r

fill W id X T
1 403 29730 100 111
1 8395 10766 100 92
1 4170 14291 100 98
1 2768 20506 200 110
1 3581 15603 100 112
6 1 10504 200 87
9 48 29730 100 89
1 4790 10766 200 80
This is a slightly modified random sample from my actual data. I'd like:
id X T 403 8395 ....
29730 100 111 1
10766 100 92 1
14291 100 98
20506 200 110
15603 100 112
10504 200 87
29730 100 89
10766 200 80
Notice ID 29730 is both in T 89 and 111. I think this should just be reshape2::dcast however
data_wide <- reshape2::dcast(data_long, id + T + X ~ W, value.var = "fill") gives an illogical result. Is there generally a to keep the same ID at T1 and T2 while casting a data frame?

If I understand correctly, this is not a trivial reshape long to wide question considering OP's requirements:
The row order must be maintained.
The columns must be ordered in appearance of W.
Missing entries should appear blank rather than NA.
This requires
to add a row number to be included in the reshape formula,
to turn W into a factor where the factor levels are ordered by appearance using forecats::fct_inorder(), e.g.,
to use a aggregation function which turns NA in "" using toString(), e.g.,
and to remove the row numbers from the reshaped result.
Here, the data.table implementation of dcast() is used as data.table appears a bit more convenient, IMHO.
library(data.table)
dcast(setDT(data_long)[, rn := .I], rn + id + T + X ~ forcats::fct_inorder(factor(W)),
toString, value.var = "fill")[
, rn := NULL][]
id T X 403 8395 4170 2768 3581 1 48 4790
1: 29730 111 100 1
2: 10766 92 100 1
3: 14291 98 100 1
4: 20506 110 200 1
5: 15603 112 100 1
6: 10504 87 200 6
7: 29730 89 100 9
8: 10766 80 200 1
Data
library(data.table)
data_long <- fread(" fill W id X T
1 403 29730 100 111
1 8395 10766 100 92
1 4170 14291 100 98
1 2768 20506 200 110
1 3581 15603 100 112
6 1 10504 200 87
9 48 29730 100 89
1 4790 10766 200 80")

Related

R - Reducing a matrix

I have a square matrix that is like:
A <- c("111","111","111","112","112","113")
B <- c(100,10,20,NA,NA,10)
C <- c(10,20,40,NA,10,20)
D <- c(10,20,NA,NA,40,200)
E <- c(20,20,40,10,10,20)
F <- c(NA,NA,40,100,10,20)
G <- c(10,20,NA,30,10,20)
df <- data.frame(A,B,C,D,E,F,G)
names(df) <- c("Codes","111","111","111","112","112","113")
# Codes 111 111 111 112 112 113
# 1 111 100 10 10 20 NA 10
# 2 111 10 20 20 20 NA 20
# 3 111 20 40 NA 40 40 NA
# 4 112 NA NA NA 10 100 30
# 5 112 NA 10 40 10 10 10
# 6 113 10 20 200 20 20 20
I want to reduce it so that observations with the same row and column names are summed up.
So I want to end up with:
# Codes 111 112 113
# 1 111 230 120 30
# 2 112 50 130 40
# 3 113 230 40 20
I tried to first combine the rows with the same "Codes" number, but I was having a lot of trouble.
In tidyverse
library(tidyverse)
df %>%
pivot_longer(-Codes, values_drop_na = TRUE) %>%
group_by(Codes, name) %>%
summarise(value = sum(value), .groups = 'drop')%>%
pivot_wider()
# A tibble: 3 x 4
Codes `111` `112` `113`
<chr> <dbl> <dbl> <dbl>
1 111 230 120 30
2 112 50 130 40
3 113 230 40 20
One way in base R:
tapply(unlist(df[-1]), list(names(df)[-1][col(df[-1])], df[,1][row(df[-1])]), sum, na.rm = TRUE)
111 112 113
111 230 50 230
112 120 130 40
113 30 40 20
Note that this can be simplified as denoted by #thelatemail to
grp <- expand.grid(df$Codes, names(df)[-1])
tapply(unlist(df[-1]), grp, FUN=sum, na.rm=TRUE)
You can also use `xtabs:
xtabs(vals~., na.omit(cbind(grp, vals = unlist(df[-1]))))
Var2
Var1 111 112 113
111 230 120 30
112 50 130 40
113 230 40 20
When dealing with actual matrices - especially with large ones -, expressing the operation as (sparse) linear algebra should be most efficient.
library(Matrix) ## for sparse matrix operations
idx <- c("111","111","111","112","112","113")
mat <- matrix(c(100,10,20,NA,NA,10,
10,20,40,NA,10,20,
10,20,NA,NA,40,200,
20,20,40,10,10,20,
NA,NA,40,100,10,20,
10,20,NA,30,10,20),
nrow=length(idx),
byrow=TRUE, dimnames=list(idx, idx))
## convert NA's to zero
mat[is.na(mat)] <- 0
## examine matrix
mat
## 111 111 111 112 112 113
## 111 100 10 20 0 0 10
## 111 10 20 40 0 10 20
## 111 10 20 0 0 40 200
## 112 20 20 40 10 10 20
## 112 0 0 40 100 10 20
## 113 10 20 0 30 10 20
## indicator matrix
## converts between "code" and "idx" spaces
M_code_idx <- fac2sparse(idx)
## project to "code_code" space
M_code_idx %*% mat %*% t(M_code_idx)
## 3 x 3 Matrix of class "dgeMatrix"
## 111 112 113
## 111 230 50 230
## 112 120 130 40
## 113 30 40 20

R - more effective left_join [duplicate]

This question already has answers here:
Overlap join with start and end positions
(5 answers)
Closed 1 year ago.
I have got two dataframes - one containing names and ranges of limits (only few hundreds of rows, 1000 at most), which needs to be assigned to a "measurements" dataframe which can consist of million of rows (or ten's of millions of row).
Currently I am doing left_join and filtering value to get a specific limit assigned to each measurement. This however is quite ineffective and cost a lot of resources. For larger dataframes, the code is even unable to run.
Any ideas for more effective solutions will be helpful.
library(dplyr)
## this one has got only few houndreds rows
df_limits <- read.table(text="Title station_id limit_from limit_to
Level_3_Low 1 0 70
Level_2_Low 1 70 90
Level_1_Low 1 90 100
Optimal 1 100 110
Level_1_High 1 110 130
Level_2_High 1 130 150
Level_3_High 1 150 180
Level_3_Low 2 0 70
Level_2_Low 2 70 90
Level_1_Low 2 90 100
Optimal 2 100 110
Level_1_High 2 110 130
Level_2_High 2 130 150
Level_3_High 2 150 180
Level_3_Low 3 0 70
Level_2_Low 3 70 90
Level_1_Low 3 90 100
Optimal 3 100 110
Level_1_High 3 110 130
Level_2_High 3 130 150
Level_3_High 3 150 180
",header = TRUE, stringsAsFactors = TRUE)
# this DF has got millions of rows
df_measurements <- read.table(text="measurement_id station_id value
12121534 1 172
12121618 1 87
12121703 1 9
12121709 2 80
12121760 2 80
12121813 2 115
12121881 3 67
12121907 3 100
12121920 3 108
12121979 1 102
12121995 1 53
12122022 1 77
12122065 2 158
12122107 2 144
12122113 2 5
12122135 3 100
12122187 3 136
12122267 3 130
12122359 1 105
12122366 1 126
12122398 1 143
",header = TRUE, stringsAsFactors = TRUE)
df_results <- left_join(df_measurements,df_limits, by = "station_id") %>%
filter ((value >= limit_from & value < limit_to) | is.na(Title)) %>%
select(names(df_measurements), Title)
Another data.table solution using non-equijoins:
library(data.table)
setDT(df_measurements)
setDT(df_limits)
df_limits[df_measurements, .(station_id, measurement_id, value, Title),
on=.(station_id = station_id, limit_from < value, limit_to >= value)]
station_id measurement_id value Title
1: 1 12121534 172 Level_3_High
2: 1 12121618 87 Level_2_Low
3: 1 12121703 9 Level_3_Low
4: 2 12121709 80 Level_2_Low
5: 2 12121760 80 Level_2_Low
6: 2 12121813 115 Level_1_High
7: 3 12121881 67 Level_3_Low
8: 3 12121907 100 Level_1_Low
9: 3 12121920 108 Optimal
10: 1 12121979 102 Optimal
11: 1 12121995 53 Level_3_Low
12: 1 12122022 77 Level_2_Low
13: 2 12122065 158 Level_3_High
14: 2 12122107 144 Level_2_High
15: 2 12122113 5 Level_3_Low
16: 3 12122135 100 Level_1_Low
17: 3 12122187 136 Level_2_High
18: 3 12122267 130 Level_1_High
19: 1 12122359 105 Optimal
20: 1 12122366 126 Level_1_High
21: 1 12122398 143 Level_2_High
A simple base R (no need additional packages) option using subset + merge
subset(
merge(
df_measurements,
df_limits,
all = TRUE
),
limit_from < value & limit_to >= value
)
gives
station_id measurement_id value Title limit_from limit_to
7 1 12121534 172 Level_3_High 150 180
9 1 12121618 87 Level_2_Low 70 90
15 1 12121703 9 Level_3_Low 0 70
23 1 12122022 77 Level_2_Low 70 90
34 1 12122398 143 Level_2_High 130 150
39 1 12121979 102 Optimal 100 110
43 1 12121995 53 Level_3_Low 0 70
54 1 12122366 126 Level_1_High 110 130
60 1 12122359 105 Optimal 100 110
65 2 12121760 80 Level_2_Low 70 90
75 2 12121813 115 Level_1_High 110 130
79 2 12121709 80 Level_2_Low 70 90
91 2 12122065 158 Level_3_High 150 180
97 2 12122107 144 Level_2_High 130 150
99 2 12122113 5 Level_3_Low 0 70
108 3 12121907 100 Level_1_Low 90 100
116 3 12121920 108 Optimal 100 110
124 3 12122267 130 Level_1_High 110 130
127 3 12121881 67 Level_3_Low 0 70
136 3 12122135 100 Level_1_Low 90 100
146 3 12122187 136 Level_2_High 130 150
Another option is using dplyr
df_measurements %>%
group_by(station_id) %>%
mutate(Title = with(
df_limits,
Title[
findInterval(
value,
unique(unlist(cbind(limit_from, limit_to)[station_id == first(.$station_id)])),
left.open = TRUE
)
]
)) %>%
ungroup()
which gives
# A tibble: 21 x 4
measurement_id station_id value Title
<int> <int> <int> <fct>
1 12121534 1 172 Level_3_High
2 12121618 1 87 Level_2_Low
3 12121703 1 9 Level_3_Low
4 12121709 2 80 Level_2_Low
5 12121760 2 80 Level_2_Low
6 12121813 2 115 Level_1_High
7 12121881 3 67 Level_3_Low
8 12121907 3 100 Level_1_Low
9 12121920 3 108 Optimal
10 12121979 1 102 Optimal
# ... with 11 more rows
Benchmarking
f_TIC1 <- function() {
subset(
merge(
df_measurements,
df_limits,
all = TRUE
),
limit_from < value & limit_to >= value
)
}
f_TIC2 <- function() {
df_measurements %>%
group_by(station_id) %>%
mutate(Title = with(
df_limits,
Title[
findInterval(
value,
unique(unlist(cbind(limit_from, limit_to)[station_id == first(station_id)])),
left.open = TRUE
)
]
)) %>%
ungroup()
}
dt_limits <- as.data.table(df_limits)
dt_measurements <- as.data.table(df_measurements)
f_Waldi <- function() {
dt_limits[
dt_measurements,
.(station_id, measurement_id, value, Title),
on = .(station_id, limit_from < value, limit_to >= value)
]
}
f_TimTeaFan <- function() {
setkey(dt_limits, station_id, limit_from, limit_to)
foverlaps(dt_measurements[, value2 := value],
dt_limits,
by.x = c("station_id", "value", "value2"),
type = "within",
)[
value < limit_to,
.(measurement_id, station_id, value, Title)
]
}
you will see that
Unit: relative
expr min lq mean median uq max neval
f_TIC1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
f_TIC2() 4.848639 4.909985 4.895588 4.942616 5.124704 2.580819 100
f_Waldi() 3.182027 3.010615 3.069916 3.114160 3.397845 1.698386 100
f_TimTeaFan() 5.523778 5.112872 5.226145 5.112407 5.745671 2.446987 100
Here is one way to do it. The problematic part was the condition value < limit_to. foverlaps checks for the condition value <= limit_to which results in double matches so here we call the filter condition after the overlapping join and then select the desired columns. Note that the result is not in the same order as the df_results generated with dplyr.
library(data.table)
dt_limits <- as.data.table(df_limits)
dt_measurements <- as.data.table(df_measurements)
setkey(dt_limits, station_id, limit_from, limit_to)
dt_results <- foverlaps(dt_measurements[, value2 := value],
dt_limits,
by.x = c("station_id", "value", "value2"),
type = "within",
)[value < limit_to,
.(measurement_id , station_id, value, Title)]
dt_results[]
#> measurement_id station_id value Title
#> 1: 12121534 1 172 Level_3_High
#> 2: 12121618 1 87 Level_2_Low
#> 3: 12121703 1 9 Level_3_Low
#> 4: 12121709 2 80 Level_2_Low
#> 5: 12121760 2 80 Level_2_Low
#> 6: 12121813 2 115 Level_1_High
#> 7: 12121881 3 67 Level_3_Low
#> 8: 12121907 3 100 Optimal
#> 9: 12121920 3 108 Optimal
#> 10: 12121979 1 102 Optimal
#> 11: 12121995 1 53 Level_3_Low
#> 12: 12122022 1 77 Level_2_Low
#> 13: 12122065 2 158 Level_3_High
#> 14: 12122107 2 144 Level_2_High
#> 15: 12122113 2 5 Level_3_Low
#> 16: 12122135 3 100 Optimal
#> 17: 12122187 3 136 Level_2_High
#> 18: 12122267 3 130 Level_2_High
#> 19: 12122359 1 105 Optimal
#> 20: 12122366 1 126 Level_1_High
#> 21: 12122398 1 143 Level_2_High
#> measurement_id station_id value Title
Created on 2021-08-09 by the reprex package (v0.3.0)

R: Extracting values from a dataframe when sequential row values differ

I have a data.frame df with 2 columns:
A contains positive values.
B contains values (either zero or a positive values).
I wish to generate a new data.frame (or vector) (of unknown length) containing
the values from df[i+1, A], ONLY when df[i, B] == 0 & df[i + 1, B] != 0.
I can visualize how to do this by sequentially stepping through the data.frame using a loop, but this will take forever with >200,000 rows. What is the vectorized solution to a problem like this that requires arithmetic on sequential rows of a vector or data.frame?
Data is in this form:
A B
1 5 5
2 10 3
3 15 0
4 20 6
5 25 5
6 30 0
7 35 0
8 40 11
9 45 3
etc etc etc
I'd then like to extract the values of A from row 4 (A = 20) and row 8 (A = 40) etc.
You could use
df$A[-1][diff(df$B != 0) > 0]
[1] 20 40
The idea is as follows. First, given a vector c(1, 2), one way to extract 2 is of course c(1, 2)[2]. Another way is c(1, 2)[c(FALSE, TRUE)], i.e. you might subset a vector by using a logical vector.
After you edited your question, I see that we are no longer interested in the first row of df, so that is why I start with df$A[-1]. Then one way that is longer and very likely less efficient, but follows more readable logic, is
df$A[-1][df$B[-nrow(df)] == 0 & df$B[-1] != 0]
where df$B[-1] != 0 returns a logical vector corresponding to your condition df [ i+1, B ] != 0. Then df$B[-nrow(df)] == 0 returns another logical vector corresponding to df [ i, B ]==0. Then operator & performs element-wise AND operation, returns the final logical vector and gives the result.
Now diff(df$B != 0) > 0 is just a trickier way to write the same thing. df$B != 0 gives a logical vector. Then while performing diff(df$B != 0) we are taking differences of 1's (correspond to entries TRUE) and 0's (correspond to FALSE). For example, c(0, 1) != 0 gives c(FALSE, TRUE), which can be seen as c(0, 1), and then diff gives 1. So, we have ones in diff(df$B != 0) where entry 0 is followed by some nonzero (in your case - positive) number. To use these results for subsetting df$A[-1] we obtain the final logical vector with diff(df$B != 0) > 0.
Another option comes through 'dplyr' with the following code:
library(dplyr)
df %>% filter(B != 0 & lag(B, 1) == 0)
This uses the dataframe and leaves rows where B doesn't equal 0 and the prior B does equal zero. This does return columns A and B. If you want to see only certain columns add %>% select(...) with the argument being variables separated by commas.
My example (adds together sequential values from two vectors):
> i1=c(1:100)
> i2=c(100:1)
> i3=i1[-length(i1)]+i2[-1]
> i3
[1] 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
[55] 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
My example (adds together sequential values from a single vector):
> i1=c(1:100)
> i2=i1[-length(i1)]+i1[-1]
> i2
[1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101 103 105 107 109
[55] 111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 171 173 175 177 179 181 183 185 187 189 191 193 195 197 199

Counting Instances of Multiple Variables in R

I have a large data table Divvy (over 2.4 million records) that appears as such (some columns removed):
X trip_id from_station_id.x to_station_id.x
1 1109420 94 69
2 1109421 69 216
3 1109427 240 245
4 1109431 113 94
5 1109433 127 332
3 1109429 240 245
I would like to find the number of trips from each station to each opposing station. So for example,
From X To Y Sum
94 69 1
240 245 2
etc. and then join it back to the inital table using dplyr to make something like the below and then limit it to distinct from_station_id/to_combos, which I'll use to map routes (I have lat/long for each station):
X trip_id from_station_id.x to_station_id.x Sum
1 1109420 94 69 1
2 1109421 69 216 1
3 1109427 240 245 2
4 1109431 113 94 1
5 1109433 127 332 1
3 1109429 240 245 1
I successfully used count to get some of this, such as:
count(Divvy$from_station_id.x==94 & Divvy$to_station_id.x == 69)
x freq
1 FALSE 2454553
2 TRUE 81
But this is obviously labor intensive as there are 300 unique stations, so well over 44k poss combinations. I created a helper table thinking I could loop it.
n <- select(Divvy, from_station_id.y )
from_station_id.x
1 94
2 69
3 240
4 113
5 113
6 127
count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1])
x freq
1 FALSE 2454553
2 TRUE 81
I felt like a loop such as
output <- matrix(ncol=variables, nrow=iterations)
output <- matrix()
for(i in 1:n)(output[i, count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1]))
should work but come to think of it that will still only return 300 rows, not 44k, so it would have to then loop back and do n[2] & n[1] etc...
I felt like there might also be a quicker dplyr solution that would let me return a count of each combo and append it directly without the extra steps/table creation, but I haven't found it.
I'm newer to R and I have searched around/think I'm close, but I can't quite connect that last dot of joining that result to Divvy. Any help appreciated.
#Here is the data.table solution, which is useful if you are working with large data:
library(data.table)
setDT(DF)[,sum:=.N,by=.(from_station_id.x,to_station_id.x)][] #DF is your dataframe
X trip_id from_station_id.x to_station_id.x sum
1: 1 1109420 94 69 1
2: 2 1109421 69 216 1
3: 3 1109427 240 245 2
4: 4 1109431 113 94 1
5: 5 1109433 127 332 1
6: 3 1109429 240 245 2
Since you said "limit it to distinct from_station_id/to_combos", the following code seems to provide what you are after. Your data is called mydf.
library(dplyr)
group_by(mydf, from_station_id.x, to_station_id.x) %>%
count(from_station_id.x, to_station_id.x)
# from_station_id.x to_station_id.x n
#1 69 216 1
#2 94 69 1
#3 113 94 1
#4 127 332 1
#5 240 245 2
I'm not entirely sure that's what you're looking for as a result, but this calculates the number of trips having the same origin and destination. Feel free to comment and let me know if that's not quite what you expect as a final result.
dat <- read.table(text="X trip_id from_station_id.x to_station_id.x
1 1109420 94 69
2 1109421 69 216
3 1109427 240 245
4 1109431 113 94
5 1109433 127 332
3 1109429 240 245", header=TRUE)
dat$from.to <- paste(dat$from_station_id.x, dat$to_station_id.x, sep="-")
freqs <- as.data.frame(table(dat$from.to))
names(freqs) <- c("from.to", "sum")
dat2 <- merge(dat, freqs, by="from.to")
dat2 <- dat2[order(dat2$trip_id),-1]
Results
dat2
# X trip_id from_station_id.x to_station_id.x sum
# 6 1 1109420 94 69 1
# 5 2 1109421 69 216 1
# 3 3 1109427 240 245 2
# 4 3 1109429 240 245 2
# 1 4 1109431 113 94 1
# 2 5 1109433 127 332 1

Summarizing a data frame

I am trying to take the following data, and then uses this data to create a table which has the information broken down by state.
Here's the data:
> head(mydf2, 10)
lead_id buyer_account_id amount state
1 52055267 62 300 CA
2 52055267 64 264 CA
3 52055305 64 152 CA
4 52057682 62 75 NJ
5 52060519 62 750 OR
6 52060519 64 574 OR
15 52065951 64 152 TN
17 52066749 62 600 CO
18 52062751 64 167 OR
20 52071186 64 925 MN
I've allready subset the states that I'm interested in and have just the data I'm interested in:
mydf2 = subset(mydf, state %in% c("NV","AL","OR","CO","TN","SC","MN","NJ","KY","CA"))
Here's an idea of what I'm looking for:
State Amount Count
NV 1 50
NV 2 35
NV 3 20
NV 4 15
AL 1 10
AL 2 6
AL 3 4
AL 4 1
...
For each state, I'm trying to find a count for each amount "level." I don't necessary need to group the amount variable, but keep in mind that they are are not just 1,2,3, etc
> mydf$amount
[1] 300 264 152 75 750 574 113 152 750 152 675 489 188 263 152 152 600 167 34 925 375 156 675 152 488 204 152 152
[29] 600 489 488 75 152 152 489 222 563 215 452 152 152 75 100 113 152 150 152 150 152 452 150 152 152 225 600 620
[57] 113 152 150 152 152 152 152 152 152 152 640 236 152 480 152 152 200 152 560 152 240 222 152 152 120 257 152 400
Is there an elegant solution for this in R for this or will I be stuck using Excel (yuck!).
Here's my understanding of what you're trying to do:
Start with a simple data.frame with 26 states and amounts only ranging from 1 to 50 (which is much more restrictive than what you have in your example, where the range is much higher).
set.seed(1)
mydf <- data.frame(
state = sample(letters, 500, replace = TRUE),
amount = sample(1:50, 500, replace = TRUE)
)
head(mydf)
# state amount
# 1 g 28
# 2 j 35
# 3 o 33
# 4 x 34
# 5 f 24
# 6 x 49
Here's some straightforward tabulation. I've also removed any instances where frequency equals zero, and I've reordered the output by state.
temp1 <- data.frame(table(mydf$state, mydf$amount))
temp1 <- temp1[!temp1$Freq == 0, ]
head(temp1[order(temp1$Var1), ])
# Var1 Var2 Freq
# 79 a 4 1
# 157 a 7 2
# 391 a 16 1
# 417 a 17 1
# 521 a 21 1
# 1041 a 41 1
dim(temp1) # How many rows/cols
# [1] 410 3
Here's a little bit different tabulation. We are tabulating after grouping the "amount" values. Here, I've manually specified the breaks, but you could just as easily let R decide what it thinks is best.
temp2 <- data.frame(table(mydf$state,
cut(mydf$amount,
breaks = c(0, 12.5, 25, 37.5, 50),
include.lowest = TRUE)))
temp2 <- temp2[!temp2$Freq == 0, ]
head(temp2[order(temp2$Var1), ])
# Var1 Var2 Freq
# 1 a [0,12.5] 3
# 27 a (12.5,25] 3
# 79 a (37.5,50] 3
# 2 b [0,12.5] 2
# 28 b (12.5,25] 6
# 54 b (25,37.5] 5
dim(temp2)
# [1] 103 3
I am not sure if I understand correctly (you have two data.frames mydf and mydf2). I'll assume your data is in mydf. Using aggregate:
mydf$count <- 1:nrow(mydf)
aggregate(data = mydf, count ~ amount + state, length)
Is this what you are looking for?
Note: here count is a variable that is created just to get directly the output of the 3rd column as count.
Alternatives with ddply from plyr:
# no need to create a variable called count
ddply(mydf, .(state, amount), summarise, count=length(lead_id))
Here' one could use any column that exists in one's data instead of lead_id. Even state:
ddply(mydf, .(state, amount), summarise, count=length(state))
Or equivalently without using summarise:
ddply(mydf, .(state, amount), function(x) c(count=nrow(x)))

Resources