I have a matrix with all cell values being from -1:1 and another dataframe that identifies row/column cells for each cell I need to find.
I want to add columns to the data frame that contain the values in the cells identified in the matrix. Example of what I have and what I want below:
HAVE MATRIX:
1 2 3 4 5 6 7 8 9 10
1 0 0 0 1 0 1 1 0 0 0
2 0 0 -1 -1 1 1 -1 -1 0 0
3 0 0 -1 0 0 -1 0 -1 0 0
4 -1 1 0 0 1 1 0 -1 1 -1
5 -1 1 0 -1 -1 0 0 0 0 1
6 1 -1 1 1 0 0 -1 -1 0 1
7 0 -1 1 0 1 1 0 1 -1 0
8 0 0 -1 -1 -1 0 1 -1 0 1
9 -1 1 0 1 1 -1 0 1 -1 -1
10 -1 1 -1 -1 -1 -1 1 0 1 -1
HAVE DATAFRAME:
i j k
1 3 4 2
2 4 8 10
3 10 7 5
4 2 6 8
5 9 10 7
6 2 10 4
7 7 8 10
8 6 10 8
9 2 9 5
10 9 7 1
WANT DATAFRAME:
i j k j,i k,i k,j
1 3 4 2 0 -1 -1
2 4 8 10 -1 -1 0
3 10 7 5 0 1 0
4 2 6 8 -1 0 0
5 9 10 7 1 -1 0
6 2 10 4 1 1 -1
7 7 8 10 1 1 0
8 6 10 8 -1 0 1
9 2 9 5 1 1 0
10 9 7 1 -1 0 1
One option would be to use combn or sapply(if the combination needs to be in a specific order, loop through the list, extract the columns of second dataset, use that as row/column index of extracting the corresponding elements from first dataset and cbind
indList <- list(ji = c('j', 'i'), ki = c('k', 'i'), kj = c('k', 'j'))
cbind(df2, sapply(indList, function(x) m1[as.matrix(df2[x])]))
# i j k ji ki kj
#1 3 4 2 0 -1 -1
#2 4 8 10 -1 -1 0
#3 10 7 5 0 1 0
#4 2 6 8 -1 0 0
#5 9 10 7 1 -1 0
#6 2 10 4 1 1 -1
#7 7 8 10 1 1 0
#8 6 10 8 -1 0 1
#9 2 9 5 1 1 0
#10 9 7 1 -1 0 1
It is also possible with combn)
cbind(df2, combn(df2, 2, FUN = function(x) m1[as.matrix(x[2:1])]))
data
df2 <- structure(list(i = c(3L, 4L, 10L, 2L, 9L, 2L, 7L, 6L, 2L, 9L),
j = c(4L, 8L, 7L, 6L, 10L, 10L, 8L, 10L, 9L, 7L), k = c(2L,
10L, 5L, 8L, 7L, 4L, 10L, 8L, 5L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
m1 <- structure(c(0L, 0L, 0L, -1L, -1L, 1L, 0L, 0L, -1L, -1L, 0L, 0L,
0L, 1L, 1L, -1L, -1L, 0L, 1L, 1L, 0L, -1L, -1L, 0L, 0L, 1L, 1L,
-1L, 0L, -1L, 1L, -1L, 0L, 0L, -1L, 1L, 0L, -1L, 1L, -1L, 0L,
1L, 0L, 1L, -1L, 0L, 1L, -1L, 1L, -1L, 1L, 1L, -1L, 1L, 0L, 0L,
1L, 0L, -1L, -1L, 1L, -1L, 0L, 0L, 0L, -1L, 0L, 1L, 0L, 1L, 0L,
-1L, -1L, -1L, 0L, -1L, 1L, -1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, -1L, 0L, -1L, 1L, 0L, 0L, 0L, -1L, 1L, 1L, 0L, 1L, -1L, -1L
), .Dim = c(10L, 10L), .Dimnames = list(c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10"), c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10")))
Related
I'm trying to get the frequency distribution for a list if it's over a certain number. In my data, I have multiple columns and I want to generate a code that identifies the frequency of "0" in each column where "0" is greater than 3.
My dataset is like this:
a b c d e f g h
0 1 0 1 1 1 1 1
2 0 0 0 0 0 0 0
0 1 2 2 2 1 0 1
0 0 0 0 1 0 0 0
1 0 2 1 1 0 0 0
1 1 0 0 1 0 0 0
0 1 2 2 2 2 2 2
```
The output of the code that I need is :
```
Variable Frequency
a 4
c 4
f 4
g 5
h 4
```
So this will show us the numbers of "0" in the data frame in each column when it is greater than 3.
Thank you.
You can use colSums to count number of 0's in each column and subset the values which are greater than 3.
subset(stack(colSums(df == 0, na.rm = TRUE)), values > 3)
tidyverse way would be :
library(dplyr)
df %>%
summarise(across(.fns = ~sum(. == 0, na.rm = TRUE))) %>%
tidyr::pivot_longer(cols = everything()) %>%
filter(value > 3)
# name value
# <chr> <int>
#1 a 4
#2 c 4
#3 f 4
#4 g 5
#5 h 4
data
df <- structure(list(a = c(0L, 2L, 0L, 0L, 1L, 1L, 0L), b = c(1L, 0L,
1L, 0L, 0L, 1L, 1L), c = c(0L, 0L, 2L, 0L, 2L, 0L, 2L), d = c(1L,
0L, 2L, 0L, 1L, 0L, 2L), e = c(1L, 0L, 2L, 1L, 1L, 1L, 2L), f = c(1L,
0L, 1L, 0L, 0L, 0L, 2L), g = c(1L, 0L, 0L, 0L, 0L, 0L, 2L), h = c(1L,
0L, 1L, 0L, 0L, 0L, 2L)), class = "data.frame", row.names = c(NA, -7L))
In my dataset I have information of the ZIPCODE of 600K+ ID's. If ID's move to a different addressess, I want to determine at which zipcode they lived the longest and put a '1' for that specific year in that row (no need to combine rows as I want to know if they where they lived in what year). That way an ID only have a '1' for a certain year at one row (if there are multiple rows for that ID). The yellow highlight is what i don't want; in that case there is a '1' in two rows for the same year. In the preferred dataset there is only one '1' per year per ID possible.
For example: ID 4 lived in 2013 in 2 places (NY and LA), therefore there are 2 rows. At this point there is a 1 in each row for 2013 and I only want a 1 in the row the ID lived the longest between 1-1-2013 and 31-12-2018. ID 4 lived in 2013 longer in LA than in NY, and so only a 1 should be at the row for NY (so in this case the row of LA will be removed because only '0's remain).
I can also put this file in RStudio.
Thank you!
structure(v1)
ID CITY ZIPCODE DATE_START DATE_END DATE_END.1 X2013 X2014 X2015 X2016 X2017 X2018
1 1 NY 1234EF 1-12-2003 31-12-2018 1 1 1 1 1 1
2 2 NY 1234CD 1-12-2003 14-1-2019 14-1-2019 1 1 1 1 1 1
3 2 NY 1234AB 15-1-2019 31-12-2018 0 0 0 0 0 0
4 3 NY 1234AB 15-1-2019 31-12-2018 0 0 0 0 0 0
5 3 NY 1234CD 1-12-2003 14-1-2019 14-1-2019 1 1 1 1 1 1
6 4 LA 1111AB 4-5-2013 31-12-2018 1 1 1 1 1 1
7 4 NY 2222AB 1-12-2003 3-5-2013 3-5-2013 1 0 0 0 0 0
8 5 MIAMI 5555CD 6-2-2015 20-6-2016 20-6-2016 0 0 1 1 0 0
9 5 VEGAS 3333AB 1-1-2004 31-12-2018 1 1 1 1 1 1
10 5 ORLANDO 4444AB 26-2-2004 5-2-2015 5-2-2015 1 1 1 0 0 0
11 5 MIAMI 5555AB 21-6-2016 31-12-2018 31-12-2018 0 0 0 1 1 1
12 5 MIAMI 5555AB 1-1-2019 31-12-2018 0 0 0 0 0 0
13 6 AUSTIN 6666AB 28-2-2017 3-11-2017 3-11-2017 0 0 0 0 1 0
14 6 AUSTIN 6666AB 4-11-2017 31-12-2018 0 0 0 0 1 1
15 6 AUSTIN 7777AB 20-1-2017 27-2-2017 27-2-2017 0 0 0 0 1 0
16 6 AUSTIN 8888AB 1-12-2003 19-1-2017 19-1-2017 1 1 1 1 1 0
>
structure(list(ID = c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L), CITY = structure(c(4L, 4L, 4L, 4L, 4L,
2L, 4L, 3L, 6L, 5L, 3L, 3L, 1L, 1L, 1L, 1L), .Label = c("AUSTIN",
"LA", "MIAMI", "NY", "ORLANDO", "VEGAS"), class = "factor"),
ZIPCODE = structure(c(4L, 3L, 2L, 2L, 3L, 1L, 5L, 9L, 6L,
7L, 8L, 8L, 10L, 10L, 11L, 12L), .Label = c("1111AB", "1234AB",
"1234CD", "1234EF", "2222AB", "3333AB", "4444AB", "5555AB",
"5555CD", "6666AB", "7777AB", "8888AB"), class = "factor"),
DATE_START = structure(c(3L, 3L, 4L, 4L, 3L, 10L, 3L, 11L,
1L, 7L, 6L, 2L, 8L, 9L, 5L, 3L), .Label = c("1-1-2004", "1-1-2019",
"1-12-2003", "15-1-2019", "20-1-2017", "21-6-2016", "26-2-2004",
"28-2-2017", "4-11-2017", "4-5-2013", "6-2-2015"), class = "factor"),
DATE_END = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 7L, 4L, 1L,
9L, 8L, 1L, 6L, 1L, 5L, 3L), .Label = c("", "14-1-2019",
"19-1-2017", "20-6-2016", "27-2-2017", "3-11-2017", "3-5-2013",
"31-12-2018", "5-2-2015"), class = "factor"), DATE_END.1 = structure(c(7L,
1L, 7L, 7L, 1L, 7L, 6L, 3L, 7L, 8L, 7L, 7L, 5L, 7L, 4L, 2L
), .Label = c("14-1-2019", "19-1-2017", "20-6-2016", "27-2-2017",
"3-11-2017", "3-5-2013", "31-12-2018", "5-2-2015"), class = "factor"),
X2013 = c(1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L,
0L, 0L, 0L, 1L), X2014 = c(1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L), X2015 = c(1L, 1L, 0L, 0L,
1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L), X2016 = c(1L,
1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L
), X2017 = c(1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L,
0L, 1L, 1L, 1L, 1L), X2018 = c(1L, 1L, 0L, 0L, 1L, 1L, 0L,
0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-16L))
You can use a little help from the lubridate package to calculate how many days are spent at each location. Then we can group_by ID and use case_when to assign 1 when the time is the max or 0 otherwise.
library(lubridate)
library(dplyr)
v1 %>%
dplyr::select(ID,CITY,ZIPCODE,DATE_START,DATE_END.1) %>%
rowwise() %>%
mutate("X2013" = max(0, min(dmy("31-12-2013"),dmy(DATE_END.1)) - max(dmy("1-1-2013"),dmy(DATE_START))),
"X2014" = max(0, min(dmy("31-12-2014"),dmy(DATE_END.1)) - max(dmy("1-1-2014"),dmy(DATE_START))),
"X2015" = max(0, min(dmy("31-12-2015"),dmy(DATE_END.1)) - max(dmy("1-1-2015"),dmy(DATE_START))),
"X2016" = max(0, min(dmy("31-12-2016"),dmy(DATE_END.1)) - max(dmy("1-1-2016"),dmy(DATE_START))),
"X2017" = max(0, min(dmy("31-12-2017"),dmy(DATE_END.1)) - max(dmy("1-1-2017"),dmy(DATE_START))),
"X2018" = max(0, min(dmy("31-12-2018"),dmy(DATE_END.1)) - max(dmy("1-1-2018"),dmy(DATE_START)))) %>%
ungroup %>%
group_by(ID) %>%
mutate_at(vars(starts_with("X")),list(~ case_when(. == max(.) ~ 1,
TRUE ~ 0)))
# A tibble: 16 x 11
# Groups: ID [6]
ID CITY ZIPCODE DATE_START DATE_END.1 X2013 X2014 X2015 X2016 X2017 X2018
<int> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NY 1234EF 1-12-2003 31-12-2018 1 1 1 1 1 1
2 2 NY 1234CD 1-12-2003 14-1-2019 1 1 1 1 1 1
3 2 NY 1234AB 15-1-2019 31-12-2018 0 0 0 0 0 0
4 3 NY 1234AB 15-1-2019 31-12-2018 0 0 0 0 0 0
5 3 NY 1234CD 1-12-2003 14-1-2019 1 1 1 1 1 1
6 4 LA 1111AB 4-5-2013 31-12-2018 1 1 1 1 1 1
7 4 NY 2222AB 1-12-2003 3-5-2013 0 0 0 0 0 0
8 5 MIAMI 5555CD 6-2-2015 20-6-2016 0 0 0 0 0 0
9 5 VEGAS 3333AB 1-1-2004 31-12-2018 1 1 1 1 1 1
10 5 ORLANDO 4444AB 26-2-2004 5-2-2015 1 1 0 0 0 0
11 5 MIAMI 5555AB 21-6-2016 31-12-2018 0 0 0 0 1 1
12 5 MIAMI 5555AB 1-1-2019 31-12-2018 0 0 0 0 0 0
13 6 AUSTIN 6666AB 28-2-2017 3-11-2017 0 0 0 0 1 0
14 6 AUSTIN 6666AB 4-11-2017 31-12-2018 0 0 0 0 0 1
15 6 AUSTIN 7777AB 20-1-2017 27-2-2017 0 0 0 0 0 0
16 6 AUSTIN 8888AB 1-12-2003 19-1-2017 1 1 1 1 0 0
There is certainly a way that one could implement the first mutate call to not require manually writing each year, but would take much more work than just typing it out.
I have a base with the variables ID, month (or period) and the incomes of that month. What I need is to put a 1 if the client buys in the next 3 months or a 0 if not, and do it for all ID.
For example, if I am in month 1 and there's a purchase in the next 3 months, then put a 1 in that row for that client.
In the last periods as there will not be 3 months, an NA appears.
df<-tibble::tribble(
~ID, ~Month, ~Incomes,
1L, 1L, 5000L,
1L, 2L, 0L,
1L, 3L, 0L,
1L, 4L, 0L,
1L, 5L, 0L,
1L, 6L, 0L,
1L, 7L, 400L,
1L, 8L, 300L,
1L, 9L, 0L,
1L, 10L, 0L,
1L, 11L, 0L,
1L, 12L, 0L,
1L, 13L, 400L,
2L, 1L, 0L,
2L, 2L, 100L,
2L, 3L, 0L,
2L, 4L, 0L,
2L, 5L, 0L,
2L, 6L, 0L,
2L, 7L, 0L,
2L, 8L, 1500L,
2L, 9L, 0L,
2L, 10L, 0L,
2L, 11L, 0L,
2L, 12L, 100L,
2L, 13L, 750L,
3L, 1L, 0L,
3L, 2L, 0L,
3L, 3L, 0L,
3L, 4L, 0L,
3L, 5L, 700L,
3L, 6L, 240L,
3L, 7L, 100L,
3L, 8L, 0L,
3L, 9L, 0L,
3L, 10L, 0L,
3L, 11L, 0L,
3L, 12L, 500L,
3L, 13L, 760L
)
df<-as.data.frame(df)
# ID Month Incomes
# 1 1 5000
# 1 2 0
# 1 3 0
# 1 4 0
# 1 5 0
# 1 6 0
# 1 7 400
# 1 8 300
# 1 9 0
# 1 10 0
# 1 11 0
# 1 12 0
# 1 13 400
# 2 1 0
# 2 2 100
# 2 3 0
# 2 4 0
# 2 5 0
# 2 6 0
# 2 7 0
# 2 8 1500
# 2 9 0
# 2 10 0
# 2 11 0
# 2 12 100
# 2 13 750
# 3 1 0
# 3 2 0
# 3 3 0
# 3 4 0
# 3 5 700
# 3 6 240
# 3 7 100
# 3 8 0
# 3 9 0
# 3 10 0
# 3 11 0
# 3 12 500
# 3 13 760
What I hope should look like this:
dffinal<- tibble::tribble(
~ID_RUT, ~Month, ~Incomes, ~Quarter,
1L, 1L, 5000L, 0L,
1L, 2L, 0L, 0L,
1L, 3L, 0L, 0L,
1L, 4L, 0L, 1L,
1L, 5L, 0L, 1L,
1L, 6L, 0L, 1L,
1L, 7L, 400L, 1L,
1L, 8L, 300L, 0L,
1L, 9L, 0L, 0L,
1L, 10L, 0L, 0L,
1L, 11L, 0L, NA,
1L, 12L, 0L, NA,
1L, 13L, 400L, NA,
2L, 1L, 0L, 1L,
2L, 2L, 100L, 0L,
2L, 3L, 0L, 0L,
2L, 4L, 0L, 0L,
2L, 5L, 0L, 1L,
2L, 6L, 0L, 1L,
2L, 7L, 0L, 1L,
2L, 8L, 1500L, 0L,
2L, 9L, 0L, 1L,
2L, 10L, 0L, 1L,
2L, 11L, 0L, NA,
2L, 12L, 100L, NA,
2L, 13L, 750L, NA,
3L, 1L, 0L, 0L,
3L, 2L, 0L, 1L,
3L, 3L, 0L, 1L,
3L, 4L, 0L, 1L,
3L, 5L, 700L, 1L,
3L, 6L, 240L, 1L,
3L, 7L, 100L, 0L,
3L, 8L, 0L, 0L,
3L, 9L, 0L, 1L,
3L, 10L, 0L, 1L,
3L, 11L, 0L, NA,
3L, 12L, 500L, NA,
3L, 13L, 760L, NA
)
# ID Month Incomes Quarterly
# 1 1 5000 0
# 1 2 0 0
# 1 3 0 0
# 1 4 0 1
# 1 5 0 1
# 1 6 0 1
# 1 7 400 1
# 1 8 300 0
# 1 9 0 0
# 1 10 0 0
# 1 11 0 NA
# 1 12 0 NA
# 1 13 400 NA
# 2 1 0 1
# 2 2 100 0
# 2 3 0 0
# 2 4 0 0
# 2 5 0 1
# 2 6 0 1
# 2 7 0 1
# 2 8 1500 0
# 2 9 0 1
# 2 10 0 1
# 2 11 0 NA
# 2 12 100 NA
# 2 13 750 NA
# 3 1 0 0
# 3 2 0 1
# 3 3 0 1
# 3 4 0 1
# 3 5 700 1
# 3 6 240 1
# 3 7 100 0
# 3 8 0 0
# 3 9 0 1
# 3 10 0 1
# 3 11 0 NA
# 3 12 500 NA
# 3 13 760 NA
Does anyone how to do it? Thanks for your time
1) rollapply Roll forward along Incomes > 0 returning TRUE if any are TRUE and FALSE otherwise. Convert that to numeric using +. 1:3 means use offsets 1, 2, 3 from the current point, i.e. the next three incomes. Add the partial=TRUE argument to rollapply if you want to consider the next and next two incomes near the end of each group where there are not three left.
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(Quarter = +rollapply(Incomes > 0, list(1:3), any, fill = NA)) %>%
ungroup
2) SQL An SQL solution would be:
library(sqldf)
over <- "partition by ID rows between 1 following and 3 following"
fn$sqldf("select
*,
(max(Incomes > 0) over ($over)) +
(case when (count(*) over ($over)) = 3 then 0 else Null end) as Quarter
from df")
This can be simplified if it is OK to process elements for which there are fewer than 3 rows following. over is from above:
fn$sqldf("select *, (max(Incomes > 0) over ($over)) as Quarter from df")
A dplyr solution: sum the next three months using lag and take the sign of the result.
df %>%
group_by(ID) %>%
mutate(quarter = sign(lead(Incomes, 3) + lead(Incomes, 2) + lead(Incomes))) %>%
as.data.frame()
#> ID Month Incomes quarter
#> 1 1 1 5000 0
#> 2 1 2 0 0
#> 3 1 3 0 0
#> 4 1 4 0 1
#> 5 1 5 0 1
#> 6 1 6 0 1
#> 7 1 7 400 1
#> 8 1 8 300 0
#> 9 1 9 0 0
#> 10 1 10 0 1
#> 11 1 11 0 NA
#> 12 1 12 0 NA
#> 13 1 13 400 NA
#> 14 2 1 0 1
#> 15 2 2 100 0
#> 16 2 3 0 0
#> 17 2 4 0 0
#> 18 2 5 0 1
#> 19 2 6 0 1
#> 20 2 7 0 1
#> 21 2 8 1500 0
#> 22 2 9 0 1
#> 23 2 10 0 1
#> 24 2 11 0 NA
#> 25 2 12 100 NA
#> 26 2 13 750 NA
#> 27 3 1 0 0
#> 28 3 2 0 1
#> 29 3 3 0 1
#> 30 3 4 0 1
#> 31 3 5 700 1
#> 32 3 6 240 1
#> 33 3 7 100 0
#> 34 3 8 0 0
#> 35 3 9 0 1
#> 36 3 10 0 1
#> 37 3 11 0 NA
#> 38 3 12 500 NA
#> 39 3 13 760 NA
Another option:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(
Quarterly = c(
sapply(1:(n() - 3), function(x) +any(Incomes[(x + 1):(x + 3)] > 0)),
rep(NA, 3)
)
) %>% as.data.frame
Output:
ID Month Incomes Quarterly
1 1 1 5000 0
2 1 2 0 0
3 1 3 0 0
4 1 4 0 1
5 1 5 0 1
6 1 6 0 1
7 1 7 400 1
8 1 8 300 0
9 1 9 0 0
10 1 10 0 1
11 1 11 0 NA
12 1 12 0 NA
13 1 13 400 NA
14 2 1 0 1
15 2 2 100 0
16 2 3 0 0
17 2 4 0 0
18 2 5 0 1
19 2 6 0 1
20 2 7 0 1
21 2 8 1500 0
22 2 9 0 1
23 2 10 0 1
24 2 11 0 NA
25 2 12 100 NA
26 2 13 750 NA
27 3 1 0 0
28 3 2 0 1
29 3 3 0 1
30 3 4 0 1
31 3 5 700 1
32 3 6 240 1
33 3 7 100 0
34 3 8 0 0
35 3 9 0 1
36 3 10 0 1
37 3 11 0 NA
38 3 12 500 NA
39 3 13 760 NA
And a base equivalent:
transform(df, Quarterly = ave(Incomes, ID,
FUN = function(x) c(
sapply(1:(length(x) - 3), function(y) +any(x[(y + 1):(y + 3)] > 0)),
rep(NA, 3)
)
)
)
I'm following the vcd docs where assocstats is called on an xtable call on multiple subsets of a data frame. However, I get NaNs with a specific subset because the expected observations for many cases is 0:
factor.2
factor.1 0 1 2 3 4 5 or more
0 0 12 7 1 0 1
1 0 2 1 1 0 0
2 0 8 2 1 0 0
3 0 5 4 0 0 0
4 0 1 2 2 0 0
5 0 6 8 0 0 0
6 0 5 3 0 0 0
7 0 5 1 0 0 0
8 0 5 4 0 1 0
9 0 1 1 0 1 0
10 0 5 6 0 0 1
temp.table <- structure(c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 12L,
2L, 8L, 5L, 1L, 6L, 5L, 5L, 5L, 1L, 5L, 7L, 1L, 2L, 4L, 2L, 8L,
3L, 1L, 4L, 1L, 6L, 1L, 1L, 1L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L), .Dim = c(11L, 6L), .Dimnames = structure(list(
factor.1 = c("0", "1", "2", "3", "4", "5", "6", "7", "8",
"9", "10"), factor.2 = c("0", "1", "2", "3", "4", "5 or more"
)), .Names = c("factor.1", "factor.2")), class = c("xtabs",
"table"), call = xtabs(data = cases.limited, na.action = na.omit))
library(vcd)
assocstats(temp.table)
X^2 df P(> X^2)
Likelihood Ratio 35.004 50 0.94676
Pearson NaN 50 NaN
Phi-Coefficient : NA
Contingency Coeff.: NaN
Cramer's V : NaN
Is there a way to quickly and efficiently avoid including these cases in the analysis without extensive rewriting of some of what assocstats or xtable do? I understand that there is arguably less statistical power, but Cramer's V is already an optimistic estimator, so the results will still be useful to me.
I have a data table:
> COUNT_ID_CATEGORY
id 706 799 1703 1726 2119 2202 3203 3504 3509 4401 4517 5122 5558 5616 5619 5824 6202 7205 9115 9909
1: 86246 9 0 15 4 28 0 15 63 39 5 7 25 27 43 12 64 1 16 0 96
2: 86252 3 0 17 6 21 0 6 62 24 6 7 12 25 32 6 49 1 26 0 103
3: 12262064 3 0 1 1 12 0 0 2 1 0 0 0 2 4 0 4 0 0 0 12
4: 12277270 2 0 0 0 1 0 3 0 3 0 0 0 0 24 0 6 2 5 0 60
5: 12332190 2 0 2 0 4 0 1 2 0 0 0 1 0 3 0 1 3 2 0 46
---
310661: 4837642552 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0
310662: 4843417324 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0
310663: 4847628950 2 0 1 1 16 0 0 2 3 0 0 2 9 5 0 3 3 2 3 14
310664: 4847787712 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
310665: 4853598737 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
> class(COUNT_ID_CATEGORY)
[1] "data.table" "data.frame"
>
and I wish to read the data as quickly as possible as follows:
COUNT_ID_CATEGORY for (id == 86246) & (category == 706)
which should return the value 9 (top left in the table).
(for example)
I can get the row with:
COUNT_ID_CATEGORY[id==86246,]
but how do I get the column?
> dput(head(COUNT_ID_CATEGORY))
structure(list(id = c(86246, 86252, 12262064, 12277270, 12332190,
12524696), `706` = c(9L, 3L, 3L, 2L, 2L, 0L), `799` = c(0L, 0L,
0L, 0L, 0L, 0L), `1703` = c(15L, 17L, 1L, 0L, 2L, 0L), `1726` = c(4L,
6L, 1L, 0L, 0L, 0L), `2119` = c(28L, 21L, 12L, 1L, 4L, 0L), `2202` = c(0L,
0L, 0L, 0L, 0L, 0L), `3203` = c(15L, 6L, 0L, 3L, 1L, 0L), `3504` = c(63L,
62L, 2L, 0L, 2L, 11L), `3509` = c(39L, 24L, 1L, 3L, 0L, 3L),
`4401` = c(5L, 6L, 0L, 0L, 0L, 1L), `4517` = c(7L, 7L, 0L,
0L, 0L, 1L), `5122` = c(25L, 12L, 0L, 0L, 1L, 0L), `5558` = c(27L,
25L, 2L, 0L, 0L, 1L), `5616` = c(43L, 32L, 4L, 24L, 3L, 18L
), `5619` = c(12L, 6L, 0L, 0L, 0L, 0L), `5824` = c(64L, 49L,
4L, 6L, 1L, 10L), `6202` = c(1L, 1L, 0L, 2L, 3L, 6L), `7205` = c(16L,
26L, 0L, 5L, 2L, 4L), `9115` = c(0L, 0L, 0L, 0L, 0L, 0L),
`9909` = c(96L, 103L, 12L, 60L, 46L, 1L)), .Names = c("id",
"706", "799", "1703", "1726", "2119", "2202", "3203", "3504",
"3509", "4401", "4517", "5122", "5558", "5616", "5619", "5824",
"6202", "7205", "9115", "9909"), sorted = "id", class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x043a24a0>)
First setkey for fast lookup using data.table's binary search/subset feature:
setkey(COUNT_ID_CATEGORY, id)
Then you can do:
COUNT_ID_CATEGORY[J(86246)][, '706']
The first part COUNT_ID_CATEGORY[J(86246)] performs fast subset using binary search. You can read more about J(.) and what it does here.
The next part [, '706', with=FALSE] takes the subset result, which is a data.table and selects just the column 706.
Just to be complete, this post shows more ways of selecting/subsetting columns from a data.table.