Contingency table based on third variable (numeric) - r

Some time ago I asked a question about creating market basket data. Now I would like to create a similar data.frame, but based on a third variable. Unfortunately I run into problems trying. Previous question: Effecient way to create market basket matrix in R
#shadow and #SimonO101 gave me good answers, but I was not able to alter their anwser correctly. I have the following data:
Customer <- as.factor(c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003))
Product <- as.factor(c(100001,100001,100001,100004,100004,100002,100003,100003,100003,100002,100003,100008))
input <- data.frame(Customer,Product)
I can create a contingency table now the following way:
input_df <- as.data.frame.matrix(table(input))
However I have a third (numeric) variable which I want as output in the table.
Number <- c(3,1,-4,1,1,1,1,1,1,1,1,1)
input <- data.frame(Customer,Product,Number)
Now the code (of course, now there are 3 variables) does not work anymore. The result I am looking for has unique Customer as row names and unique Product as column names. And has Number as value (or 0 if not present), this number could be calculated by:
input_agg <- aggregate( Number ~ Customer + Product, data = input, sum)
Hope my question is clear, please comment if something is not clear.

You can use xtabs for that :
R> xtabs(Number~Customer+Product, data=input)
Product
Customer 100001 100002 100003 100004 100008
1000001 0 1 0 2 0
1000002 0 0 3 0 0
1000003 0 1 1 0 1

This class of problem is designed for reshape2::dcast...
require( reshape2 )
# Too many rows so change to a data.table.
dcast( input , Customer ~ Product , fun = sum , value.var = "Number" )
# Customer 100001 100002 100003 100004 100008
#1 1000001 0 1 0 2 0
#2 1000002 0 0 3 0 0
#3 1000003 0 1 1 0 1
Recently, the method for using dcast with data.table object was implemented by #Arun responding to FR #2627. Great stuff. You will have to use the development version 1.8.11. Also at the moment, it should be used as dcast.data.table. This is because dcast is not a S3 generic yet in reshape2 package. That is, you can do:
require(reshape2)
require(data.table)
input <- data.table(input)
dcast.data.table(input , Customer ~ Product , fun = sum , value.var = "Number")
# Customer 100001 100002 100003 100004 100008
# 1: 1000001 0 1 0 2 0
# 2: 1000002 0 0 3 0 0
# 3: 1000003 0 1 1 0 1
This should work quite well on bigger data and should be much faster than reshape2:::dcast as well.
Alternatively, you can try the reshape:::cast version which may or may not crash... Try it!!
require(reshape)
input <- data.table( input )
cast( input , Customer ~ Product , fun = sum , value = .(Number) )

Related

Problem finding number of elements in a dataframe in R

I have downloaded the data frame casos_hosp_uci_def_sexo_edad_provres_60_mas.csv, which describes the amount of people infected from Covid-19 in Spain classified for their province, age, gender... from this webpage. I read and represent the dataframe as:
db<-read.csv(file = 'casos_hosp_uci_def_sexo_edad_provres.csv')
The first five rows are shown
provincia_iso sexo grupo_edad fecha num_casos num_hosp num_uci num_def
1 A H 0-9 2020-01-01 0 0 0 0
2 A H 10-19 2020-01-01 0 0 0 0
3 A H 20-29 2020-01-01 0 0 0 0
4 A H 30-39 2020-01-01 0 0 0 0
5 A H 40-49 2020-01-01 0 0 0 0
The first four colums of the data frame show the name of the province, gender of the people, age group and date, the latest four columns show the number of people who got ill, were hospitalized, in ICU or dead.
I want to use R to find the day with the highest rate of contagions. To do that, I have to sum the elements of the fifth row num_casos for each different value of the column fecha.
I have already been able to calculate the number of sick males as hombresEnfermos=sum(db[which(db$sexo=="H"), 5]). However, I think there has to be a better way to check the days with higher contagion than go manually counting. However, I cannot find out how.
Can someone please help me?
Using dplyr to get the total by date:
library(dplyr)
db %>% group_by(fecha) %>% summarise(total = sum(num_casos))
Two alternatives in base R:
data.frame(fecha = sort(unique(db$fecha)),
total = sapply(split(db, f = db$fecha), function(x) {sum(x[['num_casos']])}))
Or more simply,
aggregate(db$num_casos, list(db$fecha), FUN=sum)
An alternative in data.table:
library(data.table)
db <- as.data.table(db)
db[, list(total=sum(num_casos)), by = fecha]

Factor Level issues after filling data frame using match

I am using two large data files, each having >2m records. The sample data frames are
x <- data.frame("ItemID" = c(1,2,1,1,3,4,2,3,4,1), "SessionID" = c(111,112,111,112,113,114,114,115,115,115), "Avg" = c(1.0,0.45,0.5,0.5,0.46,0.34,0.5,0.6,0.10,0.15),"Category" =c(0,0,0,0,0,0,0,0,0,0))
y <- data.frame("ItemID" = c(1,2,3,4,3,4,5,7),"Category" = c("1","0","S","120","S","120","512","621"))
I successfully filled the x$Category using following command
x$Category <- y$Category[match(x$ItemID,y$ItemID)]
but
x$Category
gave me
[1] 1 0 1 1 S 120 0 S 120 1
Levels: 0 1 120 512 621 S
In x there are only four distinct categories but the Levels shows six. Similarly, the frequency shows me 512 and 621 with 0 frequency. I am using the same data for classification where it shows six classes instead of four which effects the f measure and recall etc. negatively.
table(x$Category)
0 1 120 512 621 S
2 4 2 0 0 2
while I want
table(x$Category)
0 1 120 S
2 4 2 2
I tried merge this and this with a number of other questions but it is giving me an error message. I found here Practical limits of R data frame that it is the limitation of R.
I would omit the Category column from your x data.frame, since it seems to only be serving as a placeholder until values from the y data.frame are filled in. Then, you can use left_join from dplyr with ItemID as the key variable, followed by droplevels() as suggested by TingITangIBob.
This gets you close, but my table does not exactly match yours:
dplyr::select(x, -Category) %>%
dplyr::left_join(y, by = "ItemID") %>%
droplevels()
0 1 120 S
2 4 4 4
I think this may have to do with the repeat ItemIDs in x?

Creating a subset of unique entries for a recursive list in R

I have the following data set df
name draught nav_status date
A 22 0 24/12/2014
A 22 0 25/12/2014
A 11 5 26/12/2014
A 11 1 27/12/2014
B 22 0 24/12/2014
B 22 0 25/12/2014
B 22 0 26/12/2014
B 22 5 27/12/2014
B 9 0 28/12/2014
B 22 0 29/12/2014
from this data set, I need to extract the unique draught values for each object of the list.
I am fairly new to R and have made the following attempts
y <- subset(df,!duplicated(df[,draught]),)
and
Dup <- function(x){
x <- x[!duplicated[x$draught],]
y <- lapply(df, Dup)
But this deletes the draught entries for the entire data. I went through some literature regarding split-apply and combine techniques and also tries those options.
Please provide some guidance, literature so as to solve this problem.
The result should be
name draught nav_status date
A 22 0 24/12/2014
A 11 5 26/12/2014
A 11 1 27/12/2014
B 22 0 25/12/2014
B 9 0 28/12/2014
I even tried to subsetthe data based on first and last entries by arranging them sequentially and deleting the duplicate entries, but there was loss of data.Thank you!!
Using data.table library you can arrive at the result by:
library(data.table)
dt <- as.data.table(df)
unique(dt, by = c('name', 'draught'))
One thing though. Why you have two entries of a pair A 11 in your desired result?

Aggregating big data in R

I have a dataset (dat) that looks like this:
Team Person Performance1 Performance2
1 36465930 1 101
1 37236856 1 101
1 34940210 1 101
1 29135524 1 101
2 10318268 1 541
2 641793 1 541
2 32352593 1 541
2 2139024 1 541
3 35193922 2 790
3 32645504 2 890
3 32304024 2 790
3 22696491 2 790
I am trying to identify and remove all teams that have variance on Performance1 or Performance2. So, for example, team 3 in the example has variance on Performance 2, so I would want to remove that team from the dataset. Here is the code as I've written it:
tda <- aggregate(dat, by=list(data$Team), FUN=sd)
tda1 <- tda[ which(tda$Performance1 != 0 | tda$Performance2 != 0), ]
The problem is that there are over 100,000 teams in my dataset, so my first line of code is taking an extremely long time, and I'm not sure if it will ever finish aggregating the dataset. What would be a more efficient way to solve this problem?
Thanks in advance! :)
Sincerely,
Amy
The dplyr package is generally very fast. Here's a way to select only those teams with standard deviation equal to zero for both Performance1 and Performance2:
library(dplyr)
datAggregated = dat %>%
group_by(Team) %>%
summarise(sdP1 = sd(Performance1),
sdP2 = sd(Performance2)) %>%
filter(sdP1==0 & sdP2==0)
datAggregated
Team sdP1 sdP2
1 1 0 0
2 2 0 0
Using data.table for big datasets
library(data.table)
setDT(dat)[, setNames(lapply(.SD,sd), paste0("sdP", 1:2)),
.SDcols=3:4, by=Team][,.SD[!sdP1& !sdP2]]
# Team sdP1 sdP2
#1: 1 0 0
#2: 2 0 0
If you have more number of Performance columns, you could use summarise_each from dplyr
datNew <- dat %>%
group_by(Team) %>%
summarise_each(funs(sd), starts_with("Performance"))
colnames(datNew)[-1] <- paste0("sdP", head(seq_along(datNew),-1))
datNew[!rowSums(datNew[-1]),]
which gives the output
# Team sdP1 sdP2
#1 1 0 0
#2 2 0 0

plyr to calculate relative aggregration

I have a data.frame that looks like this:
> head(activity_data)
ev_id cust_id active previous_active start_date
1 1141880 201 1 0 2008-08-17
2 4927803 201 1 0 2013-03-17
3 1141880 244 1 0 2008-08-17
4 2391524 244 1 0 2011-02-05
5 1141868 325 1 0 2008-08-16
6 1141872 325 1 0 2008-08-16
for each cust_id
for each ev_id
create a new variable $recent_active (= sum $active across all rows with this cust_id where $start_date > [this_row]$start_date - 10)
I am struggling to do this using ddply, as my split grouping was .(cust_id) and I wanted to return rows with cust_id and ev_id
Here is what I tried
ddply(activity_data, .(cust_id), function(x) recent_active=sum(x[this_row,]$active))
If ddply is not an option what other effieicent ways do you recommend. My dataset has ~200mn rows and I need to do this about 10-15 times per row.
sample data is here
You actually need to use two step approach here (and also need to convert date into date format before using the following code)
ddply(activity_date, .(cust_id), transform, recent_active=your function) #Not clear what you are asking regarding the function
ddply(activity_date, .(cust_id,ev_id), summarize,recent_active=sum(recent_active))

Resources