Aggregating big data in R - r

I have a dataset (dat) that looks like this:
Team Person Performance1 Performance2
1 36465930 1 101
1 37236856 1 101
1 34940210 1 101
1 29135524 1 101
2 10318268 1 541
2 641793 1 541
2 32352593 1 541
2 2139024 1 541
3 35193922 2 790
3 32645504 2 890
3 32304024 2 790
3 22696491 2 790
I am trying to identify and remove all teams that have variance on Performance1 or Performance2. So, for example, team 3 in the example has variance on Performance 2, so I would want to remove that team from the dataset. Here is the code as I've written it:
tda <- aggregate(dat, by=list(data$Team), FUN=sd)
tda1 <- tda[ which(tda$Performance1 != 0 | tda$Performance2 != 0), ]
The problem is that there are over 100,000 teams in my dataset, so my first line of code is taking an extremely long time, and I'm not sure if it will ever finish aggregating the dataset. What would be a more efficient way to solve this problem?
Thanks in advance! :)
Sincerely,
Amy

The dplyr package is generally very fast. Here's a way to select only those teams with standard deviation equal to zero for both Performance1 and Performance2:
library(dplyr)
datAggregated = dat %>%
group_by(Team) %>%
summarise(sdP1 = sd(Performance1),
sdP2 = sd(Performance2)) %>%
filter(sdP1==0 & sdP2==0)
datAggregated
Team sdP1 sdP2
1 1 0 0
2 2 0 0

Using data.table for big datasets
library(data.table)
setDT(dat)[, setNames(lapply(.SD,sd), paste0("sdP", 1:2)),
.SDcols=3:4, by=Team][,.SD[!sdP1& !sdP2]]
# Team sdP1 sdP2
#1: 1 0 0
#2: 2 0 0
If you have more number of Performance columns, you could use summarise_each from dplyr
datNew <- dat %>%
group_by(Team) %>%
summarise_each(funs(sd), starts_with("Performance"))
colnames(datNew)[-1] <- paste0("sdP", head(seq_along(datNew),-1))
datNew[!rowSums(datNew[-1]),]
which gives the output
# Team sdP1 sdP2
#1 1 0 0
#2 2 0 0

Related

How to take data elements in a dataset and turn them into rows and columns?

Before I start off I should I have never used R until yesterday, I only know some Python and am very much a beginner. I'm using R because I cannot figure out how to do anything in excel, and have already had more progress with R.
So I have a seemingly unique problem I'm trying to solve. I have a data set that looks similar to this:
ID Contaminant
1 123 Lead
2 123 Copper
3 456 Lead
4 678 Iron
5 456 Lead
6 111 Iron
7 222 Arsenic
I want to take this data and create a new xlsx or csv file from it for data analysis. I want to see how many times the ID has had a contaminant and what that contaminant is. I (think) I have figured out how to figure out the unique values of how many ID's are associated with the Contaminant type and how many ID's have a unique Contaminant associated with them. If that makes sense.
I want the new data sheet to look something like this:
ID Lead Copper Iron Arsenic
123 1 2 0 0
456 2 0 0 0
678 0 0 1 0
111 0 0 1 0
222 0 0 0 0
So far I have figured out how to take my original data sheet, which contains a lot of variables, and turn it into the first data set I listed above, that only contains IDs and Contaminants.
My code is a bit rough, as I'm mimicking others works, but what I have so far is:
violations <- tbl_df(2015_Violations)
new_violations <- select(violations, -TypeofViolation, -"CODETypeof Vio")
with(unique(violations[c("ID","Contaminant")]), table(Contaminant))
with(unique(violations[c("ID","Contaminant")]), table(ID))
write.csv(new_violations, file = "C:/r_stuff/new_violations.csv",
row.names = F)
This spits out the unique numbers for Contaminants and ID's into some tables.
I am then using a different .R file to test this new .csv file. It simply contains this:
mydata <- read.csv("C:/r_stuff/new_violations.csv")
View(mydata)
So my question is, how can I take my data in the first table and turn it into a new file with the structure of the second? I assume this isn't a very easy task, but doing it by hand will be impossible as there is thousands of entries for the original data file.
I propose to do it like that:
library(tidyverse)
data %>%
group_by(ID, Contaminant) %>%
mutate(Count = n()) %>%
distinct() %>%
pivot_wider(names_from = Contaminant, values_from = Count) %>%
ungroup() %>%
mutate_all(replace_na, 0)
where:
data <-
tibble(
ID = c(123, 123, 456, 678, 456, 111, 222),
Contaminant = c("Lead", "Copper", "Lead", "Iron", "Lead", "Iron", "Arsenic")
)
It gives:
# A tibble: 5 x 5
ID Lead Copper Iron Arsenic
<dbl> <dbl> <dbl> <dbl> <dbl>
1 123 1 1 0 0
2 456 2 0 0 0
3 678 0 0 1 0
4 111 0 0 1 0
5 222 0 0 0 1

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

How to classify percentages by category in R

Input
titanic <- read.csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv")
names(titanic)
tbl <- table(titanic$survived)
cbind(tbl, prop.table(tbl))
Desired Output: Percentage survival categorized by gender, eg. titanic$sex.
My attempt:
tapply(titanic$survived, titanic$sex, table)
This gets me a gender breakdown of survival by raw numbers;
$female
0 1
127 339
$male
0 1
682 161
but how do I get a breakdown by percentages? Ie, the desired output:
$female
0 1
.27 .72
$male
0 1
.81 .19
To generate the specific form you asked for:
lapply(split(titanic,titanic$sex),function(x)prop.table(table(x$survived)))
# $female
#
# 0 1
# 0.2725322 0.7274678
#
# $male
#
# 0 1
# 0.8090154 0.1909846
You're looking for the prop.table function
> prop.table(table(titanic$survived, titanic$sex), 2)
female male
0 0.2725322 0.8090154
1 0.7274678 0.1909846
Passing in 1 gives the percentages based on the sex instead.

Optimization of an R loop taking 18 hours to run

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

plyr to calculate relative aggregration

I have a data.frame that looks like this:
> head(activity_data)
ev_id cust_id active previous_active start_date
1 1141880 201 1 0 2008-08-17
2 4927803 201 1 0 2013-03-17
3 1141880 244 1 0 2008-08-17
4 2391524 244 1 0 2011-02-05
5 1141868 325 1 0 2008-08-16
6 1141872 325 1 0 2008-08-16
for each cust_id
for each ev_id
create a new variable $recent_active (= sum $active across all rows with this cust_id where $start_date > [this_row]$start_date - 10)
I am struggling to do this using ddply, as my split grouping was .(cust_id) and I wanted to return rows with cust_id and ev_id
Here is what I tried
ddply(activity_data, .(cust_id), function(x) recent_active=sum(x[this_row,]$active))
If ddply is not an option what other effieicent ways do you recommend. My dataset has ~200mn rows and I need to do this about 10-15 times per row.
sample data is here
You actually need to use two step approach here (and also need to convert date into date format before using the following code)
ddply(activity_date, .(cust_id), transform, recent_active=your function) #Not clear what you are asking regarding the function
ddply(activity_date, .(cust_id,ev_id), summarize,recent_active=sum(recent_active))

Resources