I have two data sets. One has 2 million cases (individual donations to various causes), the other has about 38,000 (all zip codes in the U.S.).
I want to sort through the first data set and tally up the total number of donations by zip code. (Additionally, the total for each zip code will be broken down by cause.) Each case in the first data set includes the zip code of the corresponding donation and information about what kind of cause it went to.
Is there an efficient way to do this? The only approach that I (very much a novice) can think of is to use a for ... if loop to go through each case and count them up one by one. This seems like it would be really slow, though, for data sets of this size.
edit: thanks, #josilber. This gets me a step closer to what I'm looking for.
One more question, though. table seems to generate frequencies, correct? What if I'm actually looking for the sum for each cause by zip code? For example, if the data frame looks like this:
dat3 <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE),
amt = sample(250:2500, 2000000, replace=TRUE))
Suppose that instead of frequencies, I want to end up with output that looks like this?
# Cause 1(amt) Cause 2(amt) Cause 3(amt)
# Zip 1 (sum) (sum) (sum)
# Zip 2 (sum) (sum) (sum)
# Zip 3 (sum) (sum) (sum)
# etc. ... ... ...
Does that make sense?
Sure, you can accomplish what you're looking for with the table command in R. First, let's start with a reproducible example (I'll create an example with 2 million cases, 3 zip codes, and 3 causes; I know you have more zip codes and more causes but that won't cause the code to take too much longer to run):
# Data
set.seed(144)
dat <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE))
Please note that it's a good idea to include a reproducible example with all your questions on Stack Overflow because it helps make sure we understand what you're asking! Basically you should include a sample dataset (like the one I've just included) along with your desired output for that dataset.
Now you can use the table function to count the number of donations in each zip code, broken down by cause:
table(dat$zip, dat$cause)
# Cause 1 Cause 2 Cause 3
# Zip 1 222276 222004 222744
# Zip 2 222068 222791 222363
# Zip 3 221015 221930 222809
This took about 0.3 seconds on my computer.
could this work?-
aggregate(amt~cause+zip,data=dat3,FUN=sum)
cause zip amt
1 Cause 1 Zip 1 306231179
2 Cause 2 Zip 1 306600943
3 Cause 3 Zip 1 305964165
4 Cause 1 Zip 2 305788668
5 Cause 2 Zip 2 306306940
6 Cause 3 Zip 2 305559305
7 Cause 1 Zip 3 304898918
8 Cause 2 Zip 3 304281568
9 Cause 3 Zip 3 303939326
Related
I have a data table as a CSV file that I use to create metrics for a dashboard. The data table includes Metric IDs and associates these with field names. This table--this definition of metrics--is largely static, and I'd like to include it within R code rather than, for example, importing a CSV file containing these headings.
The table looks something like this:
Metric_ID
Metric_Name
Numerator
Denominator
AB0001
Number_of_Customers
No_of_Customers
AB0002
Percent_New_Customers
No_of_New_Customers
No_of_Customers
This has about 40 rows of data, and I'd like to set this table up in code so that it is created at the time the R query is run. I'll then use it to associate metric IDs with measures I retrive through SQL queries. Sometimes this table may change -- for example, new metrics might be added or existing metrics modified. This would need some modificatoin in the code to incorporate these metrics.
The closet way I could find was to create a data table, along the lines described in the query below.
dt<-data.table(x=c(1,2,3),y=c(2,3,4),z=c(3,4,5))
dt
x y z
1: 1 2 3
2: 2 3 4
3: 3 4 5
cbind with data table and data frame
This works for a table with a few rows or columns, but will be unwieldy for tables with 40+ rows. For example, if I wanted to modify a metric 20 rows down, I'd have to go 20 rows down in each column, and then test the table to ensure I switched the metric at the right place in each column -- especially where some metrics have empty cells. for example, I may correct the metric ID in row 20, but accidentally put the definition (a separate column) in row 19.
Is there a more straightforward way of, in essence, creating a table in code?
(I appreciate the most straightforward way would be to keep a CSV file accessible and use read_csv to import it into R. However, this doesn't work so well if colleagues are running this query on their machine and have a different file path to the CSV -- it also raises the risk of them running the query with an out-of-date metrics table, as they may not have the latest version in their files).
Thanks in advance for any guidance you might have!
Tony
Here are two options (examples taken from respective help pages):
data.table::fread()
fread("A,B
1,2
3,4
")
#> A B
#> <int> <int>
#> 1: 1 2
#> 2: 3 4
https://rdatatable.gitlab.io/data.table/reference/fread.html
tibble::tribble()
tribble(
~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
)
#> # A tibble: 3 × 2
#> colA colB
#> <chr> <dbl>
#> 1 a 1
#> 2 b 2
#> 3 c 3
https://tibble.tidyverse.org/reference/tribble.html
Other options:
If you already have the data.frame from somewhere, you can also use dput() to get a structure() code you can paste into the files you are distributing.
use the reprex package https://reprex.tidyverse.org/
Firstly: I have seen other posts about AVERAGEIF translations from excel into R but I didn't see one that worked on my specific case and I couldn't get around to making one work.
I have a dataset which encompasses the daily pricings of a bunch of listings.
It looks like this
listing_id date price
1 1000 1/2/2015 $100
2 1200 2/4/2016 $150
Sample of the dataset (and desired outcome) # https://send.firefox.com/download/228f31e39d18738d/#rlMmm6UeGxgbkzsSD5OsQw
The dataset I would like to have has only the date and the average prices of all listings on that date. The goal is to get a (different) dataframe which would look something like this so I can work with it:
Date Average Price
1 4/5/2015 204.5438
2 4/6/2015 182.6439
3 4/7/2015 176.553
4 4/8/2015 182.0448
5 4/9/2015 183.3617
6 4/10/2015 205.0997
7 4/11/2015 197.0118
8 4/12/2015 172.2943
I created this in Excel using the Average.if function (and copy pasting by value) from the sample provided above.
I tried to format the data in Excel first where I could use the AVERAGE.IF function saying take the average if it is this specific date. The problem with this is that the dataset consists of 30million rows and excel only allows for 1 million so it didn't work.
What I have done so far: I created a data frame in R (where i want the average prices to go into) using
Avg = data.frame("Date" =1:2, "Average Price"=1:2)
Avg[nrow(Avg) + 2036,] = list("v1","v2")
Avg$Date = seq(from = as.Date("2015-04-05"), to = as.Date("2020-11-01"), by = 'day')
I tried to create an averageif-like function by this article and another but could not get it to work.
I hope this is enough information to go on otherwise I would be more than happy to provide more.
If your question is how to replicate the AVERAGEIF function, you can use logical indexing :
R code :
> df
Dates Prices
1 1 100
2 2 120
3 3 150
4 1 320
5 2 250
6 3 210
7 1 102
8 2 180
9 3 150
idx <- df$Dates == 1 # Positions where condition is true
mean(df$Prices[idx]) # Prints same output as Excel
I have a dataframe structured like this
path clicks
/a/b/index.html 1000
/a/b/index.html#1 500
/a/index.html#1 250
R Code:
path <- c('/a/b/index.html','/a/b/index.html#1','/a/index.html#1')
clicks <- c(1000, 500, 250)
d.f <- data.frame(path,clicks)
The first two rows are basically the same URL path. Hence, I would like to merge these two rows into one by path adding clicks and reducing the path name of the result to simply '#1' while getting rid of the old names. The result would look something like this:
path clicks
#1 1500
/a/index.html#1 250
From what I read this can be achieved by using aggregate(), but I can't quiet find a decent introduction thoroughly explaining how this function works.
Anyways, I'd be thankful if you could either provide me with a solution or point me to a beginner-friendly source to educate myself with the relevant material.
This is really what you want I think (I will explain why at the end).
path <- c('/a/b/index.html','/a/b/index.html#1','/a/index.html#1')
clicks <- c(1000, 500, 250)
d.f <- data.frame(path,clicks)
d.f$path <- gsub("\\#\\d", "", d.f$path)
d.f
aggregate(d.f$clicks ~ d.f$path, FUN = sum)
Reducing the link to "#1" would be next to impossible since that would make rows 1 and 3 identical, which is not what you want. Plus I assume if you had "/a/b/index.html#2" you would want that aggregated with rows 1 and 2 and not kept separately.
The other option would be to append a "#1" to all links that do not have one and then aggregate
d.f$path[grep("html$", d.f$path)]<-paste0(d.f$path[grep("html$", d.f$path)], "#1")
A possible dplyr solution with #1 and #2:
df=data.frame(path=c("/a/b/index.html","/a/b/index.html#1","/a/b/index.html#2","/a/index.html"),
clicks=c(1000,500,150,250))
path clicks
1 /a/b/index.html 1000
2 /a/b/index.html#1 500
3 /a/b/index.html#2 150
4 /a/index.html 250
df%>%
mutate(path_simp=gsub("#.*","",path))%>%
transform(path=gsub("^[^#]*","",path,perl=T))%>%
group_by(path_simp)%>%
mutate(path=ifelse(any(!path==""),path[path!=""][length(path[path!=""])],path_simp))%>%
summarise(clicks=sum(clicks),path=last(path))%>%
select(path,clicks)
Which gives:
path clicks
<chr> <dbl>
1 #2 1650
2 /a/index.html 250
The idea is to create a new column path_simp which contains the path without any # afterwards and replace in path any path containing #number with just #number.
path_simp is used for grouping, and path is changed to only have the #number if there is one.
Summary of clicks and path are computed with sum() and last() for path.
I searched, but I couldn't find a similar question, so I apologize if I may have missed it.
My problem is actually pretty simple. I have two lists, a large one and a smaller one.
The smaller one consists of the averages of the data in the large list (ten lines have
been aggregated to form the small list -> it has one tenth the size of the larger one). All I want now, is to add a new column in the large list (which is no problem) and showing the averages next
to the original data. I am aware that I will see the average ten times, but that's fine.
I tried to solve this "problem" with simple list comparisons, e.g. (the relevant averages, as well as the original data have identical identifiers in the first column):
Large_List$Average_column[ Large_List$identifier == Small_List$identifier ] <- Small_List$Average[ Large_List$identifier == Small_List$identifier ];
Yet for some reason, it doesn't work. Probably because the target vector is larger than the source vector. I really tried a lot, and the only thing that seems to work is a loop structure. But that is no option because my list is way too large... I am sure there must be a smart solution to this simple issue.
UPDATE & SPECIFICATION
Thank you for your suggestions. But it seems I need to be more specific. The problem is that in most, but not in all cases, the average is formed out of ten consecutive datapoints. It may occur that less is used because of holes in the sample. Therefore, a replication will unfortunately not do the job.
Here’s an example (1_Ident is the minute identifier, 10_Ident being the ten minute identifier) :
Original_List:
1_Ident | 10_Ident|Minute_value|
July1-0| July1-0d| 1
July1-2| July1-0d| 1
(..)
July1-10| July1-0d| 1
July1-11| July1-1d| 1
July1-12| July1-1d| 2
July1-21| July1-21| 3
July1-31| July1-31| 2
Resulting Small_list:
10_Ident|Minute_average|
July1-0d| 1
July1-1d| 1.5
July1-2d| 3
July1-3d| 2
Desired outcome:
Large_List:
1_Ident |10_Ident|Minute_value|Minute_average|
July1-0| July1-0d| 1 1
July1-2| July1-0d| 1 1
(..)
July1-10| July1-0d| 1 1
July1-11| July1-1d| 1 1.5
July1-12| July1-1d| 2 1.5
July1-21| July1-21| 3 3
July1-31| July1-31| 2 2
I think the main problem is that the Small_list$Minute_average vector is not the same size as the Large_list$Minute_value vector. As said, one could compare the two lists line by line, doing a loop, but the size of the tables is >1M lines, so that won't work.
What I want to do is basically the following:
1) Look in the Large_List$10_Ident and compare it Small_List$10_Ident
2) Where the values match, transfer the corresponding Small_List$Minute_average value to Large_List$Minute_average
Thanks!
You could use match or merge to do that but why not just calculate the averages off the groupings?
Large_List$Average_column <- ave(Large_List$col_to_be_avgd,
Large_List$group_var,
FUN=mean, na.rm=TRUE)
The merge code might look like
merge( Large_List, Small_List[c('identifier', "Average"], by='identifier' , all.x=TRUE)
This is a seemingly simple R question, but I don't see an exact answer here. I have a data frame (alldata) that looks like this:
Case zip market
1 44485 NA
2 44488 NA
3 43210 NA
There are over 3.5 million records.
Then, I have a second data frame, 'zipcodes'.
market zip
1 44485
1 44486
1 44488
... ... (100 zips in market 1)
2 43210
2 43211
... ... (100 zips in market 2, etc.)
I want to find the correct value for alldata$market for each case based on alldata$zip matching the appropriate value in the zipcode data frame. I'm just looking for the right syntax, and assistance is much appreciated, as usual.
Since you don't care about the market column in alldata, you can first strip it off using and merge the columns in alldata and zipcodes based on the zip column using merge:
merge(alldata[, c("Case", "zip")], zipcodes, by="zip")
The by parameter specifies the key criteria, so if you have a compound key, you could do something like by=c("zip", "otherfield").
Another option that worked for me and is very simple:
alldata$market<-with(zipcodes, market[match(alldata$zip, zip)])
With such a large data set you may want the speed of an environment lookup. You can use the lookup function from the qdapTools package as follows:
library(qdapTools)
alldata$market <- lookup(alldata$zip, zipcodes[, 2:1])
Or
alldata$zip %l% zipcodes[, 2:1]
Here's the dplyr way of doing it:
library(tidyverse)
alldata %>%
select(-market) %>%
left_join(zipcodes, by="zip")
which, on my machine, is roughly the same performance as lookup.
The syntax of match is a bit clumsy. You might find the lookup package easier to use.
alldata <- data.frame(Case=1:3, zip=c(44485,44488,43210), market=c(NA,NA,NA))
zipcodes <- data.frame(market=c(1,1,1,2,2), zip=c(44485,44486,44488,43210,43211))
alldata$market <- lookup(alldata$zip, zipcodes$zip, zipcodes$market)
alldata
## Case zip market
## 1 1 44485 1
## 2 2 44488 1
## 3 3 43210 2