I am working with a database of daily deaths of a country, so I need to create a database that contains the aggregated data of daily deaths by day, month and state. My database (def_2020) is something like this:
|--------------|------------|-------|
| State | Month | Day |
|--------------|------------|-------|
| state1 | jan | 1 |
|--------------|------------|-------|
| state1 | jan | 1 |
|--------------|------------|-------|
| . | . | . |
|--------------|------------|-------|
| . | . | . |
|--------------|------------|-------|
| state2 | dic | 4 |
|--------------|------------|-------|
I have 24 states (100.000 obs), of diferent days and months of death. I need to get something like this:
|--------------|------------|-------|-------|
| State | Month | Day | Deaths|
|--------------|------------|-------|-------|
| state1 | jan | 1 | 25 |
|--------------|------------|-------|-------|
| state1 | jan | 2 | 35 |
|--------------|------------|-------|-------|
| . | . | . | |
|--------------|------------|-------|-------|
| . | . | . | |
|--------------|------------|-------|-------|
| state2 | dic | 4 | |
|--------------|------------|-------|-------|
I am new to R, so I create loop like this:
day <- c(1:31)
death_state1 <- NULL
for (i in day) {
death_state_1[i] <- sum(with(def2020 %>% filter(State == "state1", Month =="jan"), Day == i))
}
But I need to optimize this loop to get a dataframe by month (columns), days (rows) and states (also rows). Help me please, I'm still new with this.
It looks like you are using a mixture of base R and dplyr syntax (the pipe %>% and filter are exports from the dplyr package.)
dplyr has its own syntax for grouped operations that allows you to avoid defining explicit loops. You use group_by() to group your data and summarize() to define variables containing the results of dimension-reducing functions like mean(), min(), n(), etc.
def_2020 %>%
group_by(State, Month, Day) %>%
summarize(Deaths = n())
With base R, we can use aggregate
aggregate(Deaths ~ ., transform(def_2020, Deaths = 1), FUN = sum)
Related
I have two dataframes. One is a set of ≈4000 entries that looks similar to this:
| grade_col1 | grade_col2 |
| --- | --- |
| A-| A-|
| B | 86|
| C+| C+|
| B-| D |
| A | A |
| C-| 72|
| F | 96|
| B+| B+|
| B | B |
| A-| A-|
The other is a set of ≈700 entries that look similar to this:
| grade | scale |
| --- | --- |
| A+|100|
| A+| 99|
| A+| 98|
| A+| 97|
| A | 96|
| A | 95|
| A | 94|
| A | 93|
| A-| 92|
| A-| 91|
| A-| 90|
| B+| 89|
| B+| 88|
...and so on.
What I'm trying to do is create a new column that shows whether grade_col2 matches grade_col1 with a binary, 0-1 output (0 = no match, 1 = match). Most of grade_col2 is shown by letter grade. But every once in awhile an entry in grade_col2 was accidentally entered as a numeric grade instead. I want this match column to give me a "1" even when grade_col2 is a numeric grade instead of a letter grade. In other words, if grade_col1 is B and grade_col2 is 86, I want this to still be read as a match. Only when grade_col1 is F and grade_col2 is 96 would this not be a match (similar to when grade_col1 is B- and grade_col2 is D = not a match).
The second data frame gives me the information I need to translate between one and the other (entries between 97-100 are A+, between 93-96 are A, and so on). I just don't know how to run a script that uses this information to find matches through all ≈4000 entries. Theoretically, I could do this manually, but the real dataset is so lengthy that this isn't realistic.
I had been thinking of using nested if_else statements with dplyr. But once I got past the first "if" statement, I got stuck. I'd appreciate any help with this people can offer.
You can do this using a join.
Let your first dataframe be grades_df and your second dataframe be lookup_df, then you want something like the following:
output = grades_df %>%
# join on look up, keeping everything grades table
left_join(lookup_df, by = c(grade_col2 = "scale")) %>%
# combine grade_col2 from grades_df and grade from lookup_df
mutate(grade_col2b = ifelse(is.na(grade), grade_col2, grade)) %>%
# indicator column
mutate(indicator = ifelse(grade_col1 == grade_col2b, 1, 0))
I am trying to create a plot using two variables (DATE and INT_RATE) using for filter the content of a third variable GRADE.
The problem is that I can't really figure out how to use the variable GRADE as a filter for the row.
In the below section i provide a detailed sample of starting data as well as draw of the plot I'm trying to achieve.
Thanks in advance.
STARTING DATA
| DATE | INT_RATE | GRADE |
––––––––––––––––––––––––––––––
| 1-jan | 5% | A | <-- A
| 5-feb | 3% | B |
| 9-feb | 2% | D |
| 1-apr | 3% | A | <-- A
| 5-jun | 5% | A | <-- A
| 1-aug | 3% | G |
| 1-sep | 2% | E |
| 3-nov | 1% | C |
| 8-dec | 8% | A | <-- A
| . | . | . |
| . | . | . |
| . | . | . |
And this is the kind of graph i would like to achieve, which is a very basic one, except for the filtering work needed before.
WANTED RESULT:
GRADE "A"
INT_RATE
|
|
8%-| •
| ̷
| ̷
| ̷
5%-| • •
| \ /
| \ /
| \ /
| \ /
3%-| •
|
|
|
|
––––––––––––––––––––––––––––––––––-–––>
| ˆ ˆ ˆ ˆ DATE
|1-jan 1-apr 5-jun 8-dec
EDIT 1:
Following the precious help from #apax I managed to get a plot, but the result is not satisfying because of the weird way R is displaying it (I think it might be related to the fact that the dataset in question is very large 800k rows). Do you have any suggestion?
By the way, this solved my problem:
plot(x = DATE, y = INT_RATE, data = filter(df, GRADE == "A"))
I am also uploading a PNG of the malformed chart.
Thanks again to all.
Here's a quick one-liner solution where I assume your data is stored in an object named df
library(dplyr) ## For filter() function below
plot(x = DATE, y = INT_RATE, data = filter(df, GRADE == "A"))
You could use ggplot2 and facet_wrap(...)
library(ggplot2)
ggplot(mtcars, aes(x=mpg, y=disp)) +
geom_point() +
facet_wrap(~cyl)
For your data
ggplot(data, aes(x=DATE, y=INT_RATE)) +
geom_line() +
facet_wrap(~GRADE)
P.S. This gives separate graphs for all grades. But that should not be a problem.
I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
I have 150 stops (Cod) and each one of this have a number of service that used.
| Cod | SERVICE1 | SERVICE2 | SERVICE3 | Position
------------------------------------------------------
| P05 | XRS10 | XRS07| XRS05| 12455
| R07 | FR05 | | | 4521
| X05 | XRS07 | XRS10| | 57541
I need to put all the services (SERVICE1,SERVICE2,SERVICE3) in one column. That means that I need the following result.
| Cod | SERVICE | Position
------------------------------------------------------
| P05 | XRS10 | 12455
| P05 | XRS07 | 12455
| P05 | XRS05 | 12455
| R07 | FR05 | 4521
| X05 | XRS07 | 57541
| X05 | XRS10 | 57541
There is any way to do this using the sqldf package of R. Or any kind of way to do it?
try this:
library(magrittr) ##used for the pipe, %>%
library(dplyr) ##for filtering observations and selecting columns
library(tidyr) ##for making your dataset long/tidy
new_data <- original_data %>%
tidyr::gather(key = service_type, value = SERVICE) %>%
dplyr::filter(!is.na(SERVICE)) %>%
dplyr::select(-service_type)
Unfortunately I am not familiar with sqldf
Note that if you want to keep the information on whether the service comes from SERVICE1, SERVICE2, or SERVICE3, you'll omit the last line (dplyr::select) entirely.
Sum of var values by group with certain values excluded conditioned on the other variable.
How to do it elegantly without transposing?
So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.
In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A
+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker | DATE_f | wght | excl_val |
+---------+---------+-----------+-------------+-------------+
| A | XLK | 6/20/2003 | 0.087600002 | 1.980834016 |
| A | XLK | 6/23/2003 | 0.08585 | 1.898560068 |
| A | XLK | 6/24/2003 | 0.085500002 | |
| AAPL | XLK | 6/20/2003 | 0.070080002 | |
| AAPL | XLK | 6/23/2003 | 0.06868 | |
| AAPL | XLK | 6/24/2003 | 0.068400002 | |
| AA | XLK | 6/20/2003 | 1.910754014 | |
| AA | XLK | 6/23/2003 | 1.829880067 | |
| AA | XLK | 6/24/2003 | 1.819775 | |
| | | | | |
| | | | | |
+---------+---------+-----------+-------------+-------------+
There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.
I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.
In R, I got as far as this
weights$excl_val <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))
but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.
About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short
Thank you!
Arsenio
When you have data in a data.frame, it is better if the rows are meaningful
(in particular, the columns should have the same length):
in this case, excl_val looks like a separate vector.
After putting the information it contains in the data.frame,
things become easier.
# Sample data
k <- 5
d <- data.frame(
sTicker = rep(LETTERS[1:k], k),
fTicker = rep(LETTERS[1:k], each=k),
DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))