Dictionary-style conditional referencing between two dataframes - r

I have two data frames. DF1 is a list of homicides with a date and location attached to each row. DF2 consists of a set of shared locations mentioned in DF1.
DF2 contains a latitude and longitude for each unique location. I want to pull these out. NOTE: DF2 contains shared locations, which may correspond to multiple homicides in DF1, which means the two DFs are different lengths.
I want to create latitude and longitude vars in DF1 when a location in DF2 is equal to the location in DF1 (assuming location names are exact between the two DFs). How do I pull the latitude and longitude from DF2 for which the location in DF2 corresponds to a given homicide record in DF1?
Small reproducible example:
DF1: (dataframe of incidents)
| Incident | Place |
| -------- | -------|
| Incident 1| Place 1|
| Incident 2| Place 2|
| Incident 3| Place 2|
| Incident 4| Place 3|
| Incident 5| Place 1|
| Incident 6| Place 3|
| Incident 7| Place 2|
DF2: (dictionary-style lat-lon manual)
| Place |Latitude |Longitude |
| -------| ------- | ---------|
| Place 1| A | B |
| Place 2| C | D |
| Place 3| E | F |
| Place 4| G | H |
DF3 (what I want)
| Incident | Latitude | Longitude |
| -------- | -------- | --------- |
|Incident 1| A | B |
|Incident 2| C | D |
|Incident 3| C | D |
|Incident 4| E | F |
|Incident 5| A | B |
|Incident 6| E | F |
|Incident 7| C | D |
I have tried:
DF1$latitude <- DF2$latitude[which(DF2$location == DF1$location), ]
It returned the following error:
Error in DF2$latitude[which(DF2$location == DF1$location), ] :
incorrect number of dimensions
In addition: Warning message:
In DF2$location == DF1$location :
longer object length is not a multiple of shorter object length
In response to a comment suggestion, I also tried:
DF2$latitude[which(DF2$location == DF1$location)]
However, I got the error:
Error in `$<-.data.frame`(`*tmp*`, latitude, value = numeric(0)) :
replacement has 0 rows, data has 1220
In addition: Warning message:
In DF1$location == DF2$location :
longer object length is not a multiple of shorter object length

You can try dplyr's left_join(). The code below keeps all the rows in DF1 and add variables in DF2 if it finds a match based on location.
library(dplyr)
DF3 <- left_join(DF1, DF2, by = "location")

Related

Control digits in specific cells

I have a table that looks like this:
+-----------------------------------+-------+--------+------+
| | Male | Female | n |
+-----------------------------------+-------+--------+------+
| way more than my fair share | 2,4 | 21,6 | 135 |
| a little more than my fair share | 5,4 | 38,1 | 244 |
| about my fair share | 54,0 | 35,3 | 491 |
| a littles less than my fair share | 25,1 | 3,0 | 153 |
| way less than my fair share | 8,7 | 0,7 | 51 |
| Can't say | 4,4 | 1,2 | 31 |
| n | 541,0 | 564,0 | 1105 |
+-----------------------------------+-------+--------+------+
Everything is fine but what I would like to do is to show no digits in the last row at all since they show the margins (real cases). Is there any chance in R I can manipulate specific cells and their digits?
Thanks!
You could use ifelse to output the numbers in different formats in different rows, as in the example below. However, it will take some additional finagling to get the values in the last row to line up by place value with the previous rows:
library(knitr)
library(tidyverse)
# Fake data
set.seed(10)
dat = data.frame(category=c(LETTERS[1:6],"n"), replicate(3, rnorm(7, 100,20)))
dat %>%
mutate_if(is.numeric, funs(sprintf(ifelse(category=="n", "%1.0f", "%1.1f"), .))) %>%
kable(align="lrrr")
|category | X1| X2| X3|
|:--------|-----:|-----:|-----:|
|A | 100.4| 92.7| 114.8|
|B | 96.3| 67.5| 101.8|
|C | 72.6| 94.9| 80.9|
|D | 88.0| 122.0| 96.1|
|E | 105.9| 115.1| 118.5|
|F | 107.8| 95.2| 109.7|
|n | 76| 120| 88|
The huxtable package makes it easy to decimal-align the values (see the Vignette for more on table formatting):
library(huxtable)
tab = dat %>%
mutate_if(is.numeric, funs(sprintf(ifelse(category=="n", "%1.0f", "%1.1f"), .))) %>%
hux %>% add_colnames()
align(tab)[-1] = "."
tab
Here's what the PDF output looks like when knitted to PDF from an rmarkdown document:

Count rows in a dataframe object with criteria in R

Okay, I have a bit of a noob question, so please excuse me. I have a data frame object as follows:
| order_id| department_id|department | n|
|--------:|-------------:|:-------------|--:|
| 1| 4|produce | 4|
| 1| 15|canned goods | 1|
| 1| 16|dairy eggs | 3|
| 36| 4|produce | 3|
| 36| 7|beverages | 1|
| 36| 16|dairy eggs | 3|
| 36| 20|deli | 1|
| 38| 1|frozen | 1|
| 38| 4|produce | 6|
| 38| 13|pantry | 1|
| 38| 19|snacks | 1|
| 96| 1|frozen | 2|
| 96| 4|produce | 4|
| 96| 20|deli | 1|
This is the code I've used to arrive at this object:
temp5 <- opt %>%
left_join(products,by="product_id")%>%
left_join(departments,by="department_id") %>%
group_by(order_id,department_id,department) %>%
tally() %>%
group_by(department_id)
kable(head(temp5,14))
As you can see, the object contains, departments present in each Order_id. Now, what I want to do is, I want to count the number of departments for each order_id
i tried using the summarise() method in the dplyr package, but it throws the following error:
Error in summarise_impl(.data, dots) :
Evaluation error: no applicable method for 'groups' applied to an object of class "factor".
It seems so simple, but cant fig out how to do it. Any help will be appreciated.
Edit: This is the code that I tried to run, post which I read about the count() function in the plyr package, i tried to use that as well, but that is of no use as it needs a data frame as input, whereas I only want to count the no of occurrences in the data frame
temp5 <- opt %>%
+ left_join(products,by="product_id")%>%
+ left_join(departments,by="department_id") %>%
+ group_by(order_id,department_id,department) %>%
+ tally() %>%
+ group_by(department_id) %>%
+ summarise(count(department))
In the output, I need to know the average no. of departments ordered from in each order id, so i need something like this:
Order_id | no. of departments
1 3
36 4
38 4
96 3
And then I should be able to plot using ggplot, no. of orders vs no. of departments in each order. Hope this is clear

Dividing the time into periods each 30 min

I have Dataframe contains "time" column I want to add a new column contain period number after dividing the time into periods each 30 min
for example,
The original Dataframe
l = [('A','2017-01-13 00:30:00'),('A','2017-01-13 00:00:01'),('E','2017-01-13 14:00:00'),('E','2017-01-13 12:08:15')]
df = spark.createDataFrame(l,['test','time'])
df1 = df.select(df.test,df.time.cast('timestamp'))
df1.show()
+----+-------------------+
|test| time|
+----+-------------------+
| A|2017-01-13 00:30:00|
| A|2017-01-13 00:00:01|
| E|2017-01-13 14:00:00|
| E|2017-01-13 12:08:15|
+----+-------------------+
The Desired Dataframe as follow:
+----+-------------------+------+
|test| time|period|
+----+-------------------+------+
| A|2017-01-13 00:30:00| 2|
| A|2017-01-13 00:00:01| 1|
| E|2017-01-13 14:00:00| 29|
| E|2017-01-13 12:08:15| 25|
+----+-------------------+------+
Are there ways to achieve that?
You can simply utilize the hour and minute inbuilt functions to get your final result with when inbuilt function as
from pyspark.sql import functions as F
df1.withColumn('period', (F.hour(df1['time'])*2)+1+(F.when(F.minute(df1['time']) >= 30, 1).otherwise(0))).show(truncate=False)
You should be getting
+----+---------------------+------+
|test|time |period|
+----+---------------------+------+
|A |2017-01-13 00:30:00.0|2 |
|A |2017-01-13 00:00:01.0|1 |
|E |2017-01-13 14:00:00.0|29 |
|E |2017-01-13 12:08:15.0|25 |
+----+---------------------+------+
I hope the answer is helpful

Addition of calculated field in rpivotTable

I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")

By group: sum of variable values under condition

Sum of var values by group with certain values excluded conditioned on the other variable.
How to do it elegantly without transposing?
So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.
In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A
+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker | DATE_f | wght | excl_val |
+---------+---------+-----------+-------------+-------------+
| A | XLK | 6/20/2003 | 0.087600002 | 1.980834016 |
| A | XLK | 6/23/2003 | 0.08585 | 1.898560068 |
| A | XLK | 6/24/2003 | 0.085500002 | |
| AAPL | XLK | 6/20/2003 | 0.070080002 | |
| AAPL | XLK | 6/23/2003 | 0.06868 | |
| AAPL | XLK | 6/24/2003 | 0.068400002 | |
| AA | XLK | 6/20/2003 | 1.910754014 | |
| AA | XLK | 6/23/2003 | 1.829880067 | |
| AA | XLK | 6/24/2003 | 1.819775 | |
| | | | | |
| | | | | |
+---------+---------+-----------+-------------+-------------+
There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.
I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.
In R, I got as far as this
weights$excl_val <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))
but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.
About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short
Thank you!
Arsenio
When you have data in a data.frame, it is better if the rows are meaningful
(in particular, the columns should have the same length):
in this case, excl_val looks like a separate vector.
After putting the information it contains in the data.frame,
things become easier.
# Sample data
k <- 5
d <- data.frame(
sTicker = rep(LETTERS[1:k], k),
fTicker = rep(LETTERS[1:k], each=k),
DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))

Resources