Count rows in a dataframe object with criteria in R - r

Okay, I have a bit of a noob question, so please excuse me. I have a data frame object as follows:
| order_id| department_id|department | n|
|--------:|-------------:|:-------------|--:|
| 1| 4|produce | 4|
| 1| 15|canned goods | 1|
| 1| 16|dairy eggs | 3|
| 36| 4|produce | 3|
| 36| 7|beverages | 1|
| 36| 16|dairy eggs | 3|
| 36| 20|deli | 1|
| 38| 1|frozen | 1|
| 38| 4|produce | 6|
| 38| 13|pantry | 1|
| 38| 19|snacks | 1|
| 96| 1|frozen | 2|
| 96| 4|produce | 4|
| 96| 20|deli | 1|
This is the code I've used to arrive at this object:
temp5 <- opt %>%
left_join(products,by="product_id")%>%
left_join(departments,by="department_id") %>%
group_by(order_id,department_id,department) %>%
tally() %>%
group_by(department_id)
kable(head(temp5,14))
As you can see, the object contains, departments present in each Order_id. Now, what I want to do is, I want to count the number of departments for each order_id
i tried using the summarise() method in the dplyr package, but it throws the following error:
Error in summarise_impl(.data, dots) :
Evaluation error: no applicable method for 'groups' applied to an object of class "factor".
It seems so simple, but cant fig out how to do it. Any help will be appreciated.
Edit: This is the code that I tried to run, post which I read about the count() function in the plyr package, i tried to use that as well, but that is of no use as it needs a data frame as input, whereas I only want to count the no of occurrences in the data frame
temp5 <- opt %>%
+ left_join(products,by="product_id")%>%
+ left_join(departments,by="department_id") %>%
+ group_by(order_id,department_id,department) %>%
+ tally() %>%
+ group_by(department_id) %>%
+ summarise(count(department))
In the output, I need to know the average no. of departments ordered from in each order id, so i need something like this:
Order_id | no. of departments
1 3
36 4
38 4
96 3
And then I should be able to plot using ggplot, no. of orders vs no. of departments in each order. Hope this is clear

Related

Dictionary-style conditional referencing between two dataframes

I have two data frames. DF1 is a list of homicides with a date and location attached to each row. DF2 consists of a set of shared locations mentioned in DF1.
DF2 contains a latitude and longitude for each unique location. I want to pull these out. NOTE: DF2 contains shared locations, which may correspond to multiple homicides in DF1, which means the two DFs are different lengths.
I want to create latitude and longitude vars in DF1 when a location in DF2 is equal to the location in DF1 (assuming location names are exact between the two DFs). How do I pull the latitude and longitude from DF2 for which the location in DF2 corresponds to a given homicide record in DF1?
Small reproducible example:
DF1: (dataframe of incidents)
| Incident | Place |
| -------- | -------|
| Incident 1| Place 1|
| Incident 2| Place 2|
| Incident 3| Place 2|
| Incident 4| Place 3|
| Incident 5| Place 1|
| Incident 6| Place 3|
| Incident 7| Place 2|
DF2: (dictionary-style lat-lon manual)
| Place |Latitude |Longitude |
| -------| ------- | ---------|
| Place 1| A | B |
| Place 2| C | D |
| Place 3| E | F |
| Place 4| G | H |
DF3 (what I want)
| Incident | Latitude | Longitude |
| -------- | -------- | --------- |
|Incident 1| A | B |
|Incident 2| C | D |
|Incident 3| C | D |
|Incident 4| E | F |
|Incident 5| A | B |
|Incident 6| E | F |
|Incident 7| C | D |
I have tried:
DF1$latitude <- DF2$latitude[which(DF2$location == DF1$location), ]
It returned the following error:
Error in DF2$latitude[which(DF2$location == DF1$location), ] :
incorrect number of dimensions
In addition: Warning message:
In DF2$location == DF1$location :
longer object length is not a multiple of shorter object length
In response to a comment suggestion, I also tried:
DF2$latitude[which(DF2$location == DF1$location)]
However, I got the error:
Error in `$<-.data.frame`(`*tmp*`, latitude, value = numeric(0)) :
replacement has 0 rows, data has 1220
In addition: Warning message:
In DF1$location == DF2$location :
longer object length is not a multiple of shorter object length
You can try dplyr's left_join(). The code below keeps all the rows in DF1 and add variables in DF2 if it finds a match based on location.
library(dplyr)
DF3 <- left_join(DF1, DF2, by = "location")

Control digits in specific cells

I have a table that looks like this:
+-----------------------------------+-------+--------+------+
| | Male | Female | n |
+-----------------------------------+-------+--------+------+
| way more than my fair share | 2,4 | 21,6 | 135 |
| a little more than my fair share | 5,4 | 38,1 | 244 |
| about my fair share | 54,0 | 35,3 | 491 |
| a littles less than my fair share | 25,1 | 3,0 | 153 |
| way less than my fair share | 8,7 | 0,7 | 51 |
| Can't say | 4,4 | 1,2 | 31 |
| n | 541,0 | 564,0 | 1105 |
+-----------------------------------+-------+--------+------+
Everything is fine but what I would like to do is to show no digits in the last row at all since they show the margins (real cases). Is there any chance in R I can manipulate specific cells and their digits?
Thanks!
You could use ifelse to output the numbers in different formats in different rows, as in the example below. However, it will take some additional finagling to get the values in the last row to line up by place value with the previous rows:
library(knitr)
library(tidyverse)
# Fake data
set.seed(10)
dat = data.frame(category=c(LETTERS[1:6],"n"), replicate(3, rnorm(7, 100,20)))
dat %>%
mutate_if(is.numeric, funs(sprintf(ifelse(category=="n", "%1.0f", "%1.1f"), .))) %>%
kable(align="lrrr")
|category | X1| X2| X3|
|:--------|-----:|-----:|-----:|
|A | 100.4| 92.7| 114.8|
|B | 96.3| 67.5| 101.8|
|C | 72.6| 94.9| 80.9|
|D | 88.0| 122.0| 96.1|
|E | 105.9| 115.1| 118.5|
|F | 107.8| 95.2| 109.7|
|n | 76| 120| 88|
The huxtable package makes it easy to decimal-align the values (see the Vignette for more on table formatting):
library(huxtable)
tab = dat %>%
mutate_if(is.numeric, funs(sprintf(ifelse(category=="n", "%1.0f", "%1.1f"), .))) %>%
hux %>% add_colnames()
align(tab)[-1] = "."
tab
Here's what the PDF output looks like when knitted to PDF from an rmarkdown document:

Dividing the time into periods each 30 min

I have Dataframe contains "time" column I want to add a new column contain period number after dividing the time into periods each 30 min
for example,
The original Dataframe
l = [('A','2017-01-13 00:30:00'),('A','2017-01-13 00:00:01'),('E','2017-01-13 14:00:00'),('E','2017-01-13 12:08:15')]
df = spark.createDataFrame(l,['test','time'])
df1 = df.select(df.test,df.time.cast('timestamp'))
df1.show()
+----+-------------------+
|test| time|
+----+-------------------+
| A|2017-01-13 00:30:00|
| A|2017-01-13 00:00:01|
| E|2017-01-13 14:00:00|
| E|2017-01-13 12:08:15|
+----+-------------------+
The Desired Dataframe as follow:
+----+-------------------+------+
|test| time|period|
+----+-------------------+------+
| A|2017-01-13 00:30:00| 2|
| A|2017-01-13 00:00:01| 1|
| E|2017-01-13 14:00:00| 29|
| E|2017-01-13 12:08:15| 25|
+----+-------------------+------+
Are there ways to achieve that?
You can simply utilize the hour and minute inbuilt functions to get your final result with when inbuilt function as
from pyspark.sql import functions as F
df1.withColumn('period', (F.hour(df1['time'])*2)+1+(F.when(F.minute(df1['time']) >= 30, 1).otherwise(0))).show(truncate=False)
You should be getting
+----+---------------------+------+
|test|time |period|
+----+---------------------+------+
|A |2017-01-13 00:30:00.0|2 |
|A |2017-01-13 00:00:01.0|1 |
|E |2017-01-13 14:00:00.0|29 |
|E |2017-01-13 12:08:15.0|25 |
+----+---------------------+------+
I hope the answer is helpful

Proxy for Excel's split cell in R

Apologies in advance if anyone finds this to be a duplicate to a question answered before. I haven't found anything so here it is:
I have a 3x3 contingency table I made in RStudio (I am specifying this as a data frame below but I can also produce this as as.matrix, if that'll work better):
mat.s=data.frame("WT(H)"=11,"DEL(H)"=2)
mat.s[2,1]=13
mat.s[2,2]=500369
row.names(mat.s)=c("DEL(T)", "WT(T)")
mat.s=cbind(mat.s, Total=rowSums(mat.s))
mat.s=rbind(mat.s, Total=colSums(mat.s))
which looks like:
kable(mat.s)
| | WT.H.| DEL.H.| Total|
|:------|-----:|------:|------:|
|DEL(T) | 11| 2| 13|
|WT(T) | 13| 500369| 500382|
|Total | 24| 500371| 500395|
However, if I wanted to split a cell in this table (like you can do in Excel) into two, how would I do that? So I'd like to get something like the following when I render the document with kable:
| | WT.H.| DEL.H.| Total|
|:------|-----:|------:|------:|
|DEL(T) | S D | 2| 13|
| | 8 3 | | |
|WT(T) | 13| 500369| 500382|
|Total | 24| 500371| 500395|
So that when I want to calculate something from this table, I can call the split 8 or 3. Sorry if this is something very simple and easy to do! Still learning. Thanks!

How can I calculate the correlation of my residuals ? Spark-Scala

I need to know if my residuals are correlated or not. I didn't find a way to do it using Spark-Scala on Databricks.
And i conclude that i should export my project to R to use acf function.
Does someone know a trick to do it using Spark-Scala on Databricks ?
For those who need more information : I'm currently working on Sales Forecasting. I used a Regression Forest using different features. Then, I need to evaluate the quality of my forecast. To check this, i read on this paper that residuals were a good way to see if your forecasting model is good or bad. In any cases, you can still improve it but it's just to make my opinion on my forecast model and compared it to others models.
Currently, I have one dataframe like the one below. It's a part of the testing data/out-of-sample data. (I casted prediction and residuals to IntegerType, that's why at the 3rd row 40 - 17 = 22)
I am using Spark 2.1.1.
You can find correlation between columns using spark ml library function
Lets first import the classes.
import org.apache.spark.sql.functions.corr
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
Now prepare the input DataFrame :
scala> val seqRow = Seq(
| ("2017-04-27",13,21),
| ("2017-04-26",7,16),
| ("2017-04-25",40,17),
| ("2017-04-24",17,17),
| ("2017-04-21",10,20),
| ("2017-04-20",9,19),
| ("2017-04-19",30,30),
| ("2017-04-18",18,25),
| ("2017-04-14",32,28),
| ("2017-04-13",39,18),
| ("2017-04-12",2,4),
| ("2017-04-11",8,24),
| ("2017-04-10",18,27),
| ("2017-04-07",6,17),
| ("2017-04-06",13,29),
| ("2017-04-05",10,17),
| ("2017-04-04",6,8),
| ("2017-04-03",20,32)
| )
seqRow: Seq[(String, Int, Int)] = List((2017-04-27,13,21), (2017-04-26,7,16), (2017-04-25,40,17), (2017-04-24,17,17), (2017-04-21,10,20), (2017-04-20,9,19), (2017-04-19,30,30), (2017-04-18,18,25), (2017-04-14,32,28), (2017-04-13,39,18), (2017-04-12,2,4), (2017-04-11,8,24), (2017-04-10,18,27), (2017-04-07,6,17), (2017-04-06,13,29), (2017-04-05,10,17), (2017-04-04,6,8), (2017-04-03,20,32))
scala> val rdd = sc.parallelize(seqRow)
rdd: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[51] at parallelize at <console>:34
scala> val input_df = spark.createDataFrame(rdd).toDF("date","amount","prediction").withColumn("residuals",'amount - 'prediction)
input_df: org.apache.spark.sql.DataFrame = [date: string, amount: int ... 2 more fields]
scala> input_df.show(false)
+----------+------+----------+---------+
|date |amount|prediction|residuals|
+----------+------+----------+---------+
|2017-04-27|13 |21 |-8 |
|2017-04-26|7 |16 |-9 |
|2017-04-25|40 |17 |23 |
|2017-04-24|17 |17 |0 |
|2017-04-21|10 |20 |-10 |
|2017-04-20|9 |19 |-10 |
|2017-04-19|30 |30 |0 |
|2017-04-18|18 |25 |-7 |
|2017-04-14|32 |28 |4 |
|2017-04-13|39 |18 |21 |
|2017-04-12|2 |4 |-2 |
|2017-04-11|8 |24 |-16 |
|2017-04-10|18 |27 |-9 |
|2017-04-07|6 |17 |-11 |
|2017-04-06|13 |29 |-16 |
|2017-04-05|10 |17 |-7 |
|2017-04-04|6 |8 |-2 |
|2017-04-03|20 |32 |-12 |
+----------+------+----------+---------+
The values of residuals for row 2017-04-14 and 2017-04-13 don't match as i am subtracting amount - prediction for residuals
Now proceeding forward to calculate correlation between all the columns.
This method is used for calculating correlation if number of columns are more and you need correlation between each column to others.
First we drop the column whose correlation is not to be calculated
scala> val drop_date_df = input_df.drop('date)
drop_date_df: org.apache.spark.sql.DataFrame = [amount: int, prediction: int ... 1 more field]
scala> drop_date_df.show
+------+----------+---------+
|amount|prediction|residuals|
+------+----------+---------+
| 13| 21| -8|
| 7| 16| -9|
| 40| 17| 23|
| 17| 17| 0|
| 10| 20| -10|
| 9| 19| -10|
| 30| 30| 0|
| 18| 25| -7|
| 32| 28| 4|
| 39| 18| 21|
| 2| 4| -2|
| 8| 24| -16|
| 18| 27| -9|
| 6| 17| -11|
| 13| 29| -16|
| 10| 17| -7|
| 6| 8| -2|
| 20| 32| -12|
+------+----------+---------+
Since there are more than 2 column for correlation, we need to find correlation matrix.
For calculating correlation matrix we need RDD[Vector] as you can see in spark example for correlation.
scala> val dense_rdd = drop_date_df.rdd.map{row =>
| val first = row.getAs[Integer]("amount")
| val second = row.getAs[Integer]("prediction")
| val third = row.getAs[Integer]("residuals")
| Vectors.dense(first.toDouble,second.toDouble,third.toDouble)}
dense_rdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[62] at map at <console>:40
scala> val correlMatrix: Matrix = Statistics.corr(dense_rdd, "pearson")
correlMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0 0.40467032516705076 0.782939330961529
0.40467032516705076 1.0 -0.2520531290688281
0.782939330961529 -0.2520531290688281 1.0
The order of column remains same but you loose out the column names.
You can find good resources about structure of correlation matrix.
Since you want to find the correlation of residuals with other two columns.
We can explore other options
Hive corr UDAF
scala> drop_date_df.createOrReplaceTempView("temp_table")
scala> val corr_query_df = spark.sql("select corr(amount,residuals) as amount_residuals_corr,corr(prediction,residuals) as prediction_residual_corr from temp_table")
corr_query_df: org.apache.spark.sql.DataFrame = [amount_residuals_corr: double, prediction_residual_corr: double]
scala> corr_query_df.show
+---------------------+------------------------+
|amount_residuals_corr|prediction_residual_corr|
+---------------------+------------------------+
| 0.7829393309615287| -0.252053129068828|
+---------------------+------------------------+
Spark corr function link
scala> val corr_df = drop_date_df.select(
| corr('amount,'residuals).as("amount_residuals_corr"),
| corr('prediction,'residuals).as("prediction_residual_corr"))
corr_df: org.apache.spark.sql.DataFrame = [amount_residuals_corr: double, prediction_residual_corr: double]
scala> corr_df.show
+---------------------+------------------------+
|amount_residuals_corr|prediction_residual_corr|
+---------------------+------------------------+
| 0.7829393309615287| -0.252053129068828|
+---------------------+------------------------+

Resources