SAS data reorgnization- linking information from one column to another - networking

I have a dataset with a village name, market, churches, and hospitals. Markets, churches and hospitals are not given names directly; rather, they are listed by village, so that they can be directly linked to the village name. Coordinates are only listed by village. I need to reorganize a dataset in SAS, so that GPS coordinates for villages are linked to their respective markets, churches, and hospitals.
Here's a better visualization for what I'm trying to do:
Transform this dataset:
Long |Lat |Village|Market |Church
----------------------------------------
X(A) |Y(A) | A |A |A
X(B) |Y(B) | B |B |B
X(C) |Y(C) | C |A |A
X(D) |Y(D) | D |A |D
X(E) |Y(E) | E |B |B
X(F) |Y(F) | F |F |F
X(G) |Y(G) | G |F |F
X(H) |Y(H) | H |H |F
To something that looks like this, with newly created columns for Market and Church coordinates (based off of the original village coordinates):
Long|Lat |Village|Market|Market_Long|Market_Lat |Church|Church_Long|Church_Lat
-------------------------------------------------------------------------------
X(A)|Y(A) |A |A |X(A) |Y(A) |A |X(A) |Y(A)
X(B)|Y(B) |B |B |X(B) |Y(B) |B |X(B) |Y(B)
X(C)|Y(C) |C |A |X(A) |Y(A) |A |X(A) |Y(A)
X(D)|Y(D) |D |A |X(A) |Y(A) |D |X(D) |Y(D)
X(E)|Y(E) |E |B |X(B) |Y(B) |B |X(B) |Y(B)
X(F)|Y(F) |F |F |X(F) |Y(F) |F |X(F) |Y(F)
X(G)|Y(G) |G |F |X(F) |Y(F) |F |X(F) |Y(F)
X(H)|Y(H) |H |H |X(H) |Y(H) |F |X(F) |Y(F)
I know the formatting is a little strange here, so here are a couple of images of these sample tables:
Original Table:
Desired Table:
Thanks for any light you can shed on this!

As was commented, the answer seems as trivial as just defining them asnew variables. As you are going to be defining the new variables anyway, you can use the length statement to set the order you want them to appear in the dataset.
data old;
input lon $ lat $ village $ market $ church $ hosp $;
Datalines;
X(A) Y(A) A A A B
X(B) Y(B) B B B B
X(C) Y(C) C A A D
X(D) Y(D) D A D D
X(E) Y(E) E B B B
;
run;
data new;
length lon lat village market market_lon market_lat Church church_lon church_lat
hosp hosp_lon hosp_lat $ 100;
set old;
market_lon = lon;
market_lat = lat;
church_lon = lon;
church_lat = lat;
hosp_lon = lon;
hosp_lat = lat;
run;
This will work for a simple case but if there are many more variables than just market, church and hosp you may be interested in using macros and a %do loop to iterate over a list of the varnames instead of manually defining the new variables.
Great article for macro %do: http://blogs.sas.com/content/sastraining/2015/01/30/sas-authors-tip-getting-the-macro-language-to-perform-a-do-loop-over-a-list-of-values/

Related

How to not automatically sort my y axis bar plot in ggplot & lemon

I am trying to display data by Species that has different values depending on group Letter. The best way I have found to display my data is by putting my categorical data on the y-axis and displaying the Total_Observed on the x-axis. Lemon allows me to have different y-axis labels. Unfortunately, the graph sorts by my y-axis labels instead of using my data as is, which is sorted by most abundant species to least abundant. Any suggestions?
Using libraries: dplyr, ggplot2, lemon
My data:
|Letter |Species | Total_Observed|
|:------|:------------------------|--------------:|
|A |Yellowtail snapper | 155|
|A |Sharksucker | 119|
|A |Tomtate | 116|
|A |Mutton snapper | 104|
|A |Little tunny | 96|
|B |Vermilion snapper | 1655|
|B |Red snapper | 1168|
|B |Gray triggerfish | 689|
|B |Tomtate | 477|
|B |Red porgy | 253|
|C |Red snapper | 391|
|C |Vermilion snapper | 114|
|C |Lane snapper | 95|
|C |Atlantic sharpnose shark | 86|
|C |Tomtate | 73|
|D |Lane snapper | 627|
|D |Red grouper | 476|
|D |White grunt | 335|
|D |Gray snapper | 102|
|D |Sand perch | 50|
|E |White grunt | 515|
|E |Red grouper | 426|
|E |Red snapper | 150|
|E |Black sea bass | 142|
|E |Lane snapper | 88|
|E |Gag | 88|
|F |Yellowtail snapper | 385|
|F |White grunt | 105|
|F |Gray snapper | 88|
|F |Mutton snapper | 82|
|F |Lane snapper | 59|
Then I run the code for my ggplot/lemon
ggplot(test,aes(y=Species,x=Total_Observed))+geom_histogram(stat='identity')+facet_wrap(.~test$Letter,scales='free_y')
And my graphs print like this:
Answered by Johan Rosa's shared blog (https://juliasilge.com/blog/reorder-within/): The solution is to use the library(tidytext). With the functions reorder_within and scale_x_reordered.
The corrected code:
test %>% mutate(Species=reorder_within(Species,Total_Observed,Letter)) %>% ggplot(aes(Species,Total_Observed))+geom_histogram(stat='identity')+facet_wrap(~Letter,scales='free_y')+coord_flip()+scale_x_reordered()
Will now generate the graphs ordered correctly

Dictionary-style conditional referencing between two dataframes

I have two data frames. DF1 is a list of homicides with a date and location attached to each row. DF2 consists of a set of shared locations mentioned in DF1.
DF2 contains a latitude and longitude for each unique location. I want to pull these out. NOTE: DF2 contains shared locations, which may correspond to multiple homicides in DF1, which means the two DFs are different lengths.
I want to create latitude and longitude vars in DF1 when a location in DF2 is equal to the location in DF1 (assuming location names are exact between the two DFs). How do I pull the latitude and longitude from DF2 for which the location in DF2 corresponds to a given homicide record in DF1?
Small reproducible example:
DF1: (dataframe of incidents)
| Incident | Place |
| -------- | -------|
| Incident 1| Place 1|
| Incident 2| Place 2|
| Incident 3| Place 2|
| Incident 4| Place 3|
| Incident 5| Place 1|
| Incident 6| Place 3|
| Incident 7| Place 2|
DF2: (dictionary-style lat-lon manual)
| Place |Latitude |Longitude |
| -------| ------- | ---------|
| Place 1| A | B |
| Place 2| C | D |
| Place 3| E | F |
| Place 4| G | H |
DF3 (what I want)
| Incident | Latitude | Longitude |
| -------- | -------- | --------- |
|Incident 1| A | B |
|Incident 2| C | D |
|Incident 3| C | D |
|Incident 4| E | F |
|Incident 5| A | B |
|Incident 6| E | F |
|Incident 7| C | D |
I have tried:
DF1$latitude <- DF2$latitude[which(DF2$location == DF1$location), ]
It returned the following error:
Error in DF2$latitude[which(DF2$location == DF1$location), ] :
incorrect number of dimensions
In addition: Warning message:
In DF2$location == DF1$location :
longer object length is not a multiple of shorter object length
In response to a comment suggestion, I also tried:
DF2$latitude[which(DF2$location == DF1$location)]
However, I got the error:
Error in `$<-.data.frame`(`*tmp*`, latitude, value = numeric(0)) :
replacement has 0 rows, data has 1220
In addition: Warning message:
In DF1$location == DF2$location :
longer object length is not a multiple of shorter object length
You can try dplyr's left_join(). The code below keeps all the rows in DF1 and add variables in DF2 if it finds a match based on location.
library(dplyr)
DF3 <- left_join(DF1, DF2, by = "location")

Dividing the time into periods each 30 min

I have Dataframe contains "time" column I want to add a new column contain period number after dividing the time into periods each 30 min
for example,
The original Dataframe
l = [('A','2017-01-13 00:30:00'),('A','2017-01-13 00:00:01'),('E','2017-01-13 14:00:00'),('E','2017-01-13 12:08:15')]
df = spark.createDataFrame(l,['test','time'])
df1 = df.select(df.test,df.time.cast('timestamp'))
df1.show()
+----+-------------------+
|test| time|
+----+-------------------+
| A|2017-01-13 00:30:00|
| A|2017-01-13 00:00:01|
| E|2017-01-13 14:00:00|
| E|2017-01-13 12:08:15|
+----+-------------------+
The Desired Dataframe as follow:
+----+-------------------+------+
|test| time|period|
+----+-------------------+------+
| A|2017-01-13 00:30:00| 2|
| A|2017-01-13 00:00:01| 1|
| E|2017-01-13 14:00:00| 29|
| E|2017-01-13 12:08:15| 25|
+----+-------------------+------+
Are there ways to achieve that?
You can simply utilize the hour and minute inbuilt functions to get your final result with when inbuilt function as
from pyspark.sql import functions as F
df1.withColumn('period', (F.hour(df1['time'])*2)+1+(F.when(F.minute(df1['time']) >= 30, 1).otherwise(0))).show(truncate=False)
You should be getting
+----+---------------------+------+
|test|time |period|
+----+---------------------+------+
|A |2017-01-13 00:30:00.0|2 |
|A |2017-01-13 00:00:01.0|1 |
|E |2017-01-13 14:00:00.0|29 |
|E |2017-01-13 12:08:15.0|25 |
+----+---------------------+------+
I hope the answer is helpful

How can I calculate the correlation of my residuals ? Spark-Scala

I need to know if my residuals are correlated or not. I didn't find a way to do it using Spark-Scala on Databricks.
And i conclude that i should export my project to R to use acf function.
Does someone know a trick to do it using Spark-Scala on Databricks ?
For those who need more information : I'm currently working on Sales Forecasting. I used a Regression Forest using different features. Then, I need to evaluate the quality of my forecast. To check this, i read on this paper that residuals were a good way to see if your forecasting model is good or bad. In any cases, you can still improve it but it's just to make my opinion on my forecast model and compared it to others models.
Currently, I have one dataframe like the one below. It's a part of the testing data/out-of-sample data. (I casted prediction and residuals to IntegerType, that's why at the 3rd row 40 - 17 = 22)
I am using Spark 2.1.1.
You can find correlation between columns using spark ml library function
Lets first import the classes.
import org.apache.spark.sql.functions.corr
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
Now prepare the input DataFrame :
scala> val seqRow = Seq(
| ("2017-04-27",13,21),
| ("2017-04-26",7,16),
| ("2017-04-25",40,17),
| ("2017-04-24",17,17),
| ("2017-04-21",10,20),
| ("2017-04-20",9,19),
| ("2017-04-19",30,30),
| ("2017-04-18",18,25),
| ("2017-04-14",32,28),
| ("2017-04-13",39,18),
| ("2017-04-12",2,4),
| ("2017-04-11",8,24),
| ("2017-04-10",18,27),
| ("2017-04-07",6,17),
| ("2017-04-06",13,29),
| ("2017-04-05",10,17),
| ("2017-04-04",6,8),
| ("2017-04-03",20,32)
| )
seqRow: Seq[(String, Int, Int)] = List((2017-04-27,13,21), (2017-04-26,7,16), (2017-04-25,40,17), (2017-04-24,17,17), (2017-04-21,10,20), (2017-04-20,9,19), (2017-04-19,30,30), (2017-04-18,18,25), (2017-04-14,32,28), (2017-04-13,39,18), (2017-04-12,2,4), (2017-04-11,8,24), (2017-04-10,18,27), (2017-04-07,6,17), (2017-04-06,13,29), (2017-04-05,10,17), (2017-04-04,6,8), (2017-04-03,20,32))
scala> val rdd = sc.parallelize(seqRow)
rdd: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[51] at parallelize at <console>:34
scala> val input_df = spark.createDataFrame(rdd).toDF("date","amount","prediction").withColumn("residuals",'amount - 'prediction)
input_df: org.apache.spark.sql.DataFrame = [date: string, amount: int ... 2 more fields]
scala> input_df.show(false)
+----------+------+----------+---------+
|date |amount|prediction|residuals|
+----------+------+----------+---------+
|2017-04-27|13 |21 |-8 |
|2017-04-26|7 |16 |-9 |
|2017-04-25|40 |17 |23 |
|2017-04-24|17 |17 |0 |
|2017-04-21|10 |20 |-10 |
|2017-04-20|9 |19 |-10 |
|2017-04-19|30 |30 |0 |
|2017-04-18|18 |25 |-7 |
|2017-04-14|32 |28 |4 |
|2017-04-13|39 |18 |21 |
|2017-04-12|2 |4 |-2 |
|2017-04-11|8 |24 |-16 |
|2017-04-10|18 |27 |-9 |
|2017-04-07|6 |17 |-11 |
|2017-04-06|13 |29 |-16 |
|2017-04-05|10 |17 |-7 |
|2017-04-04|6 |8 |-2 |
|2017-04-03|20 |32 |-12 |
+----------+------+----------+---------+
The values of residuals for row 2017-04-14 and 2017-04-13 don't match as i am subtracting amount - prediction for residuals
Now proceeding forward to calculate correlation between all the columns.
This method is used for calculating correlation if number of columns are more and you need correlation between each column to others.
First we drop the column whose correlation is not to be calculated
scala> val drop_date_df = input_df.drop('date)
drop_date_df: org.apache.spark.sql.DataFrame = [amount: int, prediction: int ... 1 more field]
scala> drop_date_df.show
+------+----------+---------+
|amount|prediction|residuals|
+------+----------+---------+
| 13| 21| -8|
| 7| 16| -9|
| 40| 17| 23|
| 17| 17| 0|
| 10| 20| -10|
| 9| 19| -10|
| 30| 30| 0|
| 18| 25| -7|
| 32| 28| 4|
| 39| 18| 21|
| 2| 4| -2|
| 8| 24| -16|
| 18| 27| -9|
| 6| 17| -11|
| 13| 29| -16|
| 10| 17| -7|
| 6| 8| -2|
| 20| 32| -12|
+------+----------+---------+
Since there are more than 2 column for correlation, we need to find correlation matrix.
For calculating correlation matrix we need RDD[Vector] as you can see in spark example for correlation.
scala> val dense_rdd = drop_date_df.rdd.map{row =>
| val first = row.getAs[Integer]("amount")
| val second = row.getAs[Integer]("prediction")
| val third = row.getAs[Integer]("residuals")
| Vectors.dense(first.toDouble,second.toDouble,third.toDouble)}
dense_rdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[62] at map at <console>:40
scala> val correlMatrix: Matrix = Statistics.corr(dense_rdd, "pearson")
correlMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0 0.40467032516705076 0.782939330961529
0.40467032516705076 1.0 -0.2520531290688281
0.782939330961529 -0.2520531290688281 1.0
The order of column remains same but you loose out the column names.
You can find good resources about structure of correlation matrix.
Since you want to find the correlation of residuals with other two columns.
We can explore other options
Hive corr UDAF
scala> drop_date_df.createOrReplaceTempView("temp_table")
scala> val corr_query_df = spark.sql("select corr(amount,residuals) as amount_residuals_corr,corr(prediction,residuals) as prediction_residual_corr from temp_table")
corr_query_df: org.apache.spark.sql.DataFrame = [amount_residuals_corr: double, prediction_residual_corr: double]
scala> corr_query_df.show
+---------------------+------------------------+
|amount_residuals_corr|prediction_residual_corr|
+---------------------+------------------------+
| 0.7829393309615287| -0.252053129068828|
+---------------------+------------------------+
Spark corr function link
scala> val corr_df = drop_date_df.select(
| corr('amount,'residuals).as("amount_residuals_corr"),
| corr('prediction,'residuals).as("prediction_residual_corr"))
corr_df: org.apache.spark.sql.DataFrame = [amount_residuals_corr: double, prediction_residual_corr: double]
scala> corr_df.show
+---------------------+------------------------+
|amount_residuals_corr|prediction_residual_corr|
+---------------------+------------------------+
| 0.7829393309615287| -0.252053129068828|
+---------------------+------------------------+

Append data in sqlite3 c++

i have a table in sqlite like:
e.g
Columns
|a |b |c |d |e |
|5 |6 |4 |3 |6 |
columns a - e holds an integer
now i need to add number to some columns
for example add 3 to column 'c' and now c will hold 7.
how can i do it?
I think this could be done with a simple update query like this:
UPDATE <table> SET c = c + 1;

Resources