Assigning coordinates to data of a different year in R? - r

I have got a data frame of Germany from 2012 with 8187 rows for 8187 postal codes (and about 10 variables listed as columns), but with no coordinates. Additionally, I have got coordinates of a different shapefile with 8203 rows (also including mostly the same postal codes).
I need the correct coordinates of the 8203 cases to be assigned to the 8178 cases of the initial data frame.
The problem: The difference of correct assignments needed is not 8178 with 16 cases missing (8203 - 8187 = 16), it is more. There are some towns (with postal codes) of 2012 which are not listed in the more recent shapefile and vice versa.
(I) Perhaps the easiest solution would be to obtain the coordinates from 2012 (unprojected: CRS("+init=epsg:4326")). --> Does anybody know an open source platform for this purpose? And do they have exactly 8187 postal codes?
(II) Or: Does anybody have an experience with assigning coordinates from to a data set of a different year? - Or, should this be avoided in any way because of some slightly changing borders and coordinates (especially when the data should be mapped and visualized in polygons from 2012) and some towns not listed in the older "and" in the newer data set?
I would appreciate your expert advice on how to approach (and hopefully solve) this issue!
EDIT - MWE:
# data set from 2012
> df1
# A tibble: 9 x 4
ID PLZ5 Name Var1
<dbl> <dbl> <chr> <dbl>
1 1 1067 Dresden 01067 40
2 2 1069 Dresden 01069 110
3 224 4571 Rötha 0
4 225 4574 Deutzen 120
5 226 4575 Neukieritzsch 144
6 262 4860 Torgau 23
7 263 4862 Mockrehna 57
8 8186 99996 Menteroda 0
9 8187 99998 Körner 26
# coordinates of recent shapefile
> df2
# A tibble: 9 x 5
ID PLZ5 Name Longitude Latitude
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 1067 Dresden-01067 13.71832 51.06018
2 2 1069 Dresden-01069 13.73655 51.03994
3 224 4571 Roetha 12.47311 51.20390
4 225 4575 Neukieritzsch 12.41355 51.15278
5 260 4860 Torgau 12.94737 51.55790
6 261 4861 Bennewitz 13.00145 51.51125
7 262 4862 Mockrehna 12.83097 51.51125
8 8202 99996 Obermehler 10.59146 51.28864
9 8203 99998 Koerner 10.55294 51.21257
Hence,
4 225 4574 Deutzen 120
--> is not listed in df2 and:
6 261 4861 Bennewitz 13.00145 51.51125
--> is not listed in df1.
Any ideas concerning (I) and (II)?

Related

how to calculate mean based on conditions in for loop in r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)
The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}
You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

Merging spatial data with non-spatial data brings up NA values for Non-Spatial Data

I'm working with U.S Census Data (containing attributes and spatial data/geometry as well) that I'm trying to merge with my own database that I created in excel (police stop rates and counts within census tracts) and converted to a CSV file. Both databases share a unique column identifier "GEOID" and the same number of observations, but when I use merge(), left_join(), or even inner_join() I continuously get all of my data from my spatial file back but the variables from my other datable all come back as NA. What should I do? Thanks for the help!
What I'm working with:
library(readr)
SDPD_Data_Census <- read_csv("SDPD_Data_Census.csv",
col_types = cols(GEOID = col_character(),
policestop = col_integer(), policestoprate = col_number(),
totp = col_skip()))
View(SDPD_Data_Census)
#I convert my census data into a shape file
SD.city.tracts <- st_read("SD.city.tracts.shp", stringsAsFactors = FALSE)
#My SPD_Variable_List is missing geometry data that would allow me to plot the policerate variable onto a map. To fix this, I merged my census data (that has geometry values) and my police data together
#I merge my police data with my census data using GEOID as the common factor
SD_Police_Census <- left_join(SD.city.tracts, SDPD_Data_Census)
#I use names() to check if the datasets were merged, here it shows that the policestoprate and policestop columns are now included with the census data but are showing NA values
head(SD_Police_Census, n=5)
Joining, by = "GEOID"Simple feature collection with 5 features and 34 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -117.1949 ymin: 32.73966 xmax: -117.1554 ymax: 32.75932
epsg (SRID): NA
proj4string: +proj=longlat +ellps=GRS80 +no_defs
GEOID tpop tpopr medincome pfpov powner phsgrad pbach pdiv psingm pnhwhite nhwhite pnhasn nhasn pnhblk nhblk phisp
1 06073000100 3250 3250 138864 0.0000000 36.83077 1.969231 40.86154 7.323077 0.2153846 76.67692 2492 4.369231 142 0.0000000 0 15.876923
2 06073000201 1915 1915 90673 0.9921671 24.90862 3.342037 41.35770 12.584856 2.2454308 84.38642 1616 2.140992 41 0.5221932 10 7.049608
3 06073000202 4583 4583 66438 0.6764128 18.93956 4.494872 43.42134 12.000873 2.4874536 71.61248 3282 9.382501 430 0.8727907 40 13.855553
4 06073000300 5094 5094 69028 0.9422850 13.42756 3.945819 45.75972 13.172360 2.0416176 72.49706 3693 2.179034 111 5.1040440 260 16.195524
5 06073000400 3758 3758 75559 0.0000000 11.09633 5.268760 40.89941 11.362427 3.1665780 61.76158 2321 11.043108 415 5.0026610 188 19.425226
hisp pnonwhite nonwhite pfborn nfborn poth oth nhwhitec nonwhitec nhasnc nhblkc othc hispc tpoprc ent policestoprate policestop
1 516 23.32308 758 13.384615 435 3.076923 100 646438 853300 248715 89133 67268 448184 1499738 0.7397115 NA NA
2 135 15.61358 299 6.370757 122 5.900783 113 646438 853300 248715 89133 67268 448184 1499738 0.6069625 NA NA
3 635 28.38752 1301 15.775693 723 4.276675 196 646438 853300 248715 89133 67268 448184 1499738 0.9111694 NA NA
4 825 27.50294 1401 9.187279 468 4.024342 205 646438 853300 248715 89133 67268 448184 1499738 0.8925200 NA NA
5 730 38.23842 1437 18.121341 681 2.767429 104 646438 853300 248715 89133 67268 448184 1499738 1.1083576 NA NA
geometry
1 MULTIPOLYGON (((-117.1922 3...
2 MULTIPOLYGON (((-117.1789 3...
3 MULTIPOLYGON (((-117.1785 3...
4 MULTIPOLYGON (((-117.1686 3...
5 MULTIPOLYGON (((-117.1709 3...
#When I try to map the policestoprate variable it shows that all policestoprate data is missing
Hopefully, someone can help me out, I really need this to work since it for a thesis and I'd be sad to abandon this project cause of 2 variables...
EDIT:
when I use head(SDPD_Data_Census) it shows:
GEOID policestoprate policestop
<chr> <dbl> <int>
6073000100 0.0000000 0
6073000201 1.5665796 3
6073000202 0.6545931 3
6073000300 3.1409501 16
6073000400 26.3437999 99
6073000500 1.5285845 5
So the data is there and has no NA values when left in its original form, but when merged with my census data only the two columns from my police data shows NA values throughout. Using full_join() produced the same results as well.
EDIT 2:
I looked over my police database and it turns out all of my GEOID values are missing a 0 in the beginning which is why they couldn't match with the GEOID values from the census database (which has these zeroes). Very silly mistake but now I have to manually insert 0s into all of my GEOID values on excel and hopefully they merge this time. (when I did a full_join() on the two datasets it turns out that the police data was preserved but they were added at the very bottom of the newly made dataset because they couldn't match with the census GEOID values).
EDIT 3: I manually fixed my police database and added 0s in front of my GEOIDs to match with the one from the census database. Using full_join() after that worked perfectly and now I can map my police stop rates with no issues! Lesson learned: try not to work at 2am on coding because you can make silly mistakes like this.

R Panel data: Create new variable based on ifelse() statement and previous row

My question refers to the following (simplified) panel data, for which I would like to create some sort of xrd_stock.
#Setup data
library(tidyverse)
firm_id <- c(rep(1, 5), rep(2, 3), rep(3, 4))
firm_name <- c(rep("Cosco", 5), rep("Apple", 3), rep("BP", 4))
fyear <- c(seq(2000, 2004, 1), seq(2003, 2005, 1), seq(2005, 2008, 1))
xrd <- c(49,93,121,84,37,197,36,154,104,116,6,21)
df <- data.frame(firm_id, firm_name, fyear, xrd)
#Define variables
growth = 0.08
depr = 0.15
For a new variable called xrd_stock I'd like to apply the following mechanics:
each firm_id should be handled separately: group_by(firm_id)
where fyear is at minimum, calculate xrd_stock as: xrd/(growth + depr)
otherwise, calculate xrd_stock as: xrd + (1-depr) * [xrd_stock from previous row]
With the following code, I already succeeded with step 1. and 2. and parts of step 3.
df2 <- df %>%
ungroup() %>%
group_by(firm_id) %>%
arrange(firm_id, fyear, decreasing = TRUE) %>% #Ensure that data is arranged w/ in asc(fyear) order; not required in this specific example as df is already in correct order
mutate(xrd_stock = ifelse(fyear == min(fyear), xrd/(growth + depr), xrd + (1-depr)*lag(xrd_stock))))
Difficulties occur in the else part of the function, such that R returns:
Error: Problem with `mutate()` input `xrd_stock`.
x object 'xrd_stock' not found
i Input `xrd_stock` is `ifelse(...)`.
i The error occured in group 1: firm_id = 1.
Run `rlang::last_error()` to see where the error occurred.
From this error message, I understand that R cannot refer to the just created xrd_stock in the previous row (logical when considering/assuming that R is not strictly working from top to bottom); however, when simply putting a 9 in the else part, my above code runs without any errors.
Can anyone help me with this problem so that results look eventually as shown below. I am more than happy to answer additional questions if required. Thank you very much to everyone in advance, who looks at my question :-)
Target results (Excel-calculated):
id name fyear xrd xrd_stock Calculation for xrd_stock
1 Cosco 2000 49 213 =49/(0.08+0.15)
1 Cosco 2001 93 274 =93+(1-0.15)*213
1 Cosco 2002 121 354 …
1 Cosco 2003 84 385 …
1 Cosco 2004 37 364 …
2 Apple 2003 197 857 =197/(0.08+0.15)
2 Apple 2004 36 764 =36+(1-0.15)*857
2 Apple 2005 154 803 …
3 BP 2005 104 452 …
3 BP 2006 116 500 …
3 BP 2007 6 431 …
3 BP 2008 21 388 …
arrange the data by fyear so minimum year is always the 1st row, you can then use accumulate to calculate.
library(dplyr)
df %>%
arrange(firm_id, fyear) %>%
group_by(firm_id) %>%
mutate(xrd_stock = purrr::accumulate(xrd[-1], ~.y + (1-depr) * .x,
.init = first(xrd)/(growth + depr)))
# firm_id firm_name fyear xrd xrd_stock
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 Cosco 2000 49 213.
# 2 1 Cosco 2001 93 274.
# 3 1 Cosco 2002 121 354.
# 4 1 Cosco 2003 84 385.
# 5 1 Cosco 2004 37 364.
# 6 2 Apple 2003 197 857.
# 7 2 Apple 2004 36 764.
# 8 2 Apple 2005 154 803.
# 9 3 BP 2005 104 452.
#10 3 BP 2006 116 500.
#11 3 BP 2007 6 431.
#12 3 BP 2008 21 388.

Select TailNumber with MaxAirTime by UniqueCarrier [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 5 years ago.
I have a similar problem with my own dataset and decided to practice on an example dataset. I'm trying to select the TailNumbers associated with the Max Air Time by Carrier.
Here's my solution thus far:
library(hflights)
hflights %>%
group_by(UniqueCarrier, TailNum) %>%
summarise(maxAT = max(AirTime)) %>%
arrange(desc(maxAT))
This provides three columns to which I can eyeball the max Air Time values and then filter them down using filter() statements. However, I feel like there's a more elegant way to do so.
You can use which.max to find out the row with the maximum AirTime and then slice the rows:
hflights %>%
select(UniqueCarrier, TailNum, AirTime) %>%
group_by(UniqueCarrier) %>%
slice(which.max(AirTime))
# A tibble: 15 x 3
# Groups: UniqueCarrier [15]
# UniqueCarrier TailNum AirTime
# <chr> <chr> <int>
# 1 AA N3FNAA 161
# 2 AS N626AS 315
# 3 B6 N283JB 258
# 4 CO N77066 549
# 5 DL N358NB 188
# 6 EV N716EV 173
# 7 F9 N905FR 190
# 8 FL N176AT 186
# 9 MQ N526MQ 220
#10 OO N744SK 225
#11 UA N457UA 276
#12 US N950UW 212
#13 WN N256WN 288
#14 XE N11199 204
#15 YV N907FJ 150

Deleting rows dynamically based on certain condition in R

Problem description:
From the below table, I would want to remove all the rows above the quarter value of 2014-Q3 i.e. rows 1,2
Also note that this is a dynamic data-set. Which means when we move on to the next quarter i.e. 2016-Q3, I would want to remove all the rows above quarter value of 2014-Q4 automatically through a code without any manual intervention
(and when we move to next qtr 2016-Q4, would want to remove all rows above 2015-Q1 and so on)
I have a variable which captures the first quarter I would like to see in my final data-frame (in this case 2014-Q3) and this variable would change as we progress in the future
QTR Revenue
1 2014-Q1 456
2 2014-Q2 3113
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
.. .. ..
10 2016-Q2 232
How do I code this?
Here is a semi-automated method using which:
myFunc <- function(df, year, quarter) {
dropper <- paste(year, paste0("Q",(quarter-1)), sep="-")
df[-(1:which(as.character(df$QTR)==dropper)),]
}
myFunc(df, 2014, 3)
QTR Revenue
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
To subset, you can just assign output
dfNew <- myFunc(df, 2014, 3)
At this point, you can pretty easily change the year and quarter to perform a new subset.
Thanks lmo
Was going through articles and I think we can use the dplyr package to do this in a much simpler way:
>df % slice((nrow(df)-7):(nrow(df)))
Get the below result
>df
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
.. .. ..
10 2016-Q2 232
This would act in a dynamic way too as once we have more rows entered beyond 2016-Q2, the range of 8 rows (to be selected) is maintained by the nrow function

Resources