I have a data frame containing a number of projects + their start date + their coordinates(long/lat) and I have a data frame containing a number of (fictional) respondents + the date they were surveyed + their coordinates:
respond_id<- c(1:5)
survey_year<- c(2007, 2005, 2008, 2004, 2005)
lat_1<- c(53.780928, 54.025200, 53.931432, 53.881048, 54.083359)
long_1<- c(9.614991, 9.349862, 9.473498, 10.685581, 10.026894)
project_id<- c(1111:1114)
year_start<- c(2007, 2007, 2006, 2008)
lat_2<- c(54.022881, 54.022881, 53.931753, 53.750523)
long_2<- c(9.381104, 9.381104, 9.505700, 9.666336)
survey<- data.frame(respond_id, survey_year, lat_1, long_1)
projects<- data.frame(project_id, year_start, lat_2, long_2)
Now, I want to create a new variable survey$project_nearby that counts the amount of projects located nearby (here: 5km) the respondents. So the data frame survey should look somewhat like this (other results possible):
> survey
respond_id survey_year lat_1 long_1 projects_nearby
1 1 2007 53.780928 9.614991 0
2 2 2005 54.025200 9.349862 0
3 3 2008 53.931432 9.473498 1
4 4 2004 53.881048 10.685581 0
5 5 2005 54.083359 10.026894 0
Special attention needs to be paid to the start years of the projects and the year the surveys were conducted: If a respondent was asked in 2007, but the project nearby was completed in 2008, this project naturally does not count as project nearby.
I thought of creating a distance matrix and then just counting the number of rows containing a distance smaller than 5km... but I don't know how to create this distance matrix. And maybe a for loop would be easier?
Could anyone help me or give me a hint, what would be the code for doing this?
EDIT: I edited the expected values of survey$projects_nearby. Now these values should match with actual amount of projects located nearby the respective respondents.
I don't think the correct answer is that shown? Below I left_join by the year so that every row of survey will be replicated for every matching projects. Then I filter to rows where the lats are below 5 km. Count them and join back to the original survey.
Slightly confusing results too as project1 and 2 from same year are in same location. I count them twice with this code.
>survey
respond_id survey_year lat_1 long_1
1 1 2007 53.78093 9.614991
2 2 2005 54.02520 9.349862
3 3 2008 53.93143 9.473498
4 4 2004 53.88105 10.685581
5 5 2005 54.08336 10.026894
>projects
> projects
project_id year_start lat_2 long_2
1 1111 2007 54.02288 9.381104
2 1112 2007 54.02288 9.381104
3 1113 2006 53.93175 9.505700
4 1114 2008 53.75052 9.666336
> left_join(survey, projects, by = c( "survey_year"="year_start")) %>%
+ dplyr::filter( sqrt((lat_1-lat_2)^2 + (long_1-long_2)^2 ) < 5) %>%
+ group_by(respond_id, survey_year, lat_1, long_1) %>%
+ summarise(projects_nearby = n()) %>%
+ right_join(survey)
Joining, by = c("respond_id", "survey_year", "lat_1", "long_1")
Source: local data frame [5 x 5]
Groups: respond_id, survey_year, lat_1 [?]
respond_id survey_year lat_1 long_1 projects_nearby
<int> <dbl> <dbl> <dbl> <int>
1 1 2007 53.78093 9.614991 2
2 2 2005 54.02520 9.349862 NA
3 3 2008 53.93143 9.473498 1
4 4 2004 53.88105 10.685581 NA
5 5 2005 54.08336 10.026894 NA
.. you can of course change NA to zero if appropriate...
You can use the sp package to find the distances, and then just count the number that are nearby. That is,
library(sp)
survey.loc <- matrix(as.numeric(as.character(unlist(survey[, 3:4]))), ncol = 2)
project.loc <- matrix(as.numeric(as.character(unlist(projects[, 3:4]))), ncol = 2)
distances <- spDists(survey.loc, project.loc, longlat = TRUE)
survey$project_nearby <- apply(distances, 1, function(x) sum(x<5))
I hope this helps!
EDIT:
My apologies for not considering the date.
library(sp)
survey.loc <- matrix(as.numeric(as.character(unlist(survey[, 3:4]))), ncol = 2)
project.loc <- matrix(as.numeric(as.character(unlist(projects[, 3:4]))), ncol = 2)
distances <- spDists(survey.loc, project.loc, longlat = TRUE)
year.diff <- sapply(projects$year_start, function(x) survey$survey_year-x)
year.diff <- ifelse(year.diff < 0, Inf, 1)
survey$project_nearby <- apply(year.diff*distances, 1, function(x) sum(x<5))
I think you have to convert your lat, long coordinates to coordinates in a plane or using this link below from a previous post:
harvesine distance
https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula
Once you have distances to a particular location in the projects data frame, you may need to find similar points using knn or any other technique of your preference.
Related
I have a corpus of news articles with date and time of publication as 'docvars'.
readtext object consisting of 6 documents and 8 docvars.
# Description: df[,10] [6 × 10]
doc_id text year month day hour minute second title source
* <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <chr>
1 2014_01_01_10_51_00… "\"新华网伦敦1… 2014 1 1 10 51 0 docid报告称若不减… RMWenv
2 2014_01_01_11_06_00… "\"新华网北京1… 2014 1 1 11 6 0 docid盘点2013… RMWenv
3 2014_01_02_08_08_00… "\"原标题:报告… 2014 1 2 8 8 0 docid报告称若不减… RMWenv
4 2014_01_03_08_42_00… "\"地球可能毁灭… 2014 1 3 8 42 0 docid地球可能毁灭… RMWenv
5 2014_01_03_08_44_00… "\"北美鼠兔看起… 2014 1 3 8 44 0 docid北美鼠兔为应… RMWenv
6 2014_01_06_10_30_00… "\"欣克力C点核… 2014 1 6 10 30 0 docid英国欲建50… RMWenv
I would like to measure the changing relative frequency that a particular term - e.g 'development' - occurs in these articles (either as a proportion of the total terms in the article / or as a proportion of the total terms in all the articles published in a particular day / month). I know that I can count the number of times the term occurs in all the articles in a month, using:
dfm(corp, select = "term", groups = "month")
and that I can get the relative frequency of the word to the total words in the document using:
dfm_weight(dfm, scheme = "prop")
But how do I combine these together to get the frequency of a specific term relative to the total number of words on a particular day or in a particular month?
What I would like to be able to do is measure the change in the amount of times a term is used over time, but accounting for the fact that the total number of words used is also changing. Thanks for any help!
#DaveArmstrong gives a good answer here and I upvoted it, but can add a bit of efficiency using some of the newest quanteda syntax, which is a bit simpler.
The key here is preserving the date format created by zoo::yearmon(), since the dfm grouping coerce that to a character. So we pack it into a docvar, which is preserved by the grouping, and then retrieve it in the ggplot() call.
load(file("https://www.dropbox.com/s/kl2cnd63s32wsxs/music.rda?raw=1"))
library("quanteda")
## Package version: 2.1.1
## create corpus and dfm
corp <- corpus(m, text_field = "body_text")
corp$date <- m$first_publication_date %>%
zoo::as.yearmon()
D <- dfm(corp, remove = stopwords("english")) %>%
dfm_group(groups = "date") %>%
dfm_weight(scheme = "prop")
library("ggplot2")
convert(D[, "wonderfully"], to = "data.frame") %>%
ggplot(aes(x = D$date, y = wonderfully, group = 1)) +
geom_line() +
labs(x = "Date", y = "Wonderfully/Total # Words")
I suspect someone will come up with a better solution within quanteda, but in the event they don't, you could always extract the word from the dfm and put it in a dataset along with the date and then make the graph. In the code below, I'm using some music reviews I scraped from the Guardian's website. I've commented out the functions that read in the data from an .rda file from Dropbox. You're welcomed to use it if you like - it's clean, but I don't want to inadvertently have someone download a file from the web they're not aware of.
# f <- file("https://www.dropbox.com/s/kl2cnd63s32wsxs/music.rda?raw=1")
# load(f)
## create corpus and dfm
corp <- corpus(as.character(m$body_text))
docvars(corp, "date") <- m$first_publication_date
D <- dfm(corp, remove=stopwords("english"))
## take word frequencies "wonderfully" in the dfm
## along with the date
tmp <- tibble(
word = as.matrix(D)[,"wonderfully"],
date = docvars(corp)$date,
## calculate the total number of words in each document
total = rowSums(D)
)
tmp <- tmp %>%
## turn date into year-month
mutate(yearmon =zoo::as.yearmon(date)) %>%
## group by year-month
group_by(yearmon) %>%
## calculate the sum of the instances of "wonderfully"
## divided by the sum of the total words across all
## documents in the month
summarise(prop = sum(word)/sum(total))
## make a plot.
ggplot(tmp, aes(x=yearmon, y=prop)) +
geom_line() +
labs(x= "Date", y="Wonderfully/Total # Words")
I have a tbl_df that I'm trying to create unique columns based on a variety of filters. I did some reading on using ifelse, and other mutate functions but wasn't able to figure this one out on my own. The df is named Test and is listed below.
name team salary season position
<chr> <int> <int> <int> <chr>
AP 6 15 2017 OF
AN 11 8 2016 SP
AS 1 8 2014 SP
AR 3 11 2018 SS
AB 2 10 2015 3B
AC 8 7 2017 RP
Abe 11 10 2016 OF
AM 7 12 2014 RP
Ari 11 48 2018 1B
BH 13 29 2015 OF
I'm trying to create a variety of columns based on the results of specific filters. The code I have so far is as follows:
summary <- test %>%
group_by(team, season)
The mutate columns I'm trying to create are:
Hitting: Sum all values from Salary for rows with a position that does not equal SP or RP, and seasons 2016-2018
Pitching: Sum all values from Salary for rows with a position that equals SP or RP, and seasons 2016-2018
Relievers: Count all rows with a position equal to RP
Over_40: Count all rows with a salary over 40
Over_40_H: Count all rows with a salary over 40 and position not equal to SP or RP
I want all of these columns and results to be grouped by team, and season (as shown above)
You may have struggled to work this out because it looks like you're trying to summarise your data rather than mutate it. As such, you can use ifelse() within summarise() – rather than mutate() – to do this.
library(tidyverse)
test %>%
group_by(team, season) %>%
summarise(
players = n(),
hitting = sum(ifelse(!position %in% c("SP", "RP"), salary, 0)),
pitching = sum(ifelse(position %in% c("SP", "RP"), salary, 0)),
relievers = sum(ifelse(position == "RP", 1, 0)),
over_40 = sum(ifelse(salary > 40, 1, 0)),
over_40_h = sum(ifelse(salary > 40 & !position %in% c("SP", "RP"), 1, 0))
) %>%
mutate(
hitting = ifelse(season < 2016 | season > 2018, NA, hitting),
pitching = ifelse(season < 2016 | season > 2018, NA, pitching)
) %>%
arrange(team, season)
Note that:
I've added a column showing the number of players because it provides a basic check that the grouping has worked.
The mutate() command is included to remove the sum of pitcher and hitter salaries for years that are not between 2016 and 2018, as you specified in the question. However, since you've grouped the data by season anyway it may be equally easy to simply ignore the salaries for years you're not interested in.
I've arranged the data by team and season at the end, but this is mainly to make the results more readable.
Since you're using dplyr, I've changed the column names to follow the tidyverse style guide.
I have a column that specifies the type os sanctions used in my data. This is what it looks like:
country sanction_type
(chr) (int)
1 China 2
2 Austria 5
3 South Africa 1
4 Poland 6
5 Poland 7
6 Bolivia, Plurinational State of 2
The types of sanctions range from 1-10. How can I create two extra columns, one including sanction types 1,2,3,4 and the other one 5,6,7,8,9,10. I would also like to keep the exisiting one with all sanctions types. Many thanks!
The dataset has more than 6 observations, this is just a sample of the data. Sorry for the confusion.
Let your data frame be dat,
dat$less4 <- as.integer(dat$sanction_type <= 4L)
dat$great5 <- 1L - dat$less4
I saw that your sanction_type column has integer type, so I am doing integer operation all the time, to get integer result.
Using dplyr package:
country <- c("China","Austria","South Africa","Poland", "Poland", "Bolivia")
sanction_type <- c(2,5,1,6,7,2)
df <- data.frame(country, sanction_type)
library(dplyr)
df <- mutate(df, srange1 = ifelse(sanction_type <= 4, 1, 0),
srange2 = ifelse(sanction_type >= 5, 1, 0))
I have data on college course completions, with estimated numbers of students from each cohort completing after 1, 2, 3, ... 7 years. I want to use these estimates to calculate the total number of students outputting from each College and Course in any year.
The output of students in a given year will be the sum of the previous 7 cohorts outputting after 1, 2, 3, ... 7 years.
For example, the number of students outputting in 2014 from COLLEGE 1, COURSE A is equal to the sum of:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
So there are two dataframes: a lookup table that contains all the output estimates, and a smaller summary table that I'm trying to modify. I want to update dummy.summary$output with, for each row, the total output based on the above calculation.
The following code will replicate my data pretty well
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
The following code does not work, but shows the approach I've been attempting.
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
I guess it doesn't work because I was hoping to use 'x' in the anonymous function to index particular values of the dummy.summary dataframe. But that clearly isn't happening and is only returning zero for each row, presumably because the starting value of 'x' is zero each time. I don't know if it is possible to access the index position of each value that sapply loops over, and use that to index my summary dataframe.
Is this approach fixable or do I need a completely different approach?
Even if it is fixable, is there a more elegant/faster way to acheive what I'm trying to do?
Thanks in anticipation.
I've just updated your output.year to output.year2 where instead of a value from 1 to 7 it gets a value of a year based on the cohort you have.
I've realised that the output information you want corresponds to the output.year, but the intake information you want corresponds to the cohort. So, I calculate them separately and then I join tables/information. This automatically creates empty (NA that I transform to 0) output info for 1998.
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...
I have data(e - 32 obs. of 3 variables) that contains the following columns
Month Years Seats
10 2011 4477
11 2011 12210
12 2011 12617
1 2012 12617
...and so on, up to
5 2014 25234
Another data (f - 101 obs. of 3 variables) that contains
Month Years Seats
1 2006 27787
up to
5 2014 29017
My purpose is to divide the number of seats in e by the number of seats in f, if the year and month for both e and f are the same. My effective result would be getting a table that displays the result of division in percentage
Month Years Change in Seats
10 2011 14.72%
11 2011 42.28%
I tried taking -
a subset of "f" and then compare with "e" to perform division, but failed at doing so
a merge of (e,f) and then perform division
running a for loop, but didn't help
g<-{
for(i in 2006:2014)
{
for (j in 1:12)
{
if(i==e[,2] && i==f[,2] && j==e[,1] && j==f[,1])
{
(e[,3]/f[,3])
}
else
{
'NA'
}
}
}
}
g
Any help on this would be highly appreciated. Just begun working in R a couple of days ago. Please let me know if you would like any further information to attempt this question.
I think merge will be your best bet.
df1 <- data.frame(month = 1:12, year = rep(2011,12), seats = round(runif(12,10000,20000)))
df2 <- data.frame(month = 2:10, year = rep(2011,9), seats = round(runif(9,10000,20000)))
df3 <- merge(df1, df2, by=c("month", "year"))
df3$change <- df3$seats.x/df3$seats.y
If you need to display the change as a percent rather than a decimal, check How to format a number as percentage in R?