Conditional mathematical optimization in R - r

I have the following data frames:
Required <- data.table( Country=c("AT Iron", "AT Energy", "BE Iron", "BE Energy", "BG Iron", "BG Energy"),Prod1=c(5,10,0,5,0,5),Prod2=c(25,5,10,0,0,5))
Supplied <- data.table( Country=c("AT Iron", "AT Energy", "BE Iron", "BE Energy", "BG Iron", "BG Energy"),Prod1=c(10,5,5,10,5,10),Prod2=c(20,20,20,0,15,10))
> Required
Country Prod1 Prod2
1: AT Iron 5 25
2: AT Energy 10 5
3: BE Iron 0 10
4: BE Energy 5 0
5: BG Iron 0 0
6: BG Energy 5 5
> Supplied
Country Prod1 Prod2
1: AT Iron 10 20
2: AT Energy 5 20
3: BE Iron 5 20
4: BE Energy 10 0
5: BG Iron 5 15
6: BG Energy 10 10
"Required" shows the initial material and energy requirements to manufacture two products, and the materials and energy are supplied by three different countries. For example, product 1 would require, for Energy, 10 units from AT, 5 units from BE and 5 units from BG. "Supplied" shows the actual supply capacity of the countries. Following the example, AT cannot supply 10 units of energy but 5 units, so another country must supply the remaining units. I assume that the country with the most net supply capacity (that is, once discounted the initial requirement) will provide the remaining units. In this case, both BE and BG have 5 units of net supply capacity, so both will provide with equal units, 2.5.
I seek an optimization algorithm that creates a new "Required" table, "RequiredNew", considering supply constrains and the above mentioned assumption. The resulting table should look like:
> RequiredNew
Country Prod1 Prod2
1: AT Iron 5 20
2: AT Energy 10 5
3: BE Iron 0 10
4: BE Energy 7.5 0
5: BG Iron 0 5
6: BG Energy 7.5 5
In the link below I posted a similar question which was solved by user digEmAll, so a similar approach would be suitable. However, I rephrased the question so that it becomes clearer and resembles more to my actual data.
Mathematical optimization in R
I apologise by the multiple posts. Thank you in advance.

Related

Using spacyr for named entity recognition - inconsistent results

I plan to use the spacyr R library to perform named entity recognition across several news articles (spacyr is an R wrapper for the Python spaCy package). My goal is to identify partners for network analysis automatically. However, spacyr is not recognising common entities as expected. Here is sample code to illustrate my issue:
library(quanteda)
library(spacyr)
text <- data.frame(doc_id = c(1:5),
sentence = c("Brightmark LLC, the global waste solutions provider, and Florida Keys National Marine Sanctuary (FKNMS), today announced a new plastic recycling partnership that will reduce landfill waste and amplify concerns about ocean plastics.",
"Brightmark is launching a nationwide site search for U.S. locations suitable for its next set of advanced recycling facilities, which will convert hundreds of thousands of tons of post-consumer plastics into new products, including fuels, wax, and other products.",
"Brightmark will be constructing the facility in partnership with the NSW government, as part of its commitment to drive economic growth and prosperity in regional NSW.",
"Macon-Bibb County, the Macon-Bibb County Industrial Authority, and Brightmark have mutually agreed to end discussions around building a plastic recycling plant in Macon",
"Global petrochemical company SK Global Chemical and waste solutions provider Brightmark have signed a memorandum of understanding to create a partnership that aims to take the lead in the circular economy of plastic by construction of a commercial scale plastics renewal plant in South Korea"))
corpus <- corpus(text, text_field = "sentence")
spacy_initialize(model = "en_core_web_sm")
parsed <- spacy_parse(corpus)
entity <- entity_extract(parsed)
I expect the company "Brightmark" to be recognised in all 5 sentences. However this is what I get:
entity
doc_id sentence_id entity entity_type
1 1 1 Florida_Keys_National_Marine_Sanctuary ORG
2 1 1 FKNMS ORG
3 2 1 U.S. GPE
4 3 1 NSW ORG
5 4 1 Macon_-_Bibb_County ORG
6 4 1 Brightmark ORG
7 4 1 Macon GPE
8 5 1 SK_Global_Chemical ORG
9 5 1 South_Korea GPE
"Brightmark" only appears as an ORG entity type in the 4th sentence (doc_id refers to sentence number). It should show up in all the sentences. The "NSW Government" does not appear at all.
I am still figuring out spaCy and spacyr. Perhaps someone can advise me why this is happening and what steps I should take to remedy this issue. Thanks in advance.
I changed the model and achieved better results:
spacy_initialize(model = "en_core_web_trf")
parsed <- spacy_parse(corpus)
entity <- entity_extract(parsed)
entity
doc_id sentence_id entity entity_type
1 1 1 Brightmark_LLC ORG
2 1 1 Florida_Keys GPE
3 1 1 FKNMS ORG
4 2 1 Brightmark ORG
5 2 1 U.S. GPE
6 3 1 Brightmark ORG
7 3 1 NSW GPE
8 3 1 NSW GPE
9 4 1 Macon_-_Bibb_County GPE
10 4 1 the_Macon_-_Bibb_County_Industrial_Authority ORG
11 4 1 Brightmark ORG
12 4 1 Macon GPE
13 5 1 SK_Global_Chemical ORG
14 5 1 Brightmark ORG
15 5 1 South_Korea GPE
The only downside is that NSW Government and Florida Keys National Marine Sanctuary are not resolved. I also get this warning: UserWarning: User provided device_type of 'cuda', but CUDA is not available.

Tabulate number of attacks within time and distance range

This website has helped me with so much over the years, but I can't seem to figure this part out. I am working on modeling terrorist attacks in Afghanistan and want to create a new variable to reflect the clustering of attacks. For each attack I want to calculate the number of attacks that fall into two range criteria, distance and time.
head(timedist_terr_talib, 15)
eventid lat lon event1 Cluster_Num
1 20110104 32.07333 64.83389 2011-01-04 NA
2 20110107 31.00806 66.39806 2011-01-07 NA
3 20110112 34.53306 69.16611 2011-01-12 NA
4 20110112 34.87417 71.15278 2011-01-12 NA
5 20110114 31.65003 65.65002 2011-01-14 1
6 20110115 33.42977 66.21314 2011-01-15 0
7 20110116 35.95000 68.70000 2011-01-16 0
8 20110119 32.68556 68.23778 2011-01-19 0
9 20110119 34.08056 68.51917 2011-01-19 1
10 20110123 34.89000 71.18000 2011-01-23
11 20110128 34.53306 69.16611 2011-01-28
12 20110129 31.61767 65.67594 2011-01-29
13 20110131 35.03924 69.00633 2011-01-31
14 20110201 31.61767 65.67594 2011-02-01
15 20110207 31.48623 64.32139 2011-02-07
I want to create a new column whose values are the number of attacks that happened within the last 14 days and 100 km of that attack.
event1 <- strptime(timedist_terr_talib$eventid,
format="%Y%m%d", tz="UTC")
I found code that makes a matrix with the distance between each point:
http://eurekastatistics.com/calculating-a-distance-matrix-for-geographic-points-using-r/
#find dist in meters / 1000 to get km
#dis_talib_mat<-round(GeoDistanceInMetresMatrix(timedist_terr_talib) / 1000)
dis_talib_mat1 <- (GeoDistanceInMetresMatrix(timedist_terr_talib) / 1000)
And I have a matrix that calculates the time distance between every pair:
timediff_talib1<-t(outer(timedist_terr_talib$event1,
timedist_terr_talib$event1, difftime))
timediff_talib1<-timediff_talib1/(60*60*24)
So example for attack 1:4 are NA because the data does not have a complete 14 days. When I look at attack 5, I look at attacks 1:4 because they happened with 14 days. The distance matrix shows that 1 of those attacks was within 100 km.
and manually count that there is 1 attack that is under 100 km away.
My current data set is 2813 attacks, so the running is slow, but if I could get the code for these 15 and apply it my set, I would be so happy!

Make a percentage depending on DF

I have a train set here and I need you to help me with something.
This is the df.
Jobs Agency Location Date RXH HS TMM Payed
14 Netapp Gitex F1 Events House DWTC 2015-10-19 100 8.0 800 TRUE
5 RWC Heineken Lightblue EGC 2015-10-09 90 4.0 360 FALSE
45 Rugby 7s CEO Seven Stadium 2015-12-04 100 10.0 1000 FALSE
29 Playstation Lightblue Mirdiff CC 2015-11-11 90 7.0 630 FALSE
24 RWC Heineken Lightblue EGC 2015-10-31 90 4.5 405 FALSE
33 Playstation Lightblue Mirdiff CC 2015-11-15 90 10.0 900 FALSE
46 Rugby 7s CEO Seven Stadium 2015-12-05 100 10.0 1000 FALSE
44 Rugby 7s CEO Seven Stadium 2015-12-03 100 10.0 1000 FALSE
I want to know for example that the total of rows is 10, and I worked for " CEO" agency 3 times, I want CEO Agency to have the 30% value for that month, if that makes sense?
I want to know depending on the number of observations how much in % i ve worked for them.
Thats just a Demo DF to see what im talking about.
Thanks
If I understand correctly, you want to summarize by Agency and by month. Here's how to do it with dplyr:
library(dplyr)
table1 %>%
mutate(Month=format(Date,"%m-%Y")) %>%
group_by(Month,Agency)%>%
summarise(Total=n())%>%
mutate(Pct=round(Total/sum(Total)*100))
Source: local data frame [4 x 4]
Groups: Month [3]
Month Agency Total Pct
(chr) (chr) (int) (dbl)
1 10-2015 Events House 1 33
2 10-2015 Lightblue 2 67
3 11-2015 Lightblue 2 100
4 12-2015 CEO 3 100
This is just a simple approach, and I suspect you might be looking for more. However, here's some code that would give you the answer to your sample question:
length(df$Agency[df$Agency == "CEO"]) / length(df$Agency)
The first length() function calculates how many cells in df$Agency are marked "CEO," then the second one calculates the total number of cells in that column. Dividing one by the other will give you the answer.
This will get more complicated if you want to automatically do this for each of the agencies in the column, but there are the basics.

Create list of elements which match a value

I have a table of values with the name, zipcode and opening date of recreational pot shops in WA state.
name zip opening
1 The Stash Box 98002 2014-11-21
3 Greenside 98198 2015-01-01
4 Bud Nation 98106 2015-06-29
5 West Seattle Cannabis Co. 98168 2015-02-28
6 Nimbin Farm 98168 2015-04-25
...
I'm analyzing this data to see if there are any correlations between drug usage and location and opening of recreational stores. For one of the visualizations I'm doing, I am organizing the data by number of shops per zipcode using the group_by() and summarize() functions in dplyr.
zip count
(int) (int)
1 98002 1
2 98106 1
3 98168 2
4 98198 1
...
This data is then plotted onto a leaflet map. Showing the relative number of shops in a zipcode using the radius of the circles to represent shops.
I would like to reorganize the name variable into a third column so that this can popup in my visualization when scrolling over each circle. Ideally, the data would look something like this:
zip count name
(int) (int) (character)
1 98002 1 The Stash Box
2 98106 1 Bud Nation
3 98168 2 Nimbin Farm, West Seattle Cannabis Co.
4 98198 1 Greenside
...
Where all shops in the same zipcode appear together in the third column together. I've tried various for loops and if statements but I'm sure there is a better way to do this and my R skills are just not up there yet. Any help would be appreciated.

R - Bootstrap by several column criteria

So what I have is data of cod weights at different ages. This data is taken at several locations over time.
What I would like to create is "weight at age", basically a mean value of weights at a certain age. I want do this for each location at each year.
However, the ages are not sampled the same way (all old fish caught are measured, while younger fish are sub sampled), so I can't just create a normal average, I would like to bootstrap samples.
The bootstrap should take out 5 random values of weight at an age, create a mean value and repeat this a 1000 times, and then create an average of the means. The values should be able to be used again (replace). This should be done for each age at every AreaCode for every year. Dependent factors: Year-location-Age.
So here's an example of what my data could look like.
df <- data.frame( Year= rep(c(2000:2008),2), AreaCode = c("39G4", "38G5","40G5"), Age = c(0:8), IndWgt = c(rnorm(18, mean=5, sd=3)))
> df
Year AreaCode Age IndWgt
1 2000 39G4 0 7.317489899
2 2001 38G5 1 7.846606144
3 2002 40G5 2 0.009212455
4 2003 39G4 3 6.498688035
5 2004 38G5 4 3.121134937
6 2005 40G5 5 11.283096043
7 2006 39G4 6 0.258404136
8 2007 38G5 7 6.689780137
9 2008 40G5 8 10.180511929
10 2000 39G4 0 5.972879108
11 2001 38G5 1 1.872273650
12 2002 40G5 2 5.552962065
13 2003 39G4 3 4.897882549
14 2004 38G5 4 5.649438631
15 2005 40G5 5 4.525012587
16 2006 39G4 6 2.985615831
17 2007 38G5 7 8.042884181
18 2008 40G5 8 5.847629941
AreaCode contains the different locations, in reality I have 85 different levels. The time series stretches 1991-2013, the ages 0-15. IndWgt contain the weight. My whole data frame has a row length of 185726.
Also, every age does not exist for every location and every year. Don't know if this would be a problem, just so the scripts isn't based on references to certain row number. There are some NA values in the weight column, but I could just remove them before hand.
I was thinking that I maybe should use replicate, and apply or another plyr function. I've tried to understand the boot function but I don't really know if I would write my arguments under statistics, and in that case how. So yeah, basically I have no idea.
I would be thankful for any help I can get!
How about this with plyr. I think from the question you wanted to bootstrap only the "young" fish weights and use actual means for the older ones. If not, just replace the ifelse() statement with its last argument.
require(plyr)
#cod<-read.csv("cod.csv",header=T) #I loaded your data from csv
bootstrap<-function(Age,IndWgt){
ifelse(Age>2, # treat differently for old/young fish
res<-mean(IndWgt), # old fish mean
res<-mean(replicate(1000,sample(IndWgt,5,replace = TRUE))) # young fish bootstrap
)
return(res)
}
ddply(cod,.(Year,AreaCode,Age),summarize,boot_mean=bootstrap(Age,IndWgt))
Year AreaCode Age boot_mean
1 2000 39G4 0 6.650294
2 2001 38G5 1 4.863024
3 2002 40G5 2 2.724541
4 2003 39G4 3 5.698285
5 2004 38G5 4 4.385287
6 2005 40G5 5 7.904054
7 2006 39G4 6 1.622010
8 2007 38G5 7 7.366332
9 2008 40G5 8 8.014071
PS: If you want to sample all ages in the same way, no need for the function, just:
ddply(cod,.(Year,AreaCode,Age),
summarize,
boot_mean=mean(replicate(1000,mean(sample(IndWgt,5,replace = TRUE)))))
Since you don't provide enough code, it's too hard (lazy) for me to test it properly. You should get your first step using the following code. If you wrap this into replicate, you should get your end result that you can average.
part.result <- aggregate(IndWgt ~ Year + AreaCode + Age, data = data, FUN = function(x) {
rws <- length(x)
get.em <- sample(x, size = 5, replace = TRUE)
out <- mean(get.em)
out
})
To handle any missing combination of year/age/location, you could probably add an if statement checking for NULL/NA and producing a warning and/or skipping the iteration.

Resources