How can I select specific regions from a shapefile? - r

I have the present shapefile
heitaly<- readOGR("ProvCM01012017/ProvCM01012017_WGS84.shp")
FinalData<- merge(italy, HT, by.x="COD_PROV", by.y="Domain")
But I'm interesting not on all Italy, but also same provinces. How can I get them?

There are many ways to select a category into a shapefile. I don't know for what do you want. For example if it is to colour a specific region in a plot or to select a row from shapefile attribute table.
To plot:
plot(shape, col = shape$column_name == "element") # general example
plot(heitaly, col = heitaly$COD_PROV == "name of province") # your shapefile
To attribute table:
df <- shape %>% data.frame
This will give you the complete attribute table
row <- shape %>% data.frame %>% slice(1)
This will give you the first row with all columns. If you change the number 1 to another number, for example 3, will give you the information for row number 3
I hope have been useful

Related

How to use window functions?

I'm struggling to get window functions working in R to rank groups by the number of rows.
Here's my sample code:
data <- read_csv("https://dq-content.s3.amazonaws.com/498/book_reviews.csv")
data %>%
group_by(state) %>%
mutate(num_books = n(),
state_rank = dense_rank(num_books)) %>%
arrange(num_books)
The expected output is that the original data will have a new column that tells me the rank for each row (book, state, price and review) depending on whether that row is for a state with the most book reviews (would have state_rank of 1); second most books (rank 2), etc.
Manually I can get the output like this:
manual_ranks <- data %>%
count(state) %>%
mutate(state_rank = rank(state))
desired_output <- data %>%
left_join(manual_ranks)
In other words, I want the last column of this table:
data %>%
count(state) %>%
mutate(state_rank = rank(state))
added to each row of the original table (without having to create this table and then using left_join by state; that's the point of window functions).
Anyway, with the original code, you'll see that all state_rank just say 1, when I would expect states with the most book reviews to be ranked 1, second most reviews would have 2, etc.
My goal is to then be able to filter by, say, state_rank > 4. That is, I want to keep all the rows in the original data for top 4 states with the most book reviews.

Locating bold text in excel using R

I am trying to make a couple of spreadsheets in excel accessible. I need to replace bold text and some contents of the cells depending on their specific grouping. For example, if I have this table:
I would like to have the equivalent "accessible" table:
I am not worried about writing in the excel file, my goal is to read the table from the spreadsheet and create a data frame that looks like the table above with the variable names in the first column.
My idea was to identify where there is a bold text in the first column so I could prepend that text to the names that are not in bold as bold represents the subgroups.
I understand this might not be the best solution to the problem, I hope someone can find a proper solution.
Thank you all very much.
--EDITED -- it turns out you can tell the style of which cell has which style and depending on how many styles/how consistent the styles are used in the workbook would determine your needs. But I will leave the other answer using the Total column approach below. The first approach relies on the bold text being consistently used. And the second approach relies on the Total column subcategories always equaling the total categories. The both end up using similar approaches, just initial strategy of identifying the category text is different with each approach.
---I don't believe that openxlsx can determine which cells have a bolded style-- only that a bolded style exists in the workbook.--- I couldn't have been more wrong!
---Bold text search answer --
library(openxlsx)
library(tidyverse)
wb <- createWorkbook()
wb <- loadWorkbook("Path\\Your_File_Name.xlsx")
#Examine structure of the workbook
str(wb)
#Tells number of styles in workbook
wb$styleObjects %>% length()
#In this example there is just 1 style. So index the 1st style below this text logical if fontDecoration is bolded (answer is TRUE)
wb$styleObjects[[1]]$style$fontDecoration == "BOLD"
Since this is the style desired, pull out the rows that have this fontDecoration. Note if BOLD and other style type in different cell (e.g., Motorcycle was in red font) then this may get more complex in flagging/collecting the rows with bolded font (hence approach 2 may be safer option).
#This indicates that rows 1 and 5 have bolded text (i.e., Car and Motorcycle)
thesebold <- wb$styleObjects[[1]]$rows
df <- read.xlsx("Path\\filename", colNames = FALSE)
This identifies number of repetitions each category should have. So taking different between row position 1 and 5, then remaining length of the dataset. See second approach if more than two categories
thesereps <- c(diff(thesebold), dim(df)[1] - diff(thesebold))
#Named variables for ease
df %>%
set_names("Category", "Total") %>%
bind_cols(newcat = rep(df[thesebold,1], thesereps)) %>%
mutate(Category = case_when(Category == newcat ~ Category,
Category != newcat ~ paste0(newcat, ":", Category))) %>%
select(-newcat)
--Second approach --
So, this answer isn't using the bold text approach, but assuming the structure of the dataset is how you have it displayed in the example, then the below should work. The data are structured where you have categories (Car, Motorcycle) then subcategories (Tesla, Honda, Toyota, etc.) with the Total column being Total for each category, then subcategory subtotal that contributes to the Total. Using this column, you can define category boundaries (i.e., when subtracting Total from itself, it reaches 0 before switching to the next category). For demo, I added two more categories of varying lengths but same restrictions (sum of subcategories' totals must equal category total). I made a note where things may need adjusted for your purposes since I am creating the dataframe from scratch instead of reading it in using openxlsx
library(tidyverse)
#Make expanded data set for demo - adding extra categories
thesenames = c("Car", "Tesla", "Honda", "Toyota", "Motorcylce", "Honda", "Yamaha", "Suzuki", "Fruit", "Apple", "Orange", "Grape", "Strawberry", "Lemon", "Lime", "Shape", "Circle", "Square", "Octogon")
thesetotals = c(12,3,2,7,20,13,5,2, 32, 8, 4, 4, 8, 4, 4, 24, 2, 4, 18)
df <- bind_cols(thesenames, thesetotals)%>%
set_names("Type", "Total")
#Empty tibble to save running total result to
y <- tibble(NULL)
#Initialize the current.total as 0
current.total = 0
for(i in thesetotals){
if (current.total == 0){
current.total = current.total + i
} else{
current.total = current.total - i
}
tmp <- current.total
y <- rbind(y, tmp)
}
y <- as_tibble(y) %>%
set_names("RT")
#Gets the number of subcategories between each of main categories
thislong <- c(diff(which(y$RT ==0)))
thislong <- c((length(y$RT) - sum(thislong)),thislong)
#This part assumes the structure of the df I created above which may need modified in your dataset
#This pulls from first column, first row, which here is "Car"
firstrow <- df[1,1] %>% pull()
#Gets vector of each category; determines category by looking at the lag RT value
thesetypes <- bind_cols(df,y) %>%
mutate(Category = case_when(firstrow == Type ~ Type,
RT > lag(RT) ~ Type,
TRUE ~ "0")) %>%
filter(Category != "0") %>%
pull(Category)
#Adds new category to existing df, repeating the specified number of times
df$Category <- rep(thesetypes,thislong)
#Modifies the subcategory text with prefixing the category membership then drops Category
df <- df %>%
mutate(Type = case_when(Type != Category ~ paste0(Category, ":", Type),
TRUE ~ Type)) %>%
select(-Category)
df

How to group twice?

I'd like to know how to group twice in a data set. I must answer the following question: "For each state, which municipalities have the lowest and the highest infections and death rates?". This question is part of a homework (https://github.com/umbertomig/intro-prob-stat-FGV/blob/master/assignments/hw6.Rmd) and I don't know how to do it. I've tried to use top_n, but I am not sure if this is the best way.
I wanted to generate a data set in which, for each state, there were four municipalities (two with the highest rates of infection and death from coronavirus and two with the smallest). This is what a have done so far:
library(tidyverse)
brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/data sets/brazilcorona.csv")
brazilcorona_hl_rates <- select(brazilcorona, (estado:emAcompanhamentoNovos)) %>%
filter(data >= "2020-05-15") %>%
subset(!(coduf == 76)) %>%
mutate(av_inf = (casosAcumulado/populacaoTCU2019)*100000,
av_dth = (obitosAcumulado/populacaoTCU2019)*100000)
brazilcorona_hilow_rates <- brazilcorona_hl_rates %>%
group_by(estado) %>%
summarize(top_dth = top_n(1, av_dth))
In my example, I find two cities per state with the maximum and minimum values for the column "obitosAcumulado", but to solve your problem you should simply change it to the column containing the information you want to extract the information from.
rm(list=ls())
brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/datasets/brazilcorona.csv")
#
#remove NA's from municipios
brazilcorona<-brazilcorona[!is.na(brazilcorona$municipio),]
#here I am gonna use the column "obitosAcumulado" but you should use the one you want
brazilcorona$obitosAcumulado<-as.numeric(brazilcorona$obitosAcumulado)
states<-as.list(unique(brazilcorona$estado))
result<-lapply(states,FUN=function(x){
df<-brazilcorona[brazilcorona$estado==x,]
df<-df[order(df$obitosAcumulado,decreasing = T),]
return(c(paste(x),as.character(df[1:2,"municipio"]),
as.character(df[(nrow(df)-1):nrow(df),"municipio"])))})
I hope it helps you...

Binding dataframes with matching country names

I have two data frames of country data.
df1 has all the countries of the world.
df2 has a subset of countries but has the populations in one of its columns.
I want to take the population data and add it to df1 where the country names are a match.
If df1$Column1 = df2$Column1 (same country name) then populate df1$Column2 (currently empty) with the information from df2$Column2 (country's population) where the row is the the one for that country match.
I tried to merge the two using the column "Name" which they both have for country names :
total <- merge(map,Co2_2x, by="NAME")
the columns are all there but I get empty rows in my new dataframe.
I'd like to be able to say "for this row and column matrix position in df1 (the country), get the row (country name match in df2) and column X (population data). Then put it in this row and column Y matrix position in df1 (new population column in df1 for the matched country name)"... There must be an easier way :-)
Here is my code : I'd like to fill map$measure with data from Co2_2x$premium where the countries match.
library(XML)
library(raster)
library(rgdal)
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")
polygons
map <- as.data.frame(polygons)
map$Measure <- 0
library(rvest)
Co2 <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions")
Co2_2x<-Co2 %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
names(Co2_2x)[2]<-paste("premium")
names(Co2_2x)[1]<-paste("NAME")
total <- merge(map,Co2_2x, by="NAME")
Thanks!
To have the first dataset rows with no match in the other dataset appear, you just need to add the all.x=T option, as follows (have a look at the documentation for details) :
total <- merge(map,Co2_2x, by="NAME",all.x=T)
These rows will then appear with NA in the second dataset columns.
If the matching doesn't seem to work, you may want to make sure that your matching variable (in your case, NAME) is filled exaclty the same way in the two datasets (letter case, possible spaces at the extremities...).
This answer provides a fine way of doing so.
you can use sqldf library in R.
Just follow the code below. You'll be able to merge (join) the two dataset that you have:
library(sqldf)
merged_data <- sqldf("select a.country, b.population from df1 as a
left join df2 as b on (a.country = b.country) group by 1")
Thanks and happy R-programming!!!

How to use ggplot to group and show top X categories?

I am trying to use use ggplot to plot production data by company and use the color of the point to designate year. The follwoing chart shows a example based on sample data:
However, often times my real data has 50-60 different comapnies wich makes the Company names on the Y axis to be tiglhtly grouped and not very asteticly pleaseing.
What is th easiest way to show data for only the top 5 companies information (ranked by 2011 quanties) and then show the rest aggregated and shown as "Other"?
Below is some sample data and the code I have used to create the sample chart:
# create some sample data
c=c("AAA","BBB","CCC","DDD","EEE","FFF","GGG","HHH","III","JJJ")
q=c(1,2,3,4,5,6,7,8,9,10)
y=c(2010)
df1=data.frame(Company=c, Quantity=q, Year=y)
q=c(3,4,7,8,5,14,7,13,2,1)
y=c(2011)
df2=data.frame(Company=c, Quantity=q, Year=y)
df=rbind(df1, df2)
# create plot
p=ggplot(data=df,aes(Quantity,Company))+
geom_point(aes(color=factor(Year)),size=4)
p
I started down the path of a brute force approach but thought there is probably a simple and elegent way to do this that I should learn. Any assistance would be greatly appreciated.
What about this:
df2011 <- subset (df, Year == 2011)
companies <- df2011$Company [order (df2011$Quantity, decreasing = TRUE)]
ggplot (data = subset (df, Company %in% companies [1 : 5]),
aes (Quantity, Company)) +
geom_point (aes (color = factor (Year)), size = 4)
BTW: in order for the code to be called elegant, spend a few more spaces, they aren't that expensive...
See if this is what you want. It takes your df dataframe, and some of the ideas already suggested by #cbeleites. The steps are:
1.Select 2011 data and order the companies from highest to lowest on Quantity.
2.Split df into two bits: dftop which contians the data for the top 5; and dfother, which contains the aggregated data for the other companies (using ddply() from the plyr package).
3.Put the two dataframes together to give dfnew.
4.Set the order for which levels of Company are plotted: Top to bottom is highest to lowest, then "Other". The order is partly given by companies, plus "Other".
5.Plot as before.
library(ggplot2)
library(plyr)
# Step 1
df2011 <- subset (df, Year == 2011)
companies <- df2011$Company [order (df2011$Quantity, decreasing = TRUE)]
# Step 2
dftop = subset(df, Company %in% companies [1:5])
dftop$Company = droplevels(dftop$Company)
dfother = ddply(subset(df, !(Company %in% companies [1:5])), .(Year), summarise, Quantity = sum(Quantity))
dfother$Company = "Other"
# Step 3
dfnew = rbind(dftop, dfother)
# Step 4
dfnew$Company = factor(dfnew$Company, levels = c("Other", rev(as.character(companies)[1:5])))
levels(dfnew$Company) # Check that the levels are in the correct order
# Step 5
p = ggplot (data = dfnew, aes (Quantity, Company)) +
geom_point (aes (color = factor (Year)), size = 4)
p
The code produces:

Resources