Adding a column from a dataframe to a SpatialPolygon Dataframe - r

I've been trying to add a column of numerical data from a dataframe to a SpatialPolygon dataframe but every attempt leads to the latter dataframe being converted to a standard dataframe similar to the former. I needed to add the column so that I can create a choropleth map with the column's variable as the focus. Obviously the standard dataframe is no good since I'm trying to create a map using tmap.
This is how I've been trying to add the column (where shapefilecomb is the spatial dataframe and wardturnout is the variable containing the column in question):
shapefilecomb <- c(wardturnout)

Adding a column into data slot of SpatialPolygonsDataFrame by assignment operator shapefilecomb$wardturnout <- wardturnout works, but it is not the safest way to do the job. It relies only on position (first data item goes to first polygon, second to second and so on). It can get messy.
It is best reserved for calculated fields - the shapefile$valuepercapita <- shapefile$value / shapefile$population kind of assignment.
For data from external sources it is much better idea to assign value by key. Function append_data from tmap package does it very nicely, and gives you a message not only when error occurs, but also confirmation when all data was matched perfectly (which I found as a nice touch when working with large sets of imperfect data).
outShape <- append_data(srcShape, frmData, key.shp = "KOD_LAU1", key.data = "LAU1")
Edit (as of 9/2019): This answer seems to be still going strong... The world has changed though.
tmap::append_data() has been moved to tmaptools::append_data()
and is by now deprecated
sf has replaced sp as the go-to package in spatial data in R
In
the sf world spatial data are stored in modified data.frames, and the
most appropriate way to assign data items by key is one of the
*_join() functions from dplyr - either dplyr::left_join() to be
on safe side, or dplyr::inner_join() if filtering on both sides is actually desired behavior.

Related

Load file to create a dataframe in r

I have the file data/spatial/tissue_pos.csv and want to load it and create a dataframe object called spatial.data with the two columns xand y holding the x and y coordinates of the data. I am new to R and do not know how this is done. I am given some hints:
The second-to-last and last columns in tissue_pos.csv
represent the x and y coordinates respectively. You can ignore all
other columns except the barcodes of course.
There are more entries in the file than there are spots in the expression data, the entries are also differently sorted, make
sure to adjust for this and that the rownames and colnames match once
you've loaded the data.
To find common elements in two sets of vectors, there's a nifty command called intersect.
But I am still unsure of how to do this. All help is truly appreciated!

Updating a File in R by adding a column/vector

Is there any way that I can update an existing .csv file by adding a column/vector that I have scraped from the web. I have a webscraper that pulls COVID-19 data and I am trying to create a file that has positive cases in columns and each column is the list of cases for a day in each county (x-axis is counties, y-axis is date). I have toyed around with many different ideas at this point and seem to have hit a roadblock. I'm fairly new to r so any ideas would be appreciated!
Packages I am Currently Using/Planning to Use:
library(tidyverse)
library(funModeling)
library(Hmisc)
library(rvest)
library(ggplot2)
CODE:
#writing the original file
positive <- data.frame(Counties= counties_list, "06/12/2020"= positive_data)
positive[is.na(positive)]= 0
positive = positive[-c(76),]
write.csv(positive, "C:/Users/Nathan May/Desktop/Research Files (ABI)/Covid/Data For
Shiny/Positive/Positive Data.csv")
#creating the new vector and updating the existing file with it
datap <- read.csv("C:/Users/Nathan May/Desktop/Research Files (ABI)/Covid/Data For
Shiny/Positive/Positive Data.csv")
positive_data = positive_data[-c(76),]
datap$DATE <- positive_data
NOTE: The end goal is to create a ShinyApp that displays bar charts for postives, recoveries, and deaths by day in each county. This is the data wrangling portion.
First things first, if you are going to use the tidyverse, use tibble instead of data.frame. Tibbles are the Tidyverse version of data frames.
Next, be aware of the structure of your data frame. The way you create your data.frame now (and later probably your tibble) you get a variable "Counties" and one additional variable for each day. That means that you will have to add columns as time passes (the opposite of what you described: Moving along the x axis (along columns) will move along dates while moving along the y-axis (moving along rows) will move along counties). It's possible but I think a bit unconventional. You might want to initialize your data frame with one column for each county and an additional variable called "date". Then whenever you get new data you can add a row in your dataframe instead of a column (so you're "adding a new case" instead of "adding a new variable").
To actually add the row you will have to load the data as you do in your code, create the new row (or column, if you insist) and then "glue" it to the rest of the data.
Depending on how your data looks you can create a single row dataframe using tibble_row() with the same countries as variable names as you have in your main data frame and then glue them together with add_row(datap, your_new_row). Alternatively, if you want to add the row only using position and not column names, you can have the new row as a vector and use rbind() instead of add_row.
If you persist with the "one variable per date" approach there's column equivalents (add_column and cbind) for both these functions.
Hope this helps, Cheers

R empty data frame after subsetting by factor

I need to subset my data depending on the content of one factor variable.
I tried to do it with subset:
new <- subset(data, original$Group1=="SALAD")
data is already a subset from a bigger data frame, in original I have the factor variable which should identify the wanted rows.
This works perfectly for one level of the factor variable, but (and I really don´t understand why!!) when I do it with the other factor level "BREAD" it creates the data frame but says "no data available" - so it is empty. I´ve imported the data from SPSS, if this matters. I´ve already checked the factor levels, but the naming should be right!
Would be really grateful for help, I spent 3 hours on this problem and wasn´t able to find a solution.
I´ve also tried other ways to subset my data (e.g. split), but I want a data frame as output.
Do you have advice in general, what is the best way to subset a data frame if I want e.g. 3 columns of this data frame and these should be extracted depending on the level of a factor (most Code examples are only for one or all columns..)
The entire point of the subset function (as I understand it) is to look inside the data frame for the right variable - so you can type
subset(data, var1 == "value")
instead of
data[data$var1 == "value,]
Please correct me anyone if that is incorrect.
Now, in you're case, you are explicitly taking Group1 from the data frame original and using that to subset data - which you say is a subset of original. Based on this, I see no reason to believe (and every reason not to believe) that the elements of original$Group1 will align with the rows of data. If Group1 is defined within data, why not just use the copy defined there - which is aligned correctly? If not, you need to be very explicit about what you are trying to accomplish, so that you can ensure that things are aligned correctly.

What's the easiest way to ignore one row of data when creating a histogram in R?

I have this csv with 4000+ entries and I am trying to create a histogram of one of the variables. Because of the way the data was collected, there was a possibility that if data was uncollectable for that entry, it was coded as a period (.). I still want to create a histogram and just ignore that specific entry.
What would be the best or easiest way to go about this?
I tried making it so that the histogram would only use the data for every entry except the one with the period by doing
newlist <- data1$var[1:3722]+data1$var[3724:4282]
where 3723 is the entry with the period, but R said that + is not meaningful for factors. I'm not sure if I went about this the right way, my intention was to create a vector or list or table conjoining those two subsets above into one bigger list called newlist.
Your problem is deeper that you realize. When R read in the data and saw the lone . it interpreted that column as a factor (categorical variable).
You need to either convert the factor back to a numeric variable (this is FAQ 7.10) or reread the data forcing it to read that column as numeric, if you are using read.table or one of the functions that calls read.table then you can set the colClasses argument to specify a numeric column.
Once the column of data is a numeric variable then a negative subscript or !is.na will work (or some functions will automatically ignore the missing value).

Adding extra data column to shapefile using convert.to.shapefile in R's shapefiles package

My goal is very simple, namely to add 1 column of statistical data to a shapefile so that I can use it for example to colour a geographical area. The data are a country file from gadm. To this end I usually use the foreign package in R thus:
library(foreign)
newdbf <- read.dbf("CHN_adm1.dbf") #original shape file
incrdata <- read.csv("CHN_test.csv") #.csv file with same region names column + new data column
mergedbf <- merge(newdbf,incrdata)
write.dbf(mergedbf,"CHN_New")
This achieves what I want in almost all circumstances, but one of the pieces of software I am dealing with external to R will only recognize .shp files and will not read .dbf (although clearly in a sense that statement is a slight contradiction). Not sure why it won't. Anyhow, essentially it leaves me needing to do the same thing as above, but with a shapefile. I think that according to notes on shapefiles package, the process should run something like this:
library(shapefiles)
shaper <- read.shp("CHN_adm1.shp")
simplified <- convert.to.simple(shaper)
simplified <- change.id(simplified,incrdata$DataNew) #DataNew being new column of data from the .csv
simpleAsList <- by(simplified,simplified[,1],function(x)x)
####This is where I hit problems####
backToShape <- convert.to.shapefile(simplified,
data.frame(index=c("20","30","40","50","60","70","80")),"index",5)
write.shapefile(backToShape,"CHN_TestShape")
I'm afraid that I can't get my head around shapefiles, since I can't unpick them or visualize them in a way I can with dataframes, and so the resultant shape has been screwed up when it goes back to the external charting package.
To be clear: in 'backToShape' I just want to add the column of data and reconstruct the shapefile. It so happens that the data I have appears as a factor, ie 20,30,40 etc, but the data could just as easily be continuous, and I'm sure I don't need to type in all possibilities, but it was the only way I could seem to get it to be accepted. Can somebody please put me on the right track, and if I'm missing a simpler way, I'd also be extremely grateful to hear a suggestion. Many thanks in advance.
Stop using the shapefiles package.
Install the sp and rgdal packages.
Read shapefile with:
chn = readOGR(".","CHN_adm1") # first arg is path, second is shapefile name w/o .shp
Now chn is like a data frame. In fact chn#data is a data frame. Do what you like to that data frame but keep it in the same order, and then you can save the updated shapefile with the new data by:
writeOGR(chn, ".", "CHN_new", driver="ESRI Shapefile")
Note you shouldn't really manipulate the chn#data data frame directly, you can work with chn like it is a data frame in many respects, for example chn$foo gets the column named foo, or chn$popden = chn$pop/chn$area would create a new column of population density if you have population and area columns.
spplot(chn, "popden")
will map by the popden column you just created, and:
head(as.data.frame(chn))
should show you the first few lines of the shapefile data.

Resources