R - 'NA' text treated as N/A - r

I have a data frame in R including country iso codes. The iso code for Namibia happens to be 'NA'. R treats this text 'NA' as N/A.
For example the code below gives me the row with Namibia.
test <- subset(country.info,is.na(country.info$iso.code))
I initially thought it might be a factor issue, so I made sure the iso code column is character. But this didn't help.
How can this be solved?

This probably relates to how you read in the data. Just because it's character doesn't mean your "NA" isn't an NA, e.g.:
z <- c("NA",NA,"US")
class(z)
#[1] "character"
You could confirm this by giving us a dput() of (part of) your data.
When you read in your data, try changing na.strings = "NA" (e.g., in read.csv) to something else and see if it works.
For example, with na.strings = "":
read.table(text="code country
NA Namibia
GR Germany
FR France", stringsAsFactors=FALSE, header=TRUE, na.strings="")
# code country
# 1 NA Namibia
# 2 GR Germany
# 3 FR France
Make sure to check that the use of "" doesn't result in changing anything else. Else, you can use a string that will definitely not occur in your file like "z_z_z" or something like that.. You can replace the text=.. with your file name.

If Thomas' solution doesn't work you can always use the countrycode package to change your countrycodes to something that causes fewer problems.
In your case from ISO2-character to ISO3-character for instance.
country.info$iso.code<-countrycode(country.info$iso.code,"iso2c","iso3c", warn=TRUE)
If iso2c causes problems use country.names, hoping the Republic of Congo and the Democratic Republic of Congo don't mess things up.

Related

Make only numeric entries blank

I have a dataframe with UK postcodes in it. Unfortunately some of the postcode data is incorrect - ie, they are only numeric (all UK postcodes should start with a alphabet character)
I have done some research and found the grepl command that I've used to generate a TRUE/FALSE vector if the entry is only numeric,
Data$NewPostCode <- grepl("^.*[0-9]+[A-Za-z]+.*$|.*[A-Za-z]+[0-9]+.*$",Data$PostCode)
however, what I really want to do is where the instance starts with a number to make the postcode blank.
Note, I don't want remove the rows with an incorrect postcode as I will lose information from the other variables. I simply want to remove that postcode
Example data
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton 1254
London 1290C
Newcastle N1 3DC
Desired output
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton
London
Newcastle N1 3DC
There are a few ways to go between TRUE/FALSE vectors and the kind of task you want, but I prefer ifelse. A simpler way to generate the type of logical vector you're looking for is
grepl("^[0-9]", Data$PostCode)
which will be TRUE whenever PostCode starts with a number, and FALSE otherwise. You may need to adjust the regex if your needs are more complex.
You can then define a new column which is blank whenever the vector is TRUE and the old value whenever the vector is FALSE, as follows:
Data$NewPostCode <- ifelse(grepl("^[0-9]", Data$PostCode), "", Data$PostCode)
(May I suggest using NA instead of blank?)

display a subset of regions using a shapefile in R

I have a shapefile of the UK: https://geoportal.statistics.gov.uk/Docs/Boundaries/Local_authority_district_(GB)_2014_Boundaries_(Generalised_Clipped).zip
I've read the shapefile into a variable, UK
>UK <- readOGR(dsn = "....."
>England <- UK
I'd like to only display English Local Authority regions. They are specified in the LAD_DEC_2014_GB_BGC.dbf where LAD14CD starts with "E"
>UK#data
LAD14CD LAD14NM LAD14NMW
0 E06000001 Hartlepool <NA>
1 E06000002 Middlesbrough <NA>
2 E06000003 Redcar and Cleveland <NA>
371 W06000015 Cardiff Caerdydd
>#filter UK#data and replace England#data with only English regions
>England#data <- UK#data$LAD14CD[c(grep("^E", UK$LAD14CD))]
>plot(England)
But the grep command appears to change the shapefile into a factor, meaning the plot looks like this:
With this command:
England <- UK#data$LAD14CD[c(grep("^E", UK$LAD14CD))]
...you are subsetting just one column from the data slot, not the whole shapefile and assigning that to England.
This ought to do the job:
England <- UK[grep("^E", UK#data$LAD14CD),]
Note, you need the trailing comma in there! Also you don't need to wrap the grep statement in c(), but that doesn't hurt it's just unnecessary.
I ended up using dplyr and grepl instead to make things simpler:
library('rgdal')
library('dplyr')
UK <- readOGR(dsn="LAD_DEC_2014_GB_BGC.shp", layer="LAD_DEC_2014_GB_BGC") %>%
subset(grepl("^E", LAD14CD))
plot(UK)

Why cannot I plot the graph in R?

I am having trouble plotting the graph. Everytime I try to plot it, instead of a line graph, I get a histogram like this -
I have attached the link to the csv file - https://docs.google.com/spreadsheets/d/1qaTqw9sSoOpeKIa5GnHr2cJ2_DKBb1-89eTukTtrKOQ/edit?usp=sharing
First 4 lines of data
Date Comid Low High Average Close Trdno Volume Turnover Company
01-01-2005 14,259.00 138.60 139.10 138.84 138.80 14.00 1,500.00 208,230.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
02-01-2005 14,259.00 139.00 140.00 139.43 139.40 24.00 2,750.00 383,665.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
03-01-2005 14,259.00 138.50 139.00 138.70 138.60 26.00 3,600.00 499,300.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
04-01-2005 14,259.00 135.20 138.50 136.76 136.70 23.00 2,300.00 314,865.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
I am trying to plot the 6th column (the one titled "Close" and I typed the following commands.
batbc <- read.csv("batbc.csv")
plot(batbc[, 6], type="l")
The problem is the commas as thousand separators. There are a few ways of solving this, but the neatest I've seen is from another SO answer.
For your data in particular, you need to do this:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from)))
batbc <- read.csv("batbc.csv",
colClasses = c("character", rep("num.with.commas", 7), "character"))
It should then work fine.
Note with the commas in place, the numbers are treated as character, and then converted to factors per the default behaviour of read.csv. When you try to plot a factor, you get a histogram. In that context, the type = "l" is ignored with a warning.
You need to read the csv with automatic factor conversion turned off.
Then you need to get rid of the thousands comma separator in that column (or for any relevant column).
Then coerce the character column to numeric. Directly coercing to numeric without thousands comma separator being handled will generate NA for rows having comma in.
Next you can plot normally.
batbc <- read.csv('BATB.csv', as.is = T)
batbc$Close <- gsub(',','',batbc$Close)
batbc$Close <- as.numeric(batbc$Close)
plot(batbc[, 6], type="l")
HTH.

Simple way to subset SpatialPolygonsDataFrame (i.e. delete polygons) by attribute in R

I would like simply delete some polygons from a SpatialPolygonsDataFrame object based on corresponding attribute values in the #data data frame so that I can plot a simplified/subsetted shapefile. So far I haven't found a way to do this.
For example, let's say I want to delete all polygons from this world shapefile that have an area of less than 30000. How would I go about doing this?
Or, similarly, how can I delete Antartica?
require(maptools)
getinfo.shape("TM_WORLD_BORDERS_SIMPL-0.3.shp")
# Shapefile type: Polygon, (5), # of Shapes: 246
world.map <- readShapeSpatial("TM_WORLD_BORDERS_SIMPL-0.3.shp")
class(world.map)
# [1] "SpatialPolygonsDataFrame"
# attr(,"package")
# [1] "sp"
head(world.map#data)
# FIPS ISO2 ISO3 UN NAME AREA POP2005 REGION SUBREGION LON LAT
# 0 AC AG ATG 28 Antigua and Barbuda 44 83039 19 29 -61.783 17.078
# 1 AG DZ DZA 12 Algeria 238174 32854159 2 15 2.632 28.163
# 2 AJ AZ AZE 31 Azerbaijan 8260 8352021 142 145 47.395 40.430
# 3 AL AL ALB 8 Albania 2740 3153731 150 39 20.068 41.143
# 4 AM AM ARM 51 Armenia 2820 3017661 142 145 44.563 40.534
# 5 AO AO AGO 24 Angola 124670 16095214 2 17 17.544 -12.296
If I do something like this, the plot does not reflect any changes.
world.map#data = world.map#data[world.map#data$AREA > 30000,]
plot(world.map)
same result if I do this:
world.map#data = world.map#data[world.map#data$NAME != "Antarctica",]
plot(world.map)
Any help is appreciated!
looks like you're overwriting the data, but not removing the polygons. If you want to cut down the dataset including both data and polygons, try e.g.
world.map <- world.map[world.map$AREA > 30000,]
plot(world.map)
[[Edit 19 April, 2016]]
That solution used to work, but #Bonnie reports otherwise for a newer R version (though perhaps the data has changed too?):
world.map <- world.map[world.map#data$AREA > 30000, ]
Upvote #Bonnie's answer if that helped.
When I tried to do this in R 3.2.1, tim riffe's technique above did not work for me, although modifying it slightly fixed the problem. I found that I had to specifically reference the data slot as well before specifying the attribute to subset on, as below:
world.map <- world.map[world.map#data$AREA > 30000, ]
plot(world.map)
Adding this as an alternative answer in case others come across the same issue.
Just to mention that subset also makes the work avoiding to write the data's name in the condition.
world.map <- subset(world.map, AREA > 30000)
plot(world.map)
I used the above technique to make a map of just Australia:
australia.map < - world.map[world.map$NAME == "Australia",]
plot(australia.map)
The comma after "Australia" is important, as it turns out.
One flaw with this method is that it appears to retain all of the attribute columns and rows for all of the other countries, and just populates them with zeros. I found that if I wrote out a .shp file, then read it back in using readOGR (rgdal package), it automatically removes the null geographic data. Then I could write another shape file with only the data I want.
writeOGR(australia.map,".","australia",driver="ESRI Shapefile")
australia.map < - readOGR(".","australia")
writeOGR(australia.map,".","australia_small",driver="ESRI Shapefile")
On my system, at least, it's the "read" function that removes the null data, so I have to write the file after reading it back once (and if I try to re-use the filename, I get an error). I'm sure there's a simpler way, but this seems to work good enough for my purposes anyway.
As a second pointer: this does not work for shapefiles with "holes" in the shapes, because it is subsetting by index.

Need help formatting date in R

I am trying to get a simple bar char of activity count by date; however, when I import my data into R, it either skipping some record or not properly converting the date format.
Here is the script I am using:
ua <- read.table('report_users_activities_byrole 2.txt',sep='|',header=T)
qplot(date,
data=ua,
geom="bar",
weight=count,
ylab="User Count",
fill=factor(un_region)) +
opts(axis.text.x =theme_text(angle=45, size=5))
And my date
head(ua)
date role name un_region un_subregion us_state count
1 2012-06-21 ENTREPRENEUR Australia Oceania Australia and New Zealand 2
2 2012-06-21 ENTREPRENEUR Belgium Europe Western Europe 1
3 2012-06-21 ENTREPRENEUR Bosnia and Herzegovina Europe Southern Europe 1
I suspect you need something like
ua[,"Date"] <- as.Date(ua[,"Date"])
to turn the textual representation of the dates you got from reading the file into an actual Date type.
Not sure what's wrong with your code but something like this should work (that's a version of the example at http://had.co.nz/ggplot2/scale_date.html)
df = data.frame(date=sample(seq(Sys.Date(), len=100, by="1 day"),size=100,replace=TRUE))
qplot(x=date,data=df,geom="bar")
df is a data.frame where some dates appear more often than others (that's the sample() function). not sure why you want the "weight" argument in your qplot() call. Also make sure your date variable is a proper date (not a string), i.e. do
str(df$date)
otherwise
qplot(x=factor(date),data=df,geom="bar")
should work as well.
Looks like i had some encoding issues with my data extract. I used Google refine to clean up the import and then
ua <- read.csv("~/Desktop/R Working/report_users_activities_byrole.csv") and it worked

Resources