Why cannot I plot the graph in R? - r

I am having trouble plotting the graph. Everytime I try to plot it, instead of a line graph, I get a histogram like this -
I have attached the link to the csv file - https://docs.google.com/spreadsheets/d/1qaTqw9sSoOpeKIa5GnHr2cJ2_DKBb1-89eTukTtrKOQ/edit?usp=sharing
First 4 lines of data
Date Comid Low High Average Close Trdno Volume Turnover Company
01-01-2005 14,259.00 138.60 139.10 138.84 138.80 14.00 1,500.00 208,230.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
02-01-2005 14,259.00 139.00 140.00 139.43 139.40 24.00 2,750.00 383,665.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
03-01-2005 14,259.00 138.50 139.00 138.70 138.60 26.00 3,600.00 499,300.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
04-01-2005 14,259.00 135.20 138.50 136.76 136.70 23.00 2,300.00 314,865.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
I am trying to plot the 6th column (the one titled "Close" and I typed the following commands.
batbc <- read.csv("batbc.csv")
plot(batbc[, 6], type="l")

The problem is the commas as thousand separators. There are a few ways of solving this, but the neatest I've seen is from another SO answer.
For your data in particular, you need to do this:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from)))
batbc <- read.csv("batbc.csv",
colClasses = c("character", rep("num.with.commas", 7), "character"))
It should then work fine.
Note with the commas in place, the numbers are treated as character, and then converted to factors per the default behaviour of read.csv. When you try to plot a factor, you get a histogram. In that context, the type = "l" is ignored with a warning.

You need to read the csv with automatic factor conversion turned off.
Then you need to get rid of the thousands comma separator in that column (or for any relevant column).
Then coerce the character column to numeric. Directly coercing to numeric without thousands comma separator being handled will generate NA for rows having comma in.
Next you can plot normally.
batbc <- read.csv('BATB.csv', as.is = T)
batbc$Close <- gsub(',','',batbc$Close)
batbc$Close <- as.numeric(batbc$Close)
plot(batbc[, 6], type="l")
HTH.

Related

How to reformat similar text for merging in R?

I am working with the NYC open data, and I am wanting tho merge two data frames based on community board. The issue is, the two data frames have slightly different ways of representing this. I have provided an example of the two different formats below.
CommunityBoards <- data.frame(FormatOne = c("01 BRONX", "05 QUEENS", "15 BROOKLYN", "03 STATEN ISLAND"),
FormatTwo = c("BRONX COMMUNITY BOARD #1", "QUEENS COMMUNITY BOARD #5",
"BROOKLYN COMMUNITY BOARD #15", "STATEN ISLAND COMMUNITY BD #3"))
Along with the issue of the placement of the numbers and the "#", the second data frame shortens "COMMUNITY BOARD" to "COMMUNITY BD" just on Staten Island. I don't have a strong preference of what string looks like, so long as I can discern borough and community board number. What would be the easiest way to reformat one or both of these strings so I could merge the two sets?
Thank you for any and all help!
You can use regex to get out just the district numbers. For the first format, the only thing that matters is the begining of the string before the space, hence you could do
districtsNrs1 <- as.numeric(gsub("(\\d+) .*","\\1",CommunityBoards$FormatOne))
For the second, I assume that the formats look like "something HASHTAG number", hence you could do
districtsNrs2 <- as.numeric(gsub(".* #(\\d+)","\\1",CommunityBoards$FormatTwo))
to get the pure district numbers.
Now you know how to extract the district numbers. With that information, you can name/reformat the district-names how you want.
To know which district number is which district, you can create a translation data.frame between the districts and numbers like
districtNumberTranslations <- data.frame(
districtNumber = districtsNrs2,
districtName = sapply(strsplit(CommunityBoards$FormatTwo," COMMUNITY "),"[[",1)
)
giving
# districtNumber districtName
#1 1 BRONX
#2 5 QUEENS
#3 15 BROOKLYN
#4 3 STATEN ISLAND

Plot a histogram of subset of a data

!The image shows the screen shot of the .txt file of the data.
The data consists of 2,075,259 rows and 9 columns
Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
Only data from the dates 2007-02-01 and 2007-02-02 is needed.
I was trying to plot a histogram of "Global_active_power" in the above mentioned dates.
Note that in this dataset missing values are coded as "?"]
This is the code i was trying to plot the histogram:
{
data <- read.table("household_power_consumption.txt", header=TRUE)
my_data <- data[data$Date %in% as.Date(c('01/02/2007', '02/02/2007'))]
my_data <- gsub(";", " ", my_data) # replace ";" with " "
my_data <- gsub("?", "NA", my_data) # convert "?" to "NA"
my_data <- as.numeric(my_data) # turn into numbers
hist(my_data["Global_active_power"])
}
After running the code it is showing this error:
Error in hist.default(my_data["Global_active_power"]) :
invalid number of 'breaks'
Can you please help me spot the mistake in the code.
Link of the data file : https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip
You need to provide the separator (";") explicitly and your types aren't what you think they are, observe:
data <- read.table("household_power_consumption.txt", header=TRUE, sep=';', na.strings='?')
data$Date <- as.Date(data$Date, format='%d/%m/%Y')
bottom.date <- as.Date('01/02/2007', format='%d/%m/%Y')
top.date <- as.Date('02/02/2007', format='%d/%m/%Y')
my_data <- data[data$Date > bottom.date & data$Date < top.date,3]
hist(my_data)
Gives as the plot. Hope that helps.
Given you have 2m rows (though not too many columns), you're firmly into fread territory;
Here's how I would do what you want:
library(data.table)
data<-fread("household_power_consumption.txt",sep=";", #1
na.strings=c("?","NA"),colClasses="character" #2
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% seq(from=as.Date("2007-02-01"), #3
to=as.Date("2007-02-02"),by="day")]
numerics<-setdiff(names(data),c("Date","Time")) #4
data[,(numerics):=lapply(.SD,as.numeric),.SDcols=numerics]
data[,hist(Global_active_power)] #5
A brief explanation of what's going on
1: See the data.table vignettes for great introductions to the package. Here, given the structure of your data, we tell fread up front that ; is what separates fields (which is nonstandard)
2: We can tell fread up front that it can expect ? in some of the columns and should treat them as NA--e.g., here's data[8640] before setting na.strings:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 ? ? ? ? ? ? NA
Once we set na.strings, we sidestep having to replace ? as NA later:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 NA NA NA NA NA NA
On the other hand, we also have to read those fields as characters, even though they're numeric. This is something I'm hoping fread will be able to handle automatically in the future.
data.table commands can be chained (from left to right); I'm using this to subset the data before it's assigned. It's up to you whether you find that more or less readable, as there's only marginal performance differences.
Since we had to read the numeric fields as strings, we now recast them as numeric; this is the standard data.table syntax for doing so.
Once we've got our data subset as we like and of the right type, we can pass hist as an argument in j and get what we want.
Note that if all you wanted from this data set was the histogram, you could have condensed the code a bit:
ok_dates<-seq(from=as.Date("2007-02-01"),
to=as.Date("2007-02-02"),by="day")
fread("household_power_consumption.txt",sep=";",
select=c("Date","Global_active_power"),
na.strings=c("?","NA"),colClasses="character"
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% ok_dates,hist(as.numeric(Global_active_power))]

remove degree symbol from numeric values in data frame

I've had some temperature measurements in .csv format and am trying to analyse them in R. For some reason the data files contain temperature with degree C following the numeric value. Is there a way to remove the degree C symbol and return the numeric value? I though of producing an example here but did not know how to generate a degree symbol in a string in R. Anyhow, this is what the data looks like:
> head(mm)
dateTime Temperature
1 2009-04-23 17:01:00 15.115 °C
2 2009-04-23 17:11:00 15.165 °C
3 2009-04-23 17:21:00 15.183 °C
where the class of mm[,2] is 'factor'
Can anyone suggest a method for converting the second column to 15.115 etc?
You can remove the unwanted part and convert the rest to numeric all at the same time with scan(). Setting flush = TRUE treats the last field (after the last space) as a comment and it gets discarded (since sep expects whitespace separators by default).
mm <- read.table(text = "dateTime Temperature
1 '2009-04-23 17:01:00' '15.115 °C'
2 '2009-04-23 17:11:00' '15.165 °C'
3 '2009-04-23 17:21:00' '15.183 °C'", header = TRUE)
replace(mm, 2, scan(text = as.character(mm$Temp), flush = TRUE))
# dateTime Temperature
# 1 2009-04-23 17:01:00 15.115
# 2 2009-04-23 17:11:00 15.165
# 3 2009-04-23 17:21:00 15.183
Or you can use a Unicode general category to match the unicode characters for the degree symbol.
type.convert(sub("\\p{So}C", "", mm$Temp, perl = TRUE))
# [1] 15.115 15.165 15.183
Here, the regular expression \p{So} matches various symbols that are not math symbols, currency signs, or combining characters. C matches the character C literally (case sensitive). And type.convert() takes care of the extra whitespace.
If all of your temperature values have the same number of digits you can make left and right functions (similar to those in Excel) to select the digits that you want. Such as in this answer from a different post: https://stackoverflow.com/a/26591121/4459730
First make the left function:
left = function (string,char){
substr(string,1,char)
}
Then recreate your Temperature string using just the digits you want:
mm$Temperature<-left(mm$Temperature,6)
degree symbol is represented as \u00b0, hence following code should work:
df['Temperature'] = df['Temperature'].replace('\u00b0','', regex=True)

Using abline() when x-axis is date (ie, time-series data)

I want to add multiple vertical lines to a plot.
Normally you would specify abline(v=x-intercept) but my x-axis is in the form Jan-95 - Dec-09. How would I adapt the abline code to add a vertical line for example in Feb-95?
I have tried abline(v=as.Date("Jan-95")) and other variants of this piece of code.
Following this is it possible to add multiple vertical lines with one piece of code, for example Feb-95, Feb-97 and Jan-98?
An alternate solution could be to alter my plot, I have a column with month information and a column with the year information, how do I collaborate these to have a year month on the X-axis?
example[25:30,]
Year Month YRM TBC
25 1997 1 Jan-97 136
26 1997 2 Feb-97 157
27 1997 3 Mar-97 163
28 1997 4 Apr-97 152
29 1997 5 May-97 151
30 1997 6 Jun-97 170
The first note: your YRM column is probably a factor, not a datetime object, unless you converted it manually. I assume we do not want to do that and our plot is looking fine with YRM as a factor.
In that case
vline_month <- function(s) abline(v=which(s==levels(df$YRM)))
# keep original order of levels
df$YRM <- factor(df$YRM, levels=unique(df$YRM))
plot(df$YRM, df$TBC)
vline_month(c("Jan-97", "Apr-97"))
Disclaimer: this solution is a quick hack; it is neither universal nor scalable. For accurate representation of datetime objects and extensible tools for them, see packages zoo and xts.
I see two issues:
a) converting your data to a date/POSIX element, and
b) actually plotting vertical lines at specific rows.
For the first, create a proper date string then use strptime().
The second issue is resolved by converting the POSIX date to numeric using as.numeric().
# dates need Y-M-D
example$ymd <- paste(example$Year, '-', example$Month, '-01', sep='')
# convet to POSIX date
example$ymdPX <- strptime(example$ymd, format='%Y-%m-%d')
# may want to define tz otherwise system tz is used
# plot your data
plot(example$ymdPX, example$TBC, type='b')
# add vertical lines at first and last record
abline(v=as.numeric(example$ymdPX[1]), lwd=2, col='red')
abline(v=as.numeric(example$ymdPX[nrow(example)]), lwd=2, col='red')

R - 'NA' text treated as N/A

I have a data frame in R including country iso codes. The iso code for Namibia happens to be 'NA'. R treats this text 'NA' as N/A.
For example the code below gives me the row with Namibia.
test <- subset(country.info,is.na(country.info$iso.code))
I initially thought it might be a factor issue, so I made sure the iso code column is character. But this didn't help.
How can this be solved?
This probably relates to how you read in the data. Just because it's character doesn't mean your "NA" isn't an NA, e.g.:
z <- c("NA",NA,"US")
class(z)
#[1] "character"
You could confirm this by giving us a dput() of (part of) your data.
When you read in your data, try changing na.strings = "NA" (e.g., in read.csv) to something else and see if it works.
For example, with na.strings = "":
read.table(text="code country
NA Namibia
GR Germany
FR France", stringsAsFactors=FALSE, header=TRUE, na.strings="")
# code country
# 1 NA Namibia
# 2 GR Germany
# 3 FR France
Make sure to check that the use of "" doesn't result in changing anything else. Else, you can use a string that will definitely not occur in your file like "z_z_z" or something like that.. You can replace the text=.. with your file name.
If Thomas' solution doesn't work you can always use the countrycode package to change your countrycodes to something that causes fewer problems.
In your case from ISO2-character to ISO3-character for instance.
country.info$iso.code<-countrycode(country.info$iso.code,"iso2c","iso3c", warn=TRUE)
If iso2c causes problems use country.names, hoping the Republic of Congo and the Democratic Republic of Congo don't mess things up.

Resources