remove degree symbol from numeric values in data frame - r

I've had some temperature measurements in .csv format and am trying to analyse them in R. For some reason the data files contain temperature with degree C following the numeric value. Is there a way to remove the degree C symbol and return the numeric value? I though of producing an example here but did not know how to generate a degree symbol in a string in R. Anyhow, this is what the data looks like:
> head(mm)
dateTime Temperature
1 2009-04-23 17:01:00 15.115 °C
2 2009-04-23 17:11:00 15.165 °C
3 2009-04-23 17:21:00 15.183 °C
where the class of mm[,2] is 'factor'
Can anyone suggest a method for converting the second column to 15.115 etc?

You can remove the unwanted part and convert the rest to numeric all at the same time with scan(). Setting flush = TRUE treats the last field (after the last space) as a comment and it gets discarded (since sep expects whitespace separators by default).
mm <- read.table(text = "dateTime Temperature
1 '2009-04-23 17:01:00' '15.115 °C'
2 '2009-04-23 17:11:00' '15.165 °C'
3 '2009-04-23 17:21:00' '15.183 °C'", header = TRUE)
replace(mm, 2, scan(text = as.character(mm$Temp), flush = TRUE))
# dateTime Temperature
# 1 2009-04-23 17:01:00 15.115
# 2 2009-04-23 17:11:00 15.165
# 3 2009-04-23 17:21:00 15.183
Or you can use a Unicode general category to match the unicode characters for the degree symbol.
type.convert(sub("\\p{So}C", "", mm$Temp, perl = TRUE))
# [1] 15.115 15.165 15.183
Here, the regular expression \p{So} matches various symbols that are not math symbols, currency signs, or combining characters. C matches the character C literally (case sensitive). And type.convert() takes care of the extra whitespace.

If all of your temperature values have the same number of digits you can make left and right functions (similar to those in Excel) to select the digits that you want. Such as in this answer from a different post: https://stackoverflow.com/a/26591121/4459730
First make the left function:
left = function (string,char){
substr(string,1,char)
}
Then recreate your Temperature string using just the digits you want:
mm$Temperature<-left(mm$Temperature,6)

degree symbol is represented as \u00b0, hence following code should work:
df['Temperature'] = df['Temperature'].replace('\u00b0','', regex=True)

Related

R insert a comma after a value

On a UNIX system, I can easily do a global change in a file. Eg, let's say I have a year value of /2021 and it will be unique in the file and associated with the date, so I can do a global change and insert a comma after the /2021. This then lets me read the file into R using comma delimiters. Is there any way I can read a string eg
7/06/2021 23:45 and change that to 7/06/2021, 23:45 in R running on Windows?
Thanks.
The data is as follows with previous columns removed and linefeeds inserted to show the data as a list.
ReadingDate Units Read.Type
08/06/2021 0:00 0 Actual
07/06/2021 23:45 0 Actual
07/06/2021 23:30 0 Actual
07/06/2021 23:15 0 Actual
07/06/2021 23:00 0 Actual
07/06/2021 22:45 0 Actual
ReadingDate is the date and time, so there are three columns. I would like four with time separated from date via a comma.
If your input would always be a date and time component, separated by a single space, then just use sub here in fixed mode:
date <- "7/06/2021 23:45"
output <- sub(" ", ", ", date, fixed=TRUE)
output
[1] "7/06/2021, 23:45"
To apply the above logic to a data frame column, use:
df$ReadingDate <- sub(" ", ", ", df$ReadingDate, fixed=TRUE)

How to remove decimal points from dataframe column?

I've a .csv dataframe in which one of the columns is a ZIP code. The ZIP code is a factor. Here is an example:
Country<- c("US","US","US","CAN","CAN")
ZIP<- C(00210,01210,65483.0,H3P,H3P3C)
data<- data.frame(Country,ZIP)
I did the following but the output is not what I want:
data$ZIP<-round(as.numeric(as.character(data$ZIP)), 0)
Although it removed the decimals but now the zip code 00210, 01210 became 210 and 1210. Also, zip codes for CANADA became NA. I want to preserve the zip code numbers to 5 digit and preserve the zip codes of CANADA.
How can I do that?
Thank you.
Try this
data$ZIP <- sub("\\.\\d+$", "", data$ZIP)
# Country ZIP
# 1 US 00210
# 2 US 01210
# 3 US 65483
# 4 CAN H3P
# 5 CAN H3P3C
Explanation
From the help page, a typical usage of sub is
sub(pattern, replacement, x)
x is a character vector where matches are sought...
In our case x'll be the ZIP column (values of the ZIP column to be specific).
The pattern is ("\\.\\d+$"):
\\. matches the dot
\\d+ matches one or more numeric characters
$ matches the end of the input string.
The replacement pattern is "".
It replaces numeric chars beginning from a match of dot till the end with an empty string.
For example
sub("\\.\\d+$", "", 21358.222)
# "21358"
Hope that helps.

How to read OMI/MLS unstructured ASCII data in R?

I am trying to read following data in R. I could find a code written for numpy but there was not any for R.
According to the link: "After skipping the first three lines, each following 12 lines contains one row of the total 2D array. It's concatenated integers of three digits each. Reading ASCII in R needs a separator keyword, but I don't have a separator here.Any help would be really appreciated.
April 2008 OMI/MLS Tropo O3 Column (Dobson Units) X 10
Longitudes: 288 bins centered on 179.375W to 179.375E (1.25 degree steps)
Latitudes: 120 bins centered on -59.5S to 59.5N (1.00 degree steps)
166217223211241326285354290338228257199268342304270284206277284257222239296
260308184247224202214256242220241234247297278270237201247186238239284263212
201204206191184186177199218199234242242205240231246269233208210227275242213
246222272320276248166201197200188173209231199198234238231188250228285253247
249256181180233249252211210184212242242179287233269310305315234236228135128
117219226194243273280321323389427283352446431379385319265243288309242222346
289306281288271267274235277306306273250235252264274260300304322305357336297
281286324324316347308296323294256289246239197256279298275340317337289322356
315286280281276206228225211220251255244210208232239280274268237252238226181
209246254242274248252250240262253226223229204204228197200209245263253247260
267313334255264206163165188220261300266266240247224208223213262248169260298
208356315279242223222168245286303325329 lat = -59.5
312293362274234235237227305283172142294249334273287306283264267228238348370
314255273250279264261285254275261254246249282223225206214199241236244254276
224224232214225226205192175195204202183176194231285284274277284264275286265
242253247187264238203212213181184188219214221192227209233185217255254325236
206328342260262279277296265222275339239319292237352279311238273169141268188
222266281238251310281268268326363389305366312251281289443469386293271355354
342301281240278232224246244218213251291257222242233254230300323350348289258
256264309269273251241264273292271275236245250280277292280299335323345348367
352358346323265265254246235235266260258206221218212232255259260258246233233
223240299281223208269260233246261257250238232236238210215210218259279291298
304301291271240244205193221244252287306297278214235176261322276282196239296
339340295266256241263348321274309308267 lat = -58.5
Edited: I manipulated original data. First I have removed the first three lines using "sed '1,3d' -i file.txt" which is not that necessary. Second, I have removed the first blank column to have a consistent 3-integer separator by "sed -i 's/^ *//g' file.txt". Then, According to Andrew Gustar, I have tried read.fwf. Now, it reads data column-wise whereas I want them to be read every three data as an input. BTW, output is only one column.
data=read.fwf(fils[1],skip = 0,widths = 3,
comment.char = "l")
It is part of the output:
head(data)
V1
1 166
2 260
3 201
4 246
5 249
6 117
They have two integers long while it should be 3

Combine separate column/rows as one column/row in R

using txt.file, i have this dataset:
Xenopsylla cheopis Echinolaelaps sp.
Maxomys rajah 1 3
Callosciurus prevostii borneensis 4 2
using this function,
test<-read.table("data.txt",header=T)
Xenopsylla cheopis Echinolaelaps sp.
Maxomys rajah 1 3
Callosciurus prevostii borneensis 4 2
R seems to recognize my data as different columns/rows and produce this error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 2 did not have 4 elements
i tried to use textConnection but it seems that it does not produce what i want
First of all just store your data in a character vector as I did here:
test<-readChar("C:/Users/Julian/Downloads/file.txt", file.info("C:/Users/Julian/Downloads/file.txt")$size)
Obviously, you need to replace the path of my file with yours.
Then you get rid of the space between Genus and Species using gsub()
test<-gsub("([[:lower:]])([[:space:]])([[:lower:]])", "\\1\\3",test)
Finally, you can read your data using read.table() with the text argument:
a<-read.table(text=test,sep="\t",header=TRUE,row.names = 1)
a
Xenopsyllacheopis Echinolaelapssp. Ixodessp.
Maxomysrajah 3 8 9
Callosciurusprevostiiborneensis 5 7 1
Sundamysmuelleri 3 5 7
Niviventercremoriventer 6 8 9
EDIT:
To answer OP's new question in the comments:
"([[:lower:]])([[:space:]])([[:lower:]])"
enables us to find all the parts of the strings that we created with readChar() that match this pattern. This pattern is: a lowercase letter followed by a blank space followed by a lowercase letter.
You can understand this match the genus and species name but not a species name and the following genus because a genus starts with an uppercase letter.
Now the "\\1\\3" part means that we keep the first and third part of our
"([[:lower:]])([[:space:]])([[:lower:]])" pattern. That is ([[:lower:]]) and ([[:lower:]]). Because there is no space between "\\1 and \\3 in "\\1\\3" we will join them without spaces. Therefore we will have Genusspecies instead of Genus species.

Why cannot I plot the graph in R?

I am having trouble plotting the graph. Everytime I try to plot it, instead of a line graph, I get a histogram like this -
I have attached the link to the csv file - https://docs.google.com/spreadsheets/d/1qaTqw9sSoOpeKIa5GnHr2cJ2_DKBb1-89eTukTtrKOQ/edit?usp=sharing
First 4 lines of data
Date Comid Low High Average Close Trdno Volume Turnover Company
01-01-2005 14,259.00 138.60 139.10 138.84 138.80 14.00 1,500.00 208,230.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
02-01-2005 14,259.00 139.00 140.00 139.43 139.40 24.00 2,750.00 383,665.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
03-01-2005 14,259.00 138.50 139.00 138.70 138.60 26.00 3,600.00 499,300.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
04-01-2005 14,259.00 135.20 138.50 136.76 136.70 23.00 2,300.00 314,865.00 BRITISH AMERICAN TOBACCO BANGLADESH COMPANY LIMITED
I am trying to plot the 6th column (the one titled "Close" and I typed the following commands.
batbc <- read.csv("batbc.csv")
plot(batbc[, 6], type="l")
The problem is the commas as thousand separators. There are a few ways of solving this, but the neatest I've seen is from another SO answer.
For your data in particular, you need to do this:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from)))
batbc <- read.csv("batbc.csv",
colClasses = c("character", rep("num.with.commas", 7), "character"))
It should then work fine.
Note with the commas in place, the numbers are treated as character, and then converted to factors per the default behaviour of read.csv. When you try to plot a factor, you get a histogram. In that context, the type = "l" is ignored with a warning.
You need to read the csv with automatic factor conversion turned off.
Then you need to get rid of the thousands comma separator in that column (or for any relevant column).
Then coerce the character column to numeric. Directly coercing to numeric without thousands comma separator being handled will generate NA for rows having comma in.
Next you can plot normally.
batbc <- read.csv('BATB.csv', as.is = T)
batbc$Close <- gsub(',','',batbc$Close)
batbc$Close <- as.numeric(batbc$Close)
plot(batbc[, 6], type="l")
HTH.

Resources