How to remove decimal points from dataframe column? - r

I've a .csv dataframe in which one of the columns is a ZIP code. The ZIP code is a factor. Here is an example:
Country<- c("US","US","US","CAN","CAN")
ZIP<- C(00210,01210,65483.0,H3P,H3P3C)
data<- data.frame(Country,ZIP)
I did the following but the output is not what I want:
data$ZIP<-round(as.numeric(as.character(data$ZIP)), 0)
Although it removed the decimals but now the zip code 00210, 01210 became 210 and 1210. Also, zip codes for CANADA became NA. I want to preserve the zip code numbers to 5 digit and preserve the zip codes of CANADA.
How can I do that?
Thank you.

Try this
data$ZIP <- sub("\\.\\d+$", "", data$ZIP)
# Country ZIP
# 1 US 00210
# 2 US 01210
# 3 US 65483
# 4 CAN H3P
# 5 CAN H3P3C
Explanation
From the help page, a typical usage of sub is
sub(pattern, replacement, x)
x is a character vector where matches are sought...
In our case x'll be the ZIP column (values of the ZIP column to be specific).
The pattern is ("\\.\\d+$"):
\\. matches the dot
\\d+ matches one or more numeric characters
$ matches the end of the input string.
The replacement pattern is "".
It replaces numeric chars beginning from a match of dot till the end with an empty string.
For example
sub("\\.\\d+$", "", 21358.222)
# "21358"
Hope that helps.

Related

Trying to remove "ZCTA" from rows

I am trying to extract only the zip code values from my imported ACS data file, however, the rows all include "ZCTA" before the 5 digit zip code. Is there a way to remove that so just the 5 digit zip code remains?
Example:
I tried using strtrim on the data but I can't figure out how to target the last 5 digits. I image there is a function or loop that could also do this since the dataset is so large.
To remove "ZCTA5":
gsub("ZCTA5", "", df$zip) # df - your data.frame name
or
library(stringr)
str_replace(df$zip,"ZCTA5","")
To extract ZIP CODE:
str_sub(df$zip,-5,-1)
Here is a few others for fun:
#option 1
stringr::str_extract(df$zip, "(?<=\\s)\\d+$")
#option 2
gsub("^.*\\s(\\d+)$", "\\1", df$zip)

splitting strings using regex in R

I have the following a really long list of strings that look like the following that I want to split it into several pieces.
strings<-c("https://www.website.com/stats/stat.227.y2020.eon.t879.html",
"https://www.website.com/stats/stat.229.y2019.eoff.t476.html")
and the desired output is as below:
links Year Seas Tour
https://www.website.com/stats/stat.227. y2020 eon t879
https://www.website.com/stats/stat.229. y2019 eoff t476
How can I achieve this using regex?
Using str_match :
stringr::str_match(strings, '.*\\.(y\\d+)\\.(\\w+)\\.(t\\d+)')
You can use the same regex in tidyr::extract if you put strings in a dataframe.
tidyr::extract(data.frame(strings), strings, c("Year","Seas", "Tour"),
'\\.(y\\d+)\\.(\\w+)\\.(t\\d+)', remove = FALSE)
# strings Year Seas Tour
#1 https://www.pgatour.com/stats/stat.227.y2020.eon.t879.html y2020 eon t879
#2 https://www.pgatour.com/stats/stat.229.y2019.eoff.t476.html y2019 eoff t476
Here, we capture data in 3 parts (capture groups)
1st part - 'y' followed by a number
2nd part - next word following part 1
3rd part 't' followed by a number.
You could use {unglue} :
library(unglue)
unglue::unglue_data(
strings, "{links}.{Year=[^.]+}.{Seas=[^.]+}.{Tour=[^.]+}.html")
#> links Year Seas Tour
#> 1 https://www.website.com/stats/stat.227 y2020 eon t879
#> 2 https://www.website.com/stats/stat.229 y2019 eoff t476
here "[^.]+" means "one or more non dot characters", which is what we want for Year, Seas, and Tour.

Convert 5 digit zip code to 3 digit zip code in R

I have a large data set and all the zip codes are in 5 digit numeric form. I need to take these and make them in to 3 digit zip codes (so that it keeps the first 3 digits of the zip code, including any 0s). So
State Zip
A 12345
becomes
State Zip
A 123
How is this done?
Convert to string, then trim it:
> zip=12345
> strtrim(as.character(zip), 3)
[1] "123"
Then you can convert back to a number.

Combine separate column/rows as one column/row in R

using txt.file, i have this dataset:
Xenopsylla cheopis Echinolaelaps sp.
Maxomys rajah 1 3
Callosciurus prevostii borneensis 4 2
using this function,
test<-read.table("data.txt",header=T)
Xenopsylla cheopis Echinolaelaps sp.
Maxomys rajah 1 3
Callosciurus prevostii borneensis 4 2
R seems to recognize my data as different columns/rows and produce this error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 2 did not have 4 elements
i tried to use textConnection but it seems that it does not produce what i want
First of all just store your data in a character vector as I did here:
test<-readChar("C:/Users/Julian/Downloads/file.txt", file.info("C:/Users/Julian/Downloads/file.txt")$size)
Obviously, you need to replace the path of my file with yours.
Then you get rid of the space between Genus and Species using gsub()
test<-gsub("([[:lower:]])([[:space:]])([[:lower:]])", "\\1\\3",test)
Finally, you can read your data using read.table() with the text argument:
a<-read.table(text=test,sep="\t",header=TRUE,row.names = 1)
a
Xenopsyllacheopis Echinolaelapssp. Ixodessp.
Maxomysrajah 3 8 9
Callosciurusprevostiiborneensis 5 7 1
Sundamysmuelleri 3 5 7
Niviventercremoriventer 6 8 9
EDIT:
To answer OP's new question in the comments:
"([[:lower:]])([[:space:]])([[:lower:]])"
enables us to find all the parts of the strings that we created with readChar() that match this pattern. This pattern is: a lowercase letter followed by a blank space followed by a lowercase letter.
You can understand this match the genus and species name but not a species name and the following genus because a genus starts with an uppercase letter.
Now the "\\1\\3" part means that we keep the first and third part of our
"([[:lower:]])([[:space:]])([[:lower:]])" pattern. That is ([[:lower:]]) and ([[:lower:]]). Because there is no space between "\\1 and \\3 in "\\1\\3" we will join them without spaces. Therefore we will have Genusspecies instead of Genus species.

Creating a vector from a file in R

I am new to R and my question should be trivial. I need to create a word cloud from a txt file containing the words and their occurrence number. For that purposes I am using the snippets package.
As it can be seen at the bottom of the link, first I have to create a vector (is that right that words is a vector?) like bellow.
> words <- c(apple=10, pie=14, orange=5, fruit=4)
My problem is to do the same thing but create the vector from a file which would contain words and their occurrence number. I would be very happy if you could give me some hints.
Moreover, to understand the format of the file to be inserted I write the vector words to a file.
> write(words, file="words.txt")
However, the file words.txt contains only the values but not the names(apple, pie etc.).
$ cat words.txt
10 14 5 4
Thanks.
words is a named vector, the distinction is important in the context of the cloud() function if I read the help correctly.
Write the data out correctly to a file:
write.table(words, file = "words.txt")
Create your word occurrence file like the txt file created. When you read it back in to R, you need to do a little manipulation:
> newWords <- read.table("words.txt", header = TRUE)
> newWords
x
apple 10
pie 14
orange 5
fruit 4
> words <- newWords[,1]
> names(words) <- rownames(newWords)
> words
apple pie orange fruit
10 14 5 4
What we are doing here is reading the file into newWords, the subsetting it to take the one and only column (variable), which we store in words. The last step is to take the row names from the file read in and apply them as the "names" on the words vector. We do the last step using the names() function.
Yes, 'vector' is the proper term.
EDIT:
A better method than write.table would be to use save() and load():
save(words. file="svwrd.rda")
load(file="svwrd.rda")
The save/load combo preserved all the structure rather than doing coercion. The write.table followed by names()<- is kind of a hassle as you can see in both Gavin's answer here and my answer on rhelp.
Initial answer:
Suggest you use as.data.frame to coerce to a dataframe an then write.table() to write to a file.
write.table(as.data.frame(words), file="savew.txt")
saved <- read.table(file="savew.txt")
saved
words
apple 10
pie 14
orange 5
fruit 4

Resources