Convert 5 digit zip code to 3 digit zip code in R - r

I have a large data set and all the zip codes are in 5 digit numeric form. I need to take these and make them in to 3 digit zip codes (so that it keeps the first 3 digits of the zip code, including any 0s). So
State Zip
A 12345
becomes
State Zip
A 123
How is this done?

Convert to string, then trim it:
> zip=12345
> strtrim(as.character(zip), 3)
[1] "123"
Then you can convert back to a number.

Related

Trying to remove "ZCTA" from rows

I am trying to extract only the zip code values from my imported ACS data file, however, the rows all include "ZCTA" before the 5 digit zip code. Is there a way to remove that so just the 5 digit zip code remains?
Example:
I tried using strtrim on the data but I can't figure out how to target the last 5 digits. I image there is a function or loop that could also do this since the dataset is so large.
To remove "ZCTA5":
gsub("ZCTA5", "", df$zip) # df - your data.frame name
or
library(stringr)
str_replace(df$zip,"ZCTA5","")
To extract ZIP CODE:
str_sub(df$zip,-5,-1)
Here is a few others for fun:
#option 1
stringr::str_extract(df$zip, "(?<=\\s)\\d+$")
#option 2
gsub("^.*\\s(\\d+)$", "\\1", df$zip)

read.xlsx file with one column consisting "numbers as text"

I have excel file that contains numeric variables, but the first column (index column) uses custom formatting: those are numbers that should be presented as text (or similar to text) and having always fixed number of digits where some are zeroes. Here is my example table from excel:
And here is formatting for bad_col1 (rest are numbers or general):
When I try to import my data by using read.xlsx function from either openxlsx or xlsx package it produces something like this:
read.xlsx(file_dir,sheet=1)#for openxlsx
bad_col1 col2 col3
1 5 11 974
2 230 15 719
3 10250 6 944
4 2340 7 401
So as you can see, zeroes are gone. Is there any way to read 1st column as "text" and as other numeric? I can not convert it to text after, because "front zeroes" are gone arleady. I can think of workaround, but it would be more feasible for my project to have them converted while importing.
Thank you in Advance
You can use a vector to filter your desired format, with library readxl:
library(readxl)
filter <- c('text','numeric','numeric')
the_file <- read_xlsx("sample.xlsx", col_types = filter)
Even more, you can skip columns if you use in your filter 'skip' in the desired position, considering that you might have many columns.
Regards
With this https://readxl.tidyverse.org/reference/read_excel.html you can use paramater col_types so that first column is read as character.

How to remove decimal points from dataframe column?

I've a .csv dataframe in which one of the columns is a ZIP code. The ZIP code is a factor. Here is an example:
Country<- c("US","US","US","CAN","CAN")
ZIP<- C(00210,01210,65483.0,H3P,H3P3C)
data<- data.frame(Country,ZIP)
I did the following but the output is not what I want:
data$ZIP<-round(as.numeric(as.character(data$ZIP)), 0)
Although it removed the decimals but now the zip code 00210, 01210 became 210 and 1210. Also, zip codes for CANADA became NA. I want to preserve the zip code numbers to 5 digit and preserve the zip codes of CANADA.
How can I do that?
Thank you.
Try this
data$ZIP <- sub("\\.\\d+$", "", data$ZIP)
# Country ZIP
# 1 US 00210
# 2 US 01210
# 3 US 65483
# 4 CAN H3P
# 5 CAN H3P3C
Explanation
From the help page, a typical usage of sub is
sub(pattern, replacement, x)
x is a character vector where matches are sought...
In our case x'll be the ZIP column (values of the ZIP column to be specific).
The pattern is ("\\.\\d+$"):
\\. matches the dot
\\d+ matches one or more numeric characters
$ matches the end of the input string.
The replacement pattern is "".
It replaces numeric chars beginning from a match of dot till the end with an empty string.
For example
sub("\\.\\d+$", "", 21358.222)
# "21358"
Hope that helps.

Reshape specific rows into columns in R

My sample data frame would look like the following:
1 Number Type Code Reason
2 0123 06 09 010
3 Date Amount Damage Act
4 08/31/16 10,000 Y N
5 State City Zip Phone
6 WI GB 1234 Y
I want to make rows 1, 3, and 5 column names and have the data below each fall into each column, respectively. I was looking into the reshape function, but I only saw examples where an entire column of values needed to be individual columns. So I wasn't sure what to do in this scenario--apologies if it's obvious.
Here is the desired output:
1 Number Type Code Reason Date Amount Damage Act State City Zip Phone
2 0123 06 09 010 08/31/16 10,000 Y N WI GB 1234 Y
Thanks
As some people have commented, you could build a data frame out of the rows of your starting data frame, but I think it's a little easier to work on the lines of text.
If your starting file looks something like this
Number , Type , Code ,Reason
0123 , 06 , 09 , 010
Date , Amount , Damage , Act
08/31/16 , 10000 , Y , N
State , City , Zip , Phone
WI , GB , 1234, Y
we can read it in with each line as an element of a character vector:
lines <- readLines("start.csv")
make all the odd lines into a single line:
oddind <- seq(from=1, to= length(lines), by=2)
namelines <- paste(lines[oddind], collapse=",")
make all the even lines into a single line:
datlines <- paste(lines[oddind+1], collapse=",")
make those lines into a new CSV to read:
writeLines(text= c(namelines, datlines), con= "nice.csv")
print(read.csv("nice.csv"))
This gives
Number Type Code Reason Date Amount Damage Act State
1 123 6 9 10 08/31/16 10000 Y N WI
City Zip Phone
1 GB 1234 Y
So, it's all in one row of the data frame and all the variable names show up correctly in the data frame.
The benefits of this strategy are:
It will work for starting CSV files where the number of variables isn't a multiple of 4.
It will work for starting CSV files with any number of rows
There is no chance of weird dynamic casting happening with unlist() or as.character().
Creating a dataframe roughly appearing like that (although it necessarily has column names). Those are probably factor columns if you just used one of the standard read.* functions without using stringsAsFactors=FALSE, hence the need to convert with as.character.
dat=read.table(text="1 Number Type Code Reason
2 0123 06 09 010
3 Date Amount Damage Act
4 08/31/16 10,000 Y N
5 State City Zip Phone
6 WI GB 1234 Y")
Then you can set odd number rows as names of the values-vector of the even number rows with:
setNames( unlist( lapply( dat[!c(TRUE,FALSE), ] ,as.character)),
unlist( lapply( dat[c(TRUE,FALSE), ] ,as.character)) )
1 3 5 Number Date State Type
"2" "4" "6" "0123" "08/31/16" "WI" "06"
Amount City Code Damage Zip Reason Act
"10,000" "GB" "09" "Y" "1234" "010" "N"
Phone
"Y"
The !c(TRUE,FALSE) and its logical complement in the next extract operation get magically recycled along all the possible rows. Obviously there would be better ways of doing this if you posted a version of a text file rather than saying that the starting point was a dataframe. You would need to remove what were probably rownames. If you want a "clean solution then post either dput(.) from your dataframe or the raw text file.

Creating a vector from a file in R

I am new to R and my question should be trivial. I need to create a word cloud from a txt file containing the words and their occurrence number. For that purposes I am using the snippets package.
As it can be seen at the bottom of the link, first I have to create a vector (is that right that words is a vector?) like bellow.
> words <- c(apple=10, pie=14, orange=5, fruit=4)
My problem is to do the same thing but create the vector from a file which would contain words and their occurrence number. I would be very happy if you could give me some hints.
Moreover, to understand the format of the file to be inserted I write the vector words to a file.
> write(words, file="words.txt")
However, the file words.txt contains only the values but not the names(apple, pie etc.).
$ cat words.txt
10 14 5 4
Thanks.
words is a named vector, the distinction is important in the context of the cloud() function if I read the help correctly.
Write the data out correctly to a file:
write.table(words, file = "words.txt")
Create your word occurrence file like the txt file created. When you read it back in to R, you need to do a little manipulation:
> newWords <- read.table("words.txt", header = TRUE)
> newWords
x
apple 10
pie 14
orange 5
fruit 4
> words <- newWords[,1]
> names(words) <- rownames(newWords)
> words
apple pie orange fruit
10 14 5 4
What we are doing here is reading the file into newWords, the subsetting it to take the one and only column (variable), which we store in words. The last step is to take the row names from the file read in and apply them as the "names" on the words vector. We do the last step using the names() function.
Yes, 'vector' is the proper term.
EDIT:
A better method than write.table would be to use save() and load():
save(words. file="svwrd.rda")
load(file="svwrd.rda")
The save/load combo preserved all the structure rather than doing coercion. The write.table followed by names()<- is kind of a hassle as you can see in both Gavin's answer here and my answer on rhelp.
Initial answer:
Suggest you use as.data.frame to coerce to a dataframe an then write.table() to write to a file.
write.table(as.data.frame(words), file="savew.txt")
saved <- read.table(file="savew.txt")
saved
words
apple 10
pie 14
orange 5
fruit 4

Resources