Trying to remove "ZCTA" from rows - r

I am trying to extract only the zip code values from my imported ACS data file, however, the rows all include "ZCTA" before the 5 digit zip code. Is there a way to remove that so just the 5 digit zip code remains?
Example:
I tried using strtrim on the data but I can't figure out how to target the last 5 digits. I image there is a function or loop that could also do this since the dataset is so large.

To remove "ZCTA5":
gsub("ZCTA5", "", df$zip) # df - your data.frame name
or
library(stringr)
str_replace(df$zip,"ZCTA5","")
To extract ZIP CODE:
str_sub(df$zip,-5,-1)

Here is a few others for fun:
#option 1
stringr::str_extract(df$zip, "(?<=\\s)\\d+$")
#option 2
gsub("^.*\\s(\\d+)$", "\\1", df$zip)

Related

read.xlsx file with one column consisting "numbers as text"

I have excel file that contains numeric variables, but the first column (index column) uses custom formatting: those are numbers that should be presented as text (or similar to text) and having always fixed number of digits where some are zeroes. Here is my example table from excel:
And here is formatting for bad_col1 (rest are numbers or general):
When I try to import my data by using read.xlsx function from either openxlsx or xlsx package it produces something like this:
read.xlsx(file_dir,sheet=1)#for openxlsx
bad_col1 col2 col3
1 5 11 974
2 230 15 719
3 10250 6 944
4 2340 7 401
So as you can see, zeroes are gone. Is there any way to read 1st column as "text" and as other numeric? I can not convert it to text after, because "front zeroes" are gone arleady. I can think of workaround, but it would be more feasible for my project to have them converted while importing.
Thank you in Advance
You can use a vector to filter your desired format, with library readxl:
library(readxl)
filter <- c('text','numeric','numeric')
the_file <- read_xlsx("sample.xlsx", col_types = filter)
Even more, you can skip columns if you use in your filter 'skip' in the desired position, considering that you might have many columns.
Regards
With this https://readxl.tidyverse.org/reference/read_excel.html you can use paramater col_types so that first column is read as character.

Checking for number of items in a string in R

I have a very large csv file (1.4 million rows). It is supposed to have 22 fields and 21 commas in each row. It was created by taking quarterly text files and compiling them into one large text file so that I could import into SQL. In the past, one field was not in the file. I don't have the time to go row by row and check for this.
In R, is there a way to verify that each row has 22 fields or 21 commas? Below is a small sample data set. The possibly missing field is the 0 in the 10th slot.
32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1
you can use the base R function count.fields to do this:
count.fields(tmp, sep=",")
[1] 22 22
The input for this function is the name of a file or a connection. Below, I supplied a textConnection. For large files, you would probably want to feed this into table:
table(count.fields(tmp, sep=","))
Note that this can also be used to count the number of rows in a file using length, similar to the output of wc -l in the *nix OSs.
data
tmp <- textConnection(
"32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1"
)
Assuming df is your dataframe
apply(df, 1, length)
This will give you the length of each row.

Import fixed width data file with no line separator

I have fixed width data files (.dbf) that don't have line separators. Here is what two lines of that datafile looks like:
20141101 77h 3.210 0 3 20141102 76h 3.090 0 3
The widths of one line is c(8,4,7,41) for date (8), some time measure (4), the data point (7), and some other columns that i can summarize in one "rest" column (41). After one line there is no separator and the next line is just appended to the first line. All time steps are basically written consecutively in one massive line. There is exclusively numbers, characters and white space in this file.
With read.fwf('filepath', widths = c(8,4,7,41)) R stops reading after the first line due to lack of line separator.
Is there an argument to tell read.fwf() when to start reading the new line when there is no line separator? Or should i use a different read command?
Thanks in advance.
Maybe not the best idea but this should work:
content <- scan('filepath','character',sep='~') # Warning choose a sep not appearing in datas to get the whole file.
# Split content in lines:
lines <- regmatches(content,gregexpr('.{60}',content))[[1]]
x <- tempfile()
write(lines,x)
data <- read.fwf(x, widths = c(8,4,7,41))
unlink(x)
The idea is to read the whole file, get each occurence of 60 chars into a single entry, write this to a tempfile, and read the data from this tempfile before deleting the temporary file.
Another approach is doable with regexes and package stringr (still with content resulting from scan above):
library(stringr)
d <- data.frame( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5], stringsAsFactors=FALSE)
which gives:
V1 V2 V3 V4
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
str_match_all return a list, here with 1 element because there's only one line as input, so we remove it with [[1]].
Now the return is 5 columns, the first one being the full match, others being the capture groups so we subset the matrix on columns 2 to 5 to get only the 4 columns we need and wrap it in as.data.frame to get a data.frame at end.
you can then name the columns with colnames(d) <- c('date','time','data_point','rest')
If you wish to clean up the white spaces you can wrap the str_extract_all result in trimws (thanks to #jaap for the remind of this function) like this:
td <- data.frame( trimws( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5] ), stringsAsFactors=FALSE)
Output:
X1 X2 X3 X4
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
A different, and probably less elegant, solution with readLines, substr, trimws, separate (tidyr) and mutate_all (dplyr):
txt <- readLines('filepath')
dfx <- data.frame(V1 = sapply(seq(from=1, to=nchar(txt), by=60),
function(x) substr(txt, x, x+59)))
library(dplyr)
library(tidyr)
dfx %>%
separate(V1, c(paste0("V",LETTERS[1:5])), c(8,12,19,55)) %>%
mutate_all(trimws)
which gives:
VA VB VC VD VE
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
To get different column names , just replace c(paste0("V",LETTERS[1:5]) with a vector of columnnames you want.
If you want to transform the columns into the correct classes instead of into character, you can use funs(ul = type.convert(trimws(.))) inside mutate_all.
In addition to the other answers, some general info about dbf files:
Unless this is a one time read of a static file, it would be best to check the file/fields structure first in case that changes over time. See here for the internal structure of a dbf file.
But maybe even more important:
Each record in a dbf file is preceded by one byte for the delete flag. If this is a space, the record is not deleted, if it's an asterisk * the record is marked for deletion (records are not removed from a dbf file until the file is packed), and you probably want to skip those records. The first part of the data could also be overwritten with "DELETED" for example.
So, in your record c(8,4,7,41), the last byte of the rest column (41) is actually the delete flag of the record that follows it - and the last record in the file will only have 40 bytes for that field (but if you're lucky, the file has an EOF marker (0x1a), so maybe you didn't have a problem with the size there).
Thus, your record should actually be: c(1,8,4,7,40), where the 1 is the delete flag, and starting one byte sooner.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Creating a vector from a file in R

I am new to R and my question should be trivial. I need to create a word cloud from a txt file containing the words and their occurrence number. For that purposes I am using the snippets package.
As it can be seen at the bottom of the link, first I have to create a vector (is that right that words is a vector?) like bellow.
> words <- c(apple=10, pie=14, orange=5, fruit=4)
My problem is to do the same thing but create the vector from a file which would contain words and their occurrence number. I would be very happy if you could give me some hints.
Moreover, to understand the format of the file to be inserted I write the vector words to a file.
> write(words, file="words.txt")
However, the file words.txt contains only the values but not the names(apple, pie etc.).
$ cat words.txt
10 14 5 4
Thanks.
words is a named vector, the distinction is important in the context of the cloud() function if I read the help correctly.
Write the data out correctly to a file:
write.table(words, file = "words.txt")
Create your word occurrence file like the txt file created. When you read it back in to R, you need to do a little manipulation:
> newWords <- read.table("words.txt", header = TRUE)
> newWords
x
apple 10
pie 14
orange 5
fruit 4
> words <- newWords[,1]
> names(words) <- rownames(newWords)
> words
apple pie orange fruit
10 14 5 4
What we are doing here is reading the file into newWords, the subsetting it to take the one and only column (variable), which we store in words. The last step is to take the row names from the file read in and apply them as the "names" on the words vector. We do the last step using the names() function.
Yes, 'vector' is the proper term.
EDIT:
A better method than write.table would be to use save() and load():
save(words. file="svwrd.rda")
load(file="svwrd.rda")
The save/load combo preserved all the structure rather than doing coercion. The write.table followed by names()<- is kind of a hassle as you can see in both Gavin's answer here and my answer on rhelp.
Initial answer:
Suggest you use as.data.frame to coerce to a dataframe an then write.table() to write to a file.
write.table(as.data.frame(words), file="savew.txt")
saved <- read.table(file="savew.txt")
saved
words
apple 10
pie 14
orange 5
fruit 4

Resources