Currently, I am trying to import the National Incidence Based Reporting System (NIBRS) 2019 data into R. The data comes in an ASCII text format, and so far I've tried readr::read_tsv and readr::read_fwf. However, I can't seem to import the data correctly - read_tsv shows only 1 column, while read_fwf needs column arguments that I do not understand how to decipher based on the text file.
Here is the link to the NIBRS. I used the Master File Downloads to download the zipped file for the NIBRS in 2019.
My overall goal is to have a typical dataframe/tibble for this data set with column names being the type of crime, and the rows being the number of incidents.
I have seen a few other examples of importing this data through this help page, but their copies of the data only covers up to 2015 (My data needs to range from 2015-2019).
.
Use read.fwf(). Column widths are listed here
We can use read_fwf with column_positions
library(readr)
read_fwf(filename, column_positions = c(2, 5, 10))
Related
I have 2 columns of data, in table loaded from Excel: ID, BeforeConv
containing number and text, respectively.
After loading this data from Excel to R studio i made some tm_map based text cleaning on column BeforeConve copy applied to the new column in RStudio: AfterConv
This is my code:
x <- read_excel(file.choose())
corpus<-Corpus(VectorSource(x$BeforeConv))
corpus<-tm_map(corpus,tolower)
x$AfterConv<-data.frame(text = sapply(corpus, as.character)) //<-here in R studio data displayed correctly yet
write.csv(x=x, file="R2test.csv") //<-now the data displaced
my issue is that while in R studio i see the conversion correct(see image below)
...when exporting to either CSV or XLSX that last column (AfterConv) displays totally messy data (see below). What i did wrong and how to fix it?
Best if i could see the proper code to convert last column correctly to export it correctly along with two exiting columns to Excel or CSV.
Using openxlsx read.xlsx to import a dataframe from a multi-class column. The desired result is to import all values as strings, exactly as they're represented in Excel. However, some decimals are represented as very long floats.
Sample data is simply an Excel file with a column containing the following rows:
abc123,
556.1,
556.12,
556.123,
556.1234,
556.12345
require(openxlsx)
df <- read.xlsx('testnumbers.xlsx', )
Using the above R code to read the file results in df containing these string
values:
abc123,
556.1,
556.12,
556.12300000000005,
556.12339999999995,
556.12345000000005
The Excel file provided in production has the column formatted as "General". If I format the column as Text, there is no change unless I explicitly double-click each cell in Excel and hit enter. In that case, the number is correctly displayed as a string. Unfortunately, clicking each cell isn't an option in the production environment. Any solution, Excel, R, or otherwise is appreciated.
*Edit:
I've read through this question and believe I understand the math behind what's going on. At this point, I suppose I'm looking for a workaround. How can I get a float from Excel to an R dataframe as text without changing the representation?
Why Are Floating Point Numbers Inaccurate?
I was able to get the correct formats into a data frame using pandas in python.
import pandas as pd
test = pd.read_excel('testnumbers.xlsx', dtype = str)
This will suffice as a workaround, but I'd like to see a solution built in R.
Here is a workaround in R using openxlsx that I used to solve a similar issue. I think it will solve your question, or at least allow you to format as text in the excel files programmatically.
I will use it to reformat specific cells in a large number of files (I'm converting from general to 'scientific' in my case- as an example of how you might alter this for another format).
This uses functions in the openxlsx package that you reference in the OP
First, load the xlsx file in as a workbook (stored in memory, which preserves all the xlsx formatting/etc; slightly different than the method shown in the question, which pulls in only the data):
testnumbers <- loadWorkbook(here::here("test_data/testnumbers.xlsx"))
Then create a "style" to apply which converts the numbers to "text" and apply it to the virtual worksheet (in memory).
numbersAsText <- createStyle(numFmt = "TEXT")
addStyle(testnumbers, sheet = "Sheet1", style = numbersAsText, cols = 1, rows = 1:10)
finally, save it back to the original file:
saveWorkbook(testnumbers,
file = here::here("test_data/testnumbers_formatted.xlsx"),
overwrite = T)
When you open the excel file, the numbers will be stored as "text"
Let's say I have an excel column with 10 different cells with values. How do I create a variable in r that includes only the first four or first 6 cells in that column?
This question is very vague, please provide more information if you need specifics...
First of all, you'll want to use a library to import the contents of the excel file, I recommend using readxl (http://readxl.tidyverse.org)
You can then follow the documentation to read specific ranges from the excel file or just import all the contents and trim the resulting tibble.
Probably
# Install -readxl- package that loads in Excel spreadsheets
install.packages("readxl")
# Load -readxl- package for use
require(readxl)
# Change working directory to directory where spreadsheet is saved in
setwd("<Insert path here>")
# Save spreadsheet data to memory
myData <- read_excel("myData.xlsx", sheet = 1)
# Subset first four or six observations
firstFour <- myData[1:4,]
firstSix <- myData[1:6,]
Let me know if you don't understand.
I'm trying to read a zipped folder called etfreit.zip contained in Purchases from April 2016 onward.
Inside the zipped folder is a file called 2016.xls which is difficult to read as it contains empty rows along with Japanese text.
I have tried various ways of reading the xls from R, but I keep getting errors. This is the code I tried:
download.file("http://www3.boj.or.jp/market/jp/etfreit.zip", destfile="etfreit.zip")
unzip("etfreit.zip")
data <- read.csv(text=readLines("2016.xls")[-(1:10)])
I'm trying to skip the first 10 rows as I simply wish to read the data in the xls file. The code works only to the extent that it runs, but the data looks truly bizarre.
Would greatly appreciate any help on reading the spreadsheet properly in R for purposes of performing analysis.
There is more than one bizzare thing going on here I think, but I had some success with (somewhat older) gdata package:
data = gdata::read.xls("2016.xls")
By the way, treating xls file as csv seldom works. Actually it shouldn't work at all :) Find out a proper import function for your type of data and then use it, don't assume that read.csv is going to take care about anything else than csv (properly).
As per your comment: I'm not sure what you mean by "not properly aligned", but here is some code that cleans the data a bit, and gives you numeric variables instead of factors (note I'm using tidyr for that):
data2 = data[-c(1:7), -c(1, 6)]
names(data2) = c("date", "var1", "var2", "var3")
data2[, c(2:4)] = sapply(data2[, c(2:4)], tidyr::extract_numeric)
# Optionally convert the column with factor dates to Posixct
data2$date = as.POSIXct(data2$date)
Also, note that I am removing only 7 upper rows - this seems to be the portion of the data that contains the header with Japanese.
"Odd" unusual excel tables cab be read with the jailbreakr package. It is still in development, but looks pretty ace:
https://github.com/rsheets/jailbreakr
I am attempting to read data from the National Health Interview Survey in R: http://www.cdc.gov/nchs/nhis/nhis_2011_data_release.htm . The data is Sample Adult. The SAScii library actually has a function read.SAScii whose documentation has an example for the same data set I would like to use. The issue is it "doesn't work":
NHIS.11.samadult.SAS.read.in.instructions <-
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Program_Code/NHIS/2011/SAMADULT.sas"
NHIS.11.samadult.file.location <-
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHIS/2011/samadult.zip"
#store the NHIS file as an R data frame!
NHIS.11.samadult.df <-
read.SAScii (
NHIS.11.samadult.file.location ,
NHIS.11.samadult.SAS.read.in.instructions ,
zipped = T, )
#or store the NHIS SAS import instructions for use in a
#read.fwf function call outside of the read.SAScii function
NHIS.11.samadult.sas <- parse.SAScii( NHIS.11.samadult.SAS.read.in.instructions )
#save the data frame now for instantaneous loading later
save( NHIS.11.samadult.df , file = "NHIS.11.samadult.data.rda" )
However, when running it I get the error Error in toupper(SASinput) : invalid multibyte string 533.
Others on Stack Overflow with a similar error, but for functions such as read.delim and read.csv, have recommended to try changing the argument to fileEncoding="latin1" for example. The problem with read.SAScii is it has no such parameter fileEncoding.
See:
R: invalid multibyte string and Invalid multibyte string in read.csv
Just in case anyone has a similar problem, the issue and solution for me was to run options( encoding = "windows-1252" ) right before running the above code for read.SAScii since the ASCII file is meant for use in SAS and therefore on Windows. And I am using Linux.
The author of the SAScii library actually has another Github repository asdfree where he has working code for downloading CDC-NHIS datasets for all available years as well as as many other datasets from various surveys such as the American Housing Survey, FDA Drug Surveys, and many more.
The following links to the author's solution to the issue in this question. From there, you can easily find a link to the asdfree repository: https://github.com/ajdamico/SAScii/issues/3 .
As far as this dataset goes, the code in https://github.com/ajdamico/asdfree/blob/master/National%20Health%20Interview%20Survey/download%20all%20microdata.R#L8-L13 does the trick, however it doesn't encode the columns as factors or numeric properly. The good thing is that for any given dataset in an NHIS year, there are only about less than ten to twenty numeric columns where encoding these as numeric one by one is not so painful, and encoding the rest of the columns as numeric requires only a loop through the non-numeric columns.
The easiest solution for me, since I only require the Sample Adult dataset for 2011, and I was able to get my hands on a machine with SAS installed, was to run the SAS program included at http://www.cdc.gov/nchs/nhis/nhis_2011_data_release.htm to encode the columns as necessary. Finally, I used proc export to export the sas dataset onto a CSV file which I then opened in R easily with no necessary edits to the data except in dealing with missing values.
In case you want to work with NHIS datasets besides Sample Adult, it is worth noting that when I ran the available SAS program for 2010 "Sample Adult Cancer" (http://www.cdc.gov/nchs/nhis/nhis_2010_data_release.htm) and exported the data to a CSV, there was an issue with having less column names than actual columns when I attempted to read in the CSV file in R. Skipping the first line resolves this issue but you lose the descriptive column names. You can however import this same data easily without encoding with the R code in the asdfree repository. Please read the documentation there for more info.