I am trying to import the SIPP 2014 panel data into r but am having some trouble.
It can be found here:
https://www.census.gov/programs-surveys/sipp/data/2014-panel/wave-1.html
Normally, this would be a pretty simple process and I could just use
data = read.csv("pu2014w1.dat")
The issue stems from the size of the dataset and the fact that I do not know what it is separated by nor how the column headers are done. Sadly, I cannot find documentation for importing this file into R.
Any help would be greatly appreciated.
It seems that the file https://thedataweb.rm.census.gov/pub/sipp/2014/pu2014w1.dat.gz
after unzipping, is a fixed-width format text file. So, to read it, we can use
library(readr)
read_delim("https://thedataweb.rm.census.gov/pub/sipp/2014/pu2014w1.sas",
delim = " ", col_names = FALSE, skip = 6) -> foo
fwf_positions(start = foo$X4, end = foo$X6) -> bar
bar[ - c(5223:5231), ] -> bar2
bar3 <- bar2 %>% mutate(width = end - begin)
foobar <- fwf_widths(bar3$width)
read_fwf("pu2014w1.dat.gz", col_positions = foobar)
Note that when reading a fixed-width text file, we need to specify the positions of the fields. I do this by manipulating the contents of the sas input file, which tells us the positions of the fields (for use with SAS). Also, I had to download the gzipped file before I could successfully read it. Typically, I think, one can read directly from the url. I'm not sure why reading from the url didn't work here.
Related
I am trying to load a codebook from github in R studio. The url is [here][1]. It is a md file based on the link, but I want to load its raw file. (As pic1 shows on the top right this is a tab called raw, and when I click that, it shows pic2).I try to use the link provided, but it does not work. Could anyone help to tell how to do that? Thanks a lot!
cddf<-url("https://github.com/HimesGroup/BMIN503/blob/master/DataFiles/NHANES_2007to2008_DataDictionary.md")
cd<-read.table(cddf )
Update:
[![enter image description here][2]][2]
When I changed the code :
codebook<-read.table("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008_DataDictionary.md",skip = 4, sep = "|", head = TRUE)
The r successfully read most of them, but the sep "|" did not work for two variables: INDHHIN2 and MCQ010. See pic. Can anyone help to figure out why? Thanks~~!
There are two issues here.
First, the raw file is available at the link https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008_DataDictionary.md. However, read.table is not going to be able to read that file without some help: read.table is used for tab or comma delimited files, and that's a table marked up for Markdown. This comes close:
read.table("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008_DataDictionary.md",
skip = 4, sep = "|", head = TRUE)
but it will still need some cleanup, to remove the first and last columns of junk it added, and to delete the first line.
I'm trying to practice making word clouds in R and I've seen the process nicely explained in sites like this (http://www.r-bloggers.com/building-wordclouds-in-r/) and in some videos on YouTube. So I thought I'd pick some random long document to practice myself.
I chose the script for Good Will Hunting. It is available here (https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html). What I did is copy that into Notepad++ and start removing blank lines, names, etc. to try to clean up the data before saving. Saving as a .csv file doesn't seem to be an option so I saved it as a .txt file and R doesn't seem to want to read it in.
Both of the following lines return errors in R.
goodwillhunting <- read.csv("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
goodwillhunting <- read.table("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
My question is based on an html document what is the best way to save it to be read in to be used for something like this? I know with the rvest package you can read in webpages. The tutorials for word clouds have used .csv files so I'm not sure if that's what my end goal needs to be.
This might be a way to read in the data going that route?
test = read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html")
text = html_text(test)
Any help is appreciated!
Here's one way:
library(rvest)
library(wordcloud)
test <- read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/
award_winning/good_will_hunting.html")
text <- html_text(test)
content <- stringi::stri_extract_all_words(text, simplify = TRUE)
wordcloud(content, min.freq = 10, colors = RColorBrewer::brewer.pal(5,"Spectral"))
Which gives:
Here is a simple example:
library(wordcloud)
text = scan("fulltext.txt", character(0), strip.white = TRUE)
frequency_table = as.data.frame(table(text))
wordcloud(frequency_table$text, frequency_table$Freq)
I would like to printout to the same txt (outfile.txt) file items one after the other.
For instance, first I would like to print to outfile.txt a dataframe - u. Afterwards, a written message 'hello' and finally a summary of model.
How can I do it? Is sink(outfile.txt) is appropriate for this case?
It is generally a very bad idea to mix data in the same file. I advise against it in the strongest terms: it makes the data file next to unusable for other programs.
That said, most functions to save data have an append argument. You can set this to TRUE to append to an existing file rather than overwriting its contents. No need for sink.
Where you do need sink (or equivalent) is when you want to write contents formatted in the same way as it’s written on the console. This, for instance, is the case for summary.
Here’s an example similar to your requirements:
filename = 'test.txt'
write.table(head(cars), filename, quote = FALSE, col.names = NA)
cat('\nHello\n\n', file = filename, append = TRUE)
capture.output(print(summary(cars)), file = filename, append = TRUE)
Rather than sink, this uses capture.output, which is a convenience wrapper around sink.
I am writing an R script that reads in a template .R file, a list of dates, and creates a bunch of folders corresponding to the dates and containing copes of the .R wherein text substitution has been performed in R to customize each script for the given date.
I'm stuck on the part where I write out the .R file though, because the formatting and/or character representation keeps getting screwed up.
Here's a minimal, reproducible example:
RMapsDemo <- readLines("https://raw.githubusercontent.com/hack-r/RMapsDemo/master/RMapsDemo.R")
RMapsDemo <- gsub("## File: RMapsDemo.R", "## File: RMapsDemo.R ####", RMapsDemo)
save(RMapsDemo, file = "RMapsDemo.R") # Doesn't work right
save(RMapsDemo, file = "RMapsDemo.R", ascii = T) # Doesn't work right
dput(RMapsDemo, file = "RMapsDemo.R") # Close, but no cigar
dput(RMapsDemo, file = "RMapsDemo.R", control = c("keepNA", "keepInteger")) # Close, but no cigar
Ricardo Saporta pointed out the solution in the comments -- use writeLines.
I feel stupid for not thinking of this myself. It works beautifully.
writeLines(RMapsDemo, con = "RMapsDemo.R")
I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).
The data is coded in an unusual manner. A typical line is:
12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11
There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.
test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")
Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:
12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28
Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.
Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.
Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...
I have looked all over the place for a solution but can't find one.
Building on my comments, here is a solution that would read all the CSV files into a single list.
# Deal with French properly
options(encoding="latin1")
# Set your working directory to where you have
# unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")
# Get the file names
temp <- list.files()
# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)
# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
T0 <- readLines(temp[x])
T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
final <- read.csv(text = T0, header = TRUE)
final
})
names(pollResults) <- Codes
You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).
Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".
invisible(lapply(seq_along(pollResults), function(x) {
NewFilename <- paste("Corrected", temp[x], sep = "_")
write.csv(pollResults[[x]], file = NewFilename,
quote = TRUE, row.names = FALSE)
}))
Hope this helps!
This answer is mainly to #AnandaMahto (see comments to the original question).
First, it helps to set some options globally because of the french accents in the data:
options(encoding="latin1")
Next, read in the data verbatim using readLines():
temp <- readLines("pollresults_resultatsbureau13001.csv")
Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.
temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])
Penultimately, write over the original file.
fileConn<-file("pollresults_resultatsbureau13001.csv")
writeLines(temp,fileConn)
close(fileConn)
Finally, simply read the data back into R:
data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")
There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.