might be a rather special question but im currently learning with the book "Advanced R Statistical Programming and Data Models" it has a chapter "Data Setup". However, i can't download any data in the way the book describes it. Has anyone here worked with the book and has an idea on how i can get my hands on the "04690-0001-Data.rda" file?
Thank you in advance!
If you go to the github repository for the book there is an open issue stating that the resource is 'lost' and that you have to download it from the source: https://www.icpsr.umich.edu/web/NACDA/studies/4690/versions/V9 (you can pick one of many formats but it looks like you need to create a login).
Edit
So I was interested to find out what this book was about. It looks like a great resource. Turns out this file is the basis for many of the examples, and the "intro" to the book is basically taking the raw data and processing it for use in other examples.
I used my ORCiD to access and downloaded the raw data (delimited format) and loaded/processed it using:
library(data.table)
library(vroom)
df <- vroom(file = "~/Downloads/advanced-r-statistical-programming-and-data-models-master/ICPSR_04690/DS0001/04690-0001-Data.tsv")
options(
width = 70,
stringsAsFactors = FALSE,
digits = 2)
acl <- as.data.table(df)
str(acl)
acl <- acl[, .(
V1, V1801, V2101, V2064,
V3007, V2623, V2636, V2640,
V2000,
V2200, V2201, V2202,
V2613, V2614, V2616,
V2618, V2681,
V7007, V6623, V6636, V6640,
V6201, V6202,
V6613, V6614, V6616,
V6618, V6681
)]
setnames(acl, names(acl), c(
"ID", "Sex", "RaceEthnicity", "SESCategory",
"Employment_W1", "BMI_W1", "Smoke_W1", "PhysActCat_W1",
"AGE_W1",
"SWL_W1", "InformalSI_W1", "FormalSI_W1",
"SelfEsteem_W1", "Mastery_W1", "SelfEfficacy_W1",
"CESD11_W1", "NChronic12_W1",
"Employment_W2", "BMI_W2", "Smoke_W2", "PhysActCat_W2",
"InformalSI_W2", "FormalSI_W2",
"SelfEsteem_W2", "Mastery_W2", "SelfEfficacy_W2",
"CESD11_W2", "NChronic12_W2"
))
acl[, ID := factor(ID)]
acl[, SESCategory := factor(SESCategory)]
acl[, SWL_W1 := SWL_W1 * -1]
saveRDS(acl, "advancedr_acl_data.RDS", compress = "xz")
That left me with a file called "advancedr_acl_data.RDS" which I then loaded for the GLM2.R section. The example code has some minor bugs that you will need to iron out but it looks like an excellent resource - thanks
Related
In R
I am extracting data from Pdf tables using Tabulizer library and the Name are on Nepali language
and after extracting i Get this Table
[1]: https://i.stack.imgur.com/Ltpqv.png
But now i want that column 2's name To change, in its English Equivalent
Is there any way to do this in R
The R code i wrote was
library(tabulizer)
location <- "https://citizenlifenepal.com/wp-content/uploads/2019/10/2nd-AGM.pdf"
out <- extract_tables(location,pages = 113)
##write.table(out,file = "try.txt")
final <- do.call(rbind,out)
final <- as.data.frame(final) ### creating df
col_name <- c("S.No.","Types of Insurance","Inforce Policy Count", "","Sum Assured of Inforce Policies","","Sum at Risk","","Sum at Risk Transferred to Re-Insurer","","Sum At Risk Retained By Insurer","")
names(final) <- col_name
final <- final[-1,]
write.csv(final,file = "/cloud/project/Extracted_data/Citizen_life.csv",row.names = FALSE)
View(final)```
It appears that document is using a non-Unicode encoding. This web site https://www.ashesh.com.np/preeti-unicode/ can convert some Nepali encodings to Unicode, which would display properly in R, assuming you have the right fonts loaded. When I tried it on the output of your code, it did something that looked okay to me, but I don't know Nepali:
> out[[1]][1,2]
[1] ";fjlws hLjg aLdf"
When I convert the contents of that string, I get
सावधिक जीवन बीमा
which looks to me something like the text on that page in the document. If it's actually written correctly, then converting it to English will need some Nepali speaker to do the translation: hopefully that's you, but if I use Google Translate, it gives
Term life insurance
So here's my suggestion: contact the owner of that www.ashesh.com.np website, and find out if they can give you the translation rules. Write an R function to implement them if you can't find one by someone else. Then do the English translations manually.
I am here to ask question related to R programming language.
In "R" i know how to define miodin project which i have done like this
library(miodin)
mp <- MiodinProject(
name = "My Project",
author = "lee",
path = ".")
mshow(mp)
But i need a little help with defining a case-control study design for the dataset that is in my computer memory which has a name "seq.txt" and not in some online database.So how can i define study design for that dataset?
install.packages("readtext")
library(readtext)
Will get you the work direct you are on in R
getwd()
Set the work direct to best fit your project, use quotes " " and / forward slash to path inside ( ) for setwd().
setwd("/Users/r/Desktop/Prog/R/")
Put the file in the work direct, make sure using getwd()
Use this and store in a variable
df <- readtext("seq.txt")
If you want more help forward that post example of the data so we can help you figure it out.
TIP: Always crate a New project in R--> File --> New Project
We have recently moved from slack to Microsoft teams. There was a useful function (slackr) that allowed for files to be uploaded to slack from R (example below) and so wondering if there is an equivalent for Microsoft teams.
library(slackr)
slackrSetup(incoming_webhook_url = "webhook-url",
api_token = "api-token")
d1 <-
data.frame(col1 = "a", col2 = "b")
write.table(
d1,
file = paste0("my-location/export.csv"))
slackr_upload(paste0("my-location/export.csv"),
channel = "my-channel")
I have found that there is a teamr function which is useful for messages, but doesn't allow uploading of files. I have attempted to at least format the contents of the dataframe as a table in markdown in the message sent from teamr, but as the tables can be quite large (500 rows, 20-30 columns) this isn't convenient for the Microsoft teams users to extract the data.
Alternatively, I can create and send an email with an attachment from R, but hoping there is an approach to keep it to teams that I have missed.
Like #Gakku said I think that could be achieved with Microsoft365R package.
I think something in line this would put it in specific team, even specific channel creating upload folder along the way
library(Microsoft365R)
team <- get_team("NAME OF YOUR TEAM")
channel <- team$get_channel("NAME OF YOUR CHANNEL")
channel$get_folder()$create_folder("UPLOAD LOCATION")
channel$get_folder()$get_item("UPLOAD LOCATION")$upload("UPLOAD_FILE.CSV")
I know this is old, but in case someone comes across this, look at microsoft365r which lets you upload files and much more in MS teams.
I need to extract the body of texts from my corpus for text mining as my code now includes references, which bias my results. All coding is performed in R using RStudio. I have tried many techniques.
I have text mining code (of which only the first bit is included below), but recently found out that simply text mining a corpus of research articles is insufficient as the reference section will bias results; reference sections alone may provide another analysis, which would be a bonus.
EDIT: perhaps there is an R package that I am not aware of
My initial response was to clean the text formats after converting from pdf to text using Regex commands within quanteda. As a reference I was intending to follow: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005962&rev=1 . Their method confuses me not just in coding a parallel regex code, but in how to implement recognizing the last reference section to avoid cutting off portions of the text when "reference" appears prior to that section; I have been in contact with their team, but am waiting to learn more about their code since it appears they use a streamlined program now.
PubChunks and LAPDF-text were my next two options the latter of which is referenced in the paper above. In order to utilize the PubChunks package I need to convert all of my pdf (now converted to text) files into XML. This should be straightforward only the packages I found (fileToPDF, pdf2xml, trickypdf) did not appear to work; this seems to be a within-R concern. (Coding relating to trickypdf is included below).
For LAPDF-text, ...[see edit]... the code did not seem to run properly. There are also very limited resources out there for this package in terms of guides etc and they have shifted their focus to a larger package using different language that does happen to include LAPDF-text.
EDIT: I installed java 1.6 (SE 6) and Maven 2.0 then ran the LAPDF-text installer, which seemed to work. That being said, I am still having issues with this process and mvn commands recognizing folders though am continuing to work through it.
I am guessing there is someone else out there, as there are related research papers with similarly vague processes, who has done this before and has also got their hands dirty. Any recommendations is greatly appreciated.
Cheers
library(quanteda)
library(pdftools)
library(tm)
library(methods)
library(stringi) # regex pattern
library(stringr) # simpler than stringi ; uses stringi on backend
setwd('C:\\Users\\Hunter S. Baggen\\Desktop\\ZS_TestSet_04_05')
files <- list.files(pattern = 'pdf$')
summary(files)
files
# Length 63
corpus_tm <- Corpus(URISource(files),
readerControl = list(reader = readPDF()))
corpus_tm
# documents 63
inspect(corpus_tm)
meta(corpus_tm[[1]])
# convert tm::Corpus to quanteda::corpus
corpus_q <- corpus(corpus_tm)
summary(corpus_q, n = 2)
# Add Doc-level Variables here *by folder and meta-variable year
corpus_q
head(docvars(corpus_q))
metacorpus(corpus_q)
#_________
# extract segments ~ later to remove segments
# corpus_segment(x, pattern, valuetype, extract_pattern = TRUE)
corpus_q_refA <- corpus_reshape(corpus_q, to = "paragraphs", showmeta = TRUE)
corpus_q_refA
# Based upon Westergaard et al (15 Million texts; removing references)
corpus_q_refB <- corpus_trim(corpus_q, what = c('sentences'), exclude_pattern = '^\[\d+\]\s[A-Za-z]')
corpus_q_refB # ERROR with regex above
corpus_tm[1]
sum(str_detect(corpus_q, '^Referen'))
corpus_qB <- corpus_q
RemoveRef_B <- corpus_segment(corpus_q, pattern = 'Reference', valuetype = 'regex')
cbind(texts(RemoveRef_B), docvars(corpus_qB))
# -------------------------
# Idea taken from guide (must reference guide)
setGeneric('removeCitations', function(object, ...) standardGeneric('removeCitations'))
'removCitations'
setMethod('removeCitations', signature(object = 'PlainTextDocument'),
function(object, ...) {
c <- Content(object)
# remove citations tarting with '>'
# EG for > : citations <- grep('^[[:blank:]]*>.*', c) if (length(citations) > 0) c <- c[-citations]
# EG for -- : signatureStart <- grep('^-- $', c) if (length(signatureStart) > 0) c <- c[-(signatureStart:length(c))]
# using 15 mil removal guideline
citations <- grep('^\[\d+\]\s[A-Za-z]')
}
# TRICKY PDF download from github
library(pubchunks)
library(polmineR)
library(githubinstall)
library(devtools)
library(tm)
githubinstall('trickypdf') # input Y then 1 if want all related packages
# library(trickypdf)
# This time suggested I install via 'PolMine/trickypdf'
# Second attempt issue with RPoppler
install_github('PolMine/trickypdf')
library(trickypdf) # Not working
# Failed to install package 'Rpoppler' is not available for R 3.6.0
Short of the RPoppler issue above the initial description should be sufficient.
UPDATE: Having reached out to several research groups the TALN-UPF researchers got back to me and provided me with a pdfx java program that has allowed me to convert my pdfs easily into xml. Of course, now I learn that PubChunks is created with its sister program that extracts xmls from search engines and therefore is of little use to me. That being said, the TALN-UPF group will hopefully advise whether I can extract the body of each text via their other programs (Dr Inventor and Grobid). If this is possible then everything will be accomplished. Of course if not I will be back at RegEx.
I wish to read data into R from SAS data sets in Windows. The read.ssd function allows me to do so, however, it seems to have an issue when I try to import a SAS data set that has any non-alphabetic symbols in its name. For example, I can import table.sas7bdat using the following:
directory <- "C:/sas data sets"
sashome <- "/Program Files/SAS/SAS 9.1"
table.df <- read.ssd(directory, "table", sascmd = file.path(sashome, "sas.exe"))
but I can't do the same for a table SAS data set named table1.sas7bdat. It returns an error:
Error in file.symlink(oldPath, linkPath) :
symbolic links are not supported on this version of Windows
Given that I do not have the option to rename these data sets, is there a way to read a SAS data set that has non-alphabetic symbols in its name in to R?
Looking about, it looks like others have your problem as well. Perhaps it's just a bug.
Anyway, try the suggestion from this (old) R help post, posted by the venerable Dan Nordlund who's pretty good at this stuff - and also active on SASL (sasl#listserv.uga.edu) if you want to try cross-posting your question there.
https://stat.ethz.ch/pipermail/r-help/2008-December/181616.html
Also, you might consider the transport method if you don't mind 8 character long variable names.
Use:
directory <- "C:/sas data sets"
sashome <- "/Program Files/SAS/SAS 9.1"
table.df <- read.ssd(library=directory, mem="table1", formats=F,
sasprog=file.path(sashome, "sas.exe"))