Defining a case-control study design for the dataset - r

I am here to ask question related to R programming language.
In "R" i know how to define miodin project which i have done like this
library(miodin)
mp <- MiodinProject(
name = "My Project",
author = "lee",
path = ".")
mshow(mp)
But i need a little help with defining a case-control study design for the dataset that is in my computer memory which has a name "seq.txt" and not in some online database.So how can i define study design for that dataset?

install.packages("readtext")
library(readtext)
Will get you the work direct you are on in R
getwd()
Set the work direct to best fit your project, use quotes " " and / forward slash to path inside ( ) for setwd().
setwd("/Users/r/Desktop/Prog/R/")
Put the file in the work direct, make sure using getwd()
Use this and store in a variable
df <- readtext("seq.txt")
If you want more help forward that post example of the data so we can help you figure it out.
TIP: Always crate a New project in R--> File --> New Project

Related

Basic XML R package question - how to return other attributes for matching entries?

I've downloaded an XML database (Cellosaurus - https://web.expasy.org/cellosaurus/) and I'm trying to use the XML package in R to find all misspellings of a cell line name and return the misspelling and accession.
I've never used XML or XPath expressions before and I'm having real difficulties, so I also hope I've used the correct terminology in my question...
I've loaded the database like so:
doc <- XML::xmlInternalTreeParse(file)
and I can see an example entry which looks like this:
<cell-line category="Cancer cell line">
<accession-list>
<accession type="primary">CVCL_6774</accession>
</accession-list>
<name-list>
<name type="identifier">DOV13</name>
</name-list>
<comment-list>
<comment category="Misspelling"> DOR 13; In ArrayExpress E-MTAB-2706, PubMed=25485619 and PubMed=25877200 </comment>
</comment-list>
I think I've managed to pull out all of the misspellings (which is slightly useful already):
mispelt <- XML::getNodeSet(doc, "//comment[#category=\"Misspelling\"]")
but now I have no idea how to get the accession associated with each misspelling. Perhaps there's a different function I should be using?
Can anyone help me out or point me towards a simple XML R package tutorial please?
It's difficult to help with an incomplete example. But the basic idea is to navigate up the tree structure to get to the data you want. I've used the more current xml2 package but the same idea should hold for XML. For example
library(xml2)
xx <- read_xml("cell.xml")
nodes <- xml_find_all(xx, "//comment[#category=\"Misspelling\"]")
xml_find_first(nodes, ".//../../accession-list/accession") |> xml_text()
# [1] "CVCL_6774"
It's not clear if you have multiple comments or how your data is structured. You may need to lapply or purrr::map the second node selector after the first if you have multiple nodes

Access Data "Advanced R Statistical Programming and Data Models"

might be a rather special question but im currently learning with the book "Advanced R Statistical Programming and Data Models" it has a chapter "Data Setup". However, i can't download any data in the way the book describes it. Has anyone here worked with the book and has an idea on how i can get my hands on the "04690-0001-Data.rda" file?
Thank you in advance!
If you go to the github repository for the book there is an open issue stating that the resource is 'lost' and that you have to download it from the source: https://www.icpsr.umich.edu/web/NACDA/studies/4690/versions/V9 (you can pick one of many formats but it looks like you need to create a login).
Edit
So I was interested to find out what this book was about. It looks like a great resource. Turns out this file is the basis for many of the examples, and the "intro" to the book is basically taking the raw data and processing it for use in other examples.
I used my ORCiD to access and downloaded the raw data (delimited format) and loaded/processed it using:
library(data.table)
library(vroom)
df <- vroom(file = "~/Downloads/advanced-r-statistical-programming-and-data-models-master/ICPSR_04690/DS0001/04690-0001-Data.tsv")
options(
width = 70,
stringsAsFactors = FALSE,
digits = 2)
acl <- as.data.table(df)
str(acl)
acl <- acl[, .(
V1, V1801, V2101, V2064,
V3007, V2623, V2636, V2640,
V2000,
V2200, V2201, V2202,
V2613, V2614, V2616,
V2618, V2681,
V7007, V6623, V6636, V6640,
V6201, V6202,
V6613, V6614, V6616,
V6618, V6681
)]
setnames(acl, names(acl), c(
"ID", "Sex", "RaceEthnicity", "SESCategory",
"Employment_W1", "BMI_W1", "Smoke_W1", "PhysActCat_W1",
"AGE_W1",
"SWL_W1", "InformalSI_W1", "FormalSI_W1",
"SelfEsteem_W1", "Mastery_W1", "SelfEfficacy_W1",
"CESD11_W1", "NChronic12_W1",
"Employment_W2", "BMI_W2", "Smoke_W2", "PhysActCat_W2",
"InformalSI_W2", "FormalSI_W2",
"SelfEsteem_W2", "Mastery_W2", "SelfEfficacy_W2",
"CESD11_W2", "NChronic12_W2"
))
acl[, ID := factor(ID)]
acl[, SESCategory := factor(SESCategory)]
acl[, SWL_W1 := SWL_W1 * -1]
saveRDS(acl, "advancedr_acl_data.RDS", compress = "xz")
That left me with a file called "advancedr_acl_data.RDS" which I then loaded for the GLM2.R section. The example code has some minor bugs that you will need to iron out but it looks like an excellent resource - thanks

Is there some way to change the characters encoding to its English equivalent IN R?

In R
I am extracting data from Pdf tables using Tabulizer library and the Name are on Nepali language
and after extracting i Get this Table
[1]: https://i.stack.imgur.com/Ltpqv.png
But now i want that column 2's name To change, in its English Equivalent
Is there any way to do this in R
The R code i wrote was
library(tabulizer)
location <- "https://citizenlifenepal.com/wp-content/uploads/2019/10/2nd-AGM.pdf"
out <- extract_tables(location,pages = 113)
##write.table(out,file = "try.txt")
final <- do.call(rbind,out)
final <- as.data.frame(final) ### creating df
col_name <- c("S.No.","Types of Insurance","Inforce Policy Count", "","Sum Assured of Inforce Policies","","Sum at Risk","","Sum at Risk Transferred to Re-Insurer","","Sum At Risk Retained By Insurer","")
names(final) <- col_name
final <- final[-1,]
write.csv(final,file = "/cloud/project/Extracted_data/Citizen_life.csv",row.names = FALSE)
View(final)```
It appears that document is using a non-Unicode encoding. This web site https://www.ashesh.com.np/preeti-unicode/ can convert some Nepali encodings to Unicode, which would display properly in R, assuming you have the right fonts loaded. When I tried it on the output of your code, it did something that looked okay to me, but I don't know Nepali:
> out[[1]][1,2]
[1] ";fjlws hLjg aLdf"
When I convert the contents of that string, I get
सावधिक जीवन बीमा
which looks to me something like the text on that page in the document. If it's actually written correctly, then converting it to English will need some Nepali speaker to do the translation: hopefully that's you, but if I use Google Translate, it gives
Term life insurance
So here's my suggestion: contact the owner of that www.ashesh.com.np website, and find out if they can give you the translation rules. Write an R function to implement them if you can't find one by someone else. Then do the English translations manually.

For R: How to exclude some data files based on file language

I'm rather new to R (and in law school so this is all very new to me), so apologies if this is poorly worded.
I have a series of about 1500 documents that I am importing into R to categorize and analyze later. The first thing that I need to do is exclude all documents that are written in French, which are labelled with an "FR" in the title/doc.info. I was curious what kind of code I could use to exclude that before importing the files to have a clean data set before analyzing anything (since it will obvious make a mess of processes like sentiment analysis).
Any help is appreciated (even if that help is explaining how to better talk about coding).
Kind regards!
edit 1
The code that I am using is readtext(folder), which you can see below:
folder<-"C:/[pathway]"
submissions<-readtext(folder)
submissions_text<-submissions$text
submission_number<- numeric()
submission_person<- factor()
submission_code<- factor()
submission_language<-factor()
submission_location<-factor()
for (submission_name in submissions$doc_id) {
submission_name<-gsub(".txt","",submission_name)
number<-as.numeric(strsplit(submission_name, "_|-")[[1]][1])
submission_number<-c(submission_number,number)
person<-strsplit(submission_name, "_")[[1]][2]
submission_person<-c(submission_person, person)
code<-strsplit(submission_name, "_")[[1]][3]
submission_code<-c(submission_code, code)
lang<-strsplit(submission_name, "_")[[1]][4]
submission_language<-c(submission_language, lang)
location<-strsplit(submission_name, "_")[[1]][5]
submission_location<-c(submission_location, location)
}
submissions<-cbind(submissions,submission_number)
submissions<-cbind(submissions,submission_person)
submissions<-cbind(submissions,submission_code)
submissions<-cbind(submissions,submission_language)
submissions<-cbind(submissions,submission_location)
submissions<-submissions[order(submissions$submission_number, decreasing = FALSE),]
This is just the organizational aspect of my code. I am looking to hopefully exclude all of the French data before this point (but if it comes afterward, I would also be more than happy with that).
The functionality you are after can be found in the list.files() function. Documentation can be found here.
In short, your code will likely end up looking something like this:
setwd("c:/path/to/your/data/here")
files <- list.files()
non_french_files <- files[!grepl("FR", files)]
lapply(non_french_files, function(x) {
f <- read.csv(x)
#do stuff with f
}]
Note - you could directly leverage the pattern parameter found in `list.files(), but I chose to do that in two steps in case you wanted to do something else with the French files. This also simplifies what each line of code is doing...
...good luck and welcome to R!
Here's an alternative similar to #Chase 's:
#set wd
files<-list.files()[!grepl("FR",list.files())]
lapply(files,function(x) read.csv(x)) #reads all at once, might want to save each

R: Improving workflow and keeping track of output

I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.

Resources