Import DAT File - Parsing Issue

Import DAT File - Parsing Issue - r

I have a tab-delimited DAT file that I want to read into R. When I import the data using read.delim, my data frame has the correct number of columns, but has more rows than expected.
My datafile represents responses to a survey. After digging a little deeper, it appears that R is creating a new record when there is a "." in a column that represents an open-ended response. It appears that there are times when a respondent may have hit "enter" to add a new line.
Is there a way to get around this? I read the help, but I am not sure how I can tell R to ignore this character in the character response.
Here is an example response that parses incorrectly. This is one response, but you can see that there are returns that put this onto multiple lines when parsed by R.
possible ask for size before giving free tshirt.
Also maybe have the interview in conference rooms instead of tight offices. I felt very cramped.
I would of loved to have gone, but just had to make a choices and had more options then I expected.
I am analyzing the data with SPSS and the data were brought in fine, however, I need to use R for more advanced modeling
Any help will be greatly appreciated. Thanks in advance.

There is an 'na.strings' argument. You don't offer any test case, but perhaps you can to this:
read.delim(file="myfil.DAT", na.strings=".")
I think it would be good if you could produce an edit to your question that better demonstrated the problem. I cannot create an error with a simple effort:
> read.delim(text="a\tb\t.\nc\td\te\n",header=FALSE)
V1 V2 V3
1 a b .
2 c d e
> read.delim(text="a\tb\t.\nc\td\te\n",header=FALSE, na.strings=".")
V1 V2 V3
1 a b <NA>
2 c d e
(After the clarification that above comments are not particularly relevant.) This will bring in a field that has a linefeed in it .... but it requires that the "field" be quoted in the original file:
> scan(file=textConnection("'a\nb'\nx\t.\nc\td\te\n"), what=list("","","") )
Read 2 records
[[1]]
[1] "a\nb" "c"
[[2]]
[1] "x" "d"
[[3]]
[1] "." "e"

Related

How do I find the sum of a category under a subset?

So... I'm very illiterate when it comes to RStudio and I'm using this program for a class... I'm trying to figure out how to sum a subset of a category. I apologize in advance if this doesn't make sense but I'll do my best to explain because I have no clue what I'm doing and would also appreciate an explanation of why and not just what the answer would be. Note: The two lines I included are part of the directions I have to follow, not something I just typed in because I knew how to - I don't... It's the last part, the sum, that I am not explained how to do and thus I don't know what to do and would appreciate help figuring out.
For example,
I have this:
category_name category2_name
1 ABC
2 ABC
3 ABC
4 ABC
5 ABC
6 BDE
5 EFG
7 EFG
I wanted to find the sum of these numbers, so I was told to put in this:
sum(dataname$category_name)
After doing this, I'm asked to type this in, apparently creating a subset.
allabc <- subset(dataname, dataname$category_name2 == "abc")
I created this subset and now I have a new table popped up with this subset. I'm asked to sum only the numbers of this ABC subset... I have absolutely no clue on how to do this. If someone could help me out, I'd really appreciate it!

R is the software you are using. It is case-sensitive. So "abc" is not equal to "ABC".
The arguments are the "things" you put inside functions. Some arguments have the same name as the functions (which is a little confusing at first, but you get used to this eventually). So when I say the subset argument, I am talking about your second argument to the subset function, which you didn't name. That's ok, but when starting to learn R, try to always name your arguments.
So,
allabc <- subset(dataname, dataname$category_name2 == "abc")
Needs to be changed to:
allabc <- subset(dataname, subset=category2_name == "ABC")
And you also don't need to specify the name of the data again in the subset argument, since you've done that already in the first argument (which you didn't name, but almost everyone never bothers to do that).

This is the most easily done using tidyverse.
# Your data
data <- data.frame(category_name = 1:8, category_name2 = c(rep("ABC", 5), "BDE", "EFG", "EFG"))
# Installing tidyverse
install.packages("tidyverse")
# Loading tidyverse
library(tidyverse)
# For each category_name2 the category_name is summed
data %>%
group_by(category_name2) %>%
summarise(sum_by_group = sum(category_name))
# Output
category_name2 sum_by_group
ABC 15
BDE 6
EFG 15

how to use regexpr to identify patters in icd10 data

I am working with icd10 data, and I wish to create new variables called complication based on the pattern "E1X.9X", using regular expression but I keep getting an error. please help
dm_2$icd9_9code<- (E10.49, E11.51, E13.52, E13.9, E10.9, E11.21, E16.0)
dm_2$DM.complications<- "present"
dm_2$DM.complications[regexpr("^E\\d{2}.9$",dm_2$icd9_code)]<- "None"
# Error in dm_2$DM.complications[regexpr("^E\\d{2}.9", dm_2$icd9_code)] <-
# "None" : only 0's may be mixed with negative subscripts
I want
icd9_9code complications
E10.49 present
E11.51 present
E13.52 present
E13.9 none
E10.9 none
E11.21 present

This problem has already been solved. The 'icd' R package which me and co-authors have been maintaining for five years can do this. In particular, it uses standardized sets of comorbidities, including the diabetes with complications you seek, from AHRQ, Elixhauser original, Charlson, etc..
E.g., for ICD-10 AHRQ, you can see the codes for diabetes with complications here. From icd 4.0, these include ICD-10 codes from the WHO, and all years of ICD-10-CM.
icd::icd10_map_ahrq$DMcx
To use them, first just take your patient data frame and try:
library(icd)
pts <- data.frame(visit_id = c("encounter-1", "encounter-2", "encounter-3",
"encounter-4", "encounter-5", "encounter-6"), icd10 = c("I70401",
"E16", "I70.449", "E13.52", "I70.6", "E11.51"))
comorbid_ahrq(pts)
# and for diabetes with complications only:
comorbid_ahrq(pts)[, "DMcx"]
Or, you can get a data frame instead of a matrix this way:
comorbid_ahrq(pts, return_df = TRUE)
# then you can do:
comorbid_ahrq(pts, return_df = TRUE)$DMcx
If you give an example of the source data and your goal, I can help more.

Seems like there are a few errors in your code, I'll note them in the code below:
You'll want to start with wrapping your ICD codes with quotes: "E13.9"
dm_2 <- data.frame(icd9_9code = c("E10.49", "E11.51", "E13.52", "E13.9", "E10.9", "E11.21", "E16.0"))
Next let's use grepl() to search for the particular ICD pattern. Make sure you're applying it to the proper column, your code above is attempting to use dm_2$icd9_code and not dm_2$icd9_9code:
dm_2$DM.complications <- "present"
dm_2$DM.complications[grepl("^E\\d{2}.9$", dm_2$icd9_9code)] <- "None"
Finally,
dm_2
#> icd9_9code DM.complications
#> 1 E10.49 present
#> 2 E11.51 present
#> 3 E13.52 present
#> 4 E13.9 None
#> 5 E10.9 None
#> 6 E11.21 present
#> 7 E16.0 present
A quick side note -- there is a wonderful ICD package you may find handy as well: https://cran.r-project.org/web/packages/icd/index.html

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.

I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")

The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo

Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

Gene ontology (GO) analysis for a list of Genes (with ENTREZID) in R?

I am very new with the GO analysis and I am a bit confuse how to do it my list of genes.
I have a list of genes (n=10):
gene_list
SYMBOL ENTREZID GENENAME
1 AFAP1 60312 actin filament associated protein 1
2 ANAPC11 51529 anaphase promoting complex subunit 11
3 ANAPC5 51433 anaphase promoting complex subunit 5
4 ATL2 64225 atlastin GTPase 2
5 AURKA 6790 aurora kinase A
6 CCNB2 9133 cyclin B2
7 CCND2 894 cyclin D2
8 CDCA2 157313 cell division cycle associated 2
9 CDCA7 83879 cell division cycle associated 7
10 CDCA7L 55536 cell division cycle associated 7-like
and I simply want to find their function and I've been suggested to use GO analysis tools.
I am not sure if it's a correct way to do so.
here is my solution:
x <- org.Hs.egGO
# Get the entrez gene identifiers that are mapped to a GO ID
xx<- as.list(x[gene_list$ENTREZID])
So, I've got a list with EntrezID that are assigned to several GO terms for each genes.
for example:
> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"
$`GO:0009966`$Evidence
[1] "IEA"
$`GO:0009966`$Ontology
[1] "BP"
$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"
$`GO:0051493`$Evidence
[1] "IEA"
$`GO:0051493`$Ontology
[1] "BP"
My question is :
how can I find the function for each of these genes in a simpler way and I also wondered if I am doing it right or?
because I want to add the function to the gene_list as a function/GO column.
Thanks in advance,

EDIT: There is a new Bioinformatics SE (currently in beta mode).
I hope I get what you are aiming here.
BTW, for bioinformatics related topics, you can also have a look at biostar which have the same purpose as SO but for bioinformatics
If you just want to have a list of each function related to the gene, you can query database such ENSEMBl through the biomaRt bioconductor package which is an API for querying biomart database.
You will need internet though to do the query.
Bioconductor proposes packages for bioinformatics studies and these packages come generally along with good vignettes which get you through the different steps of the analysis (and even highlight how you should design your data or which would be then some of the pitfalls).
In your case, directly from biomaRt vignette - task 2 in particular:
Note: there are slightly quicker way that the one I reported below:
# load the library
library("biomaRt")
# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts()
ensembl=useMart("ensembl")
# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)
You need to create your query (your list of ENTREZ ids). To see which filters you can query:
filters = listFilters(ensembl)
And then you want to retrieve attributes : your GO number and description. To see the list of available attributes
attributes = listAttributes(ensembl)
For you, the query would look like something as:
goids = getBM(
#you want entrezgene so you know which is what, the GO ID and
# name_1006 is actually the identifier of 'Go term name'
attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',
values=gene_list$ENTREZID,
mart=ensembl)
The query itself can take a while.
Then you can always collapse the information in two columns (but I won't recommend it for anything else that reporting purposes).
Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
tempo<-goids[goids$entrezgene==x,]
return(
data.frame('ENTREZGENE'= x,
'Go.ID'= paste(tempo$go_id,collapse=' ; '),
'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)
Edit:
If you want to query a past version of the ensembl database:
ens82<-useMart(host='sep2015.archive.ensembl.org',
biomart='ENSEMBL_MART_ENSEMBL',
dataset='hsapiens_gene_ensembl')
and then the query would be:
goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',values=gene_list$ENTREZID,
mart=ens82)
However, if you had in mind to do a GO enrichment analysis, your list of genes is too short.

Creating a vector from a file in R

I am new to R and my question should be trivial. I need to create a word cloud from a txt file containing the words and their occurrence number. For that purposes I am using the snippets package.
As it can be seen at the bottom of the link, first I have to create a vector (is that right that words is a vector?) like bellow.
> words <- c(apple=10, pie=14, orange=5, fruit=4)
My problem is to do the same thing but create the vector from a file which would contain words and their occurrence number. I would be very happy if you could give me some hints.
Moreover, to understand the format of the file to be inserted I write the vector words to a file.
> write(words, file="words.txt")
However, the file words.txt contains only the values but not the names(apple, pie etc.).
$ cat words.txt
10 14 5 4
Thanks.

words is a named vector, the distinction is important in the context of the cloud() function if I read the help correctly.
Write the data out correctly to a file:
write.table(words, file = "words.txt")
Create your word occurrence file like the txt file created. When you read it back in to R, you need to do a little manipulation:
> newWords <- read.table("words.txt", header = TRUE)
> newWords
x
apple 10
pie 14
orange 5
fruit 4
> words <- newWords[,1]
> names(words) <- rownames(newWords)
> words
apple pie orange fruit
10 14 5 4
What we are doing here is reading the file into newWords, the subsetting it to take the one and only column (variable), which we store in words. The last step is to take the row names from the file read in and apply them as the "names" on the words vector. We do the last step using the names() function.

Yes, 'vector' is the proper term.
EDIT:
A better method than write.table would be to use save() and load():
save(words. file="svwrd.rda")
load(file="svwrd.rda")
The save/load combo preserved all the structure rather than doing coercion. The write.table followed by names()<- is kind of a hassle as you can see in both Gavin's answer here and my answer on rhelp.
Initial answer:
Suggest you use as.data.frame to coerce to a dataframe an then write.table() to write to a file.
write.table(as.data.frame(words), file="savew.txt")
saved <- read.table(file="savew.txt")
saved
words
apple 10
pie 14
orange 5
fruit 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Import DAT File - Parsing Issue - r

Related

How do I find the sum of a category under a subset?

how to use regexpr to identify patters in icd10 data

Extract words starting with # in R dataframe and save as new column

Gene ontology (GO) analysis for a list of Genes (with ENTREZID) in R?

Creating a vector from a file in R

Categories

Resources