I have a data table as a CSV file that I use to create metrics for a dashboard. The data table includes Metric IDs and associates these with field names. This table--this definition of metrics--is largely static, and I'd like to include it within R code rather than, for example, importing a CSV file containing these headings.
The table looks something like this:
Metric_ID
Metric_Name
Numerator
Denominator
AB0001
Number_of_Customers
No_of_Customers
AB0002
Percent_New_Customers
No_of_New_Customers
No_of_Customers
This has about 40 rows of data, and I'd like to set this table up in code so that it is created at the time the R query is run. I'll then use it to associate metric IDs with measures I retrive through SQL queries. Sometimes this table may change -- for example, new metrics might be added or existing metrics modified. This would need some modificatoin in the code to incorporate these metrics.
The closet way I could find was to create a data table, along the lines described in the query below.
dt<-data.table(x=c(1,2,3),y=c(2,3,4),z=c(3,4,5))
dt
x y z
1: 1 2 3
2: 2 3 4
3: 3 4 5
cbind with data table and data frame
This works for a table with a few rows or columns, but will be unwieldy for tables with 40+ rows. For example, if I wanted to modify a metric 20 rows down, I'd have to go 20 rows down in each column, and then test the table to ensure I switched the metric at the right place in each column -- especially where some metrics have empty cells. for example, I may correct the metric ID in row 20, but accidentally put the definition (a separate column) in row 19.
Is there a more straightforward way of, in essence, creating a table in code?
(I appreciate the most straightforward way would be to keep a CSV file accessible and use read_csv to import it into R. However, this doesn't work so well if colleagues are running this query on their machine and have a different file path to the CSV -- it also raises the risk of them running the query with an out-of-date metrics table, as they may not have the latest version in their files).
Thanks in advance for any guidance you might have!
Tony
Here are two options (examples taken from respective help pages):
data.table::fread()
fread("A,B
1,2
3,4
")
#> A B
#> <int> <int>
#> 1: 1 2
#> 2: 3 4
https://rdatatable.gitlab.io/data.table/reference/fread.html
tibble::tribble()
tribble(
~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
)
#> # A tibble: 3 × 2
#> colA colB
#> <chr> <dbl>
#> 1 a 1
#> 2 b 2
#> 3 c 3
https://tibble.tidyverse.org/reference/tribble.html
Other options:
If you already have the data.frame from somewhere, you can also use dput() to get a structure() code you can paste into the files you are distributing.
use the reprex package https://reprex.tidyverse.org/
Related
After going to find how to summarize a DataFrame I did it.
I can see the results in my Console which is what is shown below after the first two lines of code
byTue <- group_by(luckyloss.3,L_byUXR)
( sumMon <- summarize(byTue,count=n()) )
Below is what I see on the Console It feels good because it shows I got what I was looking for
The results below come from a column of 234 rows which has many values repeated.
So this I did a summarise of the 234 rows where in the case of ANA comes 8 times, ARI 14 and so on
# A tibble: 30 × 2
L_byUXR count
<chr> <int>
1 ANA 8
2 ARI 14
3 ATL 16
4 BAL 4
5 BOS 6
6 CHA 12
7 CHN 8
8 CIN 10
9 CLE 4
10 COL 8
# ... with 20 more rows
What I want is to have this output of 30 rows by two columns in a way I can take it to a word document or could even be HTML
I tried to do a write(byTUE.csv) but what I received was the list of 234 rows of the original data frame. It's like the summarise disappeared, I have checked other ways like markdown or create new files tried to see if the knitr package could help but nothing.
library(stringi) # ONLY NECESSARY FOR DATA SIMULATION
library(officer) # <<= install this
library(tidyverse)
Simulate some data:
set.seed(2017-11-18)
data_frame(
L_byUXR = stri_rand_strings(30, 3, pattern="[A-Z]"),
count = sample(20, 30, replace=TRUE)
) -> sumMon
Start a new Word doc and add the table, saving to a new doc:
read_docx() %>% # a new, empty document
body_add_table(sumMon, style = "table_template") %>%
print(target="new.docx")
I kept looking for an answer and found the "stargazer" package for R, which allowed me to get the result of the dataframe as a text which can be further edited
When you write the R instruction, in "out = ", name the file you want as output and stargazer will place it there for you in your session's folder
The instruction I used was:
stargazer(count, type = "text", summary = FALSE, title="Any Title", digits=1, out="table1.txt")
Even though I found the answer I could not have done it without the help of hrbrmstr who showed me there was a package do do it, I just needed to work more on it
I have a long data frame of genes and various forms of ids for them (e.g. OMIM, Ensembl, Genatlas). I want to get the list of all SNPs that are associated with each gene. (This is the reverse of this question.)
So far, the best solution I have found is using the biomaRt package (bioconductor). There is an example of the kind of lookup I need to do here. Fitted for my purposes, here is my code:
library(biomaRt)
#load the human variation data
variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")
#look up a single gene and get SNP data
getBM(attributes = c(
"ensembl_gene_stable_id",
'refsnp_id',
'chr_name',
'chrom_start',
'chrom_end',
'minor_allele',
'minor_allele_freq'),
filters = 'ensembl_gene',
values ="ENSG00000166813",
mart = variation
)
This outputs a data frame that begins like this:
ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end minor_allele minor_allele_freq
1 ENSG00000166813 rs8179065 15 89652777 89652777 T 0.242412
2 ENSG00000166813 rs8179066 15 89652736 89652736 C 0.139776
3 ENSG00000166813 rs12899599 15 89629243 89629243 A 0.121006
4 ENSG00000166813 rs12899845 15 89621954 89621954 C 0.421126
5 ENSG00000166813 rs12900185 15 89631884 89631884 A 0.449681
6 ENSG00000166813 rs12900805 15 89631593 89631593 T 0.439297
(4612 rows)
The code works, but the running time is extremely long. For the above, it takes about 45 seconds. I thought maybe this was related to the allele frequencies, which the server perhaps calculated on the fly. But looking up the bare minimum of only the SNPs rs ids takes something like 25 seconds. I have a few thousand genes, so this would take an entire day (assuming no timeouts or other errors). This can't be right. My internet connection is not slow (20-30 mbit).
I tried looking up more genes per query. This did dot help. Looking up 10 genes at once is roughly 10 times as slow as looking up a single gene.
What is the best way to get a vector of SNPs that associated with a vector of gene ids?
If I could just download two tables, one with genes and their positions and one with SNPs and their positions, then I could easily solve this problem using dplyr (or maybe data.table). I haven't been able to find such tables.
Since you're using R, here's an idea that uses the package rentrez. It utilizes NCBI's Entrez database system and in particular the eutils function, elink. You'll have to write some code around this and probably tweak parameters, but could be a good start.
library(rentrez)
# for converting gene name -> gene id
gene_search <- entrez_search(db="gene", term="(PTEN[Gene Name]) AND Homo sapiens[Organism]", retmax=1)
geneId <- gene_search$ids
# elink function
snp_links <- entrez_link(dbfrom='gene', id=geneId, db='snp')
# access results with $links
length(snp_links$links$gene_snp)
5779
head(snp_links$links$gene_snp)
'864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
I suggest you manually double-check that the number of SNPs is about what you'd expect for your genes of interest -- you may need to drill down further and limit by transcript, etc...
For multiple gene ids:
multi_snp_links <- entrez_link(dbfrom='gene', id=c("5728", "374654"), db='snp', by_id=TRUE)
lapply(multi_snp_links, function(x) head(x$links$gene_snp))
1. '864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
2. '797045093' '797044466' '797044465' '797044464' '797044463' '797016353'
The results are grouped by gene with by_id=TRUE
I have the following data frame:
group_id date_show date_med
1 1976-02-07 1971-04-14
1 1976-02-09 1976-12-11
1 2011-03-02 1970-03-22
2 1993-08-04 1997-06-13
2 2008-07-25 2006-09-01
2 2009-06-18 2005-11-12
3 2009-06-18 1999-11-03
I want to subset my data frame in such a way that the new data frame only shows the rows in which the values of date_show are further than 10 days apart but this condition should only be applied per group. I.e. if the values in the date_show column are less than 10 days apart but the group_ids are different, I need to keep both entries. What I want my result to look like based on the above table is:
group_id date_show date_med
1 1976-02-07 1971-04-14
1 2011-03-02 1970-03-22
2 1993-08-04 1997-06-13
2 2008-07-25 2006-09-01
2 2009-06-18 2005-11-12
3 2009-06-18 1999-11-03
Which row gets deleted isn't important because the reason why I'm subsetting in the first place is to calculate the number of rows I am left with after applying this criteria.
I've tried playing around with the diff function but I'm not sure how to go about it in the simplest possible way because this problem is already within another sapply function so I'm trying to avoid any kind of additional loop (in this case by group_id).
The df I'm working with has around 100 000 rows. Ideally, I would like to do this with base R because I have no rights to install any additional packages on the machine I'm working on but if this is not possible (or if solving this with an additional package would be significantly better), I can try and ask my admin to install it.
Any tips would be appreciated!
I have two data sets. One has 2 million cases (individual donations to various causes), the other has about 38,000 (all zip codes in the U.S.).
I want to sort through the first data set and tally up the total number of donations by zip code. (Additionally, the total for each zip code will be broken down by cause.) Each case in the first data set includes the zip code of the corresponding donation and information about what kind of cause it went to.
Is there an efficient way to do this? The only approach that I (very much a novice) can think of is to use a for ... if loop to go through each case and count them up one by one. This seems like it would be really slow, though, for data sets of this size.
edit: thanks, #josilber. This gets me a step closer to what I'm looking for.
One more question, though. table seems to generate frequencies, correct? What if I'm actually looking for the sum for each cause by zip code? For example, if the data frame looks like this:
dat3 <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE),
amt = sample(250:2500, 2000000, replace=TRUE))
Suppose that instead of frequencies, I want to end up with output that looks like this?
# Cause 1(amt) Cause 2(amt) Cause 3(amt)
# Zip 1 (sum) (sum) (sum)
# Zip 2 (sum) (sum) (sum)
# Zip 3 (sum) (sum) (sum)
# etc. ... ... ...
Does that make sense?
Sure, you can accomplish what you're looking for with the table command in R. First, let's start with a reproducible example (I'll create an example with 2 million cases, 3 zip codes, and 3 causes; I know you have more zip codes and more causes but that won't cause the code to take too much longer to run):
# Data
set.seed(144)
dat <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE))
Please note that it's a good idea to include a reproducible example with all your questions on Stack Overflow because it helps make sure we understand what you're asking! Basically you should include a sample dataset (like the one I've just included) along with your desired output for that dataset.
Now you can use the table function to count the number of donations in each zip code, broken down by cause:
table(dat$zip, dat$cause)
# Cause 1 Cause 2 Cause 3
# Zip 1 222276 222004 222744
# Zip 2 222068 222791 222363
# Zip 3 221015 221930 222809
This took about 0.3 seconds on my computer.
could this work?-
aggregate(amt~cause+zip,data=dat3,FUN=sum)
cause zip amt
1 Cause 1 Zip 1 306231179
2 Cause 2 Zip 1 306600943
3 Cause 3 Zip 1 305964165
4 Cause 1 Zip 2 305788668
5 Cause 2 Zip 2 306306940
6 Cause 3 Zip 2 305559305
7 Cause 1 Zip 3 304898918
8 Cause 2 Zip 3 304281568
9 Cause 3 Zip 3 303939326
I was wondering how I could view certain lines of data based on specific data i.e. good for viewing anomalies in results.
E.g. I have the following results from the command table(df$A)
2 3 4 5 6 19
143914 52194 30856 10662 2901 1
I'm surprised by the 1 observation where df$A=19. How can I see this observation easily in the console without having to make a subset (x<-subset(df, df$A==19)) ?
Thanks in advance
If your goal is to just view the output in an interactive session, and you have no interest in storing that value, you can use [ to "interactively" subset and view the result:
df[df$A == 19, ]