So... I've got some data coming to me in small batches, in spreadsheets. I'm having trouble figuring out the best way to put this information into a data.frame in R for later analysis.
An example of what I get would be something like this:
-----------------------------------
| Component combination A+B+C |
-----------------------------------
| | Meter | RCBS | AND |
-----------------------------------
| 1 | 2509 | 2486 | 2535 |
-----------------------------------
| 2 | 2435 | 2484 | 2539 |
-----------------------------------
| 3 | 2507 | 2493 | 2565 |
-----------------------------------
| 4 | 2558 | 2483 | 2538 |
-----------------------------------
| 5 | 2510 | 2515 | 2530 |
-----------------------------------
...where the numbers are repeated individual measurements (runs) using the listed component combination, with one of the components dispensed using one of three different devices ("Meter", etc.). There would typically be ~20 values in each column.
Should I be going for a wide or long format when I put this data into a frame? Do I want the data in the rows or in the columns?
Related
I want to merge two files using a unique ID and timestamp, and also get measurements for next next n intervals.
The first file has over 15,000 unique IDs. Each ID has measurements taken at 15 minute intervals from Jan 1, 00:00 to Dec 31, 23:45. The database is quite big (35 GB) with over 500 million rows. The file looks something like this.
First file
| ID | Time | Measurement|
|:----:|:---------------:|:------:|
| 1 |2012-12-31 22:45| 61 |
| 1 |2012-12-31 23:00| 60 |
| 1 |2012-12-31 23:15| 61 |
| 1 |2012-12-31 23:30| 59 |
| 1 |2012-12-31 23:45| 59 |
| 2 |2012-01-01 0:00| 60 |
| 2 |2012-01-01 0:15| 61 |
| 2 |2012-01-01 0:30| 60 |
| 2 |2012-01-01 0:45| 62 |
The second file has unique IDs and a timestamp. IDs in this file is a subset of IDs in the first file. The file is realtively small (~50 MB) compared to the first file.
Second file
| ID | Time |
|:----:|:---------------:|
| 1 |2012-12-31 22:48|
| 1 |2012-12-31 23:48|
| 2 |2012-01-01 0:16|
I want to merge the two files such that the measurements are extracted for current interval, and the next n intervals. I also want to be able to specify n and and run the code dynamically.
The merged file file should look like this for n = 3. For example, for the second row the measurements for next intervals should not be derived from another ID.
After merge
| ID | Time | Measurement 1| Measurement 2| Measurement 3|
|:----:|:---------------:|:----:|:---------------:|:----:|
| 1 | 2012-12-31 22:48| 61| 60| 61 |
| 1 | 2012-12-31 23:48| 59| 59| 59 |
| 2 | 2012-01-01 0:16| 61| 60| 62 |
I'm trying to use the Merge() function in RStudio. Basically I have two tables with 5000+ rows. They both have the same amount of rows. Although there is no corresponding Columns to merge by. However the rows are in order and correspond. E.g. The first row of dataframe1 should merge with first row dataframe2...2nd row dataframe1 should merge with 2nd row dataframe2 and so on...
Here's an example of what they could look like:
Dataframe1(df1):
+-------------------------------------+
| Name | Sales | Location |
+-------------------------------------+
| Rod | 123 | USA |
| Kelly | 142 | CAN |
| Sam | 183 | USA |
| Joyce | 99 | NED |
+-------------------------------------+
Dataframe2(df2):
+---------------------+
| Sex | Age |
+---------------------+
| M | 23 |
| M | 33 |
| M | 31 |
| F | 45 |
+---------------------+
NOTE: this is a downsized example only.
I've tried to use the merge function in RStudio, here's what I've done:
DFMerged <- merge(df1, df2)
This however increases both the rows and columns. It returns 16 rows and 5 columns for this example.
What am I missing from this function, I know there is a merge(x,y, by=) argument but I'm unable to use a column to match them.
The output I would like is:
+----------------------------------------------------------+
| Name | Sales | Location | Sex | Age |
+----------------------------------------------------------+
| Rod | 123 | USA | M | 23 |
| Kelly | 142 | CAN | M | 33 |
| Sam | 183 | USA | M | 31 |
| Joyce | 99 | NED | F | 45 |
+-------------------------------------+--------------------+
I've considering making extra columns in each dataframes, says row# and match them by that.
You could use cbind:
cbind(df1, df2)
If you want to use merge you could use:
merge(df1, df2, by=0)
You could use:
cbind(df1,df2)
This will necessarily work with same number of rows in two data frames
Everything is in the title, I got from a database many columns, paired two-by-two containing codes and labels for some variables, I want an easy way to create half as many factors, with, for each factor levels/codes matching to the original two variables.
Here is an exemple of original data for two factors
| customer_type | customer_type_name | customer_status | customer_status_name |
|----------------------|----------------------|----------------------|----------------------|
| 1 | A | 2 | Beta |
| 2 | B | 2 | Beta |
| 3 | C | 1 | Alpha |
| 2 | B | 3 | Gamma |
| 1 | A | 4 | Delta |
| 3 | C | 2 | Beta |
i.e. a simpler way (simpler to call in a function for lots of variables) to do from dataframe "accounts"
a<-accounts[,c("customertypecode","customertypecodename")]
a<-a[!duplicated(a),]
a<-a[order(a$customertypecode),]
accounts$customertypecode<-factor(accounts$customertypecode,labels=a$customertypecodename[!is.na(a$customertypecodename)])
I need to correlate some data.
I have two data frames - df for patient health conditions with 253 columns and tax2.melt for patient's microbiota analyses with 3 columns.
taxt.melt is:
| bac_name | pat_id | percent |
|----------------------|--------|--------------|
| Unclassified | 1 | 5.4506702563 |
| Serratia_entomophila | 1 | 0 |
| Faecalibacterium | 1 | 4.0394862303 |
| Clostridium | 1 | 5.215098996 |
df is a data frame with patient ID_CODE and 253 variables
| ID_CODE | DIAB_GR | SEX | AGE | .... |
|---------|---------|-----|-----|--------|
| 1 | 232 | 0 | 0 | .... |
| 2 | 99 | 0 | 0 | .... |
So I need to correlate individual patient's conditions (like an abdominal obesity or diabetes) with percentage of individual gut bacteria in total gut microbiota (like Faecalibacterium or Clostridium)
The result should be some data frame with columns bac_name df_testvalue corr.
Thank you!
Could you give me an advice how to make it best in R?
Just starting out with R and trying to figure out what works for my needs when it comes to creating "summary tables." I am used to Custom Tables in SPSS, and the CrossTable function in the package gmodels gets me almost where I need to be; not to mention it is easy to navigate for someone just starting out in R.
That said, it seems like the Hmisc table is very good at creating various summaries and exporting to LaTex (ultimately what I need to do).
My questions are:1)can you create the table below easily in the Hmsic page? 2) if so, can I interact variables (2 in the the column)? and finally 3) can I access p-values of significance tests (chi square).
Thanks in advance,
Brock
Cell Contents
|-------------------------|
| Count |
| Row Percent |
| Column Percent |
|-------------------------|
Total Observations in Table: 524
| asq[, 23]
asq[, 4] | 1 | 2 | 3 | 4 | 5 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 76 | 54 | 93 | 46 | 54 | 323 |
| 23.529% | 16.718% | 28.793% | 14.241% | 16.718% | 61.641% |
| 54.286% | 56.250% | 63.265% | 63.889% | 78.261% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 64 | 42 | 54 | 26 | 15 | 201 |
| 31.841% | 20.896% | 26.866% | 12.935% | 7.463% | 38.359% |
| 45.714% | 43.750% | 36.735% | 36.111% | 21.739% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 140 | 96 | 147 | 72 | 69 | 524 |
| 26.718% | 18.321% | 28.053% | 13.740% | 13.168% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
The gmodels package has a function called CrossTable, which is very nice for those used to SPSS and SAS output. Try this example:
library(gmodels) # run install.packages("gmodels") if you haven't installed the package yet
x <- sample(c("up", "down"), 100, replace = TRUE)
y <- sample(c("left", "right"), 100, replace = TRUE)
CrossTable(x, y, format = "SPSS")
This should provide you with an output just like the one you displayed on your question, very SPSS-y. :)
If you are coming from SPSS, you may be interested in the package Deducer ( http://www.deducer.org ). It has a contingency table function:
> library(Deducer)
> data(tips)
> tables<-contingency.tables(
+ row.vars=d(smoker),
+ col.vars=d(day),data=tips)
> tables<-add.chi.squared(tables)
> print(tables,prop.r=T,prop.c=T,prop.t=F)
================================================================================================================
==================================================================================
========== Table: smoker by day ==========
| day
smoker | Fri | Sat | Sun | Thur | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|
No Count | 4 | 45 | 57 | 45 | 151 |
Row % | 2.649% | 29.801% | 37.748% | 29.801% | 61.885% |
Column % | 21.053% | 51.724% | 75.000% | 72.581% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Yes Count | 15 | 42 | 19 | 17 | 93 |
Row % | 16.129% | 45.161% | 20.430% | 18.280% | 38.115% |
Column % | 78.947% | 48.276% | 25.000% | 27.419% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 19 | 87 | 76 | 62 | 244 |
Column % | 7.787% | 35.656% | 31.148% | 25.410% | |
Large Sample
Test Statistic DF p-value | Effect Size est. Lower (%) Upper (%)
Chi Squared 25.787 3 <0.001 | Cramer's V 0.325 0.183 (2.5) 0.44 (97.5)
-----------
================================================================================================================
You can get the counts and test to latex or html using the xtable package:
> library(xtable)
> xtable(drop(extract.counts(tables)[[1]]))
> test <- contin.tests.to.table((tables[[1]]$tests))
> xtable(test)