Joining multiple tables using dplyr

Joining multiple tables using dplyr - r

I am working on healthcare data. For the sake of simplicity, I am providing data on only one patient ID. Every patient has a unique ID and over a period of time, the doctors monitor the BCR_ABL value as shown in the table below.
structure(list(PatientId = c("Hospital1_124", "Hospital1_124",
"Hospital1_124", "Hospital1_124", "Hospital1_124", "Hospital1_124",
"Hospital1_124"), TestDate = c("2007-11-13", "2008-09-01", "2011-02-24",
"2013-05-01", "2016-02-16", "2017-05-12", "2017-08-29"), BCR_ABL = c(0.029,
0, 0, 0, 0, 100, 0)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), .Names = c("PatientId", "TestDate",
"BCR_ABL"))
At the start of the treatment, each patient has a BCR_ABL value of 100 and ideally post treatment, this value should drop down to 0. The patients undergo tests for BCR_ABL at various stages as shown in the TestDate column.
The patients also visit the hospital for follow up visits and this is recorded in another table which contains the followup date as well as the date of starting of the medication. The table looks like this:
structure(list(PatientId = c("Hospital1_124", "Hospital1_124",
"Hospital1_124", "Hospital1_124"), FollowupDate = structure(c(11323,
17298, 17407, 17553), class = "Date"), dateofStarting = structure(c(11323,
17318, 17318, 17318), class = "Date"), nameTKI = c("Imatinib",
"Imatinib", "Imatinib", "Imatinib"), brandTKI = c("Glivec", "Glivec",
"Glivec", "Glivec"), dailydose = c("100", "400", "400", "400"
)), class = "data.frame", row.names = c(NA, -4L), .internal.selfref = <pointer: 0x0>, .Names = c("PatientId",
"FollowupDate", "dateofStarting", "nameTKI", "brandTKI", "dailydose"
))
Now the aim of the analysis is to find out the efficacy of the drug (nameTKI) being prescribed. To my mind, the best representation would be a line graph with Date on the x-axis and BCR_ABL on the y-axis. However, I am stuck on how do I go about combining the dates. I am looking at a new table which has the following variables: PatientId, Date, BCR_ABL, nameTKI, brandTKI and dailydose. I don't think the follow up date has too much of a significance. So negelecting it, the Date variable needs to be a combination of TestDate from the first table and dateofStarting from the second table, arranged chronologically for all the individual patients (I could use group_by() for that). The value for BCR_ABL would start off as 100, till the value obtained after the first test and then follow those values for all the Date entries.
I have been trying various joins from dplyr without any success. Would appreciate some help please.

A bit hard to follow your code there, but you could join the tables together using the PatientId as the primary key. However, you should think carefully about the structure of the data as well. If the first table is at the patient/test level and the second is supposed to just be at the patient level; why are there multiple dateofStarting values for a single PatientId?
library(tidyverse)
t1 <- data.frame(PatientId = rep("Hospital1_124", 7),
TestDate = as.Date(c("2007-11-13", "2008-09-01", "2011-02-24", "2013-05-01",
"2016-02-16", "2017-05-12", "2017-08-29")),
BCR_ABL = c(0.029, 0, 0, 0, 0, 100, 0),
stringsAsFactors = FALSE)
t2 <- data.frame(PatientId = rep("Hospital1_124", 4),
FollowupDate = as.Date(c(11323, 17298, 17407, 17553), origin = "1970-01-01"),
dateofStarting = as.Date(c(11323, 17318, 17318, 17318), origin = "1970-01-01"),
nameTKI = rep("Imatinib", 4),
brandTKI = rep("Glivec", 4),
dailydose = c(100, 400, 400, 400),
stringsAsFactors = FALSE)
data <- t2 %>%
select(-FollowupDate) %>%
inner_join(t1, by = c("PatientId" = "PatientId"))

Related

Estimate_richness for all phyla in phyloseq

Is there an easy way to get ASV richness for each Phylum for each Station using the estimate_richness function in phyloseq? Or is there another simple way of extracting the abundance data for each taxonomic rank and calculating richness that way?
So far I have just been subsetting individual Phyla of interest using for example:
ps.Prymnesiophyceae <- subset_taxa(ps, Phylum == "Prymnesiophyceae")
alpha_diversity<-estimate_richness(ps.Prymnesiophyceae,measure=c("Shannon","Observed"))
H<-alpha_diversity$Shannon
S1<-alpha_diversity$Observed
S<-log(S1)
evenness<-H/S
alpha<-cbind(Shannon=H,Richness=S1,Evenness=evenness,sample_data(Prymnesiophyceae))
But this is rather a pain when having to do it for e.g. the top 20 phyla.
EDIT:
suggestion by #GTM works well until last step. See comment + dput:
> dput(head(sample_names(ps.transect), n=2)) c("2-1-DCM_S21_L001_R1_001.fastq", "2-1-SA_S9_L001_R1_001.fastq" )
> dput(head(alpha, n=2)) structure(list(Observed = c(31, 25), Shannon = c(2.84184012598765,
2.53358345702604), taxon = c("Prymnesiophyceae", "Prymnesiophyceae" ), sample_id = c("X2.1.DCM_S21_L001_R1_001.fastq", "X2.1.SA_S9_L001_R1_001.fastq" ), S = c(3.43398720448515,
3.2188758248682), evenness = c(0.827562817437384,
0.787101955736294)), row.names = c("X2.1.DCM_S21_L001_R1_001.fastq", "X2.1.SA_S9_L001_R1_001.fastq"), class = "data.frame")
> dput(head(smpl_data, n=1)) new("sample_data", .Data = list("001_DCM", 125L, structure(1L, .Label = "DCM", class = "factor"), structure(1L, .Label = "Transect", class = "factor"), structure(1L, .Label = "STZ", class = "factor"),
structure(1L, .Label = "STFW", class = "factor"), "Oligotrophic",
16L, -149.9978333, -29.997, 130.634, 17.1252, 35.4443, 1025.835008,
1.1968, 1e-12, 5.387, 2.8469, 52.26978546, 98.0505, 0, 0,
0.02, 0.9, 0, 0, 2069.47, 8.057, 377.3), names = c("Station_neat", "Depth_our", "Depth_bin", "Loc", "Front", "Water", "Zone", "Bottle", "Lon", "Lat", "pressure..db.", "Temperature", "Salinity", "Density_kgm.3", "Fluorescence_ugL", "PAR", "BottleO2_mLL", "CTDO2._mLL", "OxygenSat_.", "Beam_Transmission", "N_umolL", "NO3_umolL", "PO4_umolL", "SIL_umolL", "NO2_umolL", "NH4_umolL", "DIC_uMkg", "pH", "pCO2_matm"), row.names = "2-1-DCM_S21_L001_R1_001.fastq",
.S3Class = "data.frame")

You can wrap your code in a for loop to do so. I've slightly modified your code to make it a bit more flexible, see below.
require("phyloseq")
require("dplyr")
# Calculate alpha diversity measures for a specific taxon at a specified rank.
# You can pass any parameters that you normally pass to `estimate_richness`
estimate_diversity_for_taxon <- function(ps, taxon_name, tax_rank = "Phylum", ...){
# Subset to taxon of interest
tax_tbl <- as.data.frame(tax_table(ps))
keep <- tax_tbl[,tax_rank] == taxon_name
keep[is.na(keep)] <- FALSE
ps_phylum <- prune_taxa(keep, ps)
# Calculate alpha diversity and generate a table
alpha_diversity <- estimate_richness(ps_phylum, ...)
alpha_diversity$taxon <- taxon_name
alpha_diversity$sample_id <- row.names(alpha_diversity)
return(alpha_diversity)
}
# Load data
data(GlobalPatterns)
ps <- GlobalPatterns
# Estimate alpha diversity for each phylum
phyla <- get_taxa_unique(ps,
taxonomic.rank = 'Phylum')
phyla <- phyla[!is.na(phyla)]
alpha <- data.frame()
for (phylum in phyla){
a <- estimate_diversity_for_taxon(ps = ps,
taxon_name = phylum,
measure = c("Shannon", "Observed"))
alpha <- rbind(alpha, a)
}
# Calculate the additional alpha diversity measures
alpha$S <- log(alpha$Observed)
alpha$evenness <- alpha$Shannon/alpha$S
# Add sample data
smpl_data <- as.data.frame(sample_data(ps))
alpha <- left_join(alpha,
smpl_data,
by = c("sample_id" = "X.SampleID"))
This is a reproducible example with GlobalPatterns. Make sure to alter the code to match your data by replacing X.SampleID in the left join with the name of the column that contains the sample IDs in your sample_data. If there is no such column, you can create it from the row names:
smpl_data <- as.data.frame(sample_data(ps))
smpl_data$sample_id < row.names(smpl_data)
alpha <- left_join(alpha,
smpl_data,
by = c("sample_id" = "sample_id"))

Removing rows between two ID-values in panel data set

I have a panel data set with the following columns: "ID", "Year", "Poverty rate", "Health services".
I have data from 2011-2013, and the table is ordered after the value of ID, looking something like this:
merged_data_frame = structure(list(ID = c(1001,1001,1001,2001,2001,2001,2002,2002,2002,2003,2003,2003,3001,3001,3001),
Year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013,2011,2012,2013,2011,2012,2013),
Poverty_rate = c(0.5,0.4,0.3,0.45,0.1,0.35,0.55,0.55,0.55,0.6,0.7,0.8,0.1,0.11,0.1 )), row.names = c(1:15), class = "data.frame")
How do I remove the values for the rows with ID between 2001 and 2003? My actual dataset have more than 5000 values, so I need something that removes everything between 2001 and 2xxx.
I managed to remove one and one value, but that is not an option given the size of the data set:
new_data_frame<-subset(merged_data_frame, merged_data_frame$ID!=20013)

Try this, using filter(!ID %in% c(2001,2003)
merged_data_frame = structure(list(ID = c(1001,1001,1001,2001,2001,2001,2002,2002,2002,2003,2003,2003,3001,3001,3001),
Year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013,2011,2012,2013,2011,2012,2013),
Poverty_rate = c(0.5,0.4,0.3,0.45,0.1,0.35,0.55,0.55,0.55,0.6,0.7,0.8,0.1,0.11,0.1 )), row.names = c(1:15), class = "data.frame")
df = merged_data_frame %>%
filter(!ID %in% c(2001,2003))

r data.table appears to have duplicated rows but unique doesn't find them

I have a data.table, dt, where there appear to be two copies of each row. But when I run unique(dt), there are no duplicates.
The output of dput on the file is below
structure(list(region_code.IMPACT159 = c("CHM", "CHM"), c_Crust.elas = c(1, 1),
c_Mllsc.elas = c(0.437389655806453,
0.437389655806453),
c_FrshD.elas = c(0.361233613522818,
0.361233613522818),
c_OPelag.elas = c(0.361774165068678,
0.361774165068678
),
c_ODmrsl.elas = c(1, 1),
c_OMarn.elas = c(-0.09, -0.09),
c_FshOil.elas = c(0.382700000000001,
0.382700000000001),
c_aqan.elas = c(0, 0),
c_aqpl.elas = c(0,
0)),
sorted = "region_code.IMPACT159",
class = c("data.table",
"data.frame"),
row.names = c(NA, -2L),
.internal.selfref = <pointer: 0x7fd6af00e2e0>)
I can't run this directly because of the internal self ref code. But when I delete that the resulting file does show one row is duplicated. I've been running this type of code for a long time so I'm not sure what has changed to cause this. I'm using version 1.11.9 of data.table

R: add a new column to dataframes from a function

I have many tibbles similar to this:
dftest_tw <- structure(list(text = c("RT #BitMEXdotcom: A new high: US$500M turnover in the last 24 hours, over 80% of it on $XBTUSD. Congrats to the team and thank you to our u…",
"RT #Crowd_indicator: Thank you for this nice video, #Nicholas_Merten",
"RT #Crowd_indicator: Review of #Cindicator by DataDash: t.co/D0da3u5y3V"
), Tweet.id = c("896858423521837057", "896858275689398272", "896858135314538497"
), created.date = structure(c(17391, 17391, 17391), class = "Date"),
created.week = c(33, 33, 33)), .Names = c("text", "Tweet.id",
"created.date", "created.week"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
For testing, we add another one:
dftest2_tw <- dftest_tw
I have this list of my df:
myUserList <- ls(,pattern = "_tw")
What I am looking to do is:
1- add a new column named Twitter.name
2- fill the column with the df name, all this in a function. The following code works for each df taken one by one:
dftest_tw %>% rowwise() %>% mutate(Twitter.name = myUserList[1])
The desired result is this:
MyRes <- structure(list(text = c("RT #BitMEXdotcom: A new high: US$500M turnover in the last 24 hours, over 80% of it on $XBTUSD. Congrats to the team and thank you to our u…",
"RT #Crowd_indicator: Thank you for this nice video, #Nicholas_Merten",
"RT #Crowd_indicator: Review of #Cindicator by DataDash: t.co/D0da3u5y3V"
), Tweet.id = c("896858423521837057", "896858275689398272", "896858135314538497"
), created.date = structure(c(17391, 17391, 17391), class = "Date"),
created.week = c(33, 33, 33), retweet = c(0, 0, 0), custom = c(0,
0, 0), Twitter.name = c("dftest_tw", "dftest_tw", "dftest_tw"
)), .Names = c("text", "Tweet.id", "created.date", "created.week",
"retweet", "custom", "Twitter.name"), class = c("rowwise_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L))
When it comes to write a function to be thereafter been applied to all my df (more than 100), I can't achieve it. Any help would be appreciated.

We can use tidyverse options. Get the value of multiple string objects with mget, then with map2 from purrr, create the new column 'Twitter.name in each dataset of the list with corresponding string element of 'myUserList`
library(tidyverse)
lst <- mget(myUserList) %>%
map2(myUserList, ~mutate(.data = .x, Twitter.name = .y))
If we need to modify the objects in the global environment, use list2env
list2env(lst, envir = .GlobalEnv)

Arithmetic on summarized dataframe from dplyr in R

I have a large dataset I use dplyr() summarize to generate some means.
Occasionally, I would like to perform arithmetic on that output.
For example, I would like to get the mean of means from the output below, say "m.biomass".
I've tried this mean(data.sum[,7]) and this mean(as.list(data.sum[,7])). Is there a quick and easy way to achieve this?
data.sum <-structure(list(scenario = c("future", "future", "future", "future"
), state = c("fl", "ga", "ok", "va"), m.soc = c(4090.31654013689,
3654.45350562628, 2564.33199749487, 4193.83388887064), m.npp = c(1032.244475,
821.319385, 753.401315, 636.885535), sd.soc = c(56.0344229400332,
97.8553643582118, 68.2248389927858, 79.0739969429246), sd.npp = c(34.9421782033153,
27.6443555578531, 26.0728757486901, 24.0375040705595), m.biomass = c(5322.76631158111,
3936.79457763176, 3591.0902359206, 2888.25308402464), sd.m.biomass = c(3026.59250918009,
2799.40317348016, 2515.10516340438, 2273.45510178843), max.biomass = c(9592.9303,
8105.109, 7272.4896, 6439.2259), time = c("1980-1999", "1980-1999",
"1980-1999", "1980-1999")), .Names = c("scenario", "state", "m.soc",
"m.npp", "sd.soc", "sd.npp", "m.biomass", "sd.m.biomass", "max.biomass",
"time"), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4), vars = list(quote(scenario)), labels = structure(list(
scenario = "future"), class = "data.frame", row.names = c(NA,
-1), vars = list(quote(scenario)), drop = TRUE, .Names = "scenario"), indices = list(0:3))

We can use [[ to extract the column as a vector; as mean only works on a vector or a matrix -- not on a data.frame. If the OP wanted to do this on a single column, use this:
mean(data.sum[[7]])
#[1] 3934.726
If there was only the data.frame class, the data.sum[,7] would be extracting it as a vector, but the tbl_df prevents it to collapse it to vector
For multiple columns, the dplyr also has specialised functions
data.sum %>%
summarise_each(funs(mean), 3:7)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Joining multiple tables using dplyr - r

Related

Estimate_richness for all phyla in phyloseq

Removing rows between two ID-values in panel data set

r data.table appears to have duplicated rows but unique doesn't find them

R: add a new column to dataframes from a function

Arithmetic on summarized dataframe from dplyr in R

Categories

Resources