Dataframe with all information in one row. How can spilt them? - r

I have a dataframe where all information is in one row. see picture below
untidy data
I need to change it to something like this
tidy data
so the first value (suffix_name) in the row should be changed to a variable and the second value (none) should be first value of new variable (suffix_name)
please see images

with the following code you can split the information of the row.
You need to use the library data.table to subset the vars or values columns.
library(data.table)
data_raw.dt <- data.table(
V2_00011 = "'SUFFIX_NAME'",
V2_00012 = 'NONE}}',
V2_00013 = "'PATIENT_ID'",
V2_00014 = "'CZMIl1844982497'",
V2_00015 = "'BIRTH_DATE'",
V2_00016 = "'1987-01-01'",
V2_00017 = "'GENDER'",
V2_00018 = "'Unknown'",
V2_00019 = "'OBSCURITY_LEVEL'",
V2_00020 = "'Normal'")
vars <- seq(1, ncol(data_raw.dt), by = 2)
vals <- seq(2, ncol(data_raw.dt), by = 2)
data_ref.dt <- data.table(matrix(data_raw.dt[, ..vals], ncol = length(vals)))
names(data_ref.dt) <- paste(data_raw.dt[, ..vars])
Here you can see the results.
print(data_raw.dt)
V2_00011 V2_00012 V2_00013 V2_00014 V2_00015 V2_00016 V2_00017 V2_00018 V2_00019 V2_00020
'SUFFIX_NAME' NONE}} 'PATIENT_ID' 'CZMIl1844982497' 'BIRTH_DATE' '1987-01-01' 'GENDER' 'Unknown' 'OBSCURITY_LEVEL' 'Normal'
print(data_ref.dt)
'SUFFIX_NAME' 'PATIENT_ID' 'BIRTH_DATE' 'GENDER' 'OBSCURITY_LEVEL'
NONE}} 'CZMIl1844982497' '1987-01-01' 'Unknown' 'Normal'

Related

R for loop going wrong when applied to function

I am trying to work on a for loop to make running a function I've developed more efficient.
However, when I put it in a for loop, it is overwriting columns that it should not be and returning incorrect results.
Edit: The error is that in the resulting dataframe MiSeq_Bord_Outliers_table0, the resulting columns containing label Outlier_type is returning incorrect outputs.
As per the Outlier_Hunter function, when Avg_Trim_Cov and S2_Total_Read_Pairs_Processed are below their
respective Q1 Thresholds their respective Outlier_type columns should read "Lower_Outlier", if between Q1 & Q3 Threshold, "Normal" and if above Q3 Threshold then "Upper_outlier". But when the for loop is executed, only "Upper_outlier" is shown in the Outlier_type columns.
Edit: The inputs have been simplified and tested on the different computer with a clean console. If there were any artifacts there before, they should have been eliminated now, and there should be no errors here now. It is important to run the outlier_results_1var part first. If you test run this code and get errors, please let me know which part failed.
Edit: MiSeq_Bord_Outliers_table0_error is the error that is being reproduced. This is the error result, not an input.
Can someone please tell me why is it returning these incorrect results and what I can do to fix it? I will upload the relevant code below. Or is there another way to do this without a for loop?
#libraries used
library(tidyverse)
library(datapasta)
library(data.table)
library(janitor)
library(ggpubr)
library(labeling)
#2.) Outlier_Hunter Function
#Function to Generate the Outlier table
#Outlier Hunter function takes 4 arguments: the dataset, column/variable of interest,
#Q1 and Q3. Q1 and Q3 are stored in the results of Quartile_Hunter.
#Input ex: MiSeq_Bord_final_report0, Avg_Trim_Cov, MiSeq_Bord_Quartiles_ATC$First_Quartile[1], MiSeq_Bord_Quartiles_ATC$Third_Quartile[1]
#Usage ex: Outlier_Hunter(MiSeq_Bord_final_report0, Avg_Trim_Cov,
#MiSeq_Bord_Quartiles_ATC$First_Quartile[1], MiSeq_Bord_Quartiles_ATC$Third_Quartile[1])
#Here is the Function to get the Outlier Table
Outlier_Hunter <- function(Platform_Genus_final_report0, my_col, Q1, Q3) {
#set up and generalize the variable name you want to work with
varname <- enquo(my_col)
#print(varname) #just to see what variable the function is working with
#get the outliers
Platform_Genus_Variable_Outliers <- Platform_Genus_final_report0 %>%
select(ReadID, Platform, Genus, !!varname) %>%
#Tell if it is an outlier, and if so, what kind of outlier
mutate(
Q1_Threshold = Q1,
Q3_Threshold = Q3,
Outlier_type =
case_when(
!!varname < Q1_Threshold ~ "Lower_Outlier",
!!varname >= Q1_Threshold & !!varname <= Q3_Threshold ~ "Normal",
!!varname > Q3_Threshold ~ "Upper_Outlier"
)
)
}
#MiSeq_Bord_Quartiles entries
MiSeq_Bord_Quartiles <- data.frame(
stringsAsFactors = FALSE,
row.names = c("Avg_Trim_Cov", "S2_Total_Read_Pairs_Processed"),
Platform = c("MiSeq", "MiSeq"),
Genus = c("Bord", "Bord"),
Min = c(0.03, 295),
First_Quartile = c(80.08, 687613.25),
Median = c(97.085, 818806.5),
Third_Quartile = c(121.5625, 988173.75),
Max = c(327.76, 2836438)
)
#Remove the hashtag below to test if what you have is correct
#datapasta::df_paste(head(MiSeq_Bord_Quartiles, 5))
#dataset entry
MiSeq_Bord_final_report0 <- data.frame(
stringsAsFactors = FALSE,
ReadID = c("A005_20160223_S11_L001","A050_20210122_S6_L001",
"A073_20210122_S7_L001",
"A076_20210426_S11_L001",
"A080_20210426_S12_L001"),
Platform = c("MiSeq","MiSeq",
"MiSeq","MiSeq","MiSeq"),
Genus = c("Bordetella",
"Bordetella","Bordetella",
"Bordetella","Bordetella"),
Avg_Raw_Read_bp = c(232.85,241.09,
248.54,246.99,248.35),
Avg_Trimmed_Read_bp = c(204.32,232.6,
238.56,242.54,244.91),
Avg_Trim_Cov = c(72.04,101.05,
92.81,41.77,54.83),
Genome_Size_Mb = c(4.1, 4.1, 4.1, 4.1, 4.1),
S1_Input_reads = c(1450010L,
1786206L,1601542L,710792L,925462L),
S1_Contaminant_reads = c(12220L,6974L,
7606L,1076L,1782L),
S1_Total_reads_removed = c(12220L,6974L,
7606L,1076L,1782L),
S1_Result_reads = c(1437790L,
1779232L,1593936L,709716L,923680L),
S2_Read_Pairs_Written = c(712776L,882301L,
790675L,352508L,459215L),
S2_Total_Read_Pairs_Processed = c(718895L,889616L,
796968L,354858L,461840L)
)
MiSeq_Bord_final_report0
#Execution for 1 variable
outlier_results_1var <- Outlier_Hunter(MiSeq_Bord_final_report0, Avg_Trim_Cov,
MiSeq_Bord_Quartiles$First_Quartile[1], MiSeq_Bord_Quartiles$Third_Quartile[1])
#Now do it with a for loop
col_var_outliers <- row.names(MiSeq_Bord_Quartiles)
#col_var_outliers <- c("Avg_Trim_Cov", "S2_Total_Read_Pairs_Processed")
#change line above to change input of variables few into Outlier Hunter Function
outlier_list_MiSeq_Bord <- list()
for (y in col_var_outliers) {
outlier_results0 <- Outlier_Hunter(MiSeq_Bord_final_report0, y, MiSeq_Bord_Quartiles[y, "First_Quartile"], MiSeq_Bord_Quartiles[y, "Third_Quartile"])
outlier_results1 <- outlier_results0
colnames(outlier_results1)[5:7] <- paste0(y, "_", colnames(outlier_results1[, c(5:7)]), sep = "")
outlier_list_MiSeq_Bord[[y]] <- outlier_results1
}
MiSeq_Bord_Outliers_table0 <- reduce(outlier_list_MiSeq_Bord, left_join, by = c("ReadID", "Platform", "Genus"))
#the columns containing label Outlier_type is where the code goes wrong.
#When Avg_Trim_Cov and S2_Total_Read_Pairs_Processed are below their
#respective Q1 Thresholds their respective Outlier_type columns should read
#"Lower_Outlier", if between Q1 & Q3 Threshold, "Normal" and if above Q3
#Threshold then "Upper_outlier". But when the for loop is executed, only
"Upper_outlier" is shown in the Outlier_type columns.
datapasta::df_paste(head(MiSeq_Bord_Outliers_table0, 5))
MiSeq_Bord_Outliers_table0_error <- data.frame(
stringsAsFactors = FALSE,
ReadID = c("A005_20160223_S11_L001",
"A050_20210122_S6_L001",
"A073_20210122_S7_L001","A076_20210426_S11_L001",
"A080_20210426_S12_L001"),
Platform = c("MiSeq",
"MiSeq","MiSeq","MiSeq",
"MiSeq"),
Genus = c("Bordetella","Bordetella","Bordetella",
"Bordetella","Bordetella"),
Avg_Trim_Cov = c(72.04,
101.05,92.81,41.77,54.83),
Avg_Trim_Cov_Q1_Threshold = c(80.08,
80.08,80.08,80.08,80.08),
Avg_Trim_Cov_Q3_Threshold = c(121.5625,
121.5625,121.5625,121.5625,
121.5625),
Avg_Trim_Cov_Outlier_type = c("Upper_Outlier","Upper_Outlier",
"Upper_Outlier","Upper_Outlier",
"Upper_Outlier"),
S2_Total_Read_Pairs_Processed = c(718895L,
889616L,796968L,354858L,
461840L),
S2_Total_Read_Pairs_Processed_Q1_Threshold = c(687613.25,
687613.25,687613.25,
687613.25,687613.25),
S2_Total_Read_Pairs_Processed_Q3_Threshold = c(988173.75,
988173.75,988173.75,
988173.75,988173.75),
S2_Total_Read_Pairs_Processed_Outlier_type = c("Upper_Outlier","Upper_Outlier",
"Upper_Outlier","Upper_Outlier",
"Upper_Outlier")
)
For use in a loop like you do, it would be more useful to write your Outlier_Hunter() function to take the target column as a character string rather than an expression.
To do that, try replacing all instances of !!varname in your function with .data[[my_col]], and remove the enquo() line altogether.
Note that with these changes, you also need to change how you call the function when you don't have the column name in a variable. For example, your single execution would become:
Outlier_Hunter(
MiSeq_Bord_final_report0,
"Avg_Trim_Cov",
MiSeq_Bord_Quartiles$First_Quartile[1],
MiSeq_Bord_Quartiles$Third_Quartile[1]
)
For more info about programming with tidy evaluation functions, you may find this rlang vignette useful.

Using an R function to hash values produces a repeating value across rows

I'm using the following query:
let
Source = {1..5},
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), {"Numbers"}, null, ExtraValues.Error),
#"Added Custom" = Table.AddColumn(#"Converted to Table", "Letters", each Character.FromNumber([Numbers] + 64)),
#"Run R script" = R.Execute("# 'dataset' holds the input data for this script#(lf)#(lf)library(""digest"")#(lf)#(lf)dataset$SuffixedLetters <- paste(dataset$Letters, ""_suffix"")#(lf)dataset$HashedLetters <- digest(dataset$Letters, ""md5"", serialize = TRUE)#(lf)output<-dataset",[dataset=#"Added Custom"]),
output = #"Run R script"{[Name="output"]}[Value]
in
output
which leads to the resulting table:
And the here is the R script with better formatting:
# 'dataset' holds the input data for this script
library("digest")
dataset$SuffixedLetters <- paste(dataset$Letters, "_suffix")
dataset$HashedLetters <- digest(dataset$Letters, "md5", serialize = TRUE)
output<-dataset
The 'paste' function appears to iterate over rows and resolve on each row with the new input. But the 'digest' function only appears to return the first value in the table across all rows.
I don't know why the behavior of the two functions would seem to operate differently. Can anyone advise how to get the 'HashedLetters' column to resolve using the values from each row instead of just the initial one?
Use:
dataset$HashedLetters <- sapply(dataset$Letters, digest, algo = "md5", serialize = TRUE)
digest works on a whole object at a time, not individual elements of a vector.
vec <- letters[1:3]
digest::digest(vec, algo="md5", serialize=TRUE)
# [1] "38ce1fe9e19a222505e693e8bdd8aeec"
sapply(vec, digest::digest, algo="md5", serialize=TRUE)
# a b c
# "127a2ec00989b9f7faf671ed470be7f8" "ddf100612805359cd81fdc5ce3b9fbba" "6e7a8c1c098e8817e3df3fd1b21149d1"

Exporting Seurat Object Data by Cluster

I'm using Seurat to perform a single cell analysis and am interested in exporting the data for all cells within each of my clusters. I tried to use the below code but have had no success.
My Seurat object is called Patients. I also attached a screenshot of my Seurat object. I am looking to extract all the clusters (i.e. Ductal1, Macrophage1, Macrophage2, etc...)
meta.data.cluster <- unique(x = Patients#meta.data$active.ident)
for(group in meta.data.cluster) {
group.cells <- WhichCells(object = Patients, subset.name = "active.ident" , accept.value = group)
data_to_write_out <- as.data.frame(x = as.matrix(x = Patients#raw.data[, group.cells]))
write.csv(x = data_to_write_out, row.names = TRUE, file = paste0(save_dir,"/",group, "_cluster_outfile.csv"))
}
I am new to R and coding so any help is greatly appreciated! :)
It doesn't work because there is no active.ident column under your metadata. For example if we use an example dataset like yours and set the ident:
library(Seurat)
M = matrix(rnbinom(5000,mu=20,size=1),ncol=50)
colnames(M) = paste0("P",1:50)
rownames(M) = paste0("gene",1:100)
Patients = CreateSeuratObject(M)
Patients$grp = sample(c("Ductal1","Macrophage1","Macrophage2"),50,replace=TRUE)
Idents(Patients) = Patients$grp
You can see this line of code gives you no value:
meta.data.cluster <- unique(x = Patients#meta.data$active.ident)
meta.data.cluster
NULL
You can do:
meta.data.cluster <- unique(Idents(Patients))
for(group in meta.data.cluster) {
group.cells <- WhichCells(object = Patients, idents = group)
data_to_write_out <- as.data.frame(GetAssayData(Patients,slot = 'counts')[,group.cells])
write.csv(data_to_write_out, row.names = TRUE, file = paste0(save_dir,"/",group, "_cluster_outfile.csv"))
}
Note also you can get the counts out using GetAssayData . You can subset one group and write out like this:
wh <- which(Idents(Patients) =="Macrophage1" )
da = as.data.frame(GetAssayData(Patients,slot = 'counts')[,wh])
write.csv(da,...)

Create and combine 2 grid.tables

I have created a grid.table object to display a dataframe in PowerBi, below there is my code:
dataset <- data.frame(BDS_ID = c("001","002"),
PRIORITY = c("high","medium"),
STATUS = c("onair","onair"),
COMPANY = c("airfr","fly"))
my.result <- melt(dataset, id = c("BDS_ID"))
mytheme <- ttheme_default(base_size = 10,
core=list(fg_params=list(hjust=0, x=0.01),
bg_params=list(fill=c("white", "grey90"))))
for (i in 1:nrow(tg)) {
tg$grobs[[i]] <- editGrob(tg$grobs[[i]], gp=gpar(fontface="bold"))
}
grid.draw(tg)
and this is my output:
I would like to improve my output in the following way: I would like that the row headers to be unique and have a different column for each different value of each variable repeating the column with the row headers each time.
I tried to do this using the statement t(dataset), but I do not get the desired result because the row headers are not repeated.
I would like to get an output (always classy grob) similar to this:
**PRIORITY** high **PRIORITY** medium
**STATUS** onair **STATUS** onair
**COMPANY** airfr **COMPANY** fly
Does anyone knows how to achive this?
Thanks
I'm unable to reproduce the grob format you've shown based on the code you've provided, but I've got something similar:
dataset <- data.frame(BDS_ID = c("001","002"),
PRIORITY = c("high","medium"),
STATUS = c("onair","onair"),
COMPANY = c("airfr","fly"))
dataset <- data.frame(t(dataset))
dataset$label1 <- rownames(dataset)
dataset$label2 <- rownames(dataset)
colnames(dataset) <- c("status1", "status2", "label1", "label2")
dataset <- dataset[c(2:nrow(dataset)), c(3, 1, 4, 2)]
rownames(dataset) <- NULL
test <- grid.draw(tableGrob(dataset))
The above code produces the following object. It doesn't look exactly like yours, but it's in the general structure you're looking for:

Lag in R dataframe

I have the following sample dataset (below and/or as CSVs here: http://goo.gl/wK57T) which I want to transform as follows. For each person in a household I want to create two new variables OrigTAZ and DestTAZ. It should take the value in TripendTAZ and put that in DestTAZ. For OrigTAZ it should put value of TripendTAZ from the previous row. For the first trip of every person in a household (Tripid = 1) the OrigTAZ = hometaz. For each person in a household, from the second trip OrigTAZ = TripendTAZ_(n-1) and DestTAZ = TripEndTAZ. The sample input and output data are shown below. I tried the suggestions shown here: Basic lag in R vector/dataframe but have not had luck. I am used to doing something like this in SAS.
Any help is appreciated.
TIA,
Krishnan
SAS Code Sample
if Houseid = lag(Houseid) then do;
if Personid = lag(Personid) then do;
DestTAZ = TripendTAZ;
if Tripid = 1 then OrigTAZ = hometaz
else
OrigTAZ = lag(TripendTAZ);
end;
end;
INPUT DATA
Houseid,Personid,Tripid,hometaz,TripendTAZ
1,1,1,45,4
1,1,2,45,7
1,1,3,45,87
1,1,4,45,34
1,1,5,45,45
2,1,1,8,96
2,1,2,8,4
2,1,3,8,2
2,1,4,8,1
2,1,5,8,8
2,2,1,8,58
2,2,2,8,67
2,2,3,8,9
2,2,4,8,10
2,2,5,8,8
3,1,1,7,89
3,1,2,7,35
3,1,3,7,32
3,1,4,7,56
3,1,5,7,7
OUTPUT DATA
Houseid,Personid,Tripid,hometaz,TripendTAZ,OrigTAZ,DestTAZ
1,1,1,45,4,45,4
1,1,2,45,7,4,7
1,1,3,45,87,7,87
1,1,4,45,34,87,34
1,1,5,45,45,34,45
2,1,1,8,96,8,96
2,1,2,8,4,96,4
2,1,3,8,2,4,2
2,1,4,8,1,2,1
2,1,5,8,8,1,8
2,2,1,8,58,8,58
2,2,2,8,67,58,67
2,2,3,8,9,67,9
2,2,4,8,10,9,10
2,2,5,8,8,10,8
3,1,1,7,89,7,89
3,1,2,7,35,89,35
3,1,3,7,32,35,32
3,1,4,7,56,32,56
3,1,5,7,7,56,7
Just proceed through the steps you outlined step-by-step and it isn't so bad.
First I'll read in your data by copying it:
df <- read.csv(file('clipboard'))
Then I'll sort to make sure the data frame is ordered by houseid, then personid, then tripid:
# first sort so that it's ordered by Houseid, then Personid, then Tripid:
df <- with(df, df[order(Houseid,Personid,Tripid),])
Then follow the steps you specified:
# take value in TripendTAZ and put it in DestTAZ
df$DestTAZ <- df$TripendTAZ
# Set OrigTAZ = value from previous row
df$OrigTAZ <- c(NA,df$TripendTAZ[-nrow(df)])
# For the first trip of every person in a household (Tripid = 1),
# OrigTAZ = hometaz.
df$OrigTAZ[ df$Tripid==1 ] <- df$hometaz[ df$Tripid==1 ]
You'll notice that df is then what you're after.

Resources