Setting row order in R using levels - r

Please don't mark this as duplicate as I will take this question down once I find out what's wrong. I have used Levels() with a very high degree of success and today it refuses to work come what may. Here's what I'm trying to achieve. I have two data frames with an identical column. I am using the simple merge() function as follows:
mergedData<-merge(df1, df2, by='Index')
Now, I want to reorder the 'Index' column i.e. reorder the rows in the 'mergedData' file to match the order in either of the original dataframes. This is the command I am using to achieve the reordering:
mergedData$Index<-factor(mergedData$Index,
levels=c("ND","TC","PR","W","MI"))
When I test the levels after running the above command it shows the desired order however when I export the table it retains the original order. I am extremely confused as to why this isn't working. I have other scripts wherein I've used this approach of setting the desired order and it is working perfectly fine except in this instance.
Any help/suggestions/advise would be greatly appreciated.
I have attached data from the two dataframes for you all to play around with:
df1
structure(list(Index = structure(1:5, .Label = c("ND", "TC",
"PR", "W", "MI"), class = "factor"), `CP` = c(0.7102,
0.059, -0.0469, 1.0137, 0.6116), FA1 = c(0.5218, 0.0249, -0.0532,
0.9561, 1.1676), FA2 = c(0.5625, 0.0397, -0.0712, 0.9636, 0.9569
), FA3 = c(0.5934, 0.0332, -0.0442, 0.9873, 0.8929)), .Names = c("Index",
"CP", "FA1", "FA2", "FA3"), row.names = c(NA, 5L), class = "data.frame")
df2
structure(list(Index = structure(1:5, .Label = c("ND", "TC",
"PR", "W", "MI"), class = "factor"), `CP SD` = c(0.0241,
0.0184, 0.0021, 0.0114, 0.0947), `FA1 SD` = c(0.1891, 0.0171,
0.0104, 0.0559, 0.5321), `FA2 SD` = c(0.1273, 0.0243, 0.0173,
0.0565, 0.3292), `FA3 SD` = c(0.0518, 0.0094, 0.0078, 0.0195,
0.1581)), .Names = c("Index", "CP SD", "FA1 SD", "FA2 SD",
"FA3 SD"), row.names = c(NA, 5L), class = "data.frame")
Thanks

levels only controls the order of the factor levels (how it will be displayed by levels(x), in table, etc.), not the order of the rows in a data.frame. To order a data frame, use this:
mergedData <- mergedData[order(mergedData$Index),]
Or with dplyr:
library(dplyr)
mergedData <- arrange(mergedData,Index)

Related

Creating new = variables from messy data in multiple variables

I am busy data cleaning a TB drug register. I have already done it in stata but would like to move across to R (and improve the new variables slightly).
There are 5 variables that describe each drug regimen,
drugname1, drugstartdate1, drugenddate1, drugdose1 (ie."500"), drugunit1 (ie."mg"). Numbers go up to 60 so I know I may need to make use of looping in one form or another.
The drugname1 is not standardised to one specific drug, so bedaquiline (bdq) can be in drugname1 or drugname2 or drugname50.
At the end of the day I would like to have variables that relate to a specific drug, for instance (bedaquiline = bdq:
onbdq bdqstartdate bdqenddate bdqdose
*there may also be patients who start on bdq, then stop, then start again so something along the lines of
onbdq1 bdqstartdate1 bdqenddate1 bdqdose1 onbdq2 bdqstartdate2 bdqenddate2 bdqdose2 would be more accurate.
#Essentially, I have this:
mre <- as.data.frame(structure(list(drugname1 = c("Bedaquiline", "Bedaquiline", "Bedaquiline",
NA, "Amikacin"), drugstartdate1 = structure(c(18875, 18383, 18795,
NA, 16743), class = "Date"), drugenddate1 = structure(c(NA, NA,
18808, NA, NA), class = "Date"), drugdose1 = c(NA, "400", "200",
NA, NA), drugunit1 = c(NA, "mg", "mg", NA, NA), drugname2 = c("Levofloxacin",
"Levofloxacin", "Levofloxacin", "p-Aminosalicylic Acid", "Terizidone"
), drugstartdate2 = structure(c(18875, 18383, 18795, 16709, 16743
), class = "Date"), drugenddate2 = structure(c(NA, NA, 19139,
NA, NA), class = "Date"), drugdose2 = c(NA, "750", "1000", NA,
NA), drugunit2 = c(NA, "mg", "mg", "mg", NA), drugname3 = c("Linezolid",
"Linezolid", "Linezolid", "Ethionamide", "Ethambutol"), drugstartdate3 = structure(c(18875,
18383, 18795, 16709, 16743), class = "Date"), drugenddate3 = structure(c(NA,
18438, 19139, NA, NA), class = "Date"), drugdose3 = c(NA, "600",
"600", NA, NA), drugunit3 = c(NA, "mg", "mg", "mg", NA)), row.names = c(NA,
5L), class = "data.frame"))
#and I want something like this for each drug (bedaquiline = bdq) and so on:
output <- as.data.frame(structure(list(
onbdq = c(TRUE, TRUE),
bdqtartdate = structure(c(18875, 18383 ), class= "Date"),
bdqenddate = structure(c(NA, NA), class = "Date"),
bdqdose = structure(c(NA, "200mg")))))
Some starting points have been
library(dplyr)
#combing drugdose and drugunit into variable dose
EDRsub <- mutate(EDRsub, dose1= paste0(drugdose1, drugunit1))
#but I am unsure of how to go about looping, in my head this should work:
for(i in 1:3){
mre <- mutate(mre, paste0(dose,i) = paste0(drugdose,i,drugunit,i))
}
#or
for(i in 1:3){
mre <- mutate(mre, dose[[i]]) = (drugdose[[i]],drugunit[[i]])
}
but neither do. There is clearly something fundamental I am not cracking with the R syntax.
# I also had a go at reshaping with mixed results, coding is going through, but it seems clunky and inefficient aswell as not quite giving me what I want.
long <- mre %>%
pivot_longer(
cols = starts_with(c("drugname")),
names_to = "observation",
names_patter = "new_?(.*)",
values_to = c("drugname"),
values_drop_na = TRUE
) %>%
pivot_longer(
cols = starts_with(c("as.character(drugstartdate)")),
names_to = "observationdrugstartdate",
names_patter = "new_?(.*)",
values_to = c("drugstartdate"),
values_drop_na = TRUE
) %>%
distinct()
I have tried bits and pieces of code but generally feel as though I have thrown myself in the middle of the atlantic. I am struggling converting looping and reshaping knowledge from stata and at the same time feel like R has a better way of doing things where the logic of stata (which I am used to) is not being helpful.
This is some of the code statalist came up with : https://www.statalist.org/forums/forum/general-stata-discussion/general/1680179-forvalues-%60i-1%60-and-wide-to-long-dataset
But isn't exactly what I am after with this question.
Any advice on coding is welcome as well as suggestions for how to arrange the data in the "R" way.

ggpattern and swimplot : issue with aligning patterns and repeating x-axis values

I am having a hard time figuring out how to fix this graph. I thought it was a bug with ggpattern (see here and response to my bug report here); however, others seem to think it is not a bug, but an issue with "overlapping columns". When there are no overlapping patterns/values (e.g., "NA, "S", "V+S") the x-axis and patterns align; but when some repeat as in the code above the problems occur. I do not know how to reconcile these packages together to create a graph with appropriate patterns and x-axis values which consistently seem to both be incorrect. Thanks!
library(swimplot)
library(ggpattern)
library(tidyverse)
df <- data.frame(
study_id = c(3, 3, 3, 3), primary_therapy = c("Si", "Si", "Si", "Si"),
additional_therapy = c("NA", "S", "NA", "V+S"), end_yr = c(0.08, 0.39, 3.03, 3.4))
df <- df %>% mutate(additional_therapy = factor(additional_therapy,
levels = c("S", "V", "V+S", "NA")))
swimmer_plot(
df = df, id = "study_id",
end = "end_yr", name_fill = "primary_therapy",
width = 0.85, color = NA) +
geom_col_pattern(aes(study_id, end_yr,
pattern = additional_therapy), color=NA,
fill = NA,
show.legend=FALSE, width=0.85,
pattern_spacing = 0.01, pattern_fill="black", pattern_color=NA) +
scale_pattern_manual(name="Additional Therapy", values = c("S"="stripe","V"="circle","V+S"="crosshatch","NA"="none"),
breaks=c("S","V","V+S"))

Lubridate: Statement returns TRUE/FALSE on its own, but incurs "missing value where TRUE/FALSE needed" error when in function

I will start by saying that I am fully aware that similar questions have been answered before, but after hours of reading and troubleshooting, I believe I have a unique issue. Apologies if I have missed something. The answer given in the much up-voted similar question points to NAs in the data, but as explained in my question, I do not seem to have any nor do I know where they may be popping up.
I am running a for-loop in R 4.1.2 using the lubridate, readr, and dplyr packages that seeks to mark as invalid data taken by individuals before they have passed a reliability test. Tests are unique to specific groups, so an individual may be reliable for one group, many, all, or none. The function I've written is meant to take a dataframe "x" and for each individual observer, check that the data point is valid against a dataframe "key" that has a column of observers (observer), test pass date (begin_valid), and the group they are now valid for (group_valid). The key may have multiple rows per observer if they have passed multiple tests. I've used tools from the Lubridate package to create POSIXct values for the dates that can be arithmetically manipulated and compared to each other. The user can set y = "remove" if they want to remove invalid data, or leave if they want to label and keep invalid data. Here is the code:
invalidata <- function(x, y){
library(lubridate)
library(readr)
library(dplyr)
x$valid <- rep(1, length(rownames(x)))
alts <- 0
key <- read_csv("updatable csv file")
key$begin_valid <- parse_date_time(key$begin_valid, c("mdy", "dmy", "ydm", "mdy"), tz= "Africa/Lubumbashi")
for(i in unique(x$observer)){
subkey <- subset(key, key$observer == i)
subx <- subset(x, x$observer == i)
if(is.na(subkey$begin_valid) == TRUE || is.na(subkey$group_valid) == TRUE){ #if reliable for nothing, remove
x[x$observer == i]$valid <- 0
print("removed completely unreliable")
}else{
for(j in rownames(subx)){
if(subx$group[j] %in% subkey$group_valid == FALSE && "All" %in% subkey$group_valid == FALSE){ #if not reliable for specific group or all groups, remove
x$valid[j] <- 0
print("removed unreliable for group")
}
if(subx$group[j] %in% subkey$group_valid){ #remove if before reliability date for group
if(subx$date[j] < subset(subkey, subkey$group_valid == subx$group[j])$begin_valid){
x$valid[j] <- 0
print("removed pre-reliability")
}
} else{ #remove if not reliable for specific group, and before reliability date for all
if(subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid){
x$valid[j] <- 0
print("removed pre-reliability")
}
}
}
}
}
if(y == "remove"){ #remove all invalid data and validity column
x <- subset(x, x$valid == 1)
x <- select(x, -valid)
}
return(x)}
My issue is with the line
if(subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid)
which returns the error:
Error in if (subx$date[j] < subset(subkey, subkey$group_valid == >"All")$begin_valid) { :
missing value where TRUE/FALSE needed
However, when I run the code inside the parentheses
subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid
outside of the context of the loop, I receive either a TRUE or FALSE value as relevant. I've checked all dates for any NULL or NA values, as well as addressed any data with NAs in a previous step of the code:
if(is.na(subkey$begin_valid) == TRUE || is.na(subkey$group_valid) == TRUE){}
else{ #code at issue }
I am not having issues with this very similar line:
if(subx$date[j] < subset(subkey, subkey$group_valid == subx$group[j])$begin_valid){
My best guess is that something may be going wrong with the date formatting? I know that this error is usually a symptom of NULLs or NAs floating in the data, but for the life of me I cannot figure out where they could be coming from. Dates in "x" have already been parsed and contain no NAs or NULLs. I have not included the data as it is proprietary, but I can come up with mock data if people are interested/think it would be necessary. Thank you in advance for reading through and for any thoughts/troubleshooting suggestions!
MRE:
dput output for x:
structure(list(date = structure(c(1486764000, 1486764000, 1486850400,
1486936800, 1487023200, 1487109600, 1487109600, 1487196000, 1487196000,
1487368800, 1487368800, 1487368800, 1487368800, 1487368800, 1487368800,
1487455200, 1487455200, 1487455200, 1487541600, 1487887200), class = c("POSIXct",
"POSIXt"), tzone = "Africa/Lubumbashi"), time = structure(c(23734,
53419, 41352, 33034, 24220, 34812, 35624, 27949, 27950, 49192,
49286, 49392, 49401, 62719, 62725, 26046, 26047, 27246, 46611,
61228), class = c("hms", "difftime"), units = "secs"), observer = c("MA",
"LE", "VI", "VI", "MI", "MA", "MA", "ME", "VI", "BA", "MA", "BA",
"MA", "ME", "MI", "MA", "BA", "MI", "BA", "MA"), group = c("EKK",
"EKK", "KKL", "EKK", "KKL", "KKL", "KKL", "EKK", "EKK", "EKK",
"EKK", "EKK", "EKK", "KKL", "KKL", "EKK", "EKK", "KKL", "EKK",
"KKL")), row.names = c(NA, -20L), spec = structure(list(cols = list(
date = structure(list(), class = c("collector_character",
"collector")), time = structure(list(format = ""), class = c("collector_time",
"collector")), observer = structure(list(), class = c("collector_character",
"collector")), group = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000001f6f2f7af70>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
for the key:
structure(list(observer = c("BA", "MI", "VI", "ME", "DA", "OK",
"FR", "MA", "LA", "DE", "JD", "JD", "JD", "BR", "DA", "DA", "PA",
"PA", "JA", "JE", "DI", "JP", "LE", "MR", "NG", "TR", "TE"),
begin_valid = c("8/12/2016", "12/21/2019", "8/11/2016", "8/11/2016",
"12/11/2019", "12/17/2019", "12/11/2019", "11/2/2016", "1/11/2020",
"12/12/2019", "12/16/2019", "12/16/2019", "11/22/2020", "6/19/2021",
"11/26/2020", "11/26/2020", "7/25/2021", "7/25/2021", NA,
NA, NA, NA, NA, NA, NA, NA, NA), group_valid = c("All", "All",
"All", "All", "All", "All", "FKK", "All", "FKK", "FKK", "EKK",
"KKL", "All", "EKK", "EKK", "KKL", "EKK", "KKL", NA, NA,
NA, NA, NA, NA, NA, NA, NA), subgroup = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, "S", NA, NA, NA, "S", NA, "N",
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -27L
), spec = structure(list(cols = list(observer = structure(list(), class = c("collector_character",
"collector")), begin_valid = structure(list(), class = c("collector_character",
"collector")), group_valid = structure(list(), class = c("collector_character",
"collector")), subgroup = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Two errors in this code:
Because rownames(.) returns strings, you cannot use subx$group[j]. Two options:
Preferred. Use for (j in seq_len(nrow(subx))), and all of the references work without change.
Keep for(j in rownames(subx)), but change all subx$ references to be akin to subx[j,"group"].
x[x$observer == i]$valid is wrong code, change to x$valid[x$observer == i].
After those two changes, your code runs without error, and in this example prints "removed pre-reliability" four times on the console.
When troubleshooting, you cannot intermingle subx$group[1] and subx$group["1"], they are very different, and the latter (as expected) will produce NA.

Create a contingency table with 2 factors from messy data

I have the following data in messy format:
structure(list(com_level = c("B", "B", "B", "B", "A", "A"),
hf_com = c(1, 1, 1, 1, 1, 1),
sal_level = c("2", "3", "1", "2", "1", "4"),
exp_sal = c(NA, 1, 1, NA, 1, NA)),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -6L))
Column com_level is the factor with 2 levels and column hf_com gives the frequency count for that level.
Column sal_level is the factor with 4 levels and column exp_sal gives the frequency count for that level.
I want to create a contingency table similar to this:
structure(list(`1` = c(1L, 2L),
`2` = c(0L, 1L),
`3` = c(0L, 2L),
`4` = c(1L, 0L)),
row.names = c("A", "B"), class = "data.frame")
I have code that works when I want to compare two columns with the same factor:
# 1 step to create table with frequency counts for exp_sal and curr_sal per category of level
cs_es_table <- df_not_na_num %>%
dplyr::count(sal_level, exp_sal, curr_sal) %>%
tidyr::spread(key = sal_level,value = n) %>% # this code spreads on just one key
select(curr_sal, exp_sal, 1, 2, 3, 4, 5, 6, 7, -8) %>% # reorder columns and omit Column 8 (no answer)
as.data.frame()
# step 2- convert cs_es_table to long format and summarise exp_sal and curr_sal frequencies
cs_es_table <- cs_es_table %>%
gather(key, value, -curr_sal,-exp_sal) %>% # crucial step to make data long
mutate(curr_val = ifelse(curr_sal == 1,value,NA),
exp_val = ifelse(exp_sal == 1,value,NA)) %>% #mutate actually cleans up the data and assigns a value to each new column for 'exp' and 'curr'
group_by(key) %>% #for your summary, because you want to sum up your previous rows which are now assigned a key in a new column
summarise_at( .vars = vars(curr_val, exp_val), .funs = sum, na.rm = TRUE)
This code produces this table but just spreads on one key in step 1:
structure(list(curr_val = c(533L, 448L, 237L, 101L, 56L), exp_val = c(179L,
577L, 725L, 401L, 216L)), row.names = c("< 1000 EUR", "1001-1500 EUR",
"2001-3000 EUR", "3001-4000 EUR", "4001-5000 EUR"), class = "data.frame")
Will I need to use pivot_wider as in this example?
Is it possible to use spread on multiple columns in tidyr similar to dcast?
or
tidyr::spread() with multiple keys and values
Any help would be appreciated to compare the two columns with different factors.

Error in ggtexttable (ggpubr)

I'm trying to create a publication-ready table using the ggtexttable function from ggpubr. I have a data frame:
dput(df)
structure(list(feature = list("start_codon", "stop_codon", "intergenic",
"3UTR", "5UTR", "exon", "intron", "ncRNA", "pseudogene"),
observed = list(structure(1L, .Names = "start_codon"), structure(1L, .Names = "stop_codon"),
structure(418L, .Names = "intergenic"), structure(48L, .Names = "3UTR"),
structure(28L, .Names = "5UTR"), structure(223L, .Names = "exon"),
structure(578L, .Names = "intron"), structure(20L, .Names = "ncRNA"),
structure(1L, .Names = "pseudogene")), expected = list(
0.286, 0.286, 369.02, 72.461, 33.165, 257.869, 631.189,
48.491, 3.172), fc = list(3.5, 3.5, 1.1, 0.7, 0.8, 0.9,
0.9, 0.4, 0.3), test = list("enrichment", "enrichment",
"enrichment", "depletion", "depletion", "depletion",
"depletion", "depletion", "depletion"), sig = list("F",
"F", "T", "T", "F", "T", "T", "T", "F"), p_val = list(
"0.249", "0.249", "0.00186", "0.00116", "0.209", "0.00814",
"0.00237", "<1e-04", "0.175")), class = "data.frame", row.names = c(NA,
-9L), .Names = c("feature", "observed", "expected", "fc", "test",
"sig", "p_val"))
And when I try to turn this into a table:
ggtexttable(df)
I get the error:
Error in (function (label, parse = FALSE, col = "black", fontsize =
12, : unused arguments (label.feature = dots[[5]][1],
label.observed = dots[[6]][1], label.expected = dots[[7]][1],
label.fc = dots[[8]][1], label.test = dots[[9]][1], label.sig_val
= dots[[10]][1], label.p_val = dots[[11]][1])
Does anyone know what might be causing this?
This works fine:
df <- head(iris)
ggtexttable(df)
I have found the problem and solution which is going to work for you. First of all your data is not in proper format (nested list) thats why you were getting this error trying to display it. You can check what is the format of the dataset easily by pasting in your console: str(data)
Here is the solution to convert your data to data.frame:
first.step <- lapply(data, unlist)
second.step <- as.data.frame(first.step, stringsAsFactors = F)
Then you can easily use the function ggtexttable(second.step) and it displays the table with your data.

Resources