I am busy data cleaning a TB drug register. I have already done it in stata but would like to move across to R (and improve the new variables slightly).
There are 5 variables that describe each drug regimen,
drugname1, drugstartdate1, drugenddate1, drugdose1 (ie."500"), drugunit1 (ie."mg"). Numbers go up to 60 so I know I may need to make use of looping in one form or another.
The drugname1 is not standardised to one specific drug, so bedaquiline (bdq) can be in drugname1 or drugname2 or drugname50.
At the end of the day I would like to have variables that relate to a specific drug, for instance (bedaquiline = bdq:
onbdq bdqstartdate bdqenddate bdqdose
*there may also be patients who start on bdq, then stop, then start again so something along the lines of
onbdq1 bdqstartdate1 bdqenddate1 bdqdose1 onbdq2 bdqstartdate2 bdqenddate2 bdqdose2 would be more accurate.
#Essentially, I have this:
mre <- as.data.frame(structure(list(drugname1 = c("Bedaquiline", "Bedaquiline", "Bedaquiline",
NA, "Amikacin"), drugstartdate1 = structure(c(18875, 18383, 18795,
NA, 16743), class = "Date"), drugenddate1 = structure(c(NA, NA,
18808, NA, NA), class = "Date"), drugdose1 = c(NA, "400", "200",
NA, NA), drugunit1 = c(NA, "mg", "mg", NA, NA), drugname2 = c("Levofloxacin",
"Levofloxacin", "Levofloxacin", "p-Aminosalicylic Acid", "Terizidone"
), drugstartdate2 = structure(c(18875, 18383, 18795, 16709, 16743
), class = "Date"), drugenddate2 = structure(c(NA, NA, 19139,
NA, NA), class = "Date"), drugdose2 = c(NA, "750", "1000", NA,
NA), drugunit2 = c(NA, "mg", "mg", "mg", NA), drugname3 = c("Linezolid",
"Linezolid", "Linezolid", "Ethionamide", "Ethambutol"), drugstartdate3 = structure(c(18875,
18383, 18795, 16709, 16743), class = "Date"), drugenddate3 = structure(c(NA,
18438, 19139, NA, NA), class = "Date"), drugdose3 = c(NA, "600",
"600", NA, NA), drugunit3 = c(NA, "mg", "mg", "mg", NA)), row.names = c(NA,
5L), class = "data.frame"))
#and I want something like this for each drug (bedaquiline = bdq) and so on:
output <- as.data.frame(structure(list(
onbdq = c(TRUE, TRUE),
bdqtartdate = structure(c(18875, 18383 ), class= "Date"),
bdqenddate = structure(c(NA, NA), class = "Date"),
bdqdose = structure(c(NA, "200mg")))))
Some starting points have been
library(dplyr)
#combing drugdose and drugunit into variable dose
EDRsub <- mutate(EDRsub, dose1= paste0(drugdose1, drugunit1))
#but I am unsure of how to go about looping, in my head this should work:
for(i in 1:3){
mre <- mutate(mre, paste0(dose,i) = paste0(drugdose,i,drugunit,i))
}
#or
for(i in 1:3){
mre <- mutate(mre, dose[[i]]) = (drugdose[[i]],drugunit[[i]])
}
but neither do. There is clearly something fundamental I am not cracking with the R syntax.
# I also had a go at reshaping with mixed results, coding is going through, but it seems clunky and inefficient aswell as not quite giving me what I want.
long <- mre %>%
pivot_longer(
cols = starts_with(c("drugname")),
names_to = "observation",
names_patter = "new_?(.*)",
values_to = c("drugname"),
values_drop_na = TRUE
) %>%
pivot_longer(
cols = starts_with(c("as.character(drugstartdate)")),
names_to = "observationdrugstartdate",
names_patter = "new_?(.*)",
values_to = c("drugstartdate"),
values_drop_na = TRUE
) %>%
distinct()
I have tried bits and pieces of code but generally feel as though I have thrown myself in the middle of the atlantic. I am struggling converting looping and reshaping knowledge from stata and at the same time feel like R has a better way of doing things where the logic of stata (which I am used to) is not being helpful.
This is some of the code statalist came up with : https://www.statalist.org/forums/forum/general-stata-discussion/general/1680179-forvalues-%60i-1%60-and-wide-to-long-dataset
But isn't exactly what I am after with this question.
Any advice on coding is welcome as well as suggestions for how to arrange the data in the "R" way.
Related
I will start by saying that I am fully aware that similar questions have been answered before, but after hours of reading and troubleshooting, I believe I have a unique issue. Apologies if I have missed something. The answer given in the much up-voted similar question points to NAs in the data, but as explained in my question, I do not seem to have any nor do I know where they may be popping up.
I am running a for-loop in R 4.1.2 using the lubridate, readr, and dplyr packages that seeks to mark as invalid data taken by individuals before they have passed a reliability test. Tests are unique to specific groups, so an individual may be reliable for one group, many, all, or none. The function I've written is meant to take a dataframe "x" and for each individual observer, check that the data point is valid against a dataframe "key" that has a column of observers (observer), test pass date (begin_valid), and the group they are now valid for (group_valid). The key may have multiple rows per observer if they have passed multiple tests. I've used tools from the Lubridate package to create POSIXct values for the dates that can be arithmetically manipulated and compared to each other. The user can set y = "remove" if they want to remove invalid data, or leave if they want to label and keep invalid data. Here is the code:
invalidata <- function(x, y){
library(lubridate)
library(readr)
library(dplyr)
x$valid <- rep(1, length(rownames(x)))
alts <- 0
key <- read_csv("updatable csv file")
key$begin_valid <- parse_date_time(key$begin_valid, c("mdy", "dmy", "ydm", "mdy"), tz= "Africa/Lubumbashi")
for(i in unique(x$observer)){
subkey <- subset(key, key$observer == i)
subx <- subset(x, x$observer == i)
if(is.na(subkey$begin_valid) == TRUE || is.na(subkey$group_valid) == TRUE){ #if reliable for nothing, remove
x[x$observer == i]$valid <- 0
print("removed completely unreliable")
}else{
for(j in rownames(subx)){
if(subx$group[j] %in% subkey$group_valid == FALSE && "All" %in% subkey$group_valid == FALSE){ #if not reliable for specific group or all groups, remove
x$valid[j] <- 0
print("removed unreliable for group")
}
if(subx$group[j] %in% subkey$group_valid){ #remove if before reliability date for group
if(subx$date[j] < subset(subkey, subkey$group_valid == subx$group[j])$begin_valid){
x$valid[j] <- 0
print("removed pre-reliability")
}
} else{ #remove if not reliable for specific group, and before reliability date for all
if(subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid){
x$valid[j] <- 0
print("removed pre-reliability")
}
}
}
}
}
if(y == "remove"){ #remove all invalid data and validity column
x <- subset(x, x$valid == 1)
x <- select(x, -valid)
}
return(x)}
My issue is with the line
if(subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid)
which returns the error:
Error in if (subx$date[j] < subset(subkey, subkey$group_valid == >"All")$begin_valid) { :
missing value where TRUE/FALSE needed
However, when I run the code inside the parentheses
subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid
outside of the context of the loop, I receive either a TRUE or FALSE value as relevant. I've checked all dates for any NULL or NA values, as well as addressed any data with NAs in a previous step of the code:
if(is.na(subkey$begin_valid) == TRUE || is.na(subkey$group_valid) == TRUE){}
else{ #code at issue }
I am not having issues with this very similar line:
if(subx$date[j] < subset(subkey, subkey$group_valid == subx$group[j])$begin_valid){
My best guess is that something may be going wrong with the date formatting? I know that this error is usually a symptom of NULLs or NAs floating in the data, but for the life of me I cannot figure out where they could be coming from. Dates in "x" have already been parsed and contain no NAs or NULLs. I have not included the data as it is proprietary, but I can come up with mock data if people are interested/think it would be necessary. Thank you in advance for reading through and for any thoughts/troubleshooting suggestions!
MRE:
dput output for x:
structure(list(date = structure(c(1486764000, 1486764000, 1486850400,
1486936800, 1487023200, 1487109600, 1487109600, 1487196000, 1487196000,
1487368800, 1487368800, 1487368800, 1487368800, 1487368800, 1487368800,
1487455200, 1487455200, 1487455200, 1487541600, 1487887200), class = c("POSIXct",
"POSIXt"), tzone = "Africa/Lubumbashi"), time = structure(c(23734,
53419, 41352, 33034, 24220, 34812, 35624, 27949, 27950, 49192,
49286, 49392, 49401, 62719, 62725, 26046, 26047, 27246, 46611,
61228), class = c("hms", "difftime"), units = "secs"), observer = c("MA",
"LE", "VI", "VI", "MI", "MA", "MA", "ME", "VI", "BA", "MA", "BA",
"MA", "ME", "MI", "MA", "BA", "MI", "BA", "MA"), group = c("EKK",
"EKK", "KKL", "EKK", "KKL", "KKL", "KKL", "EKK", "EKK", "EKK",
"EKK", "EKK", "EKK", "KKL", "KKL", "EKK", "EKK", "KKL", "EKK",
"KKL")), row.names = c(NA, -20L), spec = structure(list(cols = list(
date = structure(list(), class = c("collector_character",
"collector")), time = structure(list(format = ""), class = c("collector_time",
"collector")), observer = structure(list(), class = c("collector_character",
"collector")), group = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000001f6f2f7af70>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
for the key:
structure(list(observer = c("BA", "MI", "VI", "ME", "DA", "OK",
"FR", "MA", "LA", "DE", "JD", "JD", "JD", "BR", "DA", "DA", "PA",
"PA", "JA", "JE", "DI", "JP", "LE", "MR", "NG", "TR", "TE"),
begin_valid = c("8/12/2016", "12/21/2019", "8/11/2016", "8/11/2016",
"12/11/2019", "12/17/2019", "12/11/2019", "11/2/2016", "1/11/2020",
"12/12/2019", "12/16/2019", "12/16/2019", "11/22/2020", "6/19/2021",
"11/26/2020", "11/26/2020", "7/25/2021", "7/25/2021", NA,
NA, NA, NA, NA, NA, NA, NA, NA), group_valid = c("All", "All",
"All", "All", "All", "All", "FKK", "All", "FKK", "FKK", "EKK",
"KKL", "All", "EKK", "EKK", "KKL", "EKK", "KKL", NA, NA,
NA, NA, NA, NA, NA, NA, NA), subgroup = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, "S", NA, NA, NA, "S", NA, "N",
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -27L
), spec = structure(list(cols = list(observer = structure(list(), class = c("collector_character",
"collector")), begin_valid = structure(list(), class = c("collector_character",
"collector")), group_valid = structure(list(), class = c("collector_character",
"collector")), subgroup = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Two errors in this code:
Because rownames(.) returns strings, you cannot use subx$group[j]. Two options:
Preferred. Use for (j in seq_len(nrow(subx))), and all of the references work without change.
Keep for(j in rownames(subx)), but change all subx$ references to be akin to subx[j,"group"].
x[x$observer == i]$valid is wrong code, change to x$valid[x$observer == i].
After those two changes, your code runs without error, and in this example prints "removed pre-reliability" four times on the console.
When troubleshooting, you cannot intermingle subx$group[1] and subx$group["1"], they are very different, and the latter (as expected) will produce NA.
In the interest of learning better coding practices, can anyone show me a more efficient way of solving my problem? Maybe one that doesn't require new columns...
Problem: I have two data frames: one is my main data table (t) and the other contains changes I need to replace in the main table (Manual_changes). Example: Sometimes the CaseID is matched with the wrong EmployeeID in the file.
I can't provide the main data table, but the Manual_changes file looks like this:
Manual_changes = structure(list(`Case ID` = c(46605, 25321, 61790, 43047, 12157,
16173, 94764, 38700, 41798, 56198, 79467, 61907, 89057, 34232,
100189), `Employee ID` = c(NA, NA, NA, NA, NA, NA, NA, NA, 906572,
164978, 145724, 874472, 654830, 846333, 256403), `Age in Days` = c(3,
3, 3, 12, 0, 0, 5, 0, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
temp = merge(t, Manual_changes, by = "Case ID", all.x = TRUE)
temp$`Employee ID.y` = ifelse(is.na(temp$`Employee ID.y`), temp$`Employee ID.x`, temp$`Employee ID.y`)
temp$`Age in Days.y`= ifelse(is.na(temp$`Age in Days.y`), temp$`Age in Days.x`, temp$`Age in Days.y`)
temp$`Age in Days.x` = NULL
temp$`Employee ID.x` = NULL
colnames(temp) = colnames(t)
t = temp
We could use coalesce
library(dplyr)
left_join(t, Manual_changes, by = "Case ID") %>%
mutate(Employee_ID.y = coalesce(`Employee ID.x`, `Employee ID.y`),
`Age in Days.y` = coalesce(`Age in Days.x`, `Age in Days.y`))
Or with data.table
library(data.table)
setDT(t)[Manual_changes,
c('Employee ID', 'Age in Days') :=
.(fcoalesce(`Employee ID.x`, `Employee ID.y`),
fcoalesce(`Age in Days.x`, `Age in Days.y`)),
on = .(`Case ID`)]
Good evening guys,I have 6 millions data and they have four types.
z=structure(list(date = structure(c(11866, 16190, 14729, 11718), class = "Date"),
beg1 = structure(c(12264, 12264, 13970, 12264), class = "Date"),
end1 = structure(c(17621, 14760, 14760, 13298), class = "Date"),
ID1 = c(1003587, 1000396, 1010743, 1002113), beg2 = structure(c(NA,
14790, 14790, 13299), class = "Date"), end2 = structure(c(NA,
17621, 15217, 13969), class = "Date"), ID2 = c(NA, 1024488,
1027877, 1002824), beg3 = structure(c(NA, NA, 15218, 13970
), class = "Date"), end3 = structure(c(NA, NA, 17621, 14760
), class = "Date"), ID3 = c(NA, NA, 1031361, 1002113), beg4 = structure(c(NA,
NA, NA, 14790), class = "Date"), end4 = structure(c(NA, NA,
NA, 17621), class = "Date"), ID4 = c(NA, NA, NA, 1021290),
realID = c(NA, NA, NA, NA)), row.names = c(267365L, 193587L,
5294385L, 2039421L), class = "data.frame")
and I tried to judge and assign a suitalbe ID based on their date in which date ranges(use the loop).
for(i in 1:nrow(z)){tryCatch({print(i)
if(between(z$date[i],z$beg1[i],z$end1[i])==T){z$realID[i]=z$ID1[i]}
if(between(z$date[i],z$beg2[i],z$end2[i])==T){z$realID[i]=z$ID2[i]}
if(between(z$date[i],z$beg3[i],z$end3[i])==T){z$realID[i]=z$ID3[i]}
if(between(z$date[i],z$beg4[i],z$end4[i])==T){z$realID[i]=z$ID4[i]}},error=function(e){})}
The code works.
But,now the problem is I have too many datas,the loop is inefficiency,may be it will take almost one day to loop.
Does anyone know how can I improve or replace the code?
Thanks you so much.
Since R is a vectorized language, to speed up this code it is best to operate on the entire vector as oppose to looping through each element.
As simple solution is to use a series of ifelse statements.
z$realID <- ifelse(!is.na(z$beg1) & z$date> z$beg1 & z$date< z$end1, z$ID1, z$realID)
z$realID <- ifelse(!is.na(z$beg2) & z$date> z$beg2 & z$date< z$end2, z$ID2, z$realID)
z$realID <- ifelse(!is.na(z$beg3) & z$date> z$beg3 & z$date< z$end3, z$ID3, z$realID)
z$realID <- ifelse(!is.na(z$beg4) & z$date> z$beg4 & z$date< z$end4, z$ID4, z$realID)
When the if statement evaluates TRUE, the realID will update if not it will retain its prior value.
I have the following data frame, each row containing four dates ("y") and four measurements ("x"):
df = structure(list(x1 = c(69.772808673525, NA, 53.13125414839,
17.3033274666411,
NA, 38.6120670385487, 57.7229000792707, 40.7654208618078, 38.9010405201831,
65.7108936694177), y1 = c(0.765671296296296, NA, 1.37539351851852,
0.550277777777778, NA, 0.83037037037037, 0.0254398148148148,
0.380671296296296, 1.368125, 2.5250462962963), x2 = c(81.3285388496182,
NA, NA, 44.369872853302, NA, 61.0746827226573, 66.3965114460601,
41.4256874481852, 49.5461413070349, 47.0936997726146), y2 =
c(6.58287037037037,
NA, NA, 9.09377314814815, NA, 7.00127314814815, 6.46597222222222,
6.2462962962963, 6.76976851851852, 8.12449074074074), x3 = c(NA,
60.4976916064608, NA, 45.3575294731303, 45.159758146854, 71.8459173097114,
NA, 37.9485456227131, 44.6307631013742, 52.4523342186143), y3 = c(NA,
12.0026157407407, NA, 13.5601157407407, 16.1213657407407, 15.6431018518519,
NA, 15.8986805555556, 13.1395138888889, 17.9432638888889), x4 = c(NA,
NA, NA, 57.3383407228293, NA, 59.3921356160536, 67.4231673171527,
31.853845252547, NA, NA), y4 = c(NA, NA, NA, 18.258125, NA,
19.6074768518519,
20.9696527777778, 23.7176851851852, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I would like to create an additional column containing the slope of all the y's versus all the x's, for each row (each row is a patient with these 4 measurements).
Here is what I have so far:
df <- df %>% mutate(Slope = lm(vars(starts_with("y") ~
vars(starts_with("x"), data = .)
I am getting an error:
invalid type (list) for variable 'vars(starts_with("y"))'...
What am I doing wrong, and how can I calculate the rowwise slope?
You are using a tidyverse syntax but your data is not tidy...
Maybe you should rearrange your data.frame and rethink the way you store your data.
Here is how to do it in a quick and dirty way (at least if I understood your explanations correctly):
df <- merge(reshape(df[,(1:4)*2-1], dir="long", varying = list(1:4), v.names = "x", idvar = "patient"),
reshape(df[,(1:4)*2], dir="long", varying = list(1:4), v.names = "y", idvar = "patient"))
df$patient <- factor(df$patient)
Then you could loop over the patients, perform a linear regression and get the slopes as a vector:
sapply(levels(df$patient), function(pat) {
coef(lm(y~x,df[df$patient==pat,],na.action = "na.omit"))[2]
})
Please don't mark this as duplicate as I will take this question down once I find out what's wrong. I have used Levels() with a very high degree of success and today it refuses to work come what may. Here's what I'm trying to achieve. I have two data frames with an identical column. I am using the simple merge() function as follows:
mergedData<-merge(df1, df2, by='Index')
Now, I want to reorder the 'Index' column i.e. reorder the rows in the 'mergedData' file to match the order in either of the original dataframes. This is the command I am using to achieve the reordering:
mergedData$Index<-factor(mergedData$Index,
levels=c("ND","TC","PR","W","MI"))
When I test the levels after running the above command it shows the desired order however when I export the table it retains the original order. I am extremely confused as to why this isn't working. I have other scripts wherein I've used this approach of setting the desired order and it is working perfectly fine except in this instance.
Any help/suggestions/advise would be greatly appreciated.
I have attached data from the two dataframes for you all to play around with:
df1
structure(list(Index = structure(1:5, .Label = c("ND", "TC",
"PR", "W", "MI"), class = "factor"), `CP` = c(0.7102,
0.059, -0.0469, 1.0137, 0.6116), FA1 = c(0.5218, 0.0249, -0.0532,
0.9561, 1.1676), FA2 = c(0.5625, 0.0397, -0.0712, 0.9636, 0.9569
), FA3 = c(0.5934, 0.0332, -0.0442, 0.9873, 0.8929)), .Names = c("Index",
"CP", "FA1", "FA2", "FA3"), row.names = c(NA, 5L), class = "data.frame")
df2
structure(list(Index = structure(1:5, .Label = c("ND", "TC",
"PR", "W", "MI"), class = "factor"), `CP SD` = c(0.0241,
0.0184, 0.0021, 0.0114, 0.0947), `FA1 SD` = c(0.1891, 0.0171,
0.0104, 0.0559, 0.5321), `FA2 SD` = c(0.1273, 0.0243, 0.0173,
0.0565, 0.3292), `FA3 SD` = c(0.0518, 0.0094, 0.0078, 0.0195,
0.1581)), .Names = c("Index", "CP SD", "FA1 SD", "FA2 SD",
"FA3 SD"), row.names = c(NA, 5L), class = "data.frame")
Thanks
levels only controls the order of the factor levels (how it will be displayed by levels(x), in table, etc.), not the order of the rows in a data.frame. To order a data frame, use this:
mergedData <- mergedData[order(mergedData$Index),]
Or with dplyr:
library(dplyr)
mergedData <- arrange(mergedData,Index)