linear regression model with dplyr on sepcified columns by name - r

I have the following data frame, each row containing four dates ("y") and four measurements ("x"):
df = structure(list(x1 = c(69.772808673525, NA, 53.13125414839,
17.3033274666411,
NA, 38.6120670385487, 57.7229000792707, 40.7654208618078, 38.9010405201831,
65.7108936694177), y1 = c(0.765671296296296, NA, 1.37539351851852,
0.550277777777778, NA, 0.83037037037037, 0.0254398148148148,
0.380671296296296, 1.368125, 2.5250462962963), x2 = c(81.3285388496182,
NA, NA, 44.369872853302, NA, 61.0746827226573, 66.3965114460601,
41.4256874481852, 49.5461413070349, 47.0936997726146), y2 =
c(6.58287037037037,
NA, NA, 9.09377314814815, NA, 7.00127314814815, 6.46597222222222,
6.2462962962963, 6.76976851851852, 8.12449074074074), x3 = c(NA,
60.4976916064608, NA, 45.3575294731303, 45.159758146854, 71.8459173097114,
NA, 37.9485456227131, 44.6307631013742, 52.4523342186143), y3 = c(NA,
12.0026157407407, NA, 13.5601157407407, 16.1213657407407, 15.6431018518519,
NA, 15.8986805555556, 13.1395138888889, 17.9432638888889), x4 = c(NA,
NA, NA, 57.3383407228293, NA, 59.3921356160536, 67.4231673171527,
31.853845252547, NA, NA), y4 = c(NA, NA, NA, 18.258125, NA,
19.6074768518519,
20.9696527777778, 23.7176851851852, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I would like to create an additional column containing the slope of all the y's versus all the x's, for each row (each row is a patient with these 4 measurements).
Here is what I have so far:
df <- df %>% mutate(Slope = lm(vars(starts_with("y") ~
vars(starts_with("x"), data = .)
I am getting an error:
invalid type (list) for variable 'vars(starts_with("y"))'...
What am I doing wrong, and how can I calculate the rowwise slope?

You are using a tidyverse syntax but your data is not tidy...
Maybe you should rearrange your data.frame and rethink the way you store your data.
Here is how to do it in a quick and dirty way (at least if I understood your explanations correctly):
df <- merge(reshape(df[,(1:4)*2-1], dir="long", varying = list(1:4), v.names = "x", idvar = "patient"),
reshape(df[,(1:4)*2], dir="long", varying = list(1:4), v.names = "y", idvar = "patient"))
df$patient <- factor(df$patient)
Then you could loop over the patients, perform a linear regression and get the slopes as a vector:
sapply(levels(df$patient), function(pat) {
coef(lm(y~x,df[df$patient==pat,],na.action = "na.omit"))[2]
})

Related

Joining 'n' number of lists and perform a function in R

I have a dataframe which contains many triplicate (3 columns set). And I have grouped the dataframe into each triplicate as a seperate group of list.
The example dataset is,
example_data <- structure(list(`1_3ng` = c(69648445400, 73518145600, NA, NA,
73529102400, 75481088000, NA, 73545910600, 74473949200, 77396199900
), `2_3ng` = c(71187990600, 70677690400, NA, 73675407400, 73215342700,
NA, NA, 69996254800, 69795686400, 76951318300), `3_3ng` = c(65032022000,
71248214000, NA, 72393058300, 72025550900, 71041067000, 73604692000,
NA, 73324202000, 75969608700), `4_7-5ng` = c(NA, 65845061600,
75009245100, 64021237700, 66960666600, 69055643600, NA, 64899540900,
NA, NA), `5_7-5ng` = c(65097201700, NA, NA, 69032126500, NA,
70189899800, NA, 74143529100, 69299087400, NA), `6_7-5ng` = c(71964413900,
69048485800, NA, 71281569700, 71167596500, NA, NA, 68389822800,
69322289200, NA), `7_10ng` = c(71420403700, 67552276500, 72888076300,
66491357100, NA, 68165019600, 70876631000, NA, 69174190100, 63782945300
), `8_10ng` = c(NA, 71179401200, 68959365100, 70570182700, 73032738800,
NA, 74807496700, NA, 71812102100, 73855098500), `9_10ng` = c(NA,
70403756100, NA, 70277421000, 69887731700, 69818871800, NA, 71353886700,
NA, 74115466700), `10_15ng` = c(NA, NA, 68487581700, NA, NA,
69056997400, NA, 67780479400, 66804467800, 72291939500), `11_15ng` = c(NA,
63599643700, NA, NA, 60752029700, NA, NA, 63403655600, NA, 64548492900
), `12_15ng` = c(NA, 67344750600, 61610182700, 67414425600, 65946654700,
66166118400, NA, 70830837700, 67288305700, 69911451300)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L)
And after grouping I got the four lists, since the above example dataset contains 4 groups. I have used the following R code for grouping the data,
grouping_data<-function(df){ #df= dataframe
df_col<-ncol(df) #calculates no. of columns in dataframe
groups<-sort(rep(0:((df_col/3)-1),3)) #creates user determined groups
id<-list() #creates empty list
for (i in 1:length(unique(groups))){
id[[i]]<-which(groups == unique(groups)[i])} #creates list of groups
names(id)<-paste0("id",unique(groups)) #assigns group based names to the list "id"
data<-list() #creates empty list
for (i in 1:length(id)){
data[[i]]<-df[,id[[i]]]} #creates list of dataframe columns sorted by groups
names(data)<-paste0("data",unique(groups)) #assigns group based names to the list "data"
return(data)}
group_data <-grouping_data(example_data)
Please suggest useful R code for do a particular function for all the lists at a same time.
For example the below function I have done by following way,
#VSN Normalization
vsnNorm <- function(dat) {
dat<-as.data.frame(dat)
vsnNormed <- suppressMessages(vsn::justvsn(as.matrix(dat)))
colnames(vsnNormed) <- colnames(dat)
row.names(vsnNormed) <- rownames(dat)
return(as.matrix(vsnNormed))
}
And I have tried like below,
vsn.dat0 <- vsnNorm(group_data$data0)
vsn.dat1 <- vsnNorm(group_data$data1)
vsn.dat2 <- vsnNorm(group_data$data2)
vsn.dat3 <- vsnNorm(group_data$data3)
vsn.dat <- cbind (vsn.dat0,vsn.dat1,vsn.dat2,vsn.dat3)
It is working well.
But the dataset triplicate (3 columns set) value may be change from dataset to dataset. And calling all the lists everytime become will be tedious.
So kindly share some codes which will call all the resulted lists for performing a function and combine the result as a single file.
Thank you in advance.
The shortcut you are looking for is:
vsn.dat <- do.call("rbind", lapply(group_data, vsnNorm))

how to separate the mixed models, and fit separate linear models

I am trying to fit a linear model and separate the mixed models. Then fit separate linear models to model_steeper and model_flatter. First, I create training samples with Input >= 5 and separate the points
nSample<-length(data$Input)
Train.Sample<-data.frame(trainInput=data$Input,trainOutput=rep(NA,nSample))
Train.Sample.Steeper<-data.frame(trainSteepInput=data$Input,trainSteepOutput=rep(NA,nSample))
Train.Sample.Flatter<-data.frame(trainFlatInput=data$Input,trainFlatOutput=rep(NA,nSample))
head(cbind(data,Train.Sample,Train.Sample.Steeper,Train.Sample.Flatter))
and the result is:
dput(head(cbind(data,Train.Sample,Train.Sample.Steeper,Train.Sample.Flatter)))
structure(list(Output = c(0.430030802963404, -0.387872242279496,
-0.773463398992163, 3.47962503801818, -1.18311295613965, -0.534018180113726
), Input = c(-0.707348558586091, -0.596670078579336, -1.55126970726997,
2.00976222474128, -1.69353070948273, -0.437843651510775), trainInput = c(-0.707348558586091,
-0.596670078579336, -1.55126970726997, 2.00976222474128, -1.69353070948273,
-0.437843651510775), trainOutput = c(NA, NA, NA, NA, NA, NA),
trainSteepInput = c(-0.707348558586091, -0.596670078579336,
-1.55126970726997, 2.00976222474128, -1.69353070948273, -0.437843651510775
), trainSteepOutput = c(NA, NA, NA, NA, NA, NA), trainFlatInput = c(-0.707348558586091,
-0.596670078579336, -1.55126970726997, 2.00976222474128,
-1.69353070948273, -0.437843651510775), trainFlatOutput = c(NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
Then, I tried:
Train.Sample.Steep.lm <- lm(trainSteepOutput ~ trainSteepInput, Train.Sample.Steeper)
But the error is:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases
I do not know what to do next. Does any one know this?

Conditionally replace cells in data frame based on another data frame

In the interest of learning better coding practices, can anyone show me a more efficient way of solving my problem? Maybe one that doesn't require new columns...
Problem: I have two data frames: one is my main data table (t) and the other contains changes I need to replace in the main table (Manual_changes). Example: Sometimes the CaseID is matched with the wrong EmployeeID in the file.
I can't provide the main data table, but the Manual_changes file looks like this:
Manual_changes = structure(list(`Case ID` = c(46605, 25321, 61790, 43047, 12157,
16173, 94764, 38700, 41798, 56198, 79467, 61907, 89057, 34232,
100189), `Employee ID` = c(NA, NA, NA, NA, NA, NA, NA, NA, 906572,
164978, 145724, 874472, 654830, 846333, 256403), `Age in Days` = c(3,
3, 3, 12, 0, 0, 5, 0, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
temp = merge(t, Manual_changes, by = "Case ID", all.x = TRUE)
temp$`Employee ID.y` = ifelse(is.na(temp$`Employee ID.y`), temp$`Employee ID.x`, temp$`Employee ID.y`)
temp$`Age in Days.y`= ifelse(is.na(temp$`Age in Days.y`), temp$`Age in Days.x`, temp$`Age in Days.y`)
temp$`Age in Days.x` = NULL
temp$`Employee ID.x` = NULL
colnames(temp) = colnames(t)
t = temp
We could use coalesce
library(dplyr)
left_join(t, Manual_changes, by = "Case ID") %>%
mutate(Employee_ID.y = coalesce(`Employee ID.x`, `Employee ID.y`),
`Age in Days.y` = coalesce(`Age in Days.x`, `Age in Days.y`))
Or with data.table
library(data.table)
setDT(t)[Manual_changes,
c('Employee ID', 'Age in Days') :=
.(fcoalesce(`Employee ID.x`, `Employee ID.y`),
fcoalesce(`Age in Days.x`, `Age in Days.y`)),
on = .(`Case ID`)]

r Replace multiple strings in a data frame column with multiple strings from a column of another data frame

I have a dataframe (df1) with a column "PartcipantID". Some ParticipantIDs are wrong and should be replaced with the correct ParticipantID. I have another dataframe (df2) where all Participant IDs appear in columns Goal_ID to T4. The Participant IDs in column "Goal_ID" are the correct IDs.
Now I want to replace all ParticipantIDs in df1 with all Goal_ID ParticipantIDs from df2.
This is my original dataframe (df1):
structure(list(Partcipant_ID = c("AA_SH_RA_91", "AA_SH_RA_91",
"AB_BA_PR_93", "AB_BH_VI_90", "AB_BH_VI_90", "AB_SA_TA_91", "AJ_BO_RA_92",
"AJ_BO_RA_92", "AK_SH_HA_91", "AL_EN_RA_95", "AL_MA_RA_95", "AL_SH_BA_99",
"AM_BO_AB_49", "AM_BO_AB_94", "AM_BO_AB_94", "AM_BO_AB_94", "AN_JA_AN_91",
"AN_KL_GE_11", "AN_KL_WO_91", "AN_MA_DI_95", "AN_MA_DI_95", "AN_SE_RA_95",
"AN_SE_RA_95", "AN_SI_RA_97", "AN_SO_PU_94", "AN_SU_RA_91", "AR_BO_RA_92",
"AR_KA_VI_94", "AR_KA_VI_94", "AS_AR_SO_90", "AS_AR_SU_95", "AS_KU_SO_90",
"AS_MO_AS_97", "AW_SI_OJ_97", "AW_SI_OJ_97", "AY_CH_SU_97", "BH_BE_LD_84",
"BH_BE_LI_83", "BH_BE_LI_83", "BH_BE_LI_84", "BH_KO_SA_87", "BH_PE_AB_89",
"BH_YA_SA_87", "BI_CH_PR_94", "BI_CH_PR_94"), Start_T2 = structure(c(NA,
NA, NA, NA, 1579514871, 1576658745, NA, 1579098225, NA, NA, 1576663067,
1576844759, NA, 1577330639, NA, NA, 1576693930, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 1577718380, 1577718380, 1577454467, NA,
NA, 1576352237, NA, NA, NA, NA, 1576420656, 1576420656, NA, NA,
1578031772, 1576872938, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), End_T2 = structure(c(NA, NA, NA, NA, 1579515709,
1576660469, NA, 1579098989, NA, NA, 1576693776, 1576845312, NA,
1577331721, NA, NA, 1576694799, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 1577719049, 1577719049, 1577455167, NA, NA, 1576352397,
NA, NA, NA, NA, 1576421607, 1576421607, NA, NA, 1578032408, 1576873875,
NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
45L), class = "data.frame")
And this is the reference data frame (df2):
structure(list(Goal_ID = c("AJ_BO_RA_92", "AL_EN_RA_95", "AM_BO_AB_49",
"AS_KU_SO_90", "BH_BE_LI_84", "BH_YA_SA_87", "BI_CH_PR_94", "BI_CH_PR_94"
), T2 = c("AJ_BO_RA_92", "AL_MA_RA_95", "AM_BO_AB_94", "AS_AR_SO_90",
"BH_BE_LI_83", "BH_YA_SA_87", "BI_NA_PR_94", "BI_NA_PR_94"),
T3 = c("AR_BO_RA_92", "AL_MA_RA_95", "AM_BO_AB_94", NA, "BH_BE_LI_83",
NA, "BI_CH_PR_94", "BI_CH_PR_94"), T4 = c("AJ_BO_RA_92",
"AL_MA_RA_95", "AM_BO_AB_94", NA, "BH_BE_LI_83", "BH_KO_SA_87",
"BI_CH_PR_94", "BI_CH_PR_94")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
For example, in my df1, I want
"AR_BO_RA_92" to be replaced by "AJ_BO_RA_92";
"AL_MA_RA_95" to be replaced by "AL_EN_RA_95";
"AM_BO_AB_94" to be replaced by "AM_BO_AB_49"
and so on...
I thought about using string_replace and I started with this:
df1$Partcipant_ID <- str_replace(df1$Partcipant_ID, "AR_BO_RA_92", "AJ_BO_RA_92")
But that is of course very unefficient because I have so many replacements and it would be nice to make use of my reference data frame. I just cannot figure it out myself.
I hope this is understandable. Please ask if you need additional information.
Thank you so much already!
You can use match to find where the string is located and excange those which have been found and are not NA like:
i <- match(df1$Partcipant_ID, unlist(df2[-1])) %% nrow(df2)
j <- !is.na(i)
df1$Partcipant_ID[j] <- df2$Goal_ID[i[j]]
df1$Partcipant_ID
# [1] "AA_SH_RA_91" "AA_SH_RA_91" "AB_BA_PR_93" "AB_BH_VI_90" "AB_BH_VI_90"
# [6] "AB_SA_TA_91" "AJ_BO_RA_92" "AJ_BO_RA_92" "AK_SH_HA_91" "AL_EN_RA_95"
#[11] "AL_MA_RA_95" "AL_SH_BA_99" "AM_BO_AB_49" "AM_BO_AB_94" "AM_BO_AB_94"
#[16] "AM_BO_AB_94" "AN_JA_AN_91" "AN_KL_GE_11" "AN_KL_WO_91" "AN_MA_DI_95"
#[21] "AN_MA_DI_95" "AN_SE_RA_95" "AN_SE_RA_95" "AN_SI_RA_97" "AN_SO_PU_94"
#[26] "AN_SU_RA_91" "AR_BO_RA_92" "AR_KA_VI_94" "AR_KA_VI_94" "AS_AR_SO_90"
#[31] "AS_AR_SU_95" "AS_KU_SO_90" "AS_MO_AS_97" "AW_SI_OJ_97" "AW_SI_OJ_97"
#[36] "AY_CH_SU_97" "BH_BE_LD_84" "BH_BE_LI_83" "BH_BE_LI_83" "BH_BE_LI_84"
#[41] "BH_KO_SA_87" "BH_PE_AB_89" "BH_YA_SA_87" "BI_CH_PR_94" "BI_CH_PR_94"
I think this might work. Create a true look up table with a column of correct and incorrect codes. I.e. stack the columns, then join the subsequent df3 to df1 and use coalesce to create a new part_id. You spelt participant wrong, which made me feel more human I always do that.
library(dplyr)
df3 <- df2[1:2] %>%
bind_rows(df2[c(1,3)] %>% rename(T2 = T3),
df2[c(1,4)] %>% rename(T2 = T4)) %>%
distinct()
df1 %>%
left_join(df3, by = c("Partcipant_ID" = "T2")) %>%
mutate(Goal_ID = coalesce(Goal_ID, Partcipant_ID)) %>%
select(Goal_ID, Partcipant_ID, Start_T2, End_T2)

How to create a proper dataset for boxplots

I'm having trouble to create a proper boxplot of my dataset. All of the solutions on this platform don't work because their dataset all look different with variables against each other.
So I want to ask: how do I need to format my dataset if it only contains 3 variables and their measured values in 3 columns. In the boxplot examples here, they plot a variable against another one but here this is not the case right?
Using boxplot(data) gives me 3 boxplots. But I want to show the MEAN and also the population size on each boxplot. I don't know how to use the solution as they are all about ggplot2 or boxplot with variables against each other.
I know that this must be simple, but I think I'm plotting the boxplots on a bad method and that's why the solutions on this site don't work?
Data:
structure(list(Rest = c(3.479386607, 3.478445796, 2.52227462,
1.726115552, 3.917693859, 2.300840122), Peat = c(16.79515746,
22.76673699, 24.43289941, 15.64168939, 31.60459098, 16.2369787
), Top.culture = c(8.288, 8.732, 5.199, 6.539, 3.248, 10.156)), .Names = c("Rest",
"Peat", "Top.culture"), row.names = c(NA, 6L), class = "data.frame")
If text annotation is what is meant by 'show the mean and also the population size' then:
boxplot(dat)
text(1:3, 12.5, paste( "Mean= ",round(sapply(dat,mean, na.rm=TRUE), 2),
"\n N= ",
sapply(dat, function(x) length( x[!is.na(x)] ) )
) )
This used your more complex data-object from the other (duplicated) question.
dat <- structure(list(Rest = c(3.479386607, 3.478445796, 2.52227462, 1.726115552, 3.917693859, 2.300840122, 2.326307503, 2.344828287, 4.654278623, 3.68669447, 3.343706863, 0.712228306, 2.735897248, 1.936723375, 2.724260325, 2.069633651, 1.741484154, 2.304391217, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Peat = c(16.79515746, 22.76673699, 24.43289941, 15.64168939, 31.60459098, 16.2369787, 32.63285246, 35.91852324, 19.27802839, 21.78974576, 30.39119451, 35.4846573, 42.21807817, 42.00913743, 40.96996704, 19.85075354, 17.247096, 22.81689524, 43.35990368, 37.57273508, 23.76889902, 38.34604591, 20.98376674, 16.44173119, 17.27639888, NA, NA, NA, NA, NA, NA), Top.culture = c(8.288, 8.732, 5.199, 6.539, 3.248, 10.156, 3.436, 5.584, 4.483, 2.087, 3.28, 2.71, 2.196, 4.971, 4.475, 6.361, 5.49, 9.085, 3.52, 5.772, 9.308, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Rest", "Peat", "Top.culture" ), class = "data.frame", row.names = c(NA, -31L))

Resources