How to create a proper dataset for boxplots - r

I'm having trouble to create a proper boxplot of my dataset. All of the solutions on this platform don't work because their dataset all look different with variables against each other.
So I want to ask: how do I need to format my dataset if it only contains 3 variables and their measured values in 3 columns. In the boxplot examples here, they plot a variable against another one but here this is not the case right?
Using boxplot(data) gives me 3 boxplots. But I want to show the MEAN and also the population size on each boxplot. I don't know how to use the solution as they are all about ggplot2 or boxplot with variables against each other.
I know that this must be simple, but I think I'm plotting the boxplots on a bad method and that's why the solutions on this site don't work?
Data:
structure(list(Rest = c(3.479386607, 3.478445796, 2.52227462,
1.726115552, 3.917693859, 2.300840122), Peat = c(16.79515746,
22.76673699, 24.43289941, 15.64168939, 31.60459098, 16.2369787
), Top.culture = c(8.288, 8.732, 5.199, 6.539, 3.248, 10.156)), .Names = c("Rest",
"Peat", "Top.culture"), row.names = c(NA, 6L), class = "data.frame")

If text annotation is what is meant by 'show the mean and also the population size' then:
boxplot(dat)
text(1:3, 12.5, paste( "Mean= ",round(sapply(dat,mean, na.rm=TRUE), 2),
"\n N= ",
sapply(dat, function(x) length( x[!is.na(x)] ) )
) )
This used your more complex data-object from the other (duplicated) question.
dat <- structure(list(Rest = c(3.479386607, 3.478445796, 2.52227462, 1.726115552, 3.917693859, 2.300840122, 2.326307503, 2.344828287, 4.654278623, 3.68669447, 3.343706863, 0.712228306, 2.735897248, 1.936723375, 2.724260325, 2.069633651, 1.741484154, 2.304391217, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Peat = c(16.79515746, 22.76673699, 24.43289941, 15.64168939, 31.60459098, 16.2369787, 32.63285246, 35.91852324, 19.27802839, 21.78974576, 30.39119451, 35.4846573, 42.21807817, 42.00913743, 40.96996704, 19.85075354, 17.247096, 22.81689524, 43.35990368, 37.57273508, 23.76889902, 38.34604591, 20.98376674, 16.44173119, 17.27639888, NA, NA, NA, NA, NA, NA), Top.culture = c(8.288, 8.732, 5.199, 6.539, 3.248, 10.156, 3.436, 5.584, 4.483, 2.087, 3.28, 2.71, 2.196, 4.971, 4.475, 6.361, 5.49, 9.085, 3.52, 5.772, 9.308, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Rest", "Peat", "Top.culture" ), class = "data.frame", row.names = c(NA, -31L))

Related

Joining 'n' number of lists and perform a function in R

I have a dataframe which contains many triplicate (3 columns set). And I have grouped the dataframe into each triplicate as a seperate group of list.
The example dataset is,
example_data <- structure(list(`1_3ng` = c(69648445400, 73518145600, NA, NA,
73529102400, 75481088000, NA, 73545910600, 74473949200, 77396199900
), `2_3ng` = c(71187990600, 70677690400, NA, 73675407400, 73215342700,
NA, NA, 69996254800, 69795686400, 76951318300), `3_3ng` = c(65032022000,
71248214000, NA, 72393058300, 72025550900, 71041067000, 73604692000,
NA, 73324202000, 75969608700), `4_7-5ng` = c(NA, 65845061600,
75009245100, 64021237700, 66960666600, 69055643600, NA, 64899540900,
NA, NA), `5_7-5ng` = c(65097201700, NA, NA, 69032126500, NA,
70189899800, NA, 74143529100, 69299087400, NA), `6_7-5ng` = c(71964413900,
69048485800, NA, 71281569700, 71167596500, NA, NA, 68389822800,
69322289200, NA), `7_10ng` = c(71420403700, 67552276500, 72888076300,
66491357100, NA, 68165019600, 70876631000, NA, 69174190100, 63782945300
), `8_10ng` = c(NA, 71179401200, 68959365100, 70570182700, 73032738800,
NA, 74807496700, NA, 71812102100, 73855098500), `9_10ng` = c(NA,
70403756100, NA, 70277421000, 69887731700, 69818871800, NA, 71353886700,
NA, 74115466700), `10_15ng` = c(NA, NA, 68487581700, NA, NA,
69056997400, NA, 67780479400, 66804467800, 72291939500), `11_15ng` = c(NA,
63599643700, NA, NA, 60752029700, NA, NA, 63403655600, NA, 64548492900
), `12_15ng` = c(NA, 67344750600, 61610182700, 67414425600, 65946654700,
66166118400, NA, 70830837700, 67288305700, 69911451300)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L)
And after grouping I got the four lists, since the above example dataset contains 4 groups. I have used the following R code for grouping the data,
grouping_data<-function(df){ #df= dataframe
df_col<-ncol(df) #calculates no. of columns in dataframe
groups<-sort(rep(0:((df_col/3)-1),3)) #creates user determined groups
id<-list() #creates empty list
for (i in 1:length(unique(groups))){
id[[i]]<-which(groups == unique(groups)[i])} #creates list of groups
names(id)<-paste0("id",unique(groups)) #assigns group based names to the list "id"
data<-list() #creates empty list
for (i in 1:length(id)){
data[[i]]<-df[,id[[i]]]} #creates list of dataframe columns sorted by groups
names(data)<-paste0("data",unique(groups)) #assigns group based names to the list "data"
return(data)}
group_data <-grouping_data(example_data)
Please suggest useful R code for do a particular function for all the lists at a same time.
For example the below function I have done by following way,
#VSN Normalization
vsnNorm <- function(dat) {
dat<-as.data.frame(dat)
vsnNormed <- suppressMessages(vsn::justvsn(as.matrix(dat)))
colnames(vsnNormed) <- colnames(dat)
row.names(vsnNormed) <- rownames(dat)
return(as.matrix(vsnNormed))
}
And I have tried like below,
vsn.dat0 <- vsnNorm(group_data$data0)
vsn.dat1 <- vsnNorm(group_data$data1)
vsn.dat2 <- vsnNorm(group_data$data2)
vsn.dat3 <- vsnNorm(group_data$data3)
vsn.dat <- cbind (vsn.dat0,vsn.dat1,vsn.dat2,vsn.dat3)
It is working well.
But the dataset triplicate (3 columns set) value may be change from dataset to dataset. And calling all the lists everytime become will be tedious.
So kindly share some codes which will call all the resulted lists for performing a function and combine the result as a single file.
Thank you in advance.
The shortcut you are looking for is:
vsn.dat <- do.call("rbind", lapply(group_data, vsnNorm))

r Replace multiple strings in a data frame column with multiple strings from a column of another data frame

I have a dataframe (df1) with a column "PartcipantID". Some ParticipantIDs are wrong and should be replaced with the correct ParticipantID. I have another dataframe (df2) where all Participant IDs appear in columns Goal_ID to T4. The Participant IDs in column "Goal_ID" are the correct IDs.
Now I want to replace all ParticipantIDs in df1 with all Goal_ID ParticipantIDs from df2.
This is my original dataframe (df1):
structure(list(Partcipant_ID = c("AA_SH_RA_91", "AA_SH_RA_91",
"AB_BA_PR_93", "AB_BH_VI_90", "AB_BH_VI_90", "AB_SA_TA_91", "AJ_BO_RA_92",
"AJ_BO_RA_92", "AK_SH_HA_91", "AL_EN_RA_95", "AL_MA_RA_95", "AL_SH_BA_99",
"AM_BO_AB_49", "AM_BO_AB_94", "AM_BO_AB_94", "AM_BO_AB_94", "AN_JA_AN_91",
"AN_KL_GE_11", "AN_KL_WO_91", "AN_MA_DI_95", "AN_MA_DI_95", "AN_SE_RA_95",
"AN_SE_RA_95", "AN_SI_RA_97", "AN_SO_PU_94", "AN_SU_RA_91", "AR_BO_RA_92",
"AR_KA_VI_94", "AR_KA_VI_94", "AS_AR_SO_90", "AS_AR_SU_95", "AS_KU_SO_90",
"AS_MO_AS_97", "AW_SI_OJ_97", "AW_SI_OJ_97", "AY_CH_SU_97", "BH_BE_LD_84",
"BH_BE_LI_83", "BH_BE_LI_83", "BH_BE_LI_84", "BH_KO_SA_87", "BH_PE_AB_89",
"BH_YA_SA_87", "BI_CH_PR_94", "BI_CH_PR_94"), Start_T2 = structure(c(NA,
NA, NA, NA, 1579514871, 1576658745, NA, 1579098225, NA, NA, 1576663067,
1576844759, NA, 1577330639, NA, NA, 1576693930, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 1577718380, 1577718380, 1577454467, NA,
NA, 1576352237, NA, NA, NA, NA, 1576420656, 1576420656, NA, NA,
1578031772, 1576872938, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), End_T2 = structure(c(NA, NA, NA, NA, 1579515709,
1576660469, NA, 1579098989, NA, NA, 1576693776, 1576845312, NA,
1577331721, NA, NA, 1576694799, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 1577719049, 1577719049, 1577455167, NA, NA, 1576352397,
NA, NA, NA, NA, 1576421607, 1576421607, NA, NA, 1578032408, 1576873875,
NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
45L), class = "data.frame")
And this is the reference data frame (df2):
structure(list(Goal_ID = c("AJ_BO_RA_92", "AL_EN_RA_95", "AM_BO_AB_49",
"AS_KU_SO_90", "BH_BE_LI_84", "BH_YA_SA_87", "BI_CH_PR_94", "BI_CH_PR_94"
), T2 = c("AJ_BO_RA_92", "AL_MA_RA_95", "AM_BO_AB_94", "AS_AR_SO_90",
"BH_BE_LI_83", "BH_YA_SA_87", "BI_NA_PR_94", "BI_NA_PR_94"),
T3 = c("AR_BO_RA_92", "AL_MA_RA_95", "AM_BO_AB_94", NA, "BH_BE_LI_83",
NA, "BI_CH_PR_94", "BI_CH_PR_94"), T4 = c("AJ_BO_RA_92",
"AL_MA_RA_95", "AM_BO_AB_94", NA, "BH_BE_LI_83", "BH_KO_SA_87",
"BI_CH_PR_94", "BI_CH_PR_94")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
For example, in my df1, I want
"AR_BO_RA_92" to be replaced by "AJ_BO_RA_92";
"AL_MA_RA_95" to be replaced by "AL_EN_RA_95";
"AM_BO_AB_94" to be replaced by "AM_BO_AB_49"
and so on...
I thought about using string_replace and I started with this:
df1$Partcipant_ID <- str_replace(df1$Partcipant_ID, "AR_BO_RA_92", "AJ_BO_RA_92")
But that is of course very unefficient because I have so many replacements and it would be nice to make use of my reference data frame. I just cannot figure it out myself.
I hope this is understandable. Please ask if you need additional information.
Thank you so much already!
You can use match to find where the string is located and excange those which have been found and are not NA like:
i <- match(df1$Partcipant_ID, unlist(df2[-1])) %% nrow(df2)
j <- !is.na(i)
df1$Partcipant_ID[j] <- df2$Goal_ID[i[j]]
df1$Partcipant_ID
# [1] "AA_SH_RA_91" "AA_SH_RA_91" "AB_BA_PR_93" "AB_BH_VI_90" "AB_BH_VI_90"
# [6] "AB_SA_TA_91" "AJ_BO_RA_92" "AJ_BO_RA_92" "AK_SH_HA_91" "AL_EN_RA_95"
#[11] "AL_MA_RA_95" "AL_SH_BA_99" "AM_BO_AB_49" "AM_BO_AB_94" "AM_BO_AB_94"
#[16] "AM_BO_AB_94" "AN_JA_AN_91" "AN_KL_GE_11" "AN_KL_WO_91" "AN_MA_DI_95"
#[21] "AN_MA_DI_95" "AN_SE_RA_95" "AN_SE_RA_95" "AN_SI_RA_97" "AN_SO_PU_94"
#[26] "AN_SU_RA_91" "AR_BO_RA_92" "AR_KA_VI_94" "AR_KA_VI_94" "AS_AR_SO_90"
#[31] "AS_AR_SU_95" "AS_KU_SO_90" "AS_MO_AS_97" "AW_SI_OJ_97" "AW_SI_OJ_97"
#[36] "AY_CH_SU_97" "BH_BE_LD_84" "BH_BE_LI_83" "BH_BE_LI_83" "BH_BE_LI_84"
#[41] "BH_KO_SA_87" "BH_PE_AB_89" "BH_YA_SA_87" "BI_CH_PR_94" "BI_CH_PR_94"
I think this might work. Create a true look up table with a column of correct and incorrect codes. I.e. stack the columns, then join the subsequent df3 to df1 and use coalesce to create a new part_id. You spelt participant wrong, which made me feel more human I always do that.
library(dplyr)
df3 <- df2[1:2] %>%
bind_rows(df2[c(1,3)] %>% rename(T2 = T3),
df2[c(1,4)] %>% rename(T2 = T4)) %>%
distinct()
df1 %>%
left_join(df3, by = c("Partcipant_ID" = "T2")) %>%
mutate(Goal_ID = coalesce(Goal_ID, Partcipant_ID)) %>%
select(Goal_ID, Partcipant_ID, Start_T2, End_T2)

how to paste an array to rows which contain a certain value in a certain column in R

I would like to paste values of a certain data.frame row to other rows which have a certain attribute of a certain feature, however not a whole row just a couple of values of it. Exactly it looks like:
z <- c(NA, NA, 3,4,2,3,5)
x <- c(NA, NA, 2,5,5,3,3)
a <- c("Hank", NA, NA, NA, NA, NA, NA)
b <- c("Hank", NA, NA, NA, NA, NA, NA)
c <- c(NA, NA, NA, NA, NA, NA, NA)
d <- c("Bobby", NA, NA, NA, NA, NA, NA)
df <- as.data.frame(rbind( a, b, c, d, z, x))
Now, I would like to pass df["z",3:7] to the rows[3:7] which have V1 == "Hank", and pass df["x", 3:7] when V1== "Bobby".
Do anybody has a hint for me? I guess it should be a function with sapply or something like that. Maybe a dplyr could give a solution? Thanks for any advice!

custom rmeta - forest plot generation does not work: " 'x' and 'units' must have length > 0"

I tried to generate a "forest plot" without summary estimates using the rmeta package. However, using ?forestplot and then starting from the description or the example does not help, I am always getting the same error. I would assume that it is a simple one that has to do with the matrix/vector lengths somewhat not lining up but I kept changing and adjusting and still cannot find the error...
Here is the example code:
tabletext<-cbind(c(NA, NA, NA, NA, NA, NA),
c(NA, NA, NA, NA, NA, NA),
c("variable1","subgroup","2nd", "3rd", "4th", "5th"),
c(NA,"mean","1.8683639", "2.5717301", "4.4966049, 9.0008054")
)
tabletext
png("forestplot.png")
forestplot(tabletext, mean = c(NA, NA, 1.8683639, 2.5717301, 4.4966049, 9.0008054), lower = c(NA, NA, 1.4604643, 2.0163468, 3.5197956, 6.9469213), upper = c(NA, NA, 2.3955105, 3.2897459, 5.7672966, 11.7288609),
is.summary = c(rep(FALSE, 6)), zero = 1, xlog=FALSE, boxsize=0.75, xticks = NULL, clip = c(0.9, 12))
dev.off()
Error message:
clip = c(0.9, 12))
Error in unit(rep(1, sum(widthcolumn)), "grobwidth", labels[[1]][widthcolumn]) :
'x' and 'units' must have length > 0
dev.off()
Any help is very much appreciated!
This works with the forestplot-package although you need to remove the xticks=NULL:
tabletext<-cbind(c(NA, NA, NA, NA, NA, NA),
c(NA, NA, NA, NA, NA, NA),
c("variable1","subgroup","2nd", "3rd", "4th", "5th"),
c(NA,"mean","1.8683639", "2.5717301", "4.4966049, 9.0008054")
)
png("forestplot.png")
forestplot(tabletext,
mean = c(NA, NA, 1.8683639, 2.5717301, 4.4966049, 9.0008054),
lower = c(NA, NA, 1.4604643, 2.0163468, 3.5197956, 6.9469213),
upper = c(NA, NA, 2.3955105, 3.2897459, 5.7672966, 11.7288609),
is.summary = c(rep(FALSE, 6)), zero = 1,
xlog=FALSE, boxsize=0.75, clip = c(0.9, 12))
dev.off()
Gives (I recommend some polishing before submitting for publishing):

How to change xticks locations and customize legend using levelplot (lattice library)

I am trying to move the position of x-ticks and x-labels from the bottom of the figure to its top.
In addition, my data has a bunch of NAs. Currently, levelplot just remove them and leave them as white space in the plot. I wondering if it is possible to add this NAs to the legend as well.
Any suggestions? Thanks!
Here is my code and its output:
require(lattice)
# see data from dput() below
rownames(data)=data[,1]
data_matrix=as.matrix(data[,2:11])
color = colorRampPalette(rev(c("#D73027", "#FC8D59", "#FEE090", "#FFFFBF", "#E0F3F8", "#91BFDB", "#4575B4")))(100)
levelplot(data_matrix, scale=list(x=list(rot=45)), ylab="Days", xlab="Strains", col.regions = color)
Data
data <-
structure(list(X = structure(1:17, .Label = c("Arcobacter", "Bacillus",
"Bordetella", "Campylobacter", "Chlamydia", "Clostridium ", "Corynebacterium",
"Enterococcus", "Escherichia", "Francisella", "Legionella", "Mycobacterium",
"Pseudomonas", "Rickettsia", "Staphylococcus", "Streptococcus",
"Treponema"), class = "factor"), day.0 = c(NA, -3.823301154,
NA, NA, NA, -3.518606107, NA, NA, NA, NA, NA, -4.859479387, NA,
NA, NA, -2.588402346, -2.668136603), day.2 = c(-4.006281239,
-3.024823788, NA, -5.202804501, NA, -3.237622321, NA, NA, -5.296138823,
-5.105469059, NA, NA, -4.901775198, NA, NA, -2.979144202, -3.050083791
), day.4 = c(-2.880770182, -3.210165554, -4.749097175, -5.209064234,
NA, -2.946480184, NA, -5.264113795, -5.341881713, -4.435780293,
NA, -4.810650076, -4.152531609, NA, NA, -3.106172794, -3.543161966
), day.6 = c(-2.869833226, -3.293283924, -3.831346387, NA, NA,
-3.323947791, NA, NA, NA, NA, NA, -4.397581863, -4.068855504,
NA, NA, -3.27028378, -3.662618619), day.8 = c(-3.873589331, -3.446192193,
-3.616207965, NA, NA, -3.13869325, NA, -5.010807453, NA, NA,
NA, -4.091502649, -4.412399025, -4.681675749, NA, -3.404738625,
-3.955464159), day.15 = c(-5.176583159, -2.512963066, -3.392832457,
NA, NA, -3.194662968, NA, -3.60440455, NA, NA, -4.875554468,
-2.507376205, -4.727255906, -5.27116754, -3.200499549, -3.361296145,
-4.320554841), day.22 = c(-4.550052847, -3.654013004, -3.486879661,
NA, NA, -3.614890858, NA, NA, NA, NA, -4.706690492, -2.200533317,
-4.836957953, NA, -4.390423731, NA, NA), day.29 = c(-4.730006329,
-3.46707372, -3.594457287, NA, NA, -3.800757834, NA, NA, NA,
NA, -4.285154089, -2.121152491, -4.816807055, -5.064577888, -2.945243736,
-4.479177287, -5.226435146), day.43 = c(-4.398680025, -3.144603215,
-3.642065153, NA, NA, -3.8268662, NA, NA, NA, NA, -4.762539208,
-2.156862316, -4.118608495, NA, -4.030291084, -4.678213147, NA
), day.57 = c(-4.689982547, -2.713502214, -3.51279797, NA, -5.069579266,
-3.495580794, NA, NA, NA, NA, -4.515973639, -1.90591075, -4.134826117,
-4.479351427, -3.482134037, -4.538534489, NA)), .Names = c("X",
"day.0", "day.2", "day.4", "day.6", "day.8", "day.15", "day.22",
"day.29", "day.43", "day.57"), class = "data.frame", row.names = c("Arcobacter",
"Bacillus", "Bordetella", "Campylobacter", "Chlamydia", "Clostridium ",
"Corynebacterium", "Enterococcus", "Escherichia", "Francisella",
"Legionella", "Mycobacterium", "Pseudomonas", "Rickettsia", "Staphylococcus",
"Streptococcus", "Treponema"))
Figure
The request to move the labels to the top is pretty easy (after looking at the ?xyplot under the scales section):
levelplot(data_matrix, scale=list(x=list(rot=45,alternating=2)),
ylab="Days", xlab="Strains", col.regions = color)
Trying to get the NA values into the color legend may take a bit more thinking, but it seems as though sensible values for the colorkey arguments for at and col might suffice.
levelplot(data_matrix, scale=list(x=list(rot=45,alternating=2)),
ylab="Days", xlab="Strains", col.regions = color,
colorkey=list(at=as.numeric( factor( c( seq(-5.5, -2, by=0.5),
"NA"))),
labels=as.character( c( seq(-5.5, -2, by=0.5),
"NA")),
col=c(color, "#FFFFFF") ) )

Resources