I have a simple data set with two groups and a value for each group at 4 different time points. I want to display this data set as grouped boxplots over time but ggplot2 doesn't separate the time points.
This is my data:
matrix
Replicate Line Day Treatment X A WT Marker Proportion
1 C 10 low NA HuCHuD_Pos 8.62
2 C 10 low NA HuCHuD_Pos NA
1 C 18 low NA HuCHuD_Pos 30.50
3 C 18 low NA HuCHuD_Pos NA
2 C 18 low NA HuCHuD_Pos NA
1 C 50 low NA HuCHuD_Pos 26.10
2 C 50 low NA HuCHuD_Pos 31.90
1 C 80 low NA HuCHuD_Pos 12.70
2 C 80 low NA HuCHuD_Pos 26.20
1 C 10 normal NA HuCHuD_Pos NA
2 C 10 normal NA HuCHuD_Pos 17.20
1 C 18 normal NA HuCHuD_Pos 3.96
2 C 18 normal NA HuCHuD_Pos NA
1 C 50 normal NA HuCHuD_Pos 25.60
2 C 50 normal NA HuCHuD_Pos 17.50
1 C 80 normal NA HuCHuD_Pos 19.00
NA C 80 normal NA HuCHuD_Pos NA
And this is my code:
matrix = as.data.frame(subset(data.long, Line == line_single & Marker == marker_single & Day != "30"))
pdf(paste(line_name_single, marker_name_single, ".pdf"), width=10, height=10)
plot <-
ggplot(data=matrix,aes(x=Day, y=Proportion, group=Treatment, fill=Treatment)) +
geom_boxplot(position=position_dodge(1))
print(plot)
dev.off()
What do I do wrong?
What I want
What I get
Thanks very much for your help!
Cheers,
Paula
Edit:
This is how a minimal reproducible example for your question could look like:
matrix <- structure(list(Day = c(10L, 10L, 18L, 18L, 18L, 50L, 50L, 80L, 80L, 10L, 10L, 18L, 18L, 50L, 50L, 80L, 80L),
Treatment = c("low", "low", "low", "low", "low", "low", "low", "low", "low", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal"),
Proportion = c(8.62, NA, 30.5, NA, NA, 26.1, 31.9, 12.7, 26.2, NA, 17.2, 3.96, NA, 25.6, 17.5, 19, NA)),
class = "data.frame", row.names = c(NA, -17L))
Suggested answer using factor to 'discretisize' the variable Day:
ggplot(data=matrix,aes(x=factor(Day), y=Proportion, fill=Treatment)) +
geom_boxplot(position=position_dodge(1)) +
labs(x ="Day")
Explanation: If we pass a continuous variable to the 'x' axis for a box-plot, ggplot2 does not convert the axis to a discrete variable. Therefore, in lack of a 'grouping' variable we only get one box. But if we convert the variable to something discrete, like a factor, a string or a date, we get the desired behavior.
Also, when you use dput or one of the techniques described here it's way easier to find and test an answer than having to try and work with the data description as in the question (or at least I couldn't figure out how to load that example data)
P.S. I think it's a bit confusing to name a variable of class data.frame 'matrix' since matrix is its own data type in R... ;)
Related
I am trying to delete rows in my dataset, which contains NAs, but none of the functions work, What could be a reason?
Here is sample of my code,
Site_cov<- read.csv("site_cov.csv")
colnames(Site_cov)<- c("Point", "Basal", "Short.Saps", "Tall.Saps")
head(Site_cov)
Point Basal Short.Saps Tall.Saps
1 DEL001 Na 2 0
2 DEL002 Na 1 6
3 DEL003 Na 0 5
4 DEL004 10 21 22
Here, I though that upper and lower case Nas, could be a problem and this is what I run,
Site_cov$Basal<-toupper(Site_cov$Basal)
Site_cov$Short.Saps<-toupper(Site_cov$Short.Saps)
Site_cov$Tall.Saps<-toupper(Site_cov$Tall.Saps)
Then, I try to delete NAs
Site_cov_NA <- Site_cov[complete.cases(Site_cov[ , c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
But, NAs are still here
head(Site_cov_NA)
Point Basal Short.Saps Tall.Saps
1 DEL001 NA 2 0
2 DEL002 NA 1 6
3 DEL003 NA 0 5
4 DEL004 10 21 22
5 DEL005 60 8 17
6 DEL006 80 17 13
Obviously you have 'Na' strings that are fake NAs. replace them with real ones, then your code should work.
dat <- replace(dat, dat == 'Na', NA)
dat[complete.cases(dat[, c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
# Point Basal Short.Saps Tall.Saps
# 4 DEL004 10 21 22
Data:
dat <- structure(list(Point = c("DEL001", "DEL002", "DEL003", "DEL004"
), Basal = c("Na", "Na", "Na", "10"), Short.Saps = c(2L, 1L,
0L, 21L), Tall.Saps = c(0L, 6L, 5L, 22L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Try the complete.cases() function (https://stat.ethz.ch/R-manual/R-patched/library/stats/html/complete.cases.html)
try <- data.frame("a"=c(1,3,NA,NA), "b"=c(3,5,2,3))
try1<-try[complete.cases(try),]
try1
for easier explanation I'm gonna use a smaller example.
I have two DF:
DF1: T01 T02 T03 T04 T05
1 15 20 48 25 5
2 12 18 35 30 12
3 13 15 50 60 42
DF2: MEDIAN SD
T01 13 1.24
T02 18 2.05
T03 45 6.64
T04 30 15.45
T05 12 16.04
What I want to do is create a loop that adds a dummy to DF1 for each variable, that take value 1 if DF1$T01 ≈ (almost equal) to DF2$MEDIAN[1], and 0 if it's not, and then goes to T02, T03, until it breaks.
Until now, I haven't been able to create a loop (I'm not really good at creating loops tho) that makes this. I did manage to make the dummy for one of the variables (T01), but in the real DF I have over 40 variables, so doing it by hand it´s not efficient at all. What I have right now is:
DF1$dummyt01 <- ifelse(almost.equal(DF1$T01, DF2$MEDIAN[1], tolerance = 2),1,0)
outcome expected:
DF1: T01 T02 T03 T04 T05 dummyT01 dummyT02 ... dummyT05
1 15 20 48 25 5 1 1 ... 0
2 12 18 35 30 12 1 1 ... 1
3 13 15 50 60 42 1 0 ... 0
Note: Not a native english speaker. Sorry for any mistakes.
EDIT: Expected Outcome.
We may use tidyverse. Loop across the columns of 'DF1', get the column names of that column looped (cur_column()), use that to subset the 'DF2' (as row names) 'MEDIAN' element, do the comparison with almost.equal to return a logical vector, which is coerced to binary with as.integer or +. In the .names add the prefix 'dummy' so as to create as new columns
library(dplyr)
library(berryFunctions)
DF1 <- DF1 %>%
mutate(across(everything(), ~ +(almost.equal(.,
DF2[cur_column(), "MEDIAN"], tolerance = 1)),
.names = "dummy{.col}"))
-output
DF1
T01 T02 T03 T04 T05 dummyT01 dummyT02 dummyT03 dummyT04 dummyT05
1 15 20 48 25 5 0 0 0 0 0
2 12 18 35 30 12 1 1 0 1 1
3 13 15 50 60 42 1 0 0 0 0
Or using a for loop
for(i in seq_along(DF1))
DF1[paste0('dummy', names(DF1)[i])] <- +(almost.equal(DF1[[i]],
DF2[names(DF1)[i], "MEDIAN"], tolerance = 1))
data
DF1 <- structure(list(T01 = c(15L, 12L, 13L), T02 = c(20L, 18L, 15L),
T03 = c(48L, 35L, 50L), T04 = c(25L, 30L, 60L), T05 = c(5L,
12L, 42L)), class = "data.frame", row.names = c("1", "2",
"3"))
DF2 <- structure(list(MEDIAN = c(13L, 18L, 45L, 30L, 12L), SD = c(1.24,
2.05, 6.64, 15.45, 16.04)), class = "data.frame", row.names = c("T01",
"T02", "T03", "T04", "T05"))
Note: This question is a follow up to a previous question: r - Finding closest coordinates between two large data sets.
I am aiming to identify the nearest entry in dataset 2 to each entry in dataset 1 based on the coordinates in both datasets. Dataset 1 contains 180,000 rows (only 1,800 unique coordinates) and dataset 2 contains contains 4,500 rows (full 4,500 unique coordinates).
The previously referenced post contains a solution the problem, however it uses RANN::nn2 which uses Euclidean distance as opposed to the aim of using Ellipsoidal/Vincenty.
Current code:
df1[ , c(4,5)] <- as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))
df1[,4] <- df2[df1[, 4], 1]
# id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
# 1 1 52.88144 -2.873778 44 0.7990743
# 2 2 57.80945 -2.234544 5688 2.1676868
# 3 4 34.02335 -3.098445 61114 1.4758202
# 4 5 63.80879 -2.439163 23 4.2415854
# 5 6 53.68881 -7.396112 54 3.6445416
# 6 7 63.44628 -5.162345 23 2.3577811
# 7 8 21.60755 -8.633113 440 8.2123762
# 8 9 78.32444 3.813290 76 11.4936496
# 9 10 66.85533 -3.994326 55 1.9296370
# 10 3 51.62354 -8.906553 54 3.2180026
I suspect that the solution would involve geosphere::distVincentyEllipsoid but I am unsure as to how to integrate it into the existing code.
Data:
r details
platform x86_64-w64-mingw32
version.string R version 3.5.3 (2019-03-11)
data set 1 input (not narrowed down to unique coordinates)
df1 <- structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
data set 2 input
df2 <- structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")
Using distVincentyEllipsoid function:
library(geosphere)
t(
apply(
apply(df1[,c(3,2)], 1, function(mrow){distVincentyEllipsoid(mrow, df2[,c(3,2)])}),
2, function(x){ c(SRC_ID=df2[which.min(x),1],distance=min(x))}
)
)
SRC_ID distance
1 44 74680.48
2 5688 238553.51
3 61114 137385.18
4 23 340642.70
5 44 308458.73
6 23 256176.88
7 440 908292.28
8 76 1064419.47
9 55 185119.29
10 54 251580.45
Just use df1[,c(4,5)] <- t(apply(... to assign the values to the column of df1
Using rgeos::gDistance. This is Cartesian distance but starting from the solution below, I managed to post the updated answer above;
library(sp);library(rgeos)
#convert to spatial datasets
df1rgsp <- SpatialPointsDataFrame(df1[,c(3,2)], df1[,-c(3,2)])
df2rgsp <- SpatialPointsDataFrame(df2[,c(3,2)], data.frame(SRC_ID=df2[,1]))
#apply it on each rows
#find the minimum value and the corresponding row number
#transform it to become to columns and assign it to the columns of `df1`
df1[,c(4,5)] <- t( apply(gDistance(df1rgsp, df2rgsp, byid=TRUE), 1, function(x){
c(SRC_ID=which.min(x),distance=min(x))}))
#replace row numbers with `SRC_ID
df1[,4] <- df2[as.integer(df1[, 4]), 1] #same as what you have in the Q
# id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
# 1 1 52.88144 -2.873778 440 1.9296370
# 2 2 57.80945 -2.234544 61114 3.2180026
# 3 4 34.02335 -3.098445 21 2.3577811
# 4 5 63.80879 -2.439163 23 8.8794997
# 5 6 53.68881 -7.396112 55 0.7990743
# 6 7 63.44628 -5.162345 440 3.4316239
# 7 8 21.60755 -8.633113 5688 11.4936496
# 8 9 78.32444 3.813290 54 2.1676868
# 9 10 66.85533 -3.994326 23 6.1545391
# 10 3 51.62354 -8.906553 23 1.4758202
I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]
I have looked through other posts and I think I have an idea of what I could do, but I want to be clear!
I have a very large data frame that contains 4 variables and a number of rows.
Chain ResId ResNum Energy
1 C O17 500 -37.03670
2 A ARG 8 -0.84560
3 A LEU 24 -0.56739
4 A ASP 25 -0.98583
5 B ARG 8 -0.64880
6 B LEU 24 -0.58380
7 B ASP 25 -0.85930
Each row contains CHAIN (A, B, or C), ResID, ResNum, and Energy. I would like to sort this data so that all of the energy values belonging to a specific Resid and num in each chain are clustered together. By cluster I mean all of the values for "ARG 8" are grouped or all of the rows containing "ARG 8" are grouped. I don't know which is more efficient. Ideally, I would like the output for all residues to be
ARG 8
0.000
0.000
0.000
where the "0.000" are the energy values for ARG 8 or O17 and so on.
Sorry for the header breaks, I wanted the data to be clean, but I can't insert images.
data
structure(list(Chain = structure(c(3L, 1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), ResId = structure(c(4L,
1L, 3L, 2L, 1L, 3L, 2L), .Label = c("ARG", "ASP", "LEU", "O17"
), class = "factor"), ResNum = c(500L, 8L, 24L, 25L, 8L, 24L,
25L), Energy = c(-37.0367, -0.8456, -0.56739, -0.98583, -0.6488,
-0.5838, -0.8593)), .Names = c("Chain", "ResId", "ResNum", "Energy"
), class = "data.frame", row.names = c(NA, -7L))
If you want to convert to wide format
library(reshape2)
dcast(df, ResId+ResNum~paste0('Energy.',Chain), value.var='Energy')
# ResId ResNum Energy.A Energy.B Energy.C
#1 ARG 8 -0.84560 -0.6488 NA
#2 ASP 25 -0.98583 -0.8593 NA
#3 LEU 24 -0.56739 -0.5838 NA
#4 O17 500 NA NA -37.0367
After your edit, the output you are most likely looking for is:
library(reshape2)
dcast(df, ResId~Chain, value.var= 'Energy')
ResId A B C
1 ARG -0.84560 -0.6488 NA
2 ASP -0.98583 -0.8593 NA
3 LEU -0.56739 -0.5838 NA
4 O17 NA NA -37.0367
This will put the values together. You can further specify based on your desired output.
df[order(df$ResId), ]
Chain ResId ResNum Energy
2 A ARG 8 -0.84560
5 B ARG 8 -0.64880
4 A ASP 25 -0.98583
7 B ASP 25 -0.85930
3 A LEU 24 -0.56739
6 B LEU 24 -0.58380
1 C O17 500 -37.03670
#With dplyr
library(dplyr)
df %>%
arrange(ResId)
Chain ResId ResNum Energy
1 A ARG 8 -0.84560
2 B ARG 8 -0.64880
3 A ASP 25 -0.98583
4 B ASP 25 -0.85930
5 A LEU 24 -0.56739
6 B LEU 24 -0.58380
7 C O17 500 -37.03670
Data
df <- read.table(text = '
Chain ResId ResNum Energy
C O17 500 -37.0367
A ARG 8 -0.8456
A LEU 24 -0.56739
A ASP 25 -0.98583
B ARG 8 -0.6488
B LEU 24 -0.5838
B ASP 25 -0.8593', header=T)
Try this:
df <- df[order(df$Chain, df$ResId, df$ResNum),]
where df is the name of your dataframe. This should order it for you.