I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like
Related
I have a dataset (test_df) that looks like:
Species
TreatmentA
TreatmentB
X0
L
K
Apple
Hot
Cloudy
1
2
3
Apple
Cold
Cloudy
4
5
6
Orange
Hot
Sunny
7
8
9
Orange
Cold
Sunny
10
11
12
I would like to display the effect of the treatments by using the X0, L, and K values as coefficients in a standard logistic function and plotting the same species across various treatments on the same plot. I would like a grid of plots with the logistic curves for each species on it's own plots, with each treatment then being grouped by color within every plot. In the above example, Plot1.Grid1 would have 2 logistic curves corresponding to Apple Hot and Apple Cold, and plot1.Grid2 would have 2 logistic curves corresponding to Orange Hot and Orange Cold.
The below code will create a single logistic function curve which can then be layered, but manually adding the layers for multiple treatments is tedious.
testx0 <- 1
testL <- 2
testk <- 3
days <- seq(from = -5, to = 5, by = 1)
functionmultitest <- function(x,testL,testK,testX0) {
(testL)/(1+exp((-1)*(testK) *(x - testX0)))
}
ggplot()+aes(x = days, y = functionmultitest(days,testL,testk,testx0))+geom_line()
The method described in (https://statisticsglobe.com/draw-multiple-function-curves-to-same-plot-in-r) works for dataframes with few species or treatments, but it becomes very tedious to individually define the curves if you have many treatments/species. Is there a way to programatically pass the list of coefficients and have ggplot handle the grouping?
Thank you!
Your current code shows how to compute the curve for a single row in your data frame. What you can do is pre-compute the curve for each row and then feed to ggplot.
Setup:
# Packages
library(ggplot2)
# Your days vector
days <- seq(from = -5, to = 5, by = 1)
# Your sample data frame above
df = structure(list(Species = c("Apple", "Apple", "Orange", "Orange"
), TreatmentA = c("Hot", "Cold", "Hot", "Cold"), TreatmentB = c("Cloudy",
"Cloudy", "Sunny", "Sunny"), X0 = c(1L, 4L, 7L, 10L), L = c(2L,
5L, 8L, 11L), K = c(3L, 6L, 9L, 12L)), class = "data.frame", row.names = c(NA,
-4L))
# Your function
functionmultitest <- function(x,testL,testK,testX0) {
(testL)/(1+exp((-1)*(testK) *(x - testX0)))
}
We'll "expand" each row of your data frame with the days vector:
# Define first a data frame of days:
days_df = data.frame(days = days)
# Perform a cross join
df_all = merge(days_df, df, all = T)
At this point, you will have a data frame where each original row is duplicated for as many days you have.
Now, just as you did for one row, we'll compute the value of the function for each row and store in the df_all as result:
df_all$result = mapply(functionmultitest, df_all$days, df_all$L, df_all$K, df_all$X0)
I'm not sure how you intended to handle treatmentA and treatmentB, so I'll just combine for illustration purposes:
df_all$combined_treatment = paste0(df_all$TreatmentA, "-", df_all$TreatmentB)
We can now feed this data frame to ggplot, set the color to be combined_treatment, and use the facet_grid function to split by species
ggplot(data = df_all, aes(x = days, y = result, color = combined_treatment))+
geom_line() +
facet_grid(Species ~ ., scales = "free")
The result is as follows:
I have a data frame that lists a bunch of objects and their values.
Name NumCpu MemoryMB
1 BEAVERTN-SVR-C5 1 3072
2 BEAVERTN-SVR-UK 4 4096
3 BEAVERTN-SVR-JV 1 1024
I want to take my data frame and create a new column that groups these numbers by ranges.
Ranges: 0-1024, 1025-2048, 2049-4096
And then output the counts of those ranges into a new data frame:
Range Count
0-1024 1
1025-2048 0
2049-4096 2
I learn by doing, so this is a real work problem I'm trying to use R to solve. Any help greatly appreciated. Thank you!
Data
DF <- structure(list(Name = c("BEAVERTN-SVR-C5", "BEAVERTN-SVR-UK",
"BEAVERTN-SVR-JV"), NumCpu = c(1L, 4L, 1L), MemoryMB = c(3072L,
4096L, 1024L), Range = structure(c(3L, 3L, 1L), .Label = c("(0,1.02e+03]",
"(1.02e+03,2.05e+03]", "(2.05e+03,4.1e+03]"), class = "factor")), .Names = c("Name",
"NumCpu", "MemoryMB", "Range"), row.names = c("1", "2", "3"), class = "data.frame")
I have a list that's 1314 element long. Each element is a data frame consisting of two rows and four columns.
Game.ID Team Points Victory
1 201210300CLE CLE 94 0
2 201210300CLE WAS 84 0
I would like to use the lapply function to compare points for each team in each game, and change Victory to 1 for the winning team.
I'm trying to use this function:
test_vic <- lapply(all_games, function(x) {if (x[1,3] > x[2,3]) {x[1,4] = 1}})
But the result it produces is a list 1314 elements long with just the Game ID and either a 1 or a null, a la:
$`201306200MIA`
[1] 1
$`201306160SAS`
NULL
How can I fix my code so that each data frame maintains its shape. (I'm guessing solving the null part involves if-else, but I need to figure out the right syntax.)
Thanks.
Try
lapply(all_games, function(x) {x$Victory[which.max(x$Points)] <- 1; x})
Or another option would be to convert the list to data.table by using rbindlist and then do the conversion
library(data.table)
rbindlist(all_games)[,Victory:= +(Points==max(Points)) ,Game.ID][]
data
all_games <- list(structure(list(Game.ID = c("201210300CLE",
"201210300CLE"
), Team = c("CLE", "WAS"), Points = c(94L, 84L), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
class = "data.frame", row.names = c("1",
"2")), structure(list(Game.ID = c("201210300CME", "201210300CME"
), Team = c("CLE", "WAS"), Points = c(90, 92), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
row.names = c("1", "2"), class = "data.frame"))
You could try dplyr:
library(dplyr)
all_games %>%
bind_rows() %>%
group_by(Game.ID) %>%
mutate(Victory = row_number(Points)-1)
Which gives:
#Source: local data frame [4 x 4]
#Groups: Game.ID
#
# Game.ID Team Points Victory
#1 201210300CLE CLE 94 1
#2 201210300CLE WAS 84 0
#3 201210300CME CLE 90 0
#4 201210300CME WAS 92 1
I have a data frame with a column that contains some elements that are lists. I would like to find out which rows of the data frame contain a keyword in that column.
The data frame, df, looks a bit like this
idstr tag
1 wl
2 other.to
3 other.from
4 c("wl","other.to")
5 wl
6 other.wl
7 c("ll","other.to")
The goal is to assign all of the rows with 'wl' in their tag to a new data frame. In this example, I would want a new data frame that looks like:
idstr tag
1 wl
4 c("wl","other.to")
5 wl
I tried something like this
df_wl <- df[which(is.element('wl',df$tag)),]
but this only returns the first element of the data frame (whether or not it contains 'wl'). I think the trouble lies in iterating through the rows and implementing the "is.element" function. Here are two implementations of the function and it's results:
is.element('wl',df$tag[[4]]) > TRUE
is.element('wl',df$tag[4]) > FALSE
How do you suggest I iterate through the dataframe to assign df_wl with it's proper values?
PS: Here's the dput:
structure(list(idstr = 1:7, tag = structure(c(6L, 5L, 4L, 2L, 6L, 3L, 1L), .Label = c("c(\"ll\",\"other.to\")", "c(\"wl\",\"other.to\")", "other.wl", "other.from", "other.to", "wl"), class = "factor")), .Names = c("idstr", "tag"), row.names = c(NA, -7L), class = "data.frame")
Based on your dput data. this may work. The regular expression (^wl$)|(\"wl\") matches wl from beginning to end, or any occurrence of "wl" (wrapped in double quotes)
df[grepl("(^wl$)|(\"wl\")", df$tag),]
# idstr tag
# 1 1 wl
# 4 4 c("wl","other.to")
# 5 5 wl
I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")