I have a dataframe with 2 columns namely p1 and p2. I need to split the p1 column into a range of values like 10-50, 50-100, 100-150, etc. After splitting the values of p1, the corresponding values of p2 should be printed. The sample input is given below.
df = data.frame(p1 = c(10,20,70,80,150,200),p2 = c(1000, 1111.7, 15522.1, 15729.3,18033.8,19358.2)).
The sample output is attached below.
When I am trying to do for large dataset p2 getting mixed with p1.
One way of doing it:
library(dplyr)
df %>%
mutate(
p1 = cut(p1, breaks = 0:(max(p1) %/% 50 + 1) * 50, include.lowest = TRUE)
) %>%
group_by(p1) %>%
summarise(p2 = list(p2))
Maybe this?
setNames(
aggregate(
p2 ~ cut(p1, c(10, 50, 100, 150, 200), include.lowest = TRUE),
df,
c
), names(df)
)
gives
p1 p2
1 [10,50] 1000.0, 1111.7
2 (50,100] 15522.1, 15729.3
3 (100,150] 18033.8
4 (150,200] 19358.2
Related
So, let's say I have a 1000-row, 6-column dataframe, the columns are a1, a2, b1, b2, c1, c2. I want to run some t-tests using a's, b's, and c's and get an output df with 3 columns for the t-values of a-b-c and another three for the significance information for those values, making it a total of 6 columns. The problem I have is with rows, I want to loop over chunks of 20, rendering the output a (1000/20=)50-row, 6-column df.
I have already tried creating an index column for my inital df which repeats a 1 for the first 20 row, a 2 for the next 20 row and so on.
convert_n <- function(df) {
df <- df %T>% {.$n_for_t_tests = rep(c(1:(nrow(df)/20)), each = 20)}
}
df <- convert_n(df)
However, I can't seem to find a way to properly utilize the items in this column as indices for a "for" or any kind of loop.
Below you can see the relevant code for that creates a 1-row, 6-column df; I need to modify the [0:20] parts, create a loop that does this for 20 groups and binds them.
t_test_a <- t.test(df$a1[0:20], dfff$a2[0:20], paired = T, conf.level
= 0.95)
t_test_b <- t.test(df$b1[0:20], dfff$b2[0:20], paired = T, conf.level
= 0.95)
t_test_c <- t.test(df$c1[0:20], dfff$c2[0:20], paired = T, conf.level
= 0.95)
t_tests_df <- data.frame(t_a = t_test_a$statistic[["t"]],
t_b = t_test_b$statistic[["t"]],
t_c = t_test_c$statistic[["t"]])
t_tests_df <- t_tests_df %T>% {.$dif_significance_a = ifelse(.$t_a >
2, "YES", "NO")} %T>%
{.$dif_significance_b = ifelse(.$t_b >
2, "YES", "NO")} %T>%
{.$dif_significance_c = ifelse(.$t_c >
2, "YES", "NO")} %>%
dplyr::select(t_a, dif_significance_a,
t_b, dif_significance_b,
t_c, dif_significance_c)
Thank you in advance for your help.
You can use split() and sapply():
set.seed(42)
df <- data.frame(a1 = sample(1000, 1000), a2 = sample(1000, 1000),
b1 = sample(1000, 1000), b2 = sample(1000, 1000),
c1 = sample(1000, 1000), c2 = sample(1000, 1000))
group <- gl(50, 20)
D <- split(df, group)
myt <- function(Di)
with(Di, c(at=t.test(a1, a2)$statistic, ap=t.test(a1, a2)$p.value,
bt=t.test(b1, b2)$statistic, bp=t.test(b1, b2)$p.value,
ct=t.test(c1, c2)$statistic, cp=t.test(c1, c2)$p.value))
sapply(D, FUN=myt) ### or
t(sapply(D, FUN=myt))
This is not the most pretty but i did a for loop like this:
df <- data.frame(a1 = sample(1000, 1000),
a2 = sample(1000, 1000),
b1 = sample(1000, 1000),
b2 = sample(1000, 1000),
c1 = sample(1000, 1000),
c2 = sample(1000, 1000))
df_ttest <- data.frame(p_a = c(1:50),
t_a = c(1:50),
p_b = c(1:50),
t_b = c(1:50),
p_c = c(1:50),
t_c = c(1:50))
index <- 0:50*20
for(i in seq_along(index)) {
df_ttest$p_a[i] = t.test(df$a1[index[i] : index[i+1]])$p.value
df_ttest$p_b[i] = t.test(df$b1[index[i] : index[i+1]])$p.value
df_ttest$p_c[i] = t.test(df$c1[index[i] : index[i+1]])$p.value
df_ttest$t_a[i] = t.test(df$a1[index[i] : index[i+1]])$statistic
df_ttest$t_b[i] = t.test(df$b1[index[i] : index[i+1]])$statistic
df_ttest$t_c[i] = t.test(df$c1[index[i] : index[i+1]])$statistic
}
This gives a 50x6 dataframe with seperate columns of p and t values for every 20 row chunk of a, b and c.
You could even go further and make a nested for loop to cycle through each row in df_ttest to make this abit prettier.
I want to collect the linear regression coefficients for each column ~ ind.
Here is my data:
temp <- data.frame(
ind = c(1:10),
`9891` = runif(10, 15, 75),
`7891` = runif(10, 15, 75),
`5891` = runif(10, 15, 75)
)
I had tried
result = data.frame()
cols <- colnames(temp)[-1]
for (code in cols) {
fit <- lm(temp[, code] ~ temp$ind)
coef <- coef(fit)['ind']
result$ind <- code
result$coef <- coef
}
But this doesn't work.
Can anyone fix my method, or provides a better solution?
Also, I was wondering if lapply() and summarise_at() can do the work.
Thank you!
Here is a summarise_at option
temp %>%
summarise_at(vars(-contains("ind")), list(coef = ~list(lm(. ~ ind)$coef))) %>%
unnest()
# X9891_coef X7891_coef X5891_coef
#1 25.927946 52.5668120 35.152330
#2 2.459137 0.3158741 1.013678
The first row gives the offset and the second row the slope coefficients.
Or to extract only the slope coefficient and store the result in a long data.frame
temp %>%
summarise_at(vars(-contains("ind")), list(coef = ~list(lm(. ~ ind)$coef[2]))) %>%
unnest() %>%
stack() %>%
setNames(c("slope", "column"))
# slope column
# 1 2.4591375 X9891_coef
# 2 0.3158741 X7891_coef
# 3 1.0136783 X5891_coef
PS. It's always good practice to include a fixed random seed when working with random data to ensure reproducibility of results.
Sample data
set.seed(2018)
temp <- data.frame(
ind = c(1:10),
`9891` = runif(10, 15, 75),
`7891` = runif(10, 15, 75),
`5891` = runif(10, 15, 75)
)
You can use sapply
sapply(temp[-1], function(x) coef(lm(x ~ temp$ind))[2])
#X9891.temp$ind X7891.temp$ind X5891.temp$ind
# -0.01252979 -2.94773367 2.57816244
To get the final daatframe, you could do
data.frame(ind = names(temp)[-1],
coef = sapply(temp[-1], function(x) coef(lm(x ~ temp$ind))[2]), row.names = NULL)
# ind coef
#1 X9891 -0.01252979
#2 X7891 -2.94773367
#3 X5891 2.57816244
where every row represents value from the column.
data
set.seed(1234)
temp <- data.frame(
ind = c(1:10),
`9891` = runif(10, 15, 75),
`7891` = runif(10, 15, 75),
`5891` = runif(10, 15, 75)
)
I'm trying to create a single graph that contains boxplots of gene expression for 3 different variant types (synonymous, missense, and nonsense). Currently, these variant types are separated into 3 different data frames, each of which contain a Gene, SampleID, and Expression column.
In order to plot all 3 boxplots on a single graph, I need to normalize all the expression data for each variant type, which means I need to get the z-scores. My question is, how do I do that and then how do I plot all 3 variant types on one graph?
I've come across the solution:
missense$Zscore <- ave(m$expr, m$Gene, FUN = scale)
nonsense$Zscore <- ave(n$expr, n$Gene, FUN = scale)
synonymous$Zscore <- ave(s$expr, s$Gene, FUN = scale)
Is this the right approach? If so, where do I go from here?
Example dataframe (missense):
SampleID Expression Gene
HSB100 5.239237 ENSG00000188976
HSB105 4.443808 ENSG00000188976
HSB104 4.425764 ENSG00000188976
HSB121 4.063259 ENSG00000188976
Use scale function to get Z-scores.
missense <- data.frame(SampleID = c('HSB100', 'HSB105', 'HSB104', 'HSB121'),
Expression = c(5.239237, 4.443808, 4.425764, 4.063259),
Gene = c('ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976'))
missense$Zscore <- scale(missense$Expression)
missense
mean(missense$Zscore)
sd(missense$Zscore)
# Create fake data here
nonsense <-
data.frame(SampleID = c('HSB100', 'HSB105', 'HSB104', 'HSB121'),
Expression = c(1, 2, 3, 4),
Gene = c('ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976'))
nonsense$Zscore <- scale(nonsense$Expression)
synonymous <-
data.frame(SampleID = c('HSB100', 'HSB105', 'HSB104', 'HSB121'),
Expression = c(3, 4, 5, 6),
Gene = c('ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976'))
synonymous$Zscore <- scale(synonymous$Expression)
The trick is to bind all three data frames together and then plot using ggplot. Not familiar with base plot but this is what I would do:
# Add identifyer
missense$Type <- 'missense'
nonsense$Type <- 'nonsense'
synonymous$Type <- 'synonymous'
# Bind three together
data_all <- rbind(missense, nonsense, synonymous)
# Use ggplot to plot boxscores
library(ggplot2)
ggplot(data = data_all, aes(x = Type, y = Zscore)) + geom_boxplot()
If all the genes are the same in each corresponding data frame, then ave is not needed since no multiple groupings exist. Hence, you can run a simple calculation: m$Zscore <- scale(m$expr). From there as #emilliman5 comments, graph all three vectors with a list and even name x-axis with a named list:
# WITH SEABORN COLORS
boxplot(list(missense=m$Zscore, nonsense=n$Zscore, synonymous=s$Zscore),
col = c("#4c72b0","#55a868","#c44e52"))
Even consider row binding all data frames but adding a new column for a variant_type indicator. Then use ave since now genes will differ within data frame. And even use formula style instead of list() for boxplot:
all_gene_df <- rbind(transform(m, variant_type='missense'),
transform(n, variant_type='nonsense'),
transform(s, variant_type='synonymous'))
all_gene_df$Zscore <- with(all_gene_df, ave(expr, variant_type, FUN = scale))
# WITH SEABORN COLORS
boxplot(Zscore ~ variant_type, data = all_gene_df,
col = c("#4c72b0","#55a868","#c44e52"),
main = "ZScore Boxplots by Gene",
xlab = "Genes",
ylab = "ZScore")
Data
set.seed(103018)
m <- data.frame(SampleID = paste0(sample(LETTERS, 50, replace=TRUE), sample(LETTERS, 50, replace=TRUE),
sample(LETTERS, 50, replace=TRUE), sample(100:999, 50, replace=TRUE)),
expr = runif(50)*10,
gene = 'MISSENSE0001')
n <- data.frame(SampleID = paste0(sample(LETTERS, 50, replace=TRUE), sample(LETTERS, 50, replace=TRUE),
sample(LETTERS, 50, replace=TRUE), sample(100:999, 50, replace=TRUE)),
expr = runif(50)*10,
gene = 'NONSENSE0001')
s <- data.frame(SampleID = paste0(sample(LETTERS, 50, replace=TRUE), sample(LETTERS, 50, replace=TRUE),
sample(LETTERS, 50, replace=TRUE), sample(100:999, 50, replace=TRUE)),
expr = runif(50)*10,
gene = 'SYNONYMOUS0001')
I have a 20 points (X1, Y1,…. Xn, Yn) on a pyramid and a random base point (Xbase, Ybase). I wish to calculate the triangle area between (Xi, Yi; Xi+1, Yi+1; Xbase, Ybase). Therefore, I did a loop that calculate the area but I can not store the area result area in a the data.frame (myDF). Furthermore, is there another elegant way to calculate the area?
Script:
library(ggplot2)
myDF <- data.frame(area=double())
nElem <- 100
xData <- as.data.frame(seq(1,nElem,5))
yData1 <- seq(5,nElem/2,5)
yData2 <- rev(yData1-4)
yData<- as.data.frame((c(yData1, yData2)))
xyDATA<- cbind(xData,yData)
colnames(xyDATA) <- c("xCoord","yCoord")
Xbase <-runif(1, 90, 91)
Ybase <-runif(1, 1.0, 1.5)
for(i in 1:19)
{
x1 <- Xbase
y1 <- Ybase
x2 <- xyDATA[i,1]
y2 <- xyDATA[i,2]
x3 <- xyDATA[i+1,1]
y3 <- xyDATA[i+1,2]
s <- 0.5*sqrt((x2*x3-x3*y2)^2+(x3*y1-x1*y3)^2+(x1*y2-x2*y1)^2)
myDF[i] <-s
}
P1 <- ggplot(xyDATA) + geom_point(aes(x = xCoord, y = yCoord))
P2 <- P1 + geom_point(aes(x = x1, y = y1),colour="red",size=4)
P2
Thanks a lot.
As written you are assigning the value of s to an entire column in the dataframe. You probably want to specify an area column and then assign into a row of that col.
# before the loop, create the column:
DF['area'] <- NA
# Inside the loop
....
myDF[i, "area"] <-s
Here is a solution using the dplyr package:
nElem <- 100
xData <- as.data.frame(seq(1,nElem,5))
yData1 <- seq(5,nElem/2,5)
yData2 <- rev(yData1-4)
yData<- as.data.frame((c(yData1, yData2)))
xyDATA<- cbind(xData,yData)
colnames(xyDATA) <- c("xCoord","yCoord")
Xbase <-runif(1, 90, 91)
Ybase <-runif(1, 1.0, 1.5)
library(dplyr)
myDF <- xyDATA %>%
mutate("s" = 0.5*sqrt(
(xCoord*lead(xCoord)-lead(xCoord)*yCoord)^2+
(lead(xCoord)*Ybase-Xbase*lead(yCoord))^2+
(Xbase*yCoord-xCoord*Ybase)^2
))
head(myDF)
xCoord yCoord s
1 1 5 502.2731
2 6 10 807.6995
3 11 15 1118.5987
4 16 20 1431.4092
5 21 25 1745.1034
6 26 30 2059.2776
This is the code I'm currently running:
n <- 7
N <- 52
r <- 13
reps <- 1000000
deck <- rep(c('h','d','c','s'), each = r)
diamonds <- rep(NA, length.out = reps)
pos <- sample(x = 1:52, size = 7, replace = FALSE)
for(i in 1:reps) {
hand <- sample(x = deck, replace = FALSE)[pos]
diamonds[i] <- sum(ifelse(hand == 'd', 1, 0))
}
barplot(table(diamonds), col = 'red', xlab = '# of diamonds',
ylab = paste('frequency out of',reps,'trials'),
main = paste('Positions:',pos[1],pos[2],pos[3],pos[4],
pos[5],pos[6],pos[7]))
What I'd really like is to be able to give a title to the barplot with something like the following
barplot(..., main = paste('Positions:',pos))
and have the title say "Positions: p1 p2 p3 p4 p5 p6 p7", where p1,p2,...,p7 are the elements of pos.
For anyone that's interested, this code randomly chooses 7 positions from 52 and then counts the number of diamonds ('d') within those 7 positions after each shuffle of the deck for 1000000 shuffles. Then the empirical distribution of the number of diamonds within those 7 cards is plotted.
Use collapse in paste to collapse the multiple elements in a vector containing the base test and pos,
paste(c('Positions:', pos), collapse=" ")
Otherwise, when you paste "Positions:" to pos you get the former recycled to the length of pos.