The wording of my question may not be great so I will try and illustrate via this example.
A Rank
0.5 1
0.6 2
0.7 3
0.8 4
0.9 5
1.0 6
I would like to add a column with a result of this formula:
(i/m)/Q
Where:
i = the rank of each value in column A (e.g 1, 2, 3, 4, 5, 6)
m = 6 (total rows)
Q = 0.05
The result would then provide another column with the result of that equation. I have 730 rows so would like to perform this instead of doing it manually for each one.
A Rank C
0.5 1 (1/6)0.05
0.6 2 (2/6)0.05
0.7 3 (3/6)0.05
0.8 4 (4/6)0.05
0.9 5 (5/6)0.05
1.0 6 (6/6)0.05
Many Thanks
Related
I would like to have the minimum value between a fixed value and a calculation from a column
For example i have the following data :
beta = data.frame(A = c(1,2,3,4,5,6,7,8,9))
And i would like to know the minimum between A/2 and the value 3
My problem is when i use the min function in R, he takes the minimum of all the column A instead of calculating line by line. So when i create beta$min = min(beta$A/2, 3), i get this:
> beta
A min
1 1 0.5
2 2 0.5
3 3 0.5
4 4 0.5
5 5 0.5
6 6 0.5
7 7 0.5
8 8 0.5
9 9 0.5
So he always take the minimum of the column A (1).. how could i do this line by line ?
Thanks for reading
My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like to emphasize my search at the time point 1 and based on the values on that
time point to filter out the groups that do not fulfil a condition from the later time points.
I would like to delete the values of the groups that on the time point 1 are bigger than 0.5
and smaller than 0.1.
I want my data.frame to look like this.
group time value
1 A 1 0.2
2 A 2 0.1
3 A 3 10.0
Any help is highly appreciated.
You can select groups where value at time = 1 is between 0.1 and 0.5.
library(dplyr)
data %>%
group_by(group) %>%
filter(between(value[time == 1], 0.1, 0.5))
# group time value
# <chr> <dbl> <dbl>
#1 A 1 0.2
#2 A 2 0.1
#3 A 3 10
My data is quite large, so I create a small matrix to better illustrate my demand.
test <- matrix(c(1:3, rep(0.5,3),4:1), nrow = 1, dimnames = list(1, 1:10))
The matrix will like this:
1 2 3 4 5 6 7 8 9 10
1 1 2 3 0.5 0.5 0.5 4 3 2 1
I want to subset this matrix with multiple columns when its value is equal to specifc value likes 0.5:
4 5 6
1 0.5 0.5 0.5
Since my data would have more than 10,000 columns, I'm looking for codes that can solve my problem.
I am trying to create a difference variable by subtracting each value in a column from the highest value within the three years prior, for each individual id.
My data looks like this:
data <- data.frame(id = c(1,1,1,1,2,2,3,3,3,3,4,4,4),
testocc = c(1,2,3,4,1,2,1,2,3,4,1,2,3),
score = c(0.8,0.3,0.1,0.2,0.1,0.5,0.9,0.5,0.7,0.6,0.3,0.2,0.6),
time = c(0,1,1,3,0,4,0,4,2,1,0,3,2))
And my desired output looks like this:
> data Score.Maximum
id testocc score time Within.3.Years.Prior Difference (= Score - Score Maximum within 3 Years Prior)
1 1 0.8 0 - 0
1 2 0.3 1 0.8 -0.5
1 3 0.1 1 0.8 -0.7
1 4 0.2 3 0.1 0.1
2 1 0.1 0 - 0
2 2 0.5 4 0.1 (or NA) 0.4 (or NA)
3 1 0.9 0 - 0
3 2 0.5 4 0.9 (or NA) -0.4 (or NA)
3 3 0.7 2 0.5 0.2
3 4 0.6 1 0.7 -0.1
4 1 0.3 0 - 0
4 2 0.2 3 0.3 -0.1
4 3 0.6 2 0.2 0.4
Time here (in years) is the time since the previous testocc, and I want to find out what the highest score is within the past three years from a single testocc. Then I want to subtract the current score from that highest score. Each individual id is treated separately.
I am also hoping for two versions of this:
If the only prior value is >3 ago, I still want to subtract the current value from it (as shown in the desired output above)
If the only prior value is >3 ago, I want to put an NA (as shown beside the desired output in brackets above).
I figure I'll have to calculate all the pairwise times between all testocc's, make a cutoff at 3 years, and then subtract from the max value within that, I just have no idea how to go about this.
Here's a way to solve your version 2 using dplyr. You first group by id and calculate the running time from the time steps. Then you can calculate the maxima using the do syntax. Finally you fix the initial times and calculate the difference.
require(dplyr)
data %>%
group_by(id) %>%
mutate(cumtime = cumsum(time)) %>%
do({
mutate(.,
max = sapply(.$cumtime, function(t){
max(.$score[.$cumtime < t & t - .$cumtime <= 3])
}))
}) %>%
mutate(max = ifelse(cumtime == 0, score, max),
max = ifelse(!is.finite(max), NA, max),
difference = score - max)
I have a data.frame in which each gene name is repeated and contains values for 2 conditions:
df <- data.frame(gene=c("A","A","B","B","C","C"),
condition=c("control","treatment","control","treatment","control","treatment"),
count=c(10, 2, 5, 8, 5, 1),
sd=c(1, 0.2, 0.1, 2, 0.8, 0.1))
gene condition count sd
1 A control 10 1.0
2 A treatment 2 0.2
3 B control 5 0.1
4 B treatment 8 2.0
5 C control 5 0.8
6 C treatment 1 0.1
I want to calculate if there is an increase or decrease in "count" after treatment and mark them as such and/or subset them. That is (pseudo code):
for each unique(gene) do
if df[geneRow1,3]-df[geneRow2,3] > 0 then gene is "up"
else gene is "down"
This what it should look like in the end (the last columns is optional):
up-regulated
gene condition count sd regulation
B control 5 0.1 up
B treatment 8 2.0 up
down-regulated
gene condition count sd regulation
A control 10 1.0 down
A treatment 2 0.2 down
C control 5 0.8 down
C treatment 1 0.1 down
I have been raking my brain with this, including playing with ddply, and I've failed to find a solution - please a hapless biologist.
Cheers.
The plyr solution would look something like:
library(plyr)
reg.fun <- function(x) {
reg.diff <- x$count[x$condition=='control'] - x$count[x$condition=='treatment']
x$regulation <- ifelse(reg.diff > 0, 'up', 'down')
x
}
ddply(df, .(gene), reg.fun)
gene condition count sd regulation
1 A control 10 1.0 up
2 A treatment 2 0.2 up
3 B control 5 0.1 down
4 B treatment 8 2.0 down
5 C control 5 0.8 up
6 C treatment 1 0.1 up
>
You could also think about doing this with a different package and/or with data in a different shape:
df.w <- reshape(df, direction='wide', idvar='gene', timevar='condition')
library(data.table)
DT <- data.table(df.w, key='gene')
DT[, regulation:=ifelse(count.control-count.treatment > 0, 'up', 'down'), by=gene]
gene count.control sd.control count.treatment sd.treatment regulation
1: A 10 1.0 2 0.2 up
2: B 5 0.1 8 2.0 down
3: C 5 0.8 1 0.1 up
>
Something like this:
df$up.down <- with( df, ave(count, gene,
FUN=function(diffs) c("up", "down")[1+(diff(diffs) < 0) ]) )
spltdf <- split(df, df$up.down)
> df
gene condition count sd up.down
1 A control 10 1.0 down
2 A treatment 2 0.2 down
3 B control 5 0.1 up
4 B treatment 8 2.0 up
5 C control 5 0.8 down
6 C treatment 1 0.1 down
> spltdf
$down
gene condition count sd up.down
1 A control 10 1.0 down
2 A treatment 2 0.2 down
5 C control 5 0.8 down
6 C treatment 1 0.1 down
$up
gene condition count sd up.down
3 B control 5 0.1 up
4 B treatment 8 2.0 up