Related
I have a data that looks like this
gene=c("A","A","A","A","B","B","B","B")
frequency=c(1,1,0.8,0.6,0.3,0.2,1,1)
time=c(1,2,3,4,1,2,3,4)
df <- data.frame(gene,frequency,time)
gene frequency time
1 A 1.0 1
2 A 1.0 2
3 A 0.8 3
4 A 0.6 4
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4
I want to remove each a gene group, in this case A or B when they have
frequency > 0.9 at time==1
In this case I want to remove A and my data to look like this
gene frequency time
1 B 0.3 1
2 B 0.2 2
3 B 1.0 3
4 B 1.0 4
Any hint or help are appreciated
We may use subset from base R i.e. create a logical vector with multiple expressions extract the 'gene' correspond to that, use %in% to create a logical vector, negate (!) to return the genes that are not. Or may also change the > to <= and remove the !
subset(df, !gene %in% gene[frequency > 0.9 & time == 1])
-ouptut
gene frequency time
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4
I have a dataframe that looks like this:
df <- data.frame("CB_1.1"=c(0,5,6,2), "CB_1.16"=c(1,5,3,6), "HC_2.11"=c(3,3,4,5), "HC_1.12"=c(2,3,4,5), "HC_1.13"=c(1,0,0,5))
> df
CB_1.1 CB_1.16 HC_2.11 HC_1.12 HC_1.13
1 0 1 3 2 1
2 5 5 3 3 0
3 6 3 4 4 0
4 2 6 5 5 5
I would like to take the mean of rows that share substring of the column name, before the ".". Resulting in a dataframe like this:
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
You'll notice that the column HC_2.11 values remain the same, because no other column has HC_2 in this dataframe.
Any help would be appreciated!
1) apply/tapply For each row use tapply on it using an INDEX of the name prefixes and a function mean. Transpose the result. No packages are used.
prefix <- sub("\\..*", "", names(df))
t(apply(df, 1, tapply, prefix, mean))
giving this matrix (wrap it in data.frame(...) if you need a data frame result):
CB_1 HC_1 HC_2
[1,] 0.5 1.5 3
[2,] 5.0 1.5 3
[3,] 4.5 2.0 4
[4,] 4.0 5.0 5
2) lm Run the indicated regression. The +0 in the formula means don't add on an intercept. The transpose of the coefficients will be the required matrix, m. In the next line make the names nicer. prefix is from (1). No packages are used.
m <- t(coef(lm(t(df) ~ prefix + 0)))
colnames(m) <- sub("prefix", "", colnames(m))
m
giving this matrix
CB_1 HC_1 HC_2
[1,] 0.5 1.5 3
[2,] 5.0 1.5 3
[3,] 4.5 2.0 4
[4,] 4.0 5.0 5
This follows from the facts that (1) the model matrix X contains only ones and zeros and (2) distinct columns of it are orthogonal. The model matrix is shown here:
X <- model.matrix(~ prefix + 0) # model matrix
X
giving:
prefixCB_1 prefixHC_1 prefixHC_2
1 1 0 0
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$prefix
[1] "contr.treatment"
Because the columns of the model matrix X are orthogonal the coefficient corresponding to any column for a particular row, y, of df (column of t(df)) is just sum(x * y) / sum(x * x) and since x is a 0/1 vector that equals the mean of the values of y corresponding to the 1's in x.
3) stack/tapply Convert to long form inserting an id column at the same time. Then use tapply to convert back to wide form tapply-ing mean. No packages are used.
long <- transform(stack(df), ind = sub("\\..*", "", ind), id = c(row(df)))
with(long, tapply(values, long[c("id", "ind")], mean))
giving this table. Wrap it in as.data.frame.matrix if you want a data.frame.
ind
id CB_1 HC_1 HC_2
1 0.5 1.5 3
2 5.0 1.5 3
3 4.5 2.0 4
4 4.0 5.0 5
Here is a base R solution using rowMeans + split.default, i.e.,
dfout <- as.data.frame(Map(rowMeans, split.default(df,factor(s <- gsub("\\..*$","",names(df)), levels = unique(s)))))
such that
> dfout
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
If you do not mind the order of column names, you can use the shorter code below
dfout <- as.data.frame(Map(rowMeans,split.default(df,gsub("\\..*$","",names(df)))))
such that
> dfout
CB_1 HC_1 HC_2
1 0.5 1.5 3
2 5.0 1.5 3
3 4.5 2.0 4
4 4.0 5.0 5
One option involving dplyr and purrr could be:
map_dfc(.x = unique(sub("\\..*$", "", names(df))),
~ df %>%
transmute(!!.x := rowMeans(select(., starts_with(.x)))))
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
A base option could be:
#find column names splitting on "."
cols <- unique(sapply(strsplit(names(df),".", fixed = T), `[`, 1))
#loop through each column name and find the rowMeans
as.data.frame(sapply(cols, function (x) rowMeans(df[grep(x, names(df))])))
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
I have a dataset that looks like this:
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
head(sample.data)
groups A B position
1 1 1 3 2
2 2 3 2 1
3 3 2 4 2
4 4 4 1 1
5 5 2 5 2
6 6 5 2 1
The "position" column always alternates between 2 and 1. I want to do this calculation in R: starting from the first row, if it's in position 1, ignore it. If it starts at 2 (as in this example), then calculate as follows:
Take the first 2 values of column A that are at position 2, average them, then subtract the first value that is at position 1 (in this example: (1+2)/2 - 3 = -1.5). Then repeat the calculation for the next set of values, using the last position 2 value as the starting point, i.e. the next calculation would be (2+2)/2 - 4 = -2.
So basically, in this example, the calculations are done for the values of these sets of groups: 1-2-3, 3-4-5, 5-6-7, etc. (the last value of the previous is the first value of the next set of calculation)
Repeat the calculation until the end. Also do the same for column B.
Since I need the original data frame intact, put the newly calculated values in a new data frame(s), with columns dA and dB corresponding to the calculated values of column A and B, respectively (if not possible then they can be created as separated data frames, and I will extract them into one afterwards).
Desired output (from the example):
dA dB
1 -1.5 1.5
2 -2 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
start <- match(2, sample.data$position)
twos <- seq(from = start, to = nrow(sample.data), by = 2)
df <-
sapply(c("A", "B"), function(l) {
sapply(twos, function(i) {
mean(sample.data[c(i, i+2), l]) - sample.data[i+1, l]
})
})
df <- setNames(as.data.frame(df), c('dA', 'dB'))
As your values in position always alternate between 1 and 2, you can define an index of odd rows i1 and an index of even rows i2, and do your calculations:
## In case first row has position==1, we add an increment of 1 to the indexes
inc=0
if(sample.data$position[1]==1)
{inc=1}
i1=seq(1+inc,nrow(sample.data),by=2)
i2=seq(2+inc,nrow(sample.data),by=2)
res=data.frame(dA=(lead(sample.data$A[i1])+sample.data$A[i1])/2-sample.data$A[i2],
dB=(lead(sample.data$B[i1])+sample.data$B[i1])/2-sample.data$B[i2]);
This returns:
dA dB
1 -1.5 1.5
2 -2.0 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4.0
7 -3.5 2.5
8 -3.0 3.0
9 -3.0 4.5
10 NA NA
The last row returns NA, you can remove it if you need.
res=na.omit(res)
I have a data frame that consisting of a non-unique identifier (ID) and measures of some property of the objects within that ID, something like this:
ID Sph
A 1.0
A 1.2
A 1.1
B 0.5
B 1.8
C 2.2
C 1.1
D 2.1
D 3.0
First, I get the number of instances of each ID as X using table(df$ID), i.e. A=3, B=2 ,C=2 and D=2. Next, I would like to apply a threshold in the "Sph" category after getting the number of instances, limiting to rows where the Sph value exceeds the threshold. With threshold 2.0, for instance, I would use thold=df[df$Sph>2.0,]. Finally I would like to replace the ID column with the X value that I computed using table above. For instance, with a threshold of 1.1 in the "Sph" columns I would like the following output:
ID Sph
3 1.0
2 1.8
2 2.2
2 2.1
2 3.0
In other words, after using table() to get an x value corresponding to the number of times an ID has occurred, say 3, I would like to then assign that number to every value in that ID, Y, that is over some threshold.
There are some inconsistencies in your question and you didn't give a reproducible example, however here's my attempt.
I like to use the dplyr library, in this case I had to break out an sapply, maybe someone can improve on my answer.
Here's the short version:
library(dplyr)
#your data
x <- data.frame(ID=c(rep("A",3),rep("B",2),rep("C",2),rep("D",2)),Sph=c(1.0,1.2,1.1,0.5,1.8,2.2,1.1,2.1,3.0),stringsAsFactors = FALSE)
#lookup table
y <- summarise(group_by(x,ID), IDn=n())
#fill in original table
x$IDn <- sapply(x$ID,function(z) as.integer(y[y$ID==z,"IDn"]))
#filter for rows where Sph greater or equal to 1.1
x <- x %>% filter(Sph>=1.1)
#done
x
And here's the longer version with explanatory output:
> library(dplyr)
> #your data
> x <- data.frame(ID=c(rep("A",3),rep("B",2),rep("C",2),rep("D",2)),Sph=c(1.0,1.2,1.1,0.5,1.8,2.2,1.1,2.1,3.0),stringsAsFactors = FALSE)
> x
ID Sph
1 A 1.0
2 A 1.2
3 A 1.1
4 B 0.5
5 B 1.8
6 C 2.2
7 C 1.1
8 D 2.1
9 D 3.0
>
> #lookup table
> y <- summarise(group_by(x,ID), IDn=n())
> y
Source: local data frame [4 x 2]
ID IDn
1 A 3
2 B 2
3 C 2
4 D 2
>
> #fill in original table
> x$IDn <- sapply(x$ID,function(z) as.integer(y[y$ID==z,"IDn"]))
> x
ID Sph IDn
1 A 1.0 3
2 A 1.2 3
3 A 1.1 3
4 B 0.5 2
5 B 1.8 2
6 C 2.2 2
7 C 1.1 2
8 D 2.1 2
9 D 3.0 2
>
> #filter for rows where Sph greater or equal to 1.1
> x <- x %>% filter(Sph>=1.1)
>
> #done
> x
ID Sph IDn
1 A 1.2 3
2 A 1.1 3
3 B 1.8 2
4 C 2.2 2
5 C 1.1 2
6 D 2.1 2
7 D 3.0 2
You can actually do this in one step after computing X and thold as you did in your question:
X <- table(df$ID)
thold <- df[df$Sph > 1.1,]
thold$ID <- X[as.character(thold$ID)]
thold
# ID Sph
# 2 3 1.2
# 5 2 1.8
# 6 2 2.2
# 8 2 2.1
# 9 2 3.0
Basically you look up the frequency of each ID value in the table X that you built.
I'm trying to work on code to build a function for three stage cluster sampling, however, I am just working with dummy data right now so I can understand what is going into my function.
I am working on for loops and have a data frame with grouped values. I'm have a data frame that has data:
Cluster group value value.K.bar value.M.bar N.bar
1 1 A 1 1.5 2.5 4
2 1 A 2 1.5 2.5 4
3 1 B 3 4.0 2.5 4
4 1 B 4 4.0 2.5 4
5 2 B 5 4.0 6.0 4
6 2 C 6 6.5 6.0 4
7 2 C 7 6.5 6.0 4
and I am trying to run the for loop
n <- dim(data)[1]
e <- 0
total <- 0
for(i in 1:n) {e = data.y$value.M.bar[i] - data$N.bar[i]
total = total + e^2}
My question is: Is there a way to run the same loop but for the unique values in the group? Say by:
Group 'A', 'B', 'C'
Any help would be greatly appreciated!
Edit: for correct language
You can use by for example, to apply your data per group. First I wrap your code in a function that take data as input.
get.total <- function(data){
n <- dim(data)[1]
e <- 0
total <- 0
for(i in 1:n) {
e <- data$value.M.bar[i] - data$N.bar[i] ## I correct this line
total <- total + e^2
}
total
}
Then to compute total just for group B and C you do this :
by(data,data$group,FUN=get.total)
data$group: A
[1] 4.5
----------------------------------------------------------------------------------------------------
data$group: B
[1] 8.5
----------------------------------------------------------------------------------------------------
data$group: C
[1] 8
But better , Here a vectorized version of your function
by(data,data$group,
function(dat)with(dat, sum((value.M.bar - N.bar)^2)))