I have a table. I would like to count how many of the values start with 11_ and at the same time equal to 1.
11_AAACCCAAGAGCTGCA 11_AAACCCACAAAGACGC 11_AAACCCAGTCACTTAG 11_AAACGAACAAAGGCTG
6 3 1 1
11_AAACGAATCCACACAA 13_AAACGCTCACATGAAA 13_AAACGCTCAGCGGTCT 11_AAACGCTCATGGAAGC
7 1 3 1
Do you have a named vector?
You can combine the two conditions to filter name and value.
x <- c('11_AAACCCAAGAGCTGCA' = 6,
'11_AAACCCACAAAGACGC' = 3,
'11_AAACCCAGTCACTTAG' = 1,
'11_AAACGAACAAAGGCT' = 1,
'11_AAACGAATCCACACAA' = 7,
'13_AAACGCTCACATGAAA' = 1,
'13_AAACGCTCAGCGGTCT' = 3,
'11_AAACGCTCATGGAAGC' = 1)
x[startsWith(names(x), '11_') & x == 1]
#11_AAACCCAGTCACTTAG 11_AAACGAACAAAGGCT 11_AAACGCTCATGGAAGC
# 1 1 1
#To count
sum(startsWith(names(x), '11_') & x == 1)
#[1] 3
We can use grepl
sum(grepl("^11_", names(x)) & x == 1)
Maybe the title is a little bit vague but I didn't know how to better describe it. Suppose the following table/column is given:
tab0 <- data.frame(month = c(1, 3, 4, 7, 9, 12))
What I would love to achieve by using dplyr is the following table:
tab1 <- data.frame(month = c(1, 3, 4, 7, 9, 12), group = c(1, 1, 2, 3, 3, 4))
A month is assigned to a group in a way such that there is a maximum time lag (within a group) of 2 months. This is only an example, in the end I want to apply it to much more data and use days instead of months. I hope it's clear what I am after.
# example dataframe
tab0 <- data.frame(month = c(1, 3, 4, 7, 9, 12))
# input your lag
lag = 2
# create group
tab0$group = 1 + (tab0$month - tab0$month[1]) %/% (lag + 1)
# see updated daatset
tab0
# month group
# 1 1 1
# 2 3 1
# 3 4 2
# 4 7 3
# 5 9 3
# 6 12 4
The group number is calculated as follows: For each row we get the distance between the current month and the first month. Then we divide the result with 3 (your lag of 2 plus 1) and we keep the integer part of the division. Finally we add 1 to the result.
I have vectors in R containing a lot of 0's, and a few non-zero numbers.Each vector starts with a non-zero number.
For example <1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0>
I would like to set all of the zeros equal to the most recent non-zero number.
I.e. this vector would become <1,1,1,1,1,1,2,2,2,2,2,2,4,4,4,4>
I need to do this for a about 100 vectors containing around 6 million entries each. Currently I am using a for loop:
for(k in 1:length(vector){
if(vector[k] == 0){
vector[k] <- vector[k-1]
}
}
Is there a more efficient way to do this?
Thanks!
One option, would be to replace those 0 with NA, then use zoo::na.locf:
x <- c(1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
x[x == 0] <- NA
zoo::na.locf(x) ## you possibly need: `install.packages("zoo")`
# [1] 1 1 1 1 1 1 2 2 2 2 2 2 4 4 4 4
Thanks to Richard for showing me how to use replace,
zoo::na.locf(replace(x, x == 0, NA))
You could try this:
k <- c(1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
k[which(k != 0)[cumsum(k != 0)]]
or another case that cummax would not be appropriate
k <- c(1,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0)
k[which(k != 0)[cumsum(k != 0)]]
Logic:
I am keeping "track" of the indices of the vector elements that are non zero which(k != 0), lets denote this new vector as x, x=c(1, 7, 13)
Next I am going to "sample" this new vector. How? From k I am creating a new vector that increments every time there is a non zero element cumsum(k != 0), lets denote this new vector as y y=c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3)
I am "sampling" from vector x: x[y] i.e. taking the first element of x 6 times, then the second element 6 times and the third element 3 times. Let denote this new vector as z, z=c(1, 1, 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 13, 13, 13)
I am "sampling" from vector k, k[z], i.e. i am taking the first element 6 times, then the 7th element 6 times then the 13th element 3 times.
Add to #李哲源's answer:
If it is required to replace the leading NAs with the nearest non-NA value, and to replace the other NAs with the last non-NA value, the codes can be:
x <- c(0,0,1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
zoo::na.locf(zoo::na.locf(replace(x, x == 0, NA),na.rm=FALSE),fromLast=TRUE)
# you possibly need: `install.packages("zoo")`
# [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 4 4 4 4
In have carried out three experiments, each rusluting in a list of numbers.
data1 = c(1,1,1,2,2)
data2 = c(2,2,3,3,3,4)
data3 = c(1,1,1,4,4,4,4,4,4, 5, 6)
Now I want to count the occurences of each number in each of the experiments. I do this with table, since hist uses class mids (the nice thing about histo would be, that I can give it the list of unique values)
# save histograms
result = list()
result$values[[1]] = as.data.frame(table(data1), stringsAsFactors=F)
result$values[[2]] = as.data.frame(table(data2), stringsAsFactors=F)
result$values[[3]] = as.data.frame(table(data3), stringsAsFactors=F)
str(result)
Now I only have a list of dataframes of different length, But I'd like to have a single dataframe containinglists of the same length (I want to subtract them)
nerv=data.frame(names=c(1, 2, 3, 4, 5, 6))
nerv[[2]] = c(3, 2, 0, 0, 0, 0)
nerv[[3]] = c(0, 2, 3, 1, 0, 0)
nerv[[4]] = c(3, 0, 0, 6, 1, 1)
Is it somehow possible to tell table(), which values to count? Or is there another function that allows counter of a list of values in another list (count unique(data1, data2, data3) in data1)?
Or should I merge the data.frames and fill zeros into empty spaces?
This will generate the data frame:
lev <- unique(c(data1, data2, data3)) # the unique values
data.frame(names = lev,
do.call(cbind,
lapply(list(data1, data2, data3),
function(x) table(factor(x, levels = lev)))))
The trick is to transform the numeric vectors to factors with specified levels. The function table uses all levels.
The output:
names X1 X2 X3
1 1 3 0 3
2 2 2 2 0
3 3 0 3 0
4 4 0 1 6
5 5 0 0 1
6 6 0 0 1
Assume you have a vector with runs of consecutive values:
v <- c(1, 1, 1, 2, 2, 2, 2, 1, 1, 3, 3, 3, 3)
How can it be best reduced to one value per run and the length of each run. I.e. the first run is 1 repeated two times; 2nd run: 2 repeated four times; 3rd run: 1 repeated two times, and so on:
v.df <- data.frame(value = c(1, 2, 1, 3),
repetitions = c(3, 4, 2, 4))
In a procedural language I might just iterate through a loop and build the data.frame as I go, but with a large dataset in R such an approach is inefficient. Any advice?
or more simply
data.frame(rle(v)[])
with(rle(v), data.frame(values, lengths))
should get you what you need.
values lengths
1 3
2 4
1 2
3 4