command for expanding data in r - r

I have some data:
Length(cm) Frequency
1 5
2 2
3 3
4 5
Is there a way to expand these numbers in R without typing them out manually, so I can work out the std error of the mean for length, so I have a dataset like:
1 1 1 1 1 2 2 3 3 3 4 4 4 4 4
which I can then work on? Thanks

You can use rep.
> l <- 1:4
> f <- c(5,2,3,5)
> rep(l,f)
[1] 1 1 1 1 1 2 2 3 3 3 4 4 4 4 4

In addition to using rep to replicate the observations you could also use the wtd.mean and wtd.var functions in the Hmisc package to compute the weighted summaries without expanding (this will be better if the expanded vector would take up a large portion of memory).

I recommend using a dataframe:
sd(rep(data$length, data$freq))

Related

runner:streak_run shows unexpected result when k remains unchanged

I'm using runner:streak_run to count sequences of 0 and 1 in a column called "inactive_indicator".
The column is= 0,0,0,1,1,1,0,1,1,0,0,0,0,0,0,0,0,1,1,1,1
For runner::streak_run(inactive_indicator))
I get the following:
1,2,3,1,2,3,1,1,2,1,2,3,4,5,5,5,5,1,2,3,4
Why is it stuck on 5 when it should go up to 8?
In documentation it says that k - running window size. By default window size equals length(x). Allow varying window size specified by vector of length(x)
As I understand, the default definition should be enough.
Problem resolves and I get expected results when running:
runner::streak_run(inactive_indicator),k=length(inactive_indicator))
Why doesn't it work in the first place?
This can be solved with rle from base R
sequence(rle(inactive_indicator)$lengths)
#[1] 1 2 3 1 2 3 1 1 2 1 2 3 4 5 6 7 8 1 2 3 4
Checked with runner
runner::streak_run(inactive_indicator)
#[1] 1 2 3 1 2 3 1 1 2 1 2 3 4 5 6 7 8 1 2 3 4
It is possible that there are some leading/lagging spaces in the column and it is not numeric. In that case, use trimws
runner::streak_run(trimws(inactive_indicator))
data
inactive_indicator <- c(0,0,0,1,1,1,0,1,1,0,0,0,0,0,0,0,0,1,1,1,1)

How to BiCluster with constant values in columns - in R

My Problem in general:
I have a data frame where i would like to find all bi-clusters with constant values in columns.
For Example the initial dataframe:
> df
v1 v2 v3
1 0 2 1
2 1 3 2
3 2 4 3
4 3 3 4
5 4 2 3
6 5 2 4
7 2 2 3
8 3 1 2
And for example i would like to find the a cluster like this:
> cluster1
v1 v3
1 2 3
2 2 3
I tried to use the biclust package and tested several functions but the result was always not what i want to archive.
I figured out that I may can use the BCPlaid function with fit.model = y ~ m. But it looks like this produce also different results.
Is there a way to archive this task efficient?

Using factors in R programming

If I have the code:
x <- c(rnorm(10),runif(10), rnorm(10,1))
f <- gl(3,10)
f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
tapply(x,f,mean)
1 2 3
0.07368817 0.42992416 0.64212383
How are the 1,2,3's decided? I am assuming they are levels of something.
Furthermore, why is f used in the second argument, I dont see why it is an index and how does it know when to stop running through the index?.
I tried looking up the function definition but to no avail.
If you are asking about how tapply works (rather than gl) consider another simpler example:
> x1 <- c(1,1,2,2,3,3)
> tapply(x1, x1, mean)
1 2 3
1 2 3
> f2 <- c(2,2,2,2,3,3)
> tapply(x1, f2, mean)
2 3
1.5 3.0
In the first case, tapply has picked the first two items (indices), and found their mean
giving 1 for 1, then the next two items (2 and 2) having mean 2 etc.
In the second case, the first 4 items are treated as 2's, having mean (1+1+2+2)/4, and the last two and 3's having mean (3+3)/2
In effect, then "index" is labelling the data, and applying the requested function to each "group"

R table function

If I have a vector numbers <- c(1,1,2,4,2,2,2,2,5,4,4,4), and I use 'table(numbers)', I get
names 1 2 4 5
counts 2 5 4 1
What if I want it to include 3 also or generally, all numbers from 1:max(numbers) even if they are not represented in numbers. Thus, how would I generate an output as such:
names 1 2 3 4 5
counts 2 5 0 4 1
If you want R to add up numbers that aren't there, you should create a factor and explicitly set the levels. table will return a count for each level.
table(factor(numbers, levels=1:max(numbers)))
# 1 2 3 4 5
# 2 5 0 4 1
For this particular example (positive integers), tabulate would also work:
numbers <- c(1,1,2,4,2,2,2,2,5,4,4,4)
tabulate(numbers)
# [1] 2 5 0 4 1

Create sequence of repeated values, in sequence?

I need a sequence of repeated numbers, i.e. 1 1 ... 1 2 2 ... 2 3 3 ... 3 etc. The way I implemented this was:
nyear <- 20
names <- c(rep(1,nyear),rep(2,nyear),rep(3,nyear),rep(4,nyear),
rep(5,nyear),rep(6,nyear),rep(7,nyear),rep(8,nyear))
which works, but is clumsy, and obviously doesn't scale well.
How do I repeat the N integers M times each in sequence?
I tried nesting seq() and rep() but that didn't quite do what I wanted.
I can obviously write a for-loop to do this, but there should be an intrinsic way to do this!
You missed the each= argument to rep():
R> n <- 3
R> rep(1:5, each=n)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
R>
so your example can be done with a simple
R> rep(1:8, each=20)
Another base R option could be gl():
gl(5, 3)
Where the output is a factor:
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Levels: 1 2 3 4 5
If integers are needed, you can convert it:
as.numeric(gl(5, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
For your example, Dirk's answer is perfect. If you instead had a data frame and wanted to add that sort of sequence as a column, you could also use group from groupdata2 (disclaimer: my package) to greedily divide the datapoints into groups.
# Attach groupdata2
library(groupdata2)
# Create a random data frame
df <- data.frame("x" = rnorm(27))
# Create groups with 5 members each (except last group)
group(df, n = 5, method = "greedy")
x .groups
<dbl> <fct>
1 0.891 1
2 -1.13 1
3 -0.500 1
4 -1.12 1
5 -0.0187 1
6 0.420 2
7 -0.449 2
8 0.365 2
9 0.526 2
10 0.466 2
# … with 17 more rows
There's a whole range of methods for creating this kind of grouping factor. E.g. by number of groups, a list of group sizes, or by having groups start when the value in some column differs from the value in the previous row (e.g. if a column is c("x","x","y","z","z") the grouping factor would be c(1,1,2,3,3).

Resources