Permutation position of numbers in R - r

I'm looking for a function in R which can do the permutation. For example, I have a vector with five 1 and ten 0 like this:
> status=c(rep(1,5),rep(0,10))
> status
[1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
Now I'd like to randomly permute the position of these numbers but keep the same number of 0 and 1 in vector and to get new series of number, for example to get something like this:
1 1 0 1 0 1 0 0 0 0 0 1 0 0 0
or
1 0 0 0 0 0 0 1 1 0 0 1 0 1 0
I found the function sample() can help us to sample, but the number of 1 and 0 is not the same each time. Do you know how can I do this with R? Thanks in advance.

We can use sample
sample(status)
#[1] 1 0 0 1 0 0 1 0 0 0 0 1 0 1 0
sample(status)
#[1] 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0
If we use sample to return the entire vector, it will do the permutation and give the frequency count same for each of the unique elements
colSums(replicate(5, sample(status)))
#[1] 5 5 5 5 5
i.e. we get 5 one's in each of the sampling. So, the remaining 0's would be 10.

Related

Creating a repeated sequence of zero and ones with uneven "breaks" between

I am trying to create a sequence consisting of 1 and 0 using Rstudio.
My desired output is a sequence that first has five 1 then six 0, followed by four 1 then six 0. Then this should all be repeat until the end of a given vector.
The result should be like this:
1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 .....
Hope someone has a good solution, and sorry if I have some grammar mistakes
Best,
HB
rep(c(rep(1,5),rep(0,6),rep(1,4),rep(0,6)),n)
repeating your pattern n times.
You could use Map.
unlist(Map(function(x, ...) c(rep(x, ...), rep(0, 6)), 1, times=length(v):1))
# [1] 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0
Instead of length(v):1 you may also use rev(seq(v)) but it's slower.
Data
v <- c("Vector", "of", "specific", "length", "five")

R - Creating a new column within a data frame when two or more columns are a match in a row

I'm currently stuck on a part of my code that feels intuitive but I can't figure a way to do it. I have a very big data frame (nrows = 34036, ncol = 43) in which I want to create a continuous sequence of the variables where the value of the row is 1 (without having multiple columns with 1). It consists of only zeros and ones similar to the following:
A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1
I was able to remove the zeroes using:
#find the sum of each row
placeholderData <- transform(placeholderData, sum=rowSums(placeholderData))
placeholderData <- placeholderData[!(placeholderData$sum <= 0),]
And the data frame now looks like:
A B C D sum
1 0 0 0 1
0 0 0 1 1
0 0 0 1 1
1 0 1 0 2
1 0 1 0 2
0 1 0 0 1
0 1 0 0 1
1 0 0 1 2
My main problem comes when there are two or more 1's in a row. To try to solve this, I used the following code to identify the columns that have a sum of 2 or more:
placeholderData$Matches <- lapply(apply(placeholderData == 1, 1, which), names)
Which added the following column to the data frame:
A B C D sum Matches
1 0 0 0 1 A
0 0 0 1 1 D
0 0 0 1 1 D
1 0 1 0 2 c("A","C")
1 0 1 0 2 c("A","C")
0 1 0 0 1 B
0 1 0 0 1 B
1 0 0 1 2 c("A", "D")
I added the Matches column as an approach to solve the problem, but I'm not sure how would I do it without using a lot of logical operators (I don't know what columns have matches or not). What I would like to do is to aggregate the rows that have more than (or equal to) two 1's into a new column, to be able to have a data frame like this:
A B C D AC AD sum Matches
1 0 0 0 0 0 1 A
0 0 0 1 0 0 1 D
0 0 0 1 0 0 1 D
0 0 0 0 1 0 1 c("A","C")
0 0 0 0 1 0 1 c("A","C")
0 1 0 0 0 0 1 B
0 1 0 0 0 0 1 B
0 0 0 0 0 1 1 c("A", "D")
Then, I would be able to use my code as normal (It works just fine when there are no repeated values in rows). I tried searching to find similar questions, but I'm not sure if I was even asking the right question. I was wondering if anyone could provide some help or some ideas that I could try.
Thank you very much!
This seems a lot like making dummy variables, so I would use the model.matrix function commonly used for dummy variables (one-hot encoding):
m = read.table(header = T, text = "A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1")
m = m[rowSums(m) > 0, ]
d = factor(sapply(apply(m == 1, 1, which), function(x) paste(names(m)[x], collapse = "")))
result = data.frame(model.matrix(~ d + 0))
names(result) = levels(d)
# A AC AD B D
# 1 1 0 0 0 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 1 0 0 0
# 6 0 0 0 1 0
# 7 0 0 0 1 0
# 8 0 0 1 0 0

Count number of unique instances in a column depending on values in other columns

I've got the following table (which is called train) (in reality much bigger)
UNSPSC adaptor alert bact blood collection packet patient ultrasoft whit
514415 0 0 0 0 0 0 0 1 0
514415 0 0 0 1 0 0 0 1 0
514415 0 0 1 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
422018 0 0 0 0 0 0 0 1 0
422018 0 0 0 0 0 0 0 1 0
422018 0 0 0 1 0 0 0 1 0
411011 0 0 0 0 0 0 0 1 0
I want to calculate the number of unique UNSPSC per column where the value is equal to 1. So for column blood it will be 2 and for column ultrasoft will be 3.
I'm doing this but don't know how to continue:
apply(train[,-1], 2, ......)
I'm trying to not to use loops.
To continue from where you left, we can use apply with margin=2 and calculate the length of unique values of "UNSPSC" for each column.
apply(train[-1], 2, function(x) length(unique(train$UNSPSC[x==1])))
#adaptor alert bact blood collection packet
# 0 0 1 2 0 0
#patient ultrasoft whit
# 0 3 0
Better option is with sapply/lapply which gives the same result but unlike apply does not convert the dataframe into matrix.
sapply(train[-1], function(x) length(unique(train$UNSPSC[x==1])))
If you have columns of only 0 and 1, like in the example, just use colSums:
colSums(train[,-1]) # you remove the non numeric columns before use, like UNSPSC
# adaptor alert bact blood collection packet patient
# 0 0 1 2 0 0 0
# ultrasoft whit
# 10 0

Merging two columns with two values

I have columns which I know there name and that their data are 0 and 1.
I would like to merge them to one but if in one row exist the 1 take the one value or if I have 1 and 1 keep 1.
Example of data:
stockI stockII
1 0
1 0
0 0
0 0
0 0
0 0
0 0
1 0
0 0
1 1
the output I could expect:
stockI/stockII
0
1
0
0
0
0
0
0
0
1
Is there any cbind method to make it?
We can try
as.integer(with(df1, (c(FALSE,stockI[-1] &
stockI[-nrow(df1)]) & stockI) | (stockI & stockII)))
#[1] 0 1 0 0 0 0 0 0 0 1

How can I calculate an empirical CDF in R?

I'm reading a sparse table from a file which looks like:
1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1
Note row lengths are different.
Each row represents a single simulation. The value in the i-th column in each row says how many times value i-1 was observed in this simulation. For example, in the first simulation (first row), we got a single result with value '0' (first column), 7 results with value '2' (third column) etc.
I wish to create an average cumulative distribution function (CDF) for all the simulation results, so I could later use it to calculate an empirical p-value for true results.
To do this I can first sum up each column, but I need to take zeros for the undef columns.
How do I read such a table with different row lengths? How do I sum up columns replacing 'undef' values with 0'? And finally, how do I create the CDF? (I can do this manually but I guess there is some package which can do that).
This will read the data in:
dat <- textConnection("1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1")
df <- data.frame(scan(dat, fill = TRUE, what = as.list(rep(1, 29))))
names(df) <- paste("Val", 1:29)
close(dat)
Resulting in:
> head(df)
Val 1 Val 2 Val 3 Val 4 Val 5 Val 6 Val 7 Val 8 Val 9 Val 10 Val 11 Val 12
1 1 0 7 0 0 1 0 0 0 5 0 0
2 1 0 0 1 0 0 0 3 0 0 0 0
3 0 0 0 1 0 0 0 2 0 0 0 0
4 1 0 0 1 0 3 0 0 0 0 1 0
5 0 0 0 1 0 0 0 2 0 0 0 0
....
If the data are in a file, provide the file name instead of dat. This code presumes that there are a maximum of 29 columns, as per the data you supplied. Alter the 29 to suit the real data.
We get the column sums using
df.csum <- colSums(df, na.rm = TRUE)
the ecdf() function generates the ECDF you wanted,
df.ecdf <- ecdf(df.csum)
and we can plot it using the plot() method:
plot(df.ecdf, verticals = TRUE)
You can use the ecdf() (in base R) or Ecdf() (from the Hmisc package) functions.

Resources