b = c(1,1,2,2,3,3,4,4,1)
c = c(10,10,20,20,30,30,40,40,5)
a <- NULL
a <- matrix(c(b,c), ncol=2)
What I want to do is to compare the numbers In the first column of this matrix, and if the first number is equal to the second consecutive number in the column (in this case if 1 = 1, and so on) then I want to add the corresponding numbers in the second column together (as in 10 + 10 = 20, and so on) and that would be only one value and I want then to store this output in a separate vector.
The output from the matrix I am looking for is as follows:
[,1] [,2] [,3]
[1,] 1 10 20
[2,] 1 10 40
[3,] 2 20 62
[4,] 2 20 85
[5,] 3 30 5
[6,] 3 32
[7,] 4 40
[8,] 4 45
[9,] 1 5
I am quite new to R and struggling with this. Thank you in advance!
This sounds like a job for rle and tapply:
b = c(1,1,2,2,3,3,4,4,1)
c = c(10,10,20,20,30,30,40,40,5)
a <- NULL
a <- matrix(c(b,c), ncol=2)
A <- rle(a[, 1])$lengths
tapply(a[, 2], rep(seq_along(A), A), sum)
# 1 2 3 4 5
# 20 40 60 80 5
Explanation:
rle identifies the run-lengths of the items in the first column of matrix "a".
We create a grouping variable for tapply from the run-lengths using rep(seq_along(A), A).
We put those two things together in tapply to get the sums you want.
Is this what you want? I bet there are clean base solutions, but I give it a try with rollsum in zoo package:
library(zoo)
mm <- cbind(c(1, 1, 2, 2, 3, 3, 4, 4, 1), c(10, 10, 20, 20, 30, 30, 40, 40, 5))
# calculate all lagged sums of column 2
sums <- rollsum(x = mm[ , 2], k = 2)
# calculate differences between consecutive numbers in column 1
diffs <- diff(mm[ , 1])
# select sums where diff is 0, i.e. where the two consecutive numbers in column 1 are equal.
sums2 <- sums[diffs == 0]
sums2
# [1] 20 40 60 80
Related
Just starting to use R and am feeling a bit confused. Suppose I have three columns
data = data.frame(id=c(101, 102, 103),column1=c(2, 4, 9),
column2=c(3, 4, 2), column3=c(5, 15, 7))
How can I create a new column (e.g., colmean) that is the mean of the two columns closest in value? I thought about doing a bunch of ifelse statements, but that seemed unnecessarily messy.
In this case, for instance, colmean=c(2.5, 4, 8).
Borrowing the function findClosest() created here by #Cole, we can do the following,
findClosest <- function(x, n) {
x <- sort(x)
x[seq.int(which.min(diff(x, lag = n - 1L)), length.out = n)]
}
colMeans(apply(data[-1], 1, function(i)findClosest(i, 2)))
#[1] 2.5 4.0 8.0
A vectorized function using the Rfast package:
library(Rfast)
fClosest <- function(m, n) {
m <- colSort(t(m))
matrix(
m[
sequence(
rep(n, ncol(m)),
seq(0, nrow(m)*(ncol(m) - 1), nrow(m)) + colMins(diff(m, lag = n - 1))
)
],
ncol(m), n, TRUE
)
}
m <- matrix(sample(10, 24, 1), 4)
m
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 4 2 6 2 5 3
#> [2,] 3 4 7 3 4 7
#> [3,] 4 2 7 6 10 2
#> [4,] 8 1 10 8 2 9
fClosest(m, 3L)
#> [,1] [,2] [,3]
#> [1,] 2 2 3
#> [2,] 3 3 4
#> [3,] 2 2 4
#> [4,] 8 8 9
rowMeans(fClosest(m, 3L))
#> [1] 2.333333 3.333333 2.666667 8.333333
Here is a version with a loop:
data = data.frame(id=c(101, 102, 103),column1=c(2, 4, 9),
column2=c(3, 4, 2), column3=c(5, 15, 7))
data$colmean <- NaN # set up empty column for results
for(i in seq(nrow(data))){
data.i <- data[i,-1] # get ith row
d <- as.matrix(dist(c(data.i))) # get distances between values
diag(d) <- NaN # replace diagonal of distance matrix with NaN
hit <- which.min(d) # identify value of lowest distance
pos <- c(row(d)[hit], col(d)[hit]) # get the position (i.e. the values that are closest)
data$colmean[i] <- mean(unlist(data.i[pos])) # calculate mean
}
data
# id column1 column2 column3 colmean
# 1 101 2 3 5 2.5
# 2 102 4 4 15 4.0
# 3 103 9 2 7 8.0
Here's a self-contained solution, based on the tidyverse, that is independent of the number of columns to be compared.
library(tidyverse)
data %>%
# Add the means of smallest pairwise differences to the input data
bind_cols(
data %>%
# Make the data tidy (and hence independent of the number of "column"s)
pivot_longer(starts_with("column")) %>%
# For each id/row (replace with rowwise() if appropriate)
group_by(id) %>%
group_map(
function(.x, .y) {
# Form a tibble of all pairwise ciombinations of values
as_tibble(t(combn(.x$value, 2))) %>%
# Calculate pairwise differences
mutate(difference = abs(V1 - V2)) %>%
# Find the smallest pairwise difference
arrange(difference) %>%
head(1) %>%
# Calculate the mean of this pair
pivot_longer(starts_with("V")) %>%
summarise(colmean=mean(value))
}
) %>%
# Convert list of values to column
bind_rows()
)
id column1 column2 column3 colmean
1 101 2 3 5 2.5
2 102 4 4 15 4.0
3 103 9 2 7 8.0
Below is the swapping function which swap values lesser than 10 in a list
swapFun <- function(x, n = 10){
inx <- which(x < n)
x[sample(inx)] <- x[inx]
x
}
For example, the original list is 1, 2, 3, 10, 4, 11.
After swapping by sampling , this list may be 2, 1, 4, 10, 3, 11 or 1, 3, 2, 10, 4, 11.
But I want to swap each value lesser than 10 to a different value lesser than 10.
For example, the first outcome (ie 2, 1, 4, 10, 3, 11) is what I want because each value lesser than 10 has been swapped to a different value lesser than 10.
However the second outcome (ie 1, 3, 2, 10, 4, 11.) is not what I want because 1 and 4 have not been swapped to a different value lesser than 10.
If there are no feasible solution, just print 'no feasible solution'
Any suggestions?
Many thanks.
You are looking for a derangement of the values less than 10. By the theory of derangements, approximately 1/e (37%) of randomly chosen permutations are derangements, so a hit or miss approach is reasonable, with an important caveat.
There might be repetitions among the items less than n. Not all permutations of those items are distinguishable, so not all derangements of the items look like derangements: swapping two 2s with each other (for example) is in some sense a derangement, but it wouldn't look like a derangement. The 1/e heuristic applies to raw permutations of positions, not distinguishable permutations of values. If the number of repetitions is high, it might take longer than 1/e would suggest. If in your use-case the performance isn't satisfactory, you would need to replace sample() in the function definitions by a more sophisticated function that picks random distinguishable permutations.
As far as feasibility goes, there will be a feasible solution so long as the most common element less than n doesn't account for more than 50% of the items less than n
derangement <- function(x){
if(max(table(x)) > length(x)/2) return(NA)
while(TRUE){
y <- sample(x)
if(all(y != x)) return(y)
}
}
swapFun <- function(x, n = 10){
inx <- which(x < n)
y <- derangement(x[inx])
if(length(y) == 1) return(NA)
x[inx] <- y
x
}
For example,
> set.seed(10)
> swapFun(c(1,2,10,4,11,2,12))
[1] 2 4 10 2 11 1 12
> swapFun(c(2,2,10,4,11,2,12))
[1] NA
Note that no valid derangement has length 1, but NA has length 1, so testing the length of y is an effective way to test if it is possible to derange the values. The function returns NA if no derangement of the values less than n exists. You can test for NA and print "No feasible solutions" if you want
This function gives you all the unique permutations for the numbers < m while keeping the positions of numbers >= m the same.
require(combinat)
x <- c(1,2,10,4,11,2,12)
m <- 10
swapFun <- function(x, m){
# determine positions of values to be permutated or fixed
xi <- which(x < m)
xj <- which(x >= m)
# make permuations
xp <- do.call(rbind, permn(x[xi]))
# make matrix with permutated and fixed values
xn <- matrix(nrow = nrow(xp), ncol = length(x))
xn[ ,xi] <- xp
xn[ ,xj] <- sort(rep(x[xj],nrow(xp)))
# delete duplicates
d <- !duplicated(apply(xn, 1, paste, collapse = "_"))
xn <- xn[d,]
return(xn)
}
swapFun(x,m)
> swapFun(x,m)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 2 10 4 11 2 12
[2,] 1 2 10 2 11 4 12
[3,] 2 1 10 2 11 4 12
[4,] 2 1 10 4 11 2 12
[5,] 1 4 10 2 11 2 12
[6,] 4 1 10 2 11 2 12
[7,] 4 2 10 1 11 2 12
[8,] 2 4 10 1 11 2 12
[9,] 2 4 10 2 11 1 12
[10,] 4 2 10 2 11 1 12
[11,] 2 2 10 4 11 1 12
[12,] 2 2 10 1 11 4 12
I'm trying to create a vector whose elements add up to a specific number. For example, let's say I want to create a vector with 4 elements, and they must add up to 20, so its elements could be 6, 6, 4, 4 or 2, 5, 7, 6, whatever. I tried to run some lines using sample() and seq() but I cannot do it.
Any help appreciated.
To divide into 4 parts, you need three breakpoints from the 19 possible breaks between 20 numbers. Then your partitions are just the sizes of the intervals between 0, your partitions, and 20:
> sort(sample(19,3))
[1] 5 7 12
> diff(c(0, 5,7,12,20))
[1] 5 2 5 8
Test, lets create a big matrix of them. Each column is an instance:
> trials = sapply(1:1000, function(X){diff(c(0,sort(sample(19,3)),20))})
> trials[,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3 1 8 13 3 2
[2,] 4 7 10 2 9 5
[3,] 2 11 1 4 3 7
[4,] 11 1 1 1 5 6
Do they all add to 20?
> all(apply(trials,2,sum)==20)
[1] TRUE
Are there any weird cases?
> range(trials)
[1] 1 17
No, there are no zeroes and nothing bigger than 17, which will be a (1,1,1,17) case. You can't have an 18 without a zero.
foo = function(n, sum1){
#Divide sum1 into 'n' parts
x = rep(sum1/n, n)
#For each x, sample a value from 1 to that value minus one
f = sapply(x, function(a) sample(1:(a-1), 1))
#Add and subtract f from 'x' so that sum(x) does not change
x = x + sample(f)
x = x - sample(f)
x = floor(x)
x[n] = x[n] - (sum(x) - sum1)
return(x)
}
I want to count number of zeros in each column in a R data frame and express it as a percentage. This percentage should be added to last row of the original data frame?
example
x <- c(0, 4, 6, 0, 10)
y <- c(3, 0, 9, 12, 15)
z <- c(3, 6, 9, 0, 15)
data_a <- cbind(x,y,z)
want to see the zeros in each column and express as percentage
Thanks
x <- c(0, 4, 6, 0, 10)
y <- c(3, 0, 9, 12, 15)
z <- c(3, 6, 9, 0, 15)
data_a <- cbind(x,y,z)
#This is a matrix not a data.frame.
res <- colSums(data_a==0)/nrow(data_a)*100
If you must, rbind to the matrix (usually not really a good idea).
rbind(data_a, res)
# x y z
# 0 3 3
# 4 0 6
# 6 9 9
# 0 12 0
# 10 15 15
# res 40 20 20
Here is one more method using lapply, this would work for a data frame though.
lapply(data_a, function(x){ length(which(x==0))/length(x)})
A combination of prop.table and some *apply work can give you the same answer as #Roland's
> prop <- apply(data_a, 2, function(x) prop.table(table(x))*100)
> rbind(data_a, sapply(prop, "[", 1))
x y z
[1,] 0 3 3
[2,] 4 0 6
[3,] 6 9 9
[4,] 0 12 0
[5,] 10 15 15
[6,] 40 20 20
This is probably inelegant, but this is how I went about it when my columns had NAs:
#Returns the number of zeroes in a column
numZero <- colSums(vars == 0, na.rm = T)
#Returns the number of non-NA entries in each column
numNA <- colSums(is.na(vars))
#Returns total sample size
numSamp <- rep(nrow(vars), ncol(vars))
#Combine the three
varCheck <- as.data.frame(cbind(numZero, numNA, numSamp))
#Number of observations for that variable
varCheck$numTotal <- varCheck$numSamp - varCheck$numNA
#Percentage zero
varCheck$pctZero <- varCheck$numZero / varCheck$numTotal
#Check which have lower than 1%
varCheck[which(varCheck$pctZero > 0.99),]
I have a data set of dimension 401*5677. Among the column of this matrix there are columns which are identical but under different column names.
Now, I want to keep only one column from the columns which are repeated more than once, and also get the index j for the columns removed.
Let us use as an example matrix, the following:
B=matrix(c(1,4,0,2,56,7,1,4,0,33,2,5), nrow=3)
colnames(B)<-c("a","b","c","d")
What I did so far (on my real matrix G) is:
corrG<-cor(G)
Gtest=G
for (i in 1:nrow(corrG)){
for (j in 1:ncol(corrG)){
if (i<j && corrG[i,j]==1){
Gtest[,j]=NA
}
}
}
Gfinal<-Gtest[,complete.cases(t(Gtest))]
My code returns a matrix that still contains (!) some duplicated columns.
Any help?
try duplicated function on transpose of the matrix.
duplicated.columns <- duplicated(t(your.matrix))
new.matrix <- your.matrix[, !duplicated.columns]
One line answer
B = matrix(c(1, 4, 0, 2, 56, 7, 1, 4, 0, 33, 2, 5), nrow = 3)
colnames(B) <- c("a", "b", "c", "d")
B
## a b c d
## [1,] 1 2 1 33
## [2,] 4 56 4 2
## [3,] 0 7 0 5
B[, !duplicated(t(B))]
## a b d
## [1,] 1 2 33
## [2,] 4 56 2
## [3,] 0 7 5