Create frequency vector based on input vector - r

I have a variable test in the structure:
> test <- c(9,87)
> names(test) <- c("VGP", "GGW")
> dput(test)
structure(c(9, 87), .Names = c("VGP", "GGW"))
> class(test)
[1] "numeric"
This is a very simplified version of the input vector, but I want an output as a vector of length 100 which contains the frequency of each number 1-100 inclusive. The real input vector is of length ~1000000, so I am looking for an approach that will work for a vector of any length, assuming only numbers 1-100 are in it.
In this example, the numbers in all positions except 9 and 87 will show up as 0, and the 9th and 87th vector will both say 50.
How can I generate this output?

If we are looking for a proportion inclusive of the values that are not in the vector and to have those values as 0, convert the vector to factor with levels specified and then do the table and prop.table
100*prop.table(table(factor(test, levels = 1:100)))

>freq<-vector(mode="numeric",length=100)
>for(i in X)
+{ if(i>=1 && i<=100)
+ freq[i]=freq[i]+1
+}
>freq
X is the vector containing 10000 elements
Adding an if condition could ensure that the values are in the range of [1,100].
Hope this helps.

If you have a numeric vector and just want to get a frequency table of the values, use the table function.
set.seed(1234)
d <- sample(1:10, 1000, replace = TRUE)
x <- table(d)
x
# 1 2 3 4 5 6 7 8 9 10
# 92 98 101 104 87 112 104 94 88 120
If there is a possibility of missing values, say 11 is a possibility in my example then I'd do the following:
y <- rep(0, 11)
names(y) <- as.character(1:11)
y[as.numeric(names(x))] <- x
y
# 1 2 3 4 5 6 7 8 9 10 11
92 98 101 104 87 112 104 94 88 120 0

Related

Looping over the first 100 values then looping over the next 100 values

I have a vector of True and False values. The length of the vector is 1000.
vect <- [T T F T F F..... x1000]
I want loop over the first 100 (i.e 1:100) values and calculate the count of true and false values and store the result into some variable (e.g. True <- 51, False <- 49). Then loop over the next 100 values (101:200) and do the same computation as before, and so on till I reach 1000.
The code below is pretty standard but, instead of slicing the vector, it calculates sums for the entire vector.
count_True = 0
count_False = 0
for (i in vect){
if (i == 'T'){
count_True = count_True + 1
}
else {
count_false = count_false + 1
}
}
I am aware you can split the the vector by
vect_splt <- split(vect,10)
but is there a way to combine these to do what I wanted or any other way?
Does something like this work:
set.seed(42)
vect <- sample(rep(c(T, F), 500))
vect <- tibble(vect)
vect %>%
mutate(seq = row_number() %/% 100) %>%
group_by(seq) %>%
summarise(n_TRUE = sum(vect),
n_FALSE = sum(!vect))
# A tibble: 11 x 3
seq n_TRUE n_FALSE
<dbl> <int> <int>
1 0 42 57
2 1 56 44
3 2 50 50
4 3 55 45
5 4 43 57
6 5 48 52
7 6 48 52
8 7 54 46
9 8 51 49
10 9 53 47
11 10 0 1
We can use a split by table. With a grouping index created with gl, split the vector into a list of vectors and get the count with table and store it in a list
out <- lapply(split(vect, as.integer(gl(length(vect), 100, length(vect)))), table)
It can be converted to a single dataset by rbinding
out1 <- do.call(rbind, out)
data
set.seed(24)
vect <- sample(c(TRUE, FALSE), 1000, replace = TRUE)

Two-Way Data Table in R

I am trying to do the following in R.
Along the rows, I have a set values for a variable X. Along the columns, I have a set values for variable y.
For each combination of X and Y, I would like to perform a calculation and then summarize the results in a two-way data table.
One way I thought of was to create a row matrix containing the combination of row and column. Then rbind all the rows. But the process would be tedious and time-consuming. Is there a more efficient way to build this able using R?
Thanks.
What you need is the function outer. Here is a simple example of its use.
x = 1:5
y = seq(1, 9, 2)
names(x) = x
names(y) = y
MyFunction = function(x,y) x^2 + y^2
outer(x, y, MyFunction)
1 3 5 7 9
1 2 10 26 50 82
2 5 13 29 53 85
3 10 18 34 58 90
4 17 25 41 65 97
5 26 34 50 74 106

R code for repeating value into column

I am basically new to using R software.
I have a list of repeating codes (numeric/ categorical) from an excel file. I need to add another column values (even at random) to which every same code will get the same value.
Codes Value
1 122
1 122
2 155
2 155
2 155
4 101
4 101
5 251
5 251
Thank you.
We can use match:
n <- length(code0 <- unique(code))
value <- sample(4 * n, n)[match(code, code0)]
or factor:
n <- length(unique(code))
value <- sample(4 * n, n)[factor(code)]
The random integers generated are between 1 and 4 * n. The number 4 is arbitrary; you can also put 100.
Example
set.seed(0); code <- rep(1:5, sample(5))
code
# [1] 1 1 1 1 1 2 2 3 3 3 3 4 4 4 5
n <- length(code0 <- unique(code))
sample(4 * n, n)[match(code, code0)]
# [1] 5 5 5 5 5 18 18 19 19 19 19 12 12 12 11
Comment
The above gives the most general treatment, assuming that code is not readily sorted or taking consecutive values.
If code is sorted (no matter what value it takes), we can also use rle:
if (!is.unsorted(code)) {
n <- length(k <- rle(code)$lengths)
value <- rep.int(sample(4 * n, n), k)
}
If code takes consecutive values 1, 2, ..., n (but not necessarily sorted), we can skip match or factor and do:
n <- max(code)
value <- sample(4 * n, n)[code]
Further notice: If code is not numerical but categorical, match and factor method will still work.
What you could also do is the following, it is perhaps more intuitive to a beginner:
data <- data.frame('a' = c(122,122,155,155,155,101,101,251,251))
duplicates <- unique(data)
duplicates[, 'b'] <- rnorm(nrow(duplicates))
data <- merge(data, duplicates, by='a')

Sum over an increasing number of columns of a data frame in R

I need to extract summed subsets of a data.frame row-by-row and use the output to return a new data.frame. However, I want to increase the number of columns to sum across by 4 each time. So, for example, I want to extract the 1st column by itself, then the sum of columns 2 to 6 on a row-by-row basis, then columns 7 to 15 and so on.
I have this code that returns the sum of a constant number of columns across a data.frame (by a maximum number of trials) into a new data.frame- I just need to find a way to add the escalating function.
t<- max(as.numeric(df[,c(5)]))
process.row <- function (x){
sapply(1:t,function(i){
return(sum(as.numeric(x[c((6+(i-1)*5):(10+(i-1)*5))]
)
)
)
})
}
t(apply(df,1,process.row)) -> collated.data
I've been really struggling with a way to do this so thanks very much for any help. I couldn't find an answer to this elsewhere so apologies if I've missed something.
I was thinking you wanted to sum the rows of the selected subset of columns. If so, perhaps this will help.
# fake data
mydf <- as.data.frame(matrix(sample(45*5), nrow=5))
mydf
# prepare matrix of start and ending columns
n <- 20
i <- 1:n
ncols <- 1 + (i-1)*4
endcols <- cumsum(ncols)
startcols <- c(1, cumsum(ncols[-length(endcols)])+1)
mymat <- cbind(endcols, startcols)
# function to sum the rows
myfun <- function(df, m) {
# select subset with end columns within the dimensions of the given df
subm <- m[m[, 2] <= dim(df)[2], ]
# sum up the selected columns of df by rows
sapply(1:dim(subm)[1], function(j)
rowSums(df[, subm[j, 1]:subm[j, 2], drop=FALSE]))
}
mydf
myfun(df=mydf, m=mymat)
What you are looking for is a function that gives x (the lower value of the series), which looks like this for the sequence-part i:
In r, the code looks like this:
# the foo part of the function
foo <- function(x) ifelse(x > 0, 1 + (x - 1) * 4, 0)
# the wrapper of the function
min.val <- function(i){
ifelse(i == 1, 1, 1 + sum(sapply(1:(i - 1), foo)))
}
# takes only one value
min.val(1)
# [1] 1
min.val(2)
# [1] 2
min.val(3)
# [1] 7
# to calculate multiple values, use it like this
sapply(1:5, min.val)
#[1] 1 2 7 16 29
If you want to get the maximum number, you can create another function, which looks like this
max.val <- function(i) min.val(i + 1) - 1
sapply(1:5, max.val)
#[1] 1 6 15 28 45
Testing:
# creating a series to test it
series <- 1:20
min.vals <- sapply(series, min.val)
max.vals <- sapply(series, max.val)
dat <- data.frame(min = min.vals, max = max.vals)
# dat
# min max
# 1 1 1
# 2 2 6
# 3 7 15
# 4 16 28
# 5 29 45
# 6 46 66
# 7 67 91
# 8 92 120
# 9 121 153
# 10 154 190
# 11 191 231
# 12 232 276
# 13 277 325
# 14 326 378
# 15 379 435
# 16 436 496
# 17 497 561
# 18 562 630
# 19 631 703
# 20 704 780
Does that give you what you want?

Reordering rows in a dataframe by multiple column permutations

I am trying to reorder a data.frame that contains around 250,000 rows and 7 columns. The rows I want at the top of the data.frame are those where column 2 contains the lowest value and column 7 the highest but would go in this sequence of columns that contain the lowest to highest values: 2,5,1,4,6,3,7 (so column 5 would have the second lowest value etc.).
Once any rows that match this sequence are identified it would move on to find rows where the columns values go from lowest to highest in the sequence 2,5,1,4,6 and then 2,5,1,4 and so on until only rows where column 2 is the lowest and the other column values are randomly assorted. Any row that does not have column 2 as the lowest value would be ignored and left unsorted below the sorted rows. I am struggling to come up with any workable solution to my problem - the best I can do in terms of providing similar data to that I am working with is this:
df<-data.frame(matrix(rnorm(70000), nrow=10000))
df<-abs(df)
If anyone has any ideas, I am all ears.
Thanks!
Given that you have a largish dataset of uniform type (numeric), I would suggest using a matrix not a data.frame
tt <- abs(matrix(rnorm(70000), nrow=10000))
You have a desired order you wish to match against
desiredOrder <- c(2,5,1,4,6,3,7)
You need to find what order each of your rows is in . I think it is easiest here to ensure that you are given a list back with an element for each row. I'd suggest something like this .
orders <- lapply(apply(tt, 1, function(x) list(order(x))), unlist)
You will then need to go through (from desiredOrder[seq_len(7)] to desiredOrder[seq_len(1)] to test when the required subset of the order for a particular row is equal to the required subset of desired order. (I thinking some combination of sapply with which and all)
Once you have identified all the rows that match your required result, you can use setdiff to find the unmatched ones, and then reorder tt using this new order vector.
One possible approach would be to weight rankings of the values in the columns. It would be something like rank regression. 7 columns of 250K rows is not that big. For the ones you want the low values to have higher weight you could either subtract the rank from NROW(dfrm). If you want to scale the wieighting across that column ordering scheme then jsut multiply by a weighting vector: say c(1, .6, .3, 0, .3, .6, 1)
dmat <- matrix(sample(20, 20*7, repl=TRUE), 20, 7)
dfrm <- as.data.frame(dmat)
dfrm$wt <- sapply( dfrm[ c(2,5,1,4,6,3,7)] , rank); dfrm
dfrm$wt[,1:3] <- rep(NROW(dfrm),3) - dfrm$wt[ , 1:3]
dfrm$wt <- dfrm$wt*rep(c(1, .6, .3, 0, .3, .6, 1), each=NROW(dfrm))
dfrm[ order( apply( dfrm$wt, 1, FUN=sum), decreasing=TRUE ) , ]
This does not force the lowest value for V2 to be first, since you implied a multiple criterion. You still have the ability to re-weight if this is not exactly what you imagined.
Like this:
dat <- as.matrix(df)
rnk <- t(apply(dat, 1, rank))
desiredRank <- order(c(2,5,1,4,6,3,7))
rnk.match <- rnk == matrix(desiredRank, nrow(rnk), ncol(rnk), byrow = TRUE)
match.score <- apply(rnk.match, 1, match, x = FALSE) - 1
match.score[is.na(match.score)] <- ncol(dat)
out <- dat[order(match.score, decreasing = TRUE), ]
head(out)
# X1 X2 X3 X4 X5 X6 X7
#[1,] 0.7740246 0.19692680 1.5665696 0.9623104 0.2882492 1.367786 1.8644204
#[2,] 0.5895921 0.00498982 1.7143083 1.2698382 0.1776051 2.494149 1.4216615
#[3,] 0.1981111 0.11379934 1.0379619 0.2130251 0.1660568 1.227547 0.9248101
#[4,] 0.7507257 0.23353923 1.6502192 1.2232615 0.7497352 2.032547 1.4409475
#[5,] 0.5418513 0.06987903 1.8882399 0.6923557 0.3681018 2.172043 1.2215323
#[6,] 0.1731943 0.01088604 0.6878847 0.2450998 0.0125614 1.197478 0.3087192
In this example, the first row matches the whole rank sequence; the next rows match the first five ranks of the sequence:
head(match.score[order(match.score, decreasing = TRUE)])
# [1] 7 5 5 5 5 5
You can use the fact that order() returns the index to the ordering,
which is exactly what you are trying to match
For example if we apply `order` twice to each row of
[1,] 23 17 118 57 20 66 137
[2,] 56 42 52 66 47 8 29
[3,] 35 5 76 35 29 217 89
We would get
[1,] 2 5 1 4 6 3 7
[2,] 6 7 2 5 3 1 4
[3,] 2 5 1 4 3 7 6
Then you simply need to check which rows match what you are looking for.
There are several ways to implement this, below is an example, where we create
a logical matrix, comparisons, which indicates whether each element of a row
is in the "correct" position, as indicated by expectedOrder.
We then order the original rows by how many elements are in the "correct column". (using this phrase loosely, of course)
# assuming mydf is your data frame or matrix
# the expected order of the columns
expectedOrder <- c(2,5,1,4,6,3,7)
# apply the order function twice.
ordering <- apply(mydf, 1, function(r) order(r) )
# Recall that the output of apply is transposed relative to the input.
# We make use of this along with the recycling of vectors for the comparison
comparisons <- ordering == expectedOrder
# find all rows with at least matches to 2,5,1,4
topRows <- which(colSums(comparisons[1:4, ])==4)
# reorder the indecies based on the total number of matches in comparisons
# ie: first all 7-matches, then 5-matches, then 4-matches
topRows <- topRows[order(colSums(comparisons[,topRows]), decreasing=TRUE)]
# reorder the dataframe (or matrix)
mydf.ordered <-
rbind(mydf[topRows, ],
mydf[-topRows,])
head(mydf.ordered)
# X1 X2 X3 X4 X5 X6 X7
# 23 17 118 57 20 66 137
# 39 21 102 50 24 53 163
# 80 6 159 116 44 139 248
# 131 5 185 132 128 147 202
# 35 18 75 40 33 67 151
# 61 14 157 82 57 105 355

Resources