I want to generate 4 new columns from an existing variable total by random sampling. the results for each row should meet the condition s1 + s2 + s3 + s4 == total. Fro example,
> tabulate(sample.int(4, 100, replace = TRUE))
[1] 22 21 27 30
The following code does not work since the function appears to recycle the first row and applies it column-wise.
DT <- data.table(total = c(100, 110, 90, 92))
DT[, c(paste0("s", 1:4)) := tabulate(sample.int(4, total, replace = TRUE))]
> DT
total s1 s2 s3 s4
1: 100 31 31 31 31
2: 110 25 25 25 25
3: 90 22 22 22 22
4: 92 22 22 22 22
How to get around this? I am clearly missing some basic understanding on how R vector/list work. Your help will be much appreciated.
Edited following edited question:
data.table will expect a list internally when you want to assign to many columns. To get it so each row is unique, then you can do that by adding a by each row:
DT <- data.table(total = c(100, 110, 90, 102, 92))
DT[, c(paste0("s", 1:4)) := {
as.list(tabulate(sample.int(4, total, replace = TRUE)))
}, by = seq(NROW(DT))]
Which outputs the following, satisfying the OP criteria:
> DT
total s1 s2 s3 s4
1: 100 27 28 28 17
2: 110 25 23 36 26
3: 90 26 19 26 19
4: 102 28 24 21 29
5: 92 17 27 22 26
> apply(DT[, 2:5],1, sum)
[1] 100 110 90 102 92
Maybe you can try the code below
DTout <- cbind(
DT,
do.call(
rbind,
lapply(DT$total, function(x) diff(sort(c(0, sample(x - 1, 3), x))))
)
)
which gives
total V1 V2 V3 V4
1: 100 51 5 17 27
2: 110 41 1 40 28
3: 90 32 34 14 10
4: 102 5 73 13 11
5: 92 17 13 17 45
Test
> rowSums(DTout[,-1])
[1] 100 110 90 102 92
Related
We are looking to create a vector with the following sequence:
1,4,5,8,9,12,13,16,17,20,21,...
Start with 1, then skip 2 numbers, then add 2 numbers, then skip 2 numbers, etc., not going above 2000. We also need the inverse sequence 2,3,6,7,10,11,...
We may use recyling vector to filter the sequence
(1:21)[c(TRUE, FALSE, FALSE, TRUE)]
[1] 1 4 5 8 9 12 13 16 17 20 21
Here's an approach using rep and cumsum. Effectively, "add up alternating increments of 1 (successive #s) and 3 (skip two)."
cumsum(rep(c(1,3), 500))
and
cumsum(rep(c(3,1), 500)) - 1
Got this one myself - head(sort(c(seq(1, 2000, 4), seq(4, 2000, 4))), 20)
We can try like below
> (v <- seq(21))[v %% 4 %in% c(0, 1)]
[1] 1 4 5 8 9 12 13 16 17 20 21
You may arrange the data in a matrix and extract 1st and 4th column.
val <- 1:100
sort(c(matrix(val, ncol = 4, byrow = TRUE)[, c(1, 4)]))
# [1] 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29 32 33
#[18] 36 37 40 41 44 45 48 49 52 53 56 57 60 61 64 65 68
#[35] 69 72 73 76 77 80 81 84 85 88 89 92 93 96 97 100
A tidyverse option.
library(purrr)
library(dplyr)
map_int(1:11, ~ case_when(. == 1 ~ as.integer(1),
. %% 2 == 0 ~ as.integer(.*2),
T ~ as.integer((.*2)-1)))
# [1] 1 4 5 8 9 12 13 16 17 20 21
How to count the number of values per column above a sequence of thresholds ?
i.e.: calculate for each column, the number of values above 100, then above 150, then above ... and store the results in a data frame ?
# Reproductible data
# (Original data is daily streamflow values organized in columns per year)
set.seed(1234)
data = data.frame("1915" = runif(365, min = 60, max = 400),
"1916" = runif(365, min = 60, max = 400),
"1917" = runif(365, min = 60, max = 400))
# my code chunck
mymin = 75
mymax = 400
my step = 25
apply(data, 2, function (x) {
for(i in seq(mymin,mymax,mystep)) {
res = (sum(x > i)) # or nrow(data[x > i,])
return(res)
}
})
This code works well for one iteration, but I can't store the result of each iteration in a data frame.
I also tried approaches such as :
for (i in 1:n){
seuil = seq(mymin, mymax, my step)
lapply(data, function(x) {
res [[i]] = nrow(data[ x > seuil[i], ])
return(res)}
})
Which does not work really well...
The output would be something like :
year
n value above 75
n values above 100
n value above ...
1915
348
329
...
1916
351
325
...
...
...
...
...
Thanks for your comments and suggestions :)
You can try :
vals <- seq(mymin,mymax,mystep)
mat <- sapply(vals, function(x) sapply(data, function(y) sum(y > x)))
colnames(mat) <- paste0('values_above_', vals)
mat
# values_above_75 values_above_100 values_above_125 values_above_150 values_above_175
#X1915 348 329 303 276 235
#X1916 351 325 305 277 252
#X1917 345 315 291 260 236
# values_above_200 values_above_225 values_above_250 values_above_275 values_above_300
#X1915 212 186 153 126 104
#X1916 226 204 181 146 118
#X1917 208 186 161 133 99
# values_above_325 values_above_350 values_above_375 values_above_400
#X1915 74 49 28 0
#X1916 92 62 40 0
#X1917 81 60 34 0
myseq <- seq(75, 400, by=25)
as.data.frame(do.call(rbind, lapply(data, function(z) table(findInterval(z, myseq)))))
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
# X1915 17 19 26 27 41 23 26 33 27 22 30 25 21 28
# X1916 14 26 20 28 25 26 22 23 35 28 26 30 22 40
# X1917 20 30 24 31 24 28 22 25 28 34 18 21 26 34
or if you like the factor levels that R will come up with using cut, then
as.data.frame(do.call(rbind, lapply(data, function(z) table(cut(z, myseq)))))
# (75,100] (100,125] (125,150] (150,175] (175,200] (200,225] (225,250] (250,275] (275,300] (300,325] (325,350] (350,375] (375,400]
# X1915 19 26 27 41 23 26 33 27 22 30 25 21 28
# X1916 26 20 28 25 26 22 23 35 28 26 30 22 40
# X1917 30 24 31 24 28 22 25 28 34 18 21 26 34
I have a data frame with lot of company information separated by an id variable. I want to sort one of the variables and repeat it for every id. Let's take this example,
df <- structure(list(id = c(110, 110, 110, 90, 90, 90, 90, 252, 252
), var1 = c(26, 21, 54, 10, 18, 9, 16, 54, 39), var2 = c(234,
12, 43, 32, 21, 19, 16, 34, 44)), .Names = c("id", "var1", "var2"
), row.names = c(NA, -9L), class = "data.frame")
Which looks like this
df
id var1 var2
1 110 26 234
2 110 21 12
3 110 54 43
4 90 10 32
5 90 18 21
6 90 9 19
7 90 16 16
8 252 54 34
9 252 39 44
Now, I want to sort the data frame according to var1 by the vector id. Easiest solution I can think of is using apply function like this,
> apply(df, 2, sort)
id var1 var2
[1,] 90 9 12
[2,] 90 10 16
[3,] 90 16 19
[4,] 90 18 21
[5,] 110 21 32
[6,] 110 26 34
[7,] 110 39 43
[8,] 252 54 44
[9,] 252 54 234
However, this is not the output I am seeking. The correct output should be,
id var1 var2
1 110 21 12
2 110 26 234
3 110 54 43
4 90 9 19
5 90 10 32
6 90 16 16
7 90 18 21
8 252 39 44
9 252 54 34
Group by id and sort by var1 column and keep original id column order.
Any idea how to sort like this?
Note. As mentioned by Moody_Mudskipper, there is no need to use tidyverse and can also be done easily with base R:
df[order(ordered(df$id, unique(df$id)), df$var1), ]
A one-liner tidyverse solution w/o any temp vars:
library(tidyverse)
df %>% arrange(ordered(id, unique(id)), var1)
# id var1 var2
# 1 110 26 234
# 2 110 21 12
# 3 110 54 43
# 4 90 10 32
# 5 90 18 21
# 6 90 9 19
# 7 90 16 16
# 8 252 54 34
# 9 252 39 44
Explanation of why apply(df, 2, sort) does not work
What you were trying to do is to sort each column independently. apply runs over the specified dimension (2 in this case which corresponds to columns) and applies the function (sort in this case).
apply tries to further simplify the results, in this case to a matrix. So you are getting back a matrix (not a data.frame) where each column is sorted independently. For example this row from the apply call:
# [1,] 90 9 12
does not even exist in the original data.frame.
Another base R option using order and match
df[with(df, order(match(id, unique(id)), var1, var2)), ]
# id var1 var2
#2 110 21 12
#1 110 26 234
#3 110 54 43
#6 90 9 19
#4 90 10 32
#7 90 16 16
#5 90 18 21
#9 252 39 44
#8 252 54 34
We can convert the id to factor in order to split while preserving the original order. We can then loop over the list and order, and rbind again, i.e.
df$id <- factor(df$id, levels = unique(df$id))
do.call(rbind, lapply(split(df, df$id), function(i)i[order(i$var1),]))
# id var1 var2
#110.2 110 21 12
#110.1 110 26 234
#110.3 110 54 43
#90.6 90 9 19
#90.4 90 10 32
#90.7 90 16 16
#90.5 90 18 21
#252.9 252 39 44
#252.8 252 54 34
NOTE: You can reset the rownames by rownames(new_df) <- NULL
In base R we could use split<- :
split(df,df$id) <- lapply(split(df,df$id), function(x) x[order(x$var1),] )
or as #Markus suggests :
split(df, df$id) <- by(df, df$id, function(x) x[order(x$var1),])
output in either case :
df
# id var1 var2
# 1 110 21 12
# 2 110 26 234
# 3 110 54 43
# 4 90 9 19
# 5 90 10 32
# 6 90 16 16
# 7 90 18 21
# 8 252 39 44
# 9 252 54 34
With the following tidyverse pipe, the question's output is reproduced.
library(tidyverse)
df %>%
mutate(tmp = cumsum(c(0, diff(id) != 0))) %>%
group_by(id) %>%
arrange(tmp, var1) %>%
select(-tmp)
## A tibble: 9 x 3
## Groups: id [3]
# id var1 var2
# <dbl> <dbl> <dbl>
#1 110 21 12
#2 110 26 234
#3 110 54 43
#4 90 9 19
#5 90 10 32
#6 90 16 16
#7 90 18 21
#8 252 39 44
#9 252 54 34
I have several files with the following structure:
data <- matrix(c(1:100000), nrow=1000, ncol=100)
The first 500 rows are X coordinates and the final 500 rows are Y coordinates of several object contours. Row # 1 (X) and row 501 (Y) correspond to coordinates of the same object. I need to:
transpose the whole matrix and arrange it so now row 1 is column 1 and row 501 is column 2 and have paired x, y coordinates in contiguous columns. Row 2 and row 502 should be in column 1 and column 2 below the data of previous object.
ideally, have an extra column with filename info.
thanks.
Simpler version:
Transpose the matrix, then create a vector with the column indices and subset with them:
mat <- matrix(1:100, nrow = 10)
mat2 <- t(mat)
cols <- unlist(lapply(1:(nrow(mat2)/2), function(i) c(i, i+nrow(mat2)/2)))
mat3 <- mat2[,cols]
Then just make it a dataframe as below.
You can subset pairs of rows separated by nrow/2, make them a 2-column matrix and then cbind them all:
df <- as.data.frame(do.call(cbind, lapply(1:(nrow(mat)/2), function(i) {
matrix(mat[c(i, nrow(mat)/2 + i),], ncol = 2, byrow = TRUE)
})))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 fname
# 1 1 6 2 7 3 8 4 9 5 10 a
# 2 11 16 12 17 13 18 14 19 15 20 e
# 3 21 26 22 27 23 28 24 29 25 30 e
# 4 31 36 32 37 33 38 34 39 35 40 o
# 5 41 46 42 47 43 48 44 49 45 50 y
# 6 51 56 52 57 53 58 54 59 55 60 q
# 7 61 66 62 67 63 68 64 69 65 70 v
# 8 71 76 72 77 73 78 74 79 75 80 b
# 9 81 86 82 87 83 88 84 89 85 90 v
# 10 91 96 92 97 93 98 94 99 95 100 y
Then just add the new column as necessary, since it's now a dataframe:
df$fname <- sample(letters, nrow(df), TRUE)
What about
n <- 500
df <- data.frame(col1 = data[1:n, ],
col2 = data[(nrow(data) - 500):nrow(data), ],
fileinfo = "this is the name of the file...")
Try David's answer, but this way:
n <- 500
df <- data.frame(col1 = data[1:n, ],
col2 = data[(nrow(data) - (n-1)):nrow(data), ],
fileinfo = "this is the name of the file...")
I have a list which has multiple vectors (total 80) of various lengths. On the x-axis I want the names of these vectors. On the y-axis I want to plot the values corresponding to each vector. How can I do it in R?
One way to do this is to reshape the data using reshape2::melt or some other method. Please try and make a reproducible example. I think this is the gist of what you are after:
set.seed(4)
mylist <- list(a = sample(1:50, 10, T),
b = sample(25:40, 15, T),
c = sample(51:75, 20, T))
mylist
# $a
# [1] 30 1 15 14 41 14 37 46 48 4
#
# $b
# [1] 37 29 26 40 31 32 40 34 40 37 36 40 33 32 35
#
# $c
# [1] 71 63 72 63 64 65 56 72 67 63 75 62 66 60 51 74 57 65 55 73
library(ggplot2)
library(reshape2)
df <- melt(mylist)
head(df)
# value L1
# 1 30 a
# 2 1 a
# 3 15 a
# 4 14 a
# 5 41 a
# 6 14 a
ggplot(df, aes(x = factor(L1), y = value)) + geom_point()