R Automatically name results of Loop - r

I have a list of dataframes:
df_DJF = data.frame(replicate(2,sample(0:130,30,rep=TRUE)))
df_JJA = data.frame(replicate(2,sample(0:130,20,rep=TRUE)))
df_MAM = data.frame(replicate(2,sample(0:130,25,rep=TRUE)))
df_SON = data.frame(replicate(2,sample(0:130,15,rep=TRUE)))
df_list = list(df_DJF, df_JJA, df_MAM, df_SON)
I want to randomly choose 80% of each the dataframe. I can do that manually by doing this and using the sample_size as row index.
sample_size = floor(0.8*nrow(df_DJF))
picked_DJF = sample(seq_len(nrow(df_DJF)), size = sample_size)
My problem is that I have very many df with different number of rows. So I want to automatize this process. In the end I want to have 4 sample sizes with the correct number in it. The names of the sample_sizes should be:
samplenames = paste("sample_size", c("DJF", "JJA", "MAM", "SON"), sep = "_")
Same for the "picked"...it should be picked_DJF and so on...

Keep using lists, not assign. Set your names(df_list) = c("DJF", "JJA", "MAM", "SON"), then use the same names for subsequent lists, like a picked list.
# for a single sample size
picked = lapply(df_list, function(x) x[sample(1:nrow(x), size = floor(0.8 * nrow(x))), ])
Using lapply will keep the names of the original list so you don't have to worry about it.
For multiple sample sizes from each of the data frames, you could create a nested list with a nested lapply:
names(df_list) = c("DJF", "JJA", "MAM", "SON")
sample_prop = list(s1 = 0.2, s2 = 0.4, s3 = 0.6, s4 = 0.8)
picked = lapply(df_list, function(df) lapply(sample_prop, function(sp) {
df[sample(nrow(df), size = floor(sp * nrow(df))), ]
}))
# then access individual data frames with `$` or `[[`
picked$JJA$s3
# X1 X2
# 17 70 128
# 7 94 121
# 1 57 125
# 8 32 75
# 9 15 8
# 19 58 15
# 20 55 17
# 10 42 15
# 4 51 67
# 12 89 13
# 2 74 50
# 14 77 36
To divide a data frame in to "picked" and "unpicked", split makes sense. It already returns a list. This will give a triple-nested list result:
result = lapply(df_list, function(df) lapply(sample_prop, function(sp) {
n_pick = floor(sp * nrow(df))
n_unpick = nrow(df) - n_pick
split(df, f = c(rep("picked", n_pick), rep("unpicked", n_unpick))[sample(nrow(df))])
}))
result$JJA$s3$unpicked
# X1 X2
# 2 74 50
# 3 62 78
# 4 51 67
# 6 103 42
# 7 94 121
# 11 59 60
# 14 77 36
# 16 83 72

Related

How to convert a string into assignment statement in R?

There are many columns in the list. I'd like to use some functions to work automatically.
I have a data.frame myData
There is no myData$home_player_X. I'm adding it one by one manually.
If I do it manually, the code looks like this:
myData$home_player_1 <- lDataFrames[[3]]$home_player_1
myData$home_player_2 <- lDataFrames[[3]]$home_player_2
...
myData$home_player_11 <- lDataFrames[[3]]$home_player_11
If we only consider the part after <-, I can convert it into an expression:
eval(parse(text=paste("lDataFrames[[3]]$home_player_",i,sep="")))
But I want to convert whole string. The whole string is this:
paste("myData$home_player_",i," <- lDataFrames[[3]]$home_player_", i,sep="")
I want to convert string into an assignment statement, so I can do it in a for loop
Instead of playing with strings, you can directly copy the required columns in mydata.
cols <- grep("^home_player", names(lDataFrames[[3]]), value = TRUE)
mydata[cols] <- lDataFrames[[3]][cols]
Using reproducible example,
df <- data.frame(home_player_1 = 1:5, home_player_2 = 6:10, home_player_3 = 11:15)
cols <- grep("^home_player", names(df), value = TRUE)
mydata <- data.frame(matrix(nrow = nrow(df), ncol = length(cols),
dimnames = list(NULL, cols)))
mydata[cols] <- df[cols]
mydata
# home_player_1 home_player_2 home_player_3
#1 1 6 11
#2 2 7 12
#3 3 8 13
#4 4 9 14
#5 5 10 15
Instead of using the $ notation, just use the variable name as an index. I am substituting Y for your lDataFrames[[3]], but it should be easy to translate.
myData = data.frame(Var1 = 1:10)
Y = data.frame(home_player_1 = 11:20,
home_player_2 = 21:30, home_player_3 = 31:40)
for(i in 1:3) {
VarName = paste0("home_player_", i)
myData[ ,VarName] = Y[ ,VarName]
}
myData
Var1 home_player_1 home_player_2 home_player_3
1 1 11 21 31
2 2 12 22 32
3 3 13 23 33
4 4 14 24 34
5 5 15 25 35
6 6 16 26 36
7 7 17 27 37
8 8 18 28 38
9 9 19 29 39
10 10 20 30 40

Identify sequences between identical values

I have a large matrix:
id v1 v2 v3 v4 v5 v6 v7 v8
1001 37 15 30 37 4 11 35 37
2111 44 31 44 30 24 39 44 18
3121 43 49 39 34 44 43 26 24
4532 45 31 26 33 12 47 37 15
5234 23 27 34 23 30 34 23 4
6345 9 46 39 34 8 43 26 24
For each row (id), I would like to identify intervals of numbers in column v1 to v8. An interval is here defined as a sequence of numbers which starts and ends with the same number.
For example, in the first row, there are two sequences which both start and ends with 37: From column 1 to 4 (37, 15, 30, 37) and from column 4 to column 8 (37, 4, 11, 35, 37).
The focal value should only occur in start and end positions. For example, in the first row, the sequence from 37 at V1, to 37 at V8 is not included, because 37 also occurs in V4.
For each interval, I want the index of the start and end columns, the focal start and end value, and the sequence of numbers in between.
Desired output:
1001 [v1] to [v4] 37 to 37: 15,30
1001 [v4] to [v8] 37 to 37: 4, 11, 35
2111 [v1] to [v3] 44 to 44: 31
2111 [v3] to [v7] 44 to 44: 30, 24, 39
Any suggestions? Algorithm?
I managed to code for the indices for a vector not a matrix,
a <- which(x == 37)
from <- a[!(a-1) %in% a]
to <- a[!(a+1) %in% a]
rbind(from, to)
Very brute-force method. Get unique elements for the given row, check if they are present more than once but not side-by-side, then lapply through each, getting the elements of the row x between them.
apply(m, 1, function(x) {
u <- unique(x)
u <- u[sapply(u, function(u) any(diff(which(x == u)) > 1))]
lapply(setNames(u, u), function(u){
ind <- which(x == u)
lapply(seq(length(ind) - 1),
function(i) x[seq(ind[i] + 1, ind[i + 1] - 1)])
})
})
Output:
# [[1]]
# [[1]]$`37`
# [[1]]$`37`[[1]]
# [1] 15 30
#
# [[1]]$`37`[[2]]
# [1] 4 11 35
#
#
#
# [[2]]
# [[2]]$`44`
# [[2]]$`44`[[1]]
# [1] 31
#
# [[2]]$`44`[[2]]
# [1] 30 24 39
#
#
#
# [[3]]
# [[3]]$`43`
# [[3]]$`43`[[1]]
# [1] 49 39 34 44
#
#
#
# [[4]]
# named list()
#
# [[5]]
# [[5]]$`23`
# [[5]]$`23`[[1]]
# [1] 27 34
#
# [[5]]$`23`[[2]]
# [1] 30 34
#
#
# [[5]]$`34`
# [[5]]$`34`[[1]]
# [1] 23 30
#
#
#
# [[6]]
# named list()
Edit: Henrik's answer inspired me to do a join-based version
library(data.table)
library(magrittr)
d <- melt(as.data.table(m), "id", variable.name = 'ci')[, ci := rowid(id)]
setorder(d, id)
options(datatable.nomatch = 0)
d[d, on = .(id, value, ci > ci)
, .(id, value, i.ci, x.ci)
, mult = 'first'] %>%
.[d, on = .(id, i.ci < ci, x.ci > ci)
, .(id, value, from_ci = x.i.ci, to_ci = x.x.ci, i.value)] %>%
.[, .(val = .(i.value))
, by = setdiff(names(.), 'i.value')]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 34 3 6 23,30
# 8: 5234 23 4 7 30,34
Here's a data.table alternative.
Convert matrix to data.table and melt to long format. Create a column index 'ci' to keep track of the original columns (rowid(id)). Order by 'id'.
For each 'id' and 'value' (by = .(id, value)), check if number of rows is larger than one (if(.N > 1)), i.e. if there is at least one sequence. If so, grab the row index (.I) of the sequences and their column indexes (in the original data). For each sequence, grab the corresponding values between start and end index. Wrap in list twice (.(.() to create a list column.
library(data.table)
d <- melt(as.data.table(m), id.vars = "id")
d[ , `:=`(
ci = rowid(id),
variable = NULL)]
setorder(d, id)
d2 <- d[ , if(.N > 1){
.(from = .I[-.N], to = .I[-1],
from_ci = ci[-.N], to_ci = ci[ -1])
}, by = .(id, value)]
d2[ , val := .(.(d$value[seq(from + 1, to - 1)])), by = 1:nrow(d2)]
d2[ , `:=`(from = NULL, to = NULL)]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 23 4 7 30,34
# 8: 5234 34 3 6 23,30

Add number to vector repeatdly and duplicate vector

I have a two value
3 and 5
and I make vector
num1 <- 3
num2 <- 12
a <- c(num1, num2)
I want add number(12) to vector "a" and
also I want to make new vector with repeat and append
like this:
3,12, 15,24, 27,36, 39,48 ....
repeat number "n" is 6
I don't have any idea.
Here are two methods in base R.
with outer, you could do
c(outer(c(3, 12), (12 * 0:4), "+"))
[1] 3 12 15 24 27 36 39 48 51 60
or with sapply, you can explicitly loop through and calculate the pairs of sums.
c(sapply(0:4, function(i) c(3, 12) + (12 * i)))
[1] 3 12 15 24 27 36 39 48 51 60
outer returns a matrix where every pair of elements of the two vectors have been added together. c is used to return a vector. sapply loops through 0:4 and then calculates the element-wise sum. It also returns a matrix in this instance, so c is used to return a vector.
Here is a somewhat generic function that takes as input your original vector a, the number to add 12, and n,
f1 <- function(vec, x, n){
len1 <- length(vec)
v1 <- sapply(seq(n/len1), function(i) x*i)
v2 <- rep(v1, each = n/length(v1))
v3 <- rep(vec, n/len1)
return(c(vec, v3 + v2))
}
f1(a, 12, 6)
#[1] 3 12 15 24 27 36 39 48
f1(a, 11, 12)
#[1] 3 12 14 23 25 34 36 45 47 56 58 67 69 78
f1(a, 3, 2)
#[1] 3 12 6 15
EDIT
If by n=6 you mean 6 times the whole vector then,
f1 <- function(vec, x, n){
len1 <- length(vec)
v1 <- sapply(seq(n), function(i) x*i)
v2 <- rep(v1, each = len1)
v3 <- rep(vec, n)
return(c(vec, v3 + v2))
}
f1(a, 12, 6)
#[1] 3 12 15 24 27 36 39 48 51 60 63 72 75 84
Using rep for repeating and cumsum for the addition:
n = 6
rep(a, n) + cumsum(rep(c(12, 0), n))
# [1] 15 24 27 36 39 48 51 60 63 72 75 84

Creating new columns from long strings split into 300 substrings?

I have a column containing 1200 character strings. In each one, every four character group is hexadecimal for a number. i.e. 300 numbers in hexadecimal crammed into a 1200 character string, in every row. I need to get each number out into decimal, and into its own column (300 new columns) named 1-300.
Here's what I've figured out so far:
Data.frame:
BigString
[1] 0043003E803C0041004A...(etc...)
Here's what I've done so far:
decimal.fours <- function(x) {
strtoi(substring(BigString[x], seq(1,1197,4), seq(4,1197,4)), 16L)
}
decimal.fours(1)
[1] 283 291 239 177 ...
But now I'm stuck. How can I output these individual number, (and the remaining 296, into new columns? I have fifty total rows/strings. It would be great to do them all at once, i.e. 300 new columns, containing split up substrings from 50 strings.
You can use read.fwf which read in files with fixed width for each column:
# an example vector of big strings
BigString = c("0043003E803C0041004A", "0043003E803C0041004A", "0043003E803C0041004A")
n = 5 # n is the number of columns for your result(300 for your real case)
as.data.frame(
lapply(read.fwf(file = textConnection(BigString),
widths = rep(4, n),
colClasses = "character"),
strtoi, base = 16))
# V1 V2 V3 V4 V5
#1 67 62 32828 65 74
#2 67 62 32828 65 74
#3 67 62 32828 65 74
If you'd like to keep the decimal.hours function, you can modify it as follows and call lapply to convert your bigStrings to list of integers which can be further converted to data.frame with do.call(rbind, ...) pattern:
decimal.fours <- function(x) {
strtoi(substring(x, seq(1,1197,4), seq(4,1197,4)), 16L)
}
do.call(rbind, lapply(BigString, decimal.fours))
Obligatory tidyverse example:
library(tidyverse)
Setup some data
set.seed(1492)
bet <- c(0:9, LETTERS[1:6]) # alphabet for hex digit sequences
i <- 8 # number of rows
n <- 10 # number of 4-hex-digit sequences
df <- data_frame(
some_other_col=LETTERS[1:i],
big_str=map_chr(1:i, ~sample(bet, 4*n, replace=TRUE) %>% paste0(collapse=""))
)
df
## # A tibble: 8 × 2
## some_other_col big_str
## <chr> <chr>
## 1 A 432100D86CAA388C15AEA6291E985F2FD3FB6104
## 2 B BC2673D112925EBBB3FD175837AF7176C39B4888
## 3 C B4E99FDAABA47515EADA786715E811EE0502ABE8
## 4 D 64E622D7037D35DE6ADC40D0380E1DC12D753CBC
## 5 E CF7CDD7BBC610443A8D8FCFD896CA9730673B181
## 6 F ED86AEE8A7B65F843200B823CFBD17E9F3CA4EEF
## 7 G 2B9BCB73941228C501F937DA8E6EF033B5DD31F6
## 8 H 40823BBBFDF9B14839B7A95B6E317EBA9B016ED5
Do the manipulation
read_fwf(paste0(df$big_str, collapse="\n"),
fwf_widths(rep(4, n)),
col_types=paste0(rep("c", n), collapse="")) %>%
mutate_all(strtoi, base=16) %>%
bind_cols(df) %>%
select(some_other_col, everything(), -big_str)
## # A tibble: 8 × 11
## some_other_col X1 X2 X3 X4 X5 X6 X7 X8 X9
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 A 17185 216 27818 14476 5550 42537 7832 24367 54267
## 2 B 48166 29649 4754 24251 46077 5976 14255 29046 50075
## 3 C 46313 40922 43940 29973 60122 30823 5608 4590 1282
## 4 D 25830 8919 893 13790 27356 16592 14350 7617 11637
## 5 E 53116 56699 48225 1091 43224 64765 35180 43379 1651
## 6 F 60806 44776 42934 24452 12800 47139 53181 6121 62410
## 7 G 11163 52083 37906 10437 505 14298 36462 61491 46557
## 8 H 16514 15291 65017 45384 14775 43355 28209 32442 39681
## # ... with 1 more variables: X10 <int>
just a try using base-R
BigString = c("0043003E803C0041004A", "0043003E803C0041004A", "0043003E803C0041004A")
df = data.frame(BigString)
t(sapply(df$BigString, function(x) strtoi(substring(x, seq(1, 297, 4)[1:5],
seq(4, 300, 4)[1:5]), base = 16)))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 67 62 32828 65 74
#[2,] 67 62 32828 65 74
#[3,] 67 62 32828 65 74
# you can set the columns together at the end using `paste0("new_col", 1:300)`
# [1:5] was just used for this example, because i had strings of length 20cahr

Move cells to new column at each 48th row

I have list with names in A1:A144 and I want to move A49:A96 to B1:B48 and A97:144 to C1:C48.
So for each 48th row, I want the next 48 rows moved to a new column.
How to do that?
If you want to consider a VBA alternative then:
Sub MoveData()
nF = 1
nL = 48
nSize = Cells(Rows.Count, "A").End(xlUp).Row
nBlock = nSize / nL
For k = 1 To nBlock
nF = nF + 48
nL = nL + 48
Range("A" & nF & ":A" & nL).Copy Cells(1, k + 1)
Range("A" & nF & ":A" & nL).ClearContents
Next k
End Sub
Not sure how scalable this solution is, but it does work.
First let's pretend your names are x and you want the solution to be in new.df
number.shifts <- ceiling(length(x) / 48) # work out how many columns we need
# create an empty (NA) data frame with the dimensions we need
new.df <- matrix(data = NA, nrow = length(x), ncol = number.shifts)
# run a for-loop over the x, shift the column over every 48th row
j <- 1
for (i in 1:length(x)){
if (i %% 48 == 0) {j <- j + 1}
new.df[i,j] <- x[i]
}
I think you have to elaborate on your question a little more. Do you have the data in R or in Excel and do you want the output to be in R or in Excel?
That beeing said, if x is your vector indicating clusters
x <- rep(1:3, each = 48)
and y is the variable containing names or whatever that you want to distribute over columns A:C (each having 48 rows),
y <- sample(letters, 3 * 48, replace = TRUE)
you can do this:
y.wide <- do.call(cbind, split(y, x))
Just as there is stack in R to create a very long representation of a group of columns, there is unstack to take a long column and make it into a wide form.
Here's a basic example:
mydf <- data.frame(A = 1:144)
mydf$groups <- paste0("A", gl(n=3, k=48)) ## One of many ways to create groups
mydf2 <- unstack(mydf)
head(mydf2)
# A1 A2 A3
# 1 1 49 97
# 2 2 50 98
# 3 3 51 99
# 4 4 52 100
# 5 5 53 101
# 6 6 54 102
tail(mydf2)
# A1 A2 A3
# 43 43 91 139
# 44 44 92 140
# 45 45 93 141
# 46 46 94 142
# 47 47 95 143
# 48 48 96 144

Resources