Random number selection from a data-frame - r

I have created a dataframe of "errors" following the steps outlined by Bernaard & Sijtsma's (2000) two-way method for missing data imputation. In order to complete my calculation for missing data, I need to make a random selection of a SINGLE NUMBER from this error dataframe and add it to my already calculated missing data values.
I am familiar with the sample() function, but I am not looking for a random sample of a row or a column, but rather one individual cell from a data-frame. Is there a simple way to do this, such as a single "select random number()" command? Is there an alternative method I have yet to explore?
Any help is greatly appreciated.

It's easier if you can convert to a matrix instead of a dataframe , but on the assumption that you need to keep different data types or some such limitation,
foo<-as.data.frame(matrix(runif(20),nrow=4,ncol=5))
foo[sample(1:nrow(foo)),sample(1:ncol(foo))]
will pick a random element.

Similar to what #CarlWitthoft answered, you can convert your data frame back to matrix to make sure you sample a random cell
> set.seed(10)
> M <- data.frame(matrix(runif(20), nrow = 4, ncol = 5))
> M
# X1 X2 X3 X4 X5
# 1 0.5074782 0.08513597 0.6158293 0.1135090 0.05190332
# 2 0.3067685 0.22543662 0.4296715 0.5959253 0.26417767
# 3 0.4269077 0.27453052 0.6516557 0.3580500 0.39879073
# 4 0.6931021 0.27230507 0.5677378 0.4288094 0.83613414
> sample(as.matrix(M), 1)
# [1] 0.2641777 ## came from row 2, column 5
> sample(as.matrix(M), 1)
# [1] 0.113509 ## came from row 1, column 4
> sample(as.matrix(M), 1)
# [1] 0.4288094 ## came from row 4, column 4
> sample(as.matrix(M), 1)
# [1] 0.2723051 ## came from row 4, column 2
seq(as.matrix(M)) will show you all the cell numbers (top to bottom, left to right). You could also sample from that.
> seq(as.matrix(M))
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> sample(seq(as.matrix(M)), 1)
# [1] 15

Related

How can I create a vector of row numbers from a matrix in R (order of acquisition)?

I have two files. The first file is a data frame that is simply times in one column and individuals in a second
# [Time] [Individual]
# [1] 1528142 C5A1790
# [2] 1528142 C5A1059
# [3] 1528142 C5A1084
# [4] 1528142 C5A1564
# [5] 1528142 C5A1239
# [6] 1528142 C5A1180
the second is an N X N matrix in which both rows and columns are individuals, including those in the first matrix.
# [C5A1084] [C5A1059] [C5A1790] [C5A1180]
# 1 [C5A1084] 0 0.5 1 0
# 2 [C5A1059] 0.5 0 0 1
# 3 [C5A1790] 1 1 0 0.5
# 4 [C5A1180] 0 1 0.5 0
I need to create a vector containing the row numbers in the matrix at which I can find the individuals from the data frame, and in the order that they are listed in the data frame. For these example data it would be (3,2,1,4).
I tried to use the which() function as
RingIndex <- which(Matrix$IDcolumn == FrameIDs)
and received the "longer object length is not a multiple of shorter object length" message, presumably because the matrix includes more individuals than the data frame. %in% and match() are also returning errors stating that replacement has fewer rows than data.
Following the advice in the comments, I tried
RingIndex <- which(Matrix$IDcolumn %in% FrameIDs)
which successfully returned the correct row numbers, but in ascending order rather than the order of the original data. The match() function continues to complain of different replacement and original lengths.
What approach could I use to get my vector?
Many thanks!
df <- data.frame(Time = runif(6,1528142,1528150),
Individuals = c("C5A1790","C5A1791","C5A1792","C5A1793","C5A1794","C5A1795"))
> df
Time Individuals
1 1528144 C5A1790
2 1528143 C5A1791
3 1528144 C5A1792
4 1528148 C5A1793
5 1528145 C5A1794
6 1528143 C5A1795
nnMatrix <- matrix(runif(36,0,1),6,6)
colnames(nnMatrix) <- df$Individuals
rownames(nnMatrix) <- df$Individuals
> nnMatrix
C5A1790 C5A1791 C5A1792 C5A1793 C5A1794 C5A1795
C5A1790 0.08096946 0.8716328 0.6895134 0.05692825 0.4555460 0.53224424
C5A1791 0.42568532 0.5920239 0.4523232 0.11516185 0.8053652 0.72299411
C5A1792 0.42439187 0.6101881 0.8534429 0.86010851 0.1269521 0.41066857
C5A1793 0.26043345 0.8011337 0.8032234 0.30930988 0.2298927 0.93320166
C5A1794 0.43065533 0.2161525 0.6702832 0.89304071 0.6765714 0.09769635
C5A1795 0.70594252 0.1048099 0.7478553 0.87839534 0.5173364 0.69957502
> sapply(df$Individuals, function(t) which(colnames(nnMatrix) == t))
[1] 1 2 3 4 5 6
If you change the order
colnames(nnMatrix) <- rev(colnames(nnMatrix))
[1] 6 5 4 3 2 1
You may want to check for repetition and missing values, but the main approach is the same.
As suggested in the comments (#GKi) also match will work
> match(df$Individuals,colnames(nnMatrix))
[1] NA 1 3 4 5 6

r - find maximum length "chain" of numerically increasing pairs of numbers

I have a two column dataframe of number pairs:
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
> dfPairs
ODD EVEN
1 1 10
2 1 8
3 1 2
4 3 2
5 3 6
6 3 4
7 5 2
8 7 6
9 7 8
10 9 4
11 9 8
Each row of this dataframe is a pair of numbers, and I would like to a find the longest possible numerically increasing combination of pairs. Conceptually, this is analogous to making a chain link of number pairs; with the added conditions that 1) links can only be formed using the same number and 2) the final chain must increase numerically. Visually, the program I am looking for will accomplish this:
For instance, row three is pair (1,2), which increases left to right. The next link in the chain would need to have a 2 in the EVEN column and increase right to left, such as row four (3,2). Then the pattern repeats, so the next link would need to have a 3 in the ODD column, and increase left to right, such as rows 5 or 6. The chain doesn't have to start at 1, or end at 9 - this was simply a convenient example.
If you try to make all possible linked pairs, you will find that many unique chains of various lengths are possible. I would like to find the longest possible chain. In my real data, I will likely encounter a situation in which more than one chain tie for the longest, in which case I would like all of these returned.
The final result should return the longest possible chain that meets these requirements as a dataframe, or a list of dataframes if more than one solution is possible, containing only the rows in the chain.
Thanks in advance. This one has been perplexing me all morning.
Edited to deal with df that does not start at 1 and returns maximum chains rather than chain lengths
Take advantage of graph data structure using igraph
Your data, dfPairs
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
New data, dfTest
ODD <- c(3,3,3,5,7,7,9,9)
EVEN <- c(2,6,4,2,6,8,4,8)
dfTest <- data.frame(ODD, EVEN)
Make graph of your data. A key to my solution is to rbind the reverse (rev(dfPairs)) of the data frame to the original data frame. This will allow for building directional edges from odd numbers to even numbers. Graphs can be used to construct directional paths fairly easily.
library(igraph)
library(dplyr)
GPairs <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2"))), X1))
GTest <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfTest, c("X1", "X2")), setNames(rev(dfTest), c("X1", "X2"))), X1))
Here's the first three elements of all_simple_paths(GPairs, 1) (starting at 1)
[[1]]
+ 2/10 vertices, named, from f8e4f01:
[1] 1 2
[[2]]
+ 3/10 vertices, named, from f8e4f01:
[1] 1 2 3
[[3]]
+ 4/10 vertices, named, from f8e4f01:
[1] 1 2 3 4
I create a function to 1) convert all simple paths to list of numeric vectors, 2) filter each numeric vector for only elements that satisfy left->right increasing, and 3) return the maximum chain of left->right increasing numeric vector
max_chain_only_increasing <- function(gpath) {
list_vec <- lapply(gpath, function(v) as.numeric(names(unclass(v)))) # convert to list of numeric vector
only_increasing <- lapply(list_vec, function(v) v[1:min(which(v >= dplyr::lead(v, default=tail(v, 1))))]) # subset vector for only elements that are left->right increasing
return(unique(only_increasing[lengths(only_increasing) == max(lengths(only_increasing))])) # return maximum chain length
}
This is the output of the above function using all paths that start from 1
max_chain_only_increasing(all_simple_paths(GPairs, 1))
# [[1]]
# [1] 1 2 3 6 7 8 9
Now, I'll output (header) of max chains starting with each unique element in dfPairs, your original data
start_vals <- sort(unique(unlist(dfPairs)))
# [1] 1 2 3 4 5 6 7 8 9 10
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GPairs, i)))
names(max_chains) <- start_vals
# $`1`
# [1] 1 2 3 6 7 8 9
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# etc
And finally with dfTest, the newer data
start_vals <- sort(unique(unlist(dfTest)))
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GTest, i)))
names(max_chains) <- start_vals
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# $`6`
# [1] 6 7 8 9
In spite of Cpak's efforts I ended up writing my own function to solve this. In essence I realize I could make the right to left chain links left to right by using this section of code from Cpak's answer:
output <- arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2")))`, X1)
To ensure the resulting chains were sequential, I deleted all decreasing links:
output$increase <- with(output, ifelse(X2>X1, "Greater", "Less"))
output <- filter(output, increase == "Greater")
output <- select(output, -increase)
I realized that if I split the dataframe output by unique values in X1, I could join each of these dataframes sequentially by joining the last column of the first dataframe to the first column of the next dataframe, which would create rows of sequentially increasing chains. The only problem I needed to resolve was the issues of NAs in last column of the mered dataframe. So ended up splitting the joined dataframe after each merge, and then shifted the dataframe to remove the NAs, and rbinded the result back together.
This is the actual code:
out_split <- split(output, output$X1)
df_final <- Reduce(join_shift, out_split)
The function, join_shift, is this:
join_shift <- function(dtf1,dtf2){
abcd <- full_join(dtf1, dtf2, setNames(colnames(dtf2)[1], colnames(dtf1)[ncol(dtf1)]))
abcd[is.na(abcd)]<-0
colnames(abcd)[ncol(abcd)] <- "end"
# print(abcd)
abcd_na <- filter(abcd, end==0)
# print(abcd_na)
abcd <- filter(abcd, end != 0)
abcd_na <- abcd_na[moveme(names(abcd_na), "end first")]
# print(abcd_na)
names(abcd_na) <- names(abcd)
abcd<- rbind(abcd, abcd_na)
z <- length(colnames(abcd))
colnames(abcd)<- c(paste0("X", 1:z))
# print(abcd)
return(abcd)
}
Finally, I found there were a lot of columns that had only zeros in it, so I wrote this to delete them and trim the final dataframe:
df_final_trim = df_final[,colSums(df_final) > 0]
Overall Im happy with this. I imagine it could be a little more elegant, but it works on anything, and it works on some rather huge, and complicated data. This will produce ~ 241,700 solutions from a dataset of 700 pairs.
I also used a moveme function that I found on stackoverflow (see below). I employed it to move NA values around to achieve the shift aspect of the join_shift function.
moveme <- function (invec, movecommand) {
movecommand <- lapply(strsplit(strsplit(movecommand, ";")[[1]],
",|\\s+"), function(x) x[x != ""])
movelist <- lapply(movecommand, function(x) {
Where <- x[which(x %in% c("before", "after", "first",
"last")):length(x)]
ToMove <- setdiff(x, Where)
list(ToMove, Where)
})
myVec <- invec
for (i in seq_along(movelist)) {
temp <- setdiff(myVec, movelist[[i]][[1]])
A <- movelist[[i]][[2]][1]
if (A %in% c("before", "after")) {
ba <- movelist[[i]][[2]][2]
if (A == "before") {
after <- match(ba, temp) - 1
}
else if (A == "after") {
after <- match(ba, temp)
}
}
else if (A == "first") {
after <- 0
}
else if (A == "last") {
after <- length(myVec)
}
myVec <- append(temp, values = movelist[[i]][[1]], after = after)
}
myVec
}

Find string in data.frame

How do I search for a string in a data.frame? As a minimal example, how do I find the locations (columns and rows) of 'horse' in this data.frame?
> df = data.frame(animal=c('goat','horse','horse','two', 'five'), level=c('five','one','three',30,'horse'), length=c(10, 20, 30, 'horse', 'eight'))
> df
animal level length
1 goat five 10
2 horse one 20
3 horse three 30
4 two 30 horse
5 five horse eight
... so row 4 and 5 have the wrong order. Any output that would allow me to identify that 'horse' has shifted to the level column in row 5 and to the length column in row 4 is good. Maybe:
> magic_function(df, 'horse')
col row
'animal', 2
'animal', 3
'length', 4
'level', 5
Here's what I want to use this for: I have a very large data frame (around 60 columns, 20.000 rows) in which some columns are messed up for some rows. It's too large to eyeball in order to identify the different ways that order can be wrong, so searching would be nice. I will use this info to move data to the correct columns for these rows.
What about:
which(df == "horse", arr.ind = TRUE)
# row col
# [1,] 2 1
# [2,] 3 1
# [3,] 5 2
# [4,] 4 3
Another way around:
l <- sapply(colnames(df), function(x) grep("horse", df[,x]))
$animal
[1] 2 3
$level
[1] 5
$length
[1] 4
If you want the output to be matrix:
sapply(l,'[',1:max(lengths(l)))
animal level length
[1,] 2 5 4
[2,] 3 NA NA
We can get the indices where the value is equal to horse. Divide it by number of rows (nrow) to get the column indices and by columns (ncol) to get the row indices.
We use colnames to get column names instead of indices.
data.frame(col = colnames(df)[floor(which(df == "horse") / (nrow(df) + 1)) + 1],
row = floor(which(df == "horse") / ncol(df)) + 1)
# col row
#1 animal 1
#2 animal 2
#3 level 4
#4 length 5
Another way to do it is the following:
library(data.table)
library(zoo)
library(dplyr)
library(timeDate)
library(reshape2)
data frame name = tbl_account
first,Transpose it :
temp = t(tbl_Account)
Then, put it in to a list :
temp = list(temp)
This essentially puts every single observation in a data frame in to one massive string, allowing you to search the whole data frame in one go.
then do the searching :
temp[[1]][grep("Horse",temp[[1]])] #brings back the actual value occurrences
grep("Horse", temp[[1]]) # brings back the position of the element in a list it occurs in
hope this helps :)

'Random' Sorting with a condition in R for Psychology Research

I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.

using sample() or an equivalent on 2 variables of a dataframe

I have a large dataset, a subset of which looks like this:
Var1 Var2
9 29_13x
14 41y
9 51_13x
4 101_13x
14 105y
14 109y
9 113_13x
9 114_13x
14 116y
14 123y
4 124_13x
14 124y
14 126y
4 134_13x
4 135_13x
4 137_13x
9 138_13x
4 139_13x
14 140y
9 142_13x
4 143_13x
My code sits inside a loop and I would like to be able to sample without replacement, a certain number of Var2 (defined by the loop iteration) from each of the different Var1 categories. So for i=4 I'd like to get something like this:
29_13x
51_13x
113_13x
138_13x
which are all from Var1=9
41y
109y
126y
140y
from Var1=14, and
101_13x
134_13x
137_13x
139_13x
all from Var1=4.
I can't get sample() to work across more than one variable and can't find any other way to do this. Any suggestions would be greatly appreciated.
Here are two options.
Using sample with by or tapply:
by(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
tapply(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
Here's some example output with tapply:
out <- tapply(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
out
# $`4`
# [1] "101_13x" "143_13x" "124_13x" "134_13x"
#
# $`9`
# [1] "114_13x" "113_13x" "142_13x" "29_13x"
#
# $`14`
# [1] "116y" "109y" "140y" "105y"
You can also extract individual vectors by index position or by name:
out[[3]]
# [1] "116y" "126y" "124y" "105y"
out[["14"]]
# [1] "116y" "126y" "124y" "105y"
Subsetting based on a random variable ranked by a grouping variable:
x <- rnorm(nrow(mydf))
mydf[ave(x, mydf$Var1, FUN = rank) %in% 1:4, ]

Resources