I may have discovered one of the problems in the code posted previously, "R: using foreach() with sample() procedures in randomForest() call" and it relates to the script I was using to draw a random subsample of columns from a dataframe.
The fake data (below) has 19 columns, "A" through "S" and I want to draw a random subset of 5 columns, but I want to exclude the third column, "C", from the draw. Simply excluding the third column from the first argument of sample() call does not work (i.e., some of the samples contain the 'C' column). I'm hoping someone has a suggestion on how to do this. This is the script that does not work:
randsCOLs= sample(1:dim(FAKEinput[,c(1:2,4:19)])[2], 5, replace=FALSE)
#randsCOLs= sample(dim(FAKEinput[,c(1:2,4:19)])[2], 5, replace=FALSE) - also doesn't work
out <- FAKEinput[,randsCOLs]
FAKEinput <-
data.frame(A=sample(25:75,20, replace=T), B=sample(1:2,20,replace=T), C=as.factor(sample(0:1,20,replace=T,prob=c(0.3,0.7))),
D=sample(200:350,20,replace=T), E=sample(2300:2500,20,replace=T), F=sample(92000:105000,20,replace=T),
G=sample(280:475,20,replace=T),H=sample(470:550,20,replace=T),I=sample(2537:2723,20,replace=T),
J=sample(2984:4199,20,replace=T),K=sample(222:301,20,replace=T),L=sample(28:53,20,replace=T),
M=sample(3:9,20,replace=T),N=sample(0:2,20,replace=T),O=sample(0:5,20,replace=T),P=sample(0:2,20,replace=T),
Q=sample(0:2,20,replace=T), R=sample(0:2,20,replace=T), S=sample(0:7,20,replace=T))
It looks like excluding the dim() call will work, if I'm not mistaken.
randsCOLs = sample(FAKEinput[-3], 5, replace=FALSE)
Here is a more general approach (in case the C column is not the 3rd column)
FAKEinput[sample(which(names(FAKEinput) !='C'),5, replace=FALSE)]
Or you could use setdiff
FAKEinput[sample(setdiff(names(FAKEinput),'C'), 5, replace=FALSE)]
Or by changing the OP's code of 1:dim and assuming that C is the column 3
FAKEinput[sample((1:dim(FAKEinput)[2])[-3], 5, replace=FALSE)]
Related
I have a large data set I am attempting to sample rows from. Each row has a family ID, and there may be one or multiple rows for each family ID. I want to parse the data set by randomly sampling one row for each family ID. I have attempted to accomplish this by using both tapply() and split() + lapply() functions, but to no avail. Below is code that reproduces my issue - the size and scope of the factor levels and data entries mirror the data set I am working with.
set.seed(63)
f1 <- factor(c(rep(30000:32000, times=1),
rep(30500:31700, times = 2),
rep(30900:31900, times = 3)))
f2 <- factor(rep(sample(1:7, replace = TRUE), times = length(f1)/7))
x1 <- round(matrix(rnorm(length(f1)*300), nrow = length(f1), ncol = 300),3)
df <- data.frame(f1, f2, x1)
Next, I used tapply to sample one row per factor from f1, and then check for repeats. (f2 is a secondary factor that indexes another aspect of the observations, but is [hopefully] irrelevant here; I only include it for full disclosure of the structure of my data set).
s1 <- tapply(1:nrow(df), df$f1, sample, size=1)
any(duplicated(s1))
The output for the second line of code using duplicated is TRUE, which means there are repeats. Stumped, I tried split to see if that was the problem.
df.split <- split(1:nrow(df), df$f1)
any(duplicated(df.split))
The output here for duplicated is FALSE, so the problem is not split. I then used the output df.split with lapply and sample to see if the problem was with tapply.
df.unique <- unlist(lapply(df.split, sample, size = 1, replace = FALSE,
prob = NULL))
any(duplicated(df.unique))
In the first line, I sampled one value from each element of df.split which outputs a list, then I used unlist to convert into a vector. The output for duplicated here is also TRUE.
Somewhere within sample and lapply there is funky stuff going on (since tapply merely calls lapply). I'm not sure how to fix the issue (I searched SO and Google and found nothing related to my issue), so any help would be greatly appreciated!
EDIT: I'm hoping someone could tell me why the above code using tapply and lapply is not working as intended. Arthur has provided a nice answer, and I have coded a loop for sample as well. I'm wondering why the above code is misbehaving.
I would do that:
library(data.table)
data.table(df)[,.SD[sample(.N,1)],by='f1']
... but actually your original approach with tapply is faster if you just want an index and not the actual subset table ; however, you must notice that sample(n) actually samples in 1:n when length(n)==1. See ?sample. This version is error-proof:
s1 <- tapply(1:nrow(df), list(df$f1), function(v) v[sample(1:length(v), 1)])` is error prooff
I am trying to get the mvr function in the R-package pls to work. When having a look at the example dataset yarn I realized that all 268 NIR columns are in fact treated as one column:
library(pls)
data(yarn)
head(yarn)
colnames(yarn)
I would need that to use the function with my data (so that a multivariate datset is treated as one entity) but I have no idea how to achive that. I tried
TT<-matrix(NA, 2, 3)
colnames(TT)<-rep("NIR", ncol(TT))
TT
colnames(TT)
You will notice that while all columns have the same heading, colnames(TT) shows a vector of length three, because each column is treated separately. What I would need is what can be found in yarn, that the colname "NIR" occurs only once and applies columns 1-268 alike.
Does anybody know how to do that?
You can just assign the matrix to a column of a data.frame
TT <- matrix(1:6, 2, 3 )
# assign to an existing dataframe
out <- data.frame(desnity = 1:nrow(TT))
out$NIR <- TT
str(out)
# assign to empty dataframe
out <- data.frame(matrix(integer(0), nrow=nrow(TT))) ;
out$NIR <- TT
x <- c("a", 2, 3, 1.0)
y <- c("b", 1, 6, 7.9)
z <- c("c", 1, 8, 2.0)
p <- c("d", 2, 9, 3.3)
df1 <- data.frame(x,y,z,p)
Here is a quick example data set, but it doesn't mirror exactly what im trying to do. Say I wanted to take 50 random samples from each level of factor in row 2 (in this case we only have 2 levels of the factor)... How would I go about coding that efficiently? I have a version working in a loop but it feels needlessly complex
edit: When I say I want to take 50 random samples I mean take 50 columns from each level of the factor.
You will need to extract a factor (assuming that 2nd row is a factor).
fact <- as.factor(as.matrix(df1[2,]))
And then work with the second column which you want to be a factor. For example, to sample all for the first value of factor
df1[,df1[2,]==levels(fact)[1],]
Or for getting exactly 50:
df1[,df1[2,]==levels(fact)[1],][1:50]
Maybe you're looking to do something like this:
x1 <- df1[,sample(c(1,4),50,replace = TRUE)]
x2 <- df1[,sample(c(2,3),50,replace = TRUE)]
...but your question is very confusing. "factor" refers to something very specific in R: a type of variable that is generally stored in a column of a data frame, never a row. Additionally, you appear to be forcing all your columns themselves to be factors (or characters possibly), which seems an odd way to store the value 3.3.
I have a dataset containing something like this:
case,group,val1,val2,val3,val4
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3
I'm trying to compute programmatically the Euclidean distance between the vectors of values in groups.
This means that I have x number of cases in n number of groups. The euclidean distance is computed between pairs of rows and then averaged for the group. So, in the example above, first I compute the mean and std dev of group 1 (case 1, 2 and 5), then standardise values (i.e. [(original value - mean)/st dev], then compute the ED between case 1 and case 2, case 2 and 5, and case 1 and 5, and finally average the ED for the group.
Can anyone suggest a neat way of achieving this in a reasonably efficient way?
Yes, it is probably easier in R...
Your data:
dat <- data.frame(case = 1:5,
group = c(1, 1, 2, 2, 1),
val1 = c(3, 2, 1, 5, 8),
val2 = c(5, 7, 3, 4, 6),
val3 = c(6, 5, 6, 3, 5),
val4 = c(8, 4, 8, 7, 3))
A short solution:
library(plyr)
ddply(dat[c("group", "val1", "val2", "val3", "val4")],
"group", function(x)c(mean.ED = mean(dist(scale(as.matrix(x))))))
# group mean.ED
# 1 1 3.121136
# 2 2 3.162278
As an example of how I would approach this in SPSS, first lets read the example data into SPSS.
data list list (",") / case group val1 val2 val3 val4 (6F1.0).
begin data
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3
end data.
dataset name orig.
Then we can use SPLIT FILE and PROXIMITIES to get our distance matrix by group. Note, as you mentioned in the comments to flodel's answer, this produces a seperate dataset we need to work with (also note case practically never matters in SPSS syntax, e.g. split file and SPLIT FILE are equivalent).
sort cases by group.
split file by group.
dataset declare dist.
PROXIMITIES val1, val2, val3, val4
/STANDARDIZE = Z
/MEASURE = EUCLID
/PRINT = NONE
/MATRIX = OUT('dist').
Unlike R, basically everything within an SPSS data matrix is like an R data.frame, so SPLIT file near functionally replaces all the different *ply functions in R. Very convienant, but less flexible in general. So now we need to aggregate the distances in the dist file I saved the results to. We first sum across rows, and then sum by group via an AGGREGATE command.
dataset activate dist.
compute dist_sum = SUM(VAR1 to VAR3).
*it appears SPSS keeps empty cases - we dont want them in the aggregation.
select if MISSING(dist_sum) = 0.
dataset activate dist.
DATASET DECLARE dist_agg.
AGGREGATE
/OUTFILE='dist_agg'
/BREAK=group
/dist_sum = SUM(dist_sum)
/N_Cases=N.
dataset activate dist_agg.
compute mean_dist = dist_sum /(N_Cases*(N_Cases - 1)).
Here I save the aggregated results into another dataset named dist_agg. Because SPSS (annoyingly) saves the full distance matrix, the mean will not be n*(n-1)/2 (as in the equivalent R syntax), but will be n*(n-1) assuming you do not want to count the diagonal elements towards the mean. Then we can just merge these back into the orig data file via a match files command.
*merge back into the original dataset.
dataset activate orig.
match files file = *
/table = 'dist_agg'
/by group.
exe.
*clean out old datasets if you like.
dataset close dist.
dataset close dist_agg.
The flexibility of R to go back and forth between matrix and data.frame objects makes SPSS a bit more clunky for this job. I could write a much more concise program to do this in SPSS's MATRIX language, but to do it across groups in MATRIX is a pain in the butt (compared to R's *ply syntax).
Here is a much simpler solution using base R.
d <- by (dat[,2:5], dat$group, function(x) dist(x))
sapply(d,mean)
I am working with a dataframe that has 65 variables in it. The first variable catalogs a person, and the next 64 variables indicate the geographic distance that person is from each of 64 locations. Using R, I would like to create a new variable that catalogs the shortest distance for each person to one of those 64 locations.
For example: if person X is 35, 50, 79, 100, 450...miles away from the locations, I would like the new variable to automatically assign them a 35, because this is the shortest distance.
Any help with this would be much appreciated. Thanks.
Or, using the example of Justin:
df$shortest <- do.call(pmin,df[-1])
see also ?pmin and ?do.call, and note that you can drop the first variable in your data frame by using the list indices (so not using any comma at all, see also ?Extract )
df <- data.frame(let=letters[1:25], d1=sample(1:25,25), d2=sample(1:25,25), d3=sample(1:25,25))
df$shortest <- apply(df[,2:4],1,min)
The second line applies the function min to each row and assigns it to the new column in my data.frame df. See ?apply for more explanation of what the second line is doing. Careful to skip the first column, or any columns that aren't distances:
apply(df,1,min) gives completely difference answers since its finding the "min" of strings.
> min(2:10)
[1] 2
> min(as.character(2:10))
[1] "10"
I'd approach this with apply but transform or other approach could work.
#fake data set
ID=LETTERS[1:5], distance=matrixsample(
DF <- as.data.frame(matrix(sample(1:100, rep=T, 100), 5, 20))
DF <- data.frame(ID=LETTERS[1:5], DF)
#solution
DF$newvar <- apply(DF[,-1], 1, min)