How to subset columns by factor in a row? - r

x <- c("a", 2, 3, 1.0)
y <- c("b", 1, 6, 7.9)
z <- c("c", 1, 8, 2.0)
p <- c("d", 2, 9, 3.3)
df1 <- data.frame(x,y,z,p)
Here is a quick example data set, but it doesn't mirror exactly what im trying to do. Say I wanted to take 50 random samples from each level of factor in row 2 (in this case we only have 2 levels of the factor)... How would I go about coding that efficiently? I have a version working in a loop but it feels needlessly complex
edit: When I say I want to take 50 random samples I mean take 50 columns from each level of the factor.

You will need to extract a factor (assuming that 2nd row is a factor).
fact <- as.factor(as.matrix(df1[2,]))
And then work with the second column which you want to be a factor. For example, to sample all for the first value of factor
df1[,df1[2,]==levels(fact)[1],]
Or for getting exactly 50:
df1[,df1[2,]==levels(fact)[1],][1:50]

Maybe you're looking to do something like this:
x1 <- df1[,sample(c(1,4),50,replace = TRUE)]
x2 <- df1[,sample(c(2,3),50,replace = TRUE)]
...but your question is very confusing. "factor" refers to something very specific in R: a type of variable that is generally stored in a column of a data frame, never a row. Additionally, you appear to be forcing all your columns themselves to be factors (or characters possibly), which seems an odd way to store the value 3.3.

Related

R assign levels to factor variable

I was given an Excel table similar to this:
datos <- data.frame(op= 1:4, var1= c(4, 2, 3, 2))
Now, there are other tables with the keys to op and var1, which happen to be categorical variables. Suppose that after loading them, they become:
set.seed(1)
op <- paste("op",c(1:4),sep="")
var1 <- sample(LETTERS, 19, replace= FALSE)
As you can see, there are unused levels in the data frame. I want to replace the numbers for the proper associated levels. This is what I've tried:
datos[] <- lapply(datos, factor)
levels(datos$op) <- op
levels(datos$var1) <- var1
This fails, because it reorders the factors alphabetically and gives a wrong output. I then tried:
datos$var1 <- factor(datos$var1, levels= var1, ordered= TRUE)
but this puts everything in datos$var1 as NA (I guess that's because of unmatching lengths.
What would be the rigth way to do this?
Following the kind advice of #docendoDiscimus, I post this answer for future reference:
For the data provided in the question:
datos$var1 <- factor(var1[datos$var1], levels= unique(var1))
datos
## op
Please notice that this solution should be applied without converting datos$var1 to factor (that is, without applying the code datos[] <- lapply(datos, factor).

R: random sample of columns excluding one column

I may have discovered one of the problems in the code posted previously, "R: using foreach() with sample() procedures in randomForest() call" and it relates to the script I was using to draw a random subsample of columns from a dataframe.
The fake data (below) has 19 columns, "A" through "S" and I want to draw a random subset of 5 columns, but I want to exclude the third column, "C", from the draw. Simply excluding the third column from the first argument of sample() call does not work (i.e., some of the samples contain the 'C' column). I'm hoping someone has a suggestion on how to do this. This is the script that does not work:
randsCOLs= sample(1:dim(FAKEinput[,c(1:2,4:19)])[2], 5, replace=FALSE)
#randsCOLs= sample(dim(FAKEinput[,c(1:2,4:19)])[2], 5, replace=FALSE) - also doesn't work
out <- FAKEinput[,randsCOLs]
FAKEinput <-
data.frame(A=sample(25:75,20, replace=T), B=sample(1:2,20,replace=T), C=as.factor(sample(0:1,20,replace=T,prob=c(0.3,0.7))),
D=sample(200:350,20,replace=T), E=sample(2300:2500,20,replace=T), F=sample(92000:105000,20,replace=T),
G=sample(280:475,20,replace=T),H=sample(470:550,20,replace=T),I=sample(2537:2723,20,replace=T),
J=sample(2984:4199,20,replace=T),K=sample(222:301,20,replace=T),L=sample(28:53,20,replace=T),
M=sample(3:9,20,replace=T),N=sample(0:2,20,replace=T),O=sample(0:5,20,replace=T),P=sample(0:2,20,replace=T),
Q=sample(0:2,20,replace=T), R=sample(0:2,20,replace=T), S=sample(0:7,20,replace=T))
It looks like excluding the dim() call will work, if I'm not mistaken.
randsCOLs = sample(FAKEinput[-3], 5, replace=FALSE)
Here is a more general approach (in case the C column is not the 3rd column)
FAKEinput[sample(which(names(FAKEinput) !='C'),5, replace=FALSE)]
Or you could use setdiff
FAKEinput[sample(setdiff(names(FAKEinput),'C'), 5, replace=FALSE)]
Or by changing the OP's code of 1:dim and assuming that C is the column 3
FAKEinput[sample((1:dim(FAKEinput)[2])[-3], 5, replace=FALSE)]

r - Multiple boxplots of vector with breaks (and variable widths)

I have a vector of numeric samples. I have calculated a smaller vector of breaks that group the values. I would like to create a boxplot that has one box for every interval, with the width of each box coming from a third vector, the same length as the breaks vector.
Here is some sample data. Please note that my real data has thousands of samples and at least tens of breaks:
v <- c(seq.int(5), seq.int(7) * 2, seq.int(4) * 3)
v1 <- c(1, 6, 13) # breaks
v2 <- c(5, 10, 2) # relative widths
This is how I might make separate boxplots, ignorant of the widths:
boxplot(v[v1[1]:v1[2]-1])
boxplot(v[v1[2]:v1[3]-1])
boxplot(v[v1[3]:length(v)])
I would like a solution that does a single boxplot() call without excessive data conditioning. For example, putting the vector in a data frame and adding a column for region/break number seems inelegant, but I'm not yet "thinking in R", so perhaps that is best.
Base R is preferred, but I will take what I can get.
Thanks.
Try this:
v1 <- c(v1, length(v) + 1)
a01 <- unlist(mapply(rep, 1:(length(v1)-1), diff(v1)))
boxplot(v ~ a01, names= paste0("g", 1:(length(v1)-1)))

euclidean distance between vectors grouped by other variable in SPSS, R or Excel

I have a dataset containing something like this:
case,group,val1,val2,val3,val4
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3
I'm trying to compute programmatically the Euclidean distance between the vectors of values in groups.
This means that I have x number of cases in n number of groups. The euclidean distance is computed between pairs of rows and then averaged for the group. So, in the example above, first I compute the mean and std dev of group 1 (case 1, 2 and 5), then standardise values (i.e. [(original value - mean)/st dev], then compute the ED between case 1 and case 2, case 2 and 5, and case 1 and 5, and finally average the ED for the group.
Can anyone suggest a neat way of achieving this in a reasonably efficient way?
Yes, it is probably easier in R...
Your data:
dat <- data.frame(case = 1:5,
group = c(1, 1, 2, 2, 1),
val1 = c(3, 2, 1, 5, 8),
val2 = c(5, 7, 3, 4, 6),
val3 = c(6, 5, 6, 3, 5),
val4 = c(8, 4, 8, 7, 3))
A short solution:
library(plyr)
ddply(dat[c("group", "val1", "val2", "val3", "val4")],
"group", function(x)c(mean.ED = mean(dist(scale(as.matrix(x))))))
# group mean.ED
# 1 1 3.121136
# 2 2 3.162278
As an example of how I would approach this in SPSS, first lets read the example data into SPSS.
data list list (",") / case group val1 val2 val3 val4 (6F1.0).
begin data
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3
end data.
dataset name orig.
Then we can use SPLIT FILE and PROXIMITIES to get our distance matrix by group. Note, as you mentioned in the comments to flodel's answer, this produces a seperate dataset we need to work with (also note case practically never matters in SPSS syntax, e.g. split file and SPLIT FILE are equivalent).
sort cases by group.
split file by group.
dataset declare dist.
PROXIMITIES val1, val2, val3, val4
/STANDARDIZE = Z
/MEASURE = EUCLID
/PRINT = NONE
/MATRIX = OUT('dist').
Unlike R, basically everything within an SPSS data matrix is like an R data.frame, so SPLIT file near functionally replaces all the different *ply functions in R. Very convienant, but less flexible in general. So now we need to aggregate the distances in the dist file I saved the results to. We first sum across rows, and then sum by group via an AGGREGATE command.
dataset activate dist.
compute dist_sum = SUM(VAR1 to VAR3).
*it appears SPSS keeps empty cases - we dont want them in the aggregation.
select if MISSING(dist_sum) = 0.
dataset activate dist.
DATASET DECLARE dist_agg.
AGGREGATE
/OUTFILE='dist_agg'
/BREAK=group
/dist_sum = SUM(dist_sum)
/N_Cases=N.
dataset activate dist_agg.
compute mean_dist = dist_sum /(N_Cases*(N_Cases - 1)).
Here I save the aggregated results into another dataset named dist_agg. Because SPSS (annoyingly) saves the full distance matrix, the mean will not be n*(n-1)/2 (as in the equivalent R syntax), but will be n*(n-1) assuming you do not want to count the diagonal elements towards the mean. Then we can just merge these back into the orig data file via a match files command.
*merge back into the original dataset.
dataset activate orig.
match files file = *
/table = 'dist_agg'
/by group.
exe.
*clean out old datasets if you like.
dataset close dist.
dataset close dist_agg.
The flexibility of R to go back and forth between matrix and data.frame objects makes SPSS a bit more clunky for this job. I could write a much more concise program to do this in SPSS's MATRIX language, but to do it across groups in MATRIX is a pain in the butt (compared to R's *ply syntax).
Here is a much simpler solution using base R.
d <- by (dat[,2:5], dat$group, function(x) dist(x))
sapply(d,mean)

Numeric sequence gets out of order in plot series

I have a data frame with 30 columns numbered from 0 to 29.
I call stack on this data frame to plot a series of boxplots, one for each column number.
But instead of getting the boxplots in the sequence 0,1,2,3, ... it prints 0,1,10,11..19, 2, 7, 8, 9.
In other words, i want the boxplots to appear in the same sequence of the columns, which
is natural.
I'm using boxplot(values ~ column, data = mydata).
I don't want to fix that by changing the column names.
Is there another solution?
Thanks!
stack stores the column name as a factor,
and the default order is alphabetic.
You can either fix the order once it has been tampered with,
or just use melt instead of stack:
the column order will be preserved.
# Sample data
d <- matrix( rnorm(300), nc=30 )
d <- as.data.frame( d )
colnames(d) <- as.character(0:29)
# Plot
library(reshape2)
boxplot( value ~ variable, melt(d) )

Resources