Preparing trimmed means of multiple data files for repeated mesaures anova - r

I have multiple data files (in tab separated txt files) formatted like so:
Made some example files
https://docs.google.com/file/d/0B20HmmYd0lsFVGhTQ0EzRFFmYXc/edit?usp=sharing
https://docs.google.com/file/d/0B20HmmYd0lsFbWxmQzV6X0o2Y1E/edit?usp=sharing
Condition Block Session Stimuli Score Reqrespons Act RT extra
X 3 3 asdfa 1 a a 500 0
Y 1 2 qewrq 0 b a 1100 0
I want to exclude outlying RTs and perform ANOVA on the means of RT and score of the files(with factor condition). So far, I've done this in an extremely ugly fashion and have rows by subject (I'd prefer to format it with rows as subjectxcondition).
My current attempt uses a for loop :
all_data<-data.frame(rbind(1:27)) #make empty data.frame
all_data
for(i in 1:2)
{
n= paste(i,".txt", sep="")
a<- sprintf("table%d", i, i)
data <- read.table(toString(n), header = TRUE, sep = "\t")
I fill out cols 1:9 with scores 1-9
Score<-as.vector(tapply(data$Score,list(data$Condition,data$Reqresponse),mean))
for(o in 1:9)
{
all_data [i, o] <- Score[o]
}
Then trim my RT values in the way i want to and put in cols 10 on of all_data
data <- data[which(data$RT>200),]
data <- do.call(rbind,by(data,data$Condition,function(x) x[!abs(scale(x$RT)) > 3,] ))
RT<-as.vector(tapply(data$RT,list(data$Condition,data$Reqresponse, data$Score),mean))
for(j in 1:18)
{
all_data [i, j+9] <- RT[j]
}
}
Also this code must be aesthetically offensive to anyone decent in R, please tell me how to fix that up if you like

I would do this using ddply from plyr package. For example:
require(plyr)
res <- lapply(list.files(pattern='^[1-2].txt'),function(ff){
## you read the file
data <- read.table(ff, header=T, quote="\"")
## remove the outlier
data <- data[data$RT>200,]
data <- ddply(data,.(Condition),function(x) x[!abs(scale(x$RT)) > 3,])
## compute the mean
ddply(data,.(Condition,Reqresponse,Score),summarise,RT=mean(RT))
})
[[1]]
Condition Reqresponse Score RT
1 X a 0 500
2 X a 1 750
3 X b 0 500
4 X b 1 500
5 Y a 0 400
6 Y a 1 640
7 Y b 1 1000
8 Z a 0 1000
9 Z a 1 1675
10 Z b 0 400
[[2]]
Condition Reqresponse Score RT
1 X a 0 500
2 X a 1 750
3 X b 0 500
4 X b 1 500
5 Y a 0 400
6 Y a 1 640
7 Y b 1 1000
8 Z a 0 1000
9 Z a 1 1675
10 Z b 0 400

Related

Find what values were changed to after normalization

How can I see the original values post normalization? Or change them in the final output?
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
I have a dataset that has 10 columns and 5,000 rows. After cleaning up the data and selecting which columns and rows I want, I run a normalization code.
Then I run a kmeans and get my output. How can I see what the values were changed to after normalization? Like, if I have Region 1, 2, 3, 4, and 5. And post normalization it changes to 0.00, 0.25, 0.5, 0.75, and 1. is there a way to change them back to the original in the kmeans output?
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
normalize = function(X) {
return(abs((X-min(X)))/(max(X)-min(X)))
}
df_age_norm = as.data.frame(lapply(df_age,normalize))
clusters = kmeans(df_age_norm, 9)[['cluster']]
df_age_norm$clusters = clusters
df_age_norm =
aggregate(df_age_norm[,1:4],list(df_age_norm$clusters),FUN
= mean)
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
Head of dataset before normalization
Age HHIncome Region MaritalStatus group
18 11000 5 0 1
18 11000 5 1 1
18 12000 2 0 1
18 12000 4 0 1
18 13000 1 0 1
Head of dataset after normalization
Age HHIncome Region MaritalStatus group
0 0.001879699 1.00 0 0
0 0.001879699 1.00 1 0
0 0.002819549 0.25 0 0
0 0.002819549 0.75 0 0
0 0.003759398 0.00 0 0
This solution is inspired in base R function scale, that centers and scales the vector by subtracting the mean value and dividing by the standard deviation of the vector x. These two values, mean(x) and sd(x) are returned as attributes.
x <- -4:5
y <- scale(x)
attributes(y)
#$dim
#[1] 10 1
#
#$`scaled:center`
#[1] 0.5
#
#$`scaled:scale`
#[1] 3.02765
I have, therefore, rewritten function normalize to also set and return min(x) and max(x) as attributes. They will be used to later denormalize.
normalize <- function(X, na.rm = FALSE) {
if(na.rm) X <- X[!is.na(X)]
Min <- min(X)
Max <- max(X)
Y <- X - Min
if(Min != Max) Y <- Y/(Max - Min)
attr(Y, "scaled:min") <- Min
attr(Y, "scaled:max") <- Max
Y
}
denormalize <- function(X){
Min <- attr(X, "scaled:min")
Max <- attr(X, "scaled:max")
attr(X, "scaled:min") <- NULL
attr(X, "scaled:max") <- NULL
Y <- if(Min != Max) X*(Max - Min) else X
Y <- Y + Min
Y
}
df_age_norm <- as.data.frame(lapply(df_age, normalize))
df_age_2 <- as.data.frame(lapply(df_age_norm, denormalize))
df_age_2
# Age HHIncome Region MaritalStatus group
#1 18 11000 5 0 1
#2 18 11000 5 1 1
#3 18 12000 2 0 1
#4 18 12000 4 0 1
#5 18 13000 1 0 1
Data.
df_age <- read.table(text = "
Age HHIncome Region MaritalStatus group
18 11000 5 0 1
18 11000 5 1 1
18 12000 2 0 1
18 12000 4 0 1
18 13000 1 0 1
", header = TRUE)

Generate number of random sample of specified length in R

I think it's my poor knowledge of R, especially poor knowledge of loop-related codes.
Here is what I am trying to do. Assuming I have numbers from 1 to 500. first, generate 300 sets of two randomly picked numbers from 1 to 500, assign to each set a number from 1 to 300, and then combine them into one dataset, let it be dataset A.
So A will look like
A_no random_num
1 26
1 256
2 3
2 113
...
Then I need to create a dataset B where I will have sets of 3 random numbers out of 500, also counted
B_no random_num
1 16
1 113
1 4
2 67
2 25
2 7
I believe the R code will look very elegant and simple. Will really appreciate any help with this solution.
This is simplifying somewhat, taking advantage of the (assumed) fact that all randomly picked numbers will be independent.
df1 <- data.frame(A_no=rep(1:500,each=2), random_num=sample(1:300,2*500,replace=T))
head(df1)
## A_no random_num
## 1 1 249
## 2 1 117
## 3 2 108
## 4 2 44
## 5 3 138
## 6 3 247
dim(df1)
## [1] 1000 2
df2 <- data.frame(A_no=rep(1:500,each=3), random_num=sample(1:300,3*500,replace=T))
head(df2)
## A_no random_num
## 1 1 276
## 2 1 50
## 3 1 237
## 4 2 153
## 5 2 225
## 6 2 4
dim(df2)
## [1] 1500 2
# Set random seed
set.seed(83)
# Generate sequence
z <- seq(1, 300)
# Generate first sample
x <- sample(1:500, 300)
# Combine
zx <- data.frame(z, x)
# Set seed again
set.seed(82)
# Generate second sample
x <- sample(1:500, 300)
# Combine
zy <- data.frame(z, x)
xyz <- rbind(zx, zy)
Or you could do this way
# Set random seed
set.seed(83)
# Generate sequence
B_no <- seq(1, 300)
# Generate first sample
x <- sample(1:500, 300)
y <- sample(1:500, 300)
z <- sample(1:500, 300)
wide <- data.frame(B_no, x, y, z)
library(reshape2)
B <- melt(wide, id.vars = c("B_no"))
Bx <- B[order(B$B_no),]

Sorting data between rows in R

Hypothetical data:
a <- c(400,500,600,700,100,600,700,100)
b <- c(2,2,1,2,2,1,2,1)
c <- c('NA','R','NA','G','NA','R','NA','G')
data <- data.frame(a,b,c)
Output:
a b c
1 400 2 NA
2 500 2 R
3 600 1 NA
4 700 2 G
5 100 2 NA
6 600 1 R
7 700 2 NA
8 100 1 G
You can easily subset if it is in the same row:
subset(data, b== '1' & c =='R')
Output:
a b c
6 600 1 R
My question is how do I subset between rows? For example, how do I find all values of c = 'R' when b = '2' on the above row?
a b c
2 500 1 R
6 600 1 R
How do I find all values of c = 'R' when b = '2' on the above row?
How about
b2above <- which(data$b == 2) + 1L
cR <- which(data$c == "R")
id <- cR[cR %in% b2above] ## or `id <- intersect(cR, b2above)`
data[id, ]
# a b c
#2 500 2 R
#6 600 1 R
You can try this too:
indices.b <- which(data$b == 2)
indices.c <- which(data$c == 'R')
if ((length(indices.b) > 0) && (length(indices.c) > 0)) { # if such rows exist
indices <- which((indices.c - 1) %in% indices.b) # check if consecutive rows
if(length(indices)>0) data[indices.c[indices],] # if consecutive rows exist
}
# a b c
# 2 500 2 R
# 6 600 1 R

Find frequencies of combinations where the data.frame needs to be parsed

I'm sure there's a simple solution to this, but I can't figure it out!! Suppose I have a dataframe that has the following information:
aaa<-c("A,B","B,C","B,D,E")
vvv<-c("101","101,102","102,103,104")
data_h<-data.frame(aaa,vvv)
data_h
aaa vvv
1 A,B 101
2 B,C 101,102
3 B,D,E 102,103,104
Desired output is a frequency map of individual hits, for subsequent analysis in a heat map. So:
101 102 103 104
A 1 0 0 0
B 2 2 1 1
C 1 1 0 0
D 0 1 1 1
E 0 1 1 1
How do I make this transformation? I've seen many similar examples, but none where the contents of the data-frame need to be parsed.
The goal is to ultimately use heatmap or something similar on the output table to visualize the correlation between "aaa" and "vvv".
Here is a base R solution in 4 lines of code. First we define a function, spl which splits the components of a comma separated string producing a vector of all the fields. eg takes two string arguments and applies spl to each of them and then creates a grid of the result of the splitting. Finally we apply eg to each row of data_h, rbind the results together and tabulate them with xtabs:
spl <- function(x) strsplit(as.character(x), ",")[[1]]
eg <- function(aaa, vvv) expand.grid(aaa = spl(aaa), vvv = spl(vvv))
dd <- do.call("rbind", Map(eg, data_h$aaa, data_h$vvv))
xtabs(data = dd)
The result is:
vvv
aaa 101 102 103 104
A 1 0 0 0
B 2 2 1 1
C 1 1 0 0
D 0 1 1 1
E 0 1 1 1
dcast Alternately replace the last line of code above (the one with the xtabs) with:
library(reshape2)
dcast(dd, aaa ~ vvv, fun = length, value.var = "vvv")
in which case the result is:
aaa 101 102 103 104
1 A 1 0 0 0
2 B 2 2 1 1
3 C 1 1 0 0
4 D 0 1 1 1
5 E 0 1 1 1
tapply. Another alternative would be tapply (however, it will fill in empty cells with NA rather than 0):
tapply(1:nrow(dd), dd, length)
ADDED Alternatives. Some improvements.
The shape of the data.frame suggests using splitstackshape package. But I don't know very well this package so I just use it to reshape the data, and then compute frequencies by hand using table:
library(splitstackshape)
data_h_split <- concat.split.multiple(data_h,1:2)
# aaa_1 aaa_2 aaa_3 vvv_1 vvv_2 vvv_3
# 1 A B <NA> 101 NA NA
# 2 B C <NA> 101 102 NA
# 3 B D E 102 103 104
Once you have the data in this format (no comma , regular columns), it is easy to compute frequencies using table( you can use tapply,reshape):
table(cbind.data.frame(ff= unlist(data_h_split[1:3]),
xx= unlist(data_h_split[4:6])))
xx
ff 101 102 103 104
A 1 0 0 0
B 1 1 0 0
C 0 1 0 0
D 0 0 1 0
0 0 0 0
E 0 0 0 1
Ananda's edit
Here's a multi-step approach to get the result using "splitstackshape" to work for this.
library(splitstackshape)
## Split the "vvv" column first, and reshape at the same time
x <- concat.split.multiple(data_h, split.cols="vvv", ",", "long")
## Add an ID column
x$id <- 1:nrow(x)
## Split the "aaa" column next, again reshaping as we do so
x <- concat.split.multiple(x[complete.cases(x), ], split.cols="aaa", ",", "long")
## Use `table` with `droplevels`
with(droplevels(x), table(aaa, vvv))
# vvv
# aaa 101 102 103 104
# A 1 0 0 0
# B 2 2 1 1
# C 1 1 0 0
# D 0 1 1 1
# E 0 1 1 1
My concat.split.multiple function is in great need of a rewrite to improve its efficiency. I've done some work on that in my cSplit function, which might be useful if you have a particularly large dataset.
Here's how I would solve your given problem with cSplit:
table(
cSplit(
cSplit(data_h, splitCols = 2, sep = ",",
direction = "long", makeEqual = FALSE),
splitCols = 1, sep = ",", direction = "long",
makeEqual = FALSE))
# vvv
# aaa 101 102 103 104
# A 1 0 0 0
# B 2 2 1 1
# C 1 1 0 0
# D 0 1 1 1
# E 0 1 1 1
It seems to be pretty efficient too...
First, the functions to test:
fun1 <- function() table(cSplit(cSplit(df, 2, ",", "long", FALSE), 1, ",", "long", FALSE))
fun2 <- function() {
spl <- function(x) strsplit(as.character(x), ",")[[1]]
eg <- function(aaa, vvv) expand.grid(aaa = spl(aaa), vvv = spl(vvv))
dd <- do.call("rbind", Map(eg, df$A, df$V))
xtabs(data = dd)
}
Second, some sample data. Change Nrows and re-generate to see the effect on different sized data.frames.
set.seed(1)
Nrow <- 100
aaa <- 100:200
vvv <- LETTERS
maxA <- 10
maxV <- 10
Aaa <- sample(maxA, Nrow, TRUE)
Vvv <- sample(maxV, Nrow, TRUE)
A <- vapply(seq_along(Aaa), function(x)
paste(sample(aaa, Aaa[x], TRUE), collapse = ","), character(1L))
V <- vapply(seq_along(Vvv), function(x)
paste(sample(vvv, Vvv[x], TRUE), collapse = ","), character(1L))
df <- data.frame(A, V)
head(df)
# A V
# 1 127,122,152 E,E,O,S,W,S,M
# 2 127,118,152,156 V,A,Z,Q
# 3 113,125,172,197,110,177 L,A,T
# 4 195,182,131,165,196,196,134,126,116,132 F,Z,X,S,T,M,W,E,Q,H
# 5 151,193,151 L,B,E,B,Y,I,N
# 6 126,104,142,186,135,113,137,163,139 Q,G,N
Compare the two approaches to make sure the results are the same:
X <- fun1()
Y <- fun2()
all(X == Y[dimnames(X)[[1]], dimnames(X)[[2]]])
# [1] TRUE
Benchmark (on 100 rows).
library(microbenchmark)
## Nrow = 100
microbenchmark(fun1(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 7.263802 7.326237 7.440843 7.868905 10.26451 10
# fun2() 62.869130 64.046836 68.525880 73.595061 80.02027 10
Benchmark (on 1000 rows).
## Nrow = 1000
microbenchmark(fun1(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 19.2303 20.21857 23.14337 26.97776 35.56338 10
# fun2() 775.6586 815.01639 835.98951 852.47804 888.15345 10

Frequency table comparison using R

I have two frequency tables created using R's table() function:
freq1 <- table(unlist(strsplit(topic_list1, split=";")))
freq2 <- table(unlist(strsplit(topic_list2, split=";")))
topic_list1 and topic_list2 are strings that contains textual representations of topics separated by ;.
I want a way to compare the two frequencies, graphically if possible.
So if the two lists contain the same topic with different frequencies, I would like to be able to see it. The same goes for topics present in one frequency table, but not in the other.
There's probably a more elegant way to do this, but this ought to work:
# here I'm generating some example data
set.seed(5)
topic_list1 <- paste(sample(letters, 20, replace=T), sep=";")
topic_list2 <- paste(sample(letters, 15, replace=T), sep=";")
# I don't make the tables right away
tl1 <- unlist(strsplit(topic_list1, split=";"))
tl2 <- unlist(strsplit(topic_list2, split=";"))
big_list <- unique(c(tl1, tl2))
# this computes your frequencies
lbl <- length(big_list)
tMat1 <- matrix(rep(tl1, lbl), byrow=T, nrow=lbl)
tMat2 <- matrix(rep(tl2, lbl), byrow=T, nrow=lbl)
tMat1 <- cbind(big_list, tMat1)
tMat2 <- cbind(big_list, tMat2)
counts1 <- apply(tMat1, 1, function(x){sum(x[1]==x[2:length(x)])})
counts2 <- apply(tMat2, 1, function(x){sum(x[1]==x[2:length(x)])})
total_freqs <- rbind(counts1, counts2, counts1-counts2)
# this makes it nice looking & user friendly
colnames(total_freqs) <- big_list
rownames(total_freqs) <- c("topics1", "topics2", "difference")
total_freqs <- total_freqs[ ,order(total_freqs[3,])]
total_freqs
d l a z b f s y m r x h n i g k c v o
topics1 0 0 0 0 0 2 1 1 1 1 2 2 1 1 1 1 2 2 2
topics2 2 2 2 1 1 2 1 1 1 0 1 1 0 0 0 0 0 0 0
difference -2 -2 -2 -1 -1 0 0 0 0 1 1 1 1 1 1 1 2 2 2
From there you could just use the straight numbers or visualize them however you want (e.g, dotplots, etc.). Here's a simple dotplot:
windows()
dotchart(t(total_freqs)[,3], main="Frequencies of topics1 - topics2")
abline(v=0)
You can simply barplot them (with beside=T argument), which will give you a way to visually compare the counts per level ...
below is an example:
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, col=c("darkblue","red"), legend=rownames(counts), beside=T)

Resources