I have a bit of code which aggregates data:
pivot.present.RT <- with(
subset(correct.data, relevantTarget == 1),
aggregate(
data.frame(RT = RT),
list(
identifier = identifier,
set.size = relevantSS,
stimulus = stimulus
),
mean
)
)
I would like to make this more flexible by specifying different column names to take the place of "relevantSS". I thought I could do this with eval:
set.size.options <- c("relevantSS","irrelevantSS")
pivot.present.RT <- with(
subset(correct.data, relevantTarget == 1),
aggregate(
data.frame(RT = RT),
list(
identifier = identifier,
eval(parse(text = paste("set.size = ", set.size.options[relevant.index]))),
stimulus = stimulus
),
mean
)
)
However, when I run the second bit of code, while it does correctly aggregate the data, I lose the variable name "set.size". If I call str, I get output like this:
'data.frame': 48 obs. of 4 variables:
$ identifier: Factor w/ 9 levels "aks","ejr","ejr3",..: 1 2 4 5 6 7 8 9 1 2 ...
$ Group.2 : int 4 4 4 4 4 4 4 4 8 8 ...
$ stimulus : Factor w/ 2 levels "moving","stationary": 1 1 1 1 1 1 1 1 1 1 ...
$ RT : num 1161 1026 1257 1264 1324 ...
If I run the original code, it correctly identifies the second variable as "set.size".
Any idea what I'm missing here?
I think get might be more appropriate than eval/parse.
set.size.options <- c("relevantSS","irrelevantSS")
pivot.present.RT <- with(
subset(correct.data, relevantTarget == 1),
aggregate(
data.frame(RT = RT),
list(
identifier = identifier,
set.size = get(set.size.options[relevant.index]),
stimulus = stimulus
),
mean
)
)
That said, I'd probably prefer something like this:
d2 <- subset(correct.data, relevantTarget == 1)
doby <- subset(d2, select=c("identifier", set.size.options[relevant.index], "stimulus"))
names(doby) <- c("identifier", "set.size", "stimulus")
aggregate(d2[,"RT",drop=FALSE], doby, mean)
And others will undoubtedly chime in with plyr solutions...
Put the grouping variable name outside of eval(parse(...)), like this:
set.size.options <- c("relevantSS","irrelevantSS")
pivot.present.RT <- with(
subset(correct.data, relevantTarget == 1),
aggregate(
data.frame(RT = RT),
list(
identifier = identifier,
set.size = eval(parse(text = set.size.options[relevant.index])),
stimulus = stimulus
),
mean
)
)
Related
I am a dataset of jokes Dataset 2 (jester_dataset_2.zip) from the Jester project and I would like to divide the jokes into groups of jokes with similar rating and visualize the results appropriately.
The data look like this
> str(tabulka)
'data.frame': 1761439 obs. of 3 variables:
$ User : int 1 1 1 1 1 1 1 1 1 1 ...
$ Joke : int 5 7 8 13 15 16 17 18 19 20 ...
$ Rating: num 0.219 -9.281 -9.281 -6.781 0.875 ...
Here is a subset of Dataset 2.
> head(tabulka)
User Joke Rating
1 1 5 0.219
2 1 7 -9.281
3 1 8 -9.281
4 1 13 -6.781
5 1 15 0.875
6 1 16 -9.656
I found out I can't use ANOVA since the homogenity is not the same. Hence I am using Kruskal–Wallis method from agricolae package in R.
KWtest <- with ( tabulka , kruskal ( Rating , Joke ))
Here are the groups.
> head(KWtest$groups)
trt means M
1 53 1085099 a
2 105 1083264 a
3 89 1077435 ab
4 129 1072706 b
5 35 1070016 bc
6 32 1062102 c
The thing is I don't know how to visualize the joke groups appropriately. I am using boxplot to show the confidence intervals for each joke.
barvy <- c ("yellow", "grey")
boxplot (Rating ~ Joke, data = tabulka,
col = barvy,
xlab = "Joke",
ylab = "Rating",
ylim=c(-7,7))
It would be nice to somehow color each box (each joke) with an appropriate color according to the color given by the KW test.
How could I do that? Or is there some better way to find the best and the worst jokes in the dataset?
Interesting question per se. It's easy to color each bar according to the group the joke belongs to. However, I think it is just a intermediate solution, there must be better visualization for these data. So, certainly not the best one, but there is my version:
library(tidyverse)
# download data (jokes, part 1) to temporaty file, and unzip
tmp <- tempfile()
download.file("http://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip", tmp)
tmp <- unzip(tmp)
# read data from temp
vtipy <- readxl::read_excel(tmp, col_names = F, na = '99')
# clean data
vtipy <- vtipy %>%
mutate(user = 1:n()) %>%
gather(key = 'joke', value = 'rating', -c('..1', 'user')) %>%
rename(n = '..1', ) %>%
filter(!is.na(rating)) %>%
mutate(joke = as.character(as.numeric(gsub('\\.+', '', joke)) - 1)) %>%
select(user, n, joke, rating)
# your code
KWtest <- with(vtipy, agricolae::kruskal(rating, joke))
# join groups from KWtest to original data, clean and plot
KWtest$groups %>%
rownames_to_column('joke') %>%
select(joke, groups) %>%
right_join(vtipy, by = 'joke') %>%
mutate(joke = stringi::stri_pad_left(joke, 3, '0')) %>%
ggplot(aes(x = joke, y = rating, fill = groups)) +
geom_boxplot(show.legend = F) +
scale_x_discrete(breaks = stringi::stri_pad_left(c(1, seq(5, 100, by = 5)), 3, '0')) +
ggthemes::theme_tufte() +
labs(x = 'Joke', y = 'Rating')
I'm trying to create a function as I need to apply the same code multiple times to different columns in my data.
My data (df) looks like this:
WEEK1.x WEEK1.y WEEK2.x WEEK2.y WEEK3.x WEEK3.y
1 660.14 1 690.74 2 821.34 1
2 -482.89 99 -368.12 99 -368.12 99
3 284.48 3 399.90 1 375.32 1
4 -554.18 99 -300.28 99 -300.28 99
Then my function looks like:
extra<-function(first_var, second_var){
df$first_var=ifelse((df$first_var == 99),"99",
ifelse((df$first_var %in% c(1,2,3,4,5)),"1-5",NA))
output=as.data.frame(aggregate(second_var~first_var, data = df, mean))
return(output)
}
WEEK1<-extra("WEEK1.y", "WEEK1.x")
WEEK2<-extra("WEEK2.y", "WEEK2.y")
This then gives me the error:
Error in $<-.data.frame(*tmp*, first_var, value = logical(0)) :
replacement has 0 rows, data has 1416
When I press view traceback this is what it says:
stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows), domain = NA)
$<-.data.frame(*tmp*, first_var, value = logical(0))
$<-(*tmp*, first_var, value = logical(0))
extra("WEEK1.y", "WEEK1.x")
I'm not sure what the problem is?
Here is a working version of your function.
I have used a variant of the suggestion by #A.Suliman, but with [[.
extra <- function(first_var, second_var){
df[[first_var]] <- ifelse((df[[first_var]] == 99), "99",
ifelse((df[[first_var]] %in% c(1,2,3,4,5)), "1-5", NA))
fmla <- as.formula(paste(second_var, first_var, sep = "~"))
aggregate(fmla, data = df, mean, na.rm = TRUE)
}
WEEK1 <- extra("WEEK1.y", "WEEK1.x")
WEEK1
# WEEK1.y WEEK1.x
#1 1-5 472.310
#2 99 -518.535
WEEK2 <- extra("WEEK2.y", "WEEK2.x")
WEEK2
# WEEK2.y WEEK2.x
#1 1-5 545.32
#2 99 -334.20
Note that I would also suggest that you pass df as an argument to the function. It is generally considered bad practice to rely on objects existing elsewhere than in the function´s environment. In this case, df exists in .GlobalEnv and you are forcing R to leave th environment where it is needed to find it.
DATA.
df <- read.table(text = "
WEEK1.x WEEK1.y WEEK2.x WEEK2.y WEEK3.x WEEK3.y
1 660.14 1 690.74 2 821.34 1
2 -482.89 99 -368.12 99 -368.12 99
3 284.48 3 399.90 1 375.32 1
4 -554.18 99 -300.28 99 -300.28 99
", header = TRUE)
I have a 1.5Mx7 data.table that I need to process through. The code I have written is running very slowly (.18s per row, estimated 75 hours to complete), and I'm hoping I can optimize it.
I'll put the pseudo-example code at the end, because it's long.
str(review)
Classes ‘data.table’ and 'data.frame': 1500000 obs. of 7 variables:
$ user_id : Factor w/ 375000 levels "aA1aJ9lJ1lB5yH5uR6jR7",..: 275929 313114 99332 277686 57473 31780 236964 44371 210127 217770 ...
$ stars : int 2 1 3 3 1 1 2 1 2 2 ...
$ business_id : Factor w/ 60000 levels "aA1kR2bK6nH8yQ9gU2uI9",..: 40806 29885 43018 58297 58444 31626 26018 2493 37883 34204 ...
$ votes.funny : int 3 0 0 7 2 9 6 8 2 7 ...
$ votes.useful: int 4 1 0 5 9 2 4 7 4 9 ...
$ votes.cool : int 5 3 6 8 3 2 0 8 10 9 ...
$ IDate : IDate, format: "2012-01-01" "2012-01-01" "2012-01-01" ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "IDate"
I need to subset the dataset by date, and then compute several columns by business_id.
setkey(review, IDate)
system.time(
review[
#(IDate >= window.start) & (IDate <= window.end),
1:10,
.SD,
keyby = business_id
][
,
list(
review.num = .N,
review.users = length(unique(user_id)),
review.stars = mean(stars),
review.votes.funny = sum(votes.funny),
review.votes.useful = sum(votes.useful),
review.votes.cool = sum(votes.cool)
),
by = business_id
]
)
user system elapsed
1.534 0.000 1.534
Timing for smaller versions of the example dataset is
# 1% of original size - 15000 rows
user system elapsed
0.02 0.00 0.02
# 10% of original size - 150000 rows
user system elapsed
0.25 0.00 0.25
So, even though I'm only processing 10 rows, the time increases with the size of the original dataset.
I tried commenting out the review.users variable above, and the computation time on the original dataset fell tremendously:
user system elapsed
0 0 0
So, my challenge is making unique() work more quickly.
I need to count the unique values in user_id for each grouping of business_id.
Not sure what else to specify, but I'm happy to answer questions.
Here is some code to create a pseudo-example dataset. I'm not sure exactly what is the cause of the slowdown, so I've tried to recreate the data as specifically as possible, but because the processing time for the random variables is so long I've reduced the size by ~90%.
z <- c()
x <- c()
for (i in 1:6000) {
z <<- c(z, paste0(
letters[floor(runif(7, min = 1, max = 26))],
LETTERS[floor(runif(7, min = 1, max = 26))],
floor(runif(7, min = 1, max = 10)),
collapse = ""
))
}
z <- rep(z, 25)
for (i in 1:37500) {
x <<- c(x, paste0(
letters[floor(runif(7, min = 1, max = 26))],
LETTERS[floor(runif(7, min = 1, max = 26))],
floor(runif(7, min = 1, max = 10)),
collapse = ""
))
}
x <- rep(x, 4)
review2 <- data.table(
user_id = factor(x),
stars = as.integer(round(runif(150000) * 5, digits = 0)),
business_id = factor(z),
votes.funny = as.integer(round(runif(150000) * 10, digits = 0)),
votes.useful = as.integer(round(runif(150000) * 10, digits = 0)),
votes.cool = as.integer(round(runif(150000) * 10, digits = 0)),
IDate = rep(as.IDate("2012-01-01"), 150000)
)
setkey(review2, IDate)
How about this - an alternative to unique using an extra data.table within an anonymous function:
review2[,{
uid <- data.table(user_id)
rev_user <- uid[, .N, by = user_id][, .N]
#browser()
list(
review.num = .N,
review.users = rev_user,
review.stars = mean(stars),
review.votes.funny = sum(votes.funny),
review.votes.useful = sum(votes.useful),
review.votes.cool = sum(votes.cool)
)}, by = business_id]
It seems that length(unique()) is inefficient in calculating the length of factor variables as levels become very large.
Using uniqueN() instead (thanks #Frank):
user system elapsed
0.12 0.00 0.12
Using set(review, NULL, "user_id", as.character(review$user_id)) and length(unique)):
user system elapsed
0.11 0.00 0.11
I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]
I like to write a function using ddply that outputs the summary statistics based on the name of two columns of data.frame mat.
mat is a big data.frame with the name of columns "metric", "length", "species", "tree", ...,"index"
index is factor with 2 levels "Short", "Long"
"metric", "length", "species", "tree" and others are all continuous variables
Function:
summary1 <- function(arg1,arg2) {
...
ss <- ddply(mat, .(index), function(X) data.frame(
arg1 = as.list(summary(X$arg1)),
arg2 = as.list(summary(X$arg2)),
.parallel = FALSE)
ss
}
I expect the output to look like this after calling summary1("metric","length")
Short metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.
....
Long metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.
....
At the moment the function does not produce the desired output? What modification should be made here?
Thanks for your help.
Here is a toy example
mat <- data.frame(
metric = rpois(10,10), length = rpois(10,10), species = rpois(10,10),
tree = rpois(10,10), index = c(rep("Short",5),rep("Long",5))
)
As Nick wrote in his answer you can't use $ to reference variable passed as character name. When you wrote X$arg1 then R search for column named "arg1" in data.frame X. You can reference to it either by X[,arg1] or X[[arg1]].
And if you want nicely named output I propose below solution:
summary1 <- function(arg1, arg2) {
ss <- ddply(mat, .(index), function(X) data.frame(
setNames(
list(as.list(summary(X[[arg1]])), as.list(summary(X[[arg2]]))),
c(arg1,arg2)
)), .parallel = FALSE)
ss
}
summary1("metric","length")
Output for toy data is:
index metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu.
1 Long 5 7 10 8.6 10
2 Short 7 7 9 8.8 10
metric.Max. length.Min. length.1st.Qu. length.Median length.Mean length.3rd.Qu.
1 11 9 10 11 10.8 12
2 11 4 9 9 9.0 11
length.Max.
1 12
2 12
Is this more like what you want?
summary1 <- function(arg1,arg2) {
ss <- ddply(mat, .(index), function(X){ data.frame(
arg1 = as.list(summary(X[,arg1])),
arg2 = as.list(summary(X[,arg2])),
.parallel = FALSE)})
ss
}