dynamic column names in data.table correlation

dynamic column names in data.table correlation - r

I've combined the outputs for each user and item (for a recommendation system) into this all x all R data.table. For each row in this table, I need to calculate the correlation between user scores 1,2,3 & item scores 1,2,3 (e.g. for the first row what is the correlation between 0.5,0.6,-0.2 and 0.2,0.8,-0.3) to see how well the user and the item match.
user item user_score_1 user_score_2 user_score_3 item_score_1 item_score_2 item_score_3
A 1 0.5 0.6 -0.2 0.2 0.8 -0.3
A 2 0.5 0.6 -0.2 0.4 0.1 -0.8
A 3 0.5 0.6 -0.2 -0.2 -0.4 -0.1
B 1 -0.6 -0.1 0.9 0.2 0.8 -0.3
B 2 -0.6 -0.1 0.9 0.4 0.1 -0.8
B 3 -0.6 -0.1 0.9 -0.2 -0.4 -0.1
I have a solution that works - which is:
scoresDT[, cor(c(user_score_1,user_score_2,user_score_3), c(item_score_1,item_score_2,item_score_3)), by= .(user, item)]
...where scoresDT is my data.table.
This is all well and good, and it works...but I can't get it to work with dynamic variables instead of hard coding in the variable names.
Normally in a data.frame I could create a list and just input that, but as it's character format, the data.table doesn't like it. I've tried using a list with "with=FALSE" and have had some success when trying basic subsetting of the data.table but not with the correlation syntax that I need...
Any help is much, much appreciated!
Thanks,
Andrew

Here's what I would do:
mDT = melt(scoresDT,
id.vars = c("user","item"),
measure.vars = patterns("item_score_", "user_score_"),
value.name = c("item_score", "user_score")
)
mDT[, cor(item_score, user_score), by=.(user,item)]
user item V1
1: A 1 0.8955742
2: A 2 0.9367659
3: A 3 -0.8260332
4: B 1 -0.6141324
5: B 2 -0.9958706
6: B 3 0.5000000
I'd keep the data in its molten/long form, which fits more naturally with R and data.table functionality.

Related

Calculate average within a specified range

I am using the 'diamonds' dataset from ggplot2 and am wanting to find the average of the 'carat' column. However, I want to find the average every 0.1:
Between
0.2 and 0.29
0.3 and 0.39
0.4 and 0.49
etc.

You can use function aggregate to mean by group which is calculated with carat %/% 0.1
library(ggplot2)
averageBy <- 0.1
aggregate(diamonds$carat, list(diamonds$carat %/% averageBy * averageBy), mean)
Which gives mean by 0.1
Group.1 x
1 0.2 0.2830764
2 0.3 0.3355529
3 0.4 0.4181711
4 0.5 0.5341423
5 0.6 0.6821408
6 0.7 0.7327491
...

efficient filter rows by multi column patterns

I have a big data frame (104029 x 142).
I want to filter rows which value>0 by multi specific column names.
df
word abrasive abrasives abrasivefree abrasion slurry solute solution ....
1 composition -0.2 0.2 -0.3 -0.40 0.2 0.1 0.20 ....
2 ceria 0.1 0.2 -0.4 -0.20 -0.1 -0.2 0.20 ....
3 diamond 0.3 -0.5 -0.6 -0.10 -0.1 -0.2 -0.15 ....
4 acid -0.1 -0.1 -0.2 -0.15 0.1 0.3 0.20 ....
....
Now I have tried to use filter() function to do, and it's OK.
But I think this way is not efficient for me.
Because I need to define each column name, it makes hard work when I need to maintain my process.
column_names <- c("agent", "agents", "liquid", "liquids", "slurry",
"solute", "solutes", "solution", "solutions")
df_filter <- filter(df, agents>0 | agents>0 | liquid>0 | liquids>0 | slurry>0 | solute>0 |
solutes>0 | solution>0 | solutions>0)
df_filter
word abrasive abrasives abrasivefree abrasion slurry solute solution ....
1 composition -0.2 0.2 -0.3 -0.40 0.2 0.1 0.20 ....
2 ceria 0.1 0.2 -0.4 -0.20 -0.1 -0.2 0.20 ....
4 acid -0.1 -0.1 -0.2 -0.15 0.1 0.3 0.20 ....
....
Is there any more efficient way to do?

This line will return vector of True/False for the condition you are testing
filter_condition <- apply(df[ , column_names], 1, function(x){sum(x>0)} )>0
Then you can use
df[filter_condition, ]
I'm sure there is something nicer in dplyr.

Use dplyr::filter_at() which allows you to use select()-style helpers to select some functions:
library(dplyr)
df_filter <- df %>%
filter_at(
# select all the columns that are in your column_names vector
vars(one_of(column_names))
# if any of those variables are greater than zero, keep the row
, any_vars( . > 0)
)

How to meta analyze p values of different observations

I am trying to meta analyze p values from different studies. I have data frame
DF1
p-value1 p-value2 pvalue3 m
0.1 0.2 0.3 a
0.2 0.3 0.4 b
0.3 0.4 0.5 c
0.4 0.4 0.5 a
0.6 0.7 0.9 b
0.6 0.7 0.3 c
I am trying to get fourth column of meta analyzed p-values1 to p-value3.
I tried to use metap package
p<–rbind(DF1$p-value1,DF1$p-value2,DF1$p-value3)
pv–split (p,p$m)
library(metap)
for (i in 1:length(pv))
{pvalue <- sumlog(pv[[i]]$pvalue)}
But it results in one p value. Thank you for any help.

You can try
apply(DF1[,1:3], 1, sumlog)

horizontal correlation across variables in R data frame

I want to calculation a correlation score between two sets of numbers, but these numbers are within each row
The background is that I'm compiling a recommender system, using PCA to give me scores for each user and each item to each derived feature (1,2,3 in this case)
user item user_score_1 user_score_2 user_score_3 item_score_1 item_score_2 item_score_3
A 1 0.5 0.6 -0.2 0.2 0.8 -0.3
A 2 0.5 0.6 -0.2 0.4 0.1 -0.8
A 3 0.5 0.6 -0.2 -0.2 -0.4 -0.1
B 1 -0.6 -0.1 0.9 0.2 0.8 -0.3
B 2 -0.6 -0.1 0.9 0.4 0.1 -0.8
B 3 -0.6 -0.1 0.9 -0.2 -0.4 -0.1
I've combined the outputs for each user and item into this all x all table. For each row in this table, I need to calculate the correlation between user scores 1,2,3 & item scores 1,2,3 (e.g. for the first row what is the correlation between 0.5,0.6,-0.2 and 0.2,0.8,-0.3) to see how well the user and the item match.
The other alternative would be to the correlation before I join the users & items into an all x all dataset, but I'm not sure how to do that best either.
I don't think I can transpose the table as in reality the users and items total is very large.
Any thoughts on a good approach?
Thanks,
Andrew

Define interval to create a continuous vector [duplicate]

This question already has answers here:
How do you create vectors with specific intervals in R?
(3 answers)
Closed 8 years ago.
I can't find how to tell R that I want to make a kind of "continuous" vector.
I want something like x<-c(-1,1) to give me a vector of n values with a specific interval (e.g, 0.1 or anything I want) so I can generate a "continuous" vector such as
x
[1] 1.0 -0.9 -0.8 -0.7....1.0
I know this should be basic but I can't find my way to the solution.
Thank you

It sounds like you're looking for seq:
seq(-1, 1, by = .1)
# [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
# [16] 0.5 0.6 0.7 0.8 0.9 1.0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dynamic column names in data.table correlation - r

Related

Calculate average within a specified range

efficient filter rows by multi column patterns

How to meta analyze p values of different observations

horizontal correlation across variables in R data frame

Define interval to create a continuous vector [duplicate]

Categories

Resources