horizontal correlation across variables in R data frame - r

I want to calculation a correlation score between two sets of numbers, but these numbers are within each row
The background is that I'm compiling a recommender system, using PCA to give me scores for each user and each item to each derived feature (1,2,3 in this case)
user item user_score_1 user_score_2 user_score_3 item_score_1 item_score_2 item_score_3
A 1 0.5 0.6 -0.2 0.2 0.8 -0.3
A 2 0.5 0.6 -0.2 0.4 0.1 -0.8
A 3 0.5 0.6 -0.2 -0.2 -0.4 -0.1
B 1 -0.6 -0.1 0.9 0.2 0.8 -0.3
B 2 -0.6 -0.1 0.9 0.4 0.1 -0.8
B 3 -0.6 -0.1 0.9 -0.2 -0.4 -0.1
I've combined the outputs for each user and item into this all x all table. For each row in this table, I need to calculate the correlation between user scores 1,2,3 & item scores 1,2,3 (e.g. for the first row what is the correlation between 0.5,0.6,-0.2 and 0.2,0.8,-0.3) to see how well the user and the item match.
The other alternative would be to the correlation before I join the users & items into an all x all dataset, but I'm not sure how to do that best either.
I don't think I can transpose the table as in reality the users and items total is very large.
Any thoughts on a good approach?
Thanks,
Andrew

Related

Show what the calculated bins breaks are in a histogram

It is my understanding that when plotting histogram, it's not that every unique data point gets its own bin, there's an algorithm that calculates how many bins to use. How do I find out how the data were partitioned to create the number of bins? E.g. 0-5,6-10,... How do I get R to show me where the breaks are via text output?
I've found various methods to calculate number of bins but that's just theory
I think you need to use $breaks:
set.seed(10)
hist(rnorm(200,0,1),20)$breaks
[1] -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

efficient filter rows by multi column patterns

I have a big data frame (104029 x 142).
I want to filter rows which value>0 by multi specific column names.
df
word abrasive abrasives abrasivefree abrasion slurry solute solution ....
1 composition -0.2 0.2 -0.3 -0.40 0.2 0.1 0.20 ....
2 ceria 0.1 0.2 -0.4 -0.20 -0.1 -0.2 0.20 ....
3 diamond 0.3 -0.5 -0.6 -0.10 -0.1 -0.2 -0.15 ....
4 acid -0.1 -0.1 -0.2 -0.15 0.1 0.3 0.20 ....
....
Now I have tried to use filter() function to do, and it's OK.
But I think this way is not efficient for me.
Because I need to define each column name, it makes hard work when I need to maintain my process.
column_names <- c("agent", "agents", "liquid", "liquids", "slurry",
"solute", "solutes", "solution", "solutions")
df_filter <- filter(df, agents>0 | agents>0 | liquid>0 | liquids>0 | slurry>0 | solute>0 |
solutes>0 | solution>0 | solutions>0)
df_filter
word abrasive abrasives abrasivefree abrasion slurry solute solution ....
1 composition -0.2 0.2 -0.3 -0.40 0.2 0.1 0.20 ....
2 ceria 0.1 0.2 -0.4 -0.20 -0.1 -0.2 0.20 ....
4 acid -0.1 -0.1 -0.2 -0.15 0.1 0.3 0.20 ....
....
Is there any more efficient way to do?
This line will return vector of True/False for the condition you are testing
filter_condition <- apply(df[ , column_names], 1, function(x){sum(x>0)} )>0
Then you can use
df[filter_condition, ]
I'm sure there is something nicer in dplyr.
Use dplyr::filter_at() which allows you to use select()-style helpers to select some functions:
library(dplyr)
df_filter <- df %>%
filter_at(
# select all the columns that are in your column_names vector
vars(one_of(column_names))
# if any of those variables are greater than zero, keep the row
, any_vars( . > 0)
)

How to meta analyze p values of different observations

I am trying to meta analyze p values from different studies. I have data frame
DF1
p-value1 p-value2 pvalue3 m
0.1 0.2 0.3 a
0.2 0.3 0.4 b
0.3 0.4 0.5 c
0.4 0.4 0.5 a
0.6 0.7 0.9 b
0.6 0.7 0.3 c
I am trying to get fourth column of meta analyzed p-values1 to p-value3.
I tried to use metap package
p<–rbind(DF1$p-value1,DF1$p-value2,DF1$p-value3)
pv–split (p,p$m)
library(metap)
for (i in 1:length(pv))
{pvalue <- sumlog(pv[[i]]$pvalue)}
But it results in one p value. Thank you for any help.
You can try
apply(DF1[,1:3], 1, sumlog)

dynamic column names in data.table correlation

I've combined the outputs for each user and item (for a recommendation system) into this all x all R data.table. For each row in this table, I need to calculate the correlation between user scores 1,2,3 & item scores 1,2,3 (e.g. for the first row what is the correlation between 0.5,0.6,-0.2 and 0.2,0.8,-0.3) to see how well the user and the item match.
user item user_score_1 user_score_2 user_score_3 item_score_1 item_score_2 item_score_3
A 1 0.5 0.6 -0.2 0.2 0.8 -0.3
A 2 0.5 0.6 -0.2 0.4 0.1 -0.8
A 3 0.5 0.6 -0.2 -0.2 -0.4 -0.1
B 1 -0.6 -0.1 0.9 0.2 0.8 -0.3
B 2 -0.6 -0.1 0.9 0.4 0.1 -0.8
B 3 -0.6 -0.1 0.9 -0.2 -0.4 -0.1
I have a solution that works - which is:
scoresDT[, cor(c(user_score_1,user_score_2,user_score_3), c(item_score_1,item_score_2,item_score_3)), by= .(user, item)]
...where scoresDT is my data.table.
This is all well and good, and it works...but I can't get it to work with dynamic variables instead of hard coding in the variable names.
Normally in a data.frame I could create a list and just input that, but as it's character format, the data.table doesn't like it. I've tried using a list with "with=FALSE" and have had some success when trying basic subsetting of the data.table but not with the correlation syntax that I need...
Any help is much, much appreciated!
Thanks,
Andrew
Here's what I would do:
mDT = melt(scoresDT,
id.vars = c("user","item"),
measure.vars = patterns("item_score_", "user_score_"),
value.name = c("item_score", "user_score")
)
mDT[, cor(item_score, user_score), by=.(user,item)]
user item V1
1: A 1 0.8955742
2: A 2 0.9367659
3: A 3 -0.8260332
4: B 1 -0.6141324
5: B 2 -0.9958706
6: B 3 0.5000000
I'd keep the data in its molten/long form, which fits more naturally with R and data.table functionality.

Define interval to create a continuous vector [duplicate]

This question already has answers here:
How do you create vectors with specific intervals in R?
(3 answers)
Closed 8 years ago.
I can't find how to tell R that I want to make a kind of "continuous" vector.
I want something like x<-c(-1,1) to give me a vector of n values with a specific interval (e.g, 0.1 or anything I want) so I can generate a "continuous" vector such as
x
[1] 1.0 -0.9 -0.8 -0.7....1.0
I know this should be basic but I can't find my way to the solution.
Thank you
It sounds like you're looking for seq:
seq(-1, 1, by = .1)
# [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
# [16] 0.5 0.6 0.7 0.8 0.9 1.0

Resources