Related
I am not very good at R, and I have a problem. As I want to do a linear regression between two variables from different datasets, i run into the proble, that one dataset is way bigger than the other. So, in order to bypass that problem, I want to create a smaller variable with an equal population, randomly selected from the greater datasets variable. What is the command for that? And if any specification is needed for that, please let me know! Thank you so much for your help!
Tried to make a liner regression out of two datasets, but as one is bigger than the other, it did not help, and the line (error)
Error in model.frame.default(formula = lobby_expenditure$expend \~ compustat$lct, :
variable lengths differ (found for 'compustat$lct')
appeared
Here is a simple example; y comes from d2 and a sample of rows from d1 are selected for x
d1=data.frame(x=rnorm(100))
d2=data.frame(y=rnorm(10))
lm(d2$y~d1[sample(1:nrow(d1),nrow(d2)),"x"])
To get any sample rows, use dplyr::sample_n
Example : dataset :
df2 <- read_table('Individual Site
1 A
2 B
3 A
4 C
5 C
6 B
7 A
8 B
9 C')
with sample_n(df2,2) where 2 is number of samples you want, you can get random rows. The following output may differ in your case since its random.
#A tibble: 2 x 2
Individual Site
<dbl> <chr>
1 4 C
2 5 C
I was originally running a PCA to reduce a large number of correlated measures (>10 behaviours) down to fewer variables (in PCA I used the first and second principal components). But this is not appropriate (similar situation to this OP) because we have repeated measures from the same individuals over time (Budaev 2010, pg. 6: "Using multiple measures from the same individuals as independent while computing the correlation matrix is pseudoreplication and is incorrect."). Because of this, it is recommended I use a PARAFAC model instead of PCA to do this (available through the PTAk package in R) - see Leibovici (2010) for details.
My data is stored as a data.frame object, where each row is for one individual, that can be sampled multiple times in a year and across their lifetimes.
Sample of my data (data available here):
individual beh1 beh2 beh3 beh4 year
11979 0 0.0333 0 0 2014
12026 0.176 0.0882 0.441 0.0882 2014
12435 0.405 0.189 0 0.243 2014
12524 0 0 1 0 2014
12625 0 0 0 0 2014
12678 0 0 0 0 2014
To use the PTAk package, the data needs to be converted into an array. The code to do this is:
my_df <- array(as.vector(as.matrix(subset_data), c(x, y, z))
where x is the number of rows, y is the number of columns, and z is the number of arrays.
My general question:
Which components of my data.frame should correspond to which measures in the array?
My initial guess would be that x should correspond to the number of individuals sampled (i.e., the number of rows in the original data.frame), but I am not sure what the y and z components should be.
Like this:
my_df <- array(as.vector(as.matrix(subset_data)), c(5393, 4, 9))
where x is 5393 individuals, y is the number of variables (e.g., 4 behaviours), and z is the number of years (9 years).
This generates 9 arrays with each individual’s record as the rows, and each variable as a column (identifier, 4 behaviours, and the year of sampling). In theory each array would correspond to a certain year of sampling, but that is currently not the case.
My question in detail:
If this is the correct formatting for my array, how do I ensure that only one year of sampling data is included in each array (i.e., only samples from 2008 are in array 1, only 2009 in array 2, etc.)?
Alternatively, if my formatting is wrong, what is the correct array format for my data and question?
For example, should I group the data into arrays according to the behaviour (beh1, beh2, etc.), so the code looks like:
my_df<-array(as.vector(as.matrix(subset_data)), c(5393, 3, 4))
where there would be three columns per array corresponding to the identifier, value for the behaviour, and year of observation? If this is the proper formatting, how would I ensure that the arrays are divided based on the behaviours rather than the identifier and/or year columns?
First of all in your subset_data the variable individual and year need to be discarded (or used in rownames) as they are just identifiers, otherwise in your 'as.vector(subset_data)' they would mixed them up with the data: so use as.vector(subset_data[,-c(1,4)])
Then, look at the little example below:
A=matrix(1:6,c(2,3))
as.vector(A)is
[1] 1 2 3 4 5 6
So, imagine 2 individuals 3 behaviours that works!
In building A, dim(A)[1] is (2) runs faster than dim(A)[2] (3), which extends to arrays.
So now imagine have 4 years X[,,1] is your first year A:
X<-array(0,c(2,3,4)); X[,,1]=A;
X[,,2]=A*2; X[,,3]=A*10, X[,,4]=A/10
Note this could be a way of building your my_df
my_df[,,1]<-subset_data[ subset_data[,4]==2014, -c(1,4) ]etc.
My point was as.vector(X)is then
1 2 3 4 5 6 2 4 6 8 10 12 ...
so the first year then the second year etc...
So to come back (or in fact start of ) with a matrix ind x variable
you'll need to permute the data to AA=matrix(aperm(X,c(1,3,2)),c(8,3))
basically 8 is 2 individuals times 4 with 3 variables...
So if you start with that matrix AA your array will be Array(AA,dim=c(2,4,3)) individual x year x var
So with:
AA=subset_data[,-c(1,4)]
you'll need to say array(AA,dim=c(nb_indi_repeated,9,4)) for 9 years and 4 variables .... but 5393/9 looks like you do not have full exact repetition for all individuals. So you'll need either to select the 'best sample' of the repeated individuals to define the years and the selected individuals or estimate the missing values or do something completely different! This could be defining a repetition not from years but from the series of repeated measures, the next one being either in the same year or later ...
first I can't understand aggregate function and cbind I need explanation really simple words, second I have data
permno number mean std
1 10107 120 0.0117174000 0.06802718
2 11850 120 0.0024398083 0.04594591
3 12060 120 0.0005072167 0.08544500
4 12490 120 0.0063569167 0.05325215
5 14593 120 0.0200060583 0.08865493
6 19561 120 0.0154743500 0.07771348
7 25785 120 0.0184815583 0.16510082
8 27983 120 0.0025951333 0.09538822
9 55976 120 0.0092889000 0.04812975
10 59328 120 0.0098526167 0.07135423
I NEED TO process this by
data_processed2 <- aggregate(cbind(return)~permno, Data_summary, median)
I cant understand this command please explain me very simple THANK YOU!
cbind takes two or more tables (dataframes), puts them side by side, and then makes them into one big table. So for example, if you have one table with columns A, B and C, and another with column D and E, after you cbind them, you'll have one table with five columns: A, B, C, D and E. for the rows, cbind assumes all tables are in the same order.
As noted by Rui, in your example cbind doesn't do anything, because return is not a table, and even if it was, it's only one thing.
aggregate takes a table, divides it by some variable, and the calculates a statistic on a variable within each group. For example, if I have data for sales by month and day of month, I can aggregate by month, and calculate the average sales per day for each of the months.
The command you provided uses the following syntax:
aggregate(VARIABLES~GROUPING, DATA, FUNCTION)
Variables (cbind(return) - which doesn't make sense, really) is the list of all the variables for which your statistic will be calculated
Grouping (pernmo) is the variable by which you will break the data into groups (in the sample data you provided each row has a unique number for this variable, so that doesn't really make sense either).
Data is the dataframe you're using.
Function is median.
So this call will break Data_summery into groups that have the same pernmo, and calculate the median for each of the columns.
With the data you provided, you'll basically get the same table back, since you're grouping the data by groups of one row each... -- Actually, since your variables are an empty group, as far as I can tell, you'll get nothing back.
I have a data frame with 11 columns out of which 9 are numeric. I am trying to find out the correlation of 8 columns together against the remaining column i.e., correlation of 8 variables with 1 variable which should generate one value of correlation instead of generating 9 different values in a matrix.
is it possible? or do I need to calculate the average correlation after calculating individual correlation?e.g., I am trying to find the correlation of X,Y,Z to A. Using the mentioned methods I get a matrix which gives me indivual score of association for X,Y,Z with A where as I need one score which takes into account all three X,Y & Z.
A simulated df is presented below for illustration purposes
x y z a
1 1.72480753 0.007053053 0.32435032 10
2 0.97227885 -0.844118498 -0.75534119 20
3 -0.53844294 -0.036178789 0.89396765 30
4 1.34695331 0.870119744 0.99400826 40
5 0.02336335 0.514481676 0.95894286 50
6 -0.15239307 0.386061290 0.73541287 60
7 -0.29878116 1.615012645 -0.04416341 70
8 -1.10907706 -1.581093487 -0.93293702 80
9 2.73021114 -0.130141775 1.85304372 90
10 0.22417487 1.170900385 -0.68312974 100
I can do correlation of each row and variable with a but what I want is correlation of x,y,z combined with a
corr.test(df[,1:3],df[,4])
I will appreciate any help towards this problem.
Regards,
Pearson Correlation is defined to be a number relating one sequence (or vector) of values to another (look it up). As far as I know there is no roughly equivalent definition for a group of vectors to another, but you could do something like take the average vector (of the 3 vectors) and correlate a to that.
To me at least that has a more immediate geometric meaning than taking the average of the 3 correlation values.
If you want to compute the correlation of each variable with a, you could do something like:
head(cor(df)[,"a"], -1)
# x y z
# -0.14301569 0.19188340 -0.06561505
You said you wanted to combine these values by averaging, so I suppose you could just take the mean of that:
mean(head(cor(df)[,"a"], -1))
# [1] -0.005582445
I have a sample data below that is from a large data set, where each participant is given multiple condition for scoring.
Participant<-c("p1","p1","p2","p2","p3","p3")
Condition<-c( "c1","c2","c1","c2","c1","c2")
Score<-c(4,5, 5,7,8,2)
T<-data.frame(Participant, Condition, Score)
I am trying to use K-mean clustering to split participants in different groups, is there any good way to do it, considering the condition is not numeric?
thanks!
#Anony has the right idea. You actually do have numeric data - there is (evidently) a c1-score and a c2-score for each participant. So you need to convert your data from "long" format (data in a single column (Score) with a second column (Condition) differentiating the scores, to "wide" format (scores under different conditions in separate columns). Then you can run kmeans clustering on the scores to group the participants.
Here is how you would do that in R, using a slightly larger example to demonstrate the clusters.
# example with 100 Participants in 3 clusters
set.seed(1) # for reproducibble example
T <- data.frame(Participant=rep(paste0("p",sprintf("%03i",1:100)),each=2),
Condition =paste0("c",1:2),
Score =c(rpois(70,c(10,25)),rpois(70,c(25,10)),rpois(60,c(15,10))))
head(T)
# Participant Condition Score
# 1 p001 c1 8
# 2 p001 c2 25
# 3 p002 c1 7
# 4 p002 c2 27
# 5 p003 c1 14
# 6 p003 c2 28
library(reshape2) # for dcast(...)
# convert from long to wide format
result <- dcast(T,Participant~Condition,value.var="Score")
# k-means on the columns containing scores - look for 3 clusters
result$clust <- kmeans(result[,2:ncol(result)],centers=3)$clust
result[sample(1:100,6),] # just a random sample of 6 rows
# Participant c1 c2 clust
# 12 p012 13 21 1
# 24 p024 7 32 1
# 85 p085 10 6 2
# 43 p043 27 5 3
# 48 p048 29 11 3
# 66 p066 24 17 3
Now we can plot the scores, showing how the participant clusters.
# plot the scores for each Participant, color coded by cluster.
plot(c2~c1,result,col=result$clust, pch=20)
EDIT: Response to OP's comment.
OP wants to know what to do if there is more than one score for a participant/condition. The answer depends on why there are multiple scores. If the replicates are random and have a central tendency, then probably taking the mean is justified, although in theory participants with more replicates should be more heavily weighted.
One the other hand, suppose these are test scores. Then generally (but not always), the scores go up with multiple sittings. So these scores would not be random - there is a trend. In that case it might be more meaningful to take the most recent score.
As a third example, if the scores are used to make a decision based on some policy (such as with the SAT, where most colleges use the highest score), then the most appropriate aggregating function might be max, not mean.
Finally, it might be the case that the number of replicates is in fact an important distinguishing characteristic. In that case you would include not just the scores but also the number of replicates for each participant/condition when clustering. This is relevant in certain kinds of standardized testing under NCLB, where students take the test over and over again until they pass.
BTW: This type of question (the one in your comment) definitely belongs on https://stats.stackexchange.com/.
You should pivot your data, so that
each participant is a row
each condition is a column
the scores are your data
Try the reshape2 package.
You have 3 variables which will be used to split your data in groups. Two of them are categorical which might cause a problem. You can use k-means to split your data in groups but you will need to make dummies for your categorical data (condition and participant) and scale your continuous variable Score.
Using categorical data in K-means is not optimal because k-means cannot handle them well. The dummies will be highly correlated which might cause the algorithm to put too much weight on them and produce suboptimal results.
For the reason above, you can use different techniques such as hierarchical clustering or running a PCA on your data (in order to have continuous uncorrelated data) and then perform a normal k-means model on the PC scores.
These links give good answers:
link1
link2
Hope that helps!