This question already has answers here:
Correlation between two dataframes by row
(2 answers)
Closed 2 years ago.
I've got two datasets from the same people and I want to compute a correlation for each person over the two datasets.
Example dataset:
dat1 <- read.table(header=TRUE, text="
ItemX1 ItemX2 ItemX3 ItemX4 ItemX5
5 1 2 1 5
3 1 3 3 4
2 1 3 1 3
4 2 5 5 3
5 1 4 1 2
")
dat2 <- read.table(header=TRUE, text="
ItemY1 ItemY2 ItemY3 ItemY4 ItemY5
4 2 1 1 4
4 3 1 2 5
1 5 3 2 2
5 2 4 4 1
5 1 5 2 1
")
Does anybody know how to compute the correlation rowwise for each person and NOT for the whole two datasets?
Thank you!
One possible solution using {purrr} to iterate over the rows of both df's and compute the correlation between each row of dat1 and dat2.
library(purrr)
dat1 <- read.table(header=TRUE, text="
ItemX1 ItemX2 ItemX3 ItemX4 ItemX5
5 1 2 1 5
3 1 3 3 4
2 1 3 1 3
4 2 5 5 3
5 1 4 1 2
")
dat2 <- read.table(header=TRUE, text="
ItemY1 ItemY2 ItemY3 ItemY4 ItemY5
4 2 1 1 4
4 3 1 2 5
1 5 3 2 2
5 2 4 4 1
5 1 5 2 1
")
n_person = nrow(dat1)
cormat <- purrr::map_df(.x = setNames(1:n_person, paste0("person_", 1:n_person)), .f = ~cor(t(dat1[.x,]), t(dat2[.x,])))
cormat
#> # A tibble: 1 x 5
#> person_1[,"1"] person_2[,"2"] person_3[,"3"] person_4[,"4"] person_5[,"5"]
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.917 0.289 -0.330 0.723 0.913
Created on 2020-11-16 by the reprex package (v0.3.0)
Following that post mentioned by #Ravi, we can transpose the dataframe and then calculate the correlations. One additional step is to vectorise the cor function if you want a not-so-wasteful approach. Consider something like this
tp <- function(x) unname(as.data.frame(t(x)))
Vectorize(cor, c("x", "y"))(tp(dat1), tp(dat2))
Output
[1] 0.9169725 0.2886751 -0.3296902 0.7234780 0.9132660
Related
I am trying to create consecutive ID numbers for each distinct study. I found an example of data where they managed to create such an ID number under esid variable
Browse[1]> dat <- dat.assink2016
Browse[1]> head(dat, 9)
study esid id yi vi pubstatus year deltype
1 1 1 1 0.9066 0.0740 1 4.5 general
2 1 2 2 0.4295 0.0398 1 4.5 general
3 1 3 3 0.2679 0.0481 1 4.5 general
4 1 4 4 0.2078 0.0239 1 4.5 general
5 1 5 5 0.0526 0.0331 1 4.5 general
6 1 6 6 -0.0507 0.0886 1 4.5 general
7 2 1 7 0.5117 0.0115 1 1.5 general
8 2 2 8 0.4738 0.0076 1 1.5 general
9 2 3 9 0.3544 0.0065 1 1.5 general
I would like to create the same for my study, can anyone show me how to do it?
The key is to group_by study, then use row_number
library(dplyr)
df %>%
group_by(study) %>%
mutate(esid = row_number())
with the example data from #njp:
# A tibble: 9 × 3
# Groups: study [3]
study id esid
<dbl> <int> <int>
1 1 1 1
2 1 2 2
3 1 3 3
4 2 4 1
5 2 5 2
6 2 6 3
7 2 7 4
8 3 8 1
9 3 9 2
If the id column is consecutive (i.e. no jumps or repeated values) you could subtract the minimum value of id for each study and add one:
# Example data
df = data.frame(study=c(1,1,1,2,2,2,2,3,3),
id=1:9)
# Calculate minima
min.id = tapply(X=df$id,
INDEX=df$study,
FUN=min)
# merge this with the data
df$min.id = min.id[df$study]
# Calculate consecutive id as required
df$esid = df$id - df$min.id+1
I have a dataset f.ex. like this:
dat1 <- read.table(header=TRUE, text="
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
")
I'd like to use the select function to gather the variables that contain Trust AND T1.
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust")))
Does anybody know how to use two Arguments there, to have Trust AND T1. If I use:
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust"), contains("T1")))
it gives me the Variables that contain EITHER Trust or T1.
best!
If we need both, then use matches with a regex to specify the column names that starts (^) with 'Trust' and ends ($) as 'T1' (assuming these are only patterns
library(dplyr)
dat1 %>%
select(matches("^Trust_.*T1$"))
The mutate used to create a new column is not clear as there are multiple columns that matches the 'Trust' followed by 'T1'. If the intention is to do some operations on the selected columns, can either be across or c_across with rowwise (not clear from the post)
One solution could be:
library(dplyr)
df %>% select(starts_with('Trust') | contains('_T1'))
#> Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2
#> 1 5 1 2 1 5 3
#> 2 3 1 3 3 4 2
#> 3 2 1 3 1 3 1
#> 4 4 2 5 5 3 2
#> 5 5 1 4 1 2 2
#> Cont_01_T1
#> 1 1
#> 2 1
#> 3 2
#> 4 3
#> 5 4
DATA
df <- read.table(text =
"
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
", header =T)
This question already has answers here:
Fill missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 3 years ago.
I have a data frame of ids with number column
df <- read.table(text="
id nr
1 1
2 1
1 2
3 1
1 3
", header=TRUE)
I´d like to create new dataframe from it, where each id will have unique nr from df dataframe. As you may notice, id 3 have only nr 1, but no 2 and 3. So result should be.
result <- read.table(text="
id nr
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
", header=TRUE)
You can use expand.grid as:
library(dplyr)
result <- expand.grid(id = unique(df$id), nr = unique(df$nr)) %>%
arrange(id)
result
id nr
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
We can do:
tidyr::expand(df,id,nr)
# A tibble: 9 x 2
id nr
<int> <int>
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
I am trying to clean my data so that only duplicate values that have an observation in my first sampling period are kept. For instance, if my data frame looks like this:
df <- data.frame(ID = c(1,1,1,2,2,2,3,3,4,4), period = c(1,2,3,1,2,3,2,3,1,3), mass = rnorm(10, 5, 2))
df
ID period mass
1 1 1 3.313674
2 1 2 6.371979
3 1 3 5.449435
4 2 1 4.093022
5 2 2 2.615782
6 2 3 3.622842
7 3 2 4.466666
8 3 3 6.940979
9 4 1 6.226222
10 4 3 4.233397
I would like to keep observations only the observations that are duplicated for individuals measured during period 1. My new data frame would then look like this:
ID period mass
1 1 1 3.313674
2 1 2 6.371979
3 1 3 5.449435
4 2 1 4.093022
5 2 2 2.615782
6 2 3 3.622842
9 4 1 6.226222
10 4 3 4.233397
Using suggestions on this page (Remove all unique rows) I have tried using the following command, but it leaves in the observations for individual 3 (which was not measured in period 1).
subset(df, duplicated(ID) | duplicated(ID, fromLast=T))
If you want a base solution, the following should work, as well.
> df_new <- df[df$ID %in% df$ID[df$period == 1], ]
> df_new
ID period mass
1 1 1 3.238832
2 1 2 3.428847
3 1 3 1.205347
4 2 1 8.498452
5 2 2 7.523085
6 2 3 3.613678
9 4 1 3.324095
10 4 3 1.932733
You can use dplyr as follows:
library(dplyr)
df %>% group_by(ID) %>% filter(1 %in% period)
#Source: local data frame [8 x 3]
#Groups: ID [3]
# ID period mass
# <dbl> <dbl> <dbl>
#1 1 1 7.622950
#2 1 2 7.960665
#3 1 3 5.045723
#4 2 1 4.366568
#5 2 2 4.400645
#6 2 3 6.088367
#7 4 1 2.282713
#8 4 3 2.461640
I have some survey data that I'd like to reshape to be able to interactively slice and dice using filters. However, I'm stuck in how to reshape the data in traditional ways, and I couldn't figure out the appropriate use of the reshape package. Please help!
The data is as follows: each respondent is in a row, along with the responses to each question. In additional columns are multiple demographic columns on the respondent.
ID Q1 Q2 Q3 … Q30 Demo1 Demo2 Demo3 Average Score
1 1 2 2 … 2 1 1 1 2.5
2 2 3 1 … 5 1 2 1 2.7
3 4 1 5 … 4 2 3 2 1.6
4 1 5 4 … 3 2 1 2 2.5
5 3 4 4 … 1 1 2 2 1.4
The goal is to reshape the data to have each unique question/demographic combination be unique, and the average/sample of the scores for that combination as values.
Question Demo1 Demo2 Demo3 Average NumResp
1 1 1 1 3.4 2
1 1 1 2 2.3 5
1 1 1 3 3.1 1
… … … … … ...
30 4 5 3 1.3 9
As a part 2 to the question, there are also calculations that change the responses from the 1-5 scale into "positive", "neutral" or "negative". It would be great to add this as a column that shows % of all respondents in that specific demographic that was either one of the three, with all 3 values adding up to 100%.
Q Sentiment Demo1 Demo2 Demo3 Average
1 Positive 1 1 1 3.4
1 Neutral 1 1 1 2.3
1 Negative 1 1 1 3.1
… … … … …
30 Negative 4 5 3 1.3
Any help is greatly appreciated! Would prefer to do this in R, though Python will work too.
With melt we can specify the id variables (grouping) or the measure variables ( to collapse to "long"). The argument variable.name allows us to name the new variable created by collapsing the wide columns. And value.name allows us to name the value column. This is all available and more with the documentation for ?melt.data.frame.
To create the Sentiment variable we use cut to break the value range of scores into thirds. There is an argument called labels that allows us to choose the names of the new values.
library(reshape2)
m <- melt(df, variable.name="Question", value.name="Average", id=c("Demo1", "Demo2", "Demo3"))
m$Question <- gsub("Q", "", m$Question)
a <- aggregate(Average~., m, mean)
a$Sentiment <- cut(a$Average, seq(1,5,length.out=4), labels=c("Negative", "Neutral", "Postive"), include.lowest=T)
# Demo1 Demo2 Demo3 Question Average Sentiment
# 1 1 1 1 1 1 Negative
# 2 1 2 1 1 2 Negative
# 3 2 1 2 1 1 Negative
# 4 1 2 2 1 3 Neutral
# 5 2 3 2 1 4 Postive
# 6 1 1 1 2 2 Negative
# 7 1 2 1 2 3 Neutral
# 8 2 1 2 2 5 Postive
# 9 1 2 2 2 4 Postive
# 10 2 3 2 2 1 Negative
Note below that I deleted the "ID" and "Average.Score" columns as they will be recalculated in the process.
Data
df <- read.table(text="
ID Q1 Q2 Q3 Q30 Demo1 Demo2 Demo3 Average.Score
1 1 2 2 2 1 1 1 2.5
2 2 3 1 5 1 2 1 2.7
3 4 1 5 4 2 3 2 1.6
4 1 5 4 3 2 1 2 2.5
5 3 4 4 1 1 2 2 1.4", header=T)
df <- df[,!names(df) %in% c("ID", "Average.Score")]
Under the assumption, that you have data set like this (make it data.table):
ID Q1 Q2 ... Demo1 Demo2 Demo3
1: 1 7 8 2 7 3
2: 2 3 7 6 10 1
3: 3 6 1 5 5 8
4: 4 5 9 10 1 7
5: 5 10 4 8 4 6
and dictionary of answers scores:
value Question Score
1: 7 1 17
2: 3 1 6
3: 6 1 19
Lets transform data to have Question, Answer, ID, Demo:
d2 <- melt(dt, id.vars=c('ID', 'Demo1', 'Demo2', 'Demo3'), measure.vars=grep('^Q[0-9]+$', colnames(dt), val=T))
d2[, c('Question', 'variable'):=list(substring(variable,2), NULL)]
R> d2
ID Demo1 Demo2 Demo3 value Question
1: 1 2 7 3 7 1
2: 2 6 10 1 3 1
3: 3 5 5 8 6 1
Now let's add scores:
d3 <- merge(d2, vals_enc, by=c('Question', 'value'))
And finally get average score and respondents for Question and Demographics:
d3[, list(Avg=mean(Score), Number=.N), .(Question,Demo1,Demo2,Demo3)]
Question Demo1 Demo2 Demo3 Avg Number
1: 1 6 10 1 6 1
2: 1 10 1 7 18 1
3: 1 5 5 8 19 1
Note:
for each Id there is the same demographic status, so number of respondents for each combination of Demographic and Question should be the same.
As it comes to part 2 of the Question:
do you have such calculations or are you looking for them?