How to combine columns with the same ID using R? - r

I want to Combine V1 and V2 with the matching ID number using R. What's the simplest way to go about it?
Below is an example how I want to combine my data. Hopefully this makes sense if not I can try to be more clear. I did try the group by but I dont know if thats the best way to go about it
ID V1 V2
1 3 2
2 3 4
3 5 1
3 2 3
4 2 3
4 5 7
4 1 3
This is what I would like it to look like
ID V3
1 3
1 2
2 3
2 4
3 5
3 1
3 2
3 3
4 2
4 3
4 5
4 7
4 1
4 3

Try using pivot_longer with names_to = NULL to remove the unwanted column.
tidyr::pivot_longer(df, V1:V2, values_to = "V3", names_to = NULL)
Output:
# ID V3
# <int> <int>
# 1 1 3
# 2 1 2
# 3 2 3
# 4 2 4
# 5 3 5
# 6 3 1
# 7 3 2
# 8 3 3
# 9 4 2
# 10 4 3
# 11 4 5
# 12 4 7
# 13 4 1
# 14 4 3

You may try
library(dplyr)
reshape2::melt(df, "ID") %>% select(ID, value) %>% arrange(ID)
ID value
1 1 3
2 1 2
3 2 3
4 2 4
5 3 5
6 3 2
7 3 1
8 3 3
9 4 2
10 4 5
11 4 1
12 4 3
13 4 7
14 4 3

Related

identify whenever values repeat in r

I have a dataframe like this.
data <- data.frame(Condition = c(1,1,2,3,1,1,2,2,2,3,1,1,2,3,3))
I want to populate a new variable Sequence which identifies whenever Condition starts again from 1.
So the new dataframe would look like this.
Thanks in advance for the help!
data <- data.frame(Condition = c(1,1,2,3,1,1,2,2,2,3,1,1,2,3,3),
Sequence = c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3))
base R
data$Sequence2 <- cumsum(c(TRUE, data$Condition[-1] == 1 & data$Condition[-nrow(data)] != 1))
data
# Condition Sequence Sequence2
# 1 1 1 1
# 2 1 1 1
# 3 2 1 1
# 4 3 1 1
# 5 1 2 2
# 6 1 2 2
# 7 2 2 2
# 8 2 2 2
# 9 2 2 2
# 10 3 2 2
# 11 1 3 3
# 12 1 3 3
# 13 2 3 3
# 14 3 3 3
# 15 3 3 3
dplyr
library(dplyr)
data %>%
mutate(
Sequence2 = cumsum(Condition == 1 & lag(Condition != 1, default = TRUE))
)
# Condition Sequence Sequence2
# 1 1 1 1
# 2 1 1 1
# 3 2 1 1
# 4 3 1 1
# 5 1 2 2
# 6 1 2 2
# 7 2 2 2
# 8 2 2 2
# 9 2 2 2
# 10 3 2 2
# 11 1 3 3
# 12 1 3 3
# 13 2 3 3
# 14 3 3 3
# 15 3 3 3
This took a while. Finally I find this solution:
library(dplyr)
data %>%
group_by(Sequnce = cumsum(
ifelse(Condition==1, lead(Condition)+1, Condition)
- Condition==1)
)
Condition Sequnce
<dbl> <int>
1 1 1
2 1 1
3 2 1
4 3 1
5 1 2
6 1 2
7 2 2
8 2 2
9 2 2
10 3 2
11 1 3
12 1 3
13 2 3
14 3 3
15 3 3

Convert a small dataset written in SPSS to CSV

I have a small dataset written in SPSS syntax which comes from Table 5.3 p. 189 of this book (type 210 in the page slot to see the table).
I was wondering if there might be a way to convert this data to .csv file? (I want to use the data in R afterwards)
# SPSS Code:
DATA LIST FREE/gpid anx socskls assert.
BEGIN DATA.
1 5 3 3 1 5 4 3 1 4 5 4 1 4 5 4
1 3 5 5 1 4 5 4 1 4 5 5 1 4 4 4
1 5 4 3 1 5 4 3 1 4 4 4
2 6 2 1 2 6 2 2 2 5 2 3 2 6 2 2
2 4 4 4 2 7 1 1 2 5 4 3 2 5 2 3
2 5 3 3 2 5 4 3 2 6 2 3
3 4 4 4 3 4 3 3 3 4 4 4 3 4 5 5
3 4 5 5 3 4 4 4 3 4 5 4 3 4 6 5
3 4 4 4 3 5 3 3 3 4 4 4
END DATA.
EDIT - in order to check answers I am adding here the actual way the data looks after reading it in SPSS :
gpid anx socskls assert
1 5 3 3
1 5 4 3
1 4 5 4
1 4 5 4
1 3 5 5
1 4 5 4
1 4 5 5
1 4 4 4
1 5 4 3
1 5 4 3
1 4 4 4
2 6 2 1
2 6 2 2
2 5 2 3
2 6 2 2
2 4 4 4
2 7 1 1
2 5 4 3
2 5 2 3
2 5 3 3
2 5 4 3
2 6 2 3
3 4 4 4
3 4 3 3
3 4 4 4
3 4 5 5
3 4 5 5
3 4 4 4
3 4 5 4
3 4 6 5
3 4 4 4
3 5 3 3
3 4 4 4
If I understand correctly, the 1st, 5th, 9th, and 13th column of the dataset belong to variable gpid, the 2nd, 6th, 10th, and 14th column belong to variable anx, and so on. So, we need to
reshape from wide to long format
with multiple measure variables
where each measure variable spans several columns
and where some values are missing.
Many roads lead to Rome.
This is what I would do using my favourite tools. In particular, this approach uses the feature of data.table::melt() to reshape multiple measure columns simultaneously. There is no manual cleanup of the data section in a text editor required.
The resulting dataset result can be used directly afterwards in any subsequent R code as requested by the OP. There is no need to take a detour using a .csv file (However, feel free to save result as a .csv file).
library(data.table)
library(magrittr)
cols <- c("gpid", "anx", "socskls", "assert")
raw <- fread(text = "
1 5 3 3 1 5 4 3 1 4 5 4 1 4 5 4
1 3 5 5 1 4 5 4 1 4 5 5 1 4 4 4
1 5 4 3 1 5 4 3 1 4 4 4
2 6 2 1 2 6 2 2 2 5 2 3 2 6 2 2
2 4 4 4 2 7 1 1 2 5 4 3 2 5 2 3
2 5 3 3 2 5 4 3 2 6 2 3
3 4 4 4 3 4 3 3 3 4 4 4 3 4 5 5
3 4 5 5 3 4 4 4 3 4 5 4 3 4 6 5
3 4 4 4 3 5 3 3 3 4 4 4",
fill = TRUE)
mv <- colnames(raw) %>%
matrix(ncol = 4L, byrow = TRUE) %>%
as.data.table() %>%
setnames(new = cols)
result <- melt(raw, measure.vars = mv, na.rm = TRUE)[
order(rowid(variable))][
, variable := NULL]
result
gpid anx socskls assert
1: 1 5 3 3
2: 1 5 4 3
3: 1 4 5 4
4: 1 4 5 4
5: 1 3 5 5
6: 1 4 5 4
7: 1 4 5 5
8: 1 4 4 4
9: 1 5 4 3
10: 1 5 4 3
11: 1 4 4 4
12: 2 6 2 1
13: 2 6 2 2
14: 2 5 2 3
15: 2 6 2 2
16: 2 4 4 4
17: 2 7 1 1
18: 2 5 4 3
19: 2 5 2 3
20: 2 5 3 3
21: 2 5 4 3
22: 2 6 2 3
23: 3 4 4 4
24: 3 4 3 3
25: 3 4 4 4
26: 3 4 5 5
27: 3 4 5 5
28: 3 4 4 4
29: 3 4 5 4
30: 3 4 6 5
31: 3 4 4 4
32: 3 5 3 3
33: 3 4 4 4
gpid anx socskls assert
Some explanations
fread() returns a data.table raw with default column names V1, V2, ... V16 and with missing values filled with NA
mv is a data.table which indicates which columns of raw belong to each target variable:
mv
gpid anx socskls assert
1: V1 V2 V3 V4
2: V5 V6 V7 V8
3: V9 V10 V11 V12
4: V13 V14 V15 V16
This informations is used by melt(). melt() also removes rows with missing values from the resulting long format.
After reshaping, the rows are ordered by the variable number but need to be reordered in the original row order by using rowid(variable). Finally, the variable column is removed.
EDIT: Improved version
Giving a second thought, here is a streamlined version of the code which skips the creation of mv and uses data.table chaining:
library(data.table)
cols <- c("gpid", "anx", "socskls", "assert")
result <- fread(
text = "
1 5 3 3 1 5 4 3 1 4 5 4 1 4 5 4
1 3 5 5 1 4 5 4 1 4 5 5 1 4 4 4
1 5 4 3 1 5 4 3 1 4 4 4
2 6 2 1 2 6 2 2 2 5 2 3 2 6 2 2
2 4 4 4 2 7 1 1 2 5 4 3 2 5 2 3
2 5 3 3 2 5 4 3 2 6 2 3
3 4 4 4 3 4 3 3 3 4 4 4 3 4 5 5
3 4 5 5 3 4 4 4 3 4 5 4 3 4 6 5
3 4 4 4 3 5 3 3 3 4 4 4",
fill = TRUE, col.names = rep(cols, 4L))[
, melt(.SD, measure.vars = patterns(cols), value.name = cols, na.rm = TRUE)][
order(rowid(variable))][
, variable := NULL][]
result
Here, the columns are renamed within the call to fread(). In this case, duplicated column names are desirable (as opposed to the usual use case) because the patterns() function in the subsequent call to melt() use the duplicated column names to combine the columns which belong to one measure variable.
This requires some manual clean-up in Notepad or similar to place the data in the right format. But essentially, this could be imported using the following
df <- data.frame(
gpid = c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,
2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3),
anx = c(5,5,4,4,3,4,4,4,5,5,4,6,6,5,6,
4,7,5,5,5,5,6,4,4,4,4,4,4,4,4,4,5,4),
socskls = c(3,4,5,5,5,5,5,4,4,4,4,2,2,2,2,
4,1,4,2,3,4,2,4,3,4,5,5,4,5,6,4,3,4),
assert = c(3,3,4,4,5,4,5,4,3,3,4,1,2,3,2,
4,1,3,3,3,3,3,4,3,4,5,5,4,4,5,4,3,4)
)
write.csv(df, "df.csv", row.names = F)
Note that the first 4 values (1, 5, 3, 3) are the gpid, anx, socskls, and assert values for row 1. Whereas the values 1, 5, 4, 3 which appear to be in the next column of the pasted data in SPSS syntax (i.e. the next 4 values reading the syntax left to right) are actually the values for participant 10.
Note: I'm assuming you don't have SPSS installed. If you did the easiest option would using SPSS syntax to create the dataset in SPSS and then just export to R.
Using readLines and some string manipulating tools.
tmp <- readLines("spss1.txt") ## read from .txt
tmp <- trimws(gsub("[A-Z/.]", "", tmp)) ## remove caps and specials
nm <- strsplit(tmp[[1]], " ")[[1]] ## split names
tmp <- unlist(strsplit(tmp[3:11], "\\s{2,}") ) ## split data blocks
Finally, splitting at the spaces gives the result.
dat <- setNames(
type.convert(do.call(rbind.data.frame, strsplit(tmp, "\\s"))),
nm)
Result
dat
# gpid anx socskls assert
# 1 1 5 3 3
# 2 1 5 4 3
# 3 1 4 5 4
# 4 1 4 5 4
# 5 1 3 5 5
# 6 1 4 5 4
# 7 1 4 5 5
# 8 1 4 4 4
# 9 1 5 4 3
# 10 1 5 4 3
# 11 1 4 4 4
# 12 2 6 2 1
# 13 2 6 2 2
# 14 2 5 2 3
# 15 2 6 2 2
# 16 2 4 4 4
# 17 2 7 1 1
# 18 2 5 4 3
# 19 2 5 2 3
# 20 2 5 3 3
# 21 2 5 4 3
# 22 2 6 2 3
# 23 3 4 4 4
# 24 3 4 3 3
# 25 3 4 4 4
# 26 3 4 5 5
# 27 3 4 5 5
# 28 3 4 4 4
# 29 3 4 5 4
# 30 3 4 6 5
# 31 3 4 4 4
# 32 3 5 3 3
# 33 3 4 4 4
Note: Results in the same Wilks' lambda as #emily-kothe's method. Maybe the authors used different data or your manova method is flawed?

Retrieve a value by another column criteria in R

i need some help:
i got this df:
df <- data.frame(month = c(1,1,1,1,1,2,2,2,2,2),
day = c(1,2,3,4,5,1,2,3,4,5),
flow = c(2,5,7,8,5,4,6,7,9,2))
month day flow
1 1 1 2
2 1 2 5
3 1 3 7
4 1 4 8
5 1 5 5
6 2 1 4
7 2 2 6
8 2 3 7
9 2 4 9
10 2 5 2
but i want to know the day of min per month:
month day flow dayminflowofthemonth
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
this repetition is not a problem, i will use pivot fuction
tks people!
We can use which.min to return the index of 'min'imum 'flow' per group and use that to get the corresponding 'day' to create the column with mutate
library(dplyr)
df <- df %>%
group_by(month) %>%
mutate(dayminflowofthemonth = day[which.min(flow)]) %>%
ungroup
-output
df
# A tibble: 10 x 4
# month day flow dayminflowofthemonth
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 1
# 2 1 2 5 1
# 3 1 3 7 1
# 4 1 4 8 1
# 5 1 5 5 1
# 6 2 1 4 5
# 7 2 2 6 5
# 8 2 3 7 5
# 9 2 4 9 5
#10 2 5 2 5
Another option using indexing inside dplyr pipeline:
library(dplyr)
#Code
newdf <- df %>% group_by(month) %>% mutate(Val=day[flow==min(flow)][1])
Output:
# A tibble: 10 x 4
# Groups: month [2]
month day flow Val
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
Here is a base R option using ave
transform(
df,
dayminflowofthemonth = ave(day*(ave(flow,month,FUN = min)==flow),month,FUN = max)
)
which gives
month day flow dayminflowofthemonth
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
One more base R approach:
df$dayminflowofthemonth <- by(
df,
df$month,
function(x) x$day[which.min(x$flow)]
)[df$month]

How do I add a vector where I collapse scores from individuals within pairs?

I have done an experiment in which participants have solved a task in pairs, with another participant. Each participant has then received a score for how well they did the task. Pairs have gone through different amounts of trials.
I have a data frame similar to the one below:
participant <- c(1,1,2,2,3,3,3,4,4,4,5,6)
pair <- c(1,1,1,1,2,2,2,2,2,2,3,3)
trial <- c(1,2,1,2,1,2,3,1,2,3,1,1)
score <- c(2,3,6,3,4,7,3,1,8,5,4,3)
data <- data.frame(participant, pair, trial, score)
participant pair trial score
1 1 1 2
1 1 2 3
2 1 1 6
2 1 2 3
3 2 1 4
3 2 2 7
3 2 3 3
4 2 1 1
4 2 2 8
4 2 3 5
5 3 1 4
6 3 1 3
I would like to add a new vector to the data frame, where each participant gets the numeric difference between their own score and the other participant's score within each trial.
Does someone have an idea about how one might do that?
It should end up looking something like this:
participant pair trial score difference
1 1 1 2 4
1 1 2 3 0
2 1 1 6 4
2 1 2 3 0
3 2 1 4 3
3 2 2 7 1
3 2 3 3 2
4 2 1 1 3
4 2 2 8 1
4 2 3 5 2
5 3 1 4 1
6 3 1 3 1
Here's a solution that involves first reordering data such that each sequential pair of rows corresponds to a single pair within a single trial. This allows us to make a single call to diff() to extract the differences:
data <- data[order(data$trial,data$pair,data$participant),];
data$diff <- rep(diff(data$score)[c(T,F)],each=2L)*c(-1L,1L);
data;
## participant pair trial score diff
## 1 1 1 1 2 -4
## 3 2 1 1 6 4
## 5 3 2 1 4 3
## 8 4 2 1 1 -3
## 11 5 3 1 4 1
## 12 6 3 1 3 -1
## 2 1 1 2 3 0
## 4 2 1 2 3 0
## 6 3 2 2 7 -1
## 9 4 2 2 8 1
## 7 3 2 3 3 -2
## 10 4 2 3 5 2
I assumed you wanted the sign to capture the direction of the difference. So, for instance, if a participant has a score 4 points below the other participant in the same trial-pair, then I assumed you would want -4. If you want all-positive values, you can remove the multiplication by c(-1L,1L) and add a call to abs():
data$diff <- rep(abs(diff(data$score)[c(T,F)]),each=2L);
data;
## participant pair trial score diff
## 1 1 1 1 2 4
## 3 2 1 1 6 4
## 5 3 2 1 4 3
## 8 4 2 1 1 3
## 11 5 3 1 4 1
## 12 6 3 1 3 1
## 2 1 1 2 3 0
## 4 2 1 2 3 0
## 6 3 2 2 7 1
## 9 4 2 2 8 1
## 7 3 2 3 3 2
## 10 4 2 3 5 2
Here's a solution built around ave() that doesn't require reordering the whole data.frame first:
data$diff <- ave(data$score,data$trial,data$pair,FUN=function(x) abs(diff(x)));
data;
## participant pair trial score diff
## 1 1 1 1 2 4
## 2 1 1 2 3 0
## 3 2 1 1 6 4
## 4 2 1 2 3 0
## 5 3 2 1 4 3
## 6 3 2 2 7 1
## 7 3 2 3 3 2
## 8 4 2 1 1 3
## 9 4 2 2 8 1
## 10 4 2 3 5 2
## 11 5 3 1 4 1
## 12 6 3 1 3 1
Here's how you can get the score of the other participant in the same trial-pair:
data$other <- ave(data$score,data$trial,data$pair,FUN=rev);
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 2 1 1 2 3 3
## 3 2 1 1 6 2
## 4 2 1 2 3 3
## 5 3 2 1 4 1
## 6 3 2 2 7 8
## 7 3 2 3 3 5
## 8 4 2 1 1 4
## 9 4 2 2 8 7
## 10 4 2 3 5 3
## 11 5 3 1 4 3
## 12 6 3 1 3 4
Or, assuming the data.frame has been reordered as per the initial solution:
data$other <- c(rbind(data$score[c(F,T)],data$score[c(T,F)]));
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 3 2 1 1 6 2
## 5 3 2 1 4 1
## 8 4 2 1 1 4
## 11 5 3 1 4 3
## 12 6 3 1 3 4
## 2 1 1 2 3 3
## 4 2 1 2 3 3
## 6 3 2 2 7 8
## 9 4 2 2 8 7
## 7 3 2 3 3 5
## 10 4 2 3 5 3
Alternative, using matrix() instead of rbind():
data$other <- c(matrix(data$score,2L)[2:1,]);
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 3 2 1 1 6 2
## 5 3 2 1 4 1
## 8 4 2 1 1 4
## 11 5 3 1 4 3
## 12 6 3 1 3 4
## 2 1 1 2 3 3
## 4 2 1 2 3 3
## 6 3 2 2 7 8
## 9 4 2 2 8 7
## 7 3 2 3 3 5
## 10 4 2 3 5 3
Here is an option using data.table:
library(data.table)
setDT(data)[,difference := abs(diff(score)), by = .(pair, trial)]
data
# participant pair trial score difference
# 1: 1 1 1 2 4
# 2: 1 1 2 3 0
# 3: 2 1 1 6 4
# 4: 2 1 2 3 0
# 5: 3 2 1 4 3
# 6: 3 2 2 7 1
# 7: 3 2 3 3 2
# 8: 4 2 1 1 3
# 9: 4 2 2 8 1
#10: 4 2 3 5 2
#11: 5 3 1 4 1
#12: 6 3 1 3 1
A slightly faster option would be:
setDT(data)[, difference := abs((score - shift(score))[2]) , by = .(pair, trial)]
If we need the value of the other pair:
data[, other:= rev(score) , by = .(pair, trial)]
data
# participant pair trial score difference other
# 1: 1 1 1 2 4 6
# 2: 1 1 2 3 0 3
# 3: 2 1 1 6 4 2
# 4: 2 1 2 3 0 3
# 5: 3 2 1 4 3 1
# 6: 3 2 2 7 1 8
# 7: 3 2 3 3 2 5
# 8: 4 2 1 1 3 4
# 9: 4 2 2 8 1 7
#10: 4 2 3 5 2 3
#11: 5 3 1 4 1 3
#12: 6 3 1 3 1 4
Or using dplyr:
library(dplyr)
data %>%
group_by(pair, trial) %>%
mutate(difference = abs(diff(score)))
# participant pair trial score difference
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 2 4
#2 1 1 2 3 0
#3 2 1 1 6 4
#4 2 1 2 3 0
#5 3 2 1 4 3
#6 3 2 2 7 1
#7 3 2 3 3 2
#8 4 2 1 1 3
#9 4 2 2 8 1
#10 4 2 3 5 2
#11 5 3 1 4 1
#12 6 3 1 3 1

From table to data.frame

I have a table that looks like:
dat = data.frame(expand.grid(x = 1:10, y = 1:10),
z = sample(LETTERS[1:3], size = 100, replace = TRUE))
tabl <- with(dat, table(z, y))
tabl
y
z 1 2 3 4 5 6 7 8 9 10
A 5 3 1 1 3 6 3 7 2 4
B 4 5 3 6 5 1 3 1 4 4
C 1 2 6 3 2 3 4 2 4 2
Now how do I transform it into a data.frame that looks like
1 2 3 4 5 6 7 8 9 10
A 5 3 1 1 3 6 3 7 2 4
B 4 5 3 6 5 1 3 1 4 4
C 1 2 6 3 2 3 4 2 4 2
Here are a couple of options.
The reason as.data.frame(tabl) doesn't work is that it dispatches to the S3 method as.data.frame.table() which does something useful but different from what you want.
as.data.frame.matrix(tabl)
# 1 2 3 4 5 6 7 8 9 10
# A 5 4 3 1 1 3 3 2 6 2
# B 1 4 3 4 5 3 4 4 3 3
# C 4 2 4 5 4 4 3 4 1 5
## This will also work
as.data.frame(unclass(tabl))

Resources