How to subset multiple times from larger data frame in R - r

I have the following wide data in a .csv:
Subj
Show1_judgment1
Show1_judgment2
Show1_judgment3
Show1_judgment4
Show2_judgment1
Show2_judgment2
Show2_judgment3
Show2_judgment4
1
3
2
5
6
3
2
5
7
2
1
3
5
4
3
4
1
6
3
1
2
6
2
3
7
2
6
The columns keep going for the same four judgments for a series of 130 different shows.
I want to change this data into long form so that it looks like this:
Subj
show
judgment1
judgment2
judgment3
judgment4
1
show1
2
5
6
1
1
show2
3
5
4
4
1
show3
2
6
2
5
Usually, I would use base r to subset the columns into their own dataframes and then used rbind to put them into one dataframe.
But since there are so many different shows, it will be very inefficient to do it like that. I am relatively novice at R, so I can only do very basic for loops, but I think a for loop that subsets the subject column (first column in data) and then groups of 4 sequential columns would do this much more efficiently.
Can anyone help me create a for loop for this?
Thank you in advance for your help!

No for loop required, this is transforming or "pivoting" from wide to long format.
tidyr
tidyr::pivot_longer(dat, -Subj, names_pattern = "(.*)_(.*)", names_to = c("show", ".value"))
# # A tibble: 6 x 6
# Subj show judgment1 judgment2 judgment3 judgment4
# <int> <chr> <int> <int> <int> <int>
# 1 1 Show1 3 2 5 6
# 2 1 Show2 3 2 5 7
# 3 2 Show1 1 3 5 4
# 4 2 Show2 3 4 1 6
# 5 3 Show1 1 2 6 2
# 6 3 Show2 3 7 2 6
data.table
Requires data.table-1.14.3, relatively new (or from github).
data.table::melt(
dat, id.vars = "Subj",
measure.vars = measure(show, value.name, sep = "_"))

Related

How do I create an index variable based on three variables in R? [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 26 days ago.
I'm trying to create an index variable based on an individual identifier, a test name, and the date the test was taken in R. My data has repeated students taking the same test over and over with different scores. I'd like to be able to easily identify what number try each observation is for that specific test. My data looks something like this and I'd like to create a variable like the ID variable shown. It should start over at 1 and count, in order of date, the number of observations with the same student and test name.
student <- c(1,1,1,1,1,1,2,2,2,3,3,3,3,3)
test <-c("math","math","reading","math","reading","reading","reading","math","reading","math","math","math","reading","reading")
date <- c(1,2,3,3,4,5,2,3,5,1,2,3,4,5)
data <- data.frame(student,test,date)
print(data)
student test date
1 1 math 1
2 1 math 2
3 1 reading 3
4 1 math 3
5 1 reading 4
6 1 reading 5
7 2 reading 2
8 2 math 3
9 2 reading 5
10 3 math 1
11 3 math 2
12 3 math 3
13 3 reading 4
14 3 reading 5
I want to add a variable that indicates the attempt number for a test taken by the same student so it looks something like this:
student test date id
1 1 math 1 1
2 1 math 2 2
3 1 reading 3 1
4 1 math 3 3
5 1 reading 4 2
6 1 reading 5 3
7 2 reading 2 1
8 2 math 3 1
9 2 reading 5 2
10 3 math 1 1
11 3 math 2 2
12 3 math 3 3
13 3 reading 4 1
14 3 reading 5 2
I figured how to create an ID variable based on only one other variable, for example based on the student number, but I don't know how to do it for multiple variables. I also tried cumsum but that keeps counting with each new value, and doesn't start over at 1 when there is a new value.
tests <- transform(tests, ID = as.numeric(factor(EMPLID)))
tests$id <-cumsum(!duplicated(tests[1:3]))
library(dplyr)
data %>%
group_by(student, test) %>%
arrange(date, .by_group = TRUE) %>% ## make sure things are sorted by date
mutate(id = row_number()) %>%
ungroup()
# # A tibble: 14 × 4
# student test date id
# <dbl> <chr> <dbl> <int>
# 1 1 math 1 1
# 2 1 math 2 2
# 3 1 math 3 3
# 4 1 reading 3 1
# 5 1 reading 4 2
# 6 1 reading 5 3
# 7 2 math 3 1
# 8 2 reading 2 1
# 9 2 reading 5 2
# 10 3 math 1 1
# 11 3 math 2 2
# 12 3 math 3 3
# 13 3 reading 4 1
# 14 3 reading 5 2

Correlation over rows [duplicate]

This question already has answers here:
Correlation between two dataframes by row
(2 answers)
Closed 2 years ago.
I've got two datasets from the same people and I want to compute a correlation for each person over the two datasets.
Example dataset:
dat1 <- read.table(header=TRUE, text="
ItemX1 ItemX2 ItemX3 ItemX4 ItemX5
5 1 2 1 5
3 1 3 3 4
2 1 3 1 3
4 2 5 5 3
5 1 4 1 2
")
dat2 <- read.table(header=TRUE, text="
ItemY1 ItemY2 ItemY3 ItemY4 ItemY5
4 2 1 1 4
4 3 1 2 5
1 5 3 2 2
5 2 4 4 1
5 1 5 2 1
")
Does anybody know how to compute the correlation rowwise for each person and NOT for the whole two datasets?
Thank you!
One possible solution using {purrr} to iterate over the rows of both df's and compute the correlation between each row of dat1 and dat2.
library(purrr)
dat1 <- read.table(header=TRUE, text="
ItemX1 ItemX2 ItemX3 ItemX4 ItemX5
5 1 2 1 5
3 1 3 3 4
2 1 3 1 3
4 2 5 5 3
5 1 4 1 2
")
dat2 <- read.table(header=TRUE, text="
ItemY1 ItemY2 ItemY3 ItemY4 ItemY5
4 2 1 1 4
4 3 1 2 5
1 5 3 2 2
5 2 4 4 1
5 1 5 2 1
")
n_person = nrow(dat1)
cormat <- purrr::map_df(.x = setNames(1:n_person, paste0("person_", 1:n_person)), .f = ~cor(t(dat1[.x,]), t(dat2[.x,])))
cormat
#> # A tibble: 1 x 5
#> person_1[,"1"] person_2[,"2"] person_3[,"3"] person_4[,"4"] person_5[,"5"]
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.917 0.289 -0.330 0.723 0.913
Created on 2020-11-16 by the reprex package (v0.3.0)
Following that post mentioned by #Ravi, we can transpose the dataframe and then calculate the correlations. One additional step is to vectorise the cor function if you want a not-so-wasteful approach. Consider something like this
tp <- function(x) unname(as.data.frame(t(x)))
Vectorize(cor, c("x", "y"))(tp(dat1), tp(dat2))
Output
[1] 0.9169725 0.2886751 -0.3296902 0.7234780 0.9132660

Split dataframe based on one column in r, with a non-fixed width column [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
I have a problem that is an extension of a well-covered issue here on SE. I.e:
Split a column of a data frame to multiple columns
My data has a column with a string format, comma-separated, but of no fixed length.
data = data.frame(id = c(1,2,3), treatments = c("1,2,3", "2,3", "8,9,1,2,4"))
So I would like to have my dataframe eventually be in the proper tidy/long form of:
id treatments
1 1
1 2
1 3
...
3 1
3 2
3 4
Something like separate or strsplit doesn't seem on it's own to be the solution. Separate fails with warnings that various columns have too many values (NB id 3 has more values than id 1).
Thanks
You can use tidyr::separate_rows:
library(tidyr)
separate_rows(data, treatments)
# id treatments
#1 1 1
#2 1 2
#3 1 3
#4 2 2
#5 2 3
#6 3 8
#7 3 9
#8 3 1
#9 3 2
#10 3 4
Using dplyr and tidyr packages:
data %>%
separate(treatments, paste0("v", 1:5)) %>%
gather(var, treatments, -id) %>%
na.exclude %>%
select(id, treatments) %>%
arrange(id)
id treatments
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 3 8
7 3 9
8 3 1
9 3 2
10 3 4
You can also use unnest:
library(tidyverse)
data %>%
mutate(treatments = stringr::str_split(treatments, ",")) %>%
unnest()
id treatments
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 3 8
7 3 9
8 3 1
9 3 2
10 3 4

Replace na in column by value corresponding to column name in seperate table

I have a data frame which looks like this
data <- data.frame(ID = c(1,2,3,4,5),A = c(1,4,NA,NA,4),B = c(1,2,NA,NA,NA),C= c(1,2,3,4,NA))
> data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 NA NA 3
4 4 NA NA 4
5 5 4 NA NA
I have a mapping file as well which looks like this
reference <- data.frame(Names = c("A","B","C"),Vals = c(2,5,6))
> reference
Names Vals
1 A 2
2 B 5
3 C 6
I want my data file to be modified using the reference file in a way which would yield me this final data frame
> final_data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 2 5 3
4 4 2 5 4
5 5 4 5 6
What is the fastest way I can acheive this in R?
We can do this with Map
data[as.character(reference$Names)] <- Map(function(x,y) replace(x,
is.na(x), y), data[as.character(reference$Names)], reference$Vals)
data
# ID A B C
#1 1 1 1 1
#2 2 4 2 2
#3 3 2 5 3
#4 4 2 5 4
#5 5 4 5 6
EDIT: Based on #thelatemail's comments.
NOTE: NO external packages used
As we are looking for efficient solution, another approach would be set from data.table
library(data.table)
setDT(data)
v1 <- as.character(reference$Names)
for(j in seq_along(v1)){
set(data, i = which(is.na(data[[v1[j]]])), j= v1[j], value = reference$Vals[j] )
}
NOTE: Only a single efficient external package used.
One approach is to compute a logical matrix of the target columns capturing which cells are NA. We can then index-assign the NA cells with the replacement values. The tricky part is ensuring the replacement vector aligns with the indexed cells:
im <- is.na(data[as.character(reference$Names)]);
data[as.character(reference$Names)][im] <- rep(reference$Vals,colSums(im));
data;
## ID A B C
## 1 1 1 1 1
## 2 2 4 2 2
## 3 3 2 5 3
## 4 4 2 5 4
## 5 5 4 5 6
If reference was the same wide format as data, dplyr's new (v. 0.5.0) coalesce function is built for replacing NAs; together with purrr, which offers alternate notations for *apply functions, it makes the process very simple:
library(dplyr)
# spread reference to wide, add ID column for mapping
reference_wide <- data.frame(ID = NA_real_, tidyr::spread(reference, Names, Vals))
reference_wide
# ID A B C
# 1 NA 2 5 6
# now coalesce the two column-wise and return a df
purrr::map2_df(data, reference_wide, coalesce)
# Source: local data frame [5 x 4]
#
# ID A B C
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 2 4 2 2
# 3 3 2 5 3
# 4 4 2 5 4
# 5 5 4 5 6

More elegant way of transforming data frames from wide to long data fromat by using reshape() or melt()

I am looking for a more elegant way of reshaping my data frame by using the melt (reshape2) or reshape function.
Let’s assume I have a simple data frame like this:
d<-data.frame("PID"=factor(c(1,1,1,2,2,2)),
"Cue1"=factor(c(1,2,3,1,2,3)),
"Cue2"=factor(c(5,5,5,5,5,5)))
And I would like to transform the second and third columns to a single long one. My code below works but I am looking for a more elegant way:
d1<-data.frame("trigger"=as.vector(t(d[,c(2:3)])))
d1$PID<-factor(rep(c(1,2),each=6))
It is important that the number of levels of the two factors are different (Cue1 has 3, Cue2 has 1 level). My code above gives me the new column that looks like this (this is actually what I want):
trigger
1
5
2
5
3
5
...
Unfortunately, most of the examples on the internet about reshape discusses the following (and in my case, non-preferred) example:
trigger
1
2
3
1
2
3
...
But I need the former one.
Thanks for your suggestions in advance.
The simplest would be to use melt. This is the same as your initial dataframe (d1) unless the exact order of trigger is important.
library(reshape2)
d2 <- melt(d, id="PID", value.name="trigger")[,c(3,1)]
> d2
trigger PID
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 5 1
8 5 1
9 5 1
10 5 2
11 5 2
12 5 2
If you are fond of using base functions you can also use reshape
d3 <- reshape(d, direction="long",
varying=list(names(d)[2:3]),
v.names="trigger",
idvar="PID",
new.row.names=seq(12))[,c(3,1)]
You can see they are both identical by ordering by trigger
> d2[order(d2$trigger),]
trigger PID
1 1 1
4 1 2
2 2 1
5 2 2
3 3 1
6 3 2
7 5 1
8 5 1
9 5 1
10 5 2
11 5 2
12 5 2
> d1[order(d1$trigger),]
trigger PID
1 1 1
7 1 2
3 2 1
9 2 2
5 3 1
11 3 2
2 5 1
4 5 1
6 5 1
8 5 2
10 5 2
12 5 2
I think that "elegance" is subjective, but if you are looking for an alternative, you can consider merged.stack from my "splitstackshape" package. In order for merged.stack to work correctly, though, your ID variables need to be unique. For this, you can use getanID (also from "splitstackshape"):
library(splitstackshape)
packageVersion("splitstackshape")
# [1] ‘1.4.2’
merged.stack(getanID(d, "PID"), var.stubs = "Cue",
sep = "var.stubs")[, c("PID", "Cue"), with = FALSE]
# PID Cue
# 1: 1 1
# 2: 1 5
# 3: 1 2
# 4: 1 5
# 5: 1 3
# 6: 1 5
# 7: 2 1
# 8: 2 5
# 9: 2 2
# 10: 2 5
# 11: 2 3
# 12: 2 5
## factor levels retained as desired
str(.Last.value)
# Classes ‘data.table’ and 'data.frame': 12 obs. of 2 variables:
# $ PID: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
# $ Cue: Factor w/ 4 levels "1","2","3","5": 1 4 2 4 3 4 1 4 2 4 ...
# - attr(*, "sorted")= chr "PID"
# - attr(*, ".internal.selfref")=<externalptr>
By default, this approach would create a few extra columns if you simply did:
merged.stack(getanID(d, "PID"), var.stubs = "Cue", sep = "var.stubs")
The two extra columns would be:
.id, created by getanID. This column, when combined with the "PID" column would create unique IDs.
.time_1, which is the result of the "stacking" step to indicate which "Cue" column the value came from (in this case, cycling between 1 and 2 to represent "Cue1" and "Cue2").
The part of the code that reads [, c("PID", "Cue"), with = FALSE] means to just show us those two columns (since that's all you seem to be interested in).
If you are just looking for an one-liner using melt, below is an approach (the order desired is kept):
# assume DF is your data frame
DF_new = data.frame(trigger = melt(t(DF[,2:3]))[,3], PID = rep(DF[,1], each=2))
DF_new
# trigger PID
# 1 1 1
# 2 5 1
# 3 2 1
# 4 5 1
# 5 3 1
# 6 5 1
# 7 1 2
# 8 5 2
# 9 2 2
# 10 5 2
# 11 3 2
# 12 5 2

Resources