Have a dataset for determining interrater reliability. Trying to restructure my data from wide to long form. Here is my data.
Subject Rater Item_1 Item_2
AB 1 6 4
AB 2 5 5
CD 1 4 5
CD 2 6 5
EF 1 4 4
EF 2 7 5
I want to restructure it so that it looks like this:
Subject Item Rater_1 Rater_2
AB 1 6 5
AB 2 4 5
CD 1 4 6
CD 2 5 5
EF 1 4 7
EF 2 4 5
I've tried pivot_longer but am unable to separate "rater" into two columns. Any ideas?
Get the data in long format and use a different key to get it in wide format again.
library(dplyr)
library(tidyr)
#Thanks to #Dan Adams for the `NA` trick.
df %>%
pivot_longer(cols = starts_with('Item'),
names_to = c(NA, 'Item'),
names_sep = "_") %>%
pivot_wider(names_from = Rater, values_from = value, names_prefix = "Rater_")
# Subject Item Rater_1 Rater_2
# <chr> <chr> <int> <int>
#1 AB 1 6 5
#2 AB 2 4 5
#3 CD 1 4 6
#4 CD 2 5 5
#5 EF 1 4 7
#6 EF 2 4 5
data
df <- structure(list(Subject = c("AB", "AB", "CD", "CD", "EF", "EF"
), Rater = c(1L, 2L, 1L, 2L, 1L, 2L), Item_1 = c(6L, 5L, 4L,
6L, 4L, 7L), Item_2 = c(4L, 5L, 5L, 5L, 4L, 5L)),
class = "data.frame", row.names = c(NA, -6L))
Here is a base R solution. You are really just transposing the data by group in this particular case.
Map(\(s) {
x <- subset(df, df$Subject == s)
x[,c("Item_1", "Item_2")] <- t(x[,c("Item_1", "Item_2")])
colnames(x) <- c("Subject", "Item", "Rater_1", "Rater_2")
x
}, unique(df$Subject)) |>
do.call(what = rbind)
#> # A tibble: 6 x 4
#> Subject Item Rater_1 Rater_2
#> * <chr> <dbl> <dbl> <dbl>
#> 1 AB 1 6 5
#> 2 AB 2 4 5
#> 3 CD 1 4 6
#> 4 CD 2 5 5
#> 5 EF 1 4 7
#> 6 EF 2 4 5
Related
so i lets say i have a datatable that consist of stock monthly returns:
Company
Year
return
next years return
1
1
5
1
2
6
1
3
2
1
4
4
For a large dataset, of multiple companies and years how can i get a new column that consist of next years returns, for example in first row there would be second years return of 6% etc etc? In excel i could simple use index match but no idea how its done in R. And the reason for not using excel is that it takes over 20 hours to compute all functions as index match is extremely slow. The code needs to do this for all companies so it has to find the correct company for correct year and then input it into new column.
You could group by the company and use lead() to get the next value:
library(dplyr)
df <- data.frame(
company = c(1L, 1L, 1L, 1L, 2L, 2L),
year = c(1L, 2L, 3L, 4L, 1L, 2L),
return_ = c(5L, 6L, 2L, 4L, 2L, 4L))
df
#> company year return_
#> 1 1 1 5
#> 2 1 2 6
#> 3 1 3 2
#> 4 1 4 4
#> 5 2 1 2
#> 6 2 2 4
df %>% group_by(company) %>%
mutate(next.years.return = lead(return_, order_by = year))
#> # A tibble: 6 × 4
#> # Groups: company [2]
#> company year return_ next.years.return
#> <int> <int> <int> <int>
#> 1 1 1 5 6
#> 2 1 2 6 2
#> 3 1 3 2 4
#> 4 1 4 4 NA
#> 5 2 1 2 4
#> 6 2 2 4 NA
Created on 2023-02-10 with reprex v2.0.2
Getting the next years return if its really the next year.
library(dplyr)
df %>%
group_by(Company) %>%
arrange(Company, Year) %>%
mutate("next years return" =
if_else(lead(Year) - Year == 1, lead(`return`), NA)) %>%
ungroup()
# A tibble: 8 × 4
Company Year return `next years return`
<dbl> <dbl> <int> <int>
1 1 1 5 NA
2 1 3 2 4
3 1 4 4 6
4 1 5 6 NA
5 2 1 5 6
6 2 2 6 2
7 2 3 2 4
8 2 4 4 NA
Data
df <- structure(list(Company = c(1, 1, 1, 1, 2, 2, 2, 2), Year = c(1,
5, 3, 4, 4, 3, 2, 1), return = c(5L, 6L, 2L, 4L, 4L, 2L, 6L,
5L)), row.names = c("1", "2", "3", "4", "41", "31", "21", "11"
), class = "data.frame")
I want to calculate mean of every five rows for each column by group, and I tried:
name<-colnames(df[,4:10])
df1<-for (i in name){
df%>%
group_by(A)%>%
summarise(!!paste(i,"mean"):=rollapplyr(get(i),5,mean,fill = NA,by.column=T))
}
result df1 is NULL
then I tried:
for (i in name){
df%>%
group_by(A)%>%
mutate(!!paste(i,"mean"):=rollapplyr(get(i),5,mean,fill = NA,by.column=T))
}
This could run, but nothing happen, df remains the same. And if I assign above code to df1, df1 is still NULL.
I also tried rollmean
df1<- for (i in name){
+ df%>%
+ group_by(CONM)%>%
+ mutate(!!paste(i,"mean"):=rollmean(get(i),5,fill = NA,align = "right"))
+ }
But still get NULL.
My data is like this:
CONM A B C
a 1 2 3
a 2 3 4
a 3 4 5
a 4 5 6
a 5 6 7
a 6 7 8
And I want to get this result for each CONM:
CONM A B C A_mean B_mean C_mean
a 1 2 3 NA NA NA
a 2 3 4 NA NA NA
a 3 4 5 NA NA NA
a 4 5 6 NA NA NA
a 5 6 7 3 4 5
a 6 7 8 4 5 6
b 1 2 3 NA NA NA
Could someone help me with this? Should I use other packages? Thanks
We can use mutate with across to loop over the columns A to C, specify a lambda function (function(.) or tidyverse shortform ~) to apply the function rollmean on the column
library(dplyr)
library(zoo)
df %>%
group_by(CONM) %>%
mutate(across(A:C, ~ rollmean(., 5, fill = NA, align = 'right'),
.names = '{col}_mean')) %>%
ungroup
-output
# A tibble: 7 x 7
# CONM A B C A_mean B_mean C_mean
# <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#1 a 1 2 3 NA NA NA
#2 a 2 3 4 NA NA NA
#3 a 3 4 5 NA NA NA
#4 a 4 5 6 NA NA NA
#5 a 5 6 7 3 4 5
#6 a 6 7 8 4 5 6
#7 b 1 2 3 NA NA NA
Or as #G. Grothendieck mentioned, the rollmeanr would do the right alignment
df %>%
group_by(CONM) %>%
mutate(across(A:C, ~ rollmeanr(., 5, fill = NA), .names = '{col}_mean'))
data
df <- structure(list(CONM = c("a", "a", "a", "a", "a", "a", "b"), A = c(1L,
2L, 3L, 4L, 5L, 6L, 1L), B = c(2L, 3L, 4L, 5L, 6L, 7L, 2L), C = c(3L,
4L, 5L, 6L, 7L, 8L, 3L)), class = "data.frame", row.names = c(NA,
-7L))
I have the following dataset
clust T2 n
1 a 1
1 b 3
1 c 3
2 d 5
3 a 4
3 b 3
4 b 5
4 c 8
4 t 6
4 e 7
etc..
using the following function:
library(dplyr)
table <- data %>% group_by(clust) %>% summarise(max = max(n), name1 = T2[which.max(n)])
I get this output
clust max name1
1 3 b
2 5 d
3 4 a
4 8 c
etc
however there are cases where there are two or more T2 values corresponding to max(n). how can I record those value too?
i.e.
clust max name1
1 3 b,c
2 5 d
3 4 a
4 8 c
etc
or
clust max name1
1 3 b
1 3 c
2 5 d
3 4 a
4 8 c
etc
We can do a == instead of which.max (that returns only the first index of max value) and paste together with toString
library(dplyr)
library(tidyr)
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = toString(T2[n == max(n)]))
# A tibble: 4 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b, c
#2 2 5 d
#3 3 4 a
#4 4 8 c
and this can be expanded with separate_rows in the next step
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = toString(T2[n == max(n)])) %>%
separate_rows(name1, sep=",\\s+")
# A tibble: 5 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b
#2 1 3 c
#3 2 5 d
#4 3 4 a
#5 4 8 c
Or have a list column and then unnest
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = list(T2[n == max(n)])) %>%
unnest(c(name1))
# A tibble: 5 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b
#2 1 3 c
#3 2 5 d
#4 3 4 a
#5 4 8 c
data
data <- structure(list(clust = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 4L, 4L,
4L), T2 = c("a", "b", "c", "d", "a", "b", "b", "c", "t", "e"),
n = c(1L, 3L, 3L, 5L, 4L, 3L, 5L, 8L, 6L, 7L)),
class = "data.frame", row.names = c(NA,
-10L))
I have a data frame with 163 observations and 65 columns with some animal data. The 163 observations are from 56 animals, and each was supposed to have triplicated records, but some information was lost so for the majority of animals, I have triplicates ("A", "B", "C") and for some I have only duplicates (which vary among "A" and "B", "A" and "C" and "B" and "C").
Columns 13:65 contain some information I would like to sum, and only retain the one triplicate with the higher rowSums value. So my data frame would be something like this:
ID Trip Acet Cell Fibe Mega Tera
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3
I am not sure if what I need is to write my own function, or a loop, or what the best alternative actually is - sorry I am still learning and unfortunately for me, I don't think like a programmer so that makes things even more challenging...
So what I want is to know to keep on rows 2 and 6 (which have the highest rowSums among triplicates per animal), but for the whole data frame. What I want as a result is
ID Trip Acet Cell Fibe Mega Tera
1 4 B 9 3 7 5 5
2 12 C 5 5 7 3 3
REALLY sorry if the question is poorly elaborated or if it doesn't make sense, this is my first time asking a question here and I have only recently started learning R.
We can create the row sums separately and use that to find the row with the maximum row sums by using ave. Then use the logical vector to subset the rows of dataset
nm1 <- startsWith(names(df1), "V")
OP updated the column names. In that case, either an index
nm1 <- 3:7
Or select the columns with setdiff
nm1 <- setdiff(names(df1), c("ID", "Trip"))
v1 <- rowSums(df1[nm1], na.rm = TRUE)
i1 <- with(df1, v1 == ave(v1, ID, FUN = max))
df1[i1,]
# ID Trip V1 V2 V3 V4 V5
#2 4 B 9 3 7 5 5
#6 12 C 5 5 7 3 3
data
df1 <- structure(list(ID = c(4L, 4L, 4L, 12L, 12L, 12L), Trip = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
V1 = c(2L, 9L, 1L, 4L, 6L, 5L), V2 = c(4L, 3L, 2L, 6L, 8L,
5L), V3 = c(9L, 7L, 4L, 7L, 1L, 7L), V4 = c(8L, 5L, 8L, 2L,
1L, 3L), V5 = c(3L, 5L, 6L, 3L, 2L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Here is one way.
library(tidyverse)
dat2 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
group_by(ID) %>%
filter(Sum == max(Sum)) %>%
select(-Sum) %>%
ungroup()
dat2
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
Here is another one. This method makes sure only one row is preserved even there are multiple rows with row sum equals to the maximum.
dat3 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
arrange(ID, desc(Sum)) %>%
group_by(ID) %>%
slice(1) %>%
select(-Sum) %>%
ungroup()
dat3
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
DATA
dat <- read.table(text = " ID Trip V1 V2 V3 V4 V5
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3 ",
header = TRUE)
I've tried searching for an answer for this but most data.frame/matrix transpoitions aren't as complicated as I am trying to accomplish. Basically I have a data.frame which looks like
F M A
2008_b 1 5 6
2008_r 3 3 6
2008_a 4 1 5
2009_b 1 1 2
2009_r 5 4 9
2009_a 2 2 4
I'm trying to transpose it and rename the column and row names as such:
F_b M_b A_b F_r M_r A_r F_a M_a A_a
2008 1 5 6 3 3 6 4 1 5
2009 1 1 2 5 4 9 2 2 4
Essentially every three rows are being collapsed in to a single row. I assume this can be done with some clever plyr or reshape2 commands but I'm at a total loss how to accomplish it.
You could try
library(dplyr)
library(tidyr)
lvl <- c(outer(colnames(df), unique(gsub(".*_", "", rownames(df))),
FUN=paste, sep="_"))
res <- cbind(Var1=row.names(df), df) %>%
gather(Var2, value, -Var1) %>%
separate(Var1, c('Var11', 'Var12')) %>%
unite(VarN, Var2, Var12) %>%
mutate(VarN=factor(VarN, levels=lvl)) %>%
spread(VarN, value)
row.names(res) <- res[,1]
res1 <- res[,-1]
res1
# F_b M_b A_b F_r M_r A_r F_a M_a A_a
#2008 1 5 6 3 3 6 4 1 5
#2009 1 1 2 5 4 9 2 2 4
data
df <- structure(list(F = c(1L, 3L, 4L, 1L, 5L, 2L), M = c(5L, 3L, 1L,
1L, 4L, 2L), A = c(6L, 6L, 5L, 2L, 9L, 4L)), .Names = c("F",
"M", "A"), class = "data.frame", row.names = c("2008_b", "2008_r",
"2008_a", "2009_b", "2009_r", "2009_a"))