Conditional Column Formatting - r

I have a data frame that looks like this:
cat df1 df2 df3
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 2 1 NA NA
5 2 2 NA NA
6 2 3 NA NA
I want to populate df3 so that when cat = 1, df3 = df2 and when cat = 2, df3 = df1. However I am getting a few different error messages.
My current code looks like this:
df$df3[df$cat == 1] <- df$df2
df$df3[df$cat == 2] <- df$df1

Try this code:
df[df$cat==1,"df3"]<-df[df$cat==1,"df2"]
df[df$cat==2,"df3"]<-df[df$cat==1,"df1"]
The output:
df
cat df1 df2 df3
1 1 1 1 1
2 2 1 2 1
3 3 1 3 NA
4 4 2 NA NA
5 5 2 NA NA
6 5 2 NA NA

You can try
ifelse(df$cat == 1, df$df2, df$df1)
[1] 1 2 3 1 2 3
# saving
df$df3 <- ifelse(df$cat == 1, df$df2, df$df1)
# if there are other values than 1 and 2 you can try a nested ifelse
# that is setting other values to NA
ifelse(df$cat == 1, df$df2, ifelse(df$cat == 2, df$df1, NA))
# or you can try a tidyverse solution.
library(tidyverse)
df %>%
mutate(df3=case_when(cat == 1 ~ df2,
cat == 2 ~ df1))
cat df1 df2 df3
1 1 NA 1 1
2 1 NA 2 2
3 1 NA 3 3
4 2 1 NA 1
5 2 2 NA 2
6 2 3 NA 3
# data
df <- structure(list(cat = c(1L, 1L, 1L, 2L, 2L, 2L), df1 = c(NA, NA,
NA, 1L, 2L, 3L), df2 = c(1L, 2L, 3L, NA, NA, NA), df3 = c(NA,
NA, NA, NA, NA, NA)), .Names = c("cat", "df1", "df2", "df3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Related

Merging using index as key in R

I have two data frames
df1:
01.2020 02.2020 03.2020
11190 4 1 2
12345 3 3 1
11323 1 2 2
df2
08.2020 04.2020 09.2020
11190 1 2 2
12345 1 2 3
11324 1 2 2
Dummy Data -
df1 <- structure(list(`01.2020` = c(4L, 3L, 1L), `02.2020` = c(1L, 3L, 2L), `03.2020` = c(2L, 1L, 2L)), class = "data.frame", row.names = c("11190","12345", "11323"))
df2 <- structure(list(`08.2020` = c(1L, 1L, 1L), `04.2020` = c(2L, 2L, 2L), `09.2020` = c(2L, 3L, 2L)), class = "data.frame", row.names = c("11190", "12345", "11324"))
I want to "outer merge" these two dataframes by key = index
How can we do that? what should be there in the place of by=
merge(x = sheet1_UN, y = sheet2_UN, by = "" , all = TRUE)
I want my final dataframe to look something like this
01.2020 02.2020 03.2020 08.2020 04.2020 09.2020
11190 4 1 2 1 1 2
12345 3 3 1 1 2 3
11323 1 2 2 - - -
11324 - - - 1 2 2
Thanks in advance.
another method
df3 <- merge(df1, df2, by = "row.names", all = TRUE)
output:
Row.names 01.2020 02.2020 03.2020 08.2020 04.2020 09.2020
1 11190 4 1 2 1 2 2
2 11323 1 2 2 NA NA NA
3 11324 NA NA NA 1 2 2
4 12345 3 3 1 1 2 3
This should do:
df1 %>% rownames_to_column('id') %>%
full_join(df2 %>% rownames_to_column('id'), by='id')
output:
id 01.2020 02.2020 03.2020 08.2020 04.2020 09.2020
1 11190 4 1 2 1 2 2
2 12345 3 3 1 1 2 3
3 11323 1 2 2 NA NA NA
4 11324 NA NA NA 1 2 2
You might use replace_na('-') if you want no NA values, like this:
df1 %>% rownames_to_column('id') %>%
full_join(df2 %>% rownames_to_column('id'), by='id') %>%
mutate(across(everything(), ~.x %>% as.character %>% replace_na('-')))

Finding if a value is within the range of other columns

I have a dataframe df which looks like this:
Input:
df <- read.table(text =
"ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_LS Q2_overall
1 1 2 3 1 2 2
2 0 NA NA 2 1 1
3 2 1 1 3 4 0
4 1 0 2 4 0 2
5 NA 1 NA 0 NA 0
6 2 0 1 1 NA NA"
, header = TRUE)
Desired Output:
To explain a little further, my desired output is as below:
ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_LS Q2_overall Q1_check Q2_check
1 1 2 3 1 2 2 "above" "within"
2 0 NA NA 2 1 1 NA "within"
3 2 1 1 3 4 0 "within" "below"
4 1 0 2 4 0 2 "above" "within"
5 NA 1 NA 0 NA 0 NA "within"
6 2 0 1 1 NA NA "within" NA
Explanation:
Example 1:
Based on the value in columns Q1_PM and Q1_TP, I want to see whether the value in column Q1_overall is within their range or not? If, not in range, is the value above or below the range? To track this, I want to add an additional column Q1_check.
Example 2:
Similarly, based on the values of Q2_PM and Q2_LS, I want to check if the value of Q2_overall is within their range or not? If not in range, is it above or below the range? Again, to track this, I want to add an additional column Q2_check
Requirements:
1- For this, I want to add additional columns Q1_check and Q2_check where the first column is for the comparisons that involve Q1 items and the second column is for the comparisons that involve Q2 items.
2- The columns could contain the following values: above, below and within.
3- The case when the columns named overall have NAs, then the extra columns could also have NAs.
Related posts:
I have looked for related posts such as:
Add column with values depending on another column to a dataframe
and Create categories by comparing a numeric column with a fixed value
but I am running into errors as discussed below.
Partial Solution:
The only solution, I can think of is, along these lines:
df$Q1_check <- ifelse(data$Q1_overall < data$Q1_PM, 'below',
ifelse(data$Q1_overall > data$Q1_TP, 'above',
ifelse(is.na(data$Q1_overall), NA, 'within')))
But it results in following error: Error in data$Q1_overall : object of type 'closure' is not subsettable. I do not understand what the possible issue could be.
OR
df %>%
mutate(Regulation = case_when(Q1_overall < Q1_PM ~ 'below',
Q1_overall > Q1_TP ~ 'above',
Q1_PM < Q1_overall < Q1_TP, 'within'))
This also results in error Error: unexpected '<' in: "Q1_overall > Q1_TP ~ 'above', Q1_PM < Q1_overall <"
Edit 1:
How can the solution be extended if (let's say) the columns are these:
"Q1 Comm - 01 Scope Thesis"
"Q1 Comm - 02 Scope Project"
"Q1 Comm - 03 Learn Intern"
"Q1 Comm - 04 Biography"
"Q1 Comm - 05 Exhibit"
"Q1 Comm - 06 Social Act"
"Q1 Comm - 07 Post Project"
"Q1 Comm - 08 Learn Plant"
"Q1 Comm - 09 Study Narrate"
"Q1 Comm - 10 Learn Participate"
"Q1 Comm - 11 Write 1"
"Q1 Comm - 12 Read 2"
"Q1 Comm - Overall Study Plan"
How can we identify when the column Q1 Comm - Overall Study Plan is:
1 - Below the min() of all the other columns, or
2 - Above the max() of all the other columns, or
3 - Within the range of all the other columns
Edit 2:
For the updated fields, I am also including the dput(df)
dput(df)
structure(list(ï..ID = c(10L, 31L, 225L, 243L), Q1.Comm...01.Scope.Thesis = c(NA,
2L, 0L, NA), Q1.Comm...02.Scope.Project = c(NA, NA, NA, 2L),
Q1.Comm...03.Learn.Intern = c(4L, NA, NA, NA), Q1.Comm...04.Biography = c(NA,
NA, NA, 1L), Q1.Comm...05.Exhibit = c(4L, 2L, NA, NA), Q1.Comm...06.Social.Act = c(NA,
NA, NA, 3L), Q1.Comm...07.Post.Project = c(NA, NA, 3L, NA
), Q1.Comm...08.Learn.Plant = c(NA, NA, NA, 4L), Q1.Comm...09.Study.Narrate = c(NA,
NA, 0L, NA), Q1.Comm...10.Learn.Participate = c(4L, NA, NA,
NA), Q1.Comm...11.Write.1 = c(NA, 2L, NA, NA), Q1.Comm...12.Read.2 = c(NA,
NA, 1L, NA), Q1.Comm...Overall.Study.Plan = c(4L, 1L, 2L,
NA), X = c(NA, NA, NA, NA), X.1 = c(NA, NA, NA, NA), X.2 = c(NA,
NA, NA, NA)), class = "data.frame", row.names = c(NA, -4L
))
Any advice on how to achieve this would be greatly appreciated. Thank you!
Seems a very long winded approach -
library(dplyr)
comparison <- function(x, y, z) {
case_when(is.na(z) ~ NA_character_,
z >= x & z <= y |
z >= y & z <= x |
is.na(x) & y == z |
is.na(y) & x == z ~ 'within',
z > x & z > y ~ 'above',
TRUE ~ 'below')
}
df %>%
mutate(Q1_check = comparison(Q1.PM, Q1.TP, Q1.overall),
Q2_check = comparison(Q2.PM, Q2.LS, Q2.overall))
# ID Q1.PM Q1.TP Q1.overall Q2.PM Q2.LS Q2.overall Q1_check Q2_check
#1 1 1 2 3 1 2 2 above within
#2 2 0 NA NA 2 1 1 <NA> within
#3 3 2 1 1 3 4 0 within below
#4 4 1 0 2 4 0 2 above within
#5 5 NA 1 NA 0 NA 0 <NA> within
#6 6 2 0 1 1 NA NA within <NA>
df <- read.table(text =
"ID Q1-PM Q1-TP Q1-overall Q2-PM Q2-LS Q2-overall
1 1 2 3 1 2 2
2 0 NA NA 2 1 1
3 2 1 1 3 4 0
4 1 0 2 4 0 2
5 NA 1 NA 0 NA 0
6 2 0 1 1 NA NA"
, header = TRUE)
library(tidyverse)
f <- function(x, y, z){
case_when(
z < pmin(x, y, na.rm = TRUE) ~ "below",
z > pmax(x, y, na.rm = TRUE) ~ "abowe",
between(z, pmin(x, y, na.rm = TRUE), pmax(x, y, na.rm = TRUE)) ~ "within"
)
}
df %>%
rowwise() %>%
mutate(Q1_check = f(Q1.PM, Q1.TP, Q1.overall),
Q2_check = f(Q2.PM, Q2.LS, Q2.overall))
#> # A tibble: 6 x 9
#> # Rowwise:
#> ID Q1.PM Q1.TP Q1.overall Q2.PM Q2.LS Q2.overall Q1_check Q2_check
#> <int> <int> <int> <int> <int> <int> <int> <chr> <chr>
#> 1 1 1 2 3 1 2 2 abowe within
#> 2 2 0 NA NA 2 1 1 <NA> within
#> 3 3 2 1 1 3 4 0 within below
#> 4 4 1 0 2 4 0 2 abowe within
#> 5 5 NA 1 NA 0 NA 0 <NA> within
#> 6 6 2 0 1 1 NA NA within <NA>
Created on 2021-06-09 by the reprex package (v2.0.0)
If your columns are named similarly, you may do this for any number of Qs simultaneously.
changed - in column names to acceptable _
changed Q2_LS to Q2_TP for sake of similarity
What is does -
It picks up every column that ends with _overall (2 here but can be any number)
check this columns values as -
If less than column having name _PM / _TP in lieu of _overall allocates value below
If greater than column having name _PM/_TP in lieu of _overall allocates value above
To access these column values I used get alongwith cur_column and stringr string replacement function
if current value is NA allocated a NA_character
otherwise allocates value within
Now, for final mutated columns (all at once) it renames these by removing _overall from these columns and pasting _check instead (I used .names argument of across here)
For this I used stringr::str_remove inside glue argument (.names follow glue style of formula)
df <- read.table(text =
"ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall
1 1 2 3 1 2 2
2 0 NA NA 2 1 1
3 2 1 1 3 4 0
4 1 0 2 4 0 2
5 NA 1 NA 0 NA 0
6 2 0 1 1 NA NA"
, header = TRUE)
df
#> ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall
#> 1 1 1 2 3 1 2 2
#> 2 2 0 NA NA 2 1 1
#> 3 3 2 1 1 3 4 0
#> 4 4 1 0 2 4 0 2
#> 5 5 NA 1 NA 0 NA 0
#> 6 6 2 0 1 1 NA NA
library(tidyverse)
df %>% mutate(across(ends_with('overall'), ~ case_when(. < pmin(get(str_replace(cur_column(), '_overall', '_PM')),
get(str_replace(cur_column(), '_overall', '_TP'))) ~ 'below',
. > pmax(get(str_replace(cur_column(), '_overall', '_PM')),
get(str_replace(cur_column(), '_overall', '_TP'))) ~ 'above',
is.na(.) ~ NA_character_,
TRUE ~ 'within'),
.names = '{str_remove(.col,"_overall")}_check'))
#> ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall Q1_check Q2_check
#> 1 1 1 2 3 1 2 2 above within
#> 2 2 0 NA NA 2 1 1 <NA> within
#> 3 3 2 1 1 3 4 0 within below
#> 4 4 1 0 2 4 0 2 above within
#> 5 5 NA 1 NA 0 NA 0 <NA> within
#> 6 6 2 0 1 1 NA NA within <NA>
Created on 2021-06-09 by the reprex package (v2.0.0)
Largely based on Ronak's great solution:
df <- structure(list(ID = c(10L, 31L, 225L, 243L),
`Q1 Comm - 01 Scope Thesis` = c(NA, 2L, 0L, NA),
`Q1 Comm - 02 Scope Project` = c(NA, NA, NA, 2L),
`Q1 Comm - 03 Learn Intern` = c(4L, NA, NA, NA),
`Q1 Comm - 04 Biography` = c(NA, NA, NA, 1L),
`Q1 Comm - 05 Exhibit` = c(4L, 2L, NA, NA),
`Q1 Comm - 06 Social Act` = c(NA, NA, NA, 3L),
`Q1 Comm - 07 Post Project` = c(NA, NA, 3L, NA),
`Q1 Comm - 08 Learn Plant` = c(NA, NA, NA, 4L),
`Q1 Comm - 09 Study Narrate` = c(NA, NA, 0L, NA),
`Q1 Comm - 10 Learn Participate` = c(4L, NA, NA,NA),
`Q1 Comm - 11 Write 1` = c(NA, 2L, NA, NA),
`Q1 Comm - 12 Read 2` = c(NA, NA, 1L, NA),
`Q1 Comm - Overall Study Plan` = c(4L, 1L, 2L, NA),
X = c(NA, NA, NA, NA),
`X 1` = c(NA, NA, NA, NA),
`X 2` = c(NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -4L))
library(dplyr)
comparison <- function(df, prefix) {
df <- df[grep(prefix, colnames(df))]
min <- apply(df[-grep("Overall", colnames(df))], 1, min, na.rm = T)
max <- apply(df[-grep("Overall", colnames(df))], 1, max, na.rm = T)
z <- df[grep("Overall", colnames(df))]
case_when(is.na(z) ~ NA_character_,
z >= min & z <= max ~ 'within',
z > max ~ 'above',
TRUE ~ 'below')
}
prefixes <- sub(" \\- Overall.*", "", colnames(df[grep("Overall", colnames(df))]))
for (i in prefixes) {
df <- df %>%
mutate("{i} - Check" := comparison(df, i))
}
> print(df)
ID Q1 Comm - 01 Scope Thesis Q1 Comm - 02 Scope Project Q1 Comm - 03 Learn Intern Q1 Comm - 04 Biography
1 10 NA NA 4 NA
2 31 2 NA NA NA
3 225 0 NA NA NA
4 243 NA 2 NA 1
Q1 Comm - 05 Exhibit Q1 Comm - 06 Social Act Q1 Comm - 07 Post Project Q1 Comm - 08 Learn Plant
1 4 NA NA NA
2 2 NA NA NA
3 NA NA 3 NA
4 NA 3 NA 4
Q1 Comm - 09 Study Narrate Q1 Comm - 10 Learn Participate Q1 Comm - 11 Write 1 Q1 Comm - 12 Read 2
1 NA 4 NA NA
2 NA NA 2 NA
3 0 NA NA 1
4 NA NA NA NA
Q1 Comm - Overall Study Plan X X 1 X 2 Q1 Comm - Check
1 4 NA NA NA within
2 1 NA NA NA below
3 2 NA NA NA within
4 NA NA NA NA <NA>
comparison <- function(dt, group_cols, new_col, compare_col){
dt[,
c("min", "max") := transpose(pmap(.SD, range, na.rm = TRUE)), .SDcols = group_cols
][,(new_col) := fcase(
is.na(get(compare_col)), NA_character_,
get(compare_col) < min, "below",
get(compare_col) > max, "above",
default = "within"
)
][]
}
group_cols <- names(df) %>%
str_subset("^Q[0-9]+") %>%
str_subset("overall", negate = TRUE) %>%
split(str_extract(., "^Q[0-9]+"))
new_cols <- names(group_cols) %>% str_c("_check")
compare_cols <- names(group_cols) %>% str_c("_overall")
setDT(df)
pwalk(list(group_cols, new_cols, compare_cols), ~comparison(df, ...))
df[, c("min", "max") := NULL]

R Aggregate multiple rows

My question seems to be a very common question, but the solutions I found on internet don't work...
I would like to aggregate rows in a data frame in R.
Here is the structure of my data frame (df), it is a table of citations :
Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Kaufmann V NA 1 NA NA
Kaufmann V NA NA 1 NA
Kaufmann V NA NA NA 1
Orfeuil P 1 NA NA NA
Orfeuil P NA 1 NA NA
Sorokin P NA NA NA 1
That is I would like to have :
Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 4 NA NA NA
Kaufmann V NA 1 1 1
Orfeuil P 1 1 NA NA
Sorokin P NA NA NA 1
I have tried those solutions, but it doesn't work :
ddply(df,"Autors", numcolwise(sum))
and
df %>% group_by(Autors) %>% summarize_all(sum)
It aggregates well the rows, but the values (sum of the 1 values) are absolutely not correct ! And I don't understand why...
Do you have an idea ?
Thank you very much !
Joël
You can also do the summing using rowsum(), although it (perhaps misleadingly) gives sums of 0 rather than NA for cells in the output that had only NA's for input.
rowsum(df[,c(2:5)],df$Autors,na.rm=T)
Gives:
Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 4 0 0 0
Kaufmann V 0 1 1 1
Orfeuil P 1 1 0 0
Sorokin P 0 0 0 1
It could be because the na.rm is not used
library(dplyr)
df %>%
group_by(Autors) %>%
summarize_all(sum, na.rm = TRUE)
if both plyr and dplyr are loaded, summarise would get masked, but doubt about summarise_all as it is a dplyr function
Based on the expected output, with na.rm = TRUE, it removes all NAs and if there are cases having only NAs it returns 0. To avoid that, we can have a condition
df %>%
group_by(Autors) %>%
summarize_all(funs(if(all(is.na(.))) NA else sum(., na.rm = TRUE)))
# A tibble: 4 x 5
# Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
# <chr> <int> <int> <int> <int>
#1 Burgess E 4 NA NA NA
#2 Kaufmann V NA 1 1 1
#3 Orfeuil P 1 1 NA NA
#4 Sorokin P NA NA NA 1
data
df <- structure(list(Autors = c("Burgess E", "Burgess E", "Burgess E",
"Burgess E", "Kaufmann V", "Kaufmann V", "Kaufmann V", "Orfeuil P",
"Orfeuil P", "Sorokin P"), Lannoy_2016 = c(1L, 1L, 1L, 1L, NA,
NA, NA, 1L, NA, NA), Ramadier_2014 = c(NA, NA, NA, NA, 1L, NA,
NA, NA, 1L, NA), Lord_2009 = c(NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA), Ortar_2008 = c(NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L
)), .Names = c("Autors", "Lannoy_2016", "Ramadier_2014", "Lord_2009",
"Ortar_2008"), class = "data.frame", row.names = c(NA, -10L))

How to combine rowSums and ifelse with mutate

I need to combine rowSums and ifelse in order to create a new variable. My data looks like this:
boss var1 var2 var3 newvar
1 NA NA 3 NA
1 2 3 3 8
2 NA NA NA 0
2 NA NA NA 0
2 NA NA NA 0
1 1 NA 2 3
if boss==1, and there's more than one missing value in var1 to var3, newvar should be NA, otherwise, it should be the result of var1+var2+var3
If boss==2, newvar should be automatically 0.
So far, I have been able to solve parts of the problem using dplyr:
mutate(newvar=rowSums(.[,2:4],na.rm=TRUE) +
ifelse(rowSums(is.na(.[,2:4]))>1 & boss==2,NA,0))
mutate(newvar=ifelse(boss==2,0,NA)
However, I'm struggling to combine the two. Any help is much appreciated.
Here is one option with case_when where we create an index ('i1') which computes the number of NA elements in the row. The index is used in the case_when to create logical conditions to assign the values
df %>%
mutate(i1 = rowSums(is.na(.[-1]))) %>%
mutate(newvar = case_when(i1 > 1 & boss==1 ~ NA_integer_,
boss==2 ~ 0L,
i1 <=1 & boss != 2~ as.integer(rowSums(.[2:4], na.rm = TRUE)))) %>%
select(-i1)
# boss var1 var2 var3 newvar
#1 1 NA NA 3 NA
#2 1 2 3 3 8
#3 2 NA NA NA 0
#4 2 NA NA NA 0
#5 2 NA NA NA 0
#6 1 1 NA 2 3
In base R, this can be done with creating index and without using any ifelse
i1 <- df$boss != 2
tmp <- i1 * df[-1]
df$newvar <- NA^(rowSums(is.na(tmp)) > 1 & i1) * rowSums(tmp, na.rm = TRUE)
df$newvar
#[1] NA 8 0 0 0 3
data
df <- structure(list(boss = c(1L, 1L, 2L, 2L, 2L, 1L), var1 = c(NA,
2L, NA, NA, NA, 1L), var2 = c(NA, 3L, NA, NA, NA, NA), var3 = c(3L,
3L, NA, NA, NA, 2L)), .Names = c("boss", "var1", "var2", "var3"
), row.names = c(NA, -6L), class = "data.frame")
A solution in base-R using apply can be as:
df$newvar <- apply(df,1, function(x){
#retVal = NA
if(x["boss"]==2){
0
} else if(sum(is.na(x[-1])) > 1){
NA
} else{
sum(x[-1], na.rm = TRUE)
}
})
# boss var1 var2 var3 newvar
# 1 1 NA NA 3 NA
# 2 1 2 3 3 8
# 3 2 NA NA NA 0
# 4 2 NA NA NA 0
# 5 2 NA NA NA 0
# 6 1 1 NA 2 3
Data:
df <- read.table(text =
"boss var1 var2 var3
1 NA NA 3
1 2 3 3
2 NA NA NA
2 NA NA NA
2 NA NA NA
1 1 NA 2",
header = TRUE, stringsAsFactors = FALSE)

Rearrange data by matching columns

I am having issue with rearranging some data.
The original data is:
structure(list(id = 1:3, artery.1 = structure(c(1L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), artery.2 = structure(c(1L, NA, 2L), .Label = c("b",
"c"), class = "factor"), artery.3 = structure(c(1L, NA, 2L), .Label = c("c",
"d"), class = "factor"), artery.4 = structure(c(NA, NA, 1L), .Label = "e", class = "factor"), artery.5 = structure(c(NA, NA, 1L), .Label = "f", class = "factor"),
diameter.1 = c(3L, 2L, 1L), diameter.2 = c(2L, NA, 2L), diameter.3 = c(3L,
NA, 3L), diameter.4 = c(NA, NA, 4L), diameter.5 = c(NA, NA,
5L)), .Names = c("id", "artery.1", "artery.2", "artery.3",
"artery.4", "artery.5", "diameter.1", "diameter.2", "diameter.3",
"diameter.4", "diameter.5"), class = "data.frame", row.names = c(NA,
-3L))
# id artery.1 artery.2 artery.3 artery.4 artery.5 diameter.1 diameter.2 diameter.3 diameter.4 diameter.5
# 1 1 a b c <NA> <NA> 3 2 3 NA NA
# 2 2 a <NA> <NA> <NA> <NA> 2 NA NA NA NA
# 3 3 b c d e f 1 2 3 4 5
I would like to get to this:
structure(list(id = 1:3, a = c(3L, 2L, NA), b = c(2L, NA, 1L),
c = c(3L, NA, 2L), d = c(NA, NA, 3L), e = c(NA, NA, 4L),
f = c(NA, NA, 5L)), .Names = c("id", "a", "b", "c", "d",
"e", "f"), class = "data.frame", row.names = c(NA, -3L))
# id a b c d e f
# 1 1 3 2 3 NA NA NA
# 2 2 2 NA NA NA NA NA
# 3 3 NA 1 2 3 4 5
Basically, a to f represents arteries and the numerical values represent the corresponding diameter. Each row represents a patient.
Is there a neat way to sort this dataframe out?
Modern tidyr makes the solution even more succinct via the pivot_ functions:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-id, names_pattern = '(artery|diameter)\\.(\\d+)', names_to = c('.value', NA)) %>%
filter(!is.na(artery)) %>%
pivot_wider(names_from = artery, values_from = diameter)
id a b c d e f
<int> <int> <int> <int> <int> <int> <int>
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Here is the older solution, which uses the deprecated gather and spread functions:
library(dplyr)
library(tidyr)
new.df <- gather(df, variable, value, artery.1:diameter.5) %>%
separate(variable, c('variable', 'num')) %>%
spread(variable, value) %>%
subset(!is.na(artery)) %>%
mutate(diameter = as.numeric(diameter)) %>%
select(-num) %>%
spread(artery, diameter)
Output:
id a b c d e f
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Or using melt/dcast combination with data.table while selecting variables using regex in the patterns function
library(data.table) #v>=1.9.6
dcast(melt(setDT(df),
id = "id",
measure = patterns("artery", "diameter")),
id ~ value1,
sum,
value.var = "value2",
subset = .(!is.na(value2)),
fill = NA)
# id a b c d e f
# 1: 1 3 2 3 NA NA NA
# 2: 2 2 NA NA NA NA NA
# 3: 3 NA 1 2 3 4 5
As you can see, both melt and dcast are very flexible and you can use regex, specify a subset, pass multiple functions and specify how you want to fill missing values.
You can use xtabs with reshape from base R. Use the latter to transform data to long format and use the former to get the count table:
xtabs(diameter ~ id + artery, reshape(df, varying = 2:11, sep = '.', dir = "long"))
# artery
#id a b c d e f
# 1 3 2 3 0 0 0
# 2 2 0 0 0 0 0
# 3 0 1 2 3 4 5
This can be done with two reshape() calls. First, we can longify both artery and diameter on id, then widen with artery as the time variable. To prevent a column of NAs, we also must subset out rows with NA values for artery in the intermediate frame.
reshape(subset(reshape(df,dir='l',varying=setdiff(names(df),'id'),timevar=NULL),!is.na(artery)),dir='w',timevar='artery');
## id diameter.a diameter.b diameter.c diameter.d diameter.e diameter.f
## 1.1 1 3 2 3 NA NA NA
## 2.1 2 2 NA NA NA NA NA
## 3.1 3 NA 1 2 3 4 5
The diameter. prefixes can be removed afterward, if desired. However, an advantage of this solution is that it would be capable of preserving multiple column sets, whereas the xtabs() solution cannot. The prefixes would be essential to distinguish the column sets in that case.

Resources