I have a .csv file with demographic data for my participants. The data are coded and downloaded from my study database (REDCap) in a way that each race has its own separate column. That is, each participant has a value in each of these columns (1 if endorsed, 0 if unendorsed).
It looks something like this:
SubjID Sex Age White AA Asian Other
001 F 62 0 1 0 0
002 M 66 1 0 0 0
I have to use a roundabout way to get my demographic summary stats. There's gotta be a simpler way to do this. My question is, how can I combine these columns into one column so that there is only one value for race for each participant? (i.e. recoding so 1 = white, 2 = AA, etc, and only the endorsed category is being pulled for each participant and added to this column?)
This is what I would like for it to look:
SubjID Sex Age Race
001 F 62 2
002 M 66 1
This is more or less similar to our approach with similar data from REDCap. We use pivot_longer for dummy variables. The final Race variable could also be made a factor. Please let me know if this is what you had in mind.
Edit: Added names_ptypes to pivot_longer to indicate that Race variable is a factor (instead of mutate).
library(tidyverse)
df <- data.frame(
SubjID = c("001", "002"),
Sex = c("F", "M"),
Age = c(62, 66),
White = c(0, 1),
AA = c(1, 0),
Asian = c(0, 0),
Other = c(0, 0)
)
df %>%
pivot_longer(cols = c("White", "AA", "Asian", "Other"), names_to = "Race", names_ptypes = list(Race = factor()), values_to = "Value") %>%
filter(Value == 1) %>%
select(-Value)
Result:
# A tibble: 2 x 4
SubjID Sex Age Race
<fct> <fct> <dbl> <fct>
1 001 F 62 AA
2 002 M 66 White
Here is another approach using reshape2
df[df == 0] <- NA
df <- reshape2::melt(df, measure.vars = c("White", "AA", "Asian", "Other"), variable.name = "Race", na.rm = TRUE)
df <- subset(df, select = -value)
# SubjID Sex Age Race
# 002 M 66 White
# 001 F 62 AA
Here's a base approach:
race_cols <- 4:7
ind <- max.col(df[, race_cols])
df$Race_number <- ind
df$Race <- names(df[, race_cols])[ind]
df[, -race_cols]
SubjID Sex Age Race_number Race
1 001 F 62 2 AA
2 002 M 66 1 White
Data from #Ben
df <- data.frame(
SubjID = c("001", "002"),
Sex = c("F", "M"),
Age = c(62, 66),
White = c(0, 1),
AA = c(1, 0),
Asian = c(0, 0),
Other = c(0, 0)
)
Related
I have my data organized into classes, and if all of the scores1 values are above 100, then I want their true_score to be score1. However, if anyone in the class has a score1 value below 100, then I want to add their score1 and score2 values to be their true score.
I've tried and would like to use dplyr's if_all(), but I can't get it to work.
Here's my mock data:
library(tidyverse)
test <- tibble(person = c("c", "s", "j"),
class = c(1, 2, 2),
score1 = c(101, 200, 23),
score2 = c(200, 100, 25))
Here's what I want:
answer <- tibble(person = c("c", "s", "j"),
class = c(1, 2, 2),
score1 = c(101, 200, 23),
score2 = c(200, 100, 25),
true_score = c(101, 300, 48))
And here's my (failed) attempt:
test %>%
group_by(class) %>%
mutate(true_score = case_when(
if_all(score1 > 100), score1 > 100 ~ score1,
score1 + score2 > 100 ~ score1 + score2
))
Error in `mutate()`:
! Problem while computing `true_score = case_when(...)`.
ℹ The error occurred in group 1: class = 1.
Caused by error in `if_all()`:
! object 'score1' not found
if_all() (and if_any()) are for use across columns rather than within. In this case, you want plain old all():
library(dplyr)
test %>%
group_by(class) %>%
mutate(true_score = case_when(all(score1 > 100) ~ score1,
TRUE ~ score1 + score2)) %>%
ungroup()
# A tibble: 3 × 5
person class score1 score2 true_score
<chr> <dbl> <dbl> <dbl> <dbl>
1 c 1 101 200 101
2 s 2 200 100 300
3 j 2 23 25 48
I am stuck in performing pivot_longer() over multiple sets of columns. Here is the sample dataset
df <- data.frame(
id = c(1, 2),
uid = c("m1", "m2"),
germ_kg = c(23, 24),
mineral_kg = c(12, 17),
perc_germ = c(45, 34),
perc_mineral = c(78, 10))
I need the output dataframe to look like this
out <- df <- data.frame(
id = c(1, 1, 2, 2),
uid = c("m1", "m1", "m2", "m2"),
crop = c("germ", "germ", "mineral", "mineral"),
kg = c(23, 12, 24, 17),
perc = c(45, 78, 34, 10))
df %>%
rename_with(~str_replace(.x,'(.*)_kg', 'kg_\\1')) %>%
pivot_longer(-c(id, uid), names_to = c('.value', 'crop'), names_sep = '_')
# A tibble: 4 x 5
id uid crop kg perc
<dbl> <chr> <chr> <dbl> <dbl>
1 1 m1 germ 23 45
2 1 m1 mineral 12 78
3 2 m2 germ 24 34
4 2 m2 mineral 17 10
If you were to use data.table:
library(data.table)
melt(setDT(df), c('id', 'uid'), patterns(kg = 'kg', perc = 'perc'))
id uid variable kg perc
1: 1 m1 1 23 45
2: 2 m2 1 24 34
3: 1 m1 2 12 78
4: 2 m2 2 17 10
I suspect there might be a simpler way using pivot_long_spec, but one tricky thing here is that your column names don't have a consistent ordering of their semantic components. #Onyambu's answer deals with this nicely by fixing it upsteam.
library(tidyverse)
df %>%
pivot_longer(-c(id, uid)) %>%
separate(name, c("col1", "col2")) %>% # only needed
mutate(crop = if_else(col2 == "kg", col1, col2), # because name
meas = if_else(col2 == "kg", col2, col1)) %>% # structure
select(id, uid, crop, meas, value) %>% # is
pivot_wider(names_from = meas, values_from = value) # inconsistent
# A tibble: 4 x 5
id uid crop kg perc
<dbl> <chr> <chr> <dbl> <dbl>
1 1 m1 germ 23 45
2 1 m1 mineral 12 78
3 2 m2 germ 24 34
4 2 m2 mineral 17 10
I have 2 dfs. There are NA values for 2 variables in 1 data frame that I want to replace with values in another df. Here is my sample data:
df1
id Sex Race Income
1 M White 1
2 NA Hispanic 2
3 NA NA 3
df2
id Sex Race
1 M White
2 F Hispanic
3 M White
4 F Black
I want the data to look like this where the NA values for df1 for sex and race are filled in by the values for df2.
df2
id Sex Race Income
1 M White 1
2 F Hispanic 2
3 M White 3
4 F Black NA
Can someone please help?
A base R option using merge
subset(
merge(df1, df2, by = "id", all.y = TRUE),
select = c("id", "Sex.y", "Race.y", "Income")
)
which gives
id Sex.y Race.y Income
1 1 M White 1
2 2 F Hispanic 2
3 3 M White 3
4 4 F Black NA
We can use a join here
library(data.table)
setDT(df2)[df1, Income := Income, on = .(id)]
-output
df2
# id Sex Race Income
#1: 1 M White 1
#2: 2 F Hispanic 2
#3: 3 M White 3
#4: 4 F Black NA
If we need to choose the 'Sex', 'Race' between the non-NA elements
nm1 <- names(df2)[-1]
setDT(df2)[df1, c(nm1, 'Income') := c(Map(fcoalesce,
.SD[, nm1, with = FALSE], mget(paste0('i.', nm1))), list(Income)), on = .(id)]
-output
df2
# id Sex Race Income
#1: 1 M White 1
#2: 2 F Hispanic 2
#3: 3 M White 3
#4: 4 F Black NA
Or using tidyverse, with just dplyr functions
library(dplyr)
left_join(df2, df1, by = 'id') %>%
transmute(id, Sex = coalesce(Sex.x, Sex.y),
Race = coalesce(Race.x, Race.y),
Income)
-output
# id Sex Race Income
#1 1 M White 1
#2 2 F Hispanic 2
#3 3 M White 3
#4 4 F Black NA
data
df1 <- structure(list(id = 1:3, Sex = c("M", NA, NA), Race = c("White",
"Hispanic", NA), Income = 1:3), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(id = 1:4, Sex = c("M", "F", "M", "F"), Race = c("White",
"Hispanic", "White", "Black")), class = "data.frame", row.names = c(NA,
-4L))
A tidyverse approach can be using a join after reshaping both dataframes to long (using the well known pivot_longer()) and then reshaping to wide (using pivot_wider()) to obtain the expected result. Here the code:
library(tidyverse)
#Code
newdf <- df2 %>%
mutate(across(-id,~as.character(.))) %>%
pivot_longer(-id) %>%
full_join(df1 %>%
mutate(across(-id,~as.character(.))) %>%
pivot_longer(-id) %>% rename(value2=value)) %>%
mutate(value=ifelse(is.na(value),value2,value)) %>% select(-value2) %>%
pivot_wider(names_from = name,values_from=value) %>%
mutate(Income=as.numeric(Income))
Output:
# A tibble: 4 x 4
id Sex Race Income
<int> <chr> <chr> <dbl>
1 1 M White 1
2 2 F Hispanic 2
3 3 M White 3
4 4 F Black NA
Some data used:
#Data 1
df1 <- structure(list(id = 1:3, Sex = c("M", NA, NA), Race = c("White",
"Hispanic", NA), Income = 1:3), class = "data.frame", row.names = c(NA,
-3L))
#Data 2
df2 <- structure(list(id = 1:4, Sex = c("M", "F", "M", "F"), Race = c("White",
"Hispanic", "White", "Black")), class = "data.frame", row.names = c(NA,
-4L))
I have the following data:
ID cancer cancer_date stroke stroke_date diabetes diabetes_date
1 1 Feb2017 0 Jan2015 1 Jun2015
2 0 Feb2014 1 Jan2015 1 Jun2015
I would like to get
ID condition date
1 cancer xx
1 diabetes xx
2 stroke xx
2 diabetes xx
I tried reshape and gather, but it did not do what I want. Any ideas how can I do this?
This should do it. The key to make it work easily is to change the names of cancer, stroke and diabetes to x_val and then you can use pivot_longer() from tidyr to do the work.
library(tidyr)
library(dplyr)
dat <- tibble::tribble(
~ID, ~cancer, ~cancer_date, ~stroke, ~stroke_date, ~diabetes, ~diabetes_date,
1, 1, "Feb2017", 0, "Jan2015", 1, "Jun2015",
2, 0, "Feb2014", 1, "Jan2015", 1, "Jun2015")
dat %>%
rename("cancer_val" = "cancer",
"stroke_val" = "stroke",
"diabetes_val" = "diabetes") %>%
pivot_longer(cols=-ID,
names_to = c("diagnosis", ".value"),
names_pattern="(.*)_(.*)") %>%
filter(val == 1)
# # A tibble: 4 x 4
# ID diagnosis val date
# <dbl> <chr> <dbl> <chr>
# 1 1 cancer 1 Feb2017
# 2 1 diabetes 1 Jun2015
# 3 2 stroke 1 Jan2015
# 4 2 diabetes 1 Jun2015
library(data.table)
data <- data.table(ID = c(1, 2), cancer = c(1, 0), cancer_date = c("Feb2017", "Feb2014"), stroke = c(0, 1), stroke_date = c("Jan2015", "Jan2015"), diabetes = c(1, 1), diabetes_date = c("Jun2015", "Jun2015"))
datawide <-
melt(data, id.vars = c("ID", "cancer", "stroke", "diabetes"),
measure.vars = c("cancer_date", "stroke_date", "diabetes_date"))
datawide[(cancer == 1 & variable == "cancer_date") |
(stroke == 1 & variable == "stroke_date") |
(diabetes == 1 & variable == "diabetes_date"), .(ID, condition = variable, date = value)]
Try this solution using pivot_longer() and a flag variable to filter the desired states. After pivoting you can filter the values different to zero and only choose the one values. Here the code:
library(tidyverse)
#Code
df2 <- df %>% pivot_longer(cols = -c(ID,contains('_'))) %>%
filter(value!=0) %>% rename(condition=name) %>% select(-value) %>%
pivot_longer(-c(ID,condition)) %>%
separate(name,c('v1','v2'),sep='_') %>%
mutate(Flag=ifelse(condition==v1,1,0)) %>%
filter(Flag==1) %>% select(-c(v1,v2,Flag)) %>%
rename(date=value)
Output:
# A tibble: 4 x 3
ID condition date
<int> <chr> <chr>
1 1 cancer Feb2017
2 1 diabetes Jun2015
3 2 stroke Jan2015
4 2 diabetes Jun2015
Some data used:
#Data
df <- structure(list(ID = 1:2, cancer = 1:0, cancer_date = c("Feb2017",
"Feb2014"), stroke = 0:1, stroke_date = c("Jan2015", "Jan2015"
), diabetes = c(1L, 1L), diabetes_date = c("Jun2015", "Jun2015"
)), class = "data.frame", row.names = c(NA, -2L))
If the first obtain is complex, here another choice:
#Code 2
df2 <- df %>% mutate(across(everything(),~as.character(.))) %>%
pivot_longer(cols = -c(ID)) %>%
separate(name,c('condition','v2'),sep = '_') %>%
replace(is.na(.),'val') %>%
pivot_wider(names_from = v2,values_from=value) %>%
filter(val==1) %>% select(-val)
Output:
# A tibble: 4 x 3
ID condition date
<chr> <chr> <chr>
1 1 cancer Feb2017
2 1 diabetes Jun2015
3 2 stroke Jan2015
4 2 diabetes Jun2015
Not sure if tidyr::gather can be used to take multiple columns and merge them in multiple key columns.
Similar questions have been asked but they all refer to gathering multiple columns in one key column.
I'm trying to gather 4 columns into 2 key and 2 value columns like in the following example:
Sample data:
df <- data.frame(
subject = c("a", "b"),
age1 = c(33, 35),
age2 = c(43, 45),
weight1 = c(90, 67),
weight2 = c(70, 87)
)
subject age1 age2 weight1 weight2
1 a 33 43 90 70
2 b 35 45 67 87
Desired result:
dfe <- data.frame(
subject = c("a", "a", "b", "b"),
age = c("age1", "age2", "age1", "age2"),
age_values = c(33, 43, 35, 45),
weight = c("weight1", "weight2", "weight1", "weight2"),
weight_values = c(90, 70, 67, 87)
)
subject age age_values weight weight_values
1 a age1 33 weight1 90
2 a age2 43 weight2 70
3 b age1 35 weight1 67
4 b age2 45 weight2 87
Here's one way to do it -
df %>%
gather(key = "age", value = "age_values", age1, age2) %>%
gather(key = "weight", value = "weight_values", weight1, weight2) %>%
filter(substring(age, 4) == substring(weight, 7))
subject age age_values weight weight_values
1 a age1 33 weight1 90
2 b age1 35 weight1 67
3 a age2 43 weight2 70
4 b age2 45 weight2 87
Here's one approach. The idea is to do the use gather, then split the resulting dataframe by variable (age and weight), do the mutate operations separately for each of the two dataframes, then merge the dataframes back together using subject and the variable number (1 or 2).
library(dplyr)
library(tidyr)
library(purrr)
df %>%
gather(age1:weight2, key = key, value = value) %>%
separate(key, sep = -1, into = c("var", "num")) %>%
split(.$var) %>%
map(~mutate(., !!.$var[1] := paste0(var, num), !!paste0(.$var[1], "_values") := value)) %>%
map(~select(., -var, -value)) %>%
Reduce(f = merge, x = .) %>%
select(-num)