I am trying to recode multiple columns of data from string variables (e.g. "None of the time", "Some of the time", "Often"...) to numeric values (e.g. "None of the time" = 0). I have seen a number of different responses to similar questions but when I have tried these they seem to remove all of the data and replace it with NA.
For_Analysis <- data.frame(Q11_1=c("None of the time", "Often", "Sometimes"),
Q11_2=c("Sometimes", "Often", "Never"), Q11_3=c("Never", "Never", "Often"))
For_Analysis <- For_Analysis%>%
mutate_at(c("Q11_1", "Q11_2", "Q11_3"),
funs(recode(., "None of the time"=1, "Rarely"=2,
"Some of the time"=3, "Often"=4, "All of the time"=5)))
When I run this second bit of code I get the following output
## There were 14 warnings (use warnings() to see them)
And all of the data within the dataframe is recoded to NA instead of the numeric values I want.
You are getting an error because there are some values which do not match. Also you can replace mutate_at with across.
library(dplyr)
For_Analysis <- For_Analysis%>%
mutate(across(starts_with('Q11'), ~recode(., "None of the time"=1, "Rarely"=2,
"Sometimes"=3, "Often"=4, "All of the time"=5, "Never" = 1)))
For_Analysis
# Q11_1 Q11_2 Q11_3
#1 1 3 1
#2 4 4 1
#3 3 1 4
I have taken the liberty to assume "Never" is same as "None of the time" and coded as 1.
The following method seems to have worked for my issue (recoding string variables to numeric in multiple columns):
For_Analysis <- data.frame(Q11_1=c("Never", "Often", "Sometimes"),
Q11_2=c("Sometimes", "Often", "Never"), Q11_3=c("Never", "Never", "Often"))
New_Values <- c(1, 2, 3, 4, 5)
Old_Values <- unique(For_Analysis$Q11_1)
For_Analysis[1:3] <- as.data.frame(sapply(For_Analysis[1:3],
mapvalues, from = Old_Values, to = New_Values))
Thanks for the help!
The easiest way to convert it to a variable of type factor and then to numeric.
library(tidyverse)
For_Analysis <- data.frame(Q11_1=c("None of the time", "Often", "Sometimes"),
Q11_2=c("Sometimes", "Often", "Never"), Q11_3=c("Never", "Never", "Often"))
fRecode = function(x) x %>% fct_inorder() %>% as.numeric()
For_Analysis %>% mutate_all(fRecode)
output
Q11_1 Q11_2 Q11_3
1 1 1 1
2 2 2 1
3 3 3 2
Related
I am trying to do a large data check for a database. Some fields in the database are hidden, so when I am doing the datacheck, I need to ignore all hidden fields. Fields are hidden based on conditional logic stored in the database. I have exported this conditional logic and have stored it in a dataframe in R. Now I need to automate the data check by somehow using the text string of a conditional argument to automate the script writing itself, which I do not think is possible, or finding a way around this problem.
Below is example code that I need to solve:
id <- c(1001, 1002, 1003, 1004, 1005, 1001, 1002, 1003, 1004, 1005)
target_var <- c("race","race","race","race","race", "race_other",
"race_other", "race_other", "race_other", "race_other")
value <- c(1, NA, 1, 1, 6, NA, NA, NA, NA, "Asian")
branching_logic <- c(NA, NA, NA, NA, NA,
"race == 6", "race == 6", "race == 6",
"race == 6", "race == 6")
race <- c(NA, NA, NA,NA, NA, 1, 1, 1, 6, 6)
data <- data.frame(id, var, value, branching_logic, race) %>%
mutate(data_check_result = case_when(
!is.na(value) ~ "No Missing Data",
is.na(value) & is.na(branching_logic) ~ "Missing Data 1",
is.na(value) & race == 6 ~ "Missing Data 2",
is.na(value) & race != 6 ~ "Hidden field",
))
It would be great if I could replace (race==6) with a variable or somehow directing the script to the conditional expression already saved as a string, but I know that R can't do that.
The above problem has four categories which the data could fall into:
No Missing Data: only if value is non-na
Missing Data 1: if the value is NA, and there is no branching logic that hid the variable.
Missing Data 2: if the value is NA and the branching logic is met to show the field
Hidden Field: if the value is NA and the branching logic is NOT net to show the field
I have thousands of fields to check with accompanying branching logic, so I need a way to use the branching logic saved in the "branching_logic" column within the script.
IMPORTANT NOTE: The case here is the simplest case. Many target_var variables and value variables have branching logic that looks at multiple other variables to determine whether to hide the field (Ex. race==6 & race==1)
This is only my second time posting, and I usually do not see such in depth problems here, but it would be great if someone has an idea!
You can store the expression you want to evaluate as a string if you pass it into parse() first as explained in this answer.
Here's a simple example of how you can store the expression in a column and then feed it to dplyr::case_when().
library(tidyverse)
set.seed(1)
d <- tibble(
a = sample(10),
b = sample(10),
c = "a > b"
)
d %>%
mutate(a_bigger = case_when(
eval(parse(text = c)) ~ "Y",
TRUE ~ "N"
))
#> # A tibble: 10 x 4
#> a b c a_bigger
#> <int> <int> <chr> <chr>
#> 1 9 3 a > b Y
#> 2 4 1 a > b Y
#> 3 7 5 a > b Y
#> 4 1 8 a > b N
#> 5 2 2 a > b N
#> 6 5 6 a > b N
#> 7 3 10 a > b N
#> 8 10 9 a > b Y
#> 9 6 4 a > b Y
#> 10 8 7 a > b Y
Created on 2022-03-07 by the reprex package (v2.0.1)
My dataframe has a column called "LandType" of characters, either "Rural" or "Urban" for a bunch of samples. All I want to do is convert them to 1's and 0's, where "Rural" is 1, and "Urban" is 0.
I thought it would as simple as:
data$LandType[data$LandType == "Rural"] <- 1
data$LandType[data$LandType == "Urban"] <- 0
But after running this with no errors and then checking my data df, the crazy thing is that ONLY "Rural" has changed to 1's but Urban still remains as a string. I tried with different numbers but same thing happened, only Rural would change to the value I assigned.
Just use ifelse
#your data
data = data.frame(Landtype = c("Rural", "Urban", "Rural", "Urban"))
#ifelse condition
data$Landtype = ifelse(data$Landtype == "Rural", 1,0)
A tidyverse option using recode()
library(dplyr)
mutate(data, Landtype = recode(Landtype, Rural = 1, Urban = 0))
# # A tibble: 4 x 1
# Landtype
# <dbl>
# 1 1
# 2 0
# 3 1
# 4 0
Data
data <- tibble(Landtype = c("Rural", "Urban", "Rural", "Urban"))
We could use as.integer
data$Landtype <- as.integer(data$Landtype == "Rural")
data
data = data.frame(Landtype = c("Rural", "Urban", "Rural", "Urban"))
In my case your way works:
set.seed(42)
data <- data.frame(LandType = sample(c(rep("Rural", 72), rep("Urban", 93))))
data$LandType[data$LandType == "Rural"] <- 1
data$LandType[data$LandType == "Urban"] <- 0
table(data$LandType)
# 0 1
#93 72
To get binary values I would recommend to use the type logical (TRUE or FALSE).
data$LandType <- data$LandType == "Rural"
In case 0 and 1 is needed just add a +
data$LandType <- +(data$LandType == "Rural")
I have 2 data frames with the same row and column structure that both have a lot of NA values. I want to create another data frame that simply tells me which cells in the 2 original data frames actually have values. For example
So far I have been able to do this manually by mutating a series of if else statements for each column like this:
combined <- trial_1[,1:2] %>%
mutate("Part1" = ifelse(!is.na(trial_1$Part1) & !is.na(trial_2$Part1), "1 & 2",
ifelse(!is.na(trial_1$Part1) & is.na(trial_2$Part1), "1 only", ifelse(is.na(trial_1$Part1) & !is.na(trial_2$Part1),
"2 only", ifelse(is.na(trial_1$Part1) & is.na(trial_2$Part1),
"NA", "Failed"))))) %>%
mutate("Part2" = ifelse(!is.na(trial_1$Part2) & !is.na(trial_2$Part2),
"1 & 2",ifelse(!is.na(trial_1$Part2) & is.na(trial_2$Part2), "1 only",
ifelse(is.na(trial_1$Part2) & !is.na(trial_2$Part2), "2 only", ifelse(is.na(trial_1$Part2) & is.na(trial_2$Part2), "NA", "Failed"))))) %>%
mutate("Part3" = ifelse(!is.na(trial_1$Part3) & !is.na(trial_2$Part3), "1 & 2",
ifelse(!is.na(trial_1$Part3) & is.na(trial_2$Part3),
"1 only", ifelse(is.na(trial_1$Part3) & !is.na(trial_2$Part3), "2 only", ifelse(is.na(trial_1$Part3) & is.na(trial_2$Part3),
"NA", "Failed"))))) %>%
mutate("Part4" = ifelse(!is.na(trial_1$Part4) & !is.na(trial_2$Part4),
"1 & 2", ifelse(!is.na(trial_1$Part4) & is.na(trial_2$Part4), "1 only", ifelse(is.na(trial_1$Part4) & !is.na(trial_2$Part4),
"2 only", ifelse(is.na(trial_1$Part4) & is.na(trial_2$Part4), "NA", "Failed")))))
But this is obviously not efficient so I tried using a for loop, which does not work:
participants <- list('Part1', 'Part2', 'Part3', 'Part4')
combined <- trial_1[,1:2]
for (i in participants) {
combined <- combined %>%
mutate(i = ifelse(!is.na(trial_1$i) & !is.na(trial_2$i), "1 & 2",
ifelse(!is.na(trial_1$i) & is.na(trial_2$i), "1 only",
ifelse(is.na(trial_1$i) & !is.na(trial_2$i), "2 only",
ifelse(is.na(trial_1$i) & is.na(trial_2$i), "NA", "Failed")))))
}
Any help on how to restructure this for loop, which I think is the way to go, would be very helpful. Thanks!
Here is something to try using tidyverse. First, merge the two data frames together with a join, based on number and status. You can indicate the trial number here if you'd like.
Then, you can put your data into long form, and look at each element in Part individually. With mutate create a new string based on which trials have non-missing values.
Finally, use pivot_wider to put the data into wide form.
library(tidyverse)
trial_1 %>%
left_join(trial_2, by = c("number", "status"), suffix = c(".t1", ".t2")) %>%
pivot_longer(cols = starts_with("Part"), names_to = c("Part", ".value"), names_pattern = "Part(\\d+).(t[1-9])") %>%
mutate(part_string = case_when(
!is.na(t1) & !is.na(t2) ~ "1 & 2",
!is.na(t1) ~ "1 only",
!is.na(t2) ~ "2 only",
TRUE ~ NA_character_
)) %>%
pivot_wider(id_cols = c(number, status), names_from = "Part", values_from = "part_string", names_prefix = "Part")
Output
number status Part1 Part2 Part3 Part4
<int> <chr> <chr> <chr> <chr> <chr>
1 1 very low 1 only NA 2 only NA
2 2 low NA 1 only 1 & 2 NA
3 3 medium 2 only NA 1 only NA
4 4 high NA NA NA NA
5 5 very high NA NA 1 only 1 & 2
I have a dataset that is basically a response of PHQ-9 questionnaire. Where in there are 9 columns which have factors "Not at all", "Sometimes", "Several Days", "More than half the days", "Nearly everyday". The scores of which are 0, 1, 1, 2, 3 respectively.
The response to all the 9 questions finally gives a PHQ score out of 27.
In my dataset, I however have the responses to these questions stored as :
$ Interest : Factor w/ 5 levels "More than half the days",..: 1 4 2 2 4 5 4 4 4 5 ...
Now what I actually want is another column adjacent to each feature like the above which contains the corressponding score. Moreover, at the end I want to calculate the result using these factor scores at the end to give the depression score.
This is the output I am looking at:
Interest I_Factor Pleasure P_factor Score
Not at all 0 Nearly Everyday 2 2
Creating a simulated dataframe for you:
df <- data.frame(id = c("001", "002", "003", "004", "005"),
PHQ_1 = c("Not at all", "Not at all", "Sometimes", "Sometimes", "Several Days"),
PHQ_2 = c("Sometimes", "Sometimes", "Several Days", "More than half the days", "Nearly everyday"))
Using mutate_at to select the questionnaire items for you, and then mass applying recode from the psych package to change the likert scales from factors to numeric. Giving a "name" for the new columns and they would not replace the old columns (e.g. "numeric_columns" in the example below).
Once this is done, using mutate again to compute the row sums and put it into a new column.
library(dplyr)
library(psych)
test <- df %>%
mutate_at(vars(PHQ_1:PHQ_2), funs(numeric_columns = recode(.,
"Not at all" = 0,
"Sometimes" = 1,
"Several Days" = 1,
"More than half the days" = 2,
"Nearly everyday" = 3))) %>%
mutate(total = rowSums(select(., contains("numeric_columns"))))
The sample output is as follows. The original columns are retained and you have the new columns in numeric format as well as the total score of the questionnaire.
id PHQ_1 PHQ_2 PHQ_1_numeric_columns PHQ_2_numeric_columns total
1 001 Not at all Sometimes 0 1 1
2 002 Not at all Sometimes 0 1 1
3 003 Sometimes Several Days 1 1 2
4 004 Sometimes More than half the days 1 2 3
5 005 Several Days Nearly everyday 1 3 4
I have the following four lists.
varnames <- list("beefpork", "breakfast", "breakfast_yn", "diet_soda", "food_label", "fruit_and_veggie", "fruit_juice", "fruits", "milk", "min_foods","regular_soda", "ssb", "total_fruit", "vegetables", "asthma", "bmiclass3", "bmiclass4","bmiclass5", "dental_absence", "dental_appt", "diabetes", "food_allergies", "sore_teeth", "trying_weight", "count_pa60days", "count_vigpa20days", "gaming_bedroom", "other_organized_pa", "pa30outdoor","paguidelines", "pc_time", "school_transport", "sport_teams", "tv_bedroom", "tv_time_char", "video_games_char")
grades <- list("2", "4", "8", "11")
groups <- list("none", "ethnic", "bordercounty")
regions <- list("state", "hsr")
And the following function, which returns an integer:
all_empty = function(outcome, groupvar, gradevar, regionvar){
#How many observations?
if (groupvar == "none")
fmla <- as.formula(paste0("~", outcome))
else
fmla <- as.formula(paste0("~", outcome, "+", groupvar))
if (regionvar == "hsr")
mydata = span_phrwts
else if (regionvar == "state" & groupvar %in% c("none", "ethnic"))
mydata = span_statewts
else if (regionvar == "state" & groupvar == "bordercounty")
mydata = span_borderwts
else mydata = span_statewts
myrow = svytable(fmla, subset(mydata, grade==gradevar)) %>% nrow()
return(myrow)
}
I'm trying to write a code that will run the function on all 864 possible combinations of the values from my lists, and create one data table with 864 rows and 5 columns.
I would like the final table to look something like this, but has not been successful:
Variable Grade Group Region Obs
beefpork 2 none state 5
beefpork 4 none state 5
beefpork 8 none state 3
beefpork 11 none state 0
This is my attempt to run this, but am unable to calculate the rownum correctly.
output_all <- matrix(ncol = 5, nrow = length(varnames)*length(grades)*length(groups)*length(regions))
for(l in 1:length(regions)) {
for (k in 1:length(grades)) {
for(j in 1:length(groups)) {
for(i in 1:length(varnames)){
rownum = i + ((length(groups)*length(grades)*length(regions)) - 1)
output_all[rownum, 1] = varnames[[i]]
output_all[rownum, 2] = groups[[j]]
output_all[rownum, 3] = grades[[k]]
output_all[rownum, 4] = regions[[l]]
output_all[rownum, 5] = all_empty(varnames[[i]], groups[[j]], grades [[k]], regions[[l]])
}
}
}
}
output_all %>% as_data_frame() %>% View()
Any help/advice would be much appreciated!
Using data.table you have the function CJ to create the cross-join. Then we add a row num (Idx) to perform row-wise call of function. We finally remove the Idx column
library(data.table)
dt <- CJ(varnames=varnames,grades=grades,groups=groups,regions=regions)
dt[,Idx:=.I]
dt[,by=Idx, Obs:=all_empty(outcome, groupvar, gradevar, regionvar)]
dt[,Idx:=NULL]
If it's ok to use vectors and not lists, tidyr::crossing seems like a straightforward approach.
varnames <- c("beefpork", "breakfast", "breakfast_yn", "diet_soda", "food_label", "fruit_and_veggie", "fruit_juice", "fruits", "milk", "min_foods","regular_soda", "ssb", "total_fruit", "vegetables", "asthma", "bmiclass3", "bmiclass4","bmiclass5", "dental_absence", "dental_appt", "diabetes", "food_allergies", "sore_teeth", "trying_weight", "count_pa60days", "count_vigpa20days", "gaming_bedroom", "other_organized_pa", "pa30outdoor","paguidelines", "pc_time", "school_transport", "sport_teams", "tv_bedroom", "tv_time_char", "video_games_char")
grades <- c("2", "4", "8", "11")
groups <- c("none", "ethnic", "bordercounty")
regions <- c("state", "hsr")
tidyr::crossing(varnames, grades, groups, regions)
# A tibble: 864 x 4
varnames grades groups regions
<chr> <chr> <chr> <chr>
1 asthma 11 bordercounty hsr
2 asthma 11 bordercounty state
3 asthma 11 ethnic hsr
4 asthma 11 ethnic state
5 asthma 11 none hsr
6 asthma 11 none state
7 asthma 2 bordercounty hsr
8 asthma 2 bordercounty state
9 asthma 2 ethnic hsr
10 asthma 2 ethnic state
Consider expand.grid, then call your function with mapply to pass column values elementwise to user-defined method.
varnames <- c("beefpork", "breakfast", "breakfast_yn", "diet_soda",
"food_label", "fruit_and_veggie", "fruit_juice",
"fruits", "milk", "min_foods", "regular_soda",
"ssb", "total_fruit", "vegetables", "asthma",
"bmiclass3", "bmiclass4","bmiclass5", "dental_absence",
"dental_appt", "diabetes", "food_allergies",
"sore_teeth", "trying_weight", "count_pa60days",
"count_vigpa20days", "gaming_bedroom", "other_organized_pa",
"pa30outdoor","paguidelines", "pc_time", "school_transport",
"sport_teams", "tv_bedroom", "tv_time_char", "video_games_char")
grades <- c("2", "4", "8", "11")
groups <- c("none", "ethnic", "bordercounty")
regions <- c("state", "hsr")
df <- expand.grid(varnames=varnames, grades=grades, groups=groups, regions=regions,
stringsAsFactors = FALSE)
str(df)
# 'data.frame': 864 obs. of 4 variables:
# $ varnames: chr "beefpork" "breakfast" "breakfast_yn" "diet_soda" ...
# $ grades : chr "2" "2" "2" "2" ...
# $ groups : chr "none" "none" "none" "none" ...
# $ regions : chr "state" "state" "state" "state" ...
# ...
df$fmla <- ifelse(df$groups == "none", paste0("~", outcome), paste0("~", outcome, "+", groupvar))
df$mydata <- ifelse(df$regions == "hsr", "span_phrwts",
ifelse(df$regions == "state" & df$groups %in% c("none", "ethnic"), "span_statewts",
ifelse(df$regions == "state" & df$groups == "bordercounty", "span_borderwts",
"span_statewts")))
Function call
all_empty <- function(outcome, groupvar, gradevar, regionvar, fmla, mydata){
# How many observations?
myrow <- svytable(as.formula(fmla), subset(get(mydata), grade==gradevar))
return(nrow(myrow))
}
df$Obs <- mapply(all_empty, df$varnames, df$groups, df$grades,
df$regions, df$fmla, df$mydata)