This is my first ever post, so bear with me. I am trying to manipulate a data set in R by adding new columns based on existing data. I've converted my data to a data frame and have employed the mutate function. The function works. However, when I call my dataset again to look at the changes, the new column disappears. What am I doing wrong?
# Converting raw data into a tibble data frame for easier data analysis:
spdata <- as_tibble(rawdata)
# Creating a new Grade column based on Math Scores
spdata %>%
mutate(math.grade = case_when(math.score < 60 ~ "F",
math.score >= 60 & math.score <= 69 ~ "D",
math.score >= 70 & math.score <= 79 ~ "C",
math.score >= 80 & math.score <= 89 ~ "B",
math.score >= 90 & math.score <= 100 ~ "A"))
Here is the output that automatically generates after I run my mutate function:
# A tibble: 1,000 x 9
gender race.ethnicity parental.level.of.education lunch test.preparation.course math.score reading.score writing.score math.grade
<fct> <fct> <fct> <fct> <fct> <int> <int> <int> <chr>
1 female group B bachelor's degree standard none 72 72 74 C
2 female group C some college standard completed 69 90 88 D
3 female group B master's degree standard none 90 95 93 A
4 male group A associate's degree free/reduced none 47 57 44 F
5 male group C some college standard none 76 78 75 C
6 female group B associate's degree standard none 71 83 78 C
7 female group B some college standard completed 88 95 92 B
8 male group B some college free/reduced none 40 43 39 F
9 male group D high school free/reduced completed 64 64 67 D
10 female group B high school free/reduced none 38 60 50 F
# ... with 990 more rows
My new math.grade variable shows up as expected.
However, when I call spdata again to look at it, the math.grade column is missing:
# A tibble: 1,000 x 8
gender race.ethnicity parental.level.of.education lunch test.preparation.course math.score reading.score writing.score
<fct> <fct> <fct> <fct> <fct> <int> <int> <int>
1 female group B bachelor's degree standard none 72 72 74
2 female group C some college standard completed 69 90 88
3 female group B master's degree standard none 90 95 93
4 male group A associate's degree free/reduced none 47 57 44
5 male group C some college standard none 76 78 75
6 female group B associate's degree standard none 71 83 78
7 female group B some college standard completed 88 95 92
8 male group B some college free/reduced none 40 43 39
9 male group D high school free/reduced completed 64 64 67
10 female group B high school free/reduced none 38 60 50
# ... with 990 more rows
You need to assign the data frame with the additional column to a new variable with <- :
new_df <- spdata %>%
mutate(math.grade = case_when(math.score < 60 ~ "F",
math.score >= 60 & math.score <= 69 ~ "D",
math.score >= 70 & math.score <= 79 ~ "C",
math.score >= 80 & math.score <= 89 ~ "B",
math.score >= 90 & math.score <= 100 ~ "A"))
new_df
This should work...
Related
I want to calculate percentage of each column in a dataframe by adding new column after each column in R. How can I achieve this
The percentage is calculated based on the another column.
Here's is an example of a dataset. Percentage is calculated based on total column for col2, col3, col4
Var total col2 col3 col4
A 217 77 62 78
D 112 14 47 51
B 91 15 39 37
R 89 77 7 5
V 80 8 53 19
The output should look like
Var total col2 col2_percent col3 col3_percent col4 col4_percent
A 217 77 35.48% 62 28.57% 78 35.94%
D 112 14 12.50% 47 41.96% 51 45.54%
B 91 15 16.48% 39 42.86% 37 40.66%
R 89 77 86.52% 7 7.87% 5 5.62%
V 80 8 10.00% 53 66.25% 19 23.75%
You can use across:
library(dplyr)
df %>%
mutate(across(-c(Var, total), ~ sprintf('%.2f%%', .x / total * 100), .names = "{col}_percent")) %>%
relocate(Var, total, sort(colnames(.)))
Var total col2 col2_percent col3 col3_percent col4 col4_percent
1 A 217 77 35.48% 62 28.57% 78 35.94%
2 D 112 14 12.50% 47 41.96% 51 45.54%
3 B 91 15 16.48% 39 42.86% 37 40.66%
4 R 89 77 86.52% 7 7.87% 5 5.62%
5 V 80 8 10.00% 53 66.25% 19 23.75%
Let's say I have the following dataframe
country_df <- tibble(
population = c(328, 38, 30, 56, 1393, 126, 57),
population2 = c(133, 12, 99, 83, 1033, 101, 33),
population3 = c(89, 39, 33, 56, 193, 126, 58),
pop = 45
)
All I need is a concise way inside the mutate function to get the number of columns (population to population3) that are greater than the value of the pop column within each row.
So what I need is the following results (more specifically the GreaterTotal column) Note: I can get the answer by working through each column but it would take a while with more columns)
population population2 population3 pop GreaterThan0 GreaterThan1 GreaterThan2 GreaterTotal
<dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <int>
1 328 133 89 45 TRUE TRUE TRUE 3
2 38 12 39 45 FALSE FALSE FALSE 0
3 30 99 33 45 FALSE TRUE FALSE 1
4 56 83 56 45 TRUE TRUE TRUE 3
5 1393 1033 193 45 TRUE TRUE TRUE 3
6 126 101 126 45 TRUE TRUE TRUE 3
7 57 33 58 45 TRUE FALSE TRUE 2
I've tried using apply with the row index, but I can't get at it. Can somebody please point me in the right direction?
You can select the 'Population' columns and compare those column with pop and use rowSums to count how many of them are greater in each row.
cols <- grep('population', names(country_df))
country_df$GreaterTotal <- rowSums(country_df[cols] > country_df$pop)
# population population2 population3 pop GreaterTotal
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 328 133 89 45 3
#2 38 12 39 45 0
#3 30 99 33 45 1
#4 56 83 56 45 3
#5 1393 1033 193 45 3
#6 126 101 126 45 3
#7 57 33 58 45 2
In dplyr 1.0.0, you can do this with rowwise and c_across :
country_df %>%
rowwise() %>%
mutate(GreaterTotal = sum(c_across(population:population3) > pop))
Using tidyverse, we can do
library(dplyr)
country_df %>%
mutate(GreaterTotal = rowSums(select(.,
starts_with('population')) > .$pop) )
-output
# A tibble: 7 x 5
# population population2 population3 pop GreaterTotal
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 328 133 89 45 3
#2 38 12 39 45 0
#3 30 99 33 45 1
#4 56 83 56 45 3
#5 1393 1033 193 45 3
#6 126 101 126 45 3
#7 57 33 58 45 2
Thank you for taking a look at my question!
I have the following (dummy) data for patient performance on 3 tasks:
patient_df = data.frame(id = seq(1:5),
age = c(30,72,46,63,58),
education = c(11, 22, 18, 12, 14),
task1 = c(21, 28, 20, 24, 22),
task2 = c(15, 15, 10, 11, 14),
task3 = c(82, 60, 74, 78, 78))
> patient_df
id age education task1 task2 task3
1 1 30 11 21 15 82
2 2 72 22 28 15 60
3 3 46 18 20 10 74
4 4 63 12 24 11 78
5 5 58 14 22 14 78
I also have the following (dummy) lookup table for age and education-based cutoff values to define a patient's performance as impaired or not impaired on each task:
cutoffs = data.frame(age = rep(seq(from = 35, to = 70, by = 5), 2),
education = c(rep("<16", 8), rep(">=16",8)),
task1_cutoff = c(rep(24, 16)),
task2_cutoff = c(11,11,11,11,10,10,10,10,9,13,13,13,13,12,12,11),
task3_cutoff = c(rep(71,8), 70, rep(74,2), rep(73, 5)))
> cutoffs
age education task1_cutoff task2_cutoff task3_cutoff
1 35 <16 24 11 71
2 40 <16 24 11 71
3 45 <16 24 11 71
4 50 <16 24 11 71
5 55 <16 24 10 71
6 60 <16 24 10 71
7 65 <16 24 10 71
8 70 <16 24 10 71
9 35 >=16 24 9 70
10 40 >=16 24 13 74
11 45 >=16 24 13 74
12 50 >=16 24 13 73
13 55 >=16 24 13 73
14 60 >=16 24 12 73
15 65 >=16 24 12 73
16 70 >=16 24 11 73
My goal is to create 3 new variables in patient_df that indicate whether or not a patient is impaired on each task with a binary indicator. For example, for id=1 in patient_df, their age is <=35 and their education is <16 years, so the cutoff value for task1 would be 24, the cutoff value for task2 would be 11, and the cutoff value for task3 would be 71, such that scores below these values would denote impairment.
I would like to do this for each id by referencing the age and education-associated cutoff value in the cutoff dataset, so that the outcome would look something like this:
> goal_patient_df
id age education task1 task2 task3 task1_impaired task2_impaired task3_impaired
1 1 30 11 21 15 82 1 1 0
2 2 72 22 28 15 60 0 0 1
3 3 46 18 20 10 74 1 1 0
4 4 63 12 24 11 78 1 0 0
5 5 58 14 22 14 78 1 0 0
In actuality, my patient_df has 600+ patients and there are 7+ tasks each with age- and education-associated cutoff values, so a 'clean' way of doing this would be greatly appreciated! My only alternative that I can think of right now is writing a TON of if_else statements or case_whens which would not be incredibly reproducible for anyone else who would use my code :(
Thank you in advance!
I would recommend putting both your lookup table and patient_df dataframe in long form. I think that might be easier to manage with multiple tasks.
Your education column is numeric; so converting to character "<16" or ">=16" will help with matching in lookup table.
Using fuzzy_inner_join will match data with lookup table where task and education match exactly == but age will between an age_low and age_high if you specify a range of ages for each lookup table row.
Finally, impaired is calculated comparing the values from the two data frames for the particular task.
Please note for output, id of 1 is missing, as falls outside of age range from lookup table. You can add more rows to that table to address this.
library(tidyverse)
library(fuzzyjoin)
cutoffs_long <- cutoffs %>%
pivot_longer(cols = starts_with("task"), names_to = "task", values_to = "cutoff_value", names_pattern = "task(\\d+)") %>%
mutate(age_low = age,
age_high = age + 4) %>%
select(-age)
patient_df %>%
pivot_longer(cols = starts_with("task"), names_to = "task", values_to = "patient_value", names_pattern = "(\\d+)") %>%
mutate(education = ifelse(education < 16, "<16", ">=16")) %>%
fuzzy_inner_join(cutoffs_long, by = c("age" = "age_low", "age" = "age_high", "education", "task"), match_fun = list(`>=`, `<=`, `==`, `==`)) %>%
mutate(impaired = +(patient_value < cutoff_value))
Output
# A tibble: 12 x 11
id age education.x task.x patient_value education.y task.y cutoff_value age_low age_high impaired
<int> <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <int>
1 2 72 >=16 1 28 >=16 1 24 70 74 0
2 2 72 >=16 2 15 >=16 2 11 70 74 0
3 2 72 >=16 3 60 >=16 3 73 70 74 1
4 3 46 >=16 1 20 >=16 1 24 45 49 1
5 3 46 >=16 2 10 >=16 2 13 45 49 1
6 3 46 >=16 3 74 >=16 3 74 45 49 0
7 4 63 <16 1 24 <16 1 24 60 64 0
8 4 63 <16 2 11 <16 2 10 60 64 0
9 4 63 <16 3 78 <16 3 71 60 64 0
10 5 58 <16 1 22 <16 1 24 55 59 1
11 5 58 <16 2 14 <16 2 10 55 59 0
12 5 58 <16 3 78 <16 3 71 55 59 0
As a part of a more complex procedure, I found myself lost in this passage. Below, a reproducible example of what I am dealing with. I need to add a column to each nested dataset with the same number within but a different number between them. Specifically, the number has to be what is written in c1$Age. The code cbind(k, AgeGroup = 3) is only for demonstration. In fact when I used cbind(k, AgeGroup = Age), R gives me the following error Error in mutate_impl(.data, dots): Evaluation error: arguments imply differing number of rows: 5, 2.
library(dplyr)
library(purrr)
library(magrittr)
library(tidyr)
c <- read.table(header = TRUE, text = "Age Verbal Fluid Speed
2 89 94 103
1 98 88 100
1 127 115 102
2 83 101 71
2 102 92 87
1 91 97 120
1 96 129 98
2 79 92 84
2 107 95 102")
c1 <- c %>%
group_by(Age) %>%
nest() %>%
dplyr::mutate(db = data %>% map(function(k) cbind(k, AgeGroup = 3)))
#> c1
# A tibble: 2 x 3
# Age data db
# <int> <list> <list>
#1 2 <tibble [5 x 3]> <data.frame [5 x 4]>
#2 1 <tibble [4 x 3]> <data.frame [4 x 4]>
This is what I have now:
#> c1$db
#[[1]]
# Verbal Fluid Speed AgeGroup
#1 89 94 103 3
#2 83 101 71 3
#3 102 92 87 3
#4 79 92 84 3
#5 107 95 102 3
#
#[[2]]
# Verbal Fluid Speed AgeGroup
#1 98 88 100 3
#2 127 115 102 3
#3 91 97 120 3
#4 96 129 98 3
This is what I would like to get.
#> c1$db
#[[1]]
# Verbal Fluid Speed AgeGroup
#1 89 94 103 2
#2 83 101 71 2
#3 102 92 87 2
#4 79 92 84 2
#5 107 95 102 2
#
#[[2]]
# Verbal Fluid Speed AgeGroup
#1 98 88 100 1
#2 127 115 102 1
#3 91 97 120 1
#4 96 129 98 1
You could replace map by map2 and in this way maintain the knowledge of the corresponding value of Age:
c1 <- c %>% group_by(Age) %>% nest() %>%
dplyr::mutate(db = data %>% map2(Age, function(k, age) cbind(k, AgeGroup = age)))
c1$db
# [[1]]
# Verbal Fluid Speed AgeGroup
# 1 89 94 103 2
# 2 83 101 71 2
# 3 102 92 87 2
# 4 79 92 84 2
# 5 107 95 102 2
#
# [[2]]
# Verbal Fluid Speed AgeGroup
# 1 98 88 100 1
# 2 127 115 102 1
# 3 91 97 120 1
# 4 96 129 98 1
When you tried cbind(k, AgeGroup = Age) directly, the problem was that Age was a vector 2:1, rather than a single corresponding value.
We can use map2 to loop through both Age and data columns and update the data columns using mutate.
library(dplyr)
library(purrr)
library(magrittr)
library(tidyr)
c1 <- c %>%
group_by(Age) %>%
nest()
c2 <- c1 %>%
mutate(data = map2(data, Age, ~mutate(.x, AgeGroup = .y)))
c2$data
# [[1]]
# # A tibble: 5 x 4
# Verbal Fluid Speed AgeGroup
# <int> <int> <int> <int>
# 1 89 94 103 2
# 2 83 101 71 2
# 3 102 92 87 2
# 4 79 92 84 2
# 5 107 95 102 2
#
# [[2]]
# # A tibble: 4 x 4
# Verbal Fluid Speed AgeGroup
# <int> <int> <int> <int>
# 1 98 88 100 1
# 2 127 115 102 1
# 3 91 97 120 1
# 4 96 129 98 1
you can use dplyr::rowwise now, i.e.
c <- read.table(header = TRUE, text = "Age Verbal Fluid Speed
2 89 94 103
1 98 88 100
1 127 115 102
2 83 101 71
2 102 92 87
1 91 97 120
1 96 129 98
2 79 92 84
2 107 95 102")
c1 <- c %>%
group_by(Age) %>%
nest() %>%
rowwise() %>%
mutate(db = list(cbind(data, Age)))
I have a dataset with 15 columns col1 to col15 being numeric. I have 100 rows of data with names attached to each row as a factor. I want to do a summary for each row for all 15 columns.
head(df2phcl[,c(1:16)])
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 col15 NAME
78 95 101 100 84 93 93 85 81 97 80 94 81 79 87 R04-001
100 61 96 75 98 92 99 99 102 83 84 NA 101 93 96 R04-002
81 84 82 83 77 86 90 92 92 78 86 91 59 80 84 R04-003
91 84 87 95 103 93 92 95 86 92 107 96 94 87 97 R04-004
72 79 66 98 84 75 85 83 75 80 91 65 90 81 73 R04-005
72 75 68 44 79 64 83 71 81 82 85 63 87 94 60 R04-006
My code for this is.
library(dplyr)
####Rachis
SUMCL <- df2phcl %>%
group_by(name) %>%
summarise(CL = mean(c(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15), na.rm=T),
CLMAX = max(c(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15), na.rm=T),
CLMIN = min(c(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15), na.rm=T),
CLSTD = sd(c(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15), na.rm=T),
OUT = outliers(c(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15), na.rm=T))
head(SUMCL)
tail(SUMCL)
My resulting analysis comes out as...
Error:
Evaluation error: missing value where TRUE/FALSE needed.
I've also tried this...
df2phcl$col1+col2+col3+col4+col5+col6+col7+col8+col9+col10+col11+col12+col13+co114+col15[!df2phcl$col1+col2+col3+col4+col5+col6+col7+col8+col9+col10+col11+col12+col13+col14+col15%in%boxplot.stats(df2phcl$col1+col2+col3+col4+col5+col6+col7+col8+col9+col10+co111+col12+col13+col14+col15)$out]
This returns ....
Error: object 'col2' not found
Not sure what I'm doing wrong this works with mean, max, min, and sd.
> head(SUMCL)
# A tibble: 6 x 11
# Groups: ENTRY, NAME, HEADCODE, RHTGENES, HEAD, PL [6]
ENTRY NAME HEADCODE RHTGENES HEAD PL PH CL CLMAX CLMIN CLSTD
<int> <fctr> <fctr> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 R04-001 CAW Rht1 Club 319 83 88.53333 101 78 7.989875
2 2 R04-002 LBW Wildtype Common 330 102 91.35714 102 61 11.770936
3 3 R04-003 CBW Rht2 Club 230 82 83.00000 92 59 8.220184
4 4 R04-004 LBW Rht1 Common 328 117 93.26667 107 84 6.192930
5 5 R04-005 CBW Rht1 Club 280 97 79.80000 98 65 9.182281
6 6 R04-006 LAW Rht1 Common 310 92 73.86667 94 44 12.749603
I'm just wanting to filter the outliers at 3 sd or more and then use the dplyr to package to do my statistics...
I'm not exactly sure what you're trying to do, so let me know if the code below is on the right track.
The approach below is to convert the data from wide to long format, which makes it much easier to do the summaries for each level of name.
library(tidyverse)
# Fake data
set.seed(2)
dat = as.data.frame(replicate(15, rnorm(100)))
names(dat) = paste0("col", 1:15)
dat$name = paste0(rep(LETTERS[1:10], each=10), rep(letters[1:10], 10))
# Convert data to long format, remove outliers and summarize
dat %>%
gather(column, value, -name) %>% # reshape from wide to long
group_by(name) %>% # summarize by name
mutate(value = replace(value, abs(value - mean(value)) > 2*sd(value), NA)) %>% # set outliers to NA
summarise(mean = mean(value, na.rm=TRUE),
max = max(value, na.rm=TRUE),
sd = sd(value, na.rm=TRUE))
name mean max sd
1 Aa 0.007848188 1.238744 0.8510016
2 Ab -0.208536464 1.980401 1.2764606
3 Ac -0.152986713 1.587845 0.8443106
4 Ad -0.413543054 0.965692 0.7225872
5 Ae -0.112648322 1.178716 0.7269527
6 Af 0.442268890 2.048040 1.0350119
7 Ag 0.390627994 1.978260 0.8716681
8 Ah 0.080505879 2.396349 1.3128403
9 Ai 0.257925059 1.984474 1.0196722
10 Aj 0.137469703 1.470177 0.7192616
# ... with 90 more rows
I managed to get some of the col std. dev. changed; however, I'm not sure how many observations it took out. I was wanting to take out from the top and the bottom of the distribution at an even amount. Like a trimmed mean, it would take out 20% of the obs. from the top and bottom of the distribution. What I was curious about was just leaving the observations from the top and bottom (+-3 SD) of the distribution.
> SUMCL <- df2phcl %>%
+ gather(column, value, -c(ENTRY, NAME, HEADCODE, RHTGENES, HEAD,PL,PH)) %>% # reshape from wide to long
+ group_by(ENTRY, NAME, HEADCODE, RHTGENES, HEAD,PL,PH) %>% # summarize by name
+ mutate(value = replace(value, abs(value - mean(value)) > 2*sd(value), NA)) %>% # set outliers to NA
+ summarise(CL = mean(value, na.rm=TRUE),
+ CLMAX = max(value, na.rm=TRUE),
+ CLMIN = min(value, na.rm=TRUE),
+ N = sum(!is.na(value), na.rm=TRUE),
+ CLSTD= sd(value, na.rm=TRUE),
+ CLSE = (CLSTD / sqrt(N)))
> head(SUMCL)
# A tibble: 6 x 13
# Groups: ENTRY, NAME, HEADCODE, RHTGENES, HEAD, PL [6]
ENTRY NAME HEADCODE RHTGENES HEAD PL PH CL CLMAX CLMIN N CLSTD CLSE
<int> <fctr> <fctr> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 1 R04-001 CAW Rht1 Club 319 83 88.53333 101 78 15 7.989875 2.062977
2 2 R04-002 LBW Wildtype Common 330 102 91.35714 102 61 14 11.770936 3.145915
3 3 R04-003 CBW Rht2 Club 230 82 84.71429 92 77 14 5.029583 1.344213
4 4 R04-004 LBW Rht1 Common 328 117 92.28571 103 84 14 5.075258 1.356420
5 5 R04-005 CBW Rht1 Club 280 97 79.80000 98 65 15 9.182281 2.370855
6 6 R04-006 LAW Rht1 Common 310 92 76.00000 94 60 14 10.076629 2.693093