I have an issue with dplyr I cannot resolve. Also I do not have a full workable example, since the problem only occurs with the full set of data (that I cannot share with you).
I do the following:
t %>% group_by(id, add=TRUE) %>%
summarise(minbplevel = min(ref, na.rm=T)
,maxbplevel = max(ref, na.rm=T)
) %>% filter(id %in% c(caseA,caseB))
Which results in
id minbplevel maxbplevel
(dbl) (dbl) (dbl)
1 B 33.0 73.0
2 A 39.4 80.4
But when I do
t %>% group_by(id, add=TRUE) %>%
mutate(minbplevel = min(ref, na.rm=T)
,maxbplevel = max(ref, na.rm=T)
) %>% filter(id %in% c(caseA,caseB))
It results in:
id Level refparmax refparmin ref meanbptest minbplevel maxbplevel
(dbl) (chr) (int) (int) (dbl) (dbl) (dbl) (dbl)
1 B 0SD 69 68 49.0 52.00000 33 73
2 B min1SD 69 68 41.0 52.00000 33 73
3 B min2SD 69 68 33.0 52.00000 33 73
4 B plus1SD 69 68 59.0 52.00000 33 73
5 B plus2SD 69 68 73.0 52.00000 33 73
6 A 0SD 100 95 56.4 35.33333 NA NA
7 A min1SD 100 95 47.4 35.33333 NA NA
8 A min2SD 100 95 39.4 35.33333 NA NA
9 A plus1SD 100 95 67.4 35.33333 NA NA
10 A plus2SD 100 95 80.4 35.33333 NA NA
Why the NA's in case A are produced, I have no clue. It seems that each time I try it on a subset of the data, the second case with data is the problem, but that is just a hunch.
It is only one case of the 18850 that gives this issue, but there is nothing identifiable that makes the problem case different than the rest.
Please advice what I can try to do to solve this?
I can think of workarounds, creating the summarized data and then merging the result with the original data. But I thought that dplyr would allow me to do this in one step.
I tried removing or adding the add = TRUE option. That does not make any difference.
Maybe I am using this in the wrong way.
Based on comment I tried:
subset(with(t,aggregate(ref~id, t, FUN= min, na.rm=TRUE, na.action= na.pass)),id %in% c(caseA,caseB))
Which results in
id ref
4 B 33.0
5 A 39.4
I have to mask some parts of the data.
dput(head(subset(t,id %in% c(caseA,caseB)) , 12))
gives:
Again I replaced the actual id's with variables caseB and caseA. Also this is not the full dataset in which the problem occurs.
structure(list(id = c(caseB, caseB, caseB, caseB, caseB,
caseA, caseA, caseA, caseA, caseA), Level = c("0SD", "min1SD",
"min2SD", "plus1SD", "plus2SD", "0SD", "min1SD", "min2SD", "plus1SD",
"plus2SD"), refparmax = c(69L, 69L, 69L, 69L, 69L, 100L, 100L,
100L, 100L, 100L), refparmin = c(68L, 68L, 68L, 68L, 68L, 95L,
95L, 95L, 95L, 95L), ref = c(49, 41, 33, 59, 73, 56.4, 47.4,
39.4, 67.4, 80.4), meanbptest = c(52, 52, 52, 52, 52, 35.3333333333333,
35.3333333333333, 35.3333333333333, 35.3333333333333, 35.3333333333333
)), .Names = c("id", "Level", "refparmax", "refparmin", "ref",
"meanbptest"), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), vars = list(id), drop = TRUE, indices = list(
0:4, 5:9), group_sizes = c(5L, 5L), biggest_group_size = 5L, labels = structure(list(
id = c(caseB, caseA)), class = "data.frame", row.names = c(NA,
-2L), vars = list(id), drop = TRUE, .Names = "id"))
If I replace all NA's in the ref column with zeros the mutate step is working fine. As aosmith suggested, it has probably something to do with the mutate and NA issue that is fixed in the developement version of dplyr.
I cannot test this suggestion due to workstation restrictions though. So I will work around the issue, with the NA replacement step, and process the zero values after the summary steps.
Related
I'm trying to calculate percent change in R with each of the time points included in the column label (table below). I have dplyr loaded and my dataset was loaded in R and I named it data. Below is the code I'm using but it's not calculating correctly. I want to create a new dataframe called data_per_chg which contains the percent change from "v1" each variable from. For instance, for wbc variable, I would like to calculate percent change of wbc.v1 from wbc.v1, wbc.v2 from wbc.v1, wbc.v3 from wbc.v1, etc, and do that for all the remaining variables in my dataset. I'm assuming I can probably use a loop to easily do this but I'm pretty new to R so I'm not quite sure how proceed. Any guidance will be greatly appreciated.
id
wbc.v1
wbc.v2
wbc.v3
rbc.v1
rbc.v2
rbc.v3
hct.v1
hct.v2
hct.v3
a1
23
63
30
23
56
90
13
89
47
a2
81
45
46
N/A
18
78
14
45
22
a3
NA
27
14
29
67
46
37
34
33
data_per_chg<-data%>%
group_by(id%>%
arrange(id)%>%
mutate(change=(wbc.v2-wbc.v1)/(wbc.v1))
data_per_chg
Assuming the NA values are all NA and no N/A
library(dplyr)
library(stringr)
data <- data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(-c(id, matches("\\.v1$")), ~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change"))
-output
data
id wbc.v1 wbc.v2 wbc.v3 rbc.v1 rbc.v2 rbc.v3 hct.v1 hct.v2 hct.v3 wbc.v2_change wbc.v3_change rbc.v2_change rbc.v3_change hct.v2_change hct.v3_change
1 a1 23 63 30 23 56 90 13 89 47 1.7391304 0.3043478 1.434783 2.9130435 5.84615385 2.6153846
2 a2 81 45 46 NA 18 78 14 45 22 -0.4444444 -0.4320988 NA NA 2.21428571 0.5714286
3 a3 NA 27 14 29 67 46 37 34 33 NA NA 1.310345 0.5862069 -0.08108108 -0.1081081
If we want to keep the 'v1' columns as well
data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(ends_with('.v1'), ~ .x - .x,
.names = "{str_replace(.col, 'v1', 'v1change')}")) %>%
transmute(id, across(ends_with('change')),
across(-c(id, matches("\\.v1$"), ends_with('change')),
~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change")) %>%
select(id, starts_with('wbc'), starts_with('rbc'), starts_with('hct'))
-output
id wbc.v1change wbc.v2_change wbc.v3_change rbc.v1change rbc.v2_change rbc.v3_change hct.v1change hct.v2_change hct.v3_change
1 a1 0 1.7391304 0.3043478 0 1.434783 2.9130435 0 5.84615385 2.6153846
2 a2 0 -0.4444444 -0.4320988 NA NA NA 0 2.21428571 0.5714286
3 a3 NA NA NA 0 1.310345 0.5862069 0 -0.08108108 -0.1081081
data
data <- structure(list(id = c("a1", "a2", "a3"), wbc.v1 = c(23L, 81L,
NA), wbc.v2 = c(63L, 45L, 27L), wbc.v3 = c(30L, 46L, 14L), rbc.v1 = c("23",
"N/A", "29"), rbc.v2 = c(56L, 18L, 67L), rbc.v3 = c(90L, 78L,
46L), hct.v1 = c(13L, 14L, 37L), hct.v2 = c(89L, 45L, 34L), hct.v3 = c(47L,
22L, 33L)), class = "data.frame", row.names = c(NA, -3L))
I have multiple data.frames with an equal number of columns. I want to combine these into a single pivot table that I can write to excel.
Example data.frames:
> net_imports[,1:5]
1979 1980 1981 1982 1983
beginning_stocks NA -53 -83 -110 -60.000
production NA -390 -585 -510 -434.996
consumption NA 370 380 390 410.000
ending_stocks 53 83 110 60 46.000
predicted NA 10 -178 -170 -38.996
> area_harvested_output[,1:5]
1979 1980 1981 1982 1983
area_harvested_lag 51.22632 51.2263243 41.6213885 57.6296148 54.4279695
area_harvested_trend 0.00000 0.1007849 0.2015699 0.3023548 0.4031397
import_price_cpi NA 20.4610740 18.7566970 16.8987151 15.2273790
predicted NA 71.7881832 60.5796553 74.8306847 70.0584883
error NA 58.2118168 119.4203447 95.1693153 99.9415117
pred_err NA 130.0000000 180.0000000 170.0000000 170.0000000
I want the resulting table in excel to look something like this
Basically, I just want to maintain the variable names like "net_imports" and "area_harvested_output" as grouped data.
I'd pivot_longer both data.frames to long, so that year becomes one instead of (in your example) five columns, rbind or bind_rows them and export the accumulated long table to Excel (in which I'd then build the interactive Excel pivot table).
your example data:
net_imports <- structure(list(parameter = c("beginning_stocks", "production",
"consumption", "ending_stocks", "predicted"), X1979 = c(NA, NA,
NA, 53L, NA), X1980 = c(-53L, -390L, 370L, 83L, 10L), X1981 = c(-83L,
-585L, 380L, 110L, -178L), X1982 = c(-110L, -510L, 390L, 60L,
-170L), X1983 = c(-60, -434.996, 410, 46, -38.996)), class = "data.frame", row.names = c(NA,
5L))
area_harvested_output <- structure(list(parameter = c("area_harvested_lag", "area_harvested_trend", "import_price_cpi", "predicted", "error", "pred_err"), X1979 = c(51.22632,
0, NA, NA, NA, NA), X1980 = c(51.2263243, 0.1007849, 20.461074,
71.7881832, 58.2118168, 130), X1981 = c(41.6213885, 0.2015699,
18.756697, 60.5796553, 119.4203447, 180), X1982 = c(57.6296148,
0.3023548, 16.8987151, 74.8306847, 95.1693153, 170), X1983 = c(54.4279695,
0.4031397, 15.227379, 70.0584883, 99.9415117, 170)), class = "data.frame", row.names = c(NA,
6L))
the code:
library(dplyr)
library(tidyr)
library(rio) ## convenience package for imports/exports
long_table <-
net_imports %>%
pivot_longer(cols = -parameter,
names_to = 'year') %>%
bind_rows(
area_harvested_output %>%
pivot_longer(cols = -parameter,
names_to = 'year')
)
long_table %>% export('long_table.xlsx')
In R markdown through R Studio (R v. 4.0.3), I'm looking for a better solution to combining similarly structured dataframes while keeping all rows and matching entries on a key. Piping full_join() into a filter into a bind_rows() directly wasn't working, possibly because of the error message:
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
I have 23 dataframes (let's call these "semester data") of data I'm looking to combine into a single dataframe (intended to be a single dataset of individuals outcomes from semester-to-semester).
Each semester dataframe is roughly 3000-4000 observations (individuals) with 45-47 variables of relevant data. A simplified example of a semester (or term) dataframe is shown below.
Simplified example of a "semester" dataframe:
id
ACT_math
course_code
section_code
term_code
grade
term_GPA
0001
23
101
001
FA12
3.45
3.8
0002
28
201
003
FA12
3.2
3.4
Individuals will show up in multiple semester dataframes as they progress through the program (taking course 101 in the fall and course 102 in the spring).
I want to use the dplyr full_join() to match these individuals on an ID key.
Using the suffix argument, I hope to keep track of which semester and course a set of data (grade, term_GPA, etc) for an individual comes from.
There's some data (ACT score, gender, state residency, etc) that is the stable for an individual across semester dataframes. Ideally I could take the first input and drop the rest, but if I had to clean this afterwards, that's fine.
I started by defining an object programatic_database using the first semester of data SP11. To cut down on the duplication of stable data for an individual, I selected the relevant columns that I wanted to join.
programmatic_database <- programmatic_database %>%
full_join(select(fa12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".sp11", ".fa12"), keep = FALSE, name = "id")
However, every semester new students join the program. I would like to add these entries to the bottom of the growing programmatic_database.
I'm also looking to use rbind() or bind_rows() to add these individuals to the bottom of the programmatic_database, along with their relevant data.
After full_join(), I'm filtering out the entries that have already been added horizontally to the dataframe, then piping the remaining entries into bind_rows()
programmatic_database <- fa12[!which(fa12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, fa12)
Concatenated example of what my code is producing after several iterations:
id
ACT_math
course_code
section_code
section_code.db
section_code.db.db
term_code
grade.sp11
grade.fa12
grade.sp13
grade.sp15
term_GPA.sp11
term_GPA.fa12
term_GPA.sp15
0001
23
102
001
001
001
FA12
3.45
3.8
3.0
-
3.8
3.7
-
0002
28
201
003
003
003
FA12
3.2
3.4
3.0
-
3.8
3.7
-
1020
28
201
003
003
003
FA12
3.2
3.4
-
-
3.8
3.7
-
6783
30
101
-
-
-
SP15
-
-
-
3.8
-
-
4.0
where I have successfully added horizontally for students 0001 and 0002 for outcomes in subsequent courses in subsequent semesters. I have also managed to add vertically, like with student 6783, leaving blanks for previous semesters before they enrolled but still adding the relevant columns.
Questions:
Is there a way to pipe full_join() into a filter() into a bind_rows() without running into these errors?
rbind number of columns do not match
OR
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
Is there a easy way to keep certain columns and only add the suffix ".fa12" to certain columns? As you can see, the .db is piling up.
Is there any way to automate this? Loops aren't my strong suit, but I'm sure there's a better-looking code than doing each of the 23 joins/binds by hand.
Thank you for assistance!
Current code for simplicity:
#reproducible example
fa11 <- structure(list(id = c("1001", "1002", "1003",
"1013"), act6_05_composite = c(33L, 26L, 27L, 25L), course_code = c("101",
"101", "101", "101"), term_code = c("FA11", "FA11", "FA11", "FA11"
), section_code = c(1L, 1L, 1L, 1L), grade = c(4, 0, 0, 2.5
), repeat_status_flag = c(NA, "PR", NA, NA), class_code = c(1L,
1L, 1L, 1L), cum_atmpt_credits_prior = c(16, 0, 0, 0), cum_completed_credits_prior = c(0L,
0L, 0L, 0L), cum_passed_credits_prior = c(16, 0, 0, 0), cum_gpa_prior = c(0,
0, 0, 0), cum_atmpt_credits_during = c(29, 15, 18, 15), cum_completed_credits_during = c(13L,
1L, 10L, 15L), cum_passed_credits_during = c(29, 1, 14, 15),
term_gpa = c(3.9615, 0.2333, 2.3214, 2.9666), row.names = c(NA, 4L
), class = "data.frame")
sp12 <- structure(list(id = c("1007", "1013", "1355",
"2779", "2302"), act6_05_composite = c(24L, 26L, 25L, 24L,
24L), course_code = c(101L, 102L, 101L, 101L, 101L
), term_code = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_), section_code = c(1L, 1L, 1L, 1L, 1L), grade = c(2,
2.5, 2, 1.5, 3.5), repeat_status_flag = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), class_code = c(2L, 2L, 1L, 2L, 2L), cum_atmpt_credits_prior = c(44,
43, 12, 43, 30), cum_completed_credits_prior = c(41L, 43L,
12L, 43L, 12L), cum_passed_credits_prior = c(41, 43, 12,
43, 30), cum_gpa_prior = c(3.3125, 3.186, 3.5416, 3.1785,
3.8636), cum_atmpt_credits_during = c(56, 59, 25, 64, 43),
cum_completed_credits_during = c(53L, 56L, 25L, 56L, 25L),
cum_passed_credits_during = c(53, 59, 25, 64, 43), term_gpa = c(2.8333,
3.423, 3.1153, 2.1923, 3.6153), row.names = c(NA,
5L), class = "data.frame")
# make object from fall 2011 semester dataframe
programmatic_database <- fa11
# join the spring 2012 semester dataframe by id using select variables and attaching relevant suffix
programmatic_database <- programmatic_database %>%
full_join(select(sp12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".fa11", ".sp12"), keep = FALSE, name = "id")
#view results of join, force integer type on certain variables as needed (see error above)
#filter the joined entries from fall 2012 database, then bind the remaining entries to the bottom of the growing dataset
programmatic_database <- sp12[!which(sp12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, sp12)
It would be possible to use bind_rows here if you make the column types consistent between tables. For instance, you could make a function to re-type any particular columns that aren't consistent in your original data. (That might also be something you could fix upstream as you read it in.)
library(dplyr)
set_column_types <- function(df) {
df %>%
mutate(term_code = as.character(term_code),
course_code = as.character(course_code))
}
bind_rows(
fa11 %>% set_column_types(),
sp12 %>% set_column_types() %>% mutate(term_code = "SP12")
)
This will stack your data into a relatively "long" format, like below. You may want to then reshape it depending on what kind of subsequent calculations you want to do.
id act6_05_composite course_code term_code section_code grade repeat_status_flag class_code cum_atmpt_credits_prior cum_completed_credits_prior cum_passed_credits_prior cum_gpa_prior cum_atmpt_credits_during cum_completed_credits_during cum_passed_credits_during term_gpa
1 1001 33 101 FA11 1 4.0 <NA> 1 16 0 16 0.0000 29 13 29 3.9615
2 1002 26 101 FA11 1 0.0 PR 1 0 0 0 0.0000 15 1 1 0.2333
3 1003 27 101 FA11 1 0.0 <NA> 1 0 0 0 0.0000 18 10 14 2.3214
4 1013 25 101 FA11 1 2.5 <NA> 1 0 0 0 0.0000 15 15 15 2.9666
5 1007 24 101 SP12 1 2.0 <NA> 2 44 41 41 3.3125 56 53 53 2.8333
6 1013 26 102 SP12 1 2.5 <NA> 2 43 43 43 3.1860 59 56 59 3.4230
7 1355 25 101 SP12 1 2.0 <NA> 1 12 12 12 3.5416 25 25 25 3.1153
8 2779 24 101 SP12 1 1.5 <NA> 2 43 43 43 3.1785 64 56 64 2.1923
9 2302 24 101 SP12 1 3.5 <NA> 2 30 12 30 3.8636 43 25 43 3.6153
I have a dataset I initially manipulate with the gather() function. I am now attempting to create averages of groups in the gathered data. I am having issues understanding the best way to create averages of the data provided here. My hope is to create an average associated to each group. Here I am averaging scores for 'observers'.
EDIT: I need an average for each observer over all dates of observation.
EDIT-2: Each observer has any number of individuals they will be assessing. If I use group_by(observer) the average will be over all observations total, not an average for the observer.
EDIT-3: I am hoping to see averages of each observation dates 'fidelity score'. If I have 3 scores (90,100,120), I would like to see an average of these values attributed to the observer, but still be able to display the scores over time. The output I am hoping for would be:
Important Note: My fidelity scores are all out of 129 possible points
EDIT-4: I would like to average observer scores over the count to observations(date_of_observation)
Here is the function I am using to create my averages.
LPLC_Group %>%
group_by(observer,date_of_observation)%>%
summarize(fidelity_score = sum(value,na.rm=TRUE),
average_fidelity = round(mean(fidelity_score,na.rm=TRUE),2))
The following dput is related to the output of the function above. I cannot post my full dataset. The output of this function should be enough to work with.
dput output:
structure(list(observer = c("Cristianne", "Cristianne", "Cristianne",
"Deb", "Deb", "Deb", "Lori", "Lori", "Lori", "Pauline", "Pauline",
"Pauline"), date_of_observation = c("6/24/19", "7/24/19", "8/24/19",
"6/24/19", "7/24/19", "8/24/19", "6/24/19", "7/24/19", "8/24/19",
"6/24/19", "7/24/19", "8/24/19"), fidelity_score = c(100L, 87L,
95L, 89L, 106L, 98L, 85L, 104L, 102L, 94L, 85L, 113L), average_fidelity = c(100,
87, 95, 89, 106, 98, 85, 104, 102, 94, 85, 113)), row.names = c(NA,
-12L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(
observer = c("Cristianne", "Deb", "Lori", "Pauline"), .rows = list(
1:3, 4:6, 7:9, 10:12)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
library(dplyr)
LPLC_Group %>%
group_by(observer) %>%
mutate(average_fidelity = mean(fidelity_score))
# A tibble: 12 x 4
# Groups: observer [4]
observer date_of_observation fidelity_score average_fidelity
<chr> <chr> <int> <dbl>
1 Cristianne 6/24/19 100 94
2 Cristianne 7/24/19 87 94
3 Cristianne 8/24/19 95 94
4 Deb 6/24/19 89 97.7
5 Deb 7/24/19 106 97.7
6 Deb 8/24/19 98 97.7
7 Lori 6/24/19 85 97
8 Lori 7/24/19 104 97
9 Lori 8/24/19 102 97
10 Pauline 6/24/19 94 97.3
11 Pauline 7/24/19 85 97.3
12 Pauline 8/24/19 113 97.3
If the output you get does not match mine for this input, then you have probably succumbed to the mistake of Loading plyr after dplyr and ignoring the warning. I would suggest restarting R and being careful to load plyr before dplyr (if at all).
I have the following data and I would like to apply the function diff() only on consecutive days: diff(data$ch, differences = 1, lag = 1) returns the differences between all consecutive values of ch (23-12, 4-23, 78-4, 120-78, 94-120, ...). I would like the diff() function to return NA when the dates are not consecutive. The output I am trying to obtain from the data below is:
11, -19, 74, NA, -26, NA, -34, 39, NA
Is there anyone who knows how I can do that?
Date ch
2013-01-01 12
2013-01-02 23
2013-01-03 4
2013-01-04 78
2013-01-10 120
2013-01-11 94
2013-02-26 36
2013-02-27 2
2013-02-28 41
2003-03-05 22
You can do these in base R without installing any external packages.
Assuming that the 'Date' column is of Date class, we take the diff of the 'Date' and based on whether the difference between adjacent elements are greater than 1 or not, we can create a grouping index ('indx') by taking the cumulative sum (cumsum) of the logical vector.
indx <- cumsum(c(TRUE,abs(diff(df1$Date))>1))
In the second step, we can use ave with 'indx' as the grouping vector, and take the diff of 'ch'. The length of output of diff will be 1 less than the length of the 'ch' column. So we can append NA to make the lengths same.
ave(df1$ch, indx, FUN=function(x) c(diff(x),NA))
#[1] 11 -19 74 NA -26 NA -34 39 NA NA
data
df1 <- structure(list(Date = structure(c(15706, 15707, 15708, 15709,
15715, 15716, 15762, 15763, 15764, 12116), class = "Date"), ch = c(12L,
23L, 4L, 78L, 120L, 94L, 36L, 2L, 41L, 22L)), .Names = c("Date",
"ch"), row.names = c(NA, -10L), class = "data.frame")
The following just "...returns NA when the dates are not consecutive", unless there are tricky cases that it won't account for:
replace(diff(df1$ch), abs(diff(df1$Date)) > 1, NA)
#[1] 11 -19 74 NA -26 NA -34 39 NA
Try this with the libraries lubridate and dplyr
If you don't have them do this once install.packages("dplyr");install.packages("lubridate")
Code
library(lubridate)
library(dplyr)
data$Date <- ymd(data$Date)
data2 <- data %>% mutate(diff=ifelse(Date==lag(Date)+days(1), ch-lag(ch), NA))
Data
data <-
data.frame(Date=c("2013-01-01", "2013-01-02", "2013-01-03", "2013-01-04", "2013-01-10",
"2013-01-11", "2013-01-26", "2013-01-27", "2013-01-28", "2013-03-05"),
ch=c(12, 23, 4, 78, 120, 94, 36, 2, 41, 22))