Cut function alternative in R - r

I have some data in the form:
Person.ID Household.ID Composition
1 4593 1A_0C
2 4992 2A_1C
3 9843 1A_1C
4 8385 2A_2C
5 9823 8A_1C
6 3458 1C_9C
7 7485 2C_0C
: : :
We can think of the composition variable as a count of adults/children i.e. 2A_1C would equate to two adults and two children.
What I want to do is reduce the amount of possible levels of composition. For person 5 we have composition of 8A_1C, I am looking for a way to reduce this to 4+A_0C. So for example we would have 4+ for any composition value with greater than 4A.
Person.ID Household.ID Composition
5 9823 4+A_1C
6 3458 1A_4+C
: : :
I am unsure of how to do this in R, I am thinking of using filter() or select() from dyplyr. Otherwise I would need to use some sort of regular expression.
Any help would be appreciated. Thanks

Data:
Person.ID <- c(1,2,3,4,5,6,7,8)
Household.ID <- c(4593,4992,9843,8385,9823,3458,7485)
Composition <- c("1A_0C","2A_1C","1A_1C","2A_2C","8A_1C","1A_9C","2A_0C")
dat <- tibble(Person.ID, Household.ID, Composition)
Function:
above4 <- function(f){
ff <- gsub("[^0-9]","",f)
if(ff>4){return("4+")}
if(ff<=4){return(ff)}
}
Apply function (done on separated data, but can recombine after):
dat_ <- dat %>% tidyr::separate(., col=Composition,
into=c("Adults", "Children"),
sep="_") %>%
dplyr::mutate(Adults_ = unlist(lapply(Adults,above4)),
Children_ = unlist(lapply(Children,above4)))
You might then use select, filter to get your required dataset.
dat_ %>% dplyr::mutate(Composition_ = paste0(Adults_, "A_", Children_, "C")) %>%
dplyr::select(Person.ID, Household.ID, Composition=Composition_)
# A tibble: 7 x 3
Person.ID Household.ID Composition
<dbl> <dbl> <chr>
1 1. 4593. 1A_0C
2 2. 4992. 2A_1C
3 3. 9843. 1A_1C
4 4. 8385. 2A_2C
5 5. 9823. 4+A_1C
6 6. 3458. 1A_4+C
7 7. 7485. 2A_0C

We can use gsub:
df$Composition <- gsub("(?<!\\d)([5-9]|\\d{2,})(?=[AC])", "4+", df$Composition, perl = TRUE)
This assumes that 2 or more consecutive digits represent a number that's always greater than 4 (i.e. no 01, 02, or 001).
Output:
Person.ID Household.ID Composition
1 1 4593 1A_0C
2 2 4992 2A_1C
3 3 9843 1A_1C
4 4 8385 2A_2C
5 5 9823 4+A_1C
6 6 3458 1C_4+C
7 7 7485 2C_0C

Related

Vectorized function usage and joining individual terms into a single tibble

the title is vague but let me explain:
I have a non-vectorized function that outputs a 15-row table of volume estimates for a tree. Each row is a different measurement unit or portion of the input tree. I have a Tables argument to help the user decide what units and measurement protocol they're looking to find, but in 99% of use case scenarios, the output for a single tree's volume estimate is a tibble with more than one row.
I've removed ~20 other arguments from the function for demonstration's sake. DBH is a tree's diameter at breast height. Vol column is arbitrary.
Est1 <- TreeVol(Tables = "All", DBH = 7)
Est1
# A tibble: 15 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 1. Total_Above_Ground_Cubic_Volume 7 2
2 2. Gross_Inter_1/4inch_Vol 7 4
3 3. Net_Scribner_Vol 7 6
4 4. Gross_Merchantable_Vol 7 8
5 5. Net_Merchantable_Vol 7 10
6 6. Merchantable_Vol 7 12
7 7. Gross_SecondaryProduct_Vol 7 14
8 8. Net_SecondaryProduct_Vol 7 16
9 9. SecondaryProduct 7 18
10 10. Gross_Inter_1/4inch_Vol 7 20
11 11. Net_Inter_1/4inch_Vol 7 22
12 12. Gross_Scribner_SecondaryProduct 7 24
13 13. Net_Scribner_SecondaryProduct 7 26
14 14. Stump_Volume 7 28
15 15. Tip_Volume 7 30
the user can utilize the Tables argument as so:
Est2 <- TreeVol(Tables = "Scribner_BF", DBH = 7)
# A tibble: 3 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 3. Net_Scribner_Vol 7 6
2 12. Gross_Scribner_SecondaryProduct 7 24
3 13. Net_Scribner_SecondaryProduct 7 26
The problem arises in that I'd like to write a vectorized version of this function that can calculate the volume for an entire .csv of tree inventory data. Ideally, I'd like the multi-row outputs that relate to a single tree to output as one long tibble, with each 15-row default output filtered by what the user passes to the Tables argument as so:
Est3 <- VectorizedTreeVol(Tables = "Scribner_BF", DBH = c(7, 21, 26))
# A tibble: 9 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 3. Net_Scribner_Vol 7 6
2 12. Gross_Scribner_SecondaryProduct 7 24
3 13. Net_Scribner_SecondaryProduct 7 26
4 3. Net_Scribner_Vol 21 18
5 12. Gross_Scribner_SecondaryProduct 21 72
6 13. Net_Scribner_SecondaryProduct 21 76
7 3. Net_Scribner_Vol 26 8
8 12. Gross_Scribner_SecondaryProduct 26 78
9 13. Net_Scribner_SecondaryProduct 26 84
To achieve this, I wrote a for() loop that acts as the heart of the vectorized function. I've heard from multiple people that it's very inefficient (and I agree), but it works with the principle I'd like to achieve, in theory. Nothing I've found on this topic has suggested a better idea for application in a vectorized function like mine.
The general setup for the loop looks like this:
for(i in 1:length(DBH)){
Output <- VectorizedTreeVol(Tables = Tables[[i]], DBH = DBH[[i]]) %>%
purrr::reduce(dplyr::full_join, by = NULL) %>%
SuppressWarnings()
and in functions where the non-vectorized output is always a single row, the heart of its respective vectorized function doesn't need to be encased in a for() loop and looks like this:
Output <- OtherVectorizedFunction(Tables = Tables, DBH = DBH) %>%
purrr::reduce(dplyr::full_join, by = ColumnNames) %>% #ColumnNames is a vector with all of the output's column names
SuppressWarnings()
This specific call to reduce() has worked pretty well when I've used it to vectorize the other functions in the project, but I'm open to suggestions regarding how to join the output tables. I've been stuck on this dilemma for a few months now, and any help regarding how to achieve what this for() loop is striving for in theory would be awesome. Is having a vectorized function that outputs a tibble like Est3 even possible? Any feedback/comments are much appreciated.
Given this function:
TreeVol <- function(DBH) {
data.frame(Tables = c("Tree_Vol", "Intercapillary_transfusion", "Woodiness"),
Vol = c(DBH^2, sqrt(DBH) + 3, sin(DBH)),
DBH)
}
We could put our DBH parameters into purrr::map and then bind_rows to get a data.frame.
VecTreeVol <- function(DBH) {
DBH %>%
purrr::map(TreeVol) %>%
bind_rows()
}
Result
> VecTreeVol(DBH = 1:3)
Tables Vol DBH
1 Tree_Vol 1.0000000 1
2 Intercapillary_transfusion 4.0000000 1
3 Woodiness 0.8414710 1
4 Tree_Vol 4.0000000 2
5 Intercapillary_transfusion 4.4142136 2
6 Woodiness 0.9092974 2
7 Tree_Vol 9.0000000 3
8 Intercapillary_transfusion 4.7320508 3
9 Woodiness 0.1411200 3

Is there a reason RowSums(df[grep wouldn't work accurately?

I used
df$Total.P.n <- rowSums(df[grep('p.n', names(df), ignore.case = FALSE)])
to sum count values from any column name containing p.n, but the values it produced are way off. The columns are counts of certain combinations of language types in a language corpus. I want to get a summary of all times p.n. was used within other combinations, but am struggling. It seems like perhaps it is counting other occurences like e.sp.NR in my variable names, but shouldn't ignore.case=FALSE take care of that? I've also tried tidyverse and dplyr solutions to no avail.
Here's example of df structure:
ID.
do.p.n.NP
do.p.n.SE
p.d.e.sp.SR
1510
4
6
2
1515
2
0
1
and what I need:
ID.
do.p.n.NP
do.p.n.SE
p.d.e.sp.SR
Total.P.n
1510
4
6
2
10
1515
2
0
1
2
Update after update(new column names) of OP:
The code is like:
df$Total.P.n <- rowSums(df[grep('p.n', names(df), ignore.case = FALSE)])
df$p.d.e.sp.SR <- rowSums(df[,2:3]!=0)
ID. do.p.n.NP do.p.n.SE. p.d.e.sp.SR Total.P.n
1 1510 4 6 2 10
2 1515 2 0 1 2
First answer:
The argument pattern you are searching for e.g. p.n does not exist in df. Therefore I think you mean pn: Then your code works as expectect:
df$Total.P.n <- rowSums(df[grep('pn', names(df), ignore.case = FALSE)])
ID. do.pn.NP do.pn.SE. p.d.e.sp.SR Total.P.n
1 1510 4 6 0 10
2 1515 2 0 1 2
If we can use dplyr, I would suggest using a tidy-select function / selection helper like matches. And please mind that your regex is likely wrong. If we need to match literal dots . , we need to escape the metacharacter with a double backslash. The appropriate regex would be n\\.p.
library(dplyr)
data
df <- tibble(`ID.` = c(1510, 1515), `do.p.n.NP` = c(4,2), `do.p.n.SE.` = c(6,0), `p.d.e.sp.SR` = c(0,1))
answer
df %>%
mutate(Total.P.n = rowSums(across(matches('p\\.n'))))
# A tibble: 2 × 5
ID. do.p.n.NP do.p.n.SE. p.d.e.sp.SR Total.P.n
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1510 4 6 0 10
2 1515 2 0 1 2

An easier way to get average of a table with some conditions in R

I am trying to get the average of all 6 quizzes for each male student.
Here is part of the code that I've tried:
a<-subset(mydf,Sex=="M")
b<-a[4:9]
b
sum(b[1:6])
My logic is to get a table only contains male students with each of their 6 quizzes, then sum the table and divide by the number of male student. But I think there should be an easier way to do this.
Sample data:
df <- data.frame(Section=c(rep('A',9)),
Degree=c(rep('MBA',4),'MS','MBA','MBA','MS','MBA'),
Sex=c(rep('M',5),'F','M','M','F'),
Quiz1=c(0,10,2,2,8,6,6,2,3),
Quiz2=c(0,1,4,4,1,5,0,3,9),
Quiz3=c(6,5,6,6,4,2,7,9,3),
Quiz4=c(5,4,5,5,10,5,7,7,3),
Quiz5=c(7,3,6,3,10,7,6,10,5),
Quiz6=c(3,8,6,6,5,8,10,10,5))
How about this:
data.frame(df[which(df$Sex=='M'),],QuizMeans=rowMeans(df[which(df$Sex=='M'),c(4:9)]))
Note: "c(4:9)" in the code above is takes the row average for quiz columns 4-9.
So we're calculating quiz scores for each individual this way.
Output:
Section Degree Sex Quiz1 Quiz2 Quiz3 Quiz4 Quiz5 Quiz6 QuizMeans
1 A MBA M 0 0 6 5 7 3 3.500000
2 A MBA M 10 1 5 4 3 8 5.166667
3 A MBA M 2 4 6 5 6 6 4.833333
4 A MBA M 2 4 6 5 3 6 4.333333
5 A MS M 8 1 4 10 10 5 6.333333
7 A MBA M 6 0 7 7 6 10 6.000000
8 A MS M 2 3 9 7 10 10 6.833333
Then if you wanted to take the mean of their means (i.e. the grand mean), you could store the above as something like "df", then use mean() to calculate the mean of the column QuizMeans, like this:
df <- data.frame(df[which(df$Sex=='M'),],QuizMeans=rowMeans(df[which(df$Sex=='M'),c(4:9)]))
mean(df$QuizMeans)
[1] 5.285714
If there are missing values in your data, you'll need to add na.rm=TRUE to either the mean() or rowMeans() function, like this:
mean(df$QuizMeans, na.rm=TRUE)
[1] 5.285714
You could use the following without specifying column positions
ans <- sum(df[df$Sex=="M", grepl("Quiz",names(df))])/sum(df$Sex=="M")
# 31.71429
If you know the column positions
ans <- sum(df[df$Sex=="M", 4:9])/sum(df$Sex=="M")
# 31.71429
Data
df <- data.frame(Section=c(rep('A',9)),
Degree=c(rep('MBA',4),'MS','MBA','MBA','MS','MBA'),
Sex=c(rep('M',5),'F','M','M','F'),
Quiz1=c(0,10,2,2,8,6,6,2,3),
Quiz2=c(0,1,4,4,1,5,0,3,9),
Quiz3=c(6,5,6,6,4,2,7,9,3),
Quiz4=c(5,4,5,5,10,5,7,7,3),
Quiz5=c(7,3,6,3,10,7,6,10,5),
Quiz6=c(3,8,6,6,5,8,10,10,5))
Use dplyr.
library(dplyr)
mydf %>% filter(Sex == "Male") %>%
summarise(avg_q6 = mean(Quiz6))

Summing depth data (consecutive rows) in R

How is it possible with to sum up consecutive depth data with R?
For instance:
a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")),
depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))
The given output should be something like:
Label Depth
Air 7
Wood 3
Stone 1
First the removal of negative values is done with cummax(), because depth can only increase in this special case. Hence:
label depth
1 Air 1
2 Air 2
3 Air 3
4 Air 3
5 Air 4
6 Air 5
7 Wood 5
8 Wood 5
9 Wood 5
10 Wood 6
11 Wood 8
12 Air 9
13 Air 9
14 Air 9
15 Air 10
16 Stone 10
17 Stone 10
18 Stone 11
19 Stone 11
20 Air 11
21 Air 12
22 Air 12
23 Air 12
24 Air 13
25 Wood 14
26 Wood 14
Now by max-min the increase in depth for every consecutive row you would get: (the question is how to do this step)
label depth
1 Air 4
2 Wood 3
3 Air 1
4 Stone 1
5 Air 2
5 Wood 0
And finally summing up those max-min values the output is the one presented above.
Steps tried to achieve the output:
The first obvious solution would be for instance for Air:
diff(cummax(a[a$label=="Air",]$depth))
This solution gets rid of the negative data, which is necessary due to an expected constant increase in depth.
The problem is the output also takes into account the big steps in between each consecutive subset. Hence, the sum for Air would be 12 instead of 7.
[1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1
Even worse would be a solution with aggreagte, e.g.:
aggregate(depth~label, a, FUN=function(x){sum(x>0)})
Note: solutions with filtering big jumps is not what i'm looking for. Sure you could hard code a limit for instance <2 for the example of Air once again:
sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])
Gives you almost the right result but does not work as it is expected here. I'm pretty sure there is already a function for what I'm looking for because it is not a uncommon problem for many different tasks.
I guess taking the minimum and maximum value of each set of consecutive rows per material and summing those up would be one possible solution, but I'm not sure how to apply a function to only the consecutive subsets.
You can use data.table::rleid to quickly group by run, or reconstruct it with rle if you really like. After that, aggregating is fairly easy in any grammar. In dplyr,
library(dplyr)
a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"),
depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))
a2 <- a %>%
# filter to rows where previous value is lower, equal, or NA
filter(depth >= lag(depth) | is.na(lag(depth))) %>%
# group by label and its run
group_by(label, run = data.table::rleid(label)) %>%
summarise(depth = max(depth) - min(depth)) # aggregate
a2 %>% arrange(run) # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups: label [3]
#> label run depth
#> <fctr> <int> <dbl>
#> 1 Air 1 4
#> 2 Wood 2 3
#> 3 Air 3 1
#> 4 Stone 4 1
#> 5 Air 5 2
#> 6 Wood 6 0
a3 <- a2 %>% summarise(depth = sum(depth)) # a2 is still grouped, so aggregate more
a3
#> # A tibble: 3 x 2
#> label depth
#> <fctr> <dbl>
#> 1 Air 7
#> 2 Stone 1
#> 3 Wood 3
A base R method using aggregate is
aggregate(cbind(val=cummax(a$depth)),
list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
function(x) diff(range(x)))
The first argument to aggregate calculates the cumulative maximum as the OP does above for the input vector, the use of cbind provide for the final output of the calculated vector. The second argument is the grouping argument. This uses a different method than rle, which calculates the cumulative sum of the differences. Finally, the third argument provides the function which calculates the desired output by taking a difference of the range for each group.
This returns
label ID val
1 Air 0 4
2 Wood 1 3
3 Air 2 1
4 Stone 3 1
5 Air 4 2
6 Wood 5 0
The data.table way (borrowing in part from #alistaire):
setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
list(
depth = max(depth) - depth[1], # Only need the starting and max values
label = label[1]
),
by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]

Extracting corresponding other values in mutate when group_by is applied

I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857

Resources