Error in `mutate()` while creating a new variable using R - r

So I have a dataframe and I want to create a new variable randomly using other factors; my data contains this key variables:
iQ
Age
Educ_y
5
23
15
4
54
17
2
43
6
3
13
7
5
14
8
1
51
16
I want to generate a new variable (years of experience) randomly using this creterias:
If Age >= 15 & Iq<= 2 so "Exp_y" takes a randome number between (Age-15)/2 and Age-15.
If (Age >= 15 & (Iq==3 | Iq==4) so "Exp_y" takes a randome number between (Age-Educ_y-6)/2 and (Age-Educ_y-6).
And 0 otherwise.
I tried using this code :
Df <- Df %>%
rowwise() %>%
mutate(Exep_y = case_when(
Age > 14 & iq <= 2 ~ sample(seq((Age-15)/2, Age-15, 1), 1),
Age > 14 & between(iq, 3, 4) ~ sample(seq((Age-Educ_y-6)/2, Age-Educ_y-6, 1), 1),
TRUE ~ 0
))
But I end up with this Error message:
Error in `mutate()`:
! Problem while computing `Exep_y = case_when(...)`.
i The error occurred in row 3.
Caused by error in `seq.default()`:
! signe incorrect de l'argument 'by'
Any ideas please;
Best Regards

This error message is occurring because the case_when() statement evaluates all the right-hand-side expressions, and then selects based on the left-hand-side.. Therefore, even though, for example row 4 of your sample dataset will default to TRUE~0, the RHS side of the the first two conditions also gets evaluated. In this case, the first condition's RHS is seq((13-15)/2,13-15,1), which returns an error, because in this case from = -1 and to = -2, so the by argument cannot be 1 (it is the wrong sign).
seq((13-15)/2, 13-15, 1)
Error in seq.default((13 - 15)/2, 13 - 15, 1) :
wrong sign in 'by' argument
You could do something like this:
f <- function(i,a,e) {
if(i>4 | a<15) return(0)
if(i<=2) return(sample(seq((a-15)/2, a-15),1))
return(sample(seq((a-e-6)/2, a-e-6),1))
}
Df %>% rowwise() %>% mutate(Exep_y=f(iq,Age,Educ_y))
Output:
iq Age Educ_y Exep_y
<int> <int> <int> <dbl>
1 5 23 15 0
2 4 54 17 16.5
3 2 43 6 21
4 3 13 7 0
5 5 14 8 0
6 1 51 16 27

You could try using if_else() rather than case_when:
Documentation can be found here: https://dplyr.tidyverse.org/reference/if_else.html

Related

Vectorized function usage and joining individual terms into a single tibble

the title is vague but let me explain:
I have a non-vectorized function that outputs a 15-row table of volume estimates for a tree. Each row is a different measurement unit or portion of the input tree. I have a Tables argument to help the user decide what units and measurement protocol they're looking to find, but in 99% of use case scenarios, the output for a single tree's volume estimate is a tibble with more than one row.
I've removed ~20 other arguments from the function for demonstration's sake. DBH is a tree's diameter at breast height. Vol column is arbitrary.
Est1 <- TreeVol(Tables = "All", DBH = 7)
Est1
# A tibble: 15 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 1. Total_Above_Ground_Cubic_Volume 7 2
2 2. Gross_Inter_1/4inch_Vol 7 4
3 3. Net_Scribner_Vol 7 6
4 4. Gross_Merchantable_Vol 7 8
5 5. Net_Merchantable_Vol 7 10
6 6. Merchantable_Vol 7 12
7 7. Gross_SecondaryProduct_Vol 7 14
8 8. Net_SecondaryProduct_Vol 7 16
9 9. SecondaryProduct 7 18
10 10. Gross_Inter_1/4inch_Vol 7 20
11 11. Net_Inter_1/4inch_Vol 7 22
12 12. Gross_Scribner_SecondaryProduct 7 24
13 13. Net_Scribner_SecondaryProduct 7 26
14 14. Stump_Volume 7 28
15 15. Tip_Volume 7 30
the user can utilize the Tables argument as so:
Est2 <- TreeVol(Tables = "Scribner_BF", DBH = 7)
# A tibble: 3 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 3. Net_Scribner_Vol 7 6
2 12. Gross_Scribner_SecondaryProduct 7 24
3 13. Net_Scribner_SecondaryProduct 7 26
The problem arises in that I'd like to write a vectorized version of this function that can calculate the volume for an entire .csv of tree inventory data. Ideally, I'd like the multi-row outputs that relate to a single tree to output as one long tibble, with each 15-row default output filtered by what the user passes to the Tables argument as so:
Est3 <- VectorizedTreeVol(Tables = "Scribner_BF", DBH = c(7, 21, 26))
# A tibble: 9 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 3. Net_Scribner_Vol 7 6
2 12. Gross_Scribner_SecondaryProduct 7 24
3 13. Net_Scribner_SecondaryProduct 7 26
4 3. Net_Scribner_Vol 21 18
5 12. Gross_Scribner_SecondaryProduct 21 72
6 13. Net_Scribner_SecondaryProduct 21 76
7 3. Net_Scribner_Vol 26 8
8 12. Gross_Scribner_SecondaryProduct 26 78
9 13. Net_Scribner_SecondaryProduct 26 84
To achieve this, I wrote a for() loop that acts as the heart of the vectorized function. I've heard from multiple people that it's very inefficient (and I agree), but it works with the principle I'd like to achieve, in theory. Nothing I've found on this topic has suggested a better idea for application in a vectorized function like mine.
The general setup for the loop looks like this:
for(i in 1:length(DBH)){
Output <- VectorizedTreeVol(Tables = Tables[[i]], DBH = DBH[[i]]) %>%
purrr::reduce(dplyr::full_join, by = NULL) %>%
SuppressWarnings()
and in functions where the non-vectorized output is always a single row, the heart of its respective vectorized function doesn't need to be encased in a for() loop and looks like this:
Output <- OtherVectorizedFunction(Tables = Tables, DBH = DBH) %>%
purrr::reduce(dplyr::full_join, by = ColumnNames) %>% #ColumnNames is a vector with all of the output's column names
SuppressWarnings()
This specific call to reduce() has worked pretty well when I've used it to vectorize the other functions in the project, but I'm open to suggestions regarding how to join the output tables. I've been stuck on this dilemma for a few months now, and any help regarding how to achieve what this for() loop is striving for in theory would be awesome. Is having a vectorized function that outputs a tibble like Est3 even possible? Any feedback/comments are much appreciated.
Given this function:
TreeVol <- function(DBH) {
data.frame(Tables = c("Tree_Vol", "Intercapillary_transfusion", "Woodiness"),
Vol = c(DBH^2, sqrt(DBH) + 3, sin(DBH)),
DBH)
}
We could put our DBH parameters into purrr::map and then bind_rows to get a data.frame.
VecTreeVol <- function(DBH) {
DBH %>%
purrr::map(TreeVol) %>%
bind_rows()
}
Result
> VecTreeVol(DBH = 1:3)
Tables Vol DBH
1 Tree_Vol 1.0000000 1
2 Intercapillary_transfusion 4.0000000 1
3 Woodiness 0.8414710 1
4 Tree_Vol 4.0000000 2
5 Intercapillary_transfusion 4.4142136 2
6 Woodiness 0.9092974 2
7 Tree_Vol 9.0000000 3
8 Intercapillary_transfusion 4.7320508 3
9 Woodiness 0.1411200 3

Unable to run Two-way repeated measures ANOVA; 0 (non-NA) cases

I am trying to follow the tutorial by Datanovia for Two-way repeated measures ANOVA.
A quick overview of my dataset:
I have measured the number of different bacterial species in 12 samplingsunits over time. I have 16 time points and 2 groups. I have organised my data as a tibble called "richness";
# A tibble: 190 x 4
id selection.group Day value
<fct> <fct> <fct> <dbl>
1 KRH1 KR 2 111.
2 KRH2 KR 2 141.
3 KRH3 KR 2 110.
4 KRH1 KR 4 126
5 KRH2 KR 4 144
6 KRH3 KR 4 135.
7 KRH1 KR 6 115.
8 KRH2 KR 6 113.
9 KRH3 KR 6 107.
10 KRH1 KR 8 119.
The id refers to each sampling unit, and the selection group is of two factors (KR and RK).
richness <- tibble(
id = factor(c("KRH1", "KRH3", "KRH2", "RKH2", "RKH1", "RKH3")),
selection.group = factor(c("KR", "KR", "KR", "RK", "RK", "RK")),
Day = factor(c(2,2,4,2,4,4)),
value = c(111, 110, 144, 92, 85, 69)) # subset of original data
My tibble appears to be in an identical format as the one in the tutorial;
> str(selfesteem2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 72 obs. of 4 variables:
$ id : Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ treatment: Factor w/ 2 levels "ctr","Diet": 1 1 1 1 1 1 1 1 1 1 ...
$ time : Factor w/ 3 levels "t1","t2","t3": 1 1 1 1 1 1 1 1 1 1 ...
$ score : num 83 97 93 92 77 72 92 92 95 92 ..
Before I can run the repeated measures ANOVA I must check for normality in my data. I copied the framework proposed in the tutorial.
#my code
richness %>%
group_by(selection.group, Day) %>%
shapiro_test(value)
#tutorial code
selfesteem2 %>%
group_by(treatment, time) %>%
shapiro_test(score)
But get the error message "Error: Column variable is unknown" when I try to run the code. Does anyone know why this happens?
I tried to continue without insurance that my data is normally distributed and tried to run the ANOVA
res.aov <- rstatix::anova_test(
data = richness, dv = value, wid = id,
within = c(selection.group, Day)
)
But get this error message; Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
I have checked for NA values with any(is.na(richness)) which returns FALSE. I have also checked table(richness$selection.group, richness$Day) to be sure my setup is correct
2 4 6 8 12 16 20 24 28 29 30 32 36 40 44 50
KR 6 6 6 6 6 6 6 6 6 6 6 5 6 6 6 6
RK 6 6 6 6 6 5 6 6 6 6 6 6 6 6 6 6
And the setup appears correct. I would be very grateful for tips on solving this.
Best regards Madeleine
Below is a subset of my dataset in a reproducible format:
library(tidyverse)
library(rstatix)
library(tibble)
richness_subset = data.frame(
id = c("KRH1", "KRH3", "KRH2", "RKH2", "RKH1", "RKH3"),
selection.group = c("KR", "KR", "KR", "RK", "RK", "RK"),
Day = c(2,2,4,2,4,4),
value = c(111, 110, 144, 92, 85, 69))
richness_subset$Day = factor(richness$Day)
richness_subset$selection.group = factor(richness$selection.group)
richness_subset$id = factor(richness$id)
richness_subset = tibble::as_tibble(richness_subset)
richness_subset %>%
group_by(selection.group, Day) %>%
shapiro_test(value)
# gives Error: Column `variable` is unknown
res.aov <- rstatix::anova_test(
data = richness, dv = value, wid = id,
within = c(selection.group, Day)
)
# gives Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# 0 (non-NA) cases
I create something like the design of your data:
set.seed(111)
richness = data.frame(id=rep(c("KRH1","KRH2","KRH3"),6),
selection.group=rep(c("KR","RK"),each=9),
Day=rep(c(2,4,6),each=3,times=2),value=rpois(18,100))
richness$Day = factor(richness$Day)
richness$id = factor(richness$id)
First, shapiro_test, there's a bug in the script and the value you wanna test cannot be named "value":
# gives error Error: Column `variable` is unknown
richness %>% shapiro_test(value)
#works
richness %>% mutate(X = value) %>% shapiro_test(X)
# A tibble: 1 x 3
variable statistic p
<chr> <dbl> <dbl>
1 X 0.950 0.422
1 X 0.963 0.843
Second, for the anova, this works for me.
rstatix::anova_test(
data = richness, dv = value, wid = id,
within = c(selection.group, Day)
)
In my example every term can be estimated.. What I suspect is that one of your terms is a linear combination of the other. Using my example,
set.seed(111)
richness =
data.frame(id=rep(c("KRH1","KRH2","KRH3","KRH4","KRH5","KRH6"),3),
selection.group=rep(c("KR","RK"),each=9),
Day=rep(c(2,4,6),each=3,times=2),value=rpois(18,100))
richness$Day = factor(richness$Day)
richness$id = factor(richness$id)
rstatix::anova_test(
data = richness, dv = value, wid = id,
within = c(selection.group, Day)
)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
Gives the exact same error. This can be checked using:
lm(value~id+Day:selection.group,data=richness)
Call:
lm(formula = value ~ id + Day:selection.group, data = richness)
Coefficients:
(Intercept) id1 id2
101.667 -3.000 -6.000
id3 id4 id5
-6.000 1.889 11.556
Day2:selection.groupKR Day4:selection.groupKR Day6:selection.groupKR
1.667 -12.000 9.333
Day2:selection.groupRK Day4:selection.groupRK Day6:selection.groupRK
-1.667 NA NA
The Day4:selection.groupRK and Day6:selection.groupRK are not estimateable because they are covered by a linear combination of factors before.
The solution for running the Shapiro_test proposed above worked.
And I figured out I have some linear combination by running lm(value~id+Day:selection.group,data=richness). However, I don't understand why? I know I have data points for each group (see graph). Where does this linear combination come from?
Repeated measure ANOVA appears so appropriate for me as I am following sampling units over time.
I had the same issue. Couldn't find out the solution. Finally the following works:
install “ez” package
newModel<-ezANOVA(data = dataFrame, dv = .(outcome variable), wid = .(variable that
identifies participants), within = .(repeated measures predictors), between = .
(between-group predictors), detailed = FALSE, type = 2)
Example: bushModel<-ezANOVA(data = longBush, dv = .(Retch), wid = .(Participant), within = .(Animal), detailed = TRUE, type = 3)

Error in 2 * "X2B" : non-numeric argument to binary operator

I am trying to look at the baseball data from 1903 through 1960 from the Lahman database. I am doing this for my own research. I am wanting to use the batting table, which does not include batting average, slugging, OBP or OPS.
I want to calculate those, but I first need to get total bases. I am having trouble getting the program to calculate total bases with the X2B and X3B.
I've looked into trying as.numeric, but I couldn't get it to work. This is using R and R studio. I've tried putting quotes around X2B and X3B for the doubles and triples and without quotes.
batting_1960 <- batting_1903 %>%
filter(yearID <= 1960 & G >= 90) %>%
mutate(Batting_Average = H/AB, TB = (2*"X2B")+(3*"X3B")+HR+(H-"X2B"-"X3B"-HR)) %>%
arrange(yearID, desc(Batting_Average))
I expect that for each row of data, that the total bases will be calculated in a new column but I get the error:
Error in 2 * "X2B" : non-numeric argument to binary operator
This would be so that I could eventually calculated OPS, OBP and slugging.
Your code is trying to mutiply 2 by the literal string "X2B", which is not going to work. Column names should be unquoted in mutate().
Your error:
> tibble(X2B = 1:10) %>% mutate(TB = 2 * "X2B")
Error in 2 * "X2B" : non-numeric argument to binary operator
Should be, for example:
> tibble(X2B = 1:10) %>% mutate(TB = 2 * X2B)
# A tibble: 10 x 2
X2B TB
<int> <dbl>
1 1 2
2 2 4
3 3 6
4 4 8
5 5 10
6 6 12
7 7 14
8 8 16
9 9 18
10 10 20

R creating variable with satisfying condition

Help sought from anyone.
I have a household survey data set named h2004 and would like to create a variable equals to another variable that satisfy certain condition. Here I have put a sample of observations.
cq15 expen
10 0.4616136
10 1.538712
11 2.308068
11 0.384678
12 2.576797822
12 5.5393632
13 5.4624276
14 2.6158104
14 20.157127
and I tried the following command:
h2004$crops[h2004$cq15>=12 & h2004$cq15<=14]=h2004$expen
and this produces wrong results in R as I know the correct result from using Stata. In the original data set, the above command takes values of 'expen' even when cq15<12 and replaces the observations against cq15>=12 & cq15<=14.
I also tried with filter option of dplyr that correctly subset the data frame but don't know how to apply it to specific variable.
fil<- filter(h2004, cq15>=12 & cq15<=14)
I think my subsetting (cq15>=12 & cq15<=14) is wrong. Please advice. Thanks
The problem is in the command. When the command is executed, the following warning message is issued:
Warning message:
In h2004$crops[h2004$cq15 >= 12 & h2004$cq15 <= 14] = h2004$expen :
number of items to replace is not a multiple of replacement length
The reason for this is that the LHS of this command selects elements satisfying condition h2004$cq15 >= 12 & h2004$cq15 <= 14 while on the RHS, the complete vector h2004$expen is given causing mismatch in length.
Solution:
> h2004$crops[h2004$cq15>=12 & h2004$cq15<=14]=h2004$expen[h2004$cq15>=12 & h2004$cq15<=14]
> h2004
cq15 expen crops
1 10 0.4616136 NA
2 10 1.5387120 NA
3 11 2.3080680 NA
4 11 0.3846780 NA
5 12 2.5767978 2.576798
6 12 5.5393632 5.539363
7 13 5.4624276 5.462428
8 14 2.6158104 2.615810
9 14 20.1571270 20.157127
or Alternatively:
> indices <- which(h2004$cq15>=12 & h2004$cq15<=14)
> h2004$crops[indices] = h2004$expen[indices]
> h2004
cq15 expen crops
1 10 0.4616136 NA
2 10 1.5387120 NA
3 11 2.3080680 NA
4 11 0.3846780 NA
5 12 2.5767978 2.576798
6 12 5.5393632 5.539363
7 13 5.4624276 5.462428
8 14 2.6158104 2.615810
9 14 20.1571270 20.157127

Converting R data frame with RDS package: recruitment id error?

I am using the RDS package for respondent-driven sampling survey data. I want to convert a regular R data frame to an rds.data.frame. To do so, I have been trying to use the as.rds.data.frame function from RDS.
Here is an excerpted section of my data frame, where the first case (id=1) is the 'seed' respondent (who has no recruiter). It contains the variables: id (respondent id number), recruit.id(id number of respondent who recruited him/her), netsize (respondent's network size) and population (estimate of whole population size).
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10),
recruit.id=c(-1,1,1,2,2,4,5,3,8,3),
netsize=c(6,6,6,5,5,4,4,3,4,6), population=rep(22,000, 10))
I then (try to) apply the relevant function:
new.df <-as.rds.data.frame(df,id=df$id,
recruiter.id=df$recruit.id,
network.size=df$netsize,
population.size=df$population,
max.coupons=2)
I get the error message:
Error in as.rds.data.frame(df, id = df$id, recruiter.id = df$recruit.id,: Invalid id
and the warning
In addition: Warning message:In if (!(id %in% names(x))) stop("Invalid id") :
the condition has length > 1 and only the first element will be used
I have tried assigning various 'recruiter id' values for seed participants, including -1,0 or their own id number but I still get the same message. I have also tried eliminating function arguments (coupon.max, population) or deleting seed respondents, but I still get the same message.
Package documentation says the function will fail if recruitment information is incomplete. As far as I can tell, this is not the case.
I am new to this, so if anyone can point me in the right direction I would be really grateful.
This seems to work:
colnames(df)[2:4] <- c("recruiter.id", "network.size.variable", "population.size")
as.rds.data.frame(df,max.coupons=2)
This gives a result with a warning
as.rds.data.frame(df, id="id", recruiter.id="recruit.id",
network.size="netsize", population.size="population", max.coupons=2)
# An object of class "rds.data.frame"
#id: 1 2 3 4 5 6 7 8 9 10
#recruiter.id: -1 1 1 2 2 4 5 3 8 3
# id recruit.id netsize population
#1 1 -1 6 22
#2 2 1 6 22
#3 3 1 6 22
#4 4 2 5 22
#5 5 2 5 22
#6 6 4 4 22
#7 7 5 4 22
#8 8 3 3 22
#9 9 8 4 22
#10 10 3 6 22
# Warning message:
#In as.rds.data.frame(df, id = "id", recruiter.id = "recruit.id", :
#NAs introduced by coercion

Resources