"grouping variable must not contain purely numeric items" - r

I am doing an analysis with groups and as such, need to make a grouping variable, for which I wanted to use gender (0=male, 1=female). What I first did was create a vector of this variable (manual told me to do this), but then I got an eror that: "grouping variable must not contain purely numeric items". Then I transformed my vector in a logical (TRUE/FALSE), but somehow I still get this error.
So my question is, does anyone know, in general terms, what may be the problem when I get this error?
Attached below is the code to the head of my dataset:
structure(c(7, 8, 7, 5, 6, 6, 4.9, NA, 6.9, 5.1, 5.8, NA, NA,
NA, 7, 3, 7, NA, NA, NA, 6.7, 4.1, 5.9, NA, NA, NA, 5, 6, 7,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 8, NA, NA, NA, 6.2,
4.3, 6.3, NA, NA, NA, 7, 5, 7, NA, NA, NA, 6.5, NA, NA, NA, NA,
NA, 6, NA, 7, NA, NA, NA, NA, NA, 5, NA, NA, NA, NA, NA, 7, NA,
NA, NA, NA, NA, 6.1, NA, NA, NA, NA, NA, 7, NA, NA, NA, NA, NA,
NA, NA, 16, 0.001, 12, 11, 11, 0.001, 0.001, 0.001, 12, 12, 12,
0.001, 0.001, 0.001, 12, 12, 12, 0.001, 0.001, 0.001, 15, 12,
12, 0.001, 0.001, 0.001, 16, 0.001, 12, 0.001, 0.001, 0.001,
0.001, 0.001, 15, 0.001, 0.001, 0.001, 0.001, 0.001, 16, 0.001,
0, 1, 0, 0, 1, 0), .Dim = c(6L, 24L), .Dimnames = list(c("800009",
"800012", "800015", "800033", "800042", "800045"), c("gener_sat_T0",
"sel_T0", "gener_sat_T1", "sel_T1", "gener_sat_T2", "sel_T2",
"gener_sat_T3", "sel_T3", "gener_sat_T4", "sel_T4", "gener_sat_T5",
"sel_T5", "gener_sat_T6", "sel_T6", "gener_sat_T7", "sel_T7",
"dT1", "dT2", "dT3", "dT4", "dT5", "dT6", "dT7", "female")))
Then what I am trying to do is fit a CT model (have used it before on non-group data and that worked fine).
CTMODEL <- ctModel(n.latent = 2, n.manifest = 2, Tpoints = 8,
manifestNames = c("gener_sat", "sel"),
latentNames = c("gener_sat", "sel"), LAMBDA = diag(2))
fit_CTMODEL <- ctMultigroupFit(datawide = data_wide, groupings=female, ctmodelobj = CTMODEL)
Thanks a bunch!

Ok, I redid your computations, albeit your code was not reproducible directly, I made some changes and now it works:
# create the structure object (data_wide), and change it to remove the
# grouping:
data_wide = structure(c(7, 8, 7, 5, 6, 6, 4.9, NA, 6.9, 5.1, 5.8, NA, NA,
NA, 7, 3, 7, NA, NA, NA, 6.7, 4.1, 5.9, NA, NA, NA, 5, 6, 7,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 8, NA, NA, NA, 6.2,
4.3, 6.3, NA, NA, NA, 7, 5, 7, NA, NA, NA, 6.5, NA, NA, NA, NA,
NA, 6, NA, 7, NA, NA, NA, NA, NA, 5, NA, NA, NA, NA, NA, 7, NA,
NA, NA, NA, NA, 6.1, NA, NA, NA, NA, NA, 7, NA, NA, NA, NA, NA,
NA, NA, 16, 0.001, 12, 11, 11, 0.001, 0.001, 0.001, 12, 12, 12,
0.001, 0.001, 0.001, 12, 12, 12, 0.001, 0.001, 0.001, 15, 12,
12, 0.001, 0.001, 0.001, 16, 0.001, 12, 0.001, 0.001, 0.001,
0.001, 0.001, 15, 0.001, 0.001, 0.001, 0.001, 0.001, 16, 0.001), .Dim =
c(6L, 23L),
.Dimnames = list(c("800009", "800012", "800015", "800033", "800042",
"800045"),
c("gener_sat_T0", "sel_T0", "gener_sat_T1", "sel_T1",
"gener_sat_T2",
"sel_T2", "gener_sat_T3", "sel_T3", "gener_sat_T4",
"sel_T4", "gener_sat_T5",
"sel_T5", "gener_sat_T6", "sel_T6", "gener_sat_T7",
"sel_T7",
"dT1", "dT2", "dT3", "dT4", "dT5", "dT6", "dT7")))
CTMODEL <- ctModel(n.latent = 2, n.manifest = 2, Tpoints = 8,
manifestNames = c("gener_sat", "sel"),
latentNames = c("gener_sat", "sel"), LAMBDA = diag(2))
fem = c("f", "m", "f", "f", "m", "f") # grouping, which needs to be a
# character vector
fit_CTMODEL <- ctMultigroupFit(dat = data_wide, groupings=fem, ctmodelobj =
CTMODEL) # dat instead of datawide
So in the end it's just a matter of making the grouping variable character vector.
Add: the code runs but hives various errors:
Not all eigenvalues of Hessian are greater than 0
Fit attempt generated errors
Retry limit reached
I guess that's because of the model and leave the solution to you :)

Related

UPGMA with hclust plotting branch lengths as raw distances

I'm working on a presentation regarding utilizing UPGMA with the hlcust() function within our research lab. According to the literature, the branch length calculated by UPGMA for any pair of elements would be 1/2 the pairwise distance between those two elements.
I'm noticing that the example dendrogram I'm building for the presentation isn't calculating branch lengths that I expected. I'm not finding anything in ?hclust that would make me think that I'm missing a function argument that is causing the UPGMA algorithm to use the raw distances as the branch lengths. I understand that in certain situations, due to the limitations of computation accuracy, having a dendrogram which is exactly ultrametric may not always be possible (from here and here, and I'm sure elsewhere as well). That still doesn't explain why I see the raw pairwise distances being plotted as the branch length between two elements.
Using the data below, here's the code I used to plot an example dendrogram...
demoDend <- hclust(d = demoTable, method = "average") # make an hclust object
# use the ggdendro package to extract segments and labels for ggplot plotting
dendData <- ggdendro::dendro_data(demoDend)
dendSegs <- dendData$segments
dendLabs <- dendData$labels
library(ggplot2)
ggplot()+
geom_segment(data = dendSegs, aes(x = x, y = y, xend = xend, yend = yend))+
geom_text(data = dendLabs, aes(x = x, y = y-0.05, label = label, angle = 90))+
geom_hline(aes(yintercept = 0.333), linetype = 2, color = "blue")+
geom_hline(aes(yintercept = 0.2), linetype = 2, color = "red")+
theme_bw()
The two elements that stand out are 13195 and 13199 which have a distance of 0.2, and whose branch length is being plotted as 0.2 (red line in ggplot).
Even after examining the hclust object, some of the heights for the branches are the raw distances in the input matrix, and not 1/2 the distance. Do I need to manually half the heights in the object before plotting? Maybe I don't understand UPGMA as well as I thought? Any help or insight into the implementation of UPGMA with hclust() would be greatly appreciated.
Here's the sample distance data that I'm working with, from dput()
demoTable <- structure(c(0, 0.333333333333333, 0.333333333333333, 0, 0, 0.333333333333333,
0.333333333333333, 1, 1, 1, 1, 1, 1, NA, 0, 0, 0.333333333333333,
0.333333333333333, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, 0, 0.333333333333333,
0.333333333333333, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA, 0, 0,
0.333333333333333, 0.333333333333333, 1, 1, 1, 1, 1, 1, NA, NA,
NA, NA, 0, 0.333333333333333, 0.333333333333333, 1, 1, 1, 1,
1, 1, NA, NA, NA, NA, NA, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA,
NA, NA, NA, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA,
0, 0.6, 0, 1, 0.6, 0.333333333333333, NA, NA, NA, NA, NA, NA,
NA, NA, 0, 0.6, 1, 0.5, 0.2, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0, 1, 0.6, 0.333333333333333, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 0, 0.5, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0, 0.6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0
), .Dim = c(13L, 13L), .Dimnames = list(c("13187", "13188", "13189",
"13190", "13191", "13192", "13193", "13194", "13195", "13196",
"13197", "13198", "13199"), NULL))

Remove legend in ggplot

I am working with ggeffects package
I have the following syntax
data_example <- structure(list(paciente = structure(c(6171, 6488, 6300, 6446,
6489, 6445, 6473, 6351, 6212, 6387), label = "Paciente", format.spss = "F6.0"),
edad_s1 = structure(c(69, 62, 60, 71, 67, 59, 63, 66, 67,
70), label = "Edad", format.spss = "F3.0"), sexo_s1 = structure(c(1L,
2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L), .Label = c("Hombre",
"Mujer"), label = "Sexo", class = "factor"), grupo_int_v00 = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A", "B"), label = "Grupo de intervención", class = "factor"),
time = c(0, 0, 0, 2, 2, 2, 1, 2, 1, 1), peso1 = c(89.9, 62,
91.5, 75.2, 68.2, 88.4, 93.6, 79, 88.3, 84.4), cintura1 = c(113,
90, 112, NA, 87.5, 116, 98.5, 104, 112.5, 108.5), tasis2_e = c(132,
132, 149, NA, 145, 137, 129, 152, 146, 129), tadias2_e = c(81,
58, 79, NA, 80, 60, 79, 87, 79, 68), p17_total = c(7, 9,
10, 10, 10, 10, 10, 7, 10, 11), geaf_tot = c(3412.59, 3524.48,
559.44, 5454.55, 4293.71, 839.16, 3146.85, 7552.45, 4335.66,
566.9), glucosa = c(102, 97, 89, NA, 88, 168, 104, NA, 114,
121), albumi = c(4.94, 4.68, 4.75, NA, 4.34, 5.06, 4.56,
NA, 5.06, 3.96), coltot = c(232, 253, 215, NA, 202, 287,
255, NA, 217, 147), hdl = c(59, 64, 68, NA, 71, 46, 61, NA,
40, 42), ldl_calc = c(143, 150, 127, NA, 114, NA, 170, NA,
143, 86), trigli = c(152, 195, 99, NA, 85, 378, 121, NA,
170, 93), hba1c = c(5.61, 5.66, 5.43, NA, 5.38, 8.14, 5.81,
NA, 6, 6.38), i_hucpeptide = c(988.91, 673.5, 1036.03, NA,
734.29, 1266.3, 610.9, NA, 1144.8, 672.08), i_hughrelin = c(1133.35,
1230.06, 1109.98, NA, 1064.79, 725.35, 1437.85, NA, 866.07,
822.83), i_hugip = c(2.67, 2.67, 2.67, NA, 2.67, 2.67, 2.67,
NA, 2.67, 2.67), i_huglp1 = c(145.43, 138.32, 194.14, NA,
99.37, 166.27, 218.33, NA, 184.04, 222.84), i_huglucagon = c(513.89,
357.35, 624.73, NA, 464.85, 448.49, 304.29, NA, 310.61, 426.52
), i_huinsulin = c(234.23, 229.06, 358.86, NA, 175.38, 466,
99.02, NA, 367.95, 77.33), i_huleptin = c(7898.28, 5211.27,
14670.25, NA, 7161.39, 3218.49, 2659.8, NA, 3766.01, 1207.58
), i_hupai1 = c(3468.4, 1977.9, 4101.1, NA, 1613.4, 2847.27,
2442.49, NA, 1953.26, 1752.88), i_huresistin = c(4783.28,
2676.05, 3064.57, NA, 2165.52, 3878.48, 8343.46, NA, 2822.68,
6496.73), i_huvisfatin = c(831.6, 649.45, 2270.65, NA, 1578.88,
9.63, 185.09, NA, 162.8, 8.64), col_rema = c(30, 39, 20,
NA, 17, NA, 24, NA, 34, 19), homa = c(1061.843, 987.503,
1419.491, NA, 685.931, 3479.467, 457.692, NA, 1864.28, 415.864
), i_pcr = c(0.05, NA, 0.27, NA, 0.03, 0.23, 0.04, NA, 0.09,
0.09), d_homa = c(NA, NA, NA, NA, -2.629, 33.042, -181.211,
NA, -929.683, -89.108), d_hughrelin = c(NA, NA, NA, NA, -213.59,
48.43, 95.27, NA, -228.62, -146.8), d_huinsulin = c(NA, NA,
NA, NA, 3.24, -68.79, -43.31, NA, -147.33, -7.46), d_hucpeptide = c(NA,
NA, NA, NA, 192.39, -263.54, -71.56, NA, -437.38, -215.44
), d_huglucagon = c(NA, NA, NA, NA, 38.99, -112.45, -10.75,
NA, -133.55, -259.73), d_huleptin = c(NA, NA, NA, NA, 409.76,
-1081.5, -1778.69, NA, -353.91, -679.7), d_huresistin = c(NA,
NA, NA, NA, 391.02, -155.41, -436.47, NA, -1137.79, -922.75
), d_huvisfatin = c(NA, NA, NA, NA, 457.54, -260.79, -341.02,
NA, -426.89, 0), d_glucosa = c(NA, NA, NA, NA, -2, 23, 3,
NA, -8, -13), d_coltot = c(NA, NA, NA, NA, -52, 36, -11,
NA, 15, -12), d_hdl = c(NA, NA, NA, NA, 1, 3, -1, NA, 1,
4), d_ldl_calc = c(NA, NA, NA, NA, -50, NA, -10, NA, 12,
-15), d_col_rema = c(NA, NA, NA, NA, -3, NA, 0, NA, 2, -1
), d_trigli = c(NA, NA, NA, NA, -14, 132, -1, NA, 8, -5),
d_hba1c = c(NA, NA, NA, NA, -0.11, -0.04, -0.18, NA, -1.76,
-0.67), d_tasis2_e = c(NA, NA, NA, NA, 0, 6, -1, 7, -21,
-9), d_tadias2_e = c(NA, NA, NA, NA, 0, 2, -8, 8, -10, -17
), d_peso1 = c(NA, NA, NA, -6, -2.3, 0.2, -11.4, 0.8, -4.1,
-9.3), d_cintura1 = c(NA, NA, NA, NA, -2.5, -4, -12.5, 6,
-3.5, -4.5), d_geaf_tot = c(NA, NA, NA, 699.31, 2055.95,
-2181.82, 1748.25, 3776.23, 867.13, -6593.94), d_p17_total = c(NA,
NA, NA, 1, 4, 5, 4, -5, 5, 2), d_hupai1 = c(NA, NA, NA, NA,
-185.03, 204.77, 202.01, NA, -1551.91, 57.2), d_hugip = c(NA,
NA, NA, NA, 0, 0, 0, NA, 0, 0), d_huglp1 = c(NA, NA, NA,
NA, -42.07, -163.02, 107.28, NA, -95.82, -87.5), d_pcr = c(NA,
NA, NA, NA, NA, NA, NA, NA, -0.18, -0.22), ln_trigli = c(5.024,
5.273, 4.595, NA, 4.443, 5.935, 4.796, NA, 5.136, 4.533),
ln_homa = c(6.968, 6.895, 7.258, NA, 6.531, 8.155, 6.126,
NA, 7.531, 6.03), ln_hba1c = c(1.725, 1.733, 1.692, NA, 1.683,
2.097, 1.76, NA, 1.792, 1.853), ln_geaf_tot = c(8.135, 8.167,
6.327, 8.604, 8.365, 6.732, 8.054, 8.93, 8.375, 6.34), i_ratiolg = c(6.969,
4.237, 13.217, NA, 6.726, 4.437, 1.85, NA, 4.348, 1.468)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
The mixed model I have created following the syntax
lme_peso <- lme(peso1 ~ sexo_s1 + edad_s1 + poly(time, 2)*grupo_int_v00 + p17_total,
random = ~ poly(time, 2)|paciente, control=lmeControl(opt="optim"),
data = dat_longer, subset = !is.na(peso1), na.action = na.omit)
And then to plot it
ggpredict(lme_peso, c("time [all]", "grupo_int_v00"), type="fixed") %>%
ggplot(aes(x = x, y = predicted, colour = group)) +
geom_point() +
geom_line() +
stat_smooth(method = "loess",se = T) +
labs(x = "time (months)", y = "Weight (kg)") +
scale_color_manual(labels = c("Control", "Intervention"), values = c("orange", "green")) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high, fill = F),alpha = 1/5) +
scale_x_continuous(breaks = 0:2, labels = c(0, 6, 12))
When I supress the arguments of fill in geom_ribbon the fill stays black. But I don't know how to manage to keep just one legend with 2 groups (Control and Intervention). I have the extra-added legend (with F in this case)
Thanks in advance
I couldn't run your code, but I rebuilt it with iris.
Like Matt suggested, one thing would be, remove fill=F:
ggplot(data=iris, aes(x = SepalLength , y = PetalLength, group=Name)) +
geom_point() +
geom_line() +
stat_smooth(method = "loess",se = T, aes(color=Name)) +
geom_ribbon(aes(ymin = 1, ymax = 3),alpha = 1/5) +
scale_x_continuous(breaks = 0:2, labels = c(0, 6, 12))
Or if you need it for some reason, use guides(fill="none"):
ggplot(data=iris, aes(x = SepalLength , y = PetalLength, group=Name)) +
geom_point() +
geom_line() +
stat_smooth(method = "loess",se = T, aes(color=Name)) +
geom_ribbon(aes(ymin = 1, ymax = 3, fill=FALSE),alpha = 1/5) +
scale_x_continuous(breaks = 0:2, labels = c(0, 6, 12)) +
guides(fill="none")
Output:

How to merge multiple rows in R with multiple columns in a dataset

I want to merge the rows for each record_id into one row based on the type column except from the volunteers in the record_id column which have two repeats in the repeat column. I would like a second row for these. Each record_id corresponds to one person that has either come in for a test once (repeat=1) or twice and therefore has two entries in the repeat column.
Here's is what my data look like
structure(list(record_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4), type = c(NA, "data_collection", "test", NA, "data_collection",
"test", NA, "data_collection", "test", "test", NA, "cata_collection",
"test", "test"), `repeat` = c(NA, 1, 1, NA, 1, 1, NA, 1, 1, 2,
NA, 1, 1, 2), dt_volunteer_reg = structure(c(1597246320, NA,
NA, 1599217080, NA, NA, 1596184500, NA, NA, NA, 1598192280, NA,
NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), age = c(26,
NA, NA, 64, NA, NA, 51, NA, NA, NA, 39, NA, NA, NA), gender = c(0,
NA, NA, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA, NA), case_type = c(NA,
1, NA, NA, 2, NA, NA, 1, NA, NA, NA, 1, NA, NA), test_dis_dt = structure(c(NA,
NA, 1597250220, NA, NA, 1600012980, NA, NA, 1596382080, 1601980740,
NA, NA, 1598284020, 1603118700), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), test_dis_res = c(NA, NA, 1, NA, NA, 1, NA,
NA, 2, 2, NA, NA, 2, 2), test_dis_in = c(NA, NA, NA, NA, NA,
0.02, NA, NA, 6.13, 4.75, NA, NA, 7.23, 3.85), test_cont_dt = structure(c(NA,
NA, 1597250280, NA, NA, 1608636120, NA, NA, NA, 1601980740, NA,
NA, 1605704940, 1603205340), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
test_cont_res = c(NA, NA, 2, NA, NA, 1, NA, NA, NA, 2, NA,
NA, 2, 2), test_cont_val = c(NA, NA, 123, NA, NA, 0, NA,
NA, NA, 40000, NA, NA, 471.6, 306.5)), row.names = c(NA,
-14L), class = c("tbl_df", "tbl", "data.frame"))
And this is what I'm hoping to get
structure(list(record_id = c(1, 2, 3, 3, 4, 4), `repeat` = c(1,
1, 1, 2, 1, 2), dt_volunteer_reg = structure(c(1597246320, 1599217080,
1596184500, 1596184500, 1598192280, 1598192280), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), age = c(26, 64, 51, 51, 39, 39), gender = c(0,
1, 0, 0, 1, 1), case_type = c(1, 2, 1, 1, 1, 1), test_dis_dt = structure(c(1597250220,
1600012980, 1596382080, 1601980740, 1598284020, 1603118700), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), test_dis_res = c(1, 1, 2, 2, 2, 2),
test_dis_in = c(NA, 0.02, 6.13, 4.75, 7.23, 3.85), test_cont_dt = structure(c(1597250280,
1608636120, NA, 1601980740, 1605704940, 1603205340), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), test_cont_res = c(2, 1, NA, 2,
2, 2), test_cont_val = c(123, 0, NA, 40000, 471.6, 306.5)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Assuming the first dataframe is called input and you are happy using the tidyverse you can do it like this.
input %>%
nest(data = c(-record_id)) %>%
mutate(
data = map(data, ~replace_na(., as.list(head(., 1)))), # Fill in speciment details
data = map(data, filter, !is.na(`repeat`)), # Remove speciment details
data = map(data, ~replace_na(., as.list(head(., 1)))), # Fill in test data with data collection details
data = map(data, filter, type == "test") # Remove data collection rows
) %>%
unnest(data) %>%
select(-type
There are ways to do this more concisely and/or faster but this may be more readable.

Using for-loops in R to process several columns in a data frame

I am trying to edit 50 columns in my data frame into dummy variables depending on an exact match with a given vector of 50 values using a for-loop function.
I never used loop functions before and can't figure out how to do it.
I first wanted to code this "by hand" for each of the 50 columns like that:
dBGK1a <- as.numeric(BGK1a == BGKright[1])
dBGK2a <- as.numeric(BGK2a == BGKright[2])
dBGK3a <- as.numeric(BGK3a == BGKright[3])
....
dBGK50a <- as.numeric(BGK50a == BGKright[50])
As this is very tedious i tried to come up with a for-loop, that can handle this.
for(i in 1:50) {
for (j in seq(from = 348, to = 448, by = 2)){
data1[j] <- as.numeric(data1[j] == BGKright[i])
}
}
Somehow this doesn't work since i get the value "0" in every column over every observation.
data1 is my data frame. Here is a shorter version of the data frame:
dput(head(data1[348:354], 20))
structure(list(BGK1a = c(NA, NA, NA, NA, NA, NA, NA, NA, 2, NA,
NA, NA, NA, NA, 2, 2, 2, 2, 1, 2), BGK1b = c(NA, NA, NA, NA,
NA, NA, NA, NA, 50, NA, NA, NA, NA, NA, 100, 100, 100, 99, 89,
50), BGK2a = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA,
NA, NA, 1, 2, 1, 2, 1, 1), BGK2b = c(NA, NA, NA, NA, NA, NA,
NA, NA, 50, NA, NA, NA, NA, NA, 100, 50, 96, 62, 93, 50), BGK3a = c(NA,
NA, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 2, 1, 1, 1,
1, 2), BGK3b = c(NA, NA, NA, NA, NA, NA, NA, NA, 50, NA, NA,
NA, NA, NA, 100, 100, 50, 85, 82, 74), BGK4a = c(NA, NA, NA,
NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 1, 2, 2, 2, 1, 1)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
What the loop should do is select the respective value of "BGKright" with "i" and the column to process with "j". Note that "j" needs to jump 2 steps every loop because i only need to process every second column (from column 348 to column 448).
I would appreciate any help regarding this loop and other solutions that are possible for this task without loops.
Thank you in advance.
Ok i used BGKa=select(data1[348:448],ends_with("a")) to make a new data frame with only the relevant columns.
Then i used the for-loop to create the dummies.
for(i in 1:50) {
BGKa[i]=as.numeric(BGKa[i]==BGKright[i])
}
Seems to work. Ty for help.

Progression of non-missing values that have missing values in-between

To continue on a previous topic:
Finding non-missing values between missing values
I would like to also find whether the value before the missing value is smaller, equal to or larger than the one after the missing.
To use the same example from before:
df = structure(list(FirstYStage = c(NA, 3.2, 3.1, NA, NA, 2, 1, 3.2,
3.1, 1, 2, 5, 2, NA, NA, NA, NA, 2, 3.1, 1), SecondYStage = c(NA,
3.1, 3.1, NA, NA, 2, 1, 4, 3.1, 1, NA, 5, 3.1, 3.2, 2, 3.1, NA,
2, 3.1, 1), ThirdYStage = c(NA, NA, 3.1, NA, NA, 3.2, 1, 4, NA,
1, NA, NA, 3.2, NA, 2, 3.2, NA, NA, 2, 1), FourthYStage = c(NA,
NA, 3.1, NA, NA, NA, 1, 4, NA, 1, NA, NA, NA, 4, 2, NA, NA, NA,
2, 1), FifthYStage = c(NA, NA, 2, NA, NA, NA, 1, 5, NA, NA, NA,
NA, 3.2, NA, 2, 3.2, NA, NA, 2, 1)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -20L))
rows 13, 14 and 16 having non-missing in between missing values. The output this time should be: "same", "larger" and "same" for rows 13, 14, and 16, and say "N/A" for the other rows.
A straight forward approach would be to split, convert to numeric, take the last 2 values and compare with an ifelse statement, i.e.
sapply(strsplit(do.call(paste, df)[c(13, 14, 16)], 'NA| '), function(i){
v1 <- as.numeric(tail(i[i != ''], 2));
ifelse(v1[1] > v1[2], 'greater',
ifelse(v1[1] == v1[2], 'same', 'smaller'))
})
#[1] "same" "smaller" "same"
NOTE
I took previous answer as a given (do.call(paste, df)[c(13, 14, 16)])
A more generic approach (as noted by Ronak, last 2 digits will fail in some cases) would be,
sapply(strsplit(gsub("([[:digit:]])+\\s+[NA]+\\s+([[:digit:]])", '\\1_\\2',
do.call(paste, df)[c(13, 14, 16)]), ' '), function(i) {
v1 <- i[grepl('_', i)];
v2 <- strsplit(v1, '_')[[1]];
ifelse(v2[1] > v2[2], 'greater',
ifelse(v2[1] == v2[2], 'same', 'smaller')) })
#[1] "same" "smaller" "same"

Resources