Concentrate data frame information in r - r

I have two data frames:
> df1
2013-04-1 2013-04-2 2013-04-3 2013-04-4 2013-04-5 2013-04-6 2013-04-7 2013-04-8 2013-04-9 2013-04-10 2013-04-11
bin_1 32 489 32 32 364 19 312 0 0 0 346
bin_2 8 346 8 0 98 8 12 12 46 364 346
bin_3 9 98 346 46 9 312 6 1912 0 489 0
bin_4 4 12 9 12 0 12 0 987 9 19 12
bin_5 0 0 8 8 0 0 312 6 312 12 4
df1 contains 5 rows (bins) and 23 columns (date)
> df2
orange apple pear banana watermelon lemon
2013-04-1 1 1 1 1 0 1
2013-04-2 1 1 0 1 0 0
2013-04-3 1 1 1 1 0 1
2013-04-4 0 1 0 1 1 1
2013-04-5 1 0 0 0 1 1
df2 contains 23 rows(date) and 6 columns (types of fruits)
So now, I want to concentrate these 2 dfs into 1 big data frame that contains all the information, like:
> df3
orange apple pear banana watermelon lemon
bin_1 ? ? ? ? ? ?
bin_2 ? ? ? ? ? ?
bin_3 ? ? ? ? ? ?
bin_4 ? ? ? ? ? ?
bin_5 ? ? ? ? ? ?
But how can i concentrate the data? So for example,
on 2013-04-1,
bin_1 contains 32 fruits, bin_2 contains 8 fruits, ..., bin_5 contains 0 fruits (based on df1)
only orange, apple, pear, banana, and lemon are available (based on df2)
Q. I want my df3 to contain concentrate information, like bin_1 on average contain x amount of oranges, ...etc .How can I model this?
Code:
> dput(df1)
structure(list(`2013-04-1` = c(32, 8, 9, 4, 0), `2013-04-2` = c(489,
346, 98, 12, 0), `2013-04-3` = c(32, 8, 346, 9, 8), `2013-04-4` = c(32,
0, 46, 12, 8), `2013-04-5` = c(364, 98, 9, 0, 0), `2013-04-6` = c(19,
8, 312, 12, 0), `2013-04-7` = c(312, 12, 6, 0, 312), `2013-04-8` = c(0,
12, 1912, 987, 6), `2013-04-9` = c(0, 46, 0, 9, 312), `2013-04-10` = c(0,
364, 489, 19, 12), `2013-04-11` = c(346, 346, 0, 12, 4), `2013-04-12` = c(0,
9, 12, 46, 489), `2013-04-13` = c(32, 8, 19, 46, 0), `2013-04-14` = c(0,
987, 12, 0, 6), `2013-04-15` = c(0, 346, 4, 346, 0), `2013-04-16` = c(0,
1912, 1912, 12, 364), `2013-04-17` = c(12, 98, 32, 32, 1912),
`2013-04-18` = c(12, 12, 12, 0, 346), `2013-04-19` = c(9,
46, 98, 312, 4), `2013-04-20` = c(32, 987, 46, 9, 312), `2013-04-21` = c(4,
98, 12, 32, 12), `2013-04-22` = c(19, 0, 4, 346, 0), `2013-04-23` = c(1912,
364, 0, 0, 489)), row.names = c("bin_1", "bin_2", "bin_3",
"bin_4", "bin_5"), class = "data.frame")
> dput(df2)
structure(list(orange = c(1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0), apple = c(1, 1, 1, 1, 0, 1,
0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0), pear = c(1,
0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1,
0), banana = c(1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 0), watermelon = c(0, 0, 0, 1, 1, 0, 1, 1,
1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0), lemon = c(1, 0,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0
)), row.names = c("2013-04-1", "2013-04-2", "2013-04-3", "2013-04-4",
"2013-04-5", "2013-04-6", "2013-04-7", "2013-04-8", "2013-04-9",
"2013-04-10", "2013-04-11", "2013-04-12", "2013-04-13", "2013-04-14",
"2013-04-15", "2013-04-16", "2013-04-17", "2013-04-18", "2013-04-19",
"2013-04-20", "2013-04-21", "2013-04-22", "2013-04-23"), class = "data.frame")

Related

Count the frequency of concecutive zeros in a every time they appear in a each row

I have this dataframe and would like to compute a count of zero sequences every time they appear in a row so that the output would be A: 2 4, B:1 2 1, C:2 5, D: 2 3, E: 1 1
df <- data.frame(
A=c(1, 0, 0, 1, 1, 0, 0, 0, 0),
B=c(0, 1, 1, 0, 0, 1, 0, 1, 1),
C=c(0, 0, 1, 1, 0, 0, 0, 0, 0),
D=c(0, 0, 1, 1, 1, 1, 0, 0, 0),
E=c(1, 0, 1, 1, 1, 1, 0, 1, 1)
)
We may use rle by looping over the columns of the data.frame and get the lengths of the 0 values in base R
lapply(df1, function(x) with(rle(x), lengths[!values]))
-output
$A
[1] 2 4
$B
[1] 1 2 1
$C
[1] 2 5
$D
[1] 2 3
$E
[1] 1 1
data
df1 <- structure(list(A = c(1, 0, 0, 1, 1, 0, 0, 0, 0), B = c(0, 1,
1, 0, 0, 1, 0, 1, 1), C = c(0, 0, 1, 1, 0, 0, 0, 0, 0), D = c(0,
0, 1, 1, 1, 1, 0, 0, 0), E = c(1, 0, 1, 1, 1, 1, 0, 1, 1)), row.names = c(NA,
-9L), class = "data.frame")

How can I represent one column's values using multiple columns in R where one new column is conditional?

Looking at similar questions, I could not find one that matched my need.
If one does contain a solution, please share its link.
I have this dput-produced data:
structure(list(Player = c("Seth Lugo", "Jacob deGrom", "Rick Porcello",
"David Peterson", "Michael Wacha", "Seth Lugo", "Jacob deGrom",
"Rick Porcello", "David Peterson", "Steven Matz", "Seth Lugo",
"Jacob deGrom", "Rick Porcello", "David Peterson", "Seth Lugo",
"Jacob deGrom", "Rick Porcello", "Michael Wacha", "David Peterson",
"Jacob deGrom", "Seth Lugo", "Rick Porcello", "Robert Gsellman",
"Michael Wacha", "Ariel Jurado", "Jacob deGrom", "Rick Porcello",
"Seth Lugo", "Robert Gsellman", "David Peterson"), Date = structure(c(1601164800,
1601078400, 1601078400, 1600905600, 1600819200, 1600732800, 1600646400,
1600560000, 1600473600, 1600387200, 1600300800, 1600214400, 1600128000,
1599955200, 1599868800, 1599782400, 1599609600, 1599523200, 1599436800,
1599350400, 1599264000, 1599177600, 1599091200, 1599004800, 1598918400,
1598832000, 1598745600, 1598745600, 1598659200, 1598572800), tzone = "UTC", class = c("POSIXct",
"POSIXt")), DblHdr = c(0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 2), DateStr = c("09/27/2020",
"09/26/2020", "09/26/2020", "09/24/2020", "09/23/2020", "09/22/2020",
"09/21/2020", "09/20/2020", "09/19/2020", "09/18/2020", "09/17/2020",
"09/16/2020", "09/15/2020", "09/13/2020", "09/12/2020", "09/11/2020",
"09/09/2020", "09/08/2020", "09/07/2020", "09/06/2020", "09/05/2020",
"09/04/2020", "09/03/2020", "09/02/2020", "09/01/2020", "08/31/2020",
"08/30/2020", "08/30/2020", "08/29/2020", "08/28/2020"), Month = c("09",
"09", "09", "09", "09", "09", "09", "09", "09", "09", "09", "09",
"09", "09", "09", "09", "09", "09", "09", "09", "09", "09", "09",
"09", "09", "08", "08", "08", "08", "08"), Tm = c("NYM", "NYM",
"NYM", "NYM", "NYM", "NYM", "NYM", "NYM", "NYM", "NYM", "NYM",
"NYM", "NYM", "NYM", "NYM", "NYM", "NYM", "NYM", "NYM", "NYM",
"NYM", "NYM", "NYM", "NYM", "NYM", "NYM", "NYM", "NYM", "NYM",
"NYM"), Opp = c("WSN", "WSN", "WSN", "WSN", "TBR", "TBR", "TBR",
"ATL", "ATL", "ATL", "PHI", "PHI", "PHI", "TOR", "TOR", "TOR",
"BAL", "BAL", "PHI", "PHI", "PHI", "PHI", "NYY", "BAL", "BAL",
"MIA", "NYY", "NYY", "NYY", "NYY"), Rslt = c("L 5-15", "L 3-4",
"L 3-5", "W 3-2", "L 5-8", "W 5-2", "L 1-2", "L 0-7", "W 7-2",
"L 2-15", "W 10-6", "W 5-4", "L 1-4", "L 3-7", "L 2-3", "W 18-1",
"W 7-6", "L 2-11", "L 8-9", "W 14-1", "W 5-1", "L 3-5", "W 9-7",
"W 9-4", "L 5-9", "L 3-5", "L 7-8", "L 2-5", "L 1-2", "W 4-3"
), W_L = c("L", "L", "L", "W", "L", "W", "L", "L", "W", "L",
"W", "W", "L", "L", "L", "W", "W", "L", "L", "W", "W", "L", "W",
"W", "L", "L", "L", "L", "L", "W"), temp = c("L 5", "L 3", "L 3",
"W 3", "L 5", "W 5", "L 1", "L 0", "W 7", "L 2", "W 10", "W 5",
"L 1", "L 3", "L 2", "W 18", "W 7", "L 2", "L 8", "W 14", "W 5",
"L 3", "W 9", "W 9", "L 5", "L 3", "L 7", "L 2", "L 1", "W 4"
), RS = c(5, 3, 3, 3, 5, 5, 1, 0, 7, 2, 10, 5, 1, 3, 2, 18, 7,
2, 8, 14, 5, 3, 9, 9, 5, 3, 7, 2, 1, 4), RA = c(15, 4, 5, 2,
8, 2, 2, 7, 2, 15, 6, 4, 4, 7, 3, 1, 6, 11, 9, 1, 1, 5, 7, 4,
9, 5, 8, 5, 2, 3), Rdiff = c(-10, -1, -2, 1, -3, 3, -1, -7, 5,
-13, 4, 1, -3, -4, -1, 17, 1, -9, -1, 13, 4, -2, 2, 5, -4, -2,
-1, -3, -1, 1), absV = c(10, 1, 2, 1, 3, 3, 1, 7, 5, 13, 4, 1,
3, 4, 1, 17, 1, 9, 1, 13, 4, 2, 2, 5, 4, 2, 1, 3, 1, 1), App_Dec = c("GS-2, L",
"GS-5", "GS-3, L", "GS-7, W", "GS-6, L", "GS-7, W", "GS-7, L",
"GS-7, L", "GS-6, W", "GS-3, L", "GS-2", "GS-2", "GS-6, L", "GS-5, L",
"GS-6, L", "GS-6, W", "GS-4", "GS-4, L", "GS-2", "GS-7, W", "GS-5, W",
"GS-6", "GS-2", "GS-3", "GS-4", "GS-6, L", "GS-5", "GS-4", "GS-4",
"GS-4"), IP = c(1.1, 5, 3, 7, 6, 6.1, 7, 7, 6, 2.2, 1.2, 2, 6,
5, 5.1, 6, 4, 4, 2, 7, 5, 6, 1.2, 3, 4, 6, 5, 3.2, 4, 4), H = c(5,
5, 8, 4, 6, 4, 4, 3, 3, 8, 8, 4, 6, 3, 7, 3, 10, 7, 3, 3, 4,
3, 4, 4, 9, 6, 4, 4, 4, 4), R = c(6, 3, 5, 1, 4, 2, 2, 1, 1,
6, 6, 3, 4, 2, 3, 1, 5, 5, 5, 1, 1, 2, 4, 2, 5, 4, 2, 1, 1, 3
), ER = c(6, 3, 3, 1, 4, 1, 2, 1, 1, 6, 6, 3, 4, 2, 3, 1, 5,
4, 5, 1, 1, 2, 4, 2, 5, 1, 2, 1, 1, 3), BB = c(2, 2, 1, 1, 0,
1, 2, 2, 4, 3, 0, 1, 2, 2, 1, 2, 0, 0, 4, 2, 2, 2, 4, 1, 0, 2,
2, 2, 0, 3), SO = c(1, 10, 3, 4, 4, 7, 14, 10, 10, 5, 3, 1, 5,
2, 5, 9, 3, 3, 3, 12, 8, 6, 0, 2, 2, 9, 2, 7, 4, 3), HR = c(0,
2, 1, 0, 2, 1, 1, 1, 1, 2, 4, 0, 1, 1, 0, 0, 0, 2, 1, 1, 1, 0,
0, 0, 1, 1, 0, 1, 1, 0), UER = c(0, 0, 2, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0),
Pit = c(38, 113, 67, 107, 66, 95, 112, 100, 102, 76, 52,
40, 94, 81, 91, 102, 66, 71, 70, 108, 81, 100, 52, 69, 84,
103, 86, 60, 57, 70), Str = c(24, 78, 42, 68, 45, 66, 70,
70, 62, 45, 30, 25, 66, 52, 60, 68, 45, 49, 37, 74, 50, 65,
22, 41, 53, 72, 55, 39, 33, 37), GSc = c(19, 53, 29, 68,
48, 65, 73, 75, 68, 20, 18, 36, 47, 53, 46, 69, 25, 33, 29,
77, 61, 62, 27, 44, 26, 57, 51, 54, 54, 42), BF = c(12, 22,
19, 26, 23, 24, 26, 26, 24, 18, 14, 11, 26, 20, 24, 23, 21,
20, 14, 26, 21, 23, 13, 15, 21, 27, 20, 16, 15, 18), AB = c(8,
20, 18, 24, 23, 23, 23, 23, 20, 15, 13, 9, 24, 18, 22, 21,
21, 20, 9, 24, 19, 21, 8, 13, 20, 25, 18, 14, 15, 15), H2B = c(2,
0, 1, 1, 1, 0, 2, 0, 2, 2, 1, 2, 1, 0, 2, 1, 1, 1, 1, 1,
0, 0, 1, 0, 2, 2, 2, 0, 1, 0), H3B = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 0), IBB = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0),
HBP = c(1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), SH = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0), SF = c(1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0), GDP = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1), SB = c(0, 1,
1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 2, 0,
1, 0, 0, 0, 3, 0, 0, 0, 0), CS = c(0, 0, 0, 0, 1, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0), PO = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), BK = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), WP = c(0, 1, 1, 1, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 1, 0, 0), ERA = c("40.5", "5.4", "9", "1.29", "6", "1.42",
"2.57", "1.29", "1.5", "20.25", "32.4", "13.5", "6", "3.6",
"5.0599999999999996", "1.5", "11.25", "9", "22.5", "1.29",
"1.8", "3", "21.6", "6", "11.25", "1.5", "3.6", "2.4500000000000002",
"2.25", "6.75"), WPA = c(-0.471, -0.087, -0.256, 0.34, -0.22,
0.18, 0.107, 0.219, 0.229, -0.358, -0.487, -0.186, -0.156,
0.036, -0.047, 0.049, -0.329, -0.321, -0.34, 0.193, 0.156,
0.07, -0.312, -0.042, -0.278, -0.271, 0.029, 0.02, 0.092,
-0.174), RE24 = c(-5.122, -0.193, -3.316, 2.931, -1.08, 1.509,
1.406, 2.406, 1.92, -4.641, -5.444, -1.919, -0.758, 0.679,
0.245, 2.215, -3.054, -3.054, -4.027, 2.406, 1.433, 0.92,
-3.788, -0.359, -2.812, -1.08, 0.707, 0.364, 1.166, -0.834
), aLI = c(1.45, 1.244, 0.974, 1.271, 0.965, 0.921, 0.955,
0.888, 1.066, 0.962, 0.767, 1.073, 0.941, 0.852, 1.353, 0.392,
0.857, 0.805, 0.904, 0.75, 1.037, 0.861, 1.232, 1.355, 0.914,
1.239, 1.213, 1.28, 0.748, 1.407)), row.names = c(NA, -30L
), class = c("tbl_df", "tbl", "data.frame"))
Desired output:
The numbers starting in the second column are the total absV values for each player for each column. The last column contains the sum of all the absV values for each player where absV > 5. Only a sample of the first 3 rows are shown, and the absV values are just filler numbers.
| Player | 1 | 2 | 3 | 4 | 5 | >5 |
| deGrom | 2 | 3 | 5 | 0 | 1 | 3 |
| Matz | 2 | 3 | 5 | 0 | 1 | 3 |
Code tried (I need help getting beyond the point shown). I would prefer if the code uses dplyr:
starter %>%
select(Player, absV) %>%
group_by(Player, absV) %>%
summarize(numG= n()) %>%
arrange(Player,absV)
To do this you to bifurcate your data with rows per player >5 and <=5, then rbind them together and thereafter pivot_wider. Follow this code
library(dplyr)
library(tidyr)
df <- starter %>% group_by(Player) %>%
mutate(row = row_number()) %>%
select(Player, absV, row) %>% arrange(Player)
df %>% filter(row <= 5) %>%
mutate(row = as.character(row)) %>%
rbind(df %>% filter(row > 5) %>%
summarise( absV = sum(absV)) %>%
mutate(row = ">5")) %>%
pivot_wider(id_cols = Player, names_from = row, values_from = absV)
# A tibble: 8 x 7
# Groups: Player [8]
Player `1` `2` `3` `4` `5` `>5`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ariel Jurado 4 NA NA NA NA NA
2 David Peterson 1 5 4 1 1 NA
3 Jacob deGrom 1 1 1 17 13 2
4 Michael Wacha 3 9 5 NA NA NA
5 Rick Porcello 2 7 3 1 2 1
6 Robert Gsellman 2 1 NA NA NA NA
7 Seth Lugo 10 3 4 1 4 3
8 Steven Matz 13 NA NA NA NA NA
Note. Loading tidyverse package, at once, directly is advised.
Note-2 If you still want to sort absV before changing the data-format, add absV in arrange syntax beforehand joining them..
df <- starter %>% group_by(Player) %>%
arrange(Player, absV) %>%
mutate(row = row_number()) %>%
select(Player, absV, row)
df %>% filter(row <= 5) %>%
mutate(row = as.character(row)) %>%
rbind(df %>% filter(row > 5) %>%
summarise( absV = sum(absV)) %>%
mutate(row = ">5")) %>%
pivot_wider(id_cols = Player, names_from = row, values_from = absV)
#this will give the following diff output
# A tibble: 8 x 7
# Groups: Player [8]
Player `1` `2` `3` `4` `5` `>5`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ariel Jurado 4 NA NA NA NA NA
2 David Peterson 1 1 1 4 5 NA
3 Jacob deGrom 1 1 1 2 13 17
4 Michael Wacha 3 5 9 NA NA NA
5 Rick Porcello 1 1 2 2 3 7
6 Robert Gsellman 1 2 NA NA NA NA
7 Seth Lugo 1 3 3 4 4 10
8 Steven Matz 13 NA NA NA NA NA
Additional Question in comments below
Follow this code to work out frequency of each absV
df %>% group_by(Player, absV) %>% mutate(freq = n()) %>% ungroup()
#check it
df %>% group_by(Player, absV) %>% mutate(freq = n()) %>% ungroup() %>% select(Player, absV, freq)
Player absV freq
<chr> <dbl> <int>
1 Seth Lugo 10 1
2 Jacob deGrom 1 3
3 Rick Porcello 2 2
4 David Peterson 1 3
5 Michael Wacha 3 1
6 Seth Lugo 3 2
7 Jacob deGrom 1 3
8 Rick Porcello 7 1
9 David Peterson 5 1
10 Steven Matz 13 1
# ... with 20 more rows
Using data.table
library(data.table)
dcast(setDT(starter), Player ~ rowid(Player), value.var = 'absV')

Subset a numeric matrix with a numeric vector of column values

This is a simple problem but I have not found an explicit solution in the archives. Say I have a matrix m:
m <- structure(c(2, 0, 1, 1, 0, 2, 2, 2, 2, 1, 0, 2, 2, 2, 1, 2, 0,
1, 0, 1, 0, 2, 2, 0, 1, 2, 2, 1, 2, 0, 2, 0, 1, 0, 2, 1, 2, 1,
0, 1, 0, 2, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 2, 2, 1, 1, 1,
0, 2, 2, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 2, 0, 26, 18, 26, 18,
22, 21, 13, 22, 27, 20, 27, 24, 18, 21, 18, 22, 16, 22, 19, 15,
22, 27, 20, 20, 17), .Dim = c(25L, 4L), .Dimnames = list(NULL,
c("r", "s", "t", "u")))
And want to take a subset of the matrix containing the vector of some values in column u:
vec <- c(20, 21, 22, 24, 26)
In other words select the rows containing those values. Suggestions on how to do that or a link to the solution?
You could use which() and %in% but you can use directly only %in% (Many thanks and the credit for #GKi):
#Code
newmat <- m[m[,'u'] %in% vec,]
Output:
r s t u
[1,] 2 2 0 26
[2,] 1 1 0 26
[3,] 0 0 2 22
[4,] 2 2 2 21
[5,] 2 1 1 22
[6,] 1 2 0 20
[7,] 2 2 2 24
[8,] 2 0 1 21
[9,] 2 0 1 22
[10,] 1 0 1 22
[11,] 0 1 0 22
[12,] 2 0 0 20
[13,] 0 0 2 20

Is there an R function similar to foreach loops in Stata for creating new variables based on the name (or root) of existing variables?

I have a list of 60 variables (30 pairs, essentially), and I need to combine the information across all the pairs to create new variables based on the data stored in each pair.
To give some context, I am working on a systematic review of prediction model studies, and I extracted data on which variables were considered for inclusion in the prediction model of each study (the first 30 variables) and which variables were included in the model (the second 30 variables)
All variables are binary.
The first 30 variables are written in the form “p_[varname]”
The second 30 are written in the form “p_[varname]_inc”.
I want to create a new variable that is called [varname] and takes the values “Not considered”, “Considered”, and “Included”.
In Stata, I could easily do this like so:


foreach v of [varname1]-[varname30] {
gen `v' = "Not considered" if p_`v' == 0
replace `v' = "Considered" if p_`v' == 1 & p_`v'_inc == 0
replace `v' = "Included" if p_`v'_inc == 1 & p_`v'_inc == 1
}
In R, the only way I can figure out to do it is by copy and pasting the same ifelse statement for all variables, for example:
predictor_vars %>%
mutate(age = ifelse(p_age==1 & p_age_inc==1, "Included",
ifelse(p_age==1 & p_age_inc==0, "Considered", "Not considered")),
sex = ifelse(p_sex==1 & p_sex_inc==1, "Included",
ifelse(p_sex==1 & p_sex_inc==0, "Considered", "Not considered")),
....
[varname] = ifelse([varname]==1 & [varname]_inc==1, "Included",
ifelse([varname]==1 & [varname]==0, "Considered", "Not considered"))
)
Is there an easier way to do this in R / dplyr?
Edit: Sorry for not providing enough detail before (new here, but really appreciate the fast responses!). Here is a sample of the data
structure(list(p_age = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0), label = "Age", class = c("labelled",
"numeric")), p_age_inc = structure(c(1, 0, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0
), label = "Age", class = c("labelled", "numeric")), p_sex = structure(c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0), label = "Sex", class = c("labelled", "numeric"
)), p_sex_inc = structure(c(1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), label = "Sex", class = c("labelled",
"numeric")), p_nation = structure(c(0, 0, 0, 0, 1, 1, 0, 1, 0,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), label = "Nationality / country", class = c("labelled",
"numeric")), p_nation_inc = structure(c(0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0), label = "Nationality / country", class = c("labelled", "numeric"
)), p_prevtb = structure(c(0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), label = "Treatment regimen / treatment status (retreatment)", class = c("labelled",
"numeric")), p_prevtb_inc = structure(c(0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), label = "Previous TB / retreated TB", class = c("labelled",
"numeric"))), row.names = c(NA, 50L), class = "data.frame")
The first 5 rows (with 4 sets of selected predictors) looks like this:
p_age p_age_inc p_sex p_sex_inc p_nation p_nation_inc p_prevtb
1 1 1 1 1 0 0 0
2 1 0 1 0 0 0 0
3 1 0 1 1 0 0 0
4 1 1 1 1 0 0 0
5 1 1 1 0 1 0 1
6 1 1 1 0 1 0 1
p_prevtb_inc
1 0
2 0
3 0
4 0
5 0
6 0
And I'd like to create the new variables like this:
p_age p_age_inc p_sex p_sex_inc p_nation p_nation_inc p_prevtb
1 1 1 1 1 0 0 0
2 1 0 1 0 0 0 0
3 1 0 1 1 0 0 0
4 1 1 1 1 0 0 0
5 1 1 1 0 1 0 1
6 1 1 1 0 1 0 1
p_prevtb_inc age sex nation prevtb
1 0 Included Included Not considered Not considered
2 0 Considered Considered Not considered Not considered
3 0 Considered Included Not considered Not considered
4 0 Included Included Not considered Not considered
5 0 Included Considered Considered Considered
6 0 Included Considered Considered Considered
This solution could be improved upon but it works. The function does what the question asks for creating the variables in a standard for loop over the p_* variables. And then returns the result.
Argument Bind can be used to return just the newly created variables by setting Bind = FALSE.
create_var <- function(X, Bind = TRUE){
xnames <- names(X)
p_only <- grep('p_([^_]+$)', xnames, value = TRUE)
res <- vector('list', length = length(p_only))
for(i in seq_along(p_only)){
x <- X[[ p_only[i] ]]
y <- X[[paste0(p_only[i], '_inc')]]
res[[i]] <- case_when(
as.logical(x) & as.logical(y) ~ "Included",
as.logical(x) & !as.logical(y) ~ "Considered",
!as.logical(x) ~ "Not considered",
TRUE ~ "Not considered"
)
}
names(res) <- sub('^p_', '', p_only)
res <- do.call(cbind.data.frame, res)
if(Bind) cbind(X, res) else res
}
create_var(df1)
df1 %>% create_var()
df1 %>% create_var(Bind = FALSE)

Code multiple levels as 2 factor labels

I have a data frame with some columns:
that I want to transform into a factor,
in which the different levels are coded as -2, -1, 0, 1, 2, 3, 4
for which I want the levels to be labeled as 0 or 1 following this convention:
-2 = 1
-1 = 1
0 = 0
1 = 1
2 = 1
3 = 1
4 = 0
I have the following code:
#Convert to factor
dat[idx] <- lapply(dat[idx], factor, levels = -2:4, labels = c(1, 1, 0, 1, 1, 1, 0))
#Drop unused factor levels
dat <- droplevels(dat)
This works, but it gives me the following warning:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
I tried the following code (per Ananda Mahto's suggestion) but no luck:
levels(dat[idx]) <- list(`0` = c(0, 4), `1` = c(-2, -1, 1, 2, 3))
I figured there has to be a better way to do this, any suggestions?
My data looks like this:
structure(list(Timestamp = structure(c(1380945601, 1380945603,
1380945605, 1380945607, 1380945609, 1380945611, 1380945613, 1380945615,
1380945617, 1380945619), class = c("POSIXct", "POSIXt"), tzone = ""),
FCB2C01 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), RCB2C01 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), FCB2C02 = c(1, 1, 1, 1, 1, 1,
1, 1, 1, 1), RCB2C02 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), FCB2C03 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), RCB2C03 = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), FCB2C04 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), RCB2C04 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), FCB2C05 = c(1, 1, 1, 1, 1, 1,
1, 1, 1, 1), RCB2C05 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), FCB2C06 = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1), RCB2C06 = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), FCB2C07 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), RCB2C07 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), FCB2C08 = c(1, 1, 1, 1, 1, 1,
1, 1, 1, 1), RCB2C08 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), FCB2C09 = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1), RCB2C09 = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), FCB2C10 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), RCB2C10 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Timestamp", "FCB2C01",
"RCB2C01", "FCB2C02", "RCB2C02", "FCB2C03", "RCB2C03", "FCB2C04",
"RCB2C04", "FCB2C05", "RCB2C05", "FCB2C06", "RCB2C06", "FCB2C07",
"RCB2C07", "FCB2C08", "RCB2C08", "FCB2C09", "RCB2C09", "FCB2C10",
"RCB2C10"), row.names = c(NA, 10L), class = "data.frame")
And the column index:
idx <- seq(2,21,2)
If I correctly understand what you want to do, the "right" way would be to use the levels function to specify your levels. Compare the following:
set.seed(1)
x <- sample(-2:4, 10, replace = TRUE)
YourApproach <- factor(x, levels = -2:4, labels = c(1, 1, 0, 1, 1, 1, 0))
# Warning message:
# In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
# duplicated levels in factors are deprecated
YourApproach
# [1] 1 0 1 0 1 0 0 1 1 1
# Levels: 1 1 0 1 1 1 0
xFac <- factor(x, levels = -2:4)
levels(xFac) <- list(`0` = c(0, 4), `1` = c(-2, -1, 1, 2, 3))
xFac
# [1] 1 0 1 0 1 0 0 1 1 1
# Levels: 0 1
Note the difference in the "Levels" in each of those. This also means that the underlying numeric representation is going to be different:
> as.numeric(YourApproach)
[1] 2 3 5 7 2 7 7 5 5 1
> as.numeric(xFac)
[1] 2 1 2 1 2 1 1 2 2 2

Resources