How to get proportion of specific value across columns? - r

I have a sample dataframe as below:
self race1 race2 race3 race4
1 1 2 2 1
2 1 1 1 1
3 1 3 1 1
4 2 1 3 1
I would like to get the proportion of 1s in the race columns as a new column. So for each row, I would count the number of 1 and divide it by 4. The desired output dataframe would look like below.
self race1 race2 race3 race4 prop_race_as1
1 1 2 2 1 2/4
2 1 1 1 1 4/4
3 1 3 1 1 3/4
4 2 1 3 1 2/4
How do I write a function that incorporate rowwise() to get the desired output?

Assuming your data is in df, you can get ratios as
ratios <- apply(data.matrix(df)[,-1], 1, function(x) length(which(x == 1)) / (ncol(df)-1))
then cbind(df, ratios).

Please find below two possibilities.
Reprex
1. With dplyr (and rowwise())
Code
library(dplyr)
df %>%
dplyr::rowwise() %>%
dplyr::mutate(prop_race_as1 = sum(c_across(starts_with("race")) < 2) / 4)
Output
#> # A tibble: 4 x 6
#> # Rowwise:
#> self race1 race2 race3 race4 prop_race_as1
#> <int> <int> <int> <int> <int> <dbl>
#> 1 1 1 2 2 1 0.5
#> 2 2 1 1 1 1 1
#> 3 3 1 3 1 1 0.75
#> 4 4 2 1 3 1 0.5
2. Using only base R
Code
df$prop_race_as1 <- rowSums(df[startsWith(names(df), "race")] < 2) / 4
Output
df
#> self race1 race2 race3 race4 prop_race_as1
#> 1 1 1 2 2 1 0.50
#> 2 2 1 1 1 1 1.00
#> 3 3 1 3 1 1 0.75
#> 4 4 2 1 3 1 0.50
Data
df <- structure(list(self = 1:4, race1 = c(1L, 1L, 1L, 2L), race2 = c(2L,
1L, 3L, 1L), race3 = c(2L, 1L, 1L, 3L), race4 = c(1L, 1L, 1L,
1L)), class = "data.frame", row.names = c(NA, -4L))
Created on 2022-02-16 by the reprex package (v2.0.1)

Related

How to add new rows conditionally on R

I have a df with
v1 t1 c1 o1
1 1 9 1
1 1 12 2
1 2 2 1
1 2 7 2
2 1 3 1
2 1 6 2
2 2 3 1
2 2 12 2
And I would like to add 2 rows each time that v1 changes it's value, in order to get this:
v1 t1 c1 o1
1 1 1 1
1 1 1 2
1 2 9 1
1 2 12 2
1 3 2 1
1 3 7 2
2 1 1 1
2 1 1 2
1 2 3 1
1 2 6 2
2 3 3 1
2 3 12 2
So what I'm doing is that every time v1 changes its value I'm adding 2 rows of ones and adding a 1 to the values of t1. This is kind of tricky. I've been able to do it in Excel but I would like to scale to big files in R.
We may do the expansion in group_modify
library(dplyr)
df1 %>%
group_by(v1) %>%
group_modify(~ .x %>%
slice_head(n = 2) %>%
mutate(across(-o1, ~ 1)) %>%
bind_rows(.x) %>%
mutate(t1 = as.integer(gl(n(), 2, n())))) %>%
ungroup
-output
# A tibble: 12 × 4
v1 t1 c1 o1
<int> <int> <dbl> <int>
1 1 1 1 1
2 1 1 1 2
3 1 2 9 1
4 1 2 12 2
5 1 3 2 1
6 1 3 7 2
7 2 1 1 1
8 2 1 1 2
9 2 2 3 1
10 2 2 6 2
11 2 3 3 1
12 2 3 12 2
Or do a group by summarise
df1 %>%
group_by(v1) %>%
summarise(t1 = as.integer(gl(n() + 2, 2, n() + 2)),
c1 = c(1, 1, c1), o1 = rep(1:2, length.out = n() + 2),
.groups = 'drop')
-output
# A tibble: 12 × 4
v1 t1 c1 o1
<int> <int> <dbl> <int>
1 1 1 1 1
2 1 1 1 2
3 1 2 9 1
4 1 2 12 2
5 1 3 2 1
6 1 3 7 2
7 2 1 1 1
8 2 1 1 2
9 2 2 3 1
10 2 2 6 2
11 2 3 3 1
12 2 3 12 2
data
df1 <- structure(list(v1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), t1 = c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), c1 = c(9L, 12L, 2L, 7L, 3L, 6L,
3L, 12L), o1 = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))

define an indicator when number of duplicate rows -1 is equal one of the column

I have some duplicate rows whose are the same in some columns, I want to define indicator if the number of duplicate rows -1 are equal the number of one of the column .
example
SAMPN PERNO ARR_HR HHMEM
1 1 2 1
1 2 2 1
2 1 3 2
2 3 3 2
3 1 4 2
3 2 4 2
3 3 4 2
rows are duplicate if they are the same in first ,second and third columns. I want the indicator to be 1 if number of duplicate rows -1 is equal HHMEM .
for example 2 first rows are duplicate so 2-1=1=HHMEM so indicator is 1.
out put
SAMPN PERNO ARR_HR HHMEM indicator
1 1 2 1 1
1 2 2 1 1
2 1 3 2 0
2 3 3 2 0
3 1 4 2 1
3 2 4 2 1
3 3 4 2 1
After grouping by 'SAMPN' and other grouping variables (from OP's comments) create the 'indicator' by coercing the logical vector ((n()- 1) == HHMEM) into binary with as.integer
library(dplyr)
df1 %>%
group_by(SAMPN, ARR_HR, HHMEM) %>%
mutate(indicator = as.integer((n()-1) == HHMEM))
# A tibble: 7 x 5
# Groups: SAMPN [3]
# SAMPN PERNO ARR_HR HHMEM indicator
# <int> <int> <int> <int> <int>
#1 1 1 2 1 1
#2 1 2 2 1 1
#3 2 1 3 2 0
#4 2 3 3 2 0
#5 3 1 4 2 1
#6 3 2 4 2 1
#7 3 3 4 2 1
NOTE: We don't need to create any additional column and then remove it later
Or the same logic in base R with ave
df1$indicator <- +(with(df1, HHMEM == ave(HHMEM, HHMEM, SAMPN,
ARR_HR, FUN = length)-1))
Or using duplicated with table
i1 <- table(cumsum(!duplicated(df1[c(1, 3, 4)])))
as.integer(rep(i1, i1) - 1 == df1$HHMEM)
data
df1 <- structure(list(SAMPN = c(1L, 1L, 2L, 2L, 3L, 3L, 3L), PERNO = c(1L,
2L, 1L, 3L, 1L, 2L, 3L), ARR_HR = c(2L, 2L, 3L, 3L, 4L, 4L, 4L
), HHMEM = c(1L, 1L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA,
-7L))
We can use add_count to get count and compare it with HHMEM.
library(dplyr)
df %>%
add_count(SAMPN, ARR_HR, HHMEM) %>%
mutate(indicator = as.integer(n - 1 == HHMEM)) %>%
select(-n)
# SAMPN PERNO ARR_HR HHMEM indicator
# <int> <int> <int> <int> <int>
#1 1 1 2 1 1
#2 1 2 2 1 1
#3 2 1 3 2 0
#4 2 3 3 2 0
#5 3 1 4 2 1
#6 3 2 4 2 1
#7 3 3 4 2 1

how to add value of different column with respect of a column

suppose I have
SAMPN PERNO loop car bus walk mode
1 1 1 3.4 2.5 1.5 1
1 1 1 3 2 1 2
1 1 1 4 2 5 3
1 1 2 14 1 3 1
1 1 2 5 8 2 1
2 1 1 1 5 5 3
2 1 1 9 4 3 3
mode column is crossponding to car bus and walk.
mode==1 walk
mode==2 car
mode==3 bus
SAMPN is index of family , PERNO members in family and loop tour of each person. I want to add the value of mode of each person in each family in each loop.
for example in first family SAMPN==1 first person PERNO==1 we have 3 rows for first trip loop==1. in this tour mode of first row is walk (mode==1),mode of second row is car (mode==2),mode of third row is bus (mode==3)
so I will add walk of first row by car of second and bus of third 3.4+2+5=10.4. same for others
Output:
SAMPN PERNO loop car bus walk mode utility
1 1 1 3.4 2.5 1.5 1 10.4
1 1 1 3 2 1 2 10.4
1 1 1 4 2 5 3 10.4
1 1 2 14 1 3 1 19
1 1 2 5 8 2 1 19
2 1 1 1 5 5 3 8
2 1 1 9 4 3 3 8
df %>%
mutate(utility = case_when(mode == 1 ~ car, # using the order in the example,
mode == 2 ~ bus, # not the order in the table
mode == 3 ~ walk,
TRUE ~ 0)) %>%
count(SAMPN, PERNO, loop, wt = utility, name = "utility")
## A tibble: 3 x 4
# SAMPN PERNO loop utility
# <int> <int> <int> <dbl>
#1 1 1 1 10.4
#2 1 1 2 19
#3 2 1 1 8
Or, to get the exact output:
df %>%
mutate(utility= case_when(mode == 1 ~ car,
mode == 2 ~ bus,
mode == 3 ~ walk,
TRUE ~ 0)) %>%
group_by(SAMPN, PERNO, loop) %>%
mutate(utility = sum(utility))
## A tibble: 7 x 8
## Groups: SAMPN, PERNO, loop [3]
# SAMPN PERNO loop car bus walk mode utility
# <int> <int> <int> <dbl> <dbl> <dbl> <int> <dbl>
#1 1 1 1 3.4 2.5 1.5 1 10.4
#2 1 1 1 3 2 1 2 10.4
#3 1 1 1 4 2 5 3 10.4
#4 1 1 2 14 1 3 1 19
#5 1 1 2 5 8 2 1 19
#6 2 1 1 1 5 5 3 8
#7 2 1 1 9 4 3 3 8
Here is an option using base R. Create a column index matching the 'mode' with a named column name ('nm1'0, then cbind with row index, extract the corresponding elements from the dataset, use ave to get th esum grouped by 'SAMPN', and 'loop' column to assign it to 'utility'
nm1 <- setNames(names(df1)[4:6], 1:3)[as.character(df1$mode)]
i1 <- cbind(seq_len(nrow(df1)), match(nm1, names(df1)))
df1$utility <- ave(df1[i1], df1$SAMPN, df1$PERNO, df1$loop, FUN = sum)
df1$utility
#[1] 10.4 10.4 10.4 19.0 19.0 8.0 8.0
data
df1 <- structure(list(SAMPN = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), PERNO = c(1L,
1L, 1L, 1L, 1L, 1L, 1L), loop = c(1L, 1L, 1L, 2L, 2L, 1L, 1L),
car = c(3.4, 3, 4, 14, 5, 1, 9), bus = c(2.5, 2, 2, 1, 8,
5, 4), walk = c(1.5, 1, 5, 3, 2, 5, 3), mode = c(1L, 2L,
3L, 1L, 1L, 3L, 3L)), class = "data.frame", row.names = c(NA,
-7L))

how refill a column with the help of 2 other column?

I have a data based 3 groups : SAMPN,PERNO,loop
there are 2 columns, mode1 and mode2. and a column called int.
SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 NA
1 1 1 2 1 NA
1 1 1 3 2 0
1 2 1 3 2 NA
1 2 1 1 1 2
2 2 1 3 2 NA
2 2 1 1 3 NA
2 2 1 3 1 0
2 2 2 1 2 NA
2 2 2 3 1 2
SAMPN is family index, PERNO is index of persons in each family and loop is tour of each person. the last row of each loop for each person is 0 or 2 and and rest of loop is NA. in each family and for each person and each loop I want copy the column mode 1 in int if the last row of loop is 0 and copy mode2 if the last row of loo is 2.
output
SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 1
1 1 1 2 1 2
1 1 1 3 2 3
1 2 1 3 2 2
1 2 1 1 1 1
2 2 1 3 2 3
2 2 1 1 3 1
2 2 1 3 1 3
2 2 2 1 2 2
2 2 2 3 1 1
the first 3 rows is loop of first person in the first family, I filled that loop by mode1 because the third row was 0. and so on
Here's a way using dplyr
df <- read.table(h=T,text="SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 NA
1 1 1 2 1 NA
1 1 1 3 2 0
1 2 1 3 2 NA
1 2 1 1 1 2
2 2 1 3 2 NA
2 2 1 1 3 NA
2 2 1 3 1 0
2 2 2 1 2 NA
2 2 2 3 1 2")
library(dplyr)
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = if(last(int) == 0) mode1 else mode2) %>%
ungroup()
#> # A tibble: 10 x 6
#> SAMPN PERNO loop mode1 mode2 int
#> <int> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 2 1
#> 2 1 1 1 2 1 2
#> 3 1 1 1 3 2 3
#> 4 1 2 1 3 2 2
#> 5 1 2 1 1 1 1
#> 6 2 2 1 3 2 3
#> 7 2 2 1 1 3 1
#> 8 2 2 1 3 1 3
#> 9 2 2 2 1 2 2
#> 10 2 2 2 3 1 1
If you have more values than 0 or 2, switch could be a good alternative :
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = switch(
as.character(last(int)),
`0` = mode1,
`2` = mode2)) %>%
ungroup()
# same output!
We can also use case_when
library(dplyr)
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = case_when(rep(last(int) == 0, n()) ~ mode1, TRUE ~mode2))
# A tibble: 10 x 6
# Groups: loop, SAMPN, PERNO [4]
# SAMPN PERNO loop mode1 mode2 int
# <int> <int> <int> <int> <int> <int>
# 1 1 1 1 1 2 1
# 2 1 1 1 2 1 2
# 3 1 1 1 3 2 3
# 4 1 2 1 3 2 2
# 5 1 2 1 1 1 1
# 6 2 2 1 3 2 3
# 7 2 2 1 1 3 1
# 8 2 2 1 3 1 3
#9 2 2 2 1 2 2
#10 2 2 2 3 1 1
data
df <- structure(list(SAMPN = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), PERNO = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), loop = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), mode1 = c(1L, 2L, 3L, 3L,
1L, 3L, 1L, 3L, 1L, 3L), mode2 = c(2L, 1L, 2L, 2L, 1L, 2L, 3L,
1L, 2L, 1L), int = c(NA, NA, 0L, NA, 2L, NA, NA, 0L, NA, 2L)),
class = "data.frame", row.names = c(NA,
-10L))

how to remove observations in R dependent on a specific condition

I am trying to drop observations in R from my dataset. I need each Person_ID to have wave 0 AND (wave 1 OR wave 3 OR wave 6 OR wave 12 OR wave 18). Can someone help me?
Initial dataset
Person_ID wave
1 0
1 1
1 3
1 6
1 12
1 18
2 0
3 0
3 1
4 6
4 12
Wanted result
Person_ID wave
1 0
1 1
1 3
1 6
1 12
1 18
3 0
3 1
Thanks!
You can do a grouped filter. We keep a person if both 0 and any of 1, 3, 6, 12, 18 are in their corresponding wave values.
library(tidyverse)
tbl <- read_table2(
"Person_ID wave
1 0
1 1
1 3
1 6
1 12
1 18
2 0
3 0
3 1
4 6
4 12"
)
tbl %>%
group_by(Person_ID) %>%
filter(0 %in% wave, any(c(1, 3, 6, 12, 18) %in% wave))
#> # A tibble: 8 x 2
#> # Groups: Person_ID [2]
#> Person_ID wave
#> <dbl> <dbl>
#> 1 1 0
#> 2 1 1
#> 3 1 3
#> 4 1 6
#> 5 1 12
#> 6 1 18
#> 7 3 0
#> 8 3 1
Created on 2019-03-25 by the reprex package (v0.2.1)
We can also do this in base R
df1[with(df1, Person_ID %in% intersect(Person_ID[wave %in% c(1, 3, 6, 12, 18)],
Person_ID[!wave])),]
# Person_ID wave
#1 1 0
#2 1 1
#3 1 3
#4 1 6
#5 1 12
#6 1 18
#8 3 0
#9 3 1
data
df1 <- structure(list(Person_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 4L, 4L), wave = c(0L, 1L, 3L, 6L, 12L, 18L, 0L, 0L, 1L, 6L,
12L)), class = "data.frame", row.names = c(NA, -11L))

Resources