how to add 2 columns with respect of 2 groups? - r

I want to find the total utility of person in each each household.SAMPN is household index, PERNO is person index.
there are 2 utility for each person, utility1 and utility2. for each person I want to add utility 1 of that person with utility2 of other persons.
SAMPN PERNO utility1 utility2
1 1 3 4
1 2 4 5
1 3 6 8
2 1 1 2
2 2 2 3
output
SAMPN PERNO utility1 utility2 HH-utility
1 1 3 4 3+5+8=16
1 2 4 5 4+4+8=16
1 3 6 8 6+4+5=15
2 1 1 2 1+3=4
2 2 2 3 2+2=4

One option after grouping by 'SAMPN', is to get the sum of 'utility2', subtract from the column 'utility2' to get the sum without the element and add 'utility1' to it
library(dplyr)
df1 %>%
group_by(SAMPN) %>%
mutate(HHutility = sum(utility2) - utility2 + utility1)
# A tibble: 5 x 5
# Groups: SAMPN [2]
# SAMPN PERNO utility1 utility2 HHutility
# <int> <int> <int> <int> <int>
#1 1 1 3 4 16
#2 1 2 4 5 16
#3 1 3 6 8 15
#4 2 1 1 2 4
#5 2 2 2 3 4
Or with base R
transform(df1, HHutility = utility1 + ave(utility2, SAMPN, FUN = sum) - utility2)
data
df1 <- structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L), PERNO = c(1L, 2L,
3L, 1L, 2L), utility1 = c(3L, 4L, 6L, 1L, 2L), utility2 = c(4L,
5L, 8L, 2L, 3L)), class = "data.frame", row.names = c(NA, -5L
))

Related

R - delete rows according to the value of another row

I am quite a beginner in R but thanks to the community of Stackoverflow I am improving!
However, I am stuck with a problem:
I have a dataset with 5 variables:
id_house represents the id for each household
id_ind is an id which values 1 for the first individual in the household, 2 for the next, 3 for the third...
Indicator_tb_men which indicates if the first person has answered to the survey (1 = yes, 0 = no). All the other members of the household take the value 0.
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
3 1 0
3 2 0
3 3 0
4 1 1
5 1 0
I would like to delete all members of households where the first individual has not answered the survey.
So it would give:
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
4 1 1
Using dplyr here is one way :
library(dplyr)
df %>%
arrange(id_house, id_ind) %>%
group_by(id_house) %>%
filter(first(indicator_tb_men) != 0)
# id_house id_ind indicator_tb_men
# <int> <int> <int>
#1 1 1 1
#2 1 2 NA
#3 2 1 1
#4 4 1 1
data
df <- structure(list(id_house = c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L),
id_ind = c(1L, 2L, 1L, 1L, 2L, 3L, 1L, 1L), indicator_tb_men = c(1L,
NA, 1L, 0L, NA, NA, 1L, 0L)), class = "data.frame", row.names = c(NA, -8L))
in base we can use nested logic
df[df$id_house %in% df$id_house[df$id_ind == 1 & df$indicator_tb_men == 1],]
id_house id_ind indicator_tb_men
1 1 1 1
2 1 2 NA
3 2 1 1
7 4 1 1
Data: Using Ronak Shah's data

Completing or inserting empty rows in-between ordered factors

I have a data frame where I have aggregated the total activity per member for 15 different months (as ordered factors). Now the months/levels, where a member has not had any activity is simply skipped as there are no rows in the original data.
The data looks like this:
MemberID MonthYr freq
1 04-2014 2
1 05-2014 3
1 07-2014 2
1 08-2014 5
2 04-2014 3
2 05-2014 3
3 06-2014 6
3 07-2014 4
3 11-2014 2
3 12-2014 3
I want to insert new rows in between the active months, so that the months show a frequency of 0.
Like this:
MemberID MonthYr freq
1 04-2014 2
1 05-2014 3
1 06-2014 0
1 07-2014 2
1 08-2014 5
2 04-2014 3
2 05-2014 3
3 06-2014 6
3 07-2014 4
3 08-2014 0
3 09-2014 0
3 10-2014 0
3 11-2014 2
3 12-2014 3
However every member hasn't become members at the same time, so the 0's can only be between the min and max MonthYr for each member.
We can use complete to do this. Convert the 'MonthYr' to Date class, then grouped by 'MemberID', use complete to expand the 'MonthYr' from min to max 'Date' by 'month', while filling the 'freq' with 0 and if needed, convert back the 'MonthYr' to original format
library(dplyr)
library(tidyr)
library(zoo)
df1 %>%
mutate(MonthYr = as.Date(as.yearmon(MonthYr, "%m-%Y"))) %>%
group_by(MemberID) %>%
complete(MonthYr = seq(min(MonthYr), max(MonthYr), by = '1 month'),
fill = list(freq = 0)) %>%
mutate(MonthYr = format(MonthYr, "%m-%Y"))
# A tibble: 14 x 3
# Groups: MemberID [3]
# MemberID MonthYr freq
# <int> <chr> <dbl>
# 1 1 04-2014 2
# 2 1 05-2014 3
# 3 1 06-2014 0
# 4 1 07-2014 2
# 5 1 08-2014 5
# 6 2 04-2014 3
# 7 2 05-2014 3
# 8 3 06-2014 6
# 9 3 07-2014 4
#10 3 08-2014 0
#11 3 09-2014 0
#12 3 10-2014 0
#13 3 11-2014 2
#14 3 12-2014 3
data
df1 <- structure(list(MemberID = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L,
3L), MonthYr = c("04-2014", "05-2014", "07-2014", "08-2014",
"04-2014", "05-2014", "06-2014", "07-2014", "11-2014", "12-2014"
), freq = c(2L, 3L, 2L, 5L, 3L, 3L, 6L, 4L, 2L, 3L)),
class = "data.frame", row.names = c(NA,
-10L))

how to change a column with some information in the data set?

I have columns household , persons in each household, tour (each tour contains different trips for each person) , trip (number of trips in each tour) , and mode ( mode of travel of each person in each trip)
I want change mode column with respect of tour column as the following
mood== car if there exist at least one trip in the tour with mode car
mood==non-car if non of trips in a tour has mode=car
example:
household. person. trip. tour. mode
1 1 1 1 car
1 1 2 1 walk
1 1 4 1 bus
1 1 1 2 bus
1 1 2 2 walk
1 2 1 1 walk
1 2 2 1 bus
1 2 3 1 walk
2 1 1 1 walk
2 1 1 1 car
output
household. person. trip. tour. mode
1 1 1 1 car
1 1 2 1 car
1 1 4 1 car
1 1 1 2 non-car
1 1 2 2 non-car
1 2 1 1 non-car
1 2 2 1 non-car
1 2 3 1 non-car
2 1 1 1 car
2 1 1 1 car
We can group by 'household.', 'person.', 'tour.' and change the 'mode' to two values by checking if there are any 'car' in the column. In that case, convert it to a numeric index by adding 1 (TRUE -> 2, FALSE ->1) and based on this index, we pass a vector of strings to replace the index
library(dplyr)
df1 %>%
group_by(household., person., tour.) %>%
mutate(mode = c('non-car', 'car')[1+any(mode == "car")])
# A tibble: 10 x 5
# Groups: household., person., tour. [4]
# household. person. trip. tour. mode
# <int> <int> <int> <int> <chr>
# 1 1 1 1 1 car
# 2 1 1 2 1 car
# 3 1 1 4 1 car
# 4 1 1 1 2 non-car
# 5 1 1 2 2 non-car
# 6 1 2 1 1 non-car
# 7 1 2 2 1 non-car
# 8 1 2 3 1 non-car
# 9 2 1 1 1 car
#10 2 1 1 1 car
data
df1 <- structure(list(household. = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L), person. = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L),
trip. = c(1L, 2L, 4L, 1L, 2L, 1L, 2L, 3L, 1L, 1L), tour. = c(1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), mode = c("car", "walk",
"bus", "bus", "walk", "walk", "bus", "walk", "walk", "car"
)), class = "data.frame", row.names = c(NA, -10L))

Conditionally remove rows from a database using R

ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6
Each person has several records.
There is only one record of a person whose Number is 1, the rest is 2.
The variable Var has different values for the same person.
When the Number equals to 1, the corresponding Var (we call it P) is different for different persons.
Now, I want to delete the rows whose Var > P for every person.
At the end, I want this
ID Number Var
1 2 6
1 2 7
1 1 8
2 2 3
2 2 4
2 1 5
You can use dplyr::first where Num==1 to get the first Var value
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag=first(Var[Number==1])) %>%
filter(Var <= Flag) %>% select(-Flag)
#short version and you sure there is a one Num==1
df %>% group_by(ID) %>% filter(Var <= Var[Number==1])
Here is a solution with data.table:
library(data.table)
dt <- fread(
"ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6")
dt[, .SD[Var <= Var[Number==1]], ID]
# ID Number Var
# 1: 1 2 6
# 2: 1 2 7
# 3: 1 1 8
# 4: 2 2 3
# 5: 2 2 4
# 6: 2 1 5
A base R option would be
df1[with(df1, Var <= ave(Var * (Number == 1), ID, FUN = function(x) x[x!=0])),]
# ID Number Var
#1 1 2 6
#2 1 2 7
#3 1 1 8
#6 2 2 3
#7 2 2 4
#8 2 1 5
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Number = c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), Var = c(6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L, 6L)), row.names = c(NA, -9L), class = "data.frame")

Tidy up and reshape messy dataset (reshape/gather/unite function)?

Following my earlier question:
R: reshape/gather function to create dataset ready for multilevel analysis
I discovered it is a bit more complicated. My dataset is actually 'messier' than I hoped. So here's the full story:
I have a big dataset, 240 cases. Each row is a case (breast cancer patient). Somewhere at the end of the dataset(say from column 417 onwards) I have partner data of the patients, that also filled in a questionnaire.
In the beginning, there are demographic variables for both patients and partners, followed by test outcomes only of patients, thus followed by partner data.
I want to create a dataset, where I 'split' the patient and partner data, but keep it coupled. Thus: I want to duplicate the subject ID and create new column with 1s and 2s (1 corresponding to patient and 2 to partner).
Then, I want my data actually as it is now, but some variables can be matched though (for example, I know have "date of birth" for patient [pgebdat] and for partner [prgebdat] separate. Ofcourse, I can turn this into 'gebdat' with the two birth dates below each other.
This code worked for me for a small subset of my data:
mydf_long <- mydf4 %>%
unite(bb1:bb50rec, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pbb1:pbb50recM, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c(paste0("bb", seq(1:104),"", sep = ','))) %>% # Separate and retrieve original answers
arrange(id)
results in:
id groep_MNC zkhs fbeh pgebdat couple bb1,
1 3 1 1 1 1955-12-01 1 4
2 3 1 1 1 1955-12-01 2 5
3 5 1 1 1 1943-04-09 1 2
4 5 1 1 1 1943-04-09 2 2
But now it copies and pastes the date of birth of the patient also to 'partner' row.
I'm stuck, and don't even quite know what data you would need to be able to answer my question, so please do ask. I'll provide something of an example below:
Example of data
id groep_MNC zkhs fbeh pgebdat p_age pgesl prgebdat pr_age prgesl relpnst
1 3 1 1 1 1955-12-01 42.50000 1 <NA> NA 2 1
2 5 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.50000 1 2
3 7 1 1 1 1958-04-10 40.25000 1 <NA> NA 2 1
4 10 1 1 1 1958-04-17 40.25000 1 1957-07-31 41.33333 2 1
5 12 1 1 2 1947-11-01 50.66667 1 1944-06-08 54.58333 2 1
And then, after couple of hundred variables for only patients, this partner data comes along:
pbb1 pbb2 pbb3 pbb4 pbb5 pbb6 pbb7 pbb8 pbb9
1 5 5 5 5 2 5 4 2 3
2 2 1 4 1 3 4 3 3 4
3 5 3 4 4 4 3 5 3 4
4 5 3 5 5 5 5 4 4 4
5 5 5 5 5 5 4 4 3 4
note, I didn't create this dataset myself - I'm just here to tidy up the mess :)
Edit: The dataset is in dutch. Pgesl = gender for patient, prgesl = gender for partner... etc.
Using the melt function from the data.table-package you can use multiple measures by patterns and as a result create more than one value column:
library(data.table)
melt(setDT(df), measure.vars = patterns('_age','gesl','gebdat'),
value.name = c('age','geslacht','geboortedatum')
)[, variable := c('patient','partner')[variable]][]
you get:
id groep_MNC zkhs fbeh relpnst pbb1 pbb2 variable age geslacht geboortedatum
1: 3 1 1 1 1 5 5 patient 42.50000 1 1955-12-01
2: 5 1 1 1 2 2 1 patient 55.16667 1 1943-04-09
3: 7 1 1 1 1 5 3 patient 40.25000 1 1958-04-10
4: 10 1 1 1 1 5 3 patient 40.25000 1 1958-04-17
5: 12 1 1 2 1 5 5 patient 50.66667 1 1947-11-01
6: 3 1 1 1 1 5 5 partner NA 2 <NA>
7: 5 1 1 1 2 2 1 partner 36.50000 1 1962-04-18
8: 7 1 1 1 1 5 3 partner NA 2 <NA>
9: 10 1 1 1 1 5 3 partner 41.33333 2 1957-07-31
10: 12 1 1 2 1 5 5 partner 54.58333 2 1944-06-08
Instead of patterns you could also use a list of column indexes or columnnames.
HTH
Used data:
df <- structure(list(id = c(3L, 5L, 7L, 10L, 12L),
groep_MNC = c(1L, 1L, 1L, 1L, 1L),
zkhs = c(1L, 1L, 1L, 1L, 1L),
fbeh = c(1L, 1L, 1L, 1L, 2L),
pgebdat = c("1955-12-01", "1943-04-09", "1958-04-10", "1958-04-17", "1947-11-01"),
p_age = c(42.5, 55.16667, 40.25, 40.25, 50.66667),
pgesl = c(1L, 1L, 1L, 1L, 1L),
prgebdat = c("<NA>", "1962-04-18", "<NA>", "1957-07-31", "1944-06-08"),
pr_age = c(NA, 36.5, NA, 41.33333, 54.58333),
prgesl = c(2L, 1L, 2L, 2L, 2L),
relpnst = c(1L, 2L, 1L, 1L, 1L),
pbb1 = c(5L, 2L, 5L, 5L, 5L),
pbb2 = c(5L, 1L, 3L, 3L, 5L)),
.Names = c("id", "groep_MNC", "zkhs", "fbeh", "pgebdat", "p_age", "pgesl", "prgebdat", "pr_age", "prgesl", "relpnst", "pbb1", "pbb2"),
class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

Resources