Completing or inserting empty rows in-between ordered factors - r

I have a data frame where I have aggregated the total activity per member for 15 different months (as ordered factors). Now the months/levels, where a member has not had any activity is simply skipped as there are no rows in the original data.
The data looks like this:
MemberID MonthYr freq
1 04-2014 2
1 05-2014 3
1 07-2014 2
1 08-2014 5
2 04-2014 3
2 05-2014 3
3 06-2014 6
3 07-2014 4
3 11-2014 2
3 12-2014 3
I want to insert new rows in between the active months, so that the months show a frequency of 0.
Like this:
MemberID MonthYr freq
1 04-2014 2
1 05-2014 3
1 06-2014 0
1 07-2014 2
1 08-2014 5
2 04-2014 3
2 05-2014 3
3 06-2014 6
3 07-2014 4
3 08-2014 0
3 09-2014 0
3 10-2014 0
3 11-2014 2
3 12-2014 3
However every member hasn't become members at the same time, so the 0's can only be between the min and max MonthYr for each member.

We can use complete to do this. Convert the 'MonthYr' to Date class, then grouped by 'MemberID', use complete to expand the 'MonthYr' from min to max 'Date' by 'month', while filling the 'freq' with 0 and if needed, convert back the 'MonthYr' to original format
library(dplyr)
library(tidyr)
library(zoo)
df1 %>%
mutate(MonthYr = as.Date(as.yearmon(MonthYr, "%m-%Y"))) %>%
group_by(MemberID) %>%
complete(MonthYr = seq(min(MonthYr), max(MonthYr), by = '1 month'),
fill = list(freq = 0)) %>%
mutate(MonthYr = format(MonthYr, "%m-%Y"))
# A tibble: 14 x 3
# Groups: MemberID [3]
# MemberID MonthYr freq
# <int> <chr> <dbl>
# 1 1 04-2014 2
# 2 1 05-2014 3
# 3 1 06-2014 0
# 4 1 07-2014 2
# 5 1 08-2014 5
# 6 2 04-2014 3
# 7 2 05-2014 3
# 8 3 06-2014 6
# 9 3 07-2014 4
#10 3 08-2014 0
#11 3 09-2014 0
#12 3 10-2014 0
#13 3 11-2014 2
#14 3 12-2014 3
data
df1 <- structure(list(MemberID = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L,
3L), MonthYr = c("04-2014", "05-2014", "07-2014", "08-2014",
"04-2014", "05-2014", "06-2014", "07-2014", "11-2014", "12-2014"
), freq = c(2L, 3L, 2L, 5L, 3L, 3L, 6L, 4L, 2L, 3L)),
class = "data.frame", row.names = c(NA,
-10L))

Related

how to add 2 columns with respect of 2 groups?

I want to find the total utility of person in each each household.SAMPN is household index, PERNO is person index.
there are 2 utility for each person, utility1 and utility2. for each person I want to add utility 1 of that person with utility2 of other persons.
SAMPN PERNO utility1 utility2
1 1 3 4
1 2 4 5
1 3 6 8
2 1 1 2
2 2 2 3
output
SAMPN PERNO utility1 utility2 HH-utility
1 1 3 4 3+5+8=16
1 2 4 5 4+4+8=16
1 3 6 8 6+4+5=15
2 1 1 2 1+3=4
2 2 2 3 2+2=4
One option after grouping by 'SAMPN', is to get the sum of 'utility2', subtract from the column 'utility2' to get the sum without the element and add 'utility1' to it
library(dplyr)
df1 %>%
group_by(SAMPN) %>%
mutate(HHutility = sum(utility2) - utility2 + utility1)
# A tibble: 5 x 5
# Groups: SAMPN [2]
# SAMPN PERNO utility1 utility2 HHutility
# <int> <int> <int> <int> <int>
#1 1 1 3 4 16
#2 1 2 4 5 16
#3 1 3 6 8 15
#4 2 1 1 2 4
#5 2 2 2 3 4
Or with base R
transform(df1, HHutility = utility1 + ave(utility2, SAMPN, FUN = sum) - utility2)
data
df1 <- structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L), PERNO = c(1L, 2L,
3L, 1L, 2L), utility1 = c(3L, 4L, 6L, 1L, 2L), utility2 = c(4L,
5L, 8L, 2L, 3L)), class = "data.frame", row.names = c(NA, -5L
))

Conditionally remove rows from a database using R

ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6
Each person has several records.
There is only one record of a person whose Number is 1, the rest is 2.
The variable Var has different values for the same person.
When the Number equals to 1, the corresponding Var (we call it P) is different for different persons.
Now, I want to delete the rows whose Var > P for every person.
At the end, I want this
ID Number Var
1 2 6
1 2 7
1 1 8
2 2 3
2 2 4
2 1 5
You can use dplyr::first where Num==1 to get the first Var value
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag=first(Var[Number==1])) %>%
filter(Var <= Flag) %>% select(-Flag)
#short version and you sure there is a one Num==1
df %>% group_by(ID) %>% filter(Var <= Var[Number==1])
Here is a solution with data.table:
library(data.table)
dt <- fread(
"ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6")
dt[, .SD[Var <= Var[Number==1]], ID]
# ID Number Var
# 1: 1 2 6
# 2: 1 2 7
# 3: 1 1 8
# 4: 2 2 3
# 5: 2 2 4
# 6: 2 1 5
A base R option would be
df1[with(df1, Var <= ave(Var * (Number == 1), ID, FUN = function(x) x[x!=0])),]
# ID Number Var
#1 1 2 6
#2 1 2 7
#3 1 1 8
#6 2 2 3
#7 2 2 4
#8 2 1 5
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Number = c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), Var = c(6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L, 6L)), row.names = c(NA, -9L), class = "data.frame")

Subsetting and repetition of rows in a dataframe using R

Suppose we have the following data with column names "id", "time" and "x":
df<-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(20L, 6L, 7L, 11L, 13L, 2L, 6L),
x = c(1L, 1L, 0L, 1L, 1L, 1L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id has multiple observations for time and x. I want to extract the last observation for each id and form a new dataframe which repeats these observations according to the number of observations per each id in the original data. I am able to extract the last observations for each id using the following codes
library(dplyr)
df<-df%>%
group_by(id) %>%
filter( ((x)==0 & row_number()==n())| ((x)==1 & row_number()==n()))
What is left unresolved is the repetition aspect. The expected output would look like
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(7L, 7L, 7L, 13L, 13L, 6L, 6L),
x = c(0L, 0L, 0L, 1L, 1L, 0L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Thanks for your help in advance.
We can use ave to find the max row number for each ID and subset it from the data frame.
df[ave(1:nrow(df), df$id, FUN = max), ]
# id time x
#3 1 7 0
#3.1 1 7 0
#3.2 1 7 0
#5 2 13 1
#5.1 2 13 1
#7 3 6 0
#7.1 3 6 0
You can do this by using last() to grab the last row within each id.
df %>%
group_by(id) %>%
mutate(time = last(time),
x = last(x))
Because last(x) returns a single value, it gets expanded out to fill all the rows in the mutate() call.
This can also be applied to an arbitrary number of variables using mutate_at:
df %>%
group_by(id) %>%
mutate_at(vars(-id), ~ last(.))
slice will be your friend in the tidyverse I reckon:
df %>%
group_by(id) %>%
slice(rep(n(),n()))
## A tibble: 7 x 3
## Groups: id [3]
# id time x
# <int> <int> <int>
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
In data.table, you could also use the mult= argument of a join:
library(data.table)
setDT(df)
df[df[,.(id)], on="id", mult="last"]
# id time x
#1: 1 7 0
#2: 1 7 0
#3: 1 7 0
#4: 2 13 1
#5: 2 13 1
#6: 3 6 0
#7: 3 6 0
And in base R, a merge will get you there too:
merge(df["id"], df[!duplicated(df$id, fromLast=TRUE),])
# id time x
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
Using data.table you can try
library(data.table)
setDT(df)[,.(time=rep(time[.N],.N), x=rep(x[.N],.N)), by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0
Following #thelatemai, to avoid name the columns you can also try
df[, .SD[rep(.N,.N)], by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0

Tidy up and reshape messy dataset (reshape/gather/unite function)?

Following my earlier question:
R: reshape/gather function to create dataset ready for multilevel analysis
I discovered it is a bit more complicated. My dataset is actually 'messier' than I hoped. So here's the full story:
I have a big dataset, 240 cases. Each row is a case (breast cancer patient). Somewhere at the end of the dataset(say from column 417 onwards) I have partner data of the patients, that also filled in a questionnaire.
In the beginning, there are demographic variables for both patients and partners, followed by test outcomes only of patients, thus followed by partner data.
I want to create a dataset, where I 'split' the patient and partner data, but keep it coupled. Thus: I want to duplicate the subject ID and create new column with 1s and 2s (1 corresponding to patient and 2 to partner).
Then, I want my data actually as it is now, but some variables can be matched though (for example, I know have "date of birth" for patient [pgebdat] and for partner [prgebdat] separate. Ofcourse, I can turn this into 'gebdat' with the two birth dates below each other.
This code worked for me for a small subset of my data:
mydf_long <- mydf4 %>%
unite(bb1:bb50rec, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pbb1:pbb50recM, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c(paste0("bb", seq(1:104),"", sep = ','))) %>% # Separate and retrieve original answers
arrange(id)
results in:
id groep_MNC zkhs fbeh pgebdat couple bb1,
1 3 1 1 1 1955-12-01 1 4
2 3 1 1 1 1955-12-01 2 5
3 5 1 1 1 1943-04-09 1 2
4 5 1 1 1 1943-04-09 2 2
But now it copies and pastes the date of birth of the patient also to 'partner' row.
I'm stuck, and don't even quite know what data you would need to be able to answer my question, so please do ask. I'll provide something of an example below:
Example of data
id groep_MNC zkhs fbeh pgebdat p_age pgesl prgebdat pr_age prgesl relpnst
1 3 1 1 1 1955-12-01 42.50000 1 <NA> NA 2 1
2 5 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.50000 1 2
3 7 1 1 1 1958-04-10 40.25000 1 <NA> NA 2 1
4 10 1 1 1 1958-04-17 40.25000 1 1957-07-31 41.33333 2 1
5 12 1 1 2 1947-11-01 50.66667 1 1944-06-08 54.58333 2 1
And then, after couple of hundred variables for only patients, this partner data comes along:
pbb1 pbb2 pbb3 pbb4 pbb5 pbb6 pbb7 pbb8 pbb9
1 5 5 5 5 2 5 4 2 3
2 2 1 4 1 3 4 3 3 4
3 5 3 4 4 4 3 5 3 4
4 5 3 5 5 5 5 4 4 4
5 5 5 5 5 5 4 4 3 4
note, I didn't create this dataset myself - I'm just here to tidy up the mess :)
Edit: The dataset is in dutch. Pgesl = gender for patient, prgesl = gender for partner... etc.
Using the melt function from the data.table-package you can use multiple measures by patterns and as a result create more than one value column:
library(data.table)
melt(setDT(df), measure.vars = patterns('_age','gesl','gebdat'),
value.name = c('age','geslacht','geboortedatum')
)[, variable := c('patient','partner')[variable]][]
you get:
id groep_MNC zkhs fbeh relpnst pbb1 pbb2 variable age geslacht geboortedatum
1: 3 1 1 1 1 5 5 patient 42.50000 1 1955-12-01
2: 5 1 1 1 2 2 1 patient 55.16667 1 1943-04-09
3: 7 1 1 1 1 5 3 patient 40.25000 1 1958-04-10
4: 10 1 1 1 1 5 3 patient 40.25000 1 1958-04-17
5: 12 1 1 2 1 5 5 patient 50.66667 1 1947-11-01
6: 3 1 1 1 1 5 5 partner NA 2 <NA>
7: 5 1 1 1 2 2 1 partner 36.50000 1 1962-04-18
8: 7 1 1 1 1 5 3 partner NA 2 <NA>
9: 10 1 1 1 1 5 3 partner 41.33333 2 1957-07-31
10: 12 1 1 2 1 5 5 partner 54.58333 2 1944-06-08
Instead of patterns you could also use a list of column indexes or columnnames.
HTH
Used data:
df <- structure(list(id = c(3L, 5L, 7L, 10L, 12L),
groep_MNC = c(1L, 1L, 1L, 1L, 1L),
zkhs = c(1L, 1L, 1L, 1L, 1L),
fbeh = c(1L, 1L, 1L, 1L, 2L),
pgebdat = c("1955-12-01", "1943-04-09", "1958-04-10", "1958-04-17", "1947-11-01"),
p_age = c(42.5, 55.16667, 40.25, 40.25, 50.66667),
pgesl = c(1L, 1L, 1L, 1L, 1L),
prgebdat = c("<NA>", "1962-04-18", "<NA>", "1957-07-31", "1944-06-08"),
pr_age = c(NA, 36.5, NA, 41.33333, 54.58333),
prgesl = c(2L, 1L, 2L, 2L, 2L),
relpnst = c(1L, 2L, 1L, 1L, 1L),
pbb1 = c(5L, 2L, 5L, 5L, 5L),
pbb2 = c(5L, 1L, 3L, 3L, 5L)),
.Names = c("id", "groep_MNC", "zkhs", "fbeh", "pgebdat", "p_age", "pgesl", "prgebdat", "pr_age", "prgesl", "relpnst", "pbb1", "pbb2"),
class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

How can I exclude zeros when finding a frequency and seperate into 4 categories

I have a data frame data.2016 and am trying to find the frequency in which "DIPL" occurs (excluding zero), "DIPL" is the number of a worms parasite found in the a fish.
Data looks something like this:
data.2016
Site DIPL
1 0
1 1
1 1
2 6
2 8
2 1
2 1
3 0
3 0
3 0
4 1258
4 501
I want to output to look like this:
Site freq
1 2
2 4
3 0
4 2
From this I can interpret, out of the 3 fish found in site #1 (from the data frame), 2 of them had worm parasites.
I've tried
aggregate(DIPL~Site, data=data.2016, frequency) #and get:
Site DIPL
1 1 1
2 2 1
3 3 1
4 4 1
Is there a way to count the number of fish with worms from the DIPL column (meaning the value in the column is higher than zero) per site?
Just use a custom function that removes the zeros.
aggregate(DIPL ~ Site, data.2016, function(x) length(x[x != 0])) # or sum(x != 0)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
Another option would be to temporarily transform the DIPL column then just take the sum.
aggregate(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0), sum)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
xtabs() is fun too ...
xtabs(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0))
# Site
# 1 2 3 4
# 2 4 0 2
By the way, frequency is for use on time-series data.
Data:
data.2016 <- structure(list(Site = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L), DIPL = c(0L, 1L, 1L, 6L, 8L, 1L, 1L, 0L, 0L, 0L, 1258L,
501L)), .Names = c("Site", "DIPL"), class = "data.frame", row.names = c(NA,
-12L))
Might something like this be what you're looking for?
# first some fake data
site <- c("A","A","A","B","B","B")
numworms <- c(1,0,3,0,0,42)
data.frame(site,numworms)
site numworms
1 A 1
2 A 0
3 A 3
4 B 0
5 B 0
6 B 42
tapply(numworms, site, function(x) sum(x>0))
A B
2 1

Resources