creating a summary data frame from long formated data - r

Following this worked example:
case <- c('a','a','a','b','b','c','c','c','c','d','d','e','e')
ID <- c('aa','bb','zz','aa','cc','ee','ff','gg','kk','aa','kk','cc','dd')
score <- c(1,1,3,4,2,3,2,2,1,1,3,3,2)
df1 <- data.frame(case, ID, score)
identifier <- c('aa','bb','ff')
For each unique case, (that is a,b,c,d...), I want to scan the ID column and see how often we have an identifier value.
So we look into the 3x case==a, then how many times do the ID equal identifier? (in this case 2 times)
We then look at 2x case==b, and also count how many time ID equal identifier? (in this case 1 times)
we do this for all unique case's
I have used the following command, but this is for the whole sample, not separated per unique case
df1$ID %in% identifier
And what I want as a end result is a table, with one column with each unique case and a second column with the number of times ID and identifier were equal.
So I want to loop/automate the process and return a similiar output like:
data.frame(c('a','b','c','d','e'), c(2,1,1,1,0))

You can use tapply():
tapply(df1$ID, df1$case, FUN = function(id) sum(id %in% identifier))
a b c d e
2 1 1 1 0
but as #Jaap pointed out, you can use aggregate() to get a data.frame:
aggregate(ID ~ case, data = df1, FUN = function(id) sum(id %in% identifier))
case ID
1 a 2
2 b 1
3 c 1
4 d 1
5 e 0
And if you want more grouping you can do :
df <- aggregate(ID ~ case+(score>1), data = df1, FUN = function(id) sum(id %in% identifier))
df[df$`score > 1`,c(1,3)]
case ID
4 a 0
5 b 1
6 c 1
7 d 0
8 e 0

Related

How to transpose a long data frame every n rows

I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.

R - Compare column values in data frames of differing lengths by unique ID

I'm sure I can figure out a straightforward solution to this problem, but I didn't see a comparable question so I thought I'd post a question.
I have a longitudinal dataset with thousands of respondents over several time intervals. Everything from the questions to the data types can differ between the waves and often requires constructing long series of bools to construct indicators or dummy variables, but each respondent has a unique ID with no additional respondents add to the surveys after the first wave, so easy enough.
The issue is that while the early wave consist of one (Stata) file each, the latter waves contain lots of addendum files, structured differently. So, for example, in constructing previous indicators for the sex of previous partners there were columns (for one wave) called partnerNum and sex and there were up to 16 rows for each unique ID (respondent). Easy enough to spread (or cast) that data to be able to create a single row for each unique ID and columns partnerNum_1 ... partnerNum_16 with the value from the sex column as the entry in partnerDF. Then it's easy to construct indicators like:
sexuality$newIndicator[mainDF$bioSex = "Male" & apply(partnerDF[1:16] == "Male", 1, any)] <- 1
For other addendum files in the last two waves the data is structured long like the partner data, with multiple rows for each unique ID, but rather than just one variable like sex there are hundreds that I need to use to test against to construct indicators, all coded with different types, so it's impractical to spread (or cast) the data wide (never mind writing those bools). There are actually several of these files for each wave and the way they are structured some respondents (unique ID) occupy just 1 row, some a few dozen. (I've left_join'ed the addendum files together for each wave.)
What I'd like to be able to do to is test something like:
newDF$indicator[any(waveIIIAdds$var1 == 1) & any(waveIIIAdds$var2 == 1)] <- 1
or
newDF$indicator[mainDF$var1 == 1 & any(waveIIIAdds$var2 == 1)] <- 1
where newDF is the same length as mainDF (one row per unique ID).
So, for example, if I had two dfs.
df1 <- data.frame(ID = c(1:4), A = rep("a"))
df2 <- data.frame(ID = rep(1:4, each=2), B = rep(1:2, 2), stringsAsFactors = FALSE)
df1$A[1] <- "b"
df1$A[3] <- "b"
df2$B[8] <- 3
> df1 > df2
ID A ID B
1 b 1 1
2 a 1 2
3 b 2 1
4 a 2 2
3 1
3 2
4 1
4 3
I'd like to test like (assuming df3 has one column, just the unique IDs from df1)
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] & df2$ID[df2$B == 2]] <- 1
So that df3 would have one unique ID per row and since there is an "a" in df1$A for all IDs but df1$A[1] and a 2 in at least one row of df2$B for all IDs except the last ID (df2$B[7:8]) the result would be:
> df3
ID new
1 0
2 1
3 1
4 0
and
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] | df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 1
4 0
This does it...
df3 <- data.frame(ID=unique(df1$ID),
new=sapply(unique(df1$ID),function(x)
as.numeric(x %in% df1$ID[df1$A == "a"] & x %in% df2$ID[df2$B == 2])))
df3
ID new
1 1 1
2 2 1
3 3 1
4 4 0
I came up with a parsimonious solution thinking about it for a few minutes after returning to the problem (rather than the wee hours of the morning of the post).
I wanted something a graduate student who will likely construct thousands of indicators or dummy variables this way and may learn R first, or even only ever learn R, could use. The following provides a solution for the example and actual data using the same schema:
if the DF was already created with the IDs and the column values for the dummy indicator initiated to zero already as assumed in the example:
df3 <- data.frame(ID = df1$ID)
df3$new <- 0
My solution was:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] & df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 0
2 1
3 0
4 1
Using | (or) instead:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] | df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 0
4 1

R: Subset data frame based on multiple values for multiple variables

I need to pull records from a first data set (called df1 here) based on a combination of specific dates, ID#s, event start time, and event end time that match with a second data set (df2). Everything works fine when there is just 1 date, ID, and event start and end time, but some of the matching records between the data sets contain multiple IDs, dates, or times, and I can't get the records from df1 to subset properly in those cases. I ultimately want to put this in a FOR loop or independent function since I have a rather large dataset. Here's what I've got so far:
I started just by matching the dates between the two data sets as follows:
match_dates <- as.character(intersect(df1$Date, df2$Date))
Then I selected the records in df2 based on the first matching date, also keeping the other columns so I have the other ID and time information I need:
records <- df2[which(df2$Date == match_dates[1]), ]
The date, ID, start, and end time from records are then:
[1] "01-04-2009" "599091" "12:00" "17:21"
Finally I subset df1 for before and after the event based on the date, ID, and times in records and combined them into a new data frame called final to get at the data contained in df1 that I ultimately need.
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
after <- subset(df1, NUM==records$ID & Date==records$Date & Time>records$End)
final <- rbind(before, after)
Here's the real problem - some of the matching dates have more than 1 corresponding row in df2, and return multiple IDs or times. Here is what an example of multiple records looks like:
records <- df2[which(df2$Date == match_dates[25]), ]
> records$ID
[1] 507646 680845 680845
> records$Date
[1] "04-02-2009" "04-02-2009" "04-02-2009"
> records$Start
[1] "09:43" "05:37" "11:59"
> records$End
[1] "05:19" "11:29" "16:47"
When I try to subset df1 based on this I get an error:
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
Warning messages:
1: In NUM == records$ID :
longer object length is not a multiple of shorter object length
2: In Date == records$Date :
longer object length is not a multiple of shorter object length
3: In Time < records$Start :
longer object length is not a multiple of shorter object length
Trying to do it manually for each ID-date-time combination would be way to tedious. I have 9 years worth of data, all with multiple matching dates for a given year between the data sets, so ideally I would like to set this up as a FOR loop, or a function with a FOR loop in it, but I can't get past this. Thanks in advance for any tips!
If you're asking what I think you are the filter() function from the dplyr package combined with the match function does what you're looking for.
> df1 <- data.frame(A = c(rep(1,4),rep(2,4),rep(3,4)), B = c(rep(1:4,3)))
> df1
A B
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 3 1
10 3 2
11 3 3
12 3 4
> df2 <- data.frame(A = c(1,2), B = c(3,4))
> df2
A B
1 1 3
2 2 4
> filter(df1, A %in% df2$A, B %in% df2$B)
A B
1 1 3
2 1 4
3 2 3
4 2 4

How to label ties when creating a variable capturing the most frequent occurence of a group?

In the following example, how do I ask R to identify a tie as "tie" when I want to determine the most frequent value within a group?
I am basically following on from a previous question, that used which.max or which.is.max and a custom function (Create a variable capturing the most frequent occurence by group), but I want to acknowledge the ties as a tie. Any ideas?
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
I want to create a third variable freq that contains the most frequent observation in v1 by id, but also creates identifies ties as "tie".
From previous answers, this code works to create the freq variable, but just doesn't deal with the ties:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
You could slightly modify your function by testing if the maximum count occurs more than once. This happens in sum(tbl == max(tbl)). Then proceed accordingly.
df1 <-data.frame(
id=rep(1:2, each=4),
v1=rep(letters[1:4], c(2,2,3,1))
)
myFun <- function(x){
tbl <- table(x$v1)
nmax <- sum(tbl == max(tbl))
if (nmax == 1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
else
x$freq <- "tie"
x
}
ddply(df1,.(id),.fun=myFun)
id v1 freq
1 1 a tie
2 1 a tie
3 1 b tie
4 1 b tie
5 2 c c
6 2 c c
7 2 c c
8 2 d c

identifying and extracting the prior observation for a specific source

i have a data in the form:
id source
1 m
1 p
1 l1
1 l1
2 t
2 q
3 p
3 l1
3 n
3 l1
Now for every id, i want to identify l1 when it occurs in the source and extract the observation prior to l1.
For eg: for id 1, the 3rd source in l1 and the observation prior to that is p.
so my data should look like this:
id source
1 p
3 p
3 n
How can i create this in R?
A data.table solution
library(data.table)
dd <- data.table(df)
dd[, source[match('l1', source)-1L],by = id]
There might be a more direct method, but try this:
#get your data
test <- read.table(text="id source
1 m
1 p
1 l1
1 l1
2 t
2 q
3 p
3 l1
3 n
3 l1",header=TRUE)
# do some picking of the cases
result <- do.call(rbind,by(test,test$id,function(x) x[which(x$source=="l1")-1,]))
result <- result[result$source!="l1",]
Which gives:
> result
id source
2 1 p
7 3 p
9 3 n
Here is another data.table solution. I wasn't able to get what seemed like a correct answer with the earlier version from #mnel.
library(data.table)
## Create the test data table:
dt <- data.table(id=c(1,1,1,1,2,2,3,3,3,3),
source1=c("m","p","l1","l1","t","q","p","l1","n","l1"))
dt[,list(id, source1, source0=c(NA,source1[seq_len(.N-1L)]))][source1=="l1"]
## id source1 source0
## 1: 1 l1 p
## 2: 1 l1 l1
## 3: 3 l1 p
## 4: 3 l1 n
This is adding a column source0 to the data table that gets the previous row (or NA for the first row). The .N is a row number, and I'm using seq_len to get the previous row number. Then it subsets the result where the original source1 has a value of "l1".
Here is a vectorized solution using only simple functions from the base of R.
If DF is the input data frame then sel is a logical vector whose TRUE components select out the required rows. The three terms connected by & signs select those rows:
for which the following row's source column equals "l1" and
whose source column is not l1 and
are such that the following row is not the first with that id
The length of sel is one less than the number of rows in DF so we use which to avoid recycling of sel.
is.l1 <- DF$source == "l1"
sel <- is.l1[-1] & !is.l1[-nrow(DF)] & duplicated(DF$id)[-1]
DF[which(sel),]
The result of the last line is:
id source
2 1 p
7 3 p
9 3 n

Resources