R: Subset data frame based on multiple values for multiple variables - r

I need to pull records from a first data set (called df1 here) based on a combination of specific dates, ID#s, event start time, and event end time that match with a second data set (df2). Everything works fine when there is just 1 date, ID, and event start and end time, but some of the matching records between the data sets contain multiple IDs, dates, or times, and I can't get the records from df1 to subset properly in those cases. I ultimately want to put this in a FOR loop or independent function since I have a rather large dataset. Here's what I've got so far:
I started just by matching the dates between the two data sets as follows:
match_dates <- as.character(intersect(df1$Date, df2$Date))
Then I selected the records in df2 based on the first matching date, also keeping the other columns so I have the other ID and time information I need:
records <- df2[which(df2$Date == match_dates[1]), ]
The date, ID, start, and end time from records are then:
[1] "01-04-2009" "599091" "12:00" "17:21"
Finally I subset df1 for before and after the event based on the date, ID, and times in records and combined them into a new data frame called final to get at the data contained in df1 that I ultimately need.
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
after <- subset(df1, NUM==records$ID & Date==records$Date & Time>records$End)
final <- rbind(before, after)
Here's the real problem - some of the matching dates have more than 1 corresponding row in df2, and return multiple IDs or times. Here is what an example of multiple records looks like:
records <- df2[which(df2$Date == match_dates[25]), ]
> records$ID
[1] 507646 680845 680845
> records$Date
[1] "04-02-2009" "04-02-2009" "04-02-2009"
> records$Start
[1] "09:43" "05:37" "11:59"
> records$End
[1] "05:19" "11:29" "16:47"
When I try to subset df1 based on this I get an error:
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
Warning messages:
1: In NUM == records$ID :
longer object length is not a multiple of shorter object length
2: In Date == records$Date :
longer object length is not a multiple of shorter object length
3: In Time < records$Start :
longer object length is not a multiple of shorter object length
Trying to do it manually for each ID-date-time combination would be way to tedious. I have 9 years worth of data, all with multiple matching dates for a given year between the data sets, so ideally I would like to set this up as a FOR loop, or a function with a FOR loop in it, but I can't get past this. Thanks in advance for any tips!

If you're asking what I think you are the filter() function from the dplyr package combined with the match function does what you're looking for.
> df1 <- data.frame(A = c(rep(1,4),rep(2,4),rep(3,4)), B = c(rep(1:4,3)))
> df1
A B
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 3 1
10 3 2
11 3 3
12 3 4
> df2 <- data.frame(A = c(1,2), B = c(3,4))
> df2
A B
1 1 3
2 2 4
> filter(df1, A %in% df2$A, B %in% df2$B)
A B
1 1 3
2 1 4
3 2 3
4 2 4

Related

Function to recode multiple variables conditional on other variables

I have a dataset with multiple variables. Each question has the actual survey answer and three other characteristics. So there are four variables for each question. I want to specify if Q135_L ==1 , leave Q135_RT as it is, otherwise code it as NA. I can do that with an ifelse statement.
df$Q135_RT <- ifelse(df$Q135_L == 1, df$Q22_RT, NA)
However, I have hundreds of variables and the names are not related. For example, in the picture we can see Q135, SG1_1 and so on. How can I specify for the whole dataset if a variable ends at _L, then for the same variable ending at _RT should remain as it is, otherwise the variable ending at _RT should be coded as NA.
I tried this but it only returns NAs
ifelse(grepl("//b_L" ==1, df), "//b_RT" , NA)
If I understand your problem correctly, you have a data frame of which the columns represent survey question variables. Each column contains two identifiers, namely: a survey question number (134, 135, etc) and a variable letter (L, R, etc). Because you provide no reproducible example, I tried to make a simplified example of your data frame:
set.seed(5)
DF <- data.frame(array(sample(1:4, 24, replace = TRUE), c(4,6)))
colnames(DF) <- c("Q134_L","Q135_L", "Q134_R", "Q135_R", "Q_L1", "Q134_S")
DF
# Q134_L Q135_L Q134_R Q135_R Q_L1 Q134_S
# 1 2 3 2 3 1 1
# 2 3 1 3 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 3 3 2 1
What you want is that if Q135_L == 1, leave Q135_RT as it is, otherwise code it as NA. Here is a function that implements this recoding logic:
recode <- function(yourdf, questnums) {
for (k in 1:length(questnums)) {
charnum <- as.character(questnums)
col_end_L_k <- yourdf[grepl("_L\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
col_end_R_k <- yourdf[grepl("_RT\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
row_is_1 <- which(col_end_L_k == 1)
col_end_R_k[-row_is_1, ] <- NA
yourdf[, colnames(col_end_R_k)] <- col_end_R_k
}
return(yourdf)
}
This function takes a data frame and a vector of question numbers, and then returns the data frame that has been recoded.
What this function does:
Selecting each question number using for.
Using grepl to identify any column that contains the selected number and contains _L at the end of the column name.
Similar with above but for _RT at the end of the column name.
Using which to identify the location of rows in the _L column that contain 1.
Keeping the values of the _RT column, which has the same question number with the corresponding _L column, in those rows, and change values on other rows to NA.
The result:
recode(DF, 134:135)
# Q134_L Q135_L Q134_RT Q135_RT Q_L1 Q134_S
# 1 2 3 NA NA 1 1
# 2 3 1 NA 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 NA 3 2 1
Note that the Q_L1 column is not affected because _L in this column is not located on the end of the column name.
As for how to define questnums, the question numbers, you just need to create a numeric vector. Examples:
Your questnums are 1 to 200. Then use 1:200 or seq(200), so recode(DF, 1:200).
Your questnums are 1, 3, 134, 135. Then, use recode(DF, c(1, 3, 134, 135)).
You can also assign the question numbers to an object first, such as n = c(25, 135, 145) and the use it : recode(DF, n)

Changing elements in one data frame [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 1 year ago.
I have a df with two columns where the elements in them are codes:
> head(listaNombres)
ocupacion1 ocupacion2
1 11-2020 11-9190
2 11-2020 41-1010
3 11-2020 41-2030
4 11-2020 41-3090
5 11-2020 41-4010
6 11-3030 11-9190
And then a separate df with the meaning for each code:
> head(descripcion)
# A tibble: 6 x 2
broadGroup Desc
<chr> <chr>
1 11-1010 Chief Executives
2 11-1020 General and Operations Managers
3 11-1030 Legislators
4 11-2010 Advertising and Promotions Managers
5 11-2020 Marketing and Sales Managers
6 11-2030 Public Relations and Fundraising Managers
How can I convert the codes in the first df with the Desc column in the second?
This question has been answered a few times but non seem to have an answer that is simple and also uses base R. I'm not a fan of making people use unnecessary packages, so I'll write this up since there is an easy and straight forward solution that requires no extra packages.
Using the 'match' function we can
oldvalues <- descripcion$broadGroup
# sets up values we wish to change from
newvalues <- descripcion$Desc
# sets up the values we want to change to
listaNombres$ocupacion1 = newvalues[ match(listaNombres$ocupacion1, oldvalues) ]
# Overwrite current ocupacion1 values with desired recode
listaNombres$ocupacion2 = newvalues[ match(listaNombres$ocupacion2, oldvalues) ]
# Overwrite current ocupacion2 values with desired recode
Say we have
v3$recode = v2[ match(v3$recod, v1) ]
What this does is, is takes our three vectors, v1, v2 and v3 and using match(v3,v1), match returns a vector of positions in v1 where the first match between an element of v3 and v1 occurs. We then select elements from v2 using this vector of positions, which gives us the recoded version of v3$record. We then feed this recoded vector of values straight back into v3$record overwriting the old values.
edit: I've since had a look using R and this solution works using the following mockup dataset
ocupacion1 = c(1,2,3,4)
ocupacion2 = c(3,4,4,2)
listaNombres = data.frame(ocupacion1,ocupacion2)
broadGroup = c(1,2,3,4)
Desc = c("one","two","three","four")
descripcion = data.frame(broadGroup,Desc)
combining everything gives the following
> ocupacion1 = c(1,2,3,4)
> ocupacion2 = c(3,4,4,2)
> listaNombres = data.frame(ocupacion1,ocupacion2)
> head(listaNombres)
ocupacion1 ocupacion2
1 1 3
2 2 4
3 3 4
4 4 2
> broadGroup = c(1,2,3,4)
> Desc = c("one","two","three","four")
> descripcion = data.frame(broadGroup,Desc)
> head(descripcion)
broadGroup Desc
1 1 one
2 2 two
3 3 three
4 4 four
> oldvalues <- descripcion$broadGroup
> newvalues <- descripcion$Desc
> listaNombres$ocupacion1 = newvalues[ match(listaNombres$ocupacion1, oldvalues) ]
> listaNombres$ocupacion2 = newvalues[ match(listaNombres$ocupacion2, oldvalues) ]
> # Overwrite current ocupacion2 values with desired recode
> head(listaNombres)
ocupacion1 ocupacion2
1 one three
2 two four
3 three four
4 four two

R:How to intersect list of dataframes and specifc column

I am trying to find all matching values in a specific column, in a list of data.frames. However, I keep getting a returned value of character(0).
I have tried the following:
Simple subset (very time consuming) -> e.g. dat[[i]][[i]]
lapply w/ Reduce and intersect (as seen here
LocA<-data.frame(obs.date=c("2018-01-10","2018-01-14","2018-01-20),
obs.count=c(2,0,1))
LocB<-data.frame(obs.date=c("2018-01-09","2018-01-14","2018-01-20),
obs.count=c(0,3,5))
LocC<-data.frame(obs.date=c("2018-01-12","2018-01-14","2018-01-19"),
obs.count=c(2,0,1))
LocD<-data.frame(obs.date=c("2018-01-11","2018-01-16","2018-01-21"),
obs.count=c(2,0,1))
dfList<-list(LocA,LocB,LocC,LocD)
##List of all dates
lapply(dfList,'[[',1)
[1]"2018-01-10" "2018-01-14" "2018-01-20" "2018-01-09"...
Attempts (failure)
>Reduce(intersect,lapply(dfList,'[[',1))
character (0)
I expect the output of this function to be an output identifying the data.frames that share a common date.
*Extra smiles if someone know how to identify shared dates and mutate in to a single data frame where..Col1 = dataframe name, Col2=obs.date,Col3 = obs.count
You can first merge all the data frames so you only have one:
a <- Reduce(function(x, y) merge(x, y, all = TRUE), dfList)
Or you can merge them like this:
a <-rbind(LocA,LocB,LocC,LocD)
Afterwards, you can extract all the duplicates:
b <- a[duplicated(a$obs.date), ]
Or if you want to keep all the unique ones and keep the duplicates:
c <- a[!duplicated(a$obs.date), ]
If by "intersect" you mean doing an "inner join" or "merging" with a specific column as key, then -- you want to use dplyr::inner_join or merge.
First, between two data.frames:
library(dplyr)
inner_join(LocA, LocB, by='obs.date')
# 2 rows
inner_join(LocC, LocD, by='obs.date')
# zero rows
So, not infinite merging.
Stack, then count
We'll combine the data first, then count the occurences. Notice the use of the .id-argument to track where the row originated.
library(dplyr)
bind_rows(dfList, .id = 'id') %>%
add_count(obs.date) %>%
filter(n > 1)
# A tibble: 5 x 4
id obs.date obs.count n
<chr> <chr> <dbl> <int>
1 1 2018-01-14 0 3
2 1 2018-01-20 1 2
3 2 2018-01-14 3 3
4 2 2018-01-20 5 2
5 3 2018-01-14 0 3

How to label consecutive periods with identical statuses

I have long vector of patient statuses in R that are chronologically sorted, and a label of associated patient IDs. This vector is an element of a dataframe. I would like to label consecutive rows of data for which the patient status is the same. If the status changes, then reverts to its original value, that would be three separate events. This is different than most situations I have searched where duplicated or match would suffice.
An example would be along the lines of:
s <- c(0,0,0,1,1,1,0,0,2,1,1,0,0)
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2)
and the desired output would be
flag <- c(1,1,1,2,2,2,3,1,2,3,4,4)
or
flag <- c(1,1,1,2,2,2,3,4,5,6,7,7)
One inelegant approach would be to generate the sequence:
unlist(tapply(s, id, function(x) cumsum(c(T, x[-1] != rev(rev(x)[-1])))))
Is there a better way?
I think you could use rleid from data.table for this:
library(data.table)
rleid(s,id)
Output:
1 1 1 2 2 2 3 4 5 6 6 7 7
Or for the first sequence:
data.table(s,id)[,rleid(s),id]$V1
Output:
1 1 1 2 2 2 3 1 2 3 3 4 4
Run Length Encoding - rle()
tapply(s, id, function(x) {
v<-rle(x)$length
rep(1:length(v), v)
})

R - Compare column values in data frames of differing lengths by unique ID

I'm sure I can figure out a straightforward solution to this problem, but I didn't see a comparable question so I thought I'd post a question.
I have a longitudinal dataset with thousands of respondents over several time intervals. Everything from the questions to the data types can differ between the waves and often requires constructing long series of bools to construct indicators or dummy variables, but each respondent has a unique ID with no additional respondents add to the surveys after the first wave, so easy enough.
The issue is that while the early wave consist of one (Stata) file each, the latter waves contain lots of addendum files, structured differently. So, for example, in constructing previous indicators for the sex of previous partners there were columns (for one wave) called partnerNum and sex and there were up to 16 rows for each unique ID (respondent). Easy enough to spread (or cast) that data to be able to create a single row for each unique ID and columns partnerNum_1 ... partnerNum_16 with the value from the sex column as the entry in partnerDF. Then it's easy to construct indicators like:
sexuality$newIndicator[mainDF$bioSex = "Male" & apply(partnerDF[1:16] == "Male", 1, any)] <- 1
For other addendum files in the last two waves the data is structured long like the partner data, with multiple rows for each unique ID, but rather than just one variable like sex there are hundreds that I need to use to test against to construct indicators, all coded with different types, so it's impractical to spread (or cast) the data wide (never mind writing those bools). There are actually several of these files for each wave and the way they are structured some respondents (unique ID) occupy just 1 row, some a few dozen. (I've left_join'ed the addendum files together for each wave.)
What I'd like to be able to do to is test something like:
newDF$indicator[any(waveIIIAdds$var1 == 1) & any(waveIIIAdds$var2 == 1)] <- 1
or
newDF$indicator[mainDF$var1 == 1 & any(waveIIIAdds$var2 == 1)] <- 1
where newDF is the same length as mainDF (one row per unique ID).
So, for example, if I had two dfs.
df1 <- data.frame(ID = c(1:4), A = rep("a"))
df2 <- data.frame(ID = rep(1:4, each=2), B = rep(1:2, 2), stringsAsFactors = FALSE)
df1$A[1] <- "b"
df1$A[3] <- "b"
df2$B[8] <- 3
> df1 > df2
ID A ID B
1 b 1 1
2 a 1 2
3 b 2 1
4 a 2 2
3 1
3 2
4 1
4 3
I'd like to test like (assuming df3 has one column, just the unique IDs from df1)
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] & df2$ID[df2$B == 2]] <- 1
So that df3 would have one unique ID per row and since there is an "a" in df1$A for all IDs but df1$A[1] and a 2 in at least one row of df2$B for all IDs except the last ID (df2$B[7:8]) the result would be:
> df3
ID new
1 0
2 1
3 1
4 0
and
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] | df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 1
4 0
This does it...
df3 <- data.frame(ID=unique(df1$ID),
new=sapply(unique(df1$ID),function(x)
as.numeric(x %in% df1$ID[df1$A == "a"] & x %in% df2$ID[df2$B == 2])))
df3
ID new
1 1 1
2 2 1
3 3 1
4 4 0
I came up with a parsimonious solution thinking about it for a few minutes after returning to the problem (rather than the wee hours of the morning of the post).
I wanted something a graduate student who will likely construct thousands of indicators or dummy variables this way and may learn R first, or even only ever learn R, could use. The following provides a solution for the example and actual data using the same schema:
if the DF was already created with the IDs and the column values for the dummy indicator initiated to zero already as assumed in the example:
df3 <- data.frame(ID = df1$ID)
df3$new <- 0
My solution was:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] & df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 0
2 1
3 0
4 1
Using | (or) instead:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] | df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 0
4 1

Resources