Merge two rows in columns based on condition - r

I am new here, as well as to R, and I couldn't find any past queries that answered my following question, so I apologise if this has already been brought up before.
I am trying to merge the ID columns from two different datasets into one, but some of the IDs in the rows have been coded differently. I need to replace all the "LNZ_" IDs with the "LNZ.", however, I cannot figure out how I would go about doing this.
df_1 <- data.frame(ID_1 = c("LNZ_00001", "LNZ_00002", "LNZ_00003", "DFG00001", "CWD00001"),
Sex=c("M","F","F","M","F"))
df_2 <- data.frame(ID_2 = c("LNZ.00001", "LNZ.00002", "LNZ_00003", "DFG00001", "CWD00001"),
Type=c("S","S","B","B","B"),
AGE=c(56,75,66,64,64))
The above is similar to the datasets that I have, only more scaled down. I hope this is somewhat clear, and any help would be appreciated.
Thanks!

The issue with merging is that your ID columns have different formatting for some of the entries which are supposed to be matched. Therefore you need to modify those values to match before performing the merge. In the examples you gave, the difference is between a period separator (.) and an underscore (_). If your real data has more complex issues, you may need to use different code to clean up those values.
However, once that is resolved, you can perform your merge easily. Here I've used the {tidyverse} packages to accomplish both steps in one pipe chain.
library(tidyverse)
df_1 <- data.frame(ID_1 = c("LNZ_00001", "LNZ_00002", "LNZ_00003", "DFG00001", "CWD00001"), Sex=c("M","F","F","M","F"))
df_2 <- data.frame(ID_2 = c("LNZ.00001", "LNZ.00002", "LNZ_00003", "DFG00001", "CWD00001"), Type=c("S","S","B","B","B"), AGE=c(56,75,66,64,64))
df_2 %>%
mutate(ID_2 = str_replace(ID_2, "\\.", "\\_")) %>%
left_join(df_1, by = c("ID_2" = "ID_1"))
#> ID_2 Type AGE Sex
#> 1 LNZ_00001 S 56 M
#> 2 LNZ_00002 S 75 F
#> 3 LNZ_00003 B 66 F
#> 4 DFG00001 B 64 M
#> 5 CWD00001 B 64 F
Created on 2022-07-17 by the reprex package (v2.0.1)

Related

Combining/aggregating data in R

I feel like this is a really simple question, and I've looked a lot of places to try to find an answer to it, but everything seems to be looking to do a lot more than what I want--
I have a dataset that has multiple observations from multiple participants. One of the factors is where they're from (e.g. Buckinghamshire, Sussex, London). I want to combine everything that isn't London so I have two categories that are London and notLondon. How would I do this? I'd them want to be able to run a lm on these two, so how would I edit my dataset so that I could do lm(fom ~ [other factor]) where it would be the combined category?
Also, how would I combine all observations from each respective participant for a category? e.g. I have a category that's birth year, but currently when I do a summary of my data it will say, for example, 1996:265, because there are 265 observations from people born in '96. But I just want it to tell me how many participants were born in 1996.
Thanks!
There are multiple parts to your question so let's take it step by step.
1.
For the first part this is a great use of tidyr::fct_collapse(). See example here:
library(tidyverse)
set.seed(1)
d <- sample(letters[1:5], 20, T) %>% factor()
# original distribution
table(d)
#> d
#> a b c d e
#> 6 4 3 1 6
# lumped distribution
fct_collapse(d, a = "a", other_level = "other") %>% table()
#> .
#> a other
#> 6 14
Created on 2022-02-10 by the reprex package (v2.0.1)
2.
For the second part, you will have to clarify and share some data to get more help.
3.
Here you can use dplyr::summarize(n = n()) but you need to share some data to get an answer with your specific case.
However something like:
df %>% group_by(birth_year) %>% summarize(n = n())
will give you number of people with that birth year listed.

How to assign unambiguous values for each row in a data frame based on values found in rows from another data frame using R?

I have been struggling with this question for a couple of days.
I need to scan every row from a data frame and then assign an univocal identifier for each rows based on values found in a second data frame. Here is a toy exemple.
df1<-data.frame(c(99443975,558,99009680,99044573,599,99172478))
names(df1)<-"Building"
V1<-c(558,134917,599,120384)
V2<-c(4400796,14400095,99044573,4500481)
V3<-c(NA,99009680,99340705,99132792)
V4<-c(NA,99156365,NA,99132794)
V5<-c(NA,99172478,NA, 99181273)
V6<-c(NA, NA, NA,99443975)
row_number<-1:4
df2<-data.frame(cbind(V1, V2,V3,V4,V5,V6, row_number))
The output I expect is what follows.
row_number_assigned<-c(4,1,2,3,3,2)
output<-data.frame(cbind(df1, row_number_assigned))
Any hints?
Here's an efficient method using the arr.ind feature of thewhich function:
sapply( df1$Building, # will send Building entries one-by-one
function(inp){ which(inp == df2, # find matching values
arr.in=TRUE)[1]}) # return only row; not column
[1] 4 1 2 3 3 2
Incidentally your use of the data.frame(cbind(.)) construction is very dangerous. A much less dangerous, and using fewer keystrokes as well, method for dataframe construction would be:
df2<-data.frame( V1=c(558,134917,599,120384),
V2=c(4400796,14400095,99044573,4500481),
V3=c(NA,99009680,99340705,99132792),
V4=c(NA,99156365,NA,99132794),
V5=c(NA,99172478,NA, 99181273),
V6=c(NA, NA, NA,99443975) )
(It didn't cause coding errors this time but if there were any character columns it would changed all the numbers to character values.) If you learned this from a teacher, can you somehow approach them gently and do their future students a favor and let them know that cbind() will coerce all of the arguments to the "lowest common denominator".
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
df1 %>%
left_join(df2 %>%
pivot_longer(-row_number) %>%
select(-name),
by = c("Building" = "value"))
This returns
Building row_number
1 99443975 4
2 558 1
3 99009680 2
4 99044573 3
5 599 3
6 99172478 2

How can I count and compare data over multiple columns in R?

I am working with a dataframe which contains text data which has been categorised and coded. Each numerical value from 1-12 represent a type of word.
I want to count the frequencies of occurrence each number (1 to 12) over 6 columns (pre1 to pre6) so I know how many types of words have been used. Could anyone please advise on how to do this?
My df is structured as such:
Would something like that work for you?
library(dplyr)
df <- data.frame(pre1 = c(sample(1:12, 10)),
pre2 = c(sample(1:12, 10)),
pre3 = c(sample(1:12, 10)),
pre4 = c(sample(1:12, 10)),
pre5 = c(sample(1:12, 10)),
pre6 = c(sample(1:12, 10)))
count <- count(df, pre1, pre2, pre3, pre4, pre5, pre6)
One solution is this:
library(tidyverse)
mtcars %>%
select(cyl, am, gear) %>% # select the columns of interest
gather(column, number) %>% # reshape
count(column, number) # get counts of numbers for each column
# # A tibble: 8 x 3
# column number n
# <chr> <dbl> <int>
# 1 am 0 19
# 2 am 1 13
# 3 cyl 4 11
# 4 cyl 6 7
# 5 cyl 8 14
# 6 gear 3 15
# 7 gear 4 12
# 8 gear 5 5
In your case column will get values as pre1, pre2, etc., number' will get values 1 - 12 andn` will be the count of a specific number at a specific column.
It is not entirely clear from the question, whether you want frequency tables for all of these columns together or for each column seperately. In possible further questions you should also make clear, whether those numbers are coded as numerics, as characters or as factors (the result of str(pCat) is a good way to do that). For this particular question, it fortunately does not matter.
The answers I have already given in the comments are
table(unlist(pCat[,4:9]))
and
table(pCat$pre3)
as an extension for the latter, I shall also point to the comment by ANG , which boils down to
lapply(pCat[,4:9], table)
These are straightforward solutions with base R without any further unneccessary packages. The answers by JonGrub and AntoniosK base on the tidyverse. There is no obvious need to import dplyr or tidyverse for this problem but I guess, the authors open those packages anyways, whenever they use R, so it does not really impose any cost on them. Other great packages to base good answers on are data.table and sqldf. Those are good packages and many people do a lot of things, that could be done in base R in one of these packages. The packages promise to be more clear or faster or reuse possible knowledge you might already have. Nothing is wrong with that. However, I take your question as an indication, that you are still learning R and I would advise, to learn R first, before you become distracted by learning special packages and DSLs.
People have been using base R for decades and they will continue to do so. Learning base R wil lnot distract you from R and the knowledge will continue to be worthwhile in decades. If the same can be said for the tidyverse or datatable, time will tell (although sqldf is probably also a solid investment in the future, maybe more so than R).

Dplyr/Lubridate: How to summarise overlapping intervals after grouping

I would like to group agreements and then compare how much their periods overlap (or are apart).
My dataframe may look like:
library(tidyverse)
library(lubridate)
tribble(
~ShipTo, ~Code, ~Start, ~End,
"xxxx", "AAA11", 2018-01-01, 2018-03-01,
"yyyy", "BBB23", 2018-02-01, 2018-05-11,
"yyyy", "BBB23", 2018-03-01, 2018-06-11,
"cccc", "AAA11", 2018-01-06, 2018-03-12,
"yyyy", "CCC04", 2018-01-16, 2018-03-31,
"xxxx", "DDD", 2018-01-21, 2018-03-25
)
I would like to mutate a column to create lubridate periods and evaluate them after grouping by ShipTo and Code. What I tried was:
dft3<-dft %>% filter(concat1 %in% to_filter2) %>%
arrange(ShipTo,Code)%>%
group_by(ShipTo,Code)%>%
mutate(period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days"))) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv),
intervmin=min(interv))
If I remove the line group_by(ShipTo,Code)%>% the intervals are created correctly and also the lead intervals are correctly calculated from the next line. But when I naively use group_by, the intervals are not calculated correctly.
I suspect that perhaps my database should be split into many tables by groups and then, after the operation of creating and comparing intervals it should be glued back together.
Is there a succinct way to do it? Or perhaps there is a simpler way I have not yet learned? Thank you in advance for the hint in the right direction.
EDIT: The desired output should be a column with value of overlaps of intervals in days (or distances between intervals if no overlap). Grouping destroys the calculation. I would like to have these values calculated within groups (not accross them).
EDIT2: I trying to solve the problem by splitting dataframe into a list of dataframes and then combining it, but I am not sure of a syntax. It does not quite work, produces tables with one column, a help I was given on other portal (perhaps it can ilustrate the issue). The idea is to split a database, create new columns and combine the tables to a single table.
fnOverlaps <- function(x) {
mutate(x,okres=interval(Start,End),
nastokres=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(okres, nastokres), "days")))
}
dft3<-dft3 %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(fnOverlaps) %>%
flatten_dfr()
The result (for one group) that I expect would look like this.
tribble(
~ShipTo, ~Code, ~interv,
"yyyy", "BBB23", 70 #say there is a 70 days overlap
"yyyy", "BBB23", NA #there is no next row to compare
)
It looks like the issue is being caused by trying to combine vectors with the class "Interval." Specifically, they appear to be getting converted to numeric and losing their inherent information.
I think the only viable solution is to split the data.frame, run the analysis on each component separately with lapply, then bring them back together with bind_rows. The number of groups with only one entry present an issue as max and min return -Inf and Inf when the argument is empty after removing NAs. But, that is easy enough to correct for.
This code should work. Note that I am using group_by to ensure the ShipTo/Code columns are kept, though you could do that in other ways.
dft %>%
split(paste(.$ShipTo, "XXX", .$Code)) %>%
lapply(function(x){
x %>%
arrange(ShipTo,Code) %>%
mutate(period=interval(Start,End)
, nextperiod=interval(lead(Start),lead(End))
, interv=day(as.period(intersect(period, nextperiod), "days"))
) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv, na.rm = TRUE),
intervmin=min(interv, na.rm = TRUE)) %>%
ungroup()
}) %>%
bind_rows() %>%
mutate(intervmax = ifelse(is.infinite(intervmax)
, NA, intervmax)
, intervmin = ifelse(is.infinite(intervmin)
, NA, intervmin))
Returns
# A tibble: 5 x 5
ShipTo Code count intervmax intervmin
<chr> <chr> <int> <dbl> <dbl>
1 cccc AAA11 1 NA NA
2 xxxx AAA11 1 NA NA
3 xxxx DDD 1 NA NA
4 yyyy BBB23 2 71.0 71.0
5 yyyy CCC04 1 NA NA
I am putting it just for the record. I received an answer from Jake Knaupp on slack r4ds group with the modern map_df() syntax, it calculates overlap of periods but it converts periods to numeric. And there is a bunch of warnings it will do that.
myFun <- function(x) {
mutate(x,period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days")))
}
df %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(myFun)

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Resources