I want to merge several fields from 2 dataframes into a new dataframe. The merged data is based upon ID and Date and, the Date must equal or fall between start and end dates in the second dataframe.
The following answer to a similar question almost works for me however if the Date in the first dataframe equals the start date in the second dataframe, I get NA instead of the matching colour. Any help on ways to include the colour when the Date falls on the start date would be very much appreciated.
library(tidyverse)
library(lubridate)
df1 <- data.frame(ID=c(1, 2, 2, 3),
actual.date=mdy('3/31/2017', '2/11/2016','4/10/2016','5/15/2015'))
df2 <- data.frame(ID = c(1, 1, 1, 2, 3),
start = mdy('1/1/2000', '4/1/2011', '3/31/2017', '2/11/2016', '1/12/2012'),
end = mdy('3/31/2011', '6/4/2012', '04/04/2017', '3/31/2017', '2/12/2014'),
colour = c("blue", "purple", "blue", "red", "purple"))
df <- full_join(df1, df2, by = "ID") %>%
mutate(test = ifelse(actual.date <= end & actual.date > start,
TRUE,
FALSE)) %>%
filter(test) %>%
left_join(df1, ., by = c("ID", "actual.date")) %>%
select(ID, actual.date, colour)
If you could show us a dataframe of the output you're looking for that would be useful, but I think this may achieve what you're trying to do. I don't think you want to be joining twice in the code above. When you do the filter() you drop the observations that are showing NAs and when you join again you've dropped those observations so they show up as NAs because they are only in one of the dataframes.
full_join(df1, df2, by = "ID") %>%
filter(actual.date <= end & actual.date >= start) %>%
select(ID, actual.date, colour)
Related
I am trying the extract the set of indirect colleagues of doctors. I call colleagues doctors who work together in the same hospital. An indirect colleague is a doctor who works with the colleague of a doctor in another hospital. In the example below, doctor "a" works with doctor "b" in hospital 1, who in turn work with doctor "c" in hospital 2. Therefore "c" is an indirect colleague of "a".
The code below works well when physician id constitutes of string values (df0) or low numeric values (df1), but not when physicians id constitutes of high numeric value (df2). I would like to fix the code to work with high numeric values (while keeping the original ids of physicians).
df0 <- tribble(
~hospital, ~doctors,
1, c("a", "b"),
2, c("b", "c"),
3, c("a", "d"),
) %>%
unnest(doctors)
# Below, I replaced doctor id with numeric values
df1 <- tribble(
~hospital, ~doctors,
1, c(1, 2),
2, c(2, 3),
3, c(1, 4),
) %>%
unnest(doctors)
# Now I added +5 to each physician id
df2 <- tribble(
~hospital, ~doctors,
1, c(6, 7),
2, c(7, 8),
3, c(6, 9)
) %>%
unnest(doctors)
df <- df2 # The code only works with df0 and df1, not with df2
colleagues <- full_join(df, df, by = c("hospital")) %>%
rename(doctor = doctors.x, colleagues = doctors.y) %>%
filter(doctor != colleagues) %>%
distinct(doctor, colleagues) %>%
chop(colleagues) %>%
deframe()
colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
mutate(ego_colleagues = map(ego, ~ colleagues[[.x]]),
alter_colleagues = map(alter, ~ colleagues[[.x]]),
alter_colleague_only = map2(alter_colleagues, ego_colleagues, ~ .x[!(.x %in% .y)])) %>%
unnest(alter_colleague_only) %>%
filter(ego != alter_colleague_only) %>%
select(ego, alter, alter_colleague_only)
The issue is in your map calls. Using df2, when you map(ego, ~ colleagues[[.x]]), colleagues[.x] is indexing by position, not name. When you use character names, it defaults to using character names. When you use numeric names and they're 1, 2, 3, 4 it happens to work by luck. But when you have a list of 4 and you're calling colleagues[[6]], then you get the index out of bounds error. If that's not totally clear, print these:
colleagues[[1]] vs. colleagues[[6]] vs. colleagues$`6` .
A quick fix would be to wrap the first part of those map statements in as.character like this:
colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
mutate(ego_colleagues = map(as.character(ego), ~ colleagues[[.x]]),
alter_colleagues = map(as.character(alter), ~ colleagues[[.x]]),
alter_colleague_only = map2(as.character(alter_colleagues), as.character(ego_colleagues), ~ .x[!(.x %in% .y)])) %>%
unnest(alter_colleague_only) %>%
filter(ego != alter_colleague_only) %>%
select(ego, alter, alter_colleague_only)
UPDATE:
Depending on your setup, you could try using the furrr package with future_map and future_map2, but at least in this minimal example that was a much slower approach. I don't know if that holds true on your real data.
Here's another option. While ugly because it has a lot of intermediate objects, it may be helpful. It uses matrices and leverages the fact that you have these reciprocal relationships (if I'm interpreting correctly). I benchmarked it and it takes half as long.
t1 <- colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
filter(!duplicated(paste0(pmax(ego, alter), pmin(ego, alter)))) %>%
as.matrix()
t2 <- t1 %>%
rbind(t1[1:nrow(t1),c(2,1)])
alter_colleague_only <- t2[match(t2[,2], t2[,1]), "alter"]
t3 <- cbind(t2, alter_colleague_only)
t4 <- t3[which(t2[,1] != t3[,3]),]
t5 <- t4[,c(3,2,1)]
t6 <- rbind(t4, t5) %>%
as_tibble() %>%
arrange(ego)
I am trying to create a for loop in R that will make a new data frame ("results") when values of a column ("areaName2") in one data frame (df2), matches the value in a column ("ISLAND") from a different data frame (df1).
If there are no matches in the first column in df2, then I want it to move on to pair a second set of columns from df2 and df1 (df2:"areaName1 and df1:"ARCHIP"). Again, if there is a match, it should be printed in the new data frame. If again, there is no match, then I want it to move on the a third pair of columns (df2:"Country" and df1:"COUNTRY").
If all columns in df 2 are blank, then I would like to skip that row.
If there is some information in one of the columns in df 2, but it doesn't match df1, I would like it to state that somehow if that is possible.
I have made an example of df1, df2, and results:
ID <- c(1,2,3,4,5, 6)
COUNTRY <- c("country1", 'country2', 'country3','country4', 'country5', 'country6')
ARCHIP <- c('archipelago1', 'archipelago2', 'archipelgao3', 'archipelago4', 'archipelago5', 'archipelago6')
ISLAND <- c('someisland1', 'someIsland2', 'someIsland3', 'someIsland4', 'someIsland5', 'someIsland6')
df1 <- data.frame(ID, COUNTRY, ARCHIP, ISLAND)
Sciname <- c("scientificName1", "scientificName2", "scientificName3", "scientificName4", "scientificName5", "scientificName6")
AreaName2 <- c("someIsland1", NA, "someIsland3", NA, NA, 'unrecognisableIsland')
AreaName1 <- c("archipelago1", "archipelago2", "archipelago3", NA, NA, 'archipelago6')
Country <- c("country1", "country2", "country3", 'country4', NA, 'country6')
df2 <- data.frame(Sciname, Country, AreaName1, AreaName2)
Species <- c("scientificName1","scientificName2", "scientificName3", "scientificName4", 'scientificName6')
Location <- c("someIsland1", "archipelago2", "someIsland3", 'country4', 'UNREGOGNISED')
results <- data.frame(Species, Location)
I was thinking that I need to do something along the lines of this for each column set
for (i in df2$AreaName2) {
results[[i]] <- if(df2$AreaName2 %in% df1$ISLAND)
}
But I am not sure how to make it work for each set, or how to make it run though several columns - maybe I should make a for loop for each of the sets of columns I wish to match?
Any ideas? Thanks!
# I like to use tidyverse :)
library(tidyverse)
# First, to create our datasets - (Thank you for providing sample data!)
# I've set this up in a slightly different way, in an attempt to keep our workspace clear.
# I've also used tibble in place of data.frame, to line up with the tidyverse approach.
df1 <- tibble( ID = seq(1:6),
COUNTRY = c("country1", 'country2', 'country3','country4', 'country5', 'country6'),
ARCHIP = c('archipelago1', 'archipelago2', 'archipelgao3', 'archipelago4', 'archipelago5', 'archipelago6'),
ISLAND = c('someIsland1', 'someIsland2', 'someIsland3', 'someIsland4', 'someIsland5', 'someIsland6'))
df2 <- tibble( Sciname = c("scientificName1", "scientificName2", "scientificName3", "scientificName4", "scientificName5", "scientificName6"),
Country = c("country1", "country2", "country3", 'country4', NA, 'country6'),
AreaName1 = c("archipelago1", "archipelago2", "archipelago3", NA, NA, 'archipelago6'),
AreaName2 = c("someIsland1", NA, "someIsland3", NA, NA, 'unrecognisableIsland'))
# Rather than use a for loop, I'll use full_join to match the two tables, then filter for the conditions you're looking for.
# Merge data
join_country <- full_join(df2, df1, by = c("Country" = "COUNTRY"))
# Identify scinames with matching island names
# I use _f to signify my goal here - filtering
island_f <- join_country %>%
filter(AreaName2 == ISLAND) %>%
# Keep only relevant columns
select(Sciname, Location = AreaName2)
# Identify scinames with matching archip names
archip_f <- join_country %>%
filter(
# Exclude scinames we've identified with matching island names.
!(Sciname %in% island_f$Sciname),
AreaName1 == ARCHIP) %>%
select(Sciname, Location = AreaName1)
# Identify scinames left over (countries already matched from full_join)
country_f <- join_country %>%
filter(
# Exclude scinames we've identified with matching island or archip names.
!(Sciname %in% island_f$Sciname),
!(Sciname %in% archip_f$Sciname)) %>%
select(Sciname, Location = Country)
sciname_location <- bind_rows(island_f,
archip_f,
country_f) %>%
arrange(Sciname)
# Finally, to identify records that are populated but don't match at all, we can use anti_join.
records_no_match <- anti_join(df1, df2, by = c("COUNTRY" = "Country"))
You can learn more about relational data from R for Data Science, chapter 13.
Please let me know if you have any questions!
A different solution might be to prioritise the locations first, and then filter for the locations with the highest priority.
Just like Rebecca, I would opt for the tidyverse ;-)
library(tidyverse)
# Bring df2 into long format
df_long2 <- pivot_longer(df2, -Sciname) %>%
select(-name) %>%
mutate(value = replace_na(value, "UNRECOGNISED"))
# Bring df1 into long format
df_long1 <- pivot_longer(df1, -ID) %>%
select(-ID)
results <- df_long2 %>%
left_join(df_long1) %>%
# Prioritize names
mutate(lvl = case_when(
name == "ISLAND" ~ 1,
name == "ARCHIP" ~ 2,
name == "COUNTRY" ~ 3,
is.na(name)~ 4
)) %>%
# Group by name
group_by(Sciname) %>%
# Filter for groups with lowest lvl/highest priority
filter(lvl == min(lvl)) %>%
# Drop duplicate rows
distinct() %>%
select(-name, -lvl) %>%
# Rename
rename(Species = 1,
Location = 2)
Good luck!
I have 11 variables in my dataframe. The first is unique identifier of observation (a plane). The second one is a number from 1 to 21 representing flight of a given plane. The rest of the variables are time, velocity, distance, etc.
What I want to do is make new variables for every group (number) of flight e.g. time_1, time_2,..., velocity_1, velocity_2, etc. and consequently, reduce the number of observations (the repeating ones).
I don't really have idea how to start. I was thinking about a mutate function like:
mutate(df, time_1 = ifelse(n_flight == 1, time, NA))
But that would be a lot of typing and a new problem may appear, perhaps.
Basically, you want to convert long to wide data for each variable. You can lapply over these with tidyr::spread in that case. Suppose the data looks like the following:
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(rep("A", 3), rep("B", 3)),
n_flight = rep(seq(3), 2),
time = seq(19, 24),
velocity = rev(seq(65, 60))
)
Then the following will generate your outcome of interest, as long as you get rid of the extra ID variables.
lapply(
setdiff(names(df), c("ID", "n_flight")), function(x) {
df %>%
select(ID, n_flight, !!x) %>%
tidyr::spread(., key = "n_flight", value = x) %>%
setNames(paste(x, names(.), sep = "_"))
}
) %>%
bind_cols()
Let me know if this wasn't what you were going for.
I am a novice R programmer. Below is the dataframe I am using.
I am currently running into a filtering problem with the full_join() from tidyverse.
library(tidyverse)
set.seed(1234)
df <- data.frame(
trial = rep(0:1, each = 8),
sex = rep(c('M','F'), 4),
participant = rep(1:4, 4),
x = runif(16, 1, 10),
y = runif(16, 1, 10))
df
I am currently doing the following operation to do the full_join()
df <- df %>% mutate(k = 1)
df <- df %>%
full_join(df, by = "k")
I am restricting the results to obtain the combination of points for the same participant between the trials
df2 <- filter(df, sex.x == sex.y, participant.x == participant.y, trial.x != trial.y)
df3 <- filter(df2, participant.x == 1)
df3
Here, at this step, I am running into trouble. I do not care about the order of the points. How do I condense the duplicates into one row?
Thank you
Depending on the columns you are considering, use the duplicate function. The first one will weed out duplicates based on the first 5 columns. The last one will weed out duplicates based on
df3[!duplicated(df3[,1:5]),]
df3[!duplicated(df3[,7:11]),]
I have the following problem:
When using dplyr to mutate a numeric column after group_by(), it fails if a row contains only one value which is an NaN when using the mutate command.
Thus, if the grouped column contains a numeric, it correctly classifies as dbl, but as soon as there is an instance of only a NaN for a group, it fails as dplyr defines that group as lgl, while all the other groups are dbl.
My first (and more general question) is:
Is there a way to tell dplyr, when using group_by(), to always define a column in a certain way?
Secondly, can someone help me with a hack for the problem explained in the MWE below:
# ERROR: This will provide the column defining error mentioned:
df <- data_frame(a = c(rep(LETTERS[1:2],4),"C"),g = c(rep(LETTERS[5:7],3)), x = c(7, 8,3, 5, 9, 2, 4, 7,8)) %>% tbl_df()
df <- df %>% group_by(a) %>% mutate_each(funs(sd(., na.rm=TRUE)),x)
df <- df %>% mutate(Winsorise = ifelse(x>2,2,x))
# NO ERROR (as no groups have single entry with NaN):
df2 <- data_frame(a = c(rep(LETTERS[1:2],4),"C"),g = c(rep(LETTERS[5:7],3)), x = c(7, 8,3, 5, 9, 2, 4, 7,8)) %>% tbl_df()
df2 <- df2 %>% group_by(a) %>% mutate_each(funs(sd(., na.rm=TRUE)),x)
# Update the Group for the row with an NA - Works
df2[9,1] <- "A"
df2 <- df2 %>% mutate(Winsorise = ifelse(x>3,3,x))
# REASON FOR ERROR: What happens for groups with one member = NaN, although we want the winsorise column to be dbl not lgl:
df3 <- data_frame(g = "A",x = NaN)
df3 <- df3 %>% mutate(Winsorise = ifelse(x>3,3,x))
The reason is, as you rightly pointed out in df3, that the mutate result is cast as a logical when the source column is NaN/NA.
To circumvent this, cast your answer as numeric:
df <- df %>% mutate(Winsorise = as.numeric(ifelse(x>2,2,x)))
Perhaps #hadley could shed some light on why the mutate result is cast as lgl?