Compare and join dataframes in R - r

I have two dataframes that have mostly the same variables and I want to compare the cases of the two dataframes. I want to create a new dataframe with all the cases that are the same in df1 and df2.
Cases are assumed to be the same if all values of the variables that are present in both dataframes are the same. There is an exception for the variable "Age", where cases are assumed to be the same if the values have a difference of maximum 1 year and the variable "Time" where a difference of 1 hour is acceptable.
ID1 <- c(100, 101, 102, 103)
V1 <- c(1, 1, 2, 1)
V2 <- c(1, 2, 3, 4)
Age <- c(25, 16, 74, 46)
Time <- c("9:30", "13:25", "17:20", "7:45")
X <- c (1, 3, 4, 1)
df1 <- data.frame(ID1, V1, V2, Age, Time, X)
ID2 <- c(250, 251, 252, 253)
V1 <- c(1, 2, 1, 2)
V2 <- c(1, 2, 2, 4)
Age <- c(26, 55, 16, 80)
Time <- c("9:30", "12:00", "12:55", "18:00")
Y <- c (3, 2, 1, 1)
df2 <- data.frame(ID2, V1, V2, Age, Time, Y)
In this example ID1=100 and ID2=250 are the same and also ID1=101 and ID2=252.
I'd like to have a new dataframe-output
like this one
Note that it is not important if the values for "Age" and "Time" are taken from df1 or df2. The important Variables are X an Y.
I hope someone can help me out with this problem. Thanks a lot in advance :)
Kind regards
Philip

In base R:
df3 <- merge(df1, subset(df2, select = -c(Age, Time)), by = c("V1", "V2"))
df3[,c("ID1", "ID2", "V1", "V2", "Age", "Time", "X", "Y")]

Related

Bar plot in loop for each observation

Here is sample data where ID is a categorical variable.
ID <- c(12, 34, 560, 45, 235)
W1 <- c(0, 5, 7, 6, 0)
W2 <- c(7, 8, 9, 5, 2)
W3 <- c(0, 0, 3, 5, 9)
df <- data.frame(ID, W1, W2, W3)
df$ID <- as.factor(df$ID)
I want to draw five bar plots for each of these IDs using the frequency data for the three weeks W1:W3. In the actual dataset, I have 30+ weeks and around 150 IDs, hence the intention here is to do this efficiently. Nothing fancy, but ggplot would be ideal as I would need to manipulate some aesthetics.
How to do this using loop and save the images in one file(pdf)?
Thanks for your help!
This sort of problem is usually a data reformating problem. See reshaping data.frame from wide to long format. After reshaping the data, the plot is faceted by ID, avoiding loops.
library(ggplot2)
ID <- c(12, 34, 560, 45, 235)
W1 <- c(0, 5, 7, 6, 0)
W2 <- c(7, 8, 9, 5, 2)
W3 <- c(0, 0, 3, 5, 9)
df <- data.frame(ID, W1, W2, W3)
df$ID <- as.factor(df$ID)
df[-1] <- lapply(df[-1], as.integer)
df |>
tidyr::pivot_longer(-ID, names_to = "Week", values_to = "Frequency") |>
ggplot(aes(Week, Frequency, fill = Week)) +
geom_col() +
scale_y_continuous(breaks = scales::pretty_breaks()) +
facet_wrap(~ ID) +
theme_bw(base_size = 16)
Created on 2022-09-30 with reprex v2.0.2
Edit
If there is a mix of week numbers with 1 and 2 digits, the lexicographic order is not the numbers' order. For instance, after W1 comes W11, not W2. Package stringr function str_sort sorts by numbers when argument numeric = TRUE.
In the example below I reuse the data changing W2 to W11. The correct bars order should therefore be W1, W3, W11.
library(ggplot2)
library(stringr)
ID <- c(12, 34, 560, 45, 235)
W1 <- c(0, 5, 7, 6, 0)
W11 <- c(7, 8, 9, 5, 2)
W3 <- c(0, 0, 3, 5, 9)
df <- data.frame(ID, W1, W11, W3)
df$ID <- as.factor(df$ID)
df[-1] <- lapply(df[-1], as.integer)
df |>
tidyr::pivot_longer(-ID, names_to = "Week", values_to = "Frequency") |>
dplyr::mutate(Week = factor(Week, levels = str_sort(unique(Week), numeric = TRUE))) |>
ggplot(aes(Week, Frequency, fill = Week)) +
geom_col() +
scale_y_continuous(breaks = scales::pretty_breaks()) +
facet_wrap(~ ID) +
theme_bw(base_size = 16)
Created on 2022-10-01 with reprex v2.0.2

Adding rows to make a full long dataset for longitudinal data analysis

I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.
We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)
I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]

R data.table keys and column names. Harmonisation

I am trying to set keys yo a data.table and keep the original column names on the second row. All that I have tried so far changes the column names to keys and erases the original variables. I have ten data.tables to merge and all the variables have different names like in the example. So I made keys but would like to keep the originals as well before harmonisation just to be sure.
library(tidyverse)
library(lubridate)
library(forcats)
library(stringr)
library(data.table)
library(rio)
library(dplyr)
1. Keys
keys1 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
keys2 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
2. data.table example with variable names.
TD3 = data.table(q128 = c(1, 2, 1, 2), q129 = c(1, 5, 2, 4), q130 = c(0.8, 3.0, 10.0, NA), q131 = c(55, 56, 80, 79))
TD3
TD4 = data.table(q128 = c(1, 1, 1, 2), q129 = c(1, 3, 2, 999), q130 = c(0.9, 3.1, NA, 9.0), q131 = c(58, 60, 45, NA))
TD4
I'm not sure this is really the data structure you want to have, that is to have mixed variable types like r2evans said. However...this solution works. Just put all your little data.tables into a list and voila.
I noticed that keys1 and keys2 are identical, so I just used one of them. If they should be different keys for each they can also be listed.
keys1 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
TD <- list()
TD[[1]] = data.table(q128 = c(1, 2, 1, 2), q129 = c(1, 5, 2, 4), q130 = c(0.8, 3.0, 10.0, NA), q131 = c(55, 56, 80, 79))
TD[[2]] = data.table(q128 = c(1, 1, 1, 2), q129 = c(1, 3, 2, 999), q130 = c(0.9, 3.1, NA, 9.0), q131 = c(58, 60, 45, NA))
TD <- lapply(TD, FUN = function(x){
oldcolumns <- colnames(x)
td <- data.table(
'V1' = oldcolumns[1],
'V2' = oldcolumns[2],
'V3' = oldcolumns[3],
'V4' = oldcolumns[4]
)
colnames(td) <- keys1
colnames(x) <- keys1
x <- rbind(td, x)
return(x)
})

Leave one out cross validation by leaving out two ID during the training process

I have a dataframe df
df<-structure(list(ID = c(4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5,
6, 6, 6, 6, 8, 8, 8, 9, 9), Y = c(2268.14043972082, 2147.62290922552,
2269.1387550775, 2247.31983098201, 1903.39138268307, 2174.78291538358,
2359.51909126411, 2488.39004804939, 212.851575751527, 461.398994384333,
567.150629704352, 781.775113821961, 918.303706148872, 1107.37695799186,
1160.80594193377, 1412.61328924168, 1689.48879626486, 685.154353165934,
574.088067465695, 650.30821636616, 494.185166497016, 436.312162090908
), P = c(1750.51986303926, 1614.11541634798, 951.847023338079,
1119.3682884872, 1112.38984390156, 1270.65773075982, 1234.72262170166,
1338.46096616983, 1198.95775346458, 1136.69287367165, 1265.46480803983,
1364.70149818063, 1112.37006707489, 1346.49240261316, 1740.56677791104,
1410.99217295647, 1693.18871380948, 275.447173420805, 396.449789014179,
251.609239829704, 215.432550271042, 55.5336257666349), A = c(49,
50, 51, 52, 53, 54, 55, 56, 1, 2, 3, 4, 5, 14, 15, 16, 17, 163,
164, 165, 153, 154), TA = c(9.10006221322572, 7.65505467142961,
8.21480062559674, 8.09251754304318, 8.466220758789, 8.48094407814006,
8.77304120569444, 8.31727518543397, 8.14410265791868, 8.80921738865237,
9.04091478341757, 9.66233618146246, 8.77015716015164, 9.46037931956657,
9.59702379240667, 10.1739258740118, 9.39524442215692, -0.00568604734662462,
-2.12940164413048, -0.428603434930109, 1.52337963973006, -1.04714984064565
), TS = c(9.6499861763085, 7.00622420539595, 7.73511170298675,
7.68006974050443, 8.07442411510912, 8.27687965909096, 8.76025039592727,
8.3345638889156, 9.23658956753677, 8.98160722605782, 8.98234210211611,
9.57066566368204, 8.74444401914267, 8.98719629775988, 9.18169205278566,
9.98225438314085, 9.56196773059615, 5.47788158053928, 2.58106090926808,
3.22420704848299, 1.36953555753786, 0.241334267522977), R = c(11.6679680423377,
11.0166459173372, 11.1851268491296, 10.7404563561694, 12.1054055597684,
10.9551321815546, 11.1975918244469, 10.7242192465965, 10.1661703705992,
11.4840412725324, 11.1248456370953, 11.2529612597628, 10.7694642397996,
12.3300887767583, 12.0478558531771, 12.3212362249214, 11.5650773932264,
9.56070414783612, 9.61762902218185, 10.2076240621201, 11.8234628013552,
10.9184029778985)), .Names = c("ID", "Y", "P", "A", "TA", "TS",
"R"), na.action = structure(77:78, .Names = c("77", "78"), class = "omit"), row.names = c(NA,
22L), class = "data.frame")
I am currently doing a linear regression in leave one cross validation mode. In other words, during the training I remove one site for each iteration and test the model on the site left out. See below the procedure:
df$prediction <- NA
for(id in unique(df$ID)){
train.df <- df[df$ID != id,]
test.df <- df[df$ID == id, c("P", "A", "TA", "TS","R")]
lm.df<- glm(Y ~ P+A+TA+TS+R, data=train.df)
step.df<- step(lm.df, direction = "backward")
df.pred = predict(object = step.df, newdata = test.df)
df$prediction[df$ID== id] <- df.pred
}
However, I would like to remove 2 IDs for each iteration during the cross validation instead of one. Therefore, my test set will contain two IDs instead of one every time. Anyone know how I could do it?
If you change == into %in% and unique(df$ID) into split(unique(df$ID), c(1,1,2,2,3)) it seems to be working. Essentially, in each iteration you pass two ids instead of one, so the test.df set contains those two.
See this:
df$prediction <- NA
for(id in split(unique(df$ID), c(1,1,2,2,3))){
print(id)
train.df <- df[!df$ID %in% id,]
test.df <- df[df$ID %in% id, c("P", "A", "TA", "TS","R")]
lm.df<- glm(Y ~ P+A+TA+TS+R, data=train.df)
step.df<- step(lm.df, direction = "backward",trace=0)
df.pred = predict(object = step.df, newdata = test.df)
df$prediction[df$ID %in% id] <- df.pred
}
Output:
[1] 4 5
[1] 6 8
[1] 9
I have set trace to zero above so that it only prints the ids passed in the loop. As you can see you have two instead of one (apart from the last one obviously). split splits the vector unique(df$ID) in 2-element pieces which we can then use within the loop.

Lookup value in one dataframe based on column name stored as a value in another dataframe

Please see the reproducible (cut + paste) example below. The actual data set has over 4000 serial observations on 11000 people. I need to create columns A, B, C, etc. showing the NUMBER of the "Drug" variables X,Y, Z etc. that corresponds to the first occurrence of a particular value of a "Disease" variable. The numbers refer to actions that were taken with particular drugs (start, stop, increase dose etc.) The "disease" variable refers to whether the disease flared or not in a disease that has many stages including flares and remissions.
For example:
Animal <- c("aardvark", "1", "cheetah", "dromedary", "eel", "1", "bison", "cheetah", "dromedary",
"eel")
Plant <- c("apple_tree", "blossom", "cactus", "1", "bronze", "apple_tree", "bronze", "cactus",
"dragonplant", "1")
Mineral <- c("amber", "bronze", "1", "bronze", "emerald", "1", "bronze", "bronze", "diamond",
"emerald")
Bacteria <- c("acinetobacter", "1", "1", "d-strep", "bronze", "acinetobacter", "bacillus",
"chlamydia", "bronze", "enterobacter" )
AnimalDrugA <- c(1, 11, 12, 13, 14, 15, 16, 17, 18, 19)
AnimalDrugB <- c(20, 1, 22, 23, 24, 25, 26, 27, 28, 29)
PlantDrugA <- c(301, 302, 1, 304, 305, 306, 307, 308, 309, 310)
PlantDrugB <- c(401, 402, 1, 404, 405, 406, 407, 408, 409, 410)
MineralDrugA <- c(1, 2, 3, 4, 1, 6, 7, 8, 9, 10)
MineralDrugB <- c(11, 12, 13, 1, 15, 16, 17, 18, 19, 20)
BacteriaDrugA <- c(1, 2, 3, 4, 5, 6 , 7, 8, 9, 1)
BacteriaDrugB <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
dummy_id <- c(1001, 2002, 3003, 4004, 5005, 6006, 7007, 8008, 9009, 10101)
Elements <- data.frame(dummy_id, Animal, Plant, Mineral, Bacteria, AnimalDrugA, AnimalDrugB,
PlantDrugA, PlantDrugB, MineralDrugA, MineralDrugB, BacteriaDrugA, BacteriaDrugB)
ds <- Elements[,order(names(Elements))]
ds #Got it in alphabetical order... The real data set will be re-ordered chronologically
#Now I want the first occurrence of the word "bronze" for each id
# for each subject 1 through 10. (That is, "bronze" corresponds to start of disease flare.)
first.bronze <- colnames(ds)[apply(ds,1,match,x="bronze")]
first.bronze
#Now, I want to find the number in the DrugA, DrugB variable that corresponds to the first
#occurrence of bronze.
#Using the alphabetically ordered data set, the answer should be:
#dummy_id DrugA DrugB
#1... NA NA
#2... 2 12
#3... NA NA
#4... 4 1
#5... 5 6
#6... NA NA
#7... 7 17
#8... 8 18
#9... 9 2
#10... NA NA
#Note that all first occurrences of "bronze"
# are in Mineral or Bacteria.
#As a first step, join first.bronze to the ds
ds$first.bronze <- first.bronze
ds
#Make a new ds where those who have an NA for first.bronze are excluded:
ds2 <- ds[complete.cases(ds$first.bronze),]
ds2
# Create a template data frame
out <- data.frame(matrix(nr = 1, nc = 3))
colnames(out) <- c("Form Number", "DrugA", "DrugB") # Gives correct column names
out
#Then grow the data frame...yes I realize potential slowness of computation
test <- for(i in ds2$first.bronze){
data <- rbind(colnames(ds2)[grep(i, names(ds2), ignore.case = FALSE, fixed = TRUE)])
colnames(data) <- c("Form Number", "DrugA", "DrugB") # Gives correct column names
out <- rbind(out, data)
}
out
#Then delete the first row of NAs
out <- na.omit(out)
out
#Then add the appropriate dummy_ids
dummy_id <- ds2$dummy_id
out_with_ids <- as.data.frame(cbind(dummy_id, out))
out_with_ids
Now I am stuck. I have the name of the column from ds2 listed as a value of Drug A, Drug B in the out_with_ids dataset. I have search through Stack Overflow thoroughly but solutions based on match, merge, replace, and the data.table package don't seem to work.
Thank you!
I think the problem here is data format. May I suggest you store it in "long" table, like this:
library(data.table)
dt <- data.table(dummy_id = rep(dummy_id, 4),
type = rep(c("Animal", "Bacteria", "Mineral", "Plant"), each = 10),
name = c(Animal, Bacteria, Mineral, Plant),
drugA = c(AnimalDrugA, BacteriaDrugA, MineralDrugA, PlantDrugA),
drugB = c(AnimalDrugB, BacteriaDrugB, MineralDrugB, PlantDrugB))
Then it is much easier to filter and do other operations. For example,
dt[name == "bronze"][order(dummy_id)]
Frankly I'm not sure I understand what you want to achieve in the end.

Resources