crossJoining two data frames without repeating values - r

I have two dataframes
DataFrame1 <- data.frame(StudentId = c(1:20), Subject = c(rep("Algebra", 4), rep("Geometry", 4), rep("English", 4), rep("Zoology", 4), rep("Botany", 4)), CGPA = c(random::randomNumbers(20, 70, 100, 1)), Country = c(rep("USA", 4), rep("UK", 4), rep("Germany", 4), rep("France", 4), rep("Japan", 4)))
and
DataFrame2 <- data.frame(StudentId = c(1:10), State = c(rep("NYC", 2), rep("Illinois", 2), rep("Texas", 2), rep("Virginia", 2), rep("Florida", 2)), Age = c(random::randomNumbers(10, 16, 20, 1)), Gender = c(rep("Male", 3), rep("Female", 3), rep("Male", 2), rep("Female", 2)))
I can merge the above two using inner join as
merge(DataFrame1, DataFrame2)
How to merge as cross Joining two data frames without repeating values?

Try merge(DataFrame1, DataFrame2, all = T)

Try this for cross join..
knitr::kable(merge(x = DataFrame1, y = DataFrame2, by = NULL))

Related

Match two equal-sized data.frames and then filter results on a third

I have the following three data.frame:
area1 <- data.frame(ua = c(1, 2, 3),
sub_ua1 = c(0, 100, 0),
sub_ua2 = c(100, 100, 100),
sub_ua3 = c(100, 0, 0))
area2 <- data.frame(ua = c(1, 2, 3),
sub_ua1 = c(100, 100, 0),
sub_ua2 = c(100, 100, 0),
sub_ua3 = c(100, 0, 0))
df <- data.frame(ua = c(rep(1, 5), rep(2, 4), rep(3, 7)),
subua = c(rep("sub_ua1", 3), "sub_ua2", "sub_ua3",
"sub_ua1", "sub_ua1", "sub_ua2", "sub_ua3",
"sub_ua1", c(rep("sub_ua2", 2)), rep("sub_ua3", 4)),
value = c(rep(2, 3), rep(4, 3), rep(2, 2), rep(1, 8)))
What I'm trying to do is, based on column ua in dfs area_1 and area_2, filter only sub_ua (1 to 3) that have a match of 100 in each df. For example, the first value of sub_ua2 is 100 in both area_1 and area_2. This is a "sub_ua" I want.
Then, after having this list of "sub_ua" per "ua", filter only them on df to obtain the filtered value.
The results should be:
For ua == 1, get both sub_ua2 and sub_ua3
For ua == 2, get both sub_ua1 and sub_ua2
For ua == 3, get sub_ua2
EDIT:
I was using the following approach to obtain a data.frame of rows and columns indices:
library(prodlim)
# Indices for data frame 1 and 2 for values = 100
indices_1 <- which(area1 == 100, arr.ind = TRUE)
indices_2 <- which(area2 == 100, arr.ind = TRUE)
# Rows where indices are matched between the two data frame indices
indices_rows <- na.omit(row.match(as.data.frame(indices_1), as.data.frame(indices_2)))
# Row-column indices where both data frames have values of 100
indices_2[indices_rows, ]
I just don't know how to use this to filter in the final dataset df
If I understood correctly this should work:
area1 <- data.frame(ua = c(1, 2, 3),
sub_ua1 = c(0, 100, 0),
sub_ua2 = c(100, 100, 100),
sub_ua3 = c(100, 0, 0))
area2 <- data.frame(ua = c(1, 2, 3),
sub_ua1 = c(100, 100, 0),
sub_ua2 = c(100, 100, 0),
sub_ua3 = c(100, 0, 0))
library(dplyr)
library(tidyr)
area1 %>%
left_join(area2, by = "ua", suffix = c(".area1",".area2")) %>%
pivot_longer(cols = -ua,names_to = "var",values_to = "value") %>%
separate(col = var,into = c("var","area"),sep = "\\.") %>%
pivot_wider(names_from = area,values_from = value) %>%
filter(area1 == 100, area2 == 100) %>%
select(-starts_with("area"))
# A tibble: 4 x 2
ua var
<dbl> <chr>
1 1 sub_ua2
2 1 sub_ua3
3 2 sub_ua1
4 2 sub_ua2

Adding rows to make a full long dataset for longitudinal data analysis

I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.
We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)
I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]

How to add unique IDs for aligned sequences of sequential numbers?

I have a data table of integer coordinates that align between two groups labelled A and B. For example:
dt_long <- data.table(LABEL_A = c(rep("A", 20), rep("A", 15), rep ("A", 10), rep ("A", 15), rep ("A", 10)),
SEQ_A = c(11:30, 61:75, 76:85, 86:100, 110:119),
LABEL_B= c(rep("C", 20), rep("D", 15), rep("F", 10), rep("G",15), rep("D", 10)),
SEQ_B = c(1:20, 25:11, 16:25, 15:1, 1:5, 8:12))
How can I add an ID column to this data.table with a unique id for each of the continuous aligned sequences? Each aligned sequence needs a separate ID if either SEQ_A or SEQ_B are not sequentially continuous, or if they belong to a different group (ie LABEL). For example:
dt_long_ID <- data.table(LABEL_A = c(rep("A", 20), rep("A", 15), rep ("A", 10), rep ("A", 15), rep ("A", 10)),
SEQ_A = c(11:30, 61:75, 76:85, 86:100, 110:119),
LABEL_B= c(rep("C", 20), rep("D", 15), rep("F", 10), rep("G",15), rep("D", 10)),
SEQ_B = c(1:20, 25:11, 16:25, 15:1, 1:5, 8:12),
ID = c(rep(1, 20), rep(2, 15), rep(3, 10), rep(4, 15), rep(5, 5), rep(6, 5) ))
Updated answer based on the clarified question and the updated data. This will work whether or not the LABEL columns are numeric.
# helper function for the sequential check
# the & !is.na() just corrects for the first NA value introduced by shift()
foo = function(x) cumsum(abs(x - shift(x)) > 1 & !is.na(shift(x)))
dt_long_ID[, ID2 := .GRP, by = .(rleid(LABEL_A), rleid(LABEL_B), foo(SEQ_A), foo(SEQ_B))]
all(dt_long_ID$ID == dt_long_ID$ID2)
# [1] TRUE

Centering column names with wrapped text with Kable

I am having some trouble centering column names that have wrapped text. The top line of the wrapped text centers but the second line does not.
test_data <- data.frame(Mean = runif(5, 3.71, 7.72),
N = sample(57:59, 5, replace = TRUE),
sd = c(1, rep(0, 4)),
d = rep(1, 5),
naod = sample(1:4, 5, replace = TRUE),
a = sample(5:12, 5, replace = TRUE),
sa = sample(37:44, 5, replace = TRUE)
test <-as.data.frame(t(as.matrix(sapply(2:6,function(i) vec_fun5(test_Data,i)))))
kable(test,"latex" ,booktabs=T, align="c",col.names=linebreak(c('Mean','\\textit{N}' ,'Strongly\n Disagree','Disagree','Neither Agree\n or Disagree','Agree','Strongly\n Agree')),row.names = T,escape=F)%>%
row_spec(0,align = "c")
Output Table
I would like to have both lines centered within the cell.
You could use tableHTML for that:
Test data:
set.seed(1)
test_data <- data.frame(Mean = runif(5, 3.71, 7.72),
N = sample(57:59, 5, replace = TRUE),
sd = c(1, rep(0, 4)),
d = rep(1, 5),
naod = sample(1:4, 5, replace = TRUE),
a = sample(5:12, 5, replace = TRUE),
sa = sample(37:44, 5, replace = TRUE))
library(tableHTML)
test_data %>%
tableHTML(round = 2,
widths = c(50, 50, 50, 50,
80, 120, 50, 50),
headers = c("Mean", "N",
"Strongly <br>Disagree",
"Disagree",
"Neither Agree <br> or Disagree",
"Agree",
"Strongly <br>Agree"),
escape = FALSE) %>%
add_theme("scientific")
The result looks like this:

gvisMotionChart: change the default variables

I have a data.frame df that consists of fours sites (1 to 4). Each site has values for four parameters (A to D) from 2011 to 2014. I want to create a motion chart for site1.
library(dplyr)
siteID <- c(rep("site1", 16), rep("site2", 16), rep("site3", 16), rep("site4", 16))
YEAR <- as.numeric(rep(c("2011", "2012", "2013", "2014"), 16))
parameter <- c(rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4),
rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4),
rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4),
rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4))
value <- c(seq(1, 4, by=1), seq(10, 40, by=10), seq(12, 18, by=2), seq(5, 20, by=5),
seq(3, 12, by=3), sample(13:18, 4), sample(15:22, 4), sample(10:18, 4),
seq(7, 1, by=-2), sample(15:22, 4), sample(15:19, 4), sample(10:20, 4),
seq(8, 5, by=-1), seq(50, 20, by=-10), seq(16, 10, by=-2), seq(20, 5, by=-5))
df <- data.frame(siteID, YEAR, parameter, value)
df$YEAR <- as.numeric(df$YEAR)
df1 <- df %>%
dplyr::filter(siteID =="site1")
I created the motion chart for site 1 using the following code
library(googleVis)
site1 = gvisMotionChart(data=df1,
idvar="parameter",
timevar="YEAR",
chartid="site1")
plot(site1)
It worked fine. The result is here
However, the default x axis and y axis were value. I had to change x axis myself from value to YEAR.
I wanted to change the default values so that x-axis will be YEAR, colorvar will be parameter, and sizevar will be value. I did that using this code
site1_1 = gvisMotionChart(data=df1,
idvar="parameter",
timevar="YEAR",
chartid="site1",
xvar="YEAR",
yvar="value",
colorvar="parameter",
sizevar="value")
plot(site1_1)
It kept showing as loading but the plot was not created.
Any suggestions would be appreciated.
I think the below should get you just about there. All that's left is to set the options appropriately to get rid of the commas and such.
df1 <- df %>%
dplyr::filter(siteID =="site1") %>%
mutate(Date = YEAR) %>%
mutate(colorValue = parameter) %>%
mutate(sizeValue = value)
library(googleVis)
site1 = gvisMotionChart(data=df1,
idvar="parameter",
timevar="YEAR",
chartid="site1",
xvar = "Date",
yvar = "value",
colorvar = "colorValue",
sizevar = "sizeValue")
plot(site1)

Resources