R data.table keys and column names. Harmonisation - r

I am trying to set keys yo a data.table and keep the original column names on the second row. All that I have tried so far changes the column names to keys and erases the original variables. I have ten data.tables to merge and all the variables have different names like in the example. So I made keys but would like to keep the originals as well before harmonisation just to be sure.
library(tidyverse)
library(lubridate)
library(forcats)
library(stringr)
library(data.table)
library(rio)
library(dplyr)
1. Keys
keys1 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
keys2 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
2. data.table example with variable names.
TD3 = data.table(q128 = c(1, 2, 1, 2), q129 = c(1, 5, 2, 4), q130 = c(0.8, 3.0, 10.0, NA), q131 = c(55, 56, 80, 79))
TD3
TD4 = data.table(q128 = c(1, 1, 1, 2), q129 = c(1, 3, 2, 999), q130 = c(0.9, 3.1, NA, 9.0), q131 = c(58, 60, 45, NA))
TD4

I'm not sure this is really the data structure you want to have, that is to have mixed variable types like r2evans said. However...this solution works. Just put all your little data.tables into a list and voila.
I noticed that keys1 and keys2 are identical, so I just used one of them. If they should be different keys for each they can also be listed.
keys1 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
TD <- list()
TD[[1]] = data.table(q128 = c(1, 2, 1, 2), q129 = c(1, 5, 2, 4), q130 = c(0.8, 3.0, 10.0, NA), q131 = c(55, 56, 80, 79))
TD[[2]] = data.table(q128 = c(1, 1, 1, 2), q129 = c(1, 3, 2, 999), q130 = c(0.9, 3.1, NA, 9.0), q131 = c(58, 60, 45, NA))
TD <- lapply(TD, FUN = function(x){
oldcolumns <- colnames(x)
td <- data.table(
'V1' = oldcolumns[1],
'V2' = oldcolumns[2],
'V3' = oldcolumns[3],
'V4' = oldcolumns[4]
)
colnames(td) <- keys1
colnames(x) <- keys1
x <- rbind(td, x)
return(x)
})

Related

Using map() function to apply for each element

I need, with the help of the map() function, apply the above for each element
How can I do so?
As dt is of class data.table, you can make a vector of columns of interest (i.e. your items; below I use grepl on the names), and then apply your weighting function to each of those columns using .SD and .SDcols, with by
qs = names(dt)[grepl("^q", names(dt))]
dt[, (paste0(qs,"wt")):=lapply(.SD, \(q) 1/(sum(!is.na(q))/.N)),
.(sex, education_code, age), .SDcols = qs]
As mentioned in the comments, you miss a dt <- in your dt[, .(ID, education_code, age, sex, item = q1_1)] which makes the column item unavailable in the following line dt[, no_respond := is.na(item)].
Your weighting scheme is not absolutely clear to me however, assuming you want to do what is done in your code here, I would go with dplyr solution to iterate over columns.
# your data without no_respond column and correcting missing value in q2_3
dt <- data.table::data.table(
ID = c(1,2,3,4, 5, 6, 7, 8, 9, 10),
education_code = c(20,50,20,60, 20, 10,5, 12, 12, 12),
age = c(87,67,56,52, 34, 56, 67, 78, 23, 34),
sex = c("F","M","M","M", "F","M","M","M", "M","M"),
q1_1 = c(NA,1,5,3, 1, NA, 3, 4, 5,1),
q1_2 = c(NA,1,5,3, 1, 2, NA, 4, 5,1),
q1_3 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q1_text = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_1 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_2 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_3 = c(NA,1,5,3, 1, NA, NA, 4, 5,1),
q2_text = c(NA,1,5,3, 1, NA, 3, 4, 5,1))
dt %>%
group_by(sex, education_code, age) %>% #groups the df by sex, education_code, age
add_count() %>% #add a column with number of rows in each group
mutate(across(starts_with("q"), #for each column starting with "q"
~ 1/(sum(!is.na(.))/n), #create a new column following your weight calculation
.names = '{.col}_wgt')) %>% #naming the new column with suffix "_wgt" to original name
ungroup()

Reshape data to long form based on a pattern and not unique identifier

I have some data that comes from the measurement of an image where essentially the columns signify position (x) and height (z) data. The problem is that this data gets spit out as a .csv file in the wide format. I am trying to find a way to convert this to the long format but I'm unsure how to do this because I can't designate an identifier.
I know there are a lot of questions on reshaping data but I didn't find anything quite like this.
As an example:
df <- data.frame(V1 = c("Profile", "x", "[m]", 0, 2, 4, 6, 8, 10, 12, NA, NA),
V2 = c("1", "z", "[m]", 3, 3, 4, 10, 12, 9, 2, NA, NA),
V3 = c("Profile", "x", "[m]", 0, 2, 4, 6, NA, NA, NA, NA, NA),
V4 = c("2", "z", "[m]", 4, 8, 10, 10, NA, NA, NA, NA, NA),
V5 = c("Profile", "x", "[m]", 0, 2, 4, 6, 8, 10, 12, 14, 17),
V2 = c("3", "z", "[m]", 0, 1, 1, 10, 14, 11, 6, 2, 0))
Every two columns represents X,Z data (you can see grouped by Profile 1, Profile 2, Profile 3, etc). However, measurements are not equal lengths, hence the rows with NAs. Is there a programmatic way to reshape this data into the long form? i.e.:
profile x z
Profile 1 0 3
Profile 1 2 3
Profile 1 4 4
... ... ...
Profile 2 0 4
Profile 2 2 8
Profile 2 4 10
... ... ...
Thank you in advance for your help!
You can do the following (Its a bit verbose, feel free to optimize):
dfcols <- NCOL(df)
xColInds <- seq(1,dfcols,by=2)
zColInds <- seq(2,dfcols,by=2)
longdata <- do.call("rbind",lapply(1:length(xColInds), function(i) {
xValInd <- xColInds[i]
zValInd <- zColInds[i]
profileName <- paste0(df[1,xValInd]," ",df[1,zValInd])
xVals <- as.numeric(df[-(1:3),xValInd])
zVals <- as.numeric(df[-(1:3),zValInd])
data.frame(profile=rep(profileName,length(xVals)),
x = xVals,
z = zVals)
}))
If you want it more performant, dont cast to data.frame every single iteration. One cast at the end is enough, like:
xColInds <- seq(1,NCOL(df),by=2)
longdataList <- lapply(xColInds, function(xci) {
list(profileName = paste0(df[1,xci]," ",df[1,xci+1]),
x = df[-(1:3),xci],
z = df[-(1:3),xci+1])
})
longdata <- data.frame(profile = rep(unlist(lapply(longdataList,"[[","profileName")),each=NROW(df)-3),
x = as.numeric(unlist(lapply(longdataList,"[[","x"))),
z = as.numeric(unlist(lapply(longdataList,"[[","z"))))

Adding rows to make a full long dataset for longitudinal data analysis

I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.
We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)
I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]

data.table operations on columns

I would like to do some operations on a list of columns in a data.table.
I put here an example. In fact I have around 30 variables in the data.table (DT). that is why there are some lines for running a loop on the variables I am interested in.
As I would like to preserve the input DT I do a copy with DT2 <- DT to operate on it.
Eventually this could be done on the initial one DT.
The error message says:
*Error in eval(DT[, (var)]) * eval(DT[, variable6]) : non-numeric
argument to binary operator*
I did not find an appropriate answer on the forum yet.
Sorry for the presentation of my question. Not used to!
thanks for help.
library(data.table)
DT <- data.table(
variable1 = c("a", "b", "c", "d", "e"),
variable2 = 1:5,
variable3 = c(1, 2, 5, 6, 8),
variable4 = c(1, 2, 5, 6, 8),
variable5 = c(1, 2, 5, 6, 8),
variable6 = c(12, 14, 18, 100, 103),
variable7 = c(0.02, 0.02, 0.02, 0.02, 0.02))
cols = sapply(DT, is.numeric)
cols = cols[-c(6, 7)]
cols = names(cols)[cols]
DT2 <- DT
for(var in cols) {
DT2[, (var)] == eval(DT[, (var)]) * eval(DT[, variable6]) / eval(DT[variable7])
}

crossJoining two data frames without repeating values

I have two dataframes
DataFrame1 <- data.frame(StudentId = c(1:20), Subject = c(rep("Algebra", 4), rep("Geometry", 4), rep("English", 4), rep("Zoology", 4), rep("Botany", 4)), CGPA = c(random::randomNumbers(20, 70, 100, 1)), Country = c(rep("USA", 4), rep("UK", 4), rep("Germany", 4), rep("France", 4), rep("Japan", 4)))
and
DataFrame2 <- data.frame(StudentId = c(1:10), State = c(rep("NYC", 2), rep("Illinois", 2), rep("Texas", 2), rep("Virginia", 2), rep("Florida", 2)), Age = c(random::randomNumbers(10, 16, 20, 1)), Gender = c(rep("Male", 3), rep("Female", 3), rep("Male", 2), rep("Female", 2)))
I can merge the above two using inner join as
merge(DataFrame1, DataFrame2)
How to merge as cross Joining two data frames without repeating values?
Try merge(DataFrame1, DataFrame2, all = T)
Try this for cross join..
knitr::kable(merge(x = DataFrame1, y = DataFrame2, by = NULL))

Resources