data.table operations on columns

data.table operations on columns - r

I would like to do some operations on a list of columns in a data.table.
I put here an example. In fact I have around 30 variables in the data.table (DT). that is why there are some lines for running a loop on the variables I am interested in.
As I would like to preserve the input DT I do a copy with DT2 <- DT to operate on it.
Eventually this could be done on the initial one DT.
The error message says:
*Error in eval(DT[, (var)]) * eval(DT[, variable6]) : non-numeric
argument to binary operator*
I did not find an appropriate answer on the forum yet.
Sorry for the presentation of my question. Not used to!
thanks for help.
library(data.table)
DT <- data.table(
variable1 = c("a", "b", "c", "d", "e"),
variable2 = 1:5,
variable3 = c(1, 2, 5, 6, 8),
variable4 = c(1, 2, 5, 6, 8),
variable5 = c(1, 2, 5, 6, 8),
variable6 = c(12, 14, 18, 100, 103),
variable7 = c(0.02, 0.02, 0.02, 0.02, 0.02))
cols = sapply(DT, is.numeric)
cols = cols[-c(6, 7)]
cols = names(cols)[cols]
DT2 <- DT
for(var in cols) {
DT2[, (var)] == eval(DT[, (var)]) * eval(DT[, variable6]) / eval(DT[variable7])
}

Related

Using map() function to apply for each element

I need, with the help of the map() function, apply the above for each element
How can I do so?

As dt is of class data.table, you can make a vector of columns of interest (i.e. your items; below I use grepl on the names), and then apply your weighting function to each of those columns using .SD and .SDcols, with by
qs = names(dt)[grepl("^q", names(dt))]
dt[, (paste0(qs,"wt")):=lapply(.SD, \(q) 1/(sum(!is.na(q))/.N)),
.(sex, education_code, age), .SDcols = qs]

As mentioned in the comments, you miss a dt <- in your dt[, .(ID, education_code, age, sex, item = q1_1)] which makes the column item unavailable in the following line dt[, no_respond := is.na(item)].
Your weighting scheme is not absolutely clear to me however, assuming you want to do what is done in your code here, I would go with dplyr solution to iterate over columns.
# your data without no_respond column and correcting missing value in q2_3
dt <- data.table::data.table(
ID = c(1,2,3,4, 5, 6, 7, 8, 9, 10),
education_code = c(20,50,20,60, 20, 10,5, 12, 12, 12),
age = c(87,67,56,52, 34, 56, 67, 78, 23, 34),
sex = c("F","M","M","M", "F","M","M","M", "M","M"),
q1_1 = c(NA,1,5,3, 1, NA, 3, 4, 5,1),
q1_2 = c(NA,1,5,3, 1, 2, NA, 4, 5,1),
q1_3 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q1_text = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_1 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_2 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_3 = c(NA,1,5,3, 1, NA, NA, 4, 5,1),
q2_text = c(NA,1,5,3, 1, NA, 3, 4, 5,1))
dt %>%
group_by(sex, education_code, age) %>% #groups the df by sex, education_code, age
add_count() %>% #add a column with number of rows in each group
mutate(across(starts_with("q"), #for each column starting with "q"
~ 1/(sum(!is.na(.))/n), #create a new column following your weight calculation
.names = '{.col}_wgt')) %>% #naming the new column with suffix "_wgt" to original name
ungroup()

Adding rows to make a full long dataset for longitudinal data analysis

I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.

We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)

I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]

How to apply a function to a data.table subset by multiple columns in R?

I have a data table with counts for changes for multiple groups. For example:
input <- data.table(from = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"),
to = c(letters[1:6], letters[1:6]),
from_N = c(100, 100, 100, 50, 50, 50, 60, 60 ,60, 80, 80, 80),
to_N = c(10, 20, 40, 5, 5, 15, 10, 5, 10, 20, 5, 10),
group = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2))
How can I calculate the total for each change across groups? I can do this using a for loop, for example:
out <- list()
for (i in 1:length(unique(input$from))){
sub <- input[from == unique(input$from)[i]]
out2 <- list()
for (j in 1:length(unique(sub$to))){
sub2 <- sub[to == unique(sub$to)[j]]
out2[[j]] <- data.table(from = sub2$from[1],
to = sub2$to[1],
from_N = sum(sub2$from_N),
to_N = sum(sub2$to_N))
print(unique(sub$to)[j])
}
out[[i]] <- do.call("rbind", out2)
print(unique(input$from)[i])
}
output <- do.call("rbind", out)
However, the data table I need to apply this to is very large, and I therefore need to maximise performance. Is there a data.table method? Any help will be greatly appreciated!

Perhaps I've overlooked something, but it seems you're just after:
library(data.table)
setDT(input)[, .(from_N = sum(from_N), to_N = sum(to_N)), by = .(from, to)]
Output:
from to from_N to_N
1: A a 160 20
2: A b 160 25
3: A c 160 50
4: B d 130 25
5: B e 130 10
6: B f 130 25

An option with dplyr
library(dplyr)
input %>%
group_by(from, to) %>%
summarise_at(vars(ends_with('_N')), sum)
Or in data.table
library(data.table)
setDT(input)[, lapply(.SD, sum), by = .(from, to), .SDcols = patterns('_N$')]

R data.table keys and column names. Harmonisation

I am trying to set keys yo a data.table and keep the original column names on the second row. All that I have tried so far changes the column names to keys and erases the original variables. I have ten data.tables to merge and all the variables have different names like in the example. So I made keys but would like to keep the originals as well before harmonisation just to be sure.
library(tidyverse)
library(lubridate)
library(forcats)
library(stringr)
library(data.table)
library(rio)
library(dplyr)
1. Keys
keys1 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
keys2 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
2. data.table example with variable names.
TD3 = data.table(q128 = c(1, 2, 1, 2), q129 = c(1, 5, 2, 4), q130 = c(0.8, 3.0, 10.0, NA), q131 = c(55, 56, 80, 79))
TD3
TD4 = data.table(q128 = c(1, 1, 1, 2), q129 = c(1, 3, 2, 999), q130 = c(0.9, 3.1, NA, 9.0), q131 = c(58, 60, 45, NA))
TD4

I'm not sure this is really the data structure you want to have, that is to have mixed variable types like r2evans said. However...this solution works. Just put all your little data.tables into a list and voila.
I noticed that keys1 and keys2 are identical, so I just used one of them. If they should be different keys for each they can also be listed.
keys1 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
TD <- list()
TD[[1]] = data.table(q128 = c(1, 2, 1, 2), q129 = c(1, 5, 2, 4), q130 = c(0.8, 3.0, 10.0, NA), q131 = c(55, 56, 80, 79))
TD[[2]] = data.table(q128 = c(1, 1, 1, 2), q129 = c(1, 3, 2, 999), q130 = c(0.9, 3.1, NA, 9.0), q131 = c(58, 60, 45, NA))
TD <- lapply(TD, FUN = function(x){
oldcolumns <- colnames(x)
td <- data.table(
'V1' = oldcolumns[1],
'V2' = oldcolumns[2],
'V3' = oldcolumns[3],
'V4' = oldcolumns[4]
)
colnames(td) <- keys1
colnames(x) <- keys1
x <- rbind(td, x)
return(x)
})

Find overlapping rows between dataframes using dplyr?

df1 <- data_frame(time1 = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
time2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
id = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"))
df2 <- data_frame(time = sort(runif(100, 0, 10)),
C = rbinom(100, 1, 0.5))
For every row in df1, I want to find the rows in df2 that overlap for time, then assign the median C value for this group of df2 rows to a new column in df1. I'm sure there's some simple way to do this with dplyr's between function, but I'm new to R and haven't been able to figure it out. Thanks!

Here's a way, using the merge function to basically do a SQL style cross join, then using the between function:
library(tidyverse)
merge(df1, df2, all = TRUE) %>%
rowwise() %>%
mutate(time_between = between(time, time1, time2)) %>%
filter(time_between) %>%
group_by(time1, time2, id) %>%
summarise(med_C = median(C))
Using the filter function may result in losing some rows from df1, so an alternative method would be:
merge(df1, df2, all = TRUE) %>%
rowwise() %>%
mutate(time_between = between(time, time1, time2)) %>%
group_by(time1, time2, id) %>%
summarise(med_C = median(ifelse(time_between, C, NA), na.rm = TRUE))

You can do this in base R with sapply:
df1$median_c <- sapply(seq_along(df1$id), function(i) {
median(df2$C[df2$time > df1$time1[i] & df2$time < df1$time2[i]])
})

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

data.table operations on columns - r

Related

Using map() function to apply for each element

Adding rows to make a full long dataset for longitudinal data analysis

How to apply a function to a data.table subset by multiple columns in R?

R data.table keys and column names. Harmonisation

Find overlapping rows between dataframes using dplyr?

Categories

Resources