I need to create new columns in a data.table based on criteria set relative to some of the existing columns. I encountered some problems with missing data, however. Specifically, for each person a few datapoints are missing. For some individuals though the entire data of a questionnaire is missing (see column p == 3 or 4 in example data below). In such cases (= entire data of a questionnaire missing) I would like data.table to enter NA in the output for this particular person. I have tried resolving this using if_else from the dplyrpackage. However, data.table returns NaN or 0 instead of NAas a result even when all data of a person is missing (i.e. when column p is 3 or 4).
This is my current script, which only partially produces the desired output (i.e. correct output for p== 1 or 2, but not for p== 3 or 4).
library(data.table)
library(dplyr)
# Create example datatable
set.seed(4)
p <- c(rep(1, 5), rep(2, 5), rep(3, 5), rep(4, 5))
time1 <- as.integer(c(sample(1:20, 5, replace=TRUE), sample(21:40, 5, replace=TRUE), rep("NA",10)))
closeness1 <- as.integer(c(NA, NA, sample(c(1:40,NA), 7, replace=TRUE), NA, rep("NA",10)))
dt <- data.table::data.table(p, time1, closeness1)
# Compute new columns
dt[, c("mean1", "sum1") := .(
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 <= 10, mean(closeness1, na.rm=TRUE)]),
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.integer(NA), .SD[time1 <= 10, sum(closeness1, na.rm=TRUE)])),
by = p, .SDcols = c("time1", "closeness1")]
The following script produces the output I would want to see. However, this is obviously just for illustrative purposes and I would need to know how to modify the above script to produce the desired outcome:
# Select rows from original data that were as intended
p12 <- dplyr::filter(dt, p %in% c(1,2))
# Create new data.table with corrected output
p <- c(rep(3, 5), rep(4, 5))
time1 <- as.integer(rep("NA",10))
closeness1 <- as.integer(rep("NA",10))
mean1 <- as.integer(rep("NA",10))
sum1 <- as.integer(rep("NA",10))
dt.des <- data.table::data.table(p, time1, closeness1, mean1, sum1)
# Desired output
dsrd.opt <- dplyr::bind_rows(p12, dt.des)
dsrd.opt
p time1 closeness1 mean1 sum1
1 1 12 NA 21.5 43
2 1 1 NA 21.5 43
3 1 6 31 21.5 43
4 1 6 12 21.5 43
5 1 17 5 21.5 43
6 2 26 40 NaN 0
7 2 35 18 NaN 0
8 2 39 19 NaN 0
9 2 39 40 NaN 0
10 2 22 NA NaN 0
11 3 NA NA NA NA
12 3 NA NA NA NA
13 3 NA NA NA NA
14 3 NA NA NA NA
15 3 NA NA NA NA
16 4 NA NA NA NA
17 4 NA NA NA NA
18 4 NA NA NA NA
19 4 NA NA NA NA
20 4 NA NA NA NA
Edit:
It looks like I simplified the above example too much. I basically need to compute the mean of closeness1 based on two separate conditions, once for time1 <= 10 and once for time1 > 10 & time1 <= 21. The respective output should then be saved in two new columns. I have updated the example script accordingly, see below:
dt[, c("mean1", "mean2") := .(
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 <= 10, mean(closeness1, na.rm=TRUE)]),
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 > 10 & time1 <= 21, mean(closeness1, na.rm=TRUE)])),
by = p, .SDcols = c("time1", "closeness1")]
Updated example output
dsrd.opt
p time1 closeness1 mean1 mean2
1 1 12 NA 21.5 5
2 1 1 NA 21.5 5
3 1 6 31 21.5 5
4 1 6 12 21.5 5
5 1 17 5 21.5 5
6 2 26 40 NaN NaN
7 2 35 18 NaN NaN
8 2 39 19 NaN NaN
9 2 39 40 NaN NaN
10 2 22 NA NaN NaN
11 3 NA NA NA NA
12 3 NA NA NA NA
13 3 NA NA NA NA
14 3 NA NA NA NA
15 3 NA NA NA NA
16 4 NA NA NA NA
17 4 NA NA NA NA
18 4 NA NA NA NA
19 4 NA NA NA NA
20 4 NA NA NA NA
If I understood you correctly, I would suggest to use a simple left join. I think this is pretty straigthforward and produces the desired result.
dt_result <- merge(x = dt
, y = dt[time1 <= 10, .(mean1 = mean(closeness1, na.rm = TRUE)
, sum1 = sum(closeness1, na.rm = TRUE)), by = list(p)]
, by.x = "p"
, by.y = "p"
, all.x = TRUE
)
> dt_result
p time1 closeness1 mean1 sum1
1: 1 12 NA 21.5 43
2: 1 1 NA 21.5 43
3: 1 6 31 21.5 43
4: 1 6 12 21.5 43
5: 1 17 5 21.5 43
6: 2 26 40 NA NA
7: 2 35 18 NA NA
8: 2 39 19 NA NA
9: 2 39 40 NA NA
10: 2 22 NA NA NA
11: 3 NA NA NA NA
12: 3 NA NA NA NA
13: 3 NA NA NA NA
14: 3 NA NA NA NA
15: 3 NA NA NA NA
16: 4 NA NA NA NA
17: 4 NA NA NA NA
18: 4 NA NA NA NA
19: 4 NA NA NA NA
20: 4 NA NA NA NA
Related
I need to create a dataframe with all possible combinations of a variable. I found an example using data.table that works like this:
df <- data.frame("Age"=1:10)
df <- setDT(df)
df[,lag.Age1 := c(NA,Age[-.N])]
That creates this:
Age lag.Age1
1: 1 NA
2: 2 1
3: 3 2
.. .. ..
10: 10 9
Now, I want to keep adding lagged vectors that produce something like this:
Age lag.Age1 lag.Age2 lag.Age3
1: 1 NA NA NA
2: 2 1 NA NA
3: 3 2 1 NA
.. .. .. .. ..
10: 10 9 8 7
I tried this for the third column:
df[,lag.Age2 := c(NA,NA,Age[1:8])]
But I really don't get how data.table works here. That line runs but it doesn't do anything.
EDIT: what if the dataframe has a group variable and I want the lag to be done by group? For the first lag it is just:
df <- data.frame("Age"=1:10, "Group"=c(rep("A",4),rep("B",6)))
df[,lag.Age1 := c(NA,Age[-.N]), by="Group"]
How would this be now? note that the groups have different length.
data.table::shift() is very powerful, because you can provide a vector of offsets; For example, if you want n lag columns (from 1 to n), you can do this:
n=3
cols = paste0("lag.Age",1:n)
df[, c(cols):=shift(Age,1:n), Group]
Output:
Age Group lag.Age1 lag.Age2 lag.Age3
<int> <char> <int> <int> <int>
1: 1 A NA NA NA
2: 2 A 1 NA NA
3: 3 A 2 1 NA
4: 4 A 3 2 1
5: 5 B NA NA NA
6: 6 B 5 NA NA
7: 7 B 6 5 NA
8: 8 B 7 6 5
9: 9 B 8 7 6
10: 10 B 9 8 7
Alternatively:
df[, c(paste0("lag.Age",1:3)):=shift(Age,1:3), Group]
If you want to have the number of lags vary by group, where the number equals the number of observations in that group-1, then one approach is to do this:
# make function to return lags based on length of x
f <- function(x) shift(x,1:(length(x)-1))
# get unique groups
grps= unique(df$Group)
# set as DT, and use lapply()
setDT(df)
grp_lags = lapply(grps, \(g) f(df[Group==g, Age]))
names(grp_lags)<-grps
Output:
$A
$A[[1]]
[1] NA 1 2 3
$A[[2]]
[1] NA NA 1 2
$A[[3]]
[1] NA NA NA 1
$B
$B[[1]]
[1] NA 5 6 7 8 9
$B[[2]]
[1] NA NA 5 6 7 8
$B[[3]]
[1] NA NA NA 5 6 7
$B[[4]]
[1] NA NA NA NA 5 6
$B[[5]]
[1] NA NA NA NA NA 5
Or, if you have okay with lots of extra columns (i.e. for the groups with fewer observations), you can do this:
n = df[, .N, Group][,max(N)]
cols = paste0("lag.Age",1:n)
df[, c(cols):=shift(Age,1:n), Group]
Output:
Age Group lag.Age1 lag.Age2 lag.Age3 lag.Age4 lag.Age5 lag.Age6
1: 1 A NA NA NA NA NA NA
2: 2 A 1 NA NA NA NA NA
3: 3 A 2 1 NA NA NA NA
4: 4 A 3 2 1 NA NA NA
5: 5 B NA NA NA NA NA NA
6: 6 B 5 NA NA NA NA NA
7: 7 B 6 5 NA NA NA NA
8: 8 B 7 6 5 NA NA NA
9: 9 B 8 7 6 5 NA NA
10: 10 B 9 8 7 6 5 NA
In the below, the data frame index denotes the value while t1:t2 denotes the number of times that specific value was recorded at a specific point in time. For example index 10 at t1 equals 1 suggesting that it was made 1 records; at t2 there are 4 records, whole at t3 and t4 just 1. I would like to return the values from columns t1:t4 based on an index column
Input:
index t1 t2 t3 t4
10 1 4 1 1
20 2 5 1 0
30 3 6 1 0
40 0 0 0 2
Output:
t1 t2 t3 t4
10 10 10 10
20 10 20 40
20 10 30 40
30 10 NA NA
30 20 NA NA
30 20 NA NA
NA 20 NA NA
NA 20 NA NA
NA 30 NA NA
NA 30 NA NA
NA 30 NA NA
NA 30 NA NA
NA 30 NA NA
NA 30 NA NA
Sample data:
df<-structure(list(index=c (10,20,30,40),
t1 = c(1, 2, 3, 0),
t2 = c(4, 5, 6, 0),
t3 = c(1, 1,1, 0),
t4 = c(1, 0, 0, 2)), row.names = c(NA,4L), class = "data.frame")
df
One dplyr, tidyr and purrr solution could be:
map(.x = names(df)[-1],
~ df %>%
uncount(get(.x)) %>%
select(!!.x := index) %>%
rowid_to_column()) %>%
reduce(full_join)
rowid t1 t2 t3 t4
1 1 10 10 10 10
2 2 20 10 20 40
3 3 20 10 30 40
4 4 30 10 NA NA
5 5 30 20 NA NA
6 6 30 20 NA NA
7 7 NA 20 NA NA
8 8 NA 20 NA NA
9 9 NA 20 NA NA
10 10 NA 30 NA NA
11 11 NA 30 NA NA
12 12 NA 30 NA NA
13 13 NA 30 NA NA
14 14 NA 30 NA NA
15 15 NA 30 NA NA
Base R and one line of code.
Map(function(x) rep(df$index, x), df[,-1])
After update:
maxy <- max(apply(df[,-1], 2, sum))
data.frame(Map(function(x) c(rep(df$index, x), rep(NA, maxy - sum(x))), df[,-1]))
Using base R with lapply
lst1 <- lapply(df[-1], function(x) rep(df$index, x))
data.frame(lapply(lst1, `length<-`, max(lengths(lst1))))
-output
# t1 t2 t3 t4
#1 10 10 10 10
#2 20 10 20 40
#3 20 10 30 40
#4 30 10 NA NA
#5 30 20 NA NA
#6 30 20 NA NA
#7 NA 20 NA NA
#8 NA 20 NA NA
#9 NA 20 NA NA
#10 NA 30 NA NA
#11 NA 30 NA NA
#12 NA 30 NA NA
#13 NA 30 NA NA
#14 NA 30 NA NA
#15 NA 30 NA NA
Here is a base R option
list2DF(
lapply(
df[-1],
function(x) `length<-`(rep(df$index, x), max(colSums(df[-1])))
)
)
which gives
t1 t2 t3 t4
1 10 10 10 10
2 20 10 20 40
3 20 10 30 40
4 30 10 NA NA
5 30 20 NA NA
6 30 20 NA NA
7 NA 20 NA NA
8 NA 20 NA NA
9 NA 20 NA NA
10 NA 30 NA NA
11 NA 30 NA NA
12 NA 30 NA NA
13 NA 30 NA NA
14 NA 30 NA NA
15 NA 30 NA NA
I'm trying to clean my data. Let's imagine that we've got a vector of 20 values with several NAs:
set.seed(1234)
x <- rnorm(20, mean = 10, sd = 5) %>% round
x[c(6, 8, 12, 16, 19)] <- NA
So it looks smth like this:
> 4 11 15 -2 12 NA 7 NA 7 6 8 NA 6 10 15 NA 7 5 NA 22
I need to replace those values which are enclosed with NA with NA). E.g. 7 from my vector should be NA cause previous and next values are NA. I can do it with ifelse statement and some dplyr functions:
library(dplyr)
ifelse(is.na(lag(x))&is.na(lead(x)), NA, x)
> 4 11 15 -2 12 NA NA NA 7 6 8 NA 6 10 15 NA 7 5 NA NA
The question is how can I replace two values enclosed with NA. 7 and 5 for example? I was trying to duplicate the condition, i.e. make lag(lag(x)) and lead(lead(x)) but I get a mess.
ifelse(is.na(lag(x))&is.na(lead(x)) | is.na(lead(lead(x)))&is.na(lag(lag(x))), NA, x)
> 4 11 15 -2 12 NA NA NA 7 NA 8 NA 6 NA 15 NA 7 5 NA NA
We can group per NA and count the length of each group. If it has length 3, then that means that the group consist of NA, value, value. We simply replace those values with NA.
i1 <- cumsum(is.na(x))
x[ave(i1, i1, FUN = function(i)length(i)) == 3] <- NA
#[1] 4 11 15 -2 12 NA 7 NA 7 6 8 NA 6 10 15 NA NA NA NA 22
I have data with over 6k columns. Each result has colums with data that are always the same.
XCODE Age Sex ResultA Sex ResultB
1 X001 12 2 2 2 4
2 X002 23 2 4 2 66
3 X003 NA NA NA NA NA
4 X004 32 1 1 1 3
5 X005 NA NA NA NA NA
6 X001 NA NA NA NA NA
7 X002 NA NA NA NA NA
8 X003 33 1 8 1 6
9 X004 NA NA NA NA NA
10 X005 55 2 8 2 8
I would like to remove duplicate e.g sex variable. Is there possibility of doing that with data.table?
You can use match if you need to check for equality of all values.
df[, unique(match(df, df)), with = F]
df2
# XCODE Age Sex ResultA ResultB
# 1 X001 12 2 2 4
# 2 X002 23 2 4 66
# 3 X003 NA NA NA NA
# 4 X004 32 1 1 3
# 5 X005 NA NA NA NA
# 6 X001 NA NA NA NA
# 7 X002 NA NA NA NA
# 8 X003 33 1 8 6
# 9 X004 NA NA NA NA
# 10 X005 55 2 8 8
Data used:
df <- fread('
XCODE Age Sex ResultA Sex ResultB
1 X001 12 2 2 2 4
2 X002 23 2 4 2 66
3 X003 NA NA NA NA NA
4 X004 32 1 1 1 3
5 X005 NA NA NA NA NA
6 X001 NA NA NA NA NA
7 X002 NA NA NA NA NA
8 X003 33 1 8 1 6
9 X004 NA NA NA NA NA
10 X005 55 2 8 2 8
')[, -'V1']
Try this:
df[, unique(colnames(df))]
One caveat: it will delete all columns with duplicated names. In your case, it will delete Sex even if the two columns have the same name but different content.
If you have duplicated columns with different names, you can transpose your dataframe, which allows you to use the unique function to solve your problem. Then you then transpose it back and set it back to dataframe (because it came a matrix when you transposed it).
df = data.frame(c = 1:5, a = c("A", "B","C","D","E"), b = 1:5)
df = t(df)
df = unique(df)
df = t(df)
df = data.frame(df)
Edit: like markus points out, this is probably not a good option if you have columns of multiples types because when t() coerces your dataframe to matrix it also coerces all your variables into the same type.
I have a certain data set in which there are few missing values.
the dataset looks like the following:
a b c0 d0 c1 d1 g h
1 5 20 10 NA NA 2 NA
1 6 NA NA 8 2 NA 4
2 5 25 10 NA NA 2.5 NA
2 7 NA NA 2 2 NA 1
2 8 50 10 NA NA 5 NA
3 9 10 10 NA NA 1 NA
3 6 NA NA 8 4 NA 2
3 10 NA NA 5 1 NA 5
4 5 NA NA 6 2 NA 3
4 11 25 10 NA NA 2.5 NA
My data is in the above mentioned format. Column a is a kind of time period which is in sequence and has multiple codes corresponding to it.
Column b just shows an item. This item either has a repeated entry in time or has an unique value.
Column g and h are just the columns made by dividing column c0/d0 = g and c1/d1 = h. Out here, column g holds more importance.
Now, since it is clear that there are few NA and some of the column b entries are duplicate whereas rest are unique.
I have to perform the following steps in order to compute the NA's in column 'g':
I have to find in the 'column b' that is the entry repetitive or has an unique value.Eg : Entry 6 and 5 are repeated, whereas 7,8 9,10 and 11 are unique.
Once it has been found, next step is to that whether there is some value in 'column g' already for the item or not.
If there is, then we need to take average of the repetaed value in 'column g' if it's other than NA, like for item 5, I can find that the values are 2 and 2.5 and hence the average of 2.25 should be place in 'column g' for the repeated 5 value at a=4.
Now, if there is a repeated value but still column g is NA, then I can simply take the 'column h' value as value of 'column g'.
For the non repetitive items, like 9,10,7, etc. since they are unique, just replace the column g entry by column h.
The final output should be as follows:
a b c0 d0 c1 d1 g h
1 5 20 10 NA NA 2 NA
1 6 NA NA 8 2 4 4
2 5 25 10 NA NA 2.5 NA
2 7 NA NA 2 2 1 1
2 8 50 10 NA NA 5 NA
3 9 10 10 NA NA 1 NA
3 6 NA NA 8 4 2 2
3 10 NA NA 5 1 5 5
4 5 NA NA 6 2 2.25 3
4 11 25 10 NA NA 2.5 NA
Request you to help me out with it. In case, you have any question in understanding the question, do let me know or even if some more details are required.
Your desired output is inconsistent. You have one row missing, column h has been altered and hence column g at the seventh row looks inconsistent too.
Either-way, following your description, I would do this in two steps.
First subset your data only by b instances that have dupes and alternate NAs by the mean of the rest of the group
replace all the NAs left by column h
I'd suggest data.table as it allows comfortable operations on subsets
library(data.table)
setDT(df)[duplicated(b) | duplicated(b, fromLast = TRUE), # operate only on the dupes
g := replace(g, is.na(g), mean(g, na.rm = TRUE)), by = b] # replace NA by group
df[is.na(g), g := as.double(h)] # subset by NAs and replace with corresponding values in h
df
# a b c0 d0 c1 d1 g h
# 1: 1 5 20 10 NA NA 2.00 NA
# 2: 1 6 NA NA 8 2 4.00 4
# 3: 2 5 25 10 NA NA 2.50 NA
# 4: 2 7 NA NA 2 2 1.00 1
# 5: 2 8 50 10 NA NA 5.00 NA
# 6: 3 9 10 10 NA NA 1.00 NA
# 7: 3 6 NA NA 8 2 4.00 4
# 8: 3 10 NA NA 5 1 5.00 5
# 9: 4 5 NA NA 6 2 2.25 3
# 10: 4 11 25 10 NA NA 2.50 NA
We can reduce it to "one" step once we recognize that when grouped by b, duplicates imply that there are more than one row grouped. Therefore, the condition to replace the NA values in g by the mean of its group (that are not NA) is if:
the number of rows grouped by b is greater than one and not all of g in the group is NA
Otherwise, replace the NA values in g with h:
library(data.table)
setDT(df)[, g := if (.N > 1 & !all(is.na(g))) {
replace(g, is.na(g), mean(g, na.rm = TRUE))
} else {
replace(g, is.na(g), as.double(h))
}, by=b][]
## a b c0 d0 c1 d1 g h
## 1: 1 5 20 10 NA NA 2.00 NA
## 2: 1 6 NA NA 8 2 4.00 4
## 3: 2 5 25 10 NA NA 2.50 NA
## 4: 2 7 NA NA 2 2 1.00 1
## 5: 2 8 50 10 NA NA 5.00 NA
## 6: 3 9 10 10 NA NA 1.00 NA
## 7: 3 6 NA NA 8 2 4.00 4
## 8: 3 10 NA NA 5 1 5.00 5
## 9: 4 5 NA NA 6 2 2.25 3
##10: 4 11 25 10 NA NA 2.50 NA