Modify the column value by other columns in r - r

I have a CSV table (as a data frame). I want to modify a specific column value by other columns values.
I have prepared a code, but it doesn't work.
The data frame contains 1076 rows and 156 columns.
The formula have to be like this:
if (a[i,"0Q-state"] == "done" ) && (a[i,0Q-01] == NA)) a[i,0Q-01] = 0;
else a[i,0Q-01] = a[i,0Q-01];
but I don't know how can I do this in r.
>dataset4
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 NA
2: 1 1 1 1 1 1 NA 1 1
3: done 1 1 1 NA 1 1 1 1 1
5: done 1 1 1 1 0 0 0 1 0
6: done 1 1 1 1 0 0 0 1 0
7: 1 1 NA 1 0 0 0 1 0
8: done 1 1 1 1 0 0 0 1 0
sapply(c("0Q-01","0Q-02","0Q-03","0Q-04","0Q-05","0Q-06","0Q-07","0Q-08","0Q-09"),
function(y) {
dataset4[,y] <- sapply(c(1:1076), function(x)
ifelse (((is.na(dataset4[x,y])) && (dataset4[x,c("0Q-state")] == "done"))
,0, dataset4[x,y]))}
)
Output has to be:
>dataset4
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 0
2: 1 1 1 1 1 1 NA 1 1
3: done 1 1 1 0 1 1 1 1 1
5: done 1 1 1 1 0 0 0 1 0
6: done 1 1 1 1 0 0 0 1 0
7: 1 1 NA 1 0 0 0 1 0
8: done 1 1 1 1 0 0 0 1 0

we could try:
df[rep(df[, 1] == "done", ncol(df)) & is.na(df)] <- 0
df
1 done 1 1 1 1 1 1 1 1 0
2 1 1 1 1 1 1 NA 1 1
3 done 1 1 1 0 1 1 1 1 1
4 done 1 1 1 1 0 0 0 1 0
5 done 1 1 1 1 0 0 0 1 0
6 1 1 NA 1 0 0 0 1 0
7 done 1 1 1 1 0 0 0 1 0
or using sapply():
myFunc <- function(x, y) ifelse(is.na(x) & y == "done", 1, x)
data.frame(df[, 1], sapply(df[, -1], myFunc, y = df[, 1]))
1 done 1 1 1 1 1 1 1 1 NA
2 1 1 1 1 1 1 NA 1 1
3 done 1 1 1 NA 1 1 1 1 1
4 done 1 1 1 1 0 0 0 1 0
5 done 1 1 1 1 0 0 0 1 0
6 1 1 NA 1 0 0 0 1 0
7 done 1 1 1 1 0 0 0 1 0
where you can always substitute df[, 1] with df[, "0Q-state"] and df[, -1] with df[, namesOfDummyVars]

The question has been tagged with data.table and the printed output of dataset4 suggests that dataset4 already is a data.table object.
Here are three variants in data.table syntax to replace NAs in rows which are marked as "done".
# create vector of names of columns to be changed
cols <- sprintf("0Q-%02i", 1:9)
# variant 1
dataset4[`0Q-state` == "done",
(cols) := lapply(.SD, function(x) replace(x, is.na(x), 0L)),
.SDcols = cols][]
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 0
2: 1 1 1 1 1 NA 1 1 NA
3: done 1 1 1 0 1 1 1 1 1
4: done 1 1 1 1 0 0 0 1 0
5: done 1 1 1 1 0 0 0 1 0
6: 1 NA 1 0 0 0 1 0 NA
7: done 1 1 1 1 0 0 0 1 0
or
# variant 2
lapply(cols, function(i) dataset4[`0Q-state` == "done" & is.na(get(i)), (i) := 0L])
dataset4
returning the same as above
or
# variant 3 --- data.table development version 1.10.5
for (i in cols)
set(dataset4, which(dataset4[, "0Q-state"] == "done" & is.na(dataset4[, ..i])), i, 0L)
dataset4

Related

Binary Variables Combinations Analysis in R

I have a data set, which has a lot of binary variables. For the ease of illustration, here is a smaller version with only 4 variables:
set.seed(5)
my_data<-data.frame("Slept Well"=sample(c(0,1),10,TRUE),
"Had Breakfast"=sample(c(0,1),10,TRUE),
"Worked out"=sample(c(0,1),10,TRUE),
"Meditated"=sample(c(0,1),10,TRUE))
In the above, each row corresponds to an observation. I am interested in analysing the frequency of each unique combination of the variables. For example, how many observations said that they both slept well and meditated, but did not have breakfast or worked out?
I would like to be able to rank the unique combinations from most frequently occurring to the least frequently occurring. What is the best way to go about coding that up?
You can use aggregate.
x <- aggregate(list(n=rep(1, nrow(my_data))), my_data, length)
#x <- aggregate(list(n=my_data[,1]), my_data, length) #Alternative
x[order(-x$n),]
# Slept.Well Had.Breakfast Worked.out Meditated n
#4 0 1 1 0 2
#1 0 0 0 0 1
#2 1 1 0 0 1
#3 0 0 1 0 1
#5 0 0 0 1 1
#6 1 0 0 1 1
#7 0 1 0 1 1
#8 0 0 1 1 1
#9 0 1 1 1 1
What about a dplyr solution:
library(dplyr)
my_data %>%
# group it
group_by_all() %>%
# frequencies
summarise(freq = n()) %>%
# order decreasing
arrange(-freq)
# A tibble: 9 x 5
Slept.Well Had.Breakfast Worked.out Meditated freq
<chr> <chr> <chr> <chr> <int>
1 0 1 1 0 2
2 0 0 0 0 1
3 0 0 0 1 1
4 0 0 1 0 1
5 0 0 1 1 1
6 0 1 0 1 1
7 0 1 1 1 1
8 1 0 0 1 1
9 1 1 0 0 1
Or with data.table:
res <- setorder(data.table(my_data)[,"."(freq = .N), by = names(my_data)],-freq)
res
Slept.Well Had.Breakfast Worked.out Meditated freq
1: 0 1 1 0 2
2: 1 0 0 1 1
3: 0 0 1 0 1
4: 0 0 0 0 1
5: 0 1 0 1 1
6: 0 1 1 1 1
7: 0 0 1 1 1
8: 0 0 0 1 1
9: 1 1 0 0 1

How to remove duplicate values from different rows per unique identifier?

I'm just starting to use R. I have a dataset with in the first column unique identifiers (1958 patients) and in columns 2-35 0's en 1's.
For example:
Patient A: 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 NA NA
I want to change this to:
Patient A: 0 1 0 1 0 1
Thanks in advance.
We can use tapply and grouping our variable based on whether it changes value or not, i.e.
tapply(x[!is.na(x)], cumsum(c(TRUE, diff(x[!is.na(x)]) != 0)), FUN = unique)
#1 2 3 4 5 6
#0 1 0 1 0 1
Based on your example, it is not clear whether NA's can also occur in the middle, and how you would want to deal with that situation (e.g. make 1 NA 1 to 1 1 (option 1) and hence combine the two 1's, or whether NA would mark a boundary and you would keep both 1's (option 2).
That determines at which point to remove NA's in the code.
You could use S4Vectors run length encoding, which would allow you to have more than just 0 and 1.
library(S4Vectors)
## create example data
set.seed(1)
x <- sample(c(0,1), (1958*34), replace=TRUE, prob=c(.4, .6))
x[sample(length(x), 200)] <- NA
x <- matrix(x, nrow=1958, ncol=34)
df <- data.frame(patient.id = paste0("P", seq_len(1958)), x, stringsAsFactors = FALSE)
## define function to remove NA values
# option 1
fun.NA.boundary <- function(x) {
a <- runValue(Rle(x))
a[!is.na(a)]
}
# option 2
fun.NA.remove <- function(x) runValue(Rle(x[!is.na(x)]))
## calculate results
# option 1
reslist <- apply(x[,-1], 1, function(y) fun.NA.boundary(y))
# option 2
reslist <- apply(x[,-1], 1, function(y) fun.NA.remove(y))
names(reslist) <- df$patient.id
head(reslist)
#> $P1
#> [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
#>
#> $P2
#> [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#>
#> $P3
#> [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#>
#> $P4
#> [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#>
#> $P5
#> [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
#>
#> $P6
#> [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

Generate a dummy variable satisfying a condition for the same individual in a panel dataframe

I have a dataframe of this form
ID panelid dummy1 dummy2
1 1 0 1
1 2 1 0
2 1 1 0
2 2 0 1
3 1 1 0
3 2 1 0
4 1 0 1
4 2 0 1
I want to generate a dummy variable equal to one in occurrence of panelid==2 and only if the same individual presents a value for the dummy1 equal to 1 in panelid==1 and a value for the dummy2 equal to 1 in panelid==2. Thus I want to obtain something like this
ID panelid dummy1 dummy2 result
1 1 0 1 0
1 2 1 0 0
2 1 1 0 0
2 2 0 1 1
3 1 1 0 0
3 2 1 0 0
4 1 0 1 0
4 2 0 1 0
Can someone help me with these?
Many thanks to everyone
This is almost identical solution to #Cole's solution.
dataset <- read.table(text = 'ID panelid dummy1 dummy2
1 1 0 1
1 2 1 0
2 1 1 0
2 2 0 1
3 1 1 0
3 2 1 0
4 1 0 1
4 2 0 1',
header = TRUE)
temp_ID <- dataset$ID[(dataset$panelid == 1) & (dataset$dummy1 == 1)]
dataset$result <- as.integer(x = ((dataset$panelid == 2) & (dataset$dummy2 == 1) & (dataset$ID %in% temp_ID)))
dataset
ID panelid dummy1 dummy2 result
1 1 1 0 1 0
2 1 2 1 0 0
3 2 1 1 0 0
4 2 2 0 1 1
5 3 1 1 0 0
6 3 2 1 0 0
7 4 1 0 1 0
8 4 2 0 1 0
Here's a base R approach:
dummy1_in_panelid <- with(df, ID[panelid == 1 & dummy1 == 1])
#initialize
df$result <- 0
df$result[with(df, which(panelid == 2 & ID %in% dummy1_in_panelid & dummy2 == 1))] <- 1
df
ID panelid dummy1 dummy2 result
1 1 1 0 1 0
2 1 2 1 0 0
3 2 1 1 0 0
4 2 2 0 1 1
5 3 1 1 0 0
6 3 2 1 0 0
7 4 1 0 1 0
8 4 2 0 1 0
And the data...
df <- as.data.frame(data.table::fread('
ID panelid dummy1 dummy2
1 1 0 1
1 2 1 0
2 1 1 0
2 2 0 1
3 1 1 0
3 2 1 0
4 1 0 1
4 2 0 1'))

Ifelse statement with dataframe subset using date

I am trying to create a function to apply to a variable in a dataframe that, for a windows of 2 days forward from the current observation, change the value of VarD if in that date window it always take the value 1.
The dataframe looks like this:
VarA VarB Date Diff VarD
1 1 2007-04-09 NA 0
1 1 2007-04-10 0 0
1 1 2007-04-11 -2 1
1 1 2007-04-12 0 1
1 1 2007-04-13 2 0
1 1 2007-04-14 0 0
1 1 2007-04-15 -2 1
1 1 2007-04-16 1 0
1 1 2007-04-17 -4 1
1 1 2007-04-18 0 1
1 1 2007-04-19 0 1
1 1 2007-04-20 0 1
The new dataframe should look like the following:
VarA VarB Date Diff VarD VarC
1 1 2007-04-09 NA 0 0
1 1 2007-04-10 0 0 0
1 1 2007-04-11 -2 1 1
1 1 2007-04-12 0 1 1
1 1 2007-04-13 2 0 0
1 1 2007-04-14 0 0 0
1 1 2007-04-15 -2 1 1
1 1 2007-04-16 1 0 0
1 1 2007-04-17 -4 1 0
1 1 2007-04-18 0 1 0
1 1 2007-04-19 0 1 0
1 1 2007-04-20 0 1 0
I have tried the following code:
db$VarC <- 0
for (i in unique(db$VarA)) {
for (j in unique(db$VarB)) {
for (n in 1 : lenght(db$Date)) {
if (db$VarD[n] == 0) {db$VarC[n] <- 0}
else { db$VarC[n] <- ifelse(0 %in% db[(db$Date >=n & db$Date < n+3,]$VarC, 1,0}
}
}
But I obtain just zeroes in VarC. I have checked the code without the else and it works fine. No error by r if the complete code is run. I do not have any clue on where the problem could be.
Here are some alternatives. The first one avoids some messy indexing but the last two do not require any packages.
1) rollapply This applies the VarC function in a rolling fashion to each 3 elements of db$VarD. align = "left" says that when it passes x to function VarC that x[1] is the current element, x[2] the next and x[3] the next, i.e. the current element is the leftmost. partial = TRUE says that if there are not 3 elements available (which would be the case for the last and next to last elements) then just pass however many there are remaining.
library(zoo)
VarC <- function(x) if (all(x[-1] == 1)) 0 else x[1]
db$VarC <- rollapply(db$VarD, 3, VarC, partial = TRUE, align = "left")
giving:
> db
VarA VarB Date Diff VarD VarC
1 1 1 2007-04-09 NA 0 0
2 1 1 2007-04-10 0 0 0
3 1 1 2007-04-11 -2 1 1
4 1 1 2007-04-12 0 1 1
5 1 1 2007-04-13 2 0 0
6 1 1 2007-04-14 0 0 0
7 1 1 2007-04-15 -2 1 1
8 1 1 2007-04-16 1 0 0
9 1 1 2007-04-17 -4 1 0
10 1 1 2007-04-18 0 1 0
11 1 1 2007-04-19 0 1 0
12 1 1 2007-04-20 0 1 0
2) sapply or using VarC from above:
n <- nrow(db)
db$VarC <- sapply(1:n, function(i) VarC(db$VarD[i:min(i+2, n)]))
3) for or using n and VarC from above:
db$VarC <- NA
for(i in 1:n) db$VarC[i] <- VarC(db$VarD[i:min(i+2, n)])
Note: The input db in reproducible form is:
Lines <- "VarA VarB Date Diff VarD VarC
1 1 2007-04-09 NA 0 0
1 1 2007-04-10 0 0 0
1 1 2007-04-11 -2 1 1
1 1 2007-04-12 0 1 1
1 1 2007-04-13 2 0 0
1 1 2007-04-14 0 0 0
1 1 2007-04-15 -2 1 1
1 1 2007-04-16 1 0 0
1 1 2007-04-17 -4 1 0
1 1 2007-04-18 0 1 0
1 1 2007-04-19 0 1 0
1 1 2007-04-20 0 1 0 "
db <- read.table(text = Lines, header = TRUE)

R data.table condition within group, but recorded at first instance in group

I have data that looks a bit like this:
df <- data.frame(ID=c(rep(1,4),rep(2,2),rep(3,2),4), TYPE=c(1,3,2,4,1,2,2,3,2),
SEQUENCE=c(seq(1,4),1,2,1,2,1))
ID TYPE SEQUENCE
1 1 1
1 3 2
1 2 3
1 4 4
2 1 1
2 2 2
3 2 1
3 3 2
4 2 1
I know need to check if a certain type is present in each ID block (binary), but only record the
answer in the first record per block (SEQUENCE == 1).
The best I came up with so far is coding them in the row they are present in, e.g.
library(data.table)
DT <- data.table(df)
DT$A[DT$TYPE==1] <- 1
DT$B[DT$TYPE==2] <- 1
DT$C[DT$TYPE==3] <- 1
DT$D[DT$TYPE==4] <- 1
DT[is.na(DT)] <- 0
RESULT:
ID TYPE SEQUENCE A B C D
1 1 1 1 0 0 0
1 3 2 0 0 1 0
1 2 3 0 1 0 0
1 4 4 0 0 0 1
2 1 1 1 0 0 0
2 2 2 0 1 0 0
3 2 1 0 1 0 0
3 3 2 0 0 1 0
4 2 1 0 1 0 0
However, the result should look like this:
ID TYPE SEQUENCE A B C D
1 1 1 1 1 1 1
1 3 2 0 0 0 0
1 2 3 0 0 0 0
1 4 4 0 0 0 0
2 1 1 1 1 0 0
2 2 2 0 0 0 0
3 2 1 0 1 1 0
3 3 2 0 0 0 0
4 2 1 0 1 0 0
I assume this can be done with data.table, but I haven't quite found the correct syntax.
This makes one copy of the data.table:
DT[, FAC := factor(TYPE, labels=LETTERS[1:4])]
DT <- dcast.data.table(DT, ID+TYPE+SEQUENCE~FAC, fun.aggregate=length)
DT[,LETTERS[1:4] := lapply(.SD,
function(x) c(any(as.logical(x)), rep(0L, length(x)-1))),
.SDcols=LETTERS[1:4], by=ID]
# ID TYPE SEQUENCE A B C D
#1: 1 1 1 1 1 1 1
#2: 1 2 3 0 0 0 0
#3: 1 3 2 0 0 0 0
#4: 1 4 4 0 0 0 0
#5: 2 1 1 1 1 0 0
#6: 2 2 2 0 0 0 0
#7: 3 2 1 0 1 1 0
#8: 3 3 2 0 0 0 0
#9: 4 2 1 0 1 0 0

Resources