I am trying to create a function to apply to a variable in a dataframe that, for a windows of 2 days forward from the current observation, change the value of VarD if in that date window it always take the value 1.
The dataframe looks like this:
VarA VarB Date Diff VarD
1 1 2007-04-09 NA 0
1 1 2007-04-10 0 0
1 1 2007-04-11 -2 1
1 1 2007-04-12 0 1
1 1 2007-04-13 2 0
1 1 2007-04-14 0 0
1 1 2007-04-15 -2 1
1 1 2007-04-16 1 0
1 1 2007-04-17 -4 1
1 1 2007-04-18 0 1
1 1 2007-04-19 0 1
1 1 2007-04-20 0 1
The new dataframe should look like the following:
VarA VarB Date Diff VarD VarC
1 1 2007-04-09 NA 0 0
1 1 2007-04-10 0 0 0
1 1 2007-04-11 -2 1 1
1 1 2007-04-12 0 1 1
1 1 2007-04-13 2 0 0
1 1 2007-04-14 0 0 0
1 1 2007-04-15 -2 1 1
1 1 2007-04-16 1 0 0
1 1 2007-04-17 -4 1 0
1 1 2007-04-18 0 1 0
1 1 2007-04-19 0 1 0
1 1 2007-04-20 0 1 0
I have tried the following code:
db$VarC <- 0
for (i in unique(db$VarA)) {
for (j in unique(db$VarB)) {
for (n in 1 : lenght(db$Date)) {
if (db$VarD[n] == 0) {db$VarC[n] <- 0}
else { db$VarC[n] <- ifelse(0 %in% db[(db$Date >=n & db$Date < n+3,]$VarC, 1,0}
}
}
But I obtain just zeroes in VarC. I have checked the code without the else and it works fine. No error by r if the complete code is run. I do not have any clue on where the problem could be.
Here are some alternatives. The first one avoids some messy indexing but the last two do not require any packages.
1) rollapply This applies the VarC function in a rolling fashion to each 3 elements of db$VarD. align = "left" says that when it passes x to function VarC that x[1] is the current element, x[2] the next and x[3] the next, i.e. the current element is the leftmost. partial = TRUE says that if there are not 3 elements available (which would be the case for the last and next to last elements) then just pass however many there are remaining.
library(zoo)
VarC <- function(x) if (all(x[-1] == 1)) 0 else x[1]
db$VarC <- rollapply(db$VarD, 3, VarC, partial = TRUE, align = "left")
giving:
> db
VarA VarB Date Diff VarD VarC
1 1 1 2007-04-09 NA 0 0
2 1 1 2007-04-10 0 0 0
3 1 1 2007-04-11 -2 1 1
4 1 1 2007-04-12 0 1 1
5 1 1 2007-04-13 2 0 0
6 1 1 2007-04-14 0 0 0
7 1 1 2007-04-15 -2 1 1
8 1 1 2007-04-16 1 0 0
9 1 1 2007-04-17 -4 1 0
10 1 1 2007-04-18 0 1 0
11 1 1 2007-04-19 0 1 0
12 1 1 2007-04-20 0 1 0
2) sapply or using VarC from above:
n <- nrow(db)
db$VarC <- sapply(1:n, function(i) VarC(db$VarD[i:min(i+2, n)]))
3) for or using n and VarC from above:
db$VarC <- NA
for(i in 1:n) db$VarC[i] <- VarC(db$VarD[i:min(i+2, n)])
Note: The input db in reproducible form is:
Lines <- "VarA VarB Date Diff VarD VarC
1 1 2007-04-09 NA 0 0
1 1 2007-04-10 0 0 0
1 1 2007-04-11 -2 1 1
1 1 2007-04-12 0 1 1
1 1 2007-04-13 2 0 0
1 1 2007-04-14 0 0 0
1 1 2007-04-15 -2 1 1
1 1 2007-04-16 1 0 0
1 1 2007-04-17 -4 1 0
1 1 2007-04-18 0 1 0
1 1 2007-04-19 0 1 0
1 1 2007-04-20 0 1 0 "
db <- read.table(text = Lines, header = TRUE)
Related
I have a dataframe of this form
ID panelid dummy1 dummy2
1 1 0 1
1 2 1 0
2 1 1 0
2 2 0 1
3 1 1 0
3 2 1 0
4 1 0 1
4 2 0 1
I want to generate a dummy variable equal to one in occurrence of panelid==2 and only if the same individual presents a value for the dummy1 equal to 1 in panelid==1 and a value for the dummy2 equal to 1 in panelid==2. Thus I want to obtain something like this
ID panelid dummy1 dummy2 result
1 1 0 1 0
1 2 1 0 0
2 1 1 0 0
2 2 0 1 1
3 1 1 0 0
3 2 1 0 0
4 1 0 1 0
4 2 0 1 0
Can someone help me with these?
Many thanks to everyone
This is almost identical solution to #Cole's solution.
dataset <- read.table(text = 'ID panelid dummy1 dummy2
1 1 0 1
1 2 1 0
2 1 1 0
2 2 0 1
3 1 1 0
3 2 1 0
4 1 0 1
4 2 0 1',
header = TRUE)
temp_ID <- dataset$ID[(dataset$panelid == 1) & (dataset$dummy1 == 1)]
dataset$result <- as.integer(x = ((dataset$panelid == 2) & (dataset$dummy2 == 1) & (dataset$ID %in% temp_ID)))
dataset
ID panelid dummy1 dummy2 result
1 1 1 0 1 0
2 1 2 1 0 0
3 2 1 1 0 0
4 2 2 0 1 1
5 3 1 1 0 0
6 3 2 1 0 0
7 4 1 0 1 0
8 4 2 0 1 0
Here's a base R approach:
dummy1_in_panelid <- with(df, ID[panelid == 1 & dummy1 == 1])
#initialize
df$result <- 0
df$result[with(df, which(panelid == 2 & ID %in% dummy1_in_panelid & dummy2 == 1))] <- 1
df
ID panelid dummy1 dummy2 result
1 1 1 0 1 0
2 1 2 1 0 0
3 2 1 1 0 0
4 2 2 0 1 1
5 3 1 1 0 0
6 3 2 1 0 0
7 4 1 0 1 0
8 4 2 0 1 0
And the data...
df <- as.data.frame(data.table::fread('
ID panelid dummy1 dummy2
1 1 0 1
1 2 1 0
2 1 1 0
2 2 0 1
3 1 1 0
3 2 1 0
4 1 0 1
4 2 0 1'))
I have a CSV table (as a data frame). I want to modify a specific column value by other columns values.
I have prepared a code, but it doesn't work.
The data frame contains 1076 rows and 156 columns.
The formula have to be like this:
if (a[i,"0Q-state"] == "done" ) && (a[i,0Q-01] == NA)) a[i,0Q-01] = 0;
else a[i,0Q-01] = a[i,0Q-01];
but I don't know how can I do this in r.
>dataset4
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 NA
2: 1 1 1 1 1 1 NA 1 1
3: done 1 1 1 NA 1 1 1 1 1
5: done 1 1 1 1 0 0 0 1 0
6: done 1 1 1 1 0 0 0 1 0
7: 1 1 NA 1 0 0 0 1 0
8: done 1 1 1 1 0 0 0 1 0
sapply(c("0Q-01","0Q-02","0Q-03","0Q-04","0Q-05","0Q-06","0Q-07","0Q-08","0Q-09"),
function(y) {
dataset4[,y] <- sapply(c(1:1076), function(x)
ifelse (((is.na(dataset4[x,y])) && (dataset4[x,c("0Q-state")] == "done"))
,0, dataset4[x,y]))}
)
Output has to be:
>dataset4
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 0
2: 1 1 1 1 1 1 NA 1 1
3: done 1 1 1 0 1 1 1 1 1
5: done 1 1 1 1 0 0 0 1 0
6: done 1 1 1 1 0 0 0 1 0
7: 1 1 NA 1 0 0 0 1 0
8: done 1 1 1 1 0 0 0 1 0
we could try:
df[rep(df[, 1] == "done", ncol(df)) & is.na(df)] <- 0
df
1 done 1 1 1 1 1 1 1 1 0
2 1 1 1 1 1 1 NA 1 1
3 done 1 1 1 0 1 1 1 1 1
4 done 1 1 1 1 0 0 0 1 0
5 done 1 1 1 1 0 0 0 1 0
6 1 1 NA 1 0 0 0 1 0
7 done 1 1 1 1 0 0 0 1 0
or using sapply():
myFunc <- function(x, y) ifelse(is.na(x) & y == "done", 1, x)
data.frame(df[, 1], sapply(df[, -1], myFunc, y = df[, 1]))
1 done 1 1 1 1 1 1 1 1 NA
2 1 1 1 1 1 1 NA 1 1
3 done 1 1 1 NA 1 1 1 1 1
4 done 1 1 1 1 0 0 0 1 0
5 done 1 1 1 1 0 0 0 1 0
6 1 1 NA 1 0 0 0 1 0
7 done 1 1 1 1 0 0 0 1 0
where you can always substitute df[, 1] with df[, "0Q-state"] and df[, -1] with df[, namesOfDummyVars]
The question has been tagged with data.table and the printed output of dataset4 suggests that dataset4 already is a data.table object.
Here are three variants in data.table syntax to replace NAs in rows which are marked as "done".
# create vector of names of columns to be changed
cols <- sprintf("0Q-%02i", 1:9)
# variant 1
dataset4[`0Q-state` == "done",
(cols) := lapply(.SD, function(x) replace(x, is.na(x), 0L)),
.SDcols = cols][]
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 0
2: 1 1 1 1 1 NA 1 1 NA
3: done 1 1 1 0 1 1 1 1 1
4: done 1 1 1 1 0 0 0 1 0
5: done 1 1 1 1 0 0 0 1 0
6: 1 NA 1 0 0 0 1 0 NA
7: done 1 1 1 1 0 0 0 1 0
or
# variant 2
lapply(cols, function(i) dataset4[`0Q-state` == "done" & is.na(get(i)), (i) := 0L])
dataset4
returning the same as above
or
# variant 3 --- data.table development version 1.10.5
for (i in cols)
set(dataset4, which(dataset4[, "0Q-state"] == "done" & is.na(dataset4[, ..i])), i, 0L)
dataset4
I'm trying to calculate participant average scores on the following scheme:
1. Take a series of values from multiple variables (test items),
2. Calculate an average score only for items answered Yes or No,
3. Omitting NA values from affecting the mean yet counting frequency and getting coordinates for all NA values,
4. Storing that newfound mean value in a new variable.
I need to do this with binary questions (1 = Yes, 0 = No, -99 = Missing / NA), such as below:
id var1 var2 var3 var4 var5
1 1 0 0 0 0
2 1 1 0 1 1
3 1 0 0 1 0
4 1 0 0 1 0
5 1 0 0 0 0
6 1 1 0 0 1
7 1 1 0 0 1
8 1 1 0 0 0
9 1 0 1 0 1
10 1 0 0 -99 1
11 1 1 0 1 0
12 1 0 0 1 0
13 1 0 0 -99 0
14 1 -99 0 1 1
15 1 0 0 1 0
16 1 0 0 0 1
17 1 0 0 1 0
18 1 0 -99 0 1
19 1 0 0 1 0
20 1 0 0 1 1
21 1 0 0 1 0
22 1 0 0 1 1
23 1 0 0 1 0
24 1 0 0 0 1
25 1 0 0 0 0
26 1 0 0 1 0
27 1 0 0 0 0
28 1 1 0 1 1
And with Likert scale questions (0 = Strongly Disagree / 6 = Strongly Disagree, -99 Missing / NA).
var10 var11 var12 var13 var14
1 1 1 1 0
4 1 1 1 1
1 1 1 1 1
2 1 1 1 1
4 1 1 1 1
2 1 1 1 0
1 1 1 1 0
1 1 1 1 1
2 1 1 1 1
1 1 1 1 0
4 1 1 1 1
4 1 1 1 1
-99 1 1 1 1
1 1 2 1 1
1 4 2 2 0
4 1 1 1 1
4 1 1 1 1
1 1 1 1 1
2 1 1 1 1
4 1 1 1 0
1 1 1 1 1
4 1 1 1 1
1 1 1 1 1
4 1 1 1 1
1 1 1 1 1
Any ideas of how to go about this? I'm sure it can be done by selecting individual columns or by indicating a range of columns from which to draw data. However, I'm inexperienced in writing such a complex, multi-stepped function in R so I'm hoping to get a veteran's advice.
Thanks in advance.
Here is an example for the dataset (d):
rs3 rs4 rs5 rs6
1 0 0 0
1 0 1 0
0 0 0 0
2 0 1 0
0 0 0 0
0 2 0 1
0 2 NA 1
0 2 2 1
NA 1 2 1
To check the frequency of the SNP genotype (0,1,2), we can use the table command
table (d$rs3)
The output would be
0 1 2
5 2 1
Here we want to recode the variables if the genotype 2's frequency is <3, the recoded output should be
rs3 rs4 rs5 rs6
1 0 0 0
1 0 1 0
0 0 0 0
1 0 1 0
0 0 0 0
0 2 0 1
0 2 NA 1
0 2 1 1
NA 1 1 1
I have 70000SNPs that need to check and recode. How to use the for loop or other method to do that in R?
Here's another possible (vectorized) solution
indx <- colSums(d == 2, na.rm = TRUE) < 3 # Select columns by condition
d[indx][d[indx] == 2] <- 1 # Inset 1 when the subset by condition equals 2
d
# rs3 rs4 rs5 rs6
# 1 1 0 0 0
# 2 1 0 1 0
# 3 0 0 0 0
# 4 1 0 1 0
# 5 0 0 0 0
# 6 0 2 0 1
# 7 0 2 NA 1
# 8 0 2 1 1
# 9 NA 1 1 1
We can try
d[] <- lapply(d, function(x)
if(sum(x==2, na.rm=TRUE) < 3) replace(x, x==2, 1) else x)
d
# rs3 rs4 rs5 rs6
#1 1 0 0 0
#2 1 0 1 0
#3 0 0 0 0
#4 1 0 1 0
#5 0 0 0 0
#6 0 2 0 1
#7 0 2 NA 1
#8 0 2 1 1
#9 NA 1 1 1
Or the same methodology can be used in dplyr
library(dplyr)
d %>%
mutate_each(funs(if(sum(.==2, na.rm=TRUE) <3)
replace(., .==2, 1) else .))
I have data that looks a bit like this:
df <- data.frame(ID=c(rep(1,4),rep(2,2),rep(3,2),4), TYPE=c(1,3,2,4,1,2,2,3,2),
SEQUENCE=c(seq(1,4),1,2,1,2,1))
ID TYPE SEQUENCE
1 1 1
1 3 2
1 2 3
1 4 4
2 1 1
2 2 2
3 2 1
3 3 2
4 2 1
I know need to check if a certain type is present in each ID block (binary), but only record the
answer in the first record per block (SEQUENCE == 1).
The best I came up with so far is coding them in the row they are present in, e.g.
library(data.table)
DT <- data.table(df)
DT$A[DT$TYPE==1] <- 1
DT$B[DT$TYPE==2] <- 1
DT$C[DT$TYPE==3] <- 1
DT$D[DT$TYPE==4] <- 1
DT[is.na(DT)] <- 0
RESULT:
ID TYPE SEQUENCE A B C D
1 1 1 1 0 0 0
1 3 2 0 0 1 0
1 2 3 0 1 0 0
1 4 4 0 0 0 1
2 1 1 1 0 0 0
2 2 2 0 1 0 0
3 2 1 0 1 0 0
3 3 2 0 0 1 0
4 2 1 0 1 0 0
However, the result should look like this:
ID TYPE SEQUENCE A B C D
1 1 1 1 1 1 1
1 3 2 0 0 0 0
1 2 3 0 0 0 0
1 4 4 0 0 0 0
2 1 1 1 1 0 0
2 2 2 0 0 0 0
3 2 1 0 1 1 0
3 3 2 0 0 0 0
4 2 1 0 1 0 0
I assume this can be done with data.table, but I haven't quite found the correct syntax.
This makes one copy of the data.table:
DT[, FAC := factor(TYPE, labels=LETTERS[1:4])]
DT <- dcast.data.table(DT, ID+TYPE+SEQUENCE~FAC, fun.aggregate=length)
DT[,LETTERS[1:4] := lapply(.SD,
function(x) c(any(as.logical(x)), rep(0L, length(x)-1))),
.SDcols=LETTERS[1:4], by=ID]
# ID TYPE SEQUENCE A B C D
#1: 1 1 1 1 1 1 1
#2: 1 2 3 0 0 0 0
#3: 1 3 2 0 0 0 0
#4: 1 4 4 0 0 0 0
#5: 2 1 1 1 1 0 0
#6: 2 2 2 0 0 0 0
#7: 3 2 1 0 1 1 0
#8: 3 3 2 0 0 0 0
#9: 4 2 1 0 1 0 0