Creating new columns based on a pattern - r

I have a large dataset which has a pattern similar to the dataPattern below. I need help with the code to create the desiredresult dataset
library(data.table)
V1 <- rep(c(rep("a", times = 2), letters[2:5],
rep("f", times = 2)), times = 2)
V2 <- c(c(c(0.24, 0.25), 2:5, c(0.95, 1.05)),
c(c(0.34, 0.35), 2:5, c(1.95, 2.05)) )
(dataPattern <- data.table(V1, V2))
(desiredresult <- data.table(V1, V2, c(rep(c(0.24, 0.25), times = 4),
rep(c(0.34, 0.35), times = 4)),
c(rep(c(0.95, 1.05), times = 4),
rep(c(1.95, 2.05), times = 4))))
I need help to create column V3 in the desiredresult. The pattern is as follows:
if V1 == "a" then V3 = V2
if V1 != "a" we repeat the previous corresponding set of V2 values until a new value of a is reached then the new values of V2 is placed in V3, etc. The above repeats for all new values of a.
I also need your help with the code to create Column V4 in the desiredresult which is similar to column V3 except it checks if V1 == "f" and places the values of f from V2 into V4 and repeats it if V1 != "f"
I have tried:
rle(dataPattern$V1 == "a" )
# Run Length Encoding
# lengths: int [1:4] 2 6 2 6
# values : logi [1:4] TRUE FALSE TRUE FALSE
The sequence where V1 != "a" or V1 != "f" appears to be equal to the number of FALSE minus Number of TRUE. This is how many times each a sequence need to be repeated until a new a is reached
Many Thanks

OK, here is a better way, I think, to get the values of V2 into the column depending on V1=='a'.
V1 <- rep(c(rep("a", times = 2), letters[2:5],
rep("f", times = 2)), times = 2)
V2 <- c(c(c(0.24, 0.25), 2:5, c(0.95, 1.05)),
c(c(0.34, 0.35), 2:5, c(1.95, 2.05)) )
dataPattern <- data.frame(V1, V2)
dataPattern$V3 <- ifelse(dataPattern$V1 == "a", dataPattern$V2, NA)
dataPattern$V4 <- ifelse(dataPattern$V1 == "f", dataPattern$V2, NA)
for (i in 1:nrow(dataPattern)){
if (dataPattern$V1[i] == "a"){
tmpa <- dataPattern$V3[i]
}
if (is.na(dataPattern$V3[i])){
dataPattern$V3[i] <- tmpa
}
if (dataPattern$V1[nrow(dataPattern)-(i-1)] == "f"){
tmpf <- dataPattern$V4[nrow(dataPattern)-(i-1)]
}
if (is.na(dataPattern$V4[nrow(dataPattern)-(i-1)])){
dataPattern$V4[nrow(dataPattern)-(i-1)] <- tmpf
}
}
output, which I think is more correct, according to your stated rules, than desiredoutput:
> dataPattern
V1 V2 V3 V4
1 a 0.24 0.24 0.95
2 a 0.25 0.25 0.95
3 b 2.00 0.25 0.95
4 c 3.00 0.25 0.95
5 d 4.00 0.25 0.95
6 e 5.00 0.25 0.95
7 f 0.95 0.25 0.95
8 f 1.05 0.25 1.05
9 a 0.34 0.34 1.95
10 a 0.35 0.35 1.95
11 b 2.00 0.35 1.95
12 c 3.00 0.35 1.95
13 d 4.00 0.35 1.95
14 e 5.00 0.35 1.95
15 f 1.95 0.35 1.95
16 f 2.05 0.35 2.05

This seems to work:
dataPattern[, `:=`(
V3 = head(V2,2),
V4 = tail(V2,2)
), by=cumsum( V1 == "a" & shift(V1,type="lead") == "a" )]
The result passes the all.equal(dataPattern, desiredresult) check. Depending on what your actual use-case looks like, you might need to put something different inside the cumsum.

Related

Get unique pairs of row names and column names based on data table entries

Here is my data table:
A B C
A 1 0.8 0.2
B 0.8 1 0.3
C 0.2 0.3 1
I am trying to get the unique pairs of row names and column names based on the entries. For example, if I am looking at > 0.5, my output would be:
A B 0.8
If I am looking at < 0.5, my output would be:
B C 0.3
A C 0.2
This is a classical melt situation (though it needs some seasoning with upper or lower.tri)
dat <- read.table(text=
" A B C
A 1 0.8 0.2
B 0.8 1 0.3
C 0.2 0.3 1
", header=TRUE )
dat[ !upper.tri(dat) ] <- NA
dat <- as.data.frame( dat )
dat <- tibble::rownames_to_column( dat, "V1" )
setDT(dat)
use.this <- melt( dat, id.vars="V1", variable.name="V2" )[ !is.na(value) ]
use.this[ value < .5 ]
use.this[ value > 0.5 ]
It looks like this:
> use.this[ value < .5 ]
V1 V2 value
1: A C 0.2
2: B C 0.3
> use.this[ value > .5 ]
V1 V2 value
1: A B 0.8
In base R, using which with arr.ind = TRUE to get the row and column numbers that meet the condition.
df[lower.tri(df, diag = TRUE)] <- NA
mat <- which(df < 0.5, arr.ind = TRUE)
data.frame(rowname = rownames(df)[mat[, 1]],
colname = colnames(df)[mat[, 2]],
value = df[mat])
# rowname colname value
#1 A C 0.2
#2 B C 0.3
data
df <- structure(list(A = c(1, 0.8, 0.2), B = c(0.8, 1, 0.3), C = c(0.2,
0.3, 1)), class = "data.frame", row.names = c("A", "B", "C"))
Here is another option:
M <- matrix(c(1,0.8,0.2,0.8,1,0.3,0.2,0.3,1), nrow=3L,
dimnames=list(LETTERS[1:3], LETTERS[1:3]))
allDT <- data.table(rn=rep(rownames(M), nrow(M)),
cn=rep(colnames(M), each=ncol(M)),
val=as.vector(M))
DT <- unique(allDT[, .(val=val), .(rn=pmin(rn, cn), cn=pmax(rn, cn))])
DT[val<0.5]
The question has been tagged data.table. So, here is a simple solution which uses only data.table syntax. In addition, it suggests lookup functions (EDIT: in 3 different flavours) which require less keystrokes.
library(data.table)
wide <- fread(" A B C
A 1 0.8 0.2
B 0.8 1 0.3
C 0.2 0.3 1")
long <- melt(wide, id.var = "V1", variable.name = "V2",
variable.factor = FALSE)[V1 < V2]
long
V1 V2 value
1: A B 0.8
2: A C 0.2
3: B C 0.3
Note that the upper triangular part of wide is picked after reshaping to long format by subsetting [V1 < V2] which ensures that only unique pairs are considered.
long can be queried by subsetting, e.g.,
long[value < 0.5]
V1 V2 value
1: A C 0.2
2: B C 0.3
long[value > 0.5]
V1 V2 value
1: A B 0.8
lookup function
long can be queried by defining a lookup function:
l <- function(cond) eval(parse(text = paste0("long[value", cond, "]")))
which can be called, e.g.,
l("< .5")
V1 V2 value
1: A C 0.2
2: B C 0.3
l("> .5")
V1 V2 value
1: A B 0.8
l("== .3")
V1 V2 value
1: B C 0.3
EDIT: lookup function with 2 arguments
Alternatively, the lookup function can be defined to allow for 2 arguments, one for the comparision operator, one for the numerical values:
l2 <- function(op, v) long[do.call(op, list(value, v))]
l2("%between%", c(0.25, 0.95))
V1 V2 value
1: A B 0.8
2: B C 0.3
Or, with the new interface for programming on data.table (available with data.table development version 1.14.1):
l3 <- function(op, v) long[op(value, v), env = list(op = as.name(op), v = v)]
l3("%in%", c(0.2, 0.3))
V1 V2 value
1: A C 0.2
2: B C 0.3

Conditionally changing only some cells in a data frame - ifelse() failure?

I am trying to conditionally change some items when cleaning survey data.
I've got two questions, Question X and Question Y. If they respond 1 or 2 for Question X, they go on to answer Question Y. If they answer 3 or 4 for Question X, they skip Question Y.
If they answer X with 1 or 2 and then skip Y, I want to record their 'NULL!' entries as NA - they just didn't answer the question when they should have.
If they answer X with 3 or 4 and then skip Y, I want to record their 'NULL!' entries as 0 - they weren't supposed to answer the question, so they didn't.
Here's a reproducible dataset I made:
set.seed(1)
df <- data.frame(
X = as.factor(sample(c("1.00", "2.00", "3.00", "4.00"), 10, replace = TRUE)),
Y = as.factor(sample(c("1.00", "2.00", "#NULL!"), 10, replace = TRUE))
)
df
I'm trying to replace the aforementioned 'NULL!' fields with either NA or 0 respectively. I've been trying it with ifelse() and have had little luck - it appears to return anything that is 1.00 or 2.00 as NA and 3.00 or 4.00 as 0. Is there a better way to do this? What am I doing wrong?
levels(df$Y) <- c(levels(df$Y), 0)
df$Y <- ifelse(df$X == '3.00'| df$X == '4.00', df$Y[df$y == 'NULL!'] <- 0, df$Y[df$Y == '#NULL!'] <- NA)
df
Thank you for your help!
You're doing a couple of things the hard way. First, using factors constrains one to only use levels that exist in the a particular factor, which may not be what you want. Second, you have levels of "#NULL!" but are attempting (unsuccessfully) to test for a level of "NULL!". I'm guessing you wanted them to be the same level. Third; You are attempting to use "<-" inside the second and third arguments of ifelse. That will not succeed in manner you intended. The LHS of such an expression is not evaluated by ifelse.
You can instead use nested ifelse:
df$Y <- ifelse( (df$X == '3.00'| df$X == '4.00') & df$Y == "#NULL!", 0,
ifelse( df$Y == "#NULL!", NA, df$Y) ) # only mess with "Nulls"
df
X Y
1 2.00 1.00
2 2.00 1.00
3 3.00 0
4 4.00 2.00
5 1.00 <NA>
6 4.00 2.00
7 4.00 0
8 3.00 0
9 3.00 2.00
10 1.00 <NA>
To prevent the missing levels problem which you handled by adding a "0" level, I instead made my dataframe so it contained character vectors:
set.seed(1)
df <- data.frame(X = sample(c("1.00", "2.00", "3.00", "4.00"), 10, replace== TRUE),
Y = sample(c("1.00", "2.00", "#NULL!"), 10, replace = TRUE),
stringsAsFactors=FALSE)
Earlier tidyverse code:
library(tidyverse)
df %>% mutate(Y = case_when(
X == "3.00" ~ "0",
X == "4.00" ~ "0",
TRUE ~ as.character(Y)))
How about this?
set.seed(1)
df <- data.frame(
X = as.factor(sample(c("1.00", "2.00", "3.00", "4.00"), 10, replace = TRUE)),
Y = as.factor(sample(c("1.00", "2.00", "#NULL!"), 10, replace = TRUE))
)
df$X <- as.character(df$X)
df$Y <- as.character(df$Y)
df$Y <- ifelse(df$X=="1.00" | df$X=="2.00" & df$Y == "#NULL!", NA, df$Y)
df$Y <- ifelse(df$X=="3.00" | df$X=="4.00", "0.00", df$Y)
df
X Y
1 2.00 1.00
2 2.00 1.00
3 3.00 0.00
4 4.00 0.00
5 1.00 <NA>
6 4.00 0.00
7 4.00 0.00
8 3.00 0.00
9 3.00 0.00
10 1.00 <NA>

Replace a complex, conditional for loop with apply in R

I'm relatively new to R and I'm hoping to replace my messy loop with something more eloquent and faster (apply?). Basically, I want to populate a new matrix based on if values in the same position in other matrices match one another. Let me illustrate:
>df1
V1 V2 V3
1 A G A
2 T T T
3 C A A
4 G C G
>df2
V1
1 A
2 T
3 C
4 G
>df3
V1 V2 V3
1 .25 .99 .41
2 .21 .25 .75
3 .35 .65 .55
4 .75 .21 .11
>newdf <- data.frame(matrix(ncol= ncol(df3), nrow = nrow(df3)))
Note that df1 and df3 will always have the same dimensions as one another, and df2 will always have the same nrow.
If positions Match: If df1[i,j] == df2[i], then I want newdf[i,j] = df3[i,j]
If positions don't match: If df1[i,j] != df2[i], then I want newdf[i,j] = 1-df3[i,j]
For instance df1[1,2] = 'G' and df2[1] = 'A', so I want newdf[1,2] = (1- df3[1,2])
I wrote a very gross for loop to perform this successfully:
df1<- as.matrix(df1)
df2<- as.matrix(df2)
df3<- as.matrix(df3)
newdf <- data.frame(matrix(ncol= ncol(df3), nrow = nrow(df3)))
for (i in (1:nrow(df1))){
for (j in (1:ncol(df1))){
if (df1[i,j] == df2[i]) {
newdf[i,j] = df3[i,j] }
else {
newdf[i,j] = 1- df3[i,j] }
}
}
Which gives me the desired results:
>newdf
X1 X2 X3
1 0.25 0.01 0.41
2 0.21 0.25 0.75
3 0.35 0.35 0.45
4 0.75 0.79 0.11
This is a very slow and messy process when I have lots of data. Are there any suggestions for other ways to solve this, perhaps using the apply family? Thanks and sorry for the nasty code.
You can use an apply to create an index of those values that don't match, then simply subtract them from one
idx <- (!apply(df1, 2, function(x) x == df2))
## alternatively, you can use x != df2 too
## idx <- (apply(df1, 2, function(x) x != df2))
df3[idx] <- 1 - df3[idx]
df3
# V1 V2 V3
# 1 0.25 0.01 0.41
# 2 0.21 0.25 0.75
# 3 0.35 0.35 0.45
# 4 0.75 0.79 0.11
Explanation
Where the apply gives a matrix of TRUE/FALSE based on whether df1 matches df2
V1 V2 V3
[1,] TRUE FALSE TRUE
[2,] TRUE TRUE TRUE
[3,] TRUE FALSE FALSE
[4,] TRUE FALSE TRUE
So taking the negation of this using ! gives the opposite values.
!apply(df1, 2, function(x) x == df2)
V1 V2 V3
[1,] FALSE TRUE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE TRUE TRUE
[4,] FALSE TRUE FALSE
which then tells us which values of df we need to change
df3[idx]
[1] 0.01 0.35 0.79 0.45
And alternative is to make df2 the same size as df1
df2 <- cbind(df2, rep( df2, ncol( df1 ) - 1))
df1 != df2

R Conditional standard deviation

I have a large data set and I need to get the standard deviation for the Main column based on the number of rows in other columns. Here is a sample data set:
df1 <- data.frame(
Main = c(0.33, 0.57, 0.60, 0.51),
B = c(NA, NA, 0.09,0.19),
C = c(NA, 0.05, 0.07, 0.05),
D = c(0.23, 0.26, 0.23, 0.26)
)
View(df1)
# Main B C D
# 1 0.33 NA NA 0.23
# 2 0.57 NA 0.05 0.26
# 3 0.60 0.09 0.07 0.23
# 4 0.51 0.19 0.05 0.26
Take column B as an example, since row 1&2 are NA, its standard deviation will be sd(df1[3:4,1]); column C&D will be sd(df1[2:4,1]) and sd(df1[1:4,1]). Therefore, the result will be:
# B C D
# 1 0.06 0.05 0.12
I did the followings but it only returned one number - 0.0636
df2 <- df1[,-1]!=0
sd(df1[df2,1], na.rm = T)
My data set has many more columns, and I'm wondering if there is a more efficient way to get it done? Many thanks!
Try:
sapply(df1[,-1], function(x) sd(df1[!is.na(x), 1]))
# B C D
# 0.06363961 0.04582576 0.12093387
x <- colnames(df) # list all columns you want to calculate sd of
value <- sapply(1:length(x) , function(i) sd(df[,x[i],drop=TRUE], na.rm = T))
names(value) <- x
# Main B C D
# 0.12093387 0.07071068 0.01154701 0.01732051
We can get this with colSds from matrixStats
library(matrixStats)
colSds(`dim<-`(df1[,1][NA^is.na(df1[-1])*row(df1[-1])], dim(df1[,-1])), na.rm = TRUE)
#[1] 0.06363961 0.04582576 0.12093387

How to delete a duplicate row in R

I have the following data
x y z
1 2 a
1 2
data[2,3] is a factor but nothing shows,
In the data, it has a lot rows like this way.How to delete the row when the z has nothing?
I mean deleting the rows such as the second row.
output should be
x y z
1 2 a
OK. Stabbing a little bit in the dark here.
Imagine the following dataset:
mydf <- data.frame(
x = c(.11, .11, .33, .33, .11, .11),
y = c(.22, .22, .44, .44, .22, .44),
z = c("a", "", "", "f", "b", ""))
mydf
# x y z
# 1 0.11 0.22 a
# 2 0.11 0.22
# 3 0.33 0.44
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
From the combination of your title and your description (neither of which seems to fully describe your problem), I would decode that you want to drop rows 2 and 3, but not row 6. In other words, you want to first check whether the row is duplicated (presumably only the first two columns), and then, if the third column is empty, drop that row. By those instructions, row 5 should remain (column "z" is not blank) and row 6 should remain (the combination of columns 1 and 2 is not a duplicate).
If that's the case, here's one approach:
# Copy the data.frame, "sorting" by column "z"
mydf2 <- mydf[rev(order(mydf$z)), ]
# Subset according to your conditions
mydf2 <- mydf2[duplicated(mydf2[1:2]) & mydf2$z %in% "", ]
mydf2
# x y z
# 3 0.33 0.44
# 2 0.11 0.22
^^ Those are the data that we want to remove. One way to remove them is using setdiff on the rownames of each dataset:
mydf[setdiff(rownames(mydf), rownames(mydf2)), ]
# x y z
# 1 0.11 0.22 a
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
Some example data:
df = data.frame(x = runif(100),
y = runif(100),
z = sample(c(letters[0:10], ""), 100, replace = TRUE))
> head(df)
x y z
1 0.7664915 0.86087017 a
2 0.8567483 0.83715022 d
3 0.2819078 0.85004742 f
4 0.8241173 0.43078311 h
5 0.6433988 0.46291916 e
6 0.4103120 0.07511076
Spot row six with the missing value. You can subset using a vector of logical's (TRUE, FALSE):
df[df$z != "",]
And as #AnandaMahto commented, you can even check against multiple conditions:
df[!df$z %in% c("", " "),]

Resources