Conditionally changing only some cells in a data frame - ifelse() failure? - r

I am trying to conditionally change some items when cleaning survey data.
I've got two questions, Question X and Question Y. If they respond 1 or 2 for Question X, they go on to answer Question Y. If they answer 3 or 4 for Question X, they skip Question Y.
If they answer X with 1 or 2 and then skip Y, I want to record their 'NULL!' entries as NA - they just didn't answer the question when they should have.
If they answer X with 3 or 4 and then skip Y, I want to record their 'NULL!' entries as 0 - they weren't supposed to answer the question, so they didn't.
Here's a reproducible dataset I made:
set.seed(1)
df <- data.frame(
X = as.factor(sample(c("1.00", "2.00", "3.00", "4.00"), 10, replace = TRUE)),
Y = as.factor(sample(c("1.00", "2.00", "#NULL!"), 10, replace = TRUE))
)
df
I'm trying to replace the aforementioned 'NULL!' fields with either NA or 0 respectively. I've been trying it with ifelse() and have had little luck - it appears to return anything that is 1.00 or 2.00 as NA and 3.00 or 4.00 as 0. Is there a better way to do this? What am I doing wrong?
levels(df$Y) <- c(levels(df$Y), 0)
df$Y <- ifelse(df$X == '3.00'| df$X == '4.00', df$Y[df$y == 'NULL!'] <- 0, df$Y[df$Y == '#NULL!'] <- NA)
df
Thank you for your help!

You're doing a couple of things the hard way. First, using factors constrains one to only use levels that exist in the a particular factor, which may not be what you want. Second, you have levels of "#NULL!" but are attempting (unsuccessfully) to test for a level of "NULL!". I'm guessing you wanted them to be the same level. Third; You are attempting to use "<-" inside the second and third arguments of ifelse. That will not succeed in manner you intended. The LHS of such an expression is not evaluated by ifelse.
You can instead use nested ifelse:
df$Y <- ifelse( (df$X == '3.00'| df$X == '4.00') & df$Y == "#NULL!", 0,
ifelse( df$Y == "#NULL!", NA, df$Y) ) # only mess with "Nulls"
df
X Y
1 2.00 1.00
2 2.00 1.00
3 3.00 0
4 4.00 2.00
5 1.00 <NA>
6 4.00 2.00
7 4.00 0
8 3.00 0
9 3.00 2.00
10 1.00 <NA>
To prevent the missing levels problem which you handled by adding a "0" level, I instead made my dataframe so it contained character vectors:
set.seed(1)
df <- data.frame(X = sample(c("1.00", "2.00", "3.00", "4.00"), 10, replace== TRUE),
Y = sample(c("1.00", "2.00", "#NULL!"), 10, replace = TRUE),
stringsAsFactors=FALSE)
Earlier tidyverse code:
library(tidyverse)
df %>% mutate(Y = case_when(
X == "3.00" ~ "0",
X == "4.00" ~ "0",
TRUE ~ as.character(Y)))

How about this?
set.seed(1)
df <- data.frame(
X = as.factor(sample(c("1.00", "2.00", "3.00", "4.00"), 10, replace = TRUE)),
Y = as.factor(sample(c("1.00", "2.00", "#NULL!"), 10, replace = TRUE))
)
df$X <- as.character(df$X)
df$Y <- as.character(df$Y)
df$Y <- ifelse(df$X=="1.00" | df$X=="2.00" & df$Y == "#NULL!", NA, df$Y)
df$Y <- ifelse(df$X=="3.00" | df$X=="4.00", "0.00", df$Y)
df
X Y
1 2.00 1.00
2 2.00 1.00
3 3.00 0.00
4 4.00 0.00
5 1.00 <NA>
6 4.00 0.00
7 4.00 0.00
8 3.00 0.00
9 3.00 0.00
10 1.00 <NA>

Related

Apply and append data frame in R

I am trying to convert lets say dataframe ALPHA:
A B C D E
1 0.80 2.00 0.09 201.1 335.00
to dataframe BETA
A B C D E A1 B1 C1 D1 E1
1 0.80 2.00 0.09 201.1 335.00 1.60 3.00 0.18 402.2 670.00
so pretty much multiplies by 2 and appends.
Currently doing it as:
curveCalculator <- function(variable, variableName){
// Need variableName here for another part
return(variable*2)
}
BETA <- lapply(ALPHA, function(variableName, variable){
calculated <- curveCalculator(variable, variableName)
return(calculated)
}, names(optional))
bind_cols(ALPHA, as.data.frame(BETA,col.names=paste(names(BETA), 1, sep="")))
However, it passes curveCalculator ALL NAMES, so for A it would pass 0.80 for variable and c("A","B","C","D","E") for variable name. I want it to only pass "A" for A, "B" for B and so on..
Try this
library(purrr)
BETA <- map2(ALPHA, names(ALPHA), curveCalculator) %>%
as.data.frame()
names(BETA) <- paste0(names(BETA), 1)
cbind(ALPHA, BETA)

Summing values in columns based on other values in R

I'm relatively new to R and am having trouble creating a vector that sums certain values based on other values. I'm not quite sure what the problem is. I don't receive an error, but the output is not what I was looking for. Here is a reproducible example:
fakeprice <- c(1, 2, 2, 1, NA, 5, 4, 4, 3, 3, NA)
fakeconversion <-c(.2, .15, .07, .25, NA, .4, .36, NA, .67, .42, .01)
fakedata <- data.frame(fakeprice, fakeconversion)
fake.list <- sort(unique(fakedata$fakeprice))
fake.sum <- vector(,5)
So, fakedata looks like:
fakeprice fakeconversion
1 1 0.20
2 2 0.15
3 2 0.07
4 1 0.25
5 NA NA
6 5 0.40
7 4 0.36
8 4 NA
9 3 0.67
10 3 0.42
11 NA 0.01
I think the problem lies in the NAs, but I'm not quite sure (there are quite a few in the original data set). Here are the for loops with nested if statements. I kept getting an error when the price was 'NA' and so I added the is.na():
for(i in fake.list){
sum=0
for(j in fakedata$fakeprice){
if(is.na(fakedata$fakeprice[j])==TRUE){
NULL
} else {
if(fakedata$fakeprice[j]==fake.list[i]){
sum <- sum+fakedata$fakeconversion[j]
}}
}
fake.sum[i]=sum
}
sumdata <- data.frame(fake.list, fake.sum)
I'm looking for an output that adds up fakeconversion for each unique price. So, for fakeprice=1, fake.sum=0.45. The resulting data I am looking for would look like:
fake.list fake.sum
1 1 0.45
2 2 0.22
3 3 1.09
4 4 0.36
5 5 0.40
What I get, however, is:
sumdata
fake.list fake.sum
1 1 0.90
2 2 0.44
3 3 0.00
4 4 0.00
5 5 0.00
Any help is very much appreciated!
aggregate(fakedata$fakeconversion, list(price = fakedata$fakeprice), sum, na.rm = TRUE)
The above will deal with the NA in fakeprice 4.
The aggregate function works by subsetting your data by something and then running a function, FUN.
So:
aggregate(x, by, FUN, ...,)
x is what you wish to run the FUN on. By can be given a list if you wish to split the data by multiple columns.

Creating new columns based on a pattern

I have a large dataset which has a pattern similar to the dataPattern below. I need help with the code to create the desiredresult dataset
library(data.table)
V1 <- rep(c(rep("a", times = 2), letters[2:5],
rep("f", times = 2)), times = 2)
V2 <- c(c(c(0.24, 0.25), 2:5, c(0.95, 1.05)),
c(c(0.34, 0.35), 2:5, c(1.95, 2.05)) )
(dataPattern <- data.table(V1, V2))
(desiredresult <- data.table(V1, V2, c(rep(c(0.24, 0.25), times = 4),
rep(c(0.34, 0.35), times = 4)),
c(rep(c(0.95, 1.05), times = 4),
rep(c(1.95, 2.05), times = 4))))
I need help to create column V3 in the desiredresult. The pattern is as follows:
if V1 == "a" then V3 = V2
if V1 != "a" we repeat the previous corresponding set of V2 values until a new value of a is reached then the new values of V2 is placed in V3, etc. The above repeats for all new values of a.
I also need your help with the code to create Column V4 in the desiredresult which is similar to column V3 except it checks if V1 == "f" and places the values of f from V2 into V4 and repeats it if V1 != "f"
I have tried:
rle(dataPattern$V1 == "a" )
# Run Length Encoding
# lengths: int [1:4] 2 6 2 6
# values : logi [1:4] TRUE FALSE TRUE FALSE
The sequence where V1 != "a" or V1 != "f" appears to be equal to the number of FALSE minus Number of TRUE. This is how many times each a sequence need to be repeated until a new a is reached
Many Thanks
OK, here is a better way, I think, to get the values of V2 into the column depending on V1=='a'.
V1 <- rep(c(rep("a", times = 2), letters[2:5],
rep("f", times = 2)), times = 2)
V2 <- c(c(c(0.24, 0.25), 2:5, c(0.95, 1.05)),
c(c(0.34, 0.35), 2:5, c(1.95, 2.05)) )
dataPattern <- data.frame(V1, V2)
dataPattern$V3 <- ifelse(dataPattern$V1 == "a", dataPattern$V2, NA)
dataPattern$V4 <- ifelse(dataPattern$V1 == "f", dataPattern$V2, NA)
for (i in 1:nrow(dataPattern)){
if (dataPattern$V1[i] == "a"){
tmpa <- dataPattern$V3[i]
}
if (is.na(dataPattern$V3[i])){
dataPattern$V3[i] <- tmpa
}
if (dataPattern$V1[nrow(dataPattern)-(i-1)] == "f"){
tmpf <- dataPattern$V4[nrow(dataPattern)-(i-1)]
}
if (is.na(dataPattern$V4[nrow(dataPattern)-(i-1)])){
dataPattern$V4[nrow(dataPattern)-(i-1)] <- tmpf
}
}
output, which I think is more correct, according to your stated rules, than desiredoutput:
> dataPattern
V1 V2 V3 V4
1 a 0.24 0.24 0.95
2 a 0.25 0.25 0.95
3 b 2.00 0.25 0.95
4 c 3.00 0.25 0.95
5 d 4.00 0.25 0.95
6 e 5.00 0.25 0.95
7 f 0.95 0.25 0.95
8 f 1.05 0.25 1.05
9 a 0.34 0.34 1.95
10 a 0.35 0.35 1.95
11 b 2.00 0.35 1.95
12 c 3.00 0.35 1.95
13 d 4.00 0.35 1.95
14 e 5.00 0.35 1.95
15 f 1.95 0.35 1.95
16 f 2.05 0.35 2.05
This seems to work:
dataPattern[, `:=`(
V3 = head(V2,2),
V4 = tail(V2,2)
), by=cumsum( V1 == "a" & shift(V1,type="lead") == "a" )]
The result passes the all.equal(dataPattern, desiredresult) check. Depending on what your actual use-case looks like, you might need to put something different inside the cumsum.

Replicate rows by different N

I’ve the following data
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
1) I need to repeat number of rows for each id by n. For example, n=2.63 for id=1, then I need to replicated id=1 row three times. If n=0.5, then I need to replicate it only one time... so n needs to be round up.
2) Create a new variable called t, where the sum of t for each id must equal to n.
3) Create another new variable called accumulated.t
Here how the output looks like:
id n t accumulated.t
1 2.63 1 1
1 2.63 1 2
1 2.63 0.63 2.63
2 1.5 1 1
2 1.5 0.5 1.5
3 0.5 0.5 0.5
4 3.5 1 1
4 3.5 1 2
4 3.5 1 3
4 3.5 0.5 3.5
5 4 1 1
5 4 1 2
5 4 1 3
5 4 1 4
Get the ceiling of 'n' column and use that to expand the rows of 'mydata' (rep(1:nrow(mydata), ceiling(mydata$n)))
Using data.table, we convert the 'data.frame' to 'data.table' (setDT(mydata1)), grouped by 'id' column, we replicate (rep) 1 with times specified as the trunc of the first value of 'n' (rep(1, trunc(n[1]))). Take the difference between the unique value of 'n' per group and the sum of 'tmp' (n[1]-sum(tmp)). If the difference is greater than 0, we concatenate 'tmp' and 'tmp2' (c(tmp, tmp2)) or if it is '0', we take only 'tmp'. This can be placed in a list to create the two columns 't' and the cumulative sum of 'tmp3 (cumsum(tmp3)).
library(data.table)
mydata1 <- mydata[rep(1:nrow(mydata),ceiling(mydata$n)),]
setDT(mydata1)[, c('t', 'taccum') := {
tmp <- rep(1, trunc(n[1]))
tmp2 <- n[1]-sum(tmp)
tmp3= if(tmp2==0) tmp else c(tmp, tmp2)
list(tmp3, cumsum(tmp3)) },
by = id]
mydata1
# id n t taccum
# 1: 1 2.63 1.00 1.00
# 2: 1 2.63 1.00 2.00
# 3: 1 2.63 0.63 2.63
# 4: 2 1.50 1.00 1.00
# 5: 2 1.50 0.50 1.50
# 6: 3 0.50 0.50 0.50
# 7: 4 3.50 1.00 1.00
# 8: 4 3.50 1.00 2.00
# 9: 4 3.50 1.00 3.00
#10: 4 3.50 0.50 3.50
#11: 5 4.00 1.00 1.00
#12: 5 4.00 1.00 2.00
#13: 5 4.00 1.00 3.00
#14: 5 4.00 1.00 4.00
An alternative that utilizes base R.
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
mynewdata <- data.frame(id = rep(x = mydata$id,times = ceiling(x = mydata$n)),
n = mydata$n[match(x = rep(x = mydata$id,ceiling(mydata$n)),table = mydata$id)],
t = rep(x = mydata$n / ceiling(mydata$n),times = ceiling(mydata$n)))
mynewdata$t.accum <- unlist(x = by(data = mynewdata$t,INDICES = mynewdata$id,FUN = cumsum))
We start by creating a data.frame with three columns, id, n, and t. id is calculated using rep and ceiling to repeat the ID variable the number of appropriate times. n is obtained by using match to look up the right value in mydata$n. t is obtained by obtaining the ratio of n and ceiling of n, and then repeating it the appropriate amount of times (in this case, ceiling of n again.
Then, we use cumsum to get the cumulative sum, called using by to allow by-group processing for each group of IDs. You could probably use tapply() here as well.

How to delete a duplicate row in R

I have the following data
x y z
1 2 a
1 2
data[2,3] is a factor but nothing shows,
In the data, it has a lot rows like this way.How to delete the row when the z has nothing?
I mean deleting the rows such as the second row.
output should be
x y z
1 2 a
OK. Stabbing a little bit in the dark here.
Imagine the following dataset:
mydf <- data.frame(
x = c(.11, .11, .33, .33, .11, .11),
y = c(.22, .22, .44, .44, .22, .44),
z = c("a", "", "", "f", "b", ""))
mydf
# x y z
# 1 0.11 0.22 a
# 2 0.11 0.22
# 3 0.33 0.44
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
From the combination of your title and your description (neither of which seems to fully describe your problem), I would decode that you want to drop rows 2 and 3, but not row 6. In other words, you want to first check whether the row is duplicated (presumably only the first two columns), and then, if the third column is empty, drop that row. By those instructions, row 5 should remain (column "z" is not blank) and row 6 should remain (the combination of columns 1 and 2 is not a duplicate).
If that's the case, here's one approach:
# Copy the data.frame, "sorting" by column "z"
mydf2 <- mydf[rev(order(mydf$z)), ]
# Subset according to your conditions
mydf2 <- mydf2[duplicated(mydf2[1:2]) & mydf2$z %in% "", ]
mydf2
# x y z
# 3 0.33 0.44
# 2 0.11 0.22
^^ Those are the data that we want to remove. One way to remove them is using setdiff on the rownames of each dataset:
mydf[setdiff(rownames(mydf), rownames(mydf2)), ]
# x y z
# 1 0.11 0.22 a
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
Some example data:
df = data.frame(x = runif(100),
y = runif(100),
z = sample(c(letters[0:10], ""), 100, replace = TRUE))
> head(df)
x y z
1 0.7664915 0.86087017 a
2 0.8567483 0.83715022 d
3 0.2819078 0.85004742 f
4 0.8241173 0.43078311 h
5 0.6433988 0.46291916 e
6 0.4103120 0.07511076
Spot row six with the missing value. You can subset using a vector of logical's (TRUE, FALSE):
df[df$z != "",]
And as #AnandaMahto commented, you can even check against multiple conditions:
df[!df$z %in% c("", " "),]

Resources