Error: expecting a string in R - r

I am running the following code:
mydataframe <- mydataframe %>%
mutate(newVar1 = abs(as.numeric(CanBe1 == 0 & lead(var_id, default = 0) == (var_id + 1)) - 1)) %>%
group_by(pt, item) %>%
mutate(newVar2 = abs(as.numeric((CanBe2 == 0 & lag(var_id, default = 0) == (var_id - 1)) ) -1),
newVar2 = ifelse(lag(newVar1, default = 0) == 1, 1, newVar2))
but I get an error: Error:expecting a string. What does it mean? where exactly there should be a string?
Here are few examples of the data I have, and I expect:
pt item var_id CanBe1 CanBe2 newVar1 newVar2
1 9 2 0 0 0 1
1 9 3 0 0 0 0
1 9 4 1 0 0 0
1 9 5 0 0 1 0
1 9 7 0 0 0 1
1 9 8 1 0 1 0
1 9 10 0 1 0 1
1 9 11 0 0 0 0
1 9 12 1 0 1 0
1 9 2 1 0 0 1
The variables I am using are:
class(mydataframe$pt) = `factor` #even if I change this one to `character` the code doesn't work
class(mydataframe$item) = `character`
class(mydataframe$var_id) = `character`
class(mydataframe$CanBe1) = `numeric`
class(mydataframe$canBe2) = `numeric`

var_id is currently a character string. Reclassifying it as numeric ahead of time will fix it.
mydataframe <- mydataframe %>%
mutate(var_id = as.numeric(var_id), #Switching from character to numeric
newVar1 = abs(as.numeric(CanBe1 == 0 & lead(var_id, default = 0) == (var_id + 1)) - 1)) %>%
group_by(pt, item) %>%
mutate(newVar2 = abs(as.numeric((CanBe2 == 0 & lag(var_id, default = 0) == (var_id - 1)) ) -1),
newVar2 = ifelse(lag(newVar1, default = 0) == 1, 1, newVar2))

Related

Vectorized recoding of rows in R data frame based on value in another column

I have a dataset of binary responses (0, 1) to a number of questions like below. Each row indicates an individual and each column is a response to a question. "completed" indicates how many questions were reached. For example, if completed = 2 only q_1 and q_2 were responded too by that individual. I want to rescore each "q" column such than any column number greater than "completed" column is a 0, otherwise use the value from the corresponding "q" column.
have = data.frame(q_1 = c(1,0,1,1,0),
q_2 = c(1,1,1,1,0),
q_3 = c(0,0,1,1,0),
q_4 = c(1,0,0,1,1),
q_5 = c(1,0,0,0,1),
completed = c(2, 3, 2, 4, 1))
> have
q_1 q_2 q_3 q_4 q_5 completed
1 1 1 0 1 1 2
2 0 1 0 0 0 3
3 1 1 1 0 0 2
4 1 1 1 1 0 4
5 0 0 0 1 1 1
How can I get to this output? Would it be easier to transform the dataset?
> want
q_1 q_2 q_3 q_4 q_5 completed scored_1 scored_2 scored_3 scored_4 scored_5
1 1 1 0 1 1 2 1 1 0 0 0
2 0 1 0 0 0 3 0 1 0 0 0
3 1 1 1 0 0 2 1 1 0 0 0
4 1 1 1 1 0 4 1 1 1 1 0
5 0 0 0 1 1 1 0 0 0 0 0
This code will get the correct output. However, my real dataset is very large so I would need to be able to loop through the columns.
want = have %>%
mutate(scored_1 = ifelse(completed >= 1, q_1, 0),
scored_2 = ifelse(completed >= 2, q_2, 0),
scored_3 = ifelse(completed >= 3, q_3, 0),
scored_4 = ifelse(completed >= 4, q_4, 0),
scored_5 = ifelse(completed >= 5, q_5, 0))
You may also try mutate(across(.. in dplyr. This way you can mutate as many columns as you want (e.g. where names start from q_)
have = data.frame(q_1 = c(1,0,1,1,0),
q_2 = c(1,1,1,1,0),
q_3 = c(0,0,1,1,0),
q_4 = c(1,0,0,1,1),
q_5 = c(1,0,0,0,1),
completed = c(2, 3, 2, 4, 1))
library(tidyverse)
have %>%
mutate(across(starts_with('q_'), ~ifelse(completed >= as.numeric(str_remove(cur_column(), 'q_')), ., 0),
.names = 'scored_{.col}')) %>%
rename_with(~str_remove(., 'q_'), starts_with('scored'))
#> q_1 q_2 q_3 q_4 q_5 completed scored_1 scored_2 scored_3 scored_4 scored_5
#> 1 1 1 0 1 1 2 1 1 0 0 0
#> 2 0 1 0 0 0 3 0 1 0 0 0
#> 3 1 1 1 0 0 2 1 1 0 0 0
#> 4 1 1 1 1 0 4 1 1 1 1 0
#> 5 0 0 0 1 1 1 0 0 0 0 0
Created on 2021-05-14 by the reprex package (v2.0.0)
If you do not need any other logic, you can easily convert your completed score into conditions for populating your columns.
library(dplyr)
have %>%
mutate( scored_1 = ifelse(completed != 0, q_1, 0)
,scored_2 = ifelse(completed >= 2, q_2, 0)
,scored_3 = ifelse(completed >= 3, q_3, 0)
,scored_4 = ifelse(completed >= 4, q_4, 0)
,scored_5 = ifelse(completed == 5, q_5, 0)
)
I'm assuming if "completed" = N, then the individual actually wrote down the answers to the questions up to N. Right? Please correct me if I'm wrong.
If that's the case, I have a vectorized solution:
have = data.frame(q_1 = c(1,0,1,1,0),
q_2 = c(1,1,1,1,0),
q_3 = c(0,0,1,1,0),
q_4 = c(1,0,0,1,1),
q_5 = c(1,0,0,0,1),
completed = c(2, 3, 2, 4, 1))
check <- function(x) {
# which column number is "completed"
stop <- which(names(x) == "completed")
start <- x[["completed"]]
x <- as.numeric(x)
x[(1:length(x) > start) & (1:length(x) < stop)] <- 0
return(x)
}
want <- have
for (line in 1:nrow(want)) {
want[line, ] <- check(want[line, ])
}
Using col:
h = have[startsWith(names(have), "q")]
h[col(h) > have$completed] = 0
cbind(have, setNames(h, paste0("s_", 1:ncol(h))))

How to randomly replace a value

I have a vector of a certain length of which I want to randomly replace every 2 by 0 or 1, with a probability of 0.4 (for value=1). I have used this code below. I expected to have a different value (0 or 1) for the different 2 replaced, but I have only 1 or 0 that replace the 2.
vec<-c(rep(2,18),1,0)
ifelse (vec==2,rbinom(1,1,0.40)
here is one output
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
and another output
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
When you go into the source code of ifelse via typing View(ifelse), you will see a piece of code showing that
len <- length(ans)
ypos <- which(test)
npos <- which(!test)
if (length(ypos) > 0L)
ans[ypos] <- rep(yes, length.out = len)[ypos]
if (length(npos) > 0L)
ans[npos] <- rep(no, length.out = len)[npos]
ans
That means, once you have one single value for yes or no in ifelse, that single value is repeated len times and placed to the corresponding logical positions.
In you case, rbinom(1,1,0.40) is just a single value for yes, thus being repeated once it has an realization.
One workaround is like below
> ifelse(vec == 2, rbinom(sum(vec == 2), 1, 0.40), vec)
[1] 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 1 1 0
This replaces all 2 values with either 0 or 1
vec[vec == 2] <- rbinom(sum(vec == 2), 1, prob = .4)
If you draw a 0 and want the value to remain 2 then you could use sample, which would be equivalent to a binomial draw:
vec[vec == 2] <- sample(c(1, 2), sum(vec == 2), prob = c(0.4, 0.6), replace = T)
Try next code:
#Code
vec<-c(rep(2,18),1,0)
vec2 <- unlist(lapply(seq(2,length(vec),by=2), function(x) {vec[x] <- rbinom(1,1,0.40)}))
vec[seq(2,length(vec),by=2)] <-vec2
Output:
vec
[1] 2 0 2 0 2 1 2 0 2 0 2 0 2 1 2 0 2 0 1 1

Calculate number of time streak of categories change in a row in R

I have the following data frame in R:
Row number A B C D E F G H I J
1 1 1 0 0 1 0 0 1 1
2 1 0 0 0 1 0 0 1
3 1 0 0 0 1 0 0 1 1
I am trying to calculate the number of times the number changes between 1 and 0 excluding the Nulls
The result I am expecting is this
Row Number No of changes
---------- --------------
1 4
2 4
3 4
An explanation for row 1
In row 1, A has a null so we exclude that.
B and C have 1 which is our first set of values.
D and E have 0 which is our second set of values. Now Change = 1
F has our third set of values which is 1. Now Change = 1+1
G and H have 0 which is our third set of values. Now Change = 1+1+1
I and J have 1 which is our fourth set of values. Now Change = 1+1+1+1 =4
Here's a tidyverse approach.
I gather into longer format (from tidyr::pivot_longer), then add a helper column noting when we have a change from 0 to 1 or from 1 to 0, and then sum those by row.
library(tidyverse)
df %>%
# before tidyr 1.0, this would be gather(col, value, -1)
pivot_longer(-1, "col") %>%
group_by(Row.number) %>%
mutate(chg = value == 1 & lag(value) == 0 |
value == 0 & lag(value) == 1) %>%
summarize(no_chgs = sum(chg, na.rm = T))
# A tibble: 3 x 2
Row.number no_chgs
<int> <int>
1 1 4
2 2 4
3 3 4
Sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text = "'Row number' A B C D E F G H I J
1 NA 1 1 0 0 1 0 0 1 1
2 NA NA 1 0 0 0 1 0 0 1
3 NA 1 0 0 0 1 0 0 1 1")
Here's a data.table solution:
library(data.table)
dt <- as.data.table(df)
dt[,
no_change := max(rleid(na.omit(t(.SD)))) - 1,
by = RowNumber
]
dt
Alternatively, here's a base version:
apply(df[, -1],
1,
function(x) {
complete_case = complete.cases(x)
if (sum(complete_case) > 0) {
return(length(rle(x[complete_case])$lengths) - 1)
} else {
return (0)
}
}
)

Mutate over every possible combination of columns

I have a data frame of binary variables:
df <-data.frame(a = c(0,1,0,1,0), b = c(1, 1, 0, 0, 1), c = c(1,0,1,1,0))
And I'd like to create a column for each possible combination of my pre-existing columns:
library(tidyverse)
df %>%
mutate(d = case_when(a==1 & b==1 & c==1 ~ 1),
e = case_when(a==1 & b==1 & c!=1 ~ 1),
f = case_when(a==1 & b!=1 & c==1 ~ 1),
g = case_when(a!=1 & b==1 & c==1 ~ 1))
But my real dataset has too many columns to do this without a function or loop. Is there an easy way to do this in R?
First note that do.call(paste0, df) will combine all of your columns into one string, however many they are:
do.call(paste0, df)
# [1] "011" "110" "001" "101" "010" "011"
Then you can use spread() from the tidyr package to give each its own column. Note that you have to add an extra row column so that it knows to keep each of the rows separate (instead of trying to combine them).
# I added a sixth row that copied the first to make the effect clear
df<-data.frame(a = c(0,1,0,1,0,0), b = c(1, 1, 0, 0, 1, 1), c = c(1,0,1,1,0,1))
# this assumes you want `type_` at the start of each new column,
# but you could use a different convention
df %>%
mutate(type = paste0("type_", do.call(paste0, df)),
value = 1,
row = row_number()) %>%
spread(type, value, fill = 0) %>%
select(-row)
Result:
a b c type_001 type_010 type_011 type_101 type_110
1 0 0 1 1 0 0 0 0
2 0 1 0 0 1 0 0 0
3 0 1 1 0 0 1 0 0
4 0 1 1 0 0 1 0 0
5 1 0 1 0 0 0 1 0
6 1 1 0 0 0 0 0 1
An alternative to David's answer, but I recognize it's a little awkward:
df %>%
unite(comb, a:c, remove = FALSE) %>%
spread(key = comb, value = comb) %>%
mutate_if(is.character, funs(if_else(is.na(.), 0, 1)))
#> a b c 0_0_1 0_1_0 0_1_1 1_0_1 1_1_0
#> 1 0 0 1 1 0 0 0 0
#> 2 0 1 0 0 1 0 0 0
#> 3 0 1 1 0 0 1 0 0
#> 4 1 0 1 0 0 0 1 0
#> 5 1 1 0 0 0 0 0 1
EDIT: funs() is being deprecated as of version 0.8.0 of dplyr, so the last line should be revised to:
mutate_if(is.character, list(~ if_else(is.na(.), 0, 1)))

R: combine rows of a matrix by group

I am attempting to reformat the data set my.data to obtain the output shown below the my.data2 statement. Specifically, I want to put the last 4 columns of my.data on one line per record.id, where the last four
columns of my.data will occupy columns 2-5 of the new data matrix if group=1 and columns 6-9 if group=2.
I wrote the cumbersome code below, but the double for-loop is causing an error that I simply cannot locate.
Even if the double for-loop worked, I suspect there is a much more efficient way of accomplishing the
same thing - (maybe reshape?)
Thank you for any help correcting the double for-loop or with more efficient code.
my.data <- "record.id group s1 s2 s3 s4
1 1 2 0 1 3
1 2 0 0 0 12
2 1 0 0 0 0
3 1 10 0 0 0
4 1 1 0 0 0
4 2 0 0 0 0
8 2 0 2 2 0
9 1 0 0 0 0
9 2 0 0 0 0"
my.data2 <- read.table(textConnection(my.data), header=T)
# desired output
#
# 1 2 0 1 3 0 0 0 12
# 2 0 0 0 0 0 0 0 0
# 3 10 0 0 0 0 0 0 0
# 4 1 0 0 0 0 0 0 0
# 8 0 0 0 0 0 2 2 0
# 9 0 0 0 0 0 0 0 0
Code:
dat_sorted <- sort(unique(my.data2[,1]))
my.seq <- match(my.data2[,1],dat_sorted)
my.data3 <- cbind(my.seq, my.data2)
group.min <- tapply(my.data3$group, my.data3$my.seq, min)
group.max <- tapply(my.data3$group, my.data3$my.seq, max)
# my.min <- group.min[my.data3[,1]]
# my.max <- group.max[my.data3[,1]]
my.records <- matrix(0, nrow=length(unique(my.data3$record.id)), ncol=9)
x <- 1
for(i in 1:max(my.data3$my.seq)) {
for(j in group.min[i]:group.max[i]) {
if(my.data3[x,1] == i) my.records[i,1] = i
# the two lines below seem to be causing an error
if((my.data3[x,1] == i) & (my.data3[x,3] == 1)) (my.records[i,2:5] = my.data3[x,4:7])
if((my.data3[x,1] == i) & (my.data3[x,3] == 2)) (my.records[i,6:9] = my.data3[x,4:7])
x <- x + 1
}
}
You are right, reshape helps here.
library(reshape2)
m <- melt(my.data2, id.var = c("record.id", "group"))
dcast(m, record.id ~ group + variable, fill = 0)
record.id 1_s1 1_s2 1_s3 1_s4 2_s1 2_s2 2_s3 2_s4
1 1 2 0 1 3 0 0 0 12
2 2 0 0 0 0 0 0 0 0
3 3 10 0 0 0 0 0 0 0
4 4 1 0 0 0 0 0 0 0
5 8 0 0 0 0 0 2 2 0
6 9 0 0 0 0 0 0 0 0
Comparison:
dfTest <- data.frame(record.id = rep(1:10e5, each = 2), group = 1:2,
s1 = sample(1:10, 10e5 * 2, replace = TRUE),
s2 = sample(1:10, 10e5 * 2, replace = TRUE),
s3 = sample(1:10, 10e5 * 2, replace = TRUE),
s4 = sample(1:10, 10e5 * 2, replace = TRUE))
system.time({
...# Your code
})
Error in my.records[i, 1] = i : incorrect number of subscripts on matrix
Timing stopped at: 41.61 0.36 42.56
system.time({m <- melt(dfTest, id.var = c("record.id", "group"))
dcast(m, record.id ~ group + variable, fill = 0)})
user system elapsed
25.04 2.78 28.72
Julius' answer is better, but for completeness, I think I managed to get the following for-loop to work:
dat_x <- (unique(my.data2[,1]))
my.seq <- match(my.data2[,1],dat_x)
my.data3 <- as.data.frame(cbind(my.seq, my.data2))
my.records <- matrix(0, nrow=length(unique(my.data3$record.id)), ncol=9)
my.records <- as.data.frame(my.records)
my.records[,1] = unique(my.data3[,2])
for(i in 1:9) {
if(my.data3[i,3] == 1) (my.records[my.data3[i,1],c(2:5)] = my.data3[i,c(4:7)])
if(my.data3[i,3] == 2) (my.records[my.data3[i,1],c(6:9)] = my.data3[i,c(4:7)])
}

Resources