Related
I want to create five new columns that count how often a certain value of "stars" has happened for this business before this particular row (i.e., summing up over all rows with a smaller rolingcount but holding the business constant).
For the first row of each business (i.e., where rolingcount == 0), it should be NA, because there have been no previous occurrences for this business.
Here is an exemplary dataset:
business <-c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
df <- cbind(business, rolingcount, stars)
I feel my problem is related to this, but with a gist, that I don't get to work: Numbering rows within groups in a data frame
I also unsuccessfully experimented with while loops.
Ideally, something like this will be the output. (I leave out previousthree, previoustwo, previousone, because I believe they will work identical).
business <- c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
previousfives <- c(NA,1,2,NA,1,2,2,2,NA,1,1,1,1,1,2,3)
previousfours <- c(NA,0,0,NA,0,0,0,0,NA,0,0,0,0,1,1,1)
df <- cbind(business, rolingcount, stars, previousfives, previousfours)`
Since, I will have to do this for more than 10 M rows, a fast option would be cool. Your help is much appreciated! :)
I don't know if the option is really fast, I'm not used to deal with that many rows.
Here is a solution using dplyr package in the tidyverse :
library(tidyverse)
df %>%
as.data.frame() %>%
group_by(business) %>%
mutate(stars = as.numeric(stars),
lag_stars = lag(stars, 1, default = 0),
previousfives = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 5)),
previousfours = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 4)),
previousthrees = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 3)),
previoustwos = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 2)),
previousones = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 1))) %>%
ungroup() %>%
select(-lag_stars)
Output :
# A tibble: 16 x 8
business rolingcount stars previousfives previousfours previousthrees previoustwos previousones
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 aaa 1 5 NA NA NA NA NA
2 aaa 2 5 1 0 0 0 0
3 aaa 3 3 2 0 0 0 0
4 bbb 1 5 NA NA NA NA NA
5 bbb 2 5 1 0 0 0 0
6 bbb 3 1 2 0 0 0 0
7 bbb 4 2 2 0 0 0 1
8 bbb 5 3 2 0 0 1 1
9 ccc 1 5 NA NA NA NA NA
10 ccc 2 1 1 0 0 0 0
11 ccc 3 2 1 0 0 0 1
12 ccc 4 3 1 0 0 1 1
13 ccc 5 4 1 0 1 1 1
14 ccc 6 5 1 1 1 1 1
15 ccc 7 5 2 1 1 1 1
16 ccc 8 5 3 1 1 1 1
Basically, group_by is to perform the operation for each business, and it makes a cumulative lagged sum.
Maybe it'll lead you to another faster idea if it is too slow.
Hope it helped.
I have a large data set and want to replace many NAs, but not all.
In one group i want to replace all NAs with 0.
In the other group i want to replace all NAs with 0, but only in variables that do not include a certain part of the variable name e.g. 'b'
Here is an example:
group <- c(1,1,2,2,2)
abc <- c(1,NA,NA,NA,NA)
bcd <- c(2,1,NA,NA,NA)
cde <- c(5,NA,NA,1,2)
df <- data.frame(group,abc,bcd,cde)
group abc bcd cde
1 1 1 2 5
2 1 NA 1 NA
3 2 NA NA NA
4 2 NA NA 1
5 2 NA NA 2
This is what i want:
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
This is what i tried:
#set 0 in first group: this works fine
df[is.na(df) & df$group==1] <- 0
#set 0 in second group but only if the variable name includes b: does not work
df[is.na(df) & df$group==2 & !grepl('b',colnames(df))] <- 0
dplyr solutions are welcome as well as basic
For the second group, create the column index with grep and use that to subset the data while assigning
j1 <- !grepl('b',colnames(df))
df[j1][df$group == 2 & is.na(df[j1])] <- 0
df
# group abc bcd cde
#1 1 1 2 5
#2 1 0 1 0
#3 2 NA NA 0
#4 2 NA NA 1
#5 2 NA NA 2
Using dplyr::mutate_at you can also do:
library(dplyr)
vars_mutate_1 <- names(df)[-1]
vars_mutate_2 <- grep(x = names(df)[-1], pattern = '^(?!.*b).*$', perl = TRUE, value = TRUE)
df %>%
mutate_at(.vars = vars_mutate_1, .funs = funs(if_else(group == 1 & is.na(.), 0, .))) %>%
mutate_at(.vars = vars_mutate_2, .funs = funs(if_else(group == 2 & is.na(.), 0, .)))
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
Alternatively, you can use:
library(dplyr)
df2 <- df %>% mutate_at(vars(names(df)[-1]),
function(x) case_when((group==1 & is.na(x) ) ~ 0,
(group==2 & is.na(x) & !grepl("b",deparse(substitute(x)))) ~ 0,
TRUE ~ x))
> df2
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
I would like to know if/how can I turn the call bellow into a function that can be used in a task that I do fairly often with my data. Sadly, I can't figure out how to design function from the call that involves mutate, and case_when, both of these functions rely on dplyr package and require number of additional arguments.
Alternatively, the call itself seems redundant to me with so many case_when, perhaps it's possible to reduce how many times its used.
Any help and information about alternative approaches is welcomed.
The call looks like this:
library(dplyr)
library(stringr)
test_data %>%
mutate(
multipleoptions_o1 = case_when(
str_detect(multipleoptions, "option1") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0),
multipleoptions_o2 = case_when(
str_detect(multipleoptions, "option2") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0),
multipleoptions_o3 = case_when(
str_detect(multipleoptions, "option3") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0),
multipleoptions_o4 = case_when(
str_detect(multipleoptions, "option4") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0)
)
Sample data:
structure(list(multipleoptions = c("option1", "option2", "option3",
NA, "option2,option3", "option4")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Desired output of the function:
structure(list(multipleoptions = c("option1", "option2", "option3",
NA, "option2,option3", "option4"), multipleoptions_o1 = c(1,
0, 0, NA, 0, 0), multipleoptions_o2 = c(0, 1, 0, NA, 1, 0), multipleoptions_o3 = c(0,
0, 1, NA, 1, 0), multipleoptions_o4 = c(0, 0, 0, NA, 0, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
Arguments of the function should probably be: data (i.e., input dataset), multipleoptions (i.e., the column from data containing answer options, always one column), patterns_to_look_for (i.e., str_detect patterns to look up in the multipleoptions), number_of_options, ideally the number of options can be more or less than 4, (I am not sure if it's achievable), output_columns (i.e., names of new columns, it's always name or original column followed by the option number or option name).
You can avoid the lengthy case_when code by splitting the options into separate elements, taking advantage of nesting/unnesting to get a single column of options, and then spreading to get a separate column for each option.
Updated Answer
library(tidyverse)
# Arguments
# data A data frame
# patterns Regular expression giving the pattern(s) at which to split the options strings
# ... Grouping columns, the first of which must be the "options" column.
# If options has repeated values, then there must be a second grouping
# column (an "ID" column) to differentiate these repeated values.
fnc = function(data, patterns, ...) {
col = quos(...)
data %>%
mutate(option=str_split(!!!col[[1]], patterns)) %>%
unnest %>%
mutate(value=1) %>%
group_by(!!!col) %>%
mutate(num_chosen = ifelse(is.na(!!!col[[1]]), 0, sum(value))) %>%
spread(option, value, fill=0) %>%
select_at(vars(-matches("NA")))
}
fnc(test_data, ",", multipleoptions)
multipleoptions num_chosen option1 option2 option3 option4
1 option1 1 1 0 0 0
2 option2 1 0 1 0 0
3 option2,option3 2 0 1 1 0
4 option3 1 0 0 1 0
5 option4 1 0 0 0 1
6 <NA> 0 0 0 0 0
# Fake data
ops = paste0("option",1:4)
set.seed(2)
d = data_frame(var=replicate(20, paste(sample(ops, sample(1:4,1, prob=c(10,8,5,1))), collapse=",")))
# Add missing values
d = bind_rows(d[1:5,], data.frame(var=rep(NA,3)), d[6:nrow(d),])
fnc(d %>% mutate(ID=1:n()), ",", var, ID)
var ID num_chosen option1 option2 option3 option4
1 option1 17 1 1 0 0 0
2 option1,option2 12 2 1 1 0 0
3 option1,option2,option3 5 3 1 1 1 0
4 option1,option2,option4,option3 9 4 1 1 1 1
5 option1,option3 2 2 1 0 1 0
6 option1,option3,option4 3 3 1 0 1 1
7 option1,option4,option2 20 3 1 1 0 1
8 option1,option4,option3,option2 13 4 1 1 1 1
9 option2 11 1 0 1 0 0
10 option2,option3 23 2 0 1 1 0
11 option2,option3,option4 21 3 0 1 1 1
12 option3 1 1 0 0 1 0
13 option3 15 1 0 0 1 0
14 option3,option1 4 2 1 0 1 0
15 option3,option2,option4 14 3 0 1 1 1
16 option3,option4,option2,option1 22 4 1 1 1 1
17 option4 10 1 0 0 0 1
18 option4 16 1 0 0 0 1
19 option4 18 1 0 0 0 1
20 option4,option2,option3 19 3 0 1 1 1
21 <NA> 6 0 0 0 0 0
22 <NA> 7 0 0 0 0 0
23 <NA> 8 0 0 0 0 0
Original Answer
test_data %>%
filter(!is.na(multipleoptions)) %>%
mutate(option=str_split(multipleoptions, ",")) %>%
unnest %>%
mutate(value=1) %>%
spread(option, value)
multipleoptions option1 option2 option3 option4
<chr> <dbl> <dbl> <dbl> <dbl>
1 option1 1 NA NA NA
2 option2 NA 1 NA NA
3 option2,option3 NA 1 1 NA
4 option3 NA NA 1 NA
5 option4 NA NA NA 1
Packaging this into a function:
fnc = function(data, col, patterns) {
col = enquo(col)
data %>%
filter(!is.na(!!col)) %>%
mutate(option=str_split(!!col, patterns)) %>%
unnest %>%
mutate(value=1) %>%
spread(option, value)
}
fnc(test_data, multipleoptions, ",")
If your real data has more than one row with the same value of multipleoptons, then this code will work only if there's also an ID column that distinguishes different rows with the same value of multipleoptions. For example:
# Fake data
ops = paste0("option",1:4)
set.seed(2)
d = data.frame(var=replicate(20, paste(sample(ops, sample(1:4,1, prob=c(10,8,5,1))), collapse=",")))
fnc(d, var, ",")
Error: Duplicate identifiers for rows (1, 27), (16, 28, 30)
# Add unique row identifier
fnc(d %>% mutate(ID = 1:n()), var, ",")
var ID option1 option2 option3 option4
1 option1 14 1 NA NA NA
2 option1,option2 9 1 1 NA NA
3 option1,option2,option3 5 1 1 1 NA
4 option1,option2,option4,option3 6 1 1 1 1
5 option1,option3 2 1 NA 1 NA
6 option1,option3,option4 3 1 NA 1 1
7 option1,option4,option2 17 1 1 NA 1
8 option1,option4,option3,option2 10 1 1 1 1
9 option2 8 NA 1 NA NA
10 option2,option3 20 NA 1 1 NA
11 option2,option3,option4 18 NA 1 1 1
12 option3 1 NA NA 1 NA
13 option3 12 NA NA 1 NA
14 option3,option1 4 1 NA 1 NA
15 option3,option2,option4 11 NA 1 1 1
16 option3,option4,option2,option1 19 1 1 1 1
17 option4 7 NA NA NA 1
18 option4 13 NA NA NA 1
19 option4 15 NA NA NA 1
20 option4,option2,option3 16 NA 1 1 1
So, I want to generate a new vector from the information in two existing ones (numerical), one which sets the id for the participant, the other indicating the observation number. Each paticipant has been observed different times.
Now, the new vector should should state: 0 when obs_no=1; 1 when obs_no=last observation for that id; NA for cases in between.
id obs_no new_vector
1 1 0
1 2 NA
1 3 NA
1 4 NA
1 5 1
2 1 0
2 2 1
3 1 0
3 2 NA
3 3 1
I figure I could do this separatly for every id using code like this
new_vector <- c(0, rep(NA, times=length(obs_no[id==1])-2), 1)
Or I guess just using max() but it wouldn't make any difference.
But adding each participant manually is really inconvenient since I have a lot of cases. I can't figure out how to make a generic function. I tried to define a function(x) using sapply but cant get it to work since x is positioned within subsetting brackets.
Any advice would be helpful. Thanks.
ave to the rescue:
dat$newvar <- NA
dat$newvar <- with(dat,
ave(newvar, id, FUN=function(x) replace(x, c(length(x),1), c(1,0)) )
)
Or use a bit of duplicated() fun:
dat$newvar <- NA
dat$newvar[!duplicated(dat$id, fromLast=TRUE)] <- 1
dat$newvar[!duplicated(dat$id)] <- 0
Both giving:
# id obs_no new_vector newvar
#1 1 1 0 0
#2 1 2 NA NA
#3 1 3 NA NA
#4 1 4 NA NA
#5 1 5 1 1
#6 2 1 0 0
#7 2 2 1 1
#8 3 1 0 0
#9 3 2 NA NA
#10 3 3 1 1
You can also do this with dplyr
str <- "
id obs_no new_vector
1 1 0
1 2 NA
1 3 NA
1 4 NA
1 5 1
2 1 0
2 2 1
3 1 0
3 2 NA
3 3 1
"
dt <- read.table(textConnection(str), header = T)
library(dplyr)
dt %>% group_by(id) %>%
mutate(newvar = if_else(obs_no==1,0L,if_else(obs_no==max(obs_no),1L,as.integer(NA))))
We can use data.table
library(data.table)
i1 <- setDT(df1)[, .I[seq_len(.N) %in% c(1, .N)], id]$V1
df1[i1, newvar := c(0, 1)]
df1
# id obs_no new_vector newvar
# 1: 1 1 0 0
# 2: 1 2 NA NA
# 3: 1 3 NA NA
# 4: 1 4 NA NA
# 5: 1 5 1 1
# 6: 2 1 0 0
# 7: 2 2 1 1
# 8: 3 1 0 0
# 9: 3 2 NA NA
#10: 3 3 1 1
Use split:
result = lapply(split(obs_no, id), function (x) c(0, rep(NA, length(x) - 2), 1))
This gives you a list of vectors. You can paste them back together like this:
do.call(c, result)
I would like to tabulate by row within a data frame. I can obtain adequate results using table within apply in the following example:
df.1 <- read.table(text = '
state county city year1 year2 year3 year4 year5
1 2 4 0 0 0 1 2
2 5 3 10 20 10 NA 10
2 7 1 200 200 NA NA 200
3 1 1 NA NA NA NA NA
', na.strings = "NA", header=TRUE)
tdf <- t(df.1)
apply(tdf[4:nrow(tdf),1:nrow(df.1)], 2, function(x) {table(x, useNA = "ifany")})
Here are the results:
[[1]]
x
0 1 2
3 1 1
[[2]]
x
10 20 <NA>
3 1 1
[[3]]
x
200 <NA>
3 2
[[4]]
x
<NA>
5
However, in the following example each row consists of a single value.
df.2 <- read.table(text = '
state county city year1 year2 year3 year4 year5
1 2 4 0 0 0 0 0
2 5 3 1 1 1 1 1
2 7 1 2 2 2 2 2
3 1 1 NA NA NA NA NA
', na.strings = "NA", header=TRUE)
tdf.2 <- t(df.2)
apply(tdf.2[4:nrow(tdf.2),1:nrow(df.2)], 2, function(x) {table(x, useNA = "ifany")})
The output I obtain is:
# [1] 5 5 5 5
As such, I cannot tell from this output that the first 5 is for 0, the second 5 is for 1, the third 5 is for 2 and the last 5 is for NA. Is there a way I can have R return the value represented by each 5 in the second example?
You can use lapply to systematically output a list. You would have to loop over the row indices:
sub.df <- as.matrix(df.2[grepl("year", names(df.2))])
lapply(seq_len(nrow(sub.df)),
function(i)table(sub.df[i, ], useNA = "ifany"))
Protect the result by wrapping with list:
apply(tdf.2[4:nrow(tdf.2),1:nrow(df.2)], 2,
function(x) {list(table(x, useNA = "ifany")) })
Here's a table solution:
table(
rep(rownames(df.1),5),
unlist(df.1[,4:8]),
useNA="ifany")
This gives
0 1 2 10 20 200 <NA>
1 3 1 1 0 0 0 0
2 0 0 0 3 1 0 1
3 0 0 0 0 0 3 2
4 0 0 0 0 0 0 5
...and for your df.2:
0 1 2 <NA>
1 5 0 0 0
2 0 5 0 0
3 0 0 5 0
4 0 0 0 5
Well, this is a solution unless you really like having a list of tables for some reason.
I think the problem is stated in applys help:
... If n equals 1, apply returns a vector if MARGIN has length 1 and
an array of dimension dim(X)[MARGIN] otherwise ...
The inconsistencies of the return values of base R's apply family is the reason why I shifted completely to plyrs **ply functions. So this works as desired:
library(plyr)
alply( df.2[ 4:8 ], 1, function(x) table( unlist(x), useNA = "ifany" ) )