Related
I have a data frame that looks like this:
df <- data.frame(a=c(1,1,2,NA,1),
b=c(2,3,3,NA,6),
c=c(1,NA,1,1,1),
d=c(2,2,3,1,1),
e=c(1,2,1,2,1))
head(df)
a b c d e
1 1 2 1 2 1
2 1 3 NA 2 2
3 2 3 1 3 1
4 NA NA 1 1 2
5 1 6 1 1 1
I would like to add character "Q" to each value, but NA. For example, the desired output will look like that:
a b c d e
1 Q1 Q2 Q1 Q2 Q1
2 Q1 Q3 <NA> Q2 Q2
3 Q2 Q3 Q1 Q3 Q1
4 <NA> <NA> Q1 Q1 Q2
5 Q1 Q6 Q1 Q1 Q1
I was trying the following line of code: df[] <- sub("^", "Q", df[]), but it does not produce the desired output.
Thank you for your help!
Olha
We can use ifelse with paste after looping over the columns with lapply
df[] <- lapply(df, function(x) ifelse(!is.na(x), paste0("Q", x), x))
-output
df
# a b c d e
#1 Q1 Q2 Q1 Q2 Q1
#2 Q1 Q3 <NA> Q2 Q2
#3 Q2 Q3 Q1 Q3 Q1
#4 <NA> <NA> Q1 Q1 Q2
#5 Q1 Q6 Q1 Q1 Q1
Or it can be also done with a logical index on both sides of the assignment
df[!is.na(df)] <- paste0("Q", df[!is.na(df)])
NOTE:
sub needs a vector as input for 'x' and not a data.frame. Here, it is more easier to use paste instead of sub
Or with dplyr
library(dplyr)
library(stringr)
df %>%
mutate(across(everything(),
~ case_when(!is.na(.) ~ str_c('Q', .), TRUE ~ NA_character_)))
I have the following data frame in R
df1 <- data.frame(
"ID" = c("A", "B", "A", "B"),
"Value" = c(1, 2, 5, 5),
"freq" = c(1, 3, 5, 3)
)
I wish to obtain the following data frame
Value freq ID
1 1 A
2 NA A
3 NA A
4 NA A
5 1 A
1 NA B
2 2 B
3 NA B
4 NA B
5 5 B
I have tried the following code
library(tidyverse)
df_new <- bind_cols(df1 %>%
select(Value, freq, ID) %>%
complete(., expand(.,
Value = min(df1$Value):max(df1$Value))),)
I am getting the following output
Value freq ID
<dbl> <dbl> <fct>
1 1 A
2 3 B
3 NA NA
4 NA NA
5 5 A
5 3 B
I request someone to help me.
Using tidyr::full_seq we can find the full version of Value but nesting(full_seq(Value,1) will return an error:
Error: by can't contain join column full_seq(Value, 1) which is missing from RHS
so we need to add a name, hence nesting(Value=full_seq(Value,1)
library(tidyr)
df1 %>% complete(ID, nesting(Value=full_seq(Value,1)))
# A tibble: 10 x 3
ID Value freq
<fct> <dbl> <dbl>
1 A 1. 1.
2 A 2. NA
3 A 3. NA
4 A 4. NA
5 A 5. 5.
6 B 1. NA
7 B 2. 3.
8 B 3. NA
9 B 4. NA
10 B 5. 3.
Using data.table:
library(data.table)
setDT(df1)
setkey(df1, ID, Value)
df1[CJ(ID = c("A", "B"), Value = 1:5)]
ID Value freq
1: A 1 1
2: A 2 NA
3: A 3 NA
4: A 4 NA
5: A 5 5
6: B 1 NA
7: B 2 3
8: B 3 NA
9: B 4 NA
10: B 5 3
Would the following approach work for you?
with(data = df1,
expr = {
data.frame(Value = rep(wrapr::seqi(min(Value), max(Value)), length(unique(ID))),
ID = unique(ID))
}) %>%
left_join(y = df1,
by = c("ID" = "ID", "Value" = "Value")) %>%
arrange(ID, Value)
Results
Value ID freq
1 1 A 1
2 2 A NA
3 3 A NA
4 4 A NA
5 5 A 5
6 1 B NA
7 2 B 3
8 3 B NA
9 4 B NA
10 5 B 3
Comments
If I'm following your example correctly, your ID group takes values from 1 to 5. If this is the case, my approach would be to generate that reading unique combinations of both from the original data frame.
The only variable that is carried from the original data frame is freq that may / may not be available for a given par ID-Value. I would join that variable via left_join (as you seem to like tidyverse)
In your example, you have freq variable with values 1,3,5 but then in the example you list 1,2,5? In my example, I took original freq and left join it. You can modify it further using normal dplyr pipeline, if this is something you intended to do.
I'm trying to replace ids for their respective values. The problem is that each id has a different value according to the previous column type, like this:
>df
type id
1 q1 1
2 q1 2
3 q2 1
4 q2 3
5 q3 1
6 q3 2
Here's the type ids with its value:
>q1
id value
1 1 yes
2 2 no
>q2
id value
1 1 one hour
2 2 two hours
3 3 more than two hours
>q3
id value
1 1 blue
2 2 yellow
I've tried something like this:
df <- left_join(subset(df, type %in% c("q1"), q1, by = "id"))
But it removes the other values.
I' like to know how to do a one liner solution (or kind of) because there are more than 20 vectors with types description.
Any ideias on how to do it?
This is the df i'm expecting:
>df
type id value
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow
You can join on more than one variable. The example df you give would actually make a suitable lookup table for this:
value_lookup <- data.frame(
type = c('q1', 'q1', 'q2', 'q2', 'q3', 'q3'),
id = c(1, 2, 1, 3, 1, 2),
value = c('yes', 'no', 'one hour', 'more than two hours', 'blue', 'yellow')
)
Then you just merge on both type and id:
df <- left_join(df, value_lookup, by = c('type', 'id'))
Usually when I need a lookup table like that I store it in a CSV rather than write it all out in the code, but do whatever suits you.
tempList = split(df, df$type)
do.call(rbind,
lapply(names(tempList), function(nm)
merge(tempList[[nm]], get(nm))))
# id type value
#1 1 q1 yes
#2 2 q1 no
#3 1 q2 one hour
#4 3 q2 more than two hours
#5 1 q3 blue
#6 2 q3 yellow
Get the values of 'q\d+' data.frame object identifiers in a list, bind them together into a single data.frame with bind_rows while creating the 'type' column as the identifier name and right_join with the dataset object 'df'
library(tidyverse)
mget(paste0("q", 1:3)) %>%
bind_rows(.id = 'type') %>%
right_join(df)
# type id value
#1 q1 1 yes
#2 q1 2 no
#3 q2 1 one hour
#4 q2 3 more than two hours
#5 q3 1 blue
#6 q3 2 yellow
You can do it by a series of left joins:
df1 = left_join(df, q1, by='id') %>% filter(type=="q1")
> df1
type id value
1 q1 1 yes
2 q1 2 no
df2 = left_join(df, q2, by='id') %>% filter(type=="q2")
> df2
type id value
1 q2 1 one hour
2 q2 3 more than two hours
df3 = left_join(df, q3, by='id') %>% filter(type=="q3")
> df3
type id value
1 q3 1 blue
2 q3 2 yellow
> rbind(df1,df2,df3)
type id value
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow
One liner would be:
rbind(left_join(df, q1, by='id') %>% filter(type=="q1"),
left_join(df, q2, by='id') %>% filter(type=="q2"),
left_join(df, q3, by='id') %>% filter(type=="q3"))
If you have more vectors then probably you should loop through the names of vector types and execute left_join and bind_rows one by one as:
vecQs = c(paste("q", seq(1,3,1),sep="")) #Types of variables q1, q2 ...
result = tibble()
#Execute left_join for the types and store it in result.
for(i in vecQs) {
result = bind_rows(result, left_join(df,eval(as.symbol(i)) , by='id') %>% filter(type==!!i))
}
This will give:
> result
# A tibble: 6 x 3
type id value
<chr> <int> <chr>
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow
I have a survey where some questions were not answered by some participants. Here is a simplified version of my data
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA),
Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
df
I would like to see which ID numbers did not answer which questions. The following code is very close to the output I want but identifies the subject by row number - I would like the subject identified by ID number
table(data.frame(which(is.na(df), arr.ind=TRUE)))
right now the output shows that rows 1,3,5 did not answer at least one question and it identifies the column with the missing value. I would like it show me the same thing but with ID numbers 12,14,16. It would be a bonus if you could have the column names (eg Q1,Q2,Q3) in the output as well instead of column number.
We can get the column names which are NA row-wise using apply and make it into a comma separated string and attach it to a new dataframe along with it's ID.
new_df <- data.frame(ID =df$ID, ques = apply(df, 1, function(x)
paste0(names(which(is.na(x))), collapse = ",")))
new_df
# ID ques
#1 12 Q3
#2 13
#3 14 Q2
#4 15
#5 16 Q1,Q2
Similar equivalent would be
new_df <- data.frame(ID = df$ID, ques = apply(is.na(df), 1, function(x)
paste0(names(which(x)), collapse = ",")))
In base R:
res <- df[!complete.cases(df),]
res[-1] <- as.numeric(is.na(res[-1]))
res
# ID Q1 Q2 Q3
# 12 12 0 0 1
# 14 14 0 1 0
# 16 16 1 1 0
If you wish to avoid apply type operations and continue from which(..., T), you can do something like the following:
tmp <- data.frame(which(is.na(df[, 2:4]), T))
# change to character
tmp[, 2] <- paste0('Q', tmp[, 2])
# gather column numbers together for each row number
tmp_split <- split(tmp[, 2], tmp[, 1])
# preallocate new column in df
df$missing <- vector('list', 5)
df$missing[as.numeric(names(tmp_split))] <- tmp_split
This produces
> df
ID Q1 Q2 Q3 missing
1 12 a a <NA> Q3
2 13 b a a NULL
3 14 a <NA> a Q2
4 15 a b a NULL
5 16 <NA> <NA> b Q1, Q2
You can convert data in long format using tidyr::gather. Filter for Answer not available. Finally, you can summarise your data using toString as:
library(tidyverse)
df %>% gather(Question, Ans, -ID) %>%
filter(is.na(Ans)) %>%
group_by(ID) %>%
summarise(NotAnswered = toString(Question))
# # A tibble: 3 x 2
# ID NotAnswered
# <int> <chr>
# 1 12 Q3
# 2 14 Q2
# 3 16 Q1, Q2
If, OP wants to include all IDs in result then, solution can be as:
df %>% gather(Question, Ans, -ID) %>%
group_by(ID) %>%
summarise(NoAnswered = toString(Question[is.na(Ans)])) %>%
as.data.frame()
# ID NoAnswered
# 1 12 Q3
# 2 13
# 3 14 Q2
# 4 15
# 5 16 Q1, Q2
How's this with tidyverse:
data:
library(tidyverse)
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA), Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
code:
x <- df %>% filter(is.na(Q1) | is.na(Q2) | is.na(Q3)) # filter out NAs
y <- cbind(x %>% select(ID),
x %>% select(Q1, Q2, Q3) %>% sapply(., function(x) ifelse(is.na(x), 1, 0))
) # in 1/0 format
output:
x:
ID Q1 Q2 Q3
1 12 a a <NA>
2 14 a <NA> a
3 16 <NA> <NA> b
y:
ID Q1 Q2 Q3
1 12 0 0 1
2 14 0 1 0
3 16 1 1 0
My attempt is no better than any already offered, but it's a fun problem, so here's mine. Because why not?:
library( magrittr )
df$ques <- df %>%
is.na() %>%
apply( 1, function(x) {
x %>%
which() %>%
names() %>%
paste0( collapse = "," )
} )
df
# ID Q1 Q2 Q3 ques
# 1 12 a a <NA> Q3
# 2 13 b a a
# 3 14 a <NA> a Q2
# 4 15 a b a
# 5 16 <NA> <NA> b Q1,Q2
Most of the answer comes from your question:
df[which(is.na(df), arr.ind=TRUE)[,1],]
# ID Q1 Q2 Q3
# 5 16 <NA> <NA> b
# 3 14 a <NA> a
# 5.1 16 <NA> <NA> b
# 1 12 a a <NA>
I am trying to match two datasets by nearest preceding date, by group.
So within a group, I would like to add the variables of a second dataset (d2) to that of the first (d1) when the date of the first is the nearest date on or before the date in the second. If two rows in the second dataset are matched with one row in the first I would like to add the larger of the values. (there will always be at least one date in d1 less then the date in d2, by group)
Here is an example, which hopefully makes it clearer
d1 = data.frame(id=c(1,1,1,2,2),
ref=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2013-11-07", "2014-11-07" )))
d1
# id ref
# 1 1 2013-12-07
# 2 1 2014-12-07
# 3 1 2015-12-07
# 4 2 2013-11-07
# 5 2 2014-11-07
d2 = data.frame(id=c(1,1,2),
date=as.Date(c("2014-05-07","2014-12-05", "2015-11-05")),
x1 = factor(c(1,2,2), ordered = TRUE),
x2 = factor(c(2, NA ,2), ordered=TRUE))
d2
# id date x1 x2
# 1 1 2014-05-07 1 2
# 2 1 2014-12-05 2 <NA>
# 3 2 2015-11-05 2 2
With the expected outcome
output = data.frame(id=c(1,1,1,2,2),
ref=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2013-11-07", "2014-11-07" )),
x1 = c(2, NA, NA, NA, 2),
x2 = c(2, NA, NA, NA, 2))
output
# id ref x1 x2
# 1 1 2013-12-07 2 2
# 2 1 2014-12-07 NA NA
# 3 1 2015-12-07 NA NA
# 4 2 2013-11-07 NA NA
# 5 2 2014-11-07 2 2
So for example, the first two observations of d2, id=1, with dates "2014-05-07","2014-12-05", are matched to the earlier date "2013-12-07" in d1. As there are two rows matched to one row in d1,
then the highest level is selected.
I could do this in base R by looping the following calculations through
each group but I was hoping for something more efficient.
I would love to see a data.table approach (but I am limited to R v3.1 and data.table v1.9.4). Thanks
real dataset:
d1: rows 1M / 100K groups
d2: rows 11K / 4K groups
# for one group
x = d1[d1$id==1, ]
y = d2[d2$id==1, ]
id = apply(outer(x$ref, y$date, "-"), 2, which.min)
temp = cbind(y, ref=x$ref[id])
# aggregate variables by ref
temp = merge(aggregate(x1 ~ ref, data=temp, max),
aggregate(x2 ~ ref, data=temp, max)
)
merge(x, temp, all=T)
ps: I had looked at How to match by nearest date from two data frames? and Join data.table on exact date or if not the case on the nearest less than date with no success.
You can do this using dplyr:
d2$ind <- 0
library(dplyr)
out <- d1 %>% full_join(d2,by=c("id","ref"="date")) %>%
arrange(id,ref) %>%
mutate(ind=cumsum(ifelse(is.na(ind),1,ind))) %>%
group_by(ind) %>%
summarise(ref=min(ref),x1=max(x1,na.rm=TRUE),x2=max(x2,na.rm=TRUE))
### A tibble: 5 x 4
## ind ref x1 x2
## <dbl> <date> <fctr> <fctr>
##1 1 2013-12-07 2 2
##2 2 2014-12-07 NA NA
##3 3 2015-12-07 NA NA
##4 4 2013-11-07 NA NA
##5 5 2014-11-07 2 2
We first add a column of indicators to d2 and set those to zero. Then, we perform a full outer join between d1 and d2. Those rows in d1 will have ind of NA. We sort by id and ref (i.e., the date), and we replace the NA entries of ind with 1 and perform a cumsum. This results in:
id ref x1 x2 ind
1 1 2013-12-07 <NA> <NA> 1
2 1 2014-05-07 1 2 1
3 1 2014-12-05 2 <NA> 1
4 1 2014-12-07 <NA> <NA> 2
5 1 2015-12-07 <NA> <NA> 3
6 2 2013-11-07 <NA> <NA> 4
7 2 2014-11-07 <NA> <NA> 5
8 2 2015-11-05 2 2 5
From this we can easily see that we can group by ind and summarise appropriately to get your result.