I have the following data frame in R
df1 <- data.frame(
"ID" = c("A", "B", "A", "B"),
"Value" = c(1, 2, 5, 5),
"freq" = c(1, 3, 5, 3)
)
I wish to obtain the following data frame
Value freq ID
1 1 A
2 NA A
3 NA A
4 NA A
5 1 A
1 NA B
2 2 B
3 NA B
4 NA B
5 5 B
I have tried the following code
library(tidyverse)
df_new <- bind_cols(df1 %>%
select(Value, freq, ID) %>%
complete(., expand(.,
Value = min(df1$Value):max(df1$Value))),)
I am getting the following output
Value freq ID
<dbl> <dbl> <fct>
1 1 A
2 3 B
3 NA NA
4 NA NA
5 5 A
5 3 B
I request someone to help me.
Using tidyr::full_seq we can find the full version of Value but nesting(full_seq(Value,1) will return an error:
Error: by can't contain join column full_seq(Value, 1) which is missing from RHS
so we need to add a name, hence nesting(Value=full_seq(Value,1)
library(tidyr)
df1 %>% complete(ID, nesting(Value=full_seq(Value,1)))
# A tibble: 10 x 3
ID Value freq
<fct> <dbl> <dbl>
1 A 1. 1.
2 A 2. NA
3 A 3. NA
4 A 4. NA
5 A 5. 5.
6 B 1. NA
7 B 2. 3.
8 B 3. NA
9 B 4. NA
10 B 5. 3.
Using data.table:
library(data.table)
setDT(df1)
setkey(df1, ID, Value)
df1[CJ(ID = c("A", "B"), Value = 1:5)]
ID Value freq
1: A 1 1
2: A 2 NA
3: A 3 NA
4: A 4 NA
5: A 5 5
6: B 1 NA
7: B 2 3
8: B 3 NA
9: B 4 NA
10: B 5 3
Would the following approach work for you?
with(data = df1,
expr = {
data.frame(Value = rep(wrapr::seqi(min(Value), max(Value)), length(unique(ID))),
ID = unique(ID))
}) %>%
left_join(y = df1,
by = c("ID" = "ID", "Value" = "Value")) %>%
arrange(ID, Value)
Results
Value ID freq
1 1 A 1
2 2 A NA
3 3 A NA
4 4 A NA
5 5 A 5
6 1 B NA
7 2 B 3
8 3 B NA
9 4 B NA
10 5 B 3
Comments
If I'm following your example correctly, your ID group takes values from 1 to 5. If this is the case, my approach would be to generate that reading unique combinations of both from the original data frame.
The only variable that is carried from the original data frame is freq that may / may not be available for a given par ID-Value. I would join that variable via left_join (as you seem to like tidyverse)
In your example, you have freq variable with values 1,3,5 but then in the example you list 1,2,5? In my example, I took original freq and left join it. You can modify it further using normal dplyr pipeline, if this is something you intended to do.
Related
lets say I have the following data frame:
dt <- data.frame(id= c(1),
parameter= c("a","b","c"),
start_day = c(1,8,4),
end_day = c(16,NA,30))
I need to combine start_day and end_day columns (lets call the new column as day) such that I reserve all the other columns. Also I need to create another column that indicates if each row is showing start_day or end_day. To clarify, I am looking to create the following data frame
I am creating the above data frame using the following code:
dt1 <- subset(dt, select = -c(end_day))
dt1 <- dt1 %>% rename(day = start_day)
dt1$start <- 1
dt2 <- subset(dt, select = -c(start_day))
dt2 <- dt2 %>% rename(day = end_day)
dt2$end <- 1
dt <- bind_rows(dt1, dt2)
dt <- dt[order(dt$id, dt$parameter),]
Although my code works, but I am not happy with my solution. I am certain that there is a better and cleaner way to do that. I would appreciate any input on better alternatives of tackling this problem.
(tidyr::pivot_longer(dt, cols = c(start_day, end_day), values_to = "day")
|> dplyr::mutate(start = ifelse(name == "start_day", 1, NA),
end = ifelse(name == "end_day", 1, NA))
)
Result:
# A tibble: 6 × 6
id parameter name day start end
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 a start_day 1 1 NA
2 1 a end_day 16 NA 1
3 1 b start_day 8 1 NA
4 1 b end_day NA NA 1
5 1 c start_day 4 1 NA
6 1 c end_day 30 NA 1
You could get rid of the name column, but maybe it would be more useful than your new start/end columns?
using base R (faster than data.table up to ~300 rows; faster than tidyr up to ~1k rows) :
cbind(dt[1:2], day = c(dt$start_day,dt$end_day)) |>
(\(x) x[order(x$id, x$parameter),])() |>
(`[[<-`)("start", value = c(1, NA)) |>
(`[[<-`)("end", value = c(NA, 1))
id parameter day start end
1 1 a 1 1 NA
4 1 a 16 NA 1
2 1 b 8 1 NA
5 1 b NA NA 1
3 1 c 4 1 NA
6 1 c 30 NA 1
using the data.table package (faster than tidyr up to ~500k rows) :
dt <- as.data.table(dt)
dt[,.(day = c(start_day, end_day),
start = rep(c(1, NA), .N),
end = rep(c(NA, 1), .N)),
by = .(id, parameter)]
id parameter day start end
1: 1 a 1 1 NA
2: 1 a 16 NA 1
3: 1 b 8 1 NA
4: 1 b NA NA 1
5: 1 c 4 1 NA
6: 1 c 30 NA 1
I have a dataset where the first line is the header, the second line is some explanatory data, and then rows 3 on are numbers. Because when I read in the data with this second explanatory row, the classes are automatically converted to factors (or I could put stringsasfactors=F).
What I would like to do is remove the second row, and have a function that goes through all columns and detects if they're just numbers and change the class type to the appropriate type. Is there something like that available? Perhaps using dplyr? I have many columns so I'd like to avoid manually reassigning them.
A simplified example below
> df <- data.frame(A = c("col 1",1,2,3,4,5), B = c("col 2",1,2,3,4,5))
> df
A B
1 col 1 col 2
2 1 1
3 2 2
4 3 3
5 4 4
6 5 5
if all the numbers are after the second line, then we can do so
library(tidyverse)
df[-1, ] %>% mutate_all(as.numeric)
depending on the task can be done this way
df <- tibble(A = c("col 1",1,2,3,4,5),
B = c("col 2",1,2,3,4,5),
C = c(letters[1:5], 6))
df[-1, ] %>% mutate_if(~ any(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 6
or so
df[-1, ] %>% mutate_if(~ all(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <chr>
1 1 1 b
2 2 2 c
3 3 3 d
4 4 4 e
5 5 5 6
In base R, we can just do
df[-1] <- lapply(df[-1], as.numeric)
I have a data frame like so:
ID = c(1,1,1,2,2,2,3,3,3,4,4,4,4)
VAR_1 = c(2,4,6,1,7,9,4,4,3,1,7,4,0)
VAR_2 = c(NA,NA,NA,NA,NA,20190101,20190101,20190101,NA,20190101,NA,NA,NA)
df2 = data.frame(ID,VAR_1,VAR_2)
And I would like to subset from this data frame all the rows for every group (ID) ONLY if the first observation by group in VAR_2 has a value, In this simple case, the new subset should be all the rows from ID's 3 and 4
To represent this better:
df df_subset
ID VAR_1 VAR_2 ID VAR_1 VAR_2
1 2 NA 3 4 20190101
1 4 NA 3 4 20190101
1 6 NA 3 3 NA
2 1 NA 4 1 20190101
2 7 NA 4 7 NA
2 9 20190101 4 4 NA
3 4 20190101 4 0 NA
3 4 20190101
3 3 NA
4 1 20190101
4 7 NA
4 4 NA
4 0 NA
I manage to do this in several steps (I subset the original taking only the first observation by group,assign VAR_1 a special value, re-merge and then finally filter by the special value), but I would like to know if there's a simpler more elegant (and probably) more efficient way. I don't need VAR_1, so that can be changed if needed to provide a faster solution.
Any help would be appreciated.
Using dplyr, we can group_by ID and select groups only if first value in each group is non-NA.
library(dplyr)
df2 %>%
group_by(ID) %>%
filter(!is.na(VAR_2[1L]))
# ID VAR_1 VAR_2
# <dbl> <dbl> <dbl>
#1 3 4 20190101
#2 3 4 20190101
#3 3 3 NA
#4 4 1 20190101
#5 4 7 NA
#6 4 4 NA
#7 4 0 NA
Some variations to extract first value could be (thanks to #tmfmnk)
df2 %>% group_by(ID) %>% filter(!is.na(first(VAR_2)))
OR
df2 %>% group_by(ID) %>% filter(!is.na(nth(VAR_2, 1)))
Same using base R ave
df2[with(df2, ave(!is.na(VAR_2), ID, FUN = function(x) x[1L])), ]
or a bit complicated one with split and subset
subset(df2, ID %in% names(na.omit(sapply(split(df2$VAR_2, df2$ID), head, 1))))
I am (trying) to do a Robust ANOVA analysis in R. This requires that my two variables are in a very specific format. Basically, the requirement is to unstack two columns in my current dataframe and form an outcome frequency dataframe based on the predictor (categorical variable). This would usually happen automatically using the unstack() function i.e.
newDataFrame <- unstack(oldDataFrame, scores ~ columns)
However, the list returned has unequal rows for each category. Here is an example:
$A
[1] 2 4 2 3 3
$B
[1] 3 3
$C
[1] 5
$D
[1] 4 4 3
A, B, C and D are my categories, and the numbers are the outcome. The outcome has to be 1, 2, 3, 4, 5 or 6.
What I am working towards is the category as the 'header' and the outcome as a reference column, with the frequencies as the other columns, such that the dataframe looks like this:
A B C D
1 NA NA NA NA
2 2 NA NA NA
3 2 2 NA 1
4 1 NA NA 2
5 NA NA 1 NA
6 NA NA NA NA
What I have tried:
On another SO post, I found this -
library(stringi)
res <- as.data.frame(t(stri_list2matrix(myUnstackedList)))
colnames(res) <- unique(unlist(sapply(myUnstackedList, names)))
Outcome:
res
1 2 4 2 3 3
2 3 3 <NA> <NA> <NA>
3 5 <NA> <NA> <NA> <NA>
4 4 4 3 <NA> <NA>
Note that the categories A, B, C, D have been changed to 1, 2, 3, 4
Also tried this (another SO post):
df <- as.data.frame(plyr::ldply(myUnstackedList, rbind))
Outcome:
df
outcome group score
2 A 2
3 A 2
4 A 1
3 B 2
etc
Any tips?
This gets you most of the way to your answer:
test <- list(A=c(2,4,2,3,3),
B=c(3,3),
C=c(5),
D=c(4,4,3))
test <- lapply(1:length(test), function(i){
x <- data.frame(names(test)[i], test[i],
stringsAsFactors=FALSE)
names(x) <- c("ID", "Value")
x})
test <- bind_rows(test) %>% table %>% as.data.frame
test <- spread(test, key=ID, value=Freq)
replace(test, test==0, NA)
I'm not sure what the issue was with your previous dplyr attempt, however, I offer
library(tidyr)
library(dplyr)
df <- tibble(
outcome = c(1:5, 1:2, 1, 1:3),
group = c(rep("A", 5), rep("B", 2), "C", rep("D", 3)),
score = c(2, 4, 2, 3, 3, 3, 3, 5, 4, 4, 3)
)
df %>%
group_by(outcome) %>%
spread(group, score) %>%
ungroup() %>%
select(-outcome)
# # A tibble: 5 x 4
# A B C D
# * <dbl> <dbl> <dbl> <dbl>
# 1 2 3 5 4
# 2 4 3 NA 4
# 3 2 NA NA 3
# 4 3 NA NA NA
# 5 3 NA NA NA
I have a data.frame that looks like:
a b c d
1 2 NA 1
NA 2 2 1
3 2 NA 1
NA NA 20 2
And I want to replace the NAs with c / d (and delete c and d) to look like:
a b
1 2
2 2
3 2
10 10
Some background: d is a sum of NAs in that particular row.
I don't know the names of the columns, so I tried a few variations of things like:
df2[, 1:(length(colnames(df2)) - 2)][is.na(df2[, 1:(length(colnames(df2)) - 2)])] = df2$c / df2$d
but got:
Error in `[<-.data.frame`(`*tmp*`, is.na(df2[, 1:(length(colnames(df2)) - :
'value' is the wrong length
Here's a way you can do this with dplyr.
library(dplyr)
df <- tibble(
a = c(1, NA, 3, NA),
b = c(2, 2, 2, NA),
c = c(NA, 2, NA, 20L),
d = c(1, 1, 1, 2)
)
df %>%
mutate_at(vars(-c, -d), funs(if_else(is.na(.), c / d, .))) %>%
select(-c, -d)
#> # A tibble: 4 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
#> 3 3 2
#> 4 10 10
You can specify the variables in the vars() call using any of the functions from ?dplyr::select_helpers. These could be regex, a simple vector of names, or you can just use all columns except c and d (as I've changed this example to now).
library(data.table)
data<-fread("a b c d
1 2 NA 1
NA 2 2 1
3 2 NA 1
NA NA 20 2")
names_to_loop<-names(data)
names_to_loop<-names_to_loop[names_to_loop!="c"&names_to_loop!="d"]
for (ntl in names_to_loop){
set(data,j=ntl,value=ifelse(is.na(data[[ntl]]),data[["c"]]/data[["d"]],data[[ntl]]))
}
data[,c:=NULL]
data[,d:=NULL]
> data
a b
1: 1 2
2: 2 2
3: 3 2
4: 10 10