I am trying to fill a new column with appropriate values from a list using dplyr. I tried to come up with a simple reproducible example, which can be found below. In short, I want to add a column "Param" to a dataframe, based on the values of the existing columns. The matching values are found in a separate list. I've tried functions as ifelse()and switch but I cannot make it work. Any tips on how this can be achieved?
Thank you in advance!
library(dplyr)
# Dataframe to start with
df <- as.data.frame(matrix(data = c(rep("A", times = 3),
rep("B", times = 3),
rep(1:3, times = 2)), ncol = 2))
colnames(df) <- c("Method", "Type")
df
#> Method Type
#> 1 A 1
#> 2 A 2
#> 3 A 3
#> 4 B 1
#> 5 B 2
#> 6 B 3
# Desired dataframe
desired <- cbind(df, Param = c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4))
desired
#> Method Type Param
#> 1 A 1 0.9
#> 2 A 2 0.8
#> 3 A 3 0.7
#> 4 B 1 0.6
#> 5 B 2 0.5
#> 6 B 3 0.4
# Failed attempt
param <- list("A" = c("1" = 0.9, "2" = 0.8, "3" = 0.7),
"B" = c("1" = 0.6, "2" = 0.5, "3" = 0.4))
param
#> $A
#> 1 2 3
#> 0.9 0.8 0.7
#>
#> $B
#> 1 2 3
#> 0.6 0.5 0.4
df %>%
mutate(Param = ifelse(.$Method == "A", param$A[[.$Type]],
ifelse(.$Method == "B", param$B[[.$Type]], NA)))
#> Error: Problem with `mutate()` column `Param`.
#> ℹ `Param = ifelse(...)`.
#> x attempt to select more than one element in vectorIndex
You can unlist your list and just add it to your df.
df$Param <- unlist(param)
Method Type Param
1 A 1 0.9
2 A 2 0.8
3 A 3 0.7
4 B 1 0.6
5 B 2 0.5
6 B 3 0.4
As mentioned by #dario including matching data in dataframe would be easier.
library(dplyr)
library(tidyr)
df %>%
nest(data = Type) %>%
left_join(stack(param) %>% nest(data1 = values), by = c('Method' = 'ind')) %>%
unnest(c(data, data1))
# Method Type values
# <chr> <chr> <dbl>
#1 A 1 0.9
#2 A 2 0.8
#3 A 3 0.7
#4 B 1 0.6
#5 B 2 0.5
#6 B 3 0.4
Sure this could be cleaner, but it will get the job done: Option 1:
df %>%
mutate(
Param = unlist(param)[
match(
paste0(
df$Method,
df$Type
),
names(
do.call(
c,
lapply(
param,
names
)
)
)
)
]
)
Option 2: (cleaner version):
df %>%
type.convert() %>%
left_join(
do.call(cbind, param) %>%
data.frame() %>%
mutate(Type = as.integer(row.names(.))) %>%
pivot_longer(!Type, names_to = "Method", values_to = "Param"),
by = c("Type", "Method")
)
Related
I have this table (inputdf):
Number
Value
1
0.2
1
0.3
1
0.4
2
0.2
2
0.7
3
0.1
and I want to obtain this (outputdf):
Number1
Number2
Number3
0.2
0.2
0.1
0.3
0.7
NA
0.4
NA
NA
I have tried it by iterating with a for loop through the numbers in column 1, then subsetting the dataframe by that number but I have troubles to append the result to an output dataframe:
inputdf <- read.table("input.txt", sep="\t", header = TRUE)
outputdf <- data.frame()
i=1
total=3 ###user has to modify it
for(i in seq(1:total)) {
cat("Collecting values for number", i, "\n")
values <- subset(input, Number == i, select=c(Value))
cbind(outputdf, NewColumn= values, )
names(outputdf)[names(outputdf) == "NewColumn"] <- paste0("Number", i)
}
Any help or hint will be very wellcomed. Thanks in advance!
In the tidyverse, you can create an id for each element of the groups and then use tidyr::pivot_wider:
library(tidyverse)
dat %>%
group_by(Number) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = Number, names_prefix = "Number", values_from = "Value")
# A tibble: 3 × 4
n Number1 Number2 Number3
<int> <dbl> <dbl> <dbl>
1 1 0.2 0.2 0.1
2 2 0.3 0.7 NA
3 3 0.4 NA NA
in base R, same idea. Create the id column and then reshape to wide:
transform(dat, id = with(dat, ave(rep(1, nrow(dat)), Number, FUN = seq_along))) |>
reshape(direction = "wide", timevar = "Number")
I have 5 data frames like the ones below:
df_mon <- data.frame(mon = as.factor(c(6, 7, 8, 9, 10)),
number = c(1.11, 1.02, 0.95, 0.92, 0.72))
df_year <- data.frame(year = as.factor(c(1, 2)),
number = c(1.61, 0.4))
df_cat <- data.frame(cat = c("A", "B", "C"),
number = c(1.11, 1.02, 0.44))
df_bin <- data.frame(bin = as.factor(c(1, 2)),
number = c(1.42, 0.56))
df_cat2 <- data.frame(cat2 = c("A", "B", "C", "D", "AA"),
number = c(0.11, 1.22, 1.34, 0.88, 0.75))
I need to multiple all the numbers in the 'number' columns from each of these data frames with each other. So, look at all the possible combinations in the first column in each data set and then take the number and multiple them. The final results data frame should look something like this (First 3 are done)
results_df <- data.frame(combi = c("mon6_year1_catA_bin1_cat2A", "mon6_year1_catA_bin1_cat2B", "mon6_year1_catA_bin1_cat2C"),
final_number = c(1.11*1.61*1.11*1.42*0.11, 1.11*1.61*1.11*1.42*1.22, 1.11*1.61*1.11*1.42*1.34))
We can see the first column in the the results_df shows what combination was used to calculate the final_number. The first example shows, the 'number' column from mon_df cat 6 (1.11) is taken and multiplied with the following:
category 1 (1.61) from df_year
category A (1.11) from df_cat
category 1 (1.42) from df_bin
category A (0.11) from df_cat2
The answer for this combination is 1.11 x 1.61 x 1.11 x 1.42 x 0.11 = 0.3098.
The 2nd row shows the next possible combination and so on.
I'm not sure how to achieve this, so any help will be greatly appreciated!
Maybe you can try expand.grid like below
lst <- list(df_mon, df_year, df_cat, df_bin, df_cat2)
results_df <- data.frame(
combi = do.call(
paste,
c(do.call(
expand.grid,
lapply(lst, function(v) paste0(names(v[1]), v[, 1]))
), sep = "_")
),
final_number = Reduce(
"*",
do.call(
expand.grid,
lapply(lst, `[[`, 2)
)
)
)
which gives
> head(results_df)
combi final_number
1 mon6_year1_catA_bin1_cat2A 0.30985097
2 mon7_year1_catA_bin1_cat2A 0.28472792
3 mon8_year1_catA_bin1_cat2A 0.26518777
4 mon9_year1_catA_bin1_cat2A 0.25681342
5 mon10_year1_catA_bin1_cat2A 0.20098441
6 mon6_year2_catA_bin1_cat2A 0.07698161
Here is an approach using dplyr and tidyr.
df_all <- df_mon %>%
full_join(df_year, by = character()) %>% # by = character() ensures cross join
full_join(df_cat, by = character()) %>%
full_join(df_bin, by = character()) %>%
full_join(df_cat2, by = character()) %>%
pivot_longer(cols = c(-mon, -year, -cat, -bin, -cat2)) %>%
group_by(mon, year, cat, bin, cat2) %>%
summarize(final_number = prod(value), .groups = "keep")
# A tibble: 300 x 6
# Groups: mon, year, cat, bin, cat2 [300]
mon year cat bin cat2 final_number
<fct> <fct> <chr> <fct> <chr> <dbl>
1 6 1 A 1 A 0.310
2 6 1 A 1 AA 2.11
3 6 1 A 1 B 3.44
4 6 1 A 1 C 3.77
5 6 1 A 1 D 2.48
6 6 1 A 2 A 0.122
7 6 1 A 2 AA 0.833
8 6 1 A 2 B 1.36
9 6 1 A 2 C 1.49
10 6 1 A 2 D 0.978
# ... with 290 more rows
It keeps the variables from the other data.frames intact as columns for further analysis, but you could create your combi column with a little paste().
When using lapply() over a vector, each element of the resulting list doesn't have a name, but only an index:
library(dplyr)
vector = c("df1", "df2")
df1 = data.frame(a = rnorm(5), b = rnorm(5, sd = 1.1))
df2 = data.frame(a = rnorm(5), b = rnorm(5, sd = 1.1))
lapply(vector, function(x){
x = get(x) %>%
mutate(c = a+b)
})
#> [[1]]
#> a b c
#> 1 -0.4098768 1.6712810 1.2614041
#> 2 0.7101722 -0.1025184 0.6076538
#> 3 -0.6696859 0.5690932 -0.1005928
#> 4 1.1642214 -0.4660378 0.6981836
#> 5 -0.5158280 1.4574039 0.9415759
#>
#> [[2]]
#> a b c
#> 1 0.91047848 -1.308707 -0.3982281
#> 2 1.87336493 -1.429567 0.4437977
#> 3 0.54171333 1.849589 2.3913028
#> 4 -0.02978158 2.500763 2.4709817
#> 5 1.49926589 1.602463 3.1017284
It does has names when applying over list:
list = list(
df1 = data.frame(a = rnorm(5), b = rnorm(5, sd = 1.1)),
df2 = data.frame(a = rnorm(5), b = rnorm(5, sd = 1.1))
)
lapply(list, function(x){
x = x %>%
mutate(c = a+b)
})
#> $df1
#> a b c
#> 1 0.8228400 -2.5232496 -1.70040963
#> 2 0.3890213 -0.4349408 -0.04591949
#> 3 0.5084719 1.4089123 1.91738415
#> 4 0.2533964 -0.7379615 -0.48456516
#> 5 -0.2474338 1.0520906 0.80465685
#>
#> $df2
#> a b c
#> 1 0.1376350 -1.32304077 -1.1854058
#> 2 0.1314702 1.14775210 1.2792223
#> 3 0.9757047 -1.24806193 -0.2723573
#> 4 -0.5118045 0.09277009 -0.4190344
#> 5 -0.1631715 -0.47573087 -0.6389024
Is there a simple way to use the vector elements as the resulting list names?
Instead of using get on each list separately, you can use mget to get all the dataframes in a list with their names :
lapply(mget(vector), function(x) transform(x, c = a + b))
#$df1
# a b c
#1 -0.60421251 2.4792735 1.8750610
#2 0.06163947 2.0295196 2.0911590
#3 -0.56318825 2.1496891 1.5865009
#4 -2.46292843 1.1641211 -1.2988073
#5 -1.05692446 0.4365812 -0.6203432
#$df2
# a b c
#1 -0.33388039 0.6690467 0.3351663
#2 0.83637236 1.3321715 2.1685439
#3 0.05839826 0.1017032 0.1601015
#4 -0.20686296 0.8667050 0.6598420
#5 0.52682053 0.4629632 0.9897837
You could name the vector, before lapply
vector = c("df1", "df2")
names(vector) <- vector # <-- here
df1 = data.frame(a = rnorm(5), b = rnorm(5, sd = 1.1))
df2 = data.frame(a = rnorm(5), b = rnorm(5, sd = 1.1))
lapply(vector, function(x){
x = get(x) %>%
mutate(c = a+b)
})
# $df1
# a b c
# 1 -0.36671838 0.8733203 0.5066019
# 2 -0.05029296 -0.8823471 -0.9326401
# 3 0.54252815 -0.2087211 0.3338071
# 4 -1.16789527 0.2598863 -0.9080090
# 5 -0.80664672 -0.4968422 -1.3034889
#
# $df2
# a b c
# 1 0.9042845 1.2978663 2.2021508
# 2 -0.3848533 -0.4563623 -0.8412156
# 3 -1.1681873 1.3224087 0.1542215
# 4 1.4872564 -2.0281272 -0.5408708
# 5 -0.2717229 -0.3780464 -0.6497694
We can also use map
library(purrr)
library(dplyr)
map(mget(vector), ~ .x %>%
mutate(c = a + b))
I'm trying to divide a long-formatted dataframe by a factor (e.g. for each subject) and then put the factor (subject) before the data of each one as a label. The simplied dataframe looks like this, columns X and Y are numbers, column subject is factor. The real dataset actually has hundreds of subjects.
X <- c(1,1,2,2)
Y <- c(0.2, 0.3, 1, 0.5)
Subject <- as.factor(c("A", "A", "B", "B"))
M <- tibble(X,Y,Subject)
> M
# A tibble: 4 x 3
X Y Subject
<dbl> <dbl> <fct>
1 1 0.2 A
2 1 0.3 A
3 2 1 B
4 2 0.5 B
The resulting dataframe should look like this:
> M_trans
A
1 0.2
1 0.3
B
2 1
2 0.5
Thank you for your help!
I tried this code and it works to output like below, I couldn't find a way to introduce factors as everything in r works in vector format. If you find a better solution, post it for us.
X <- c(1,1,2,2,3,3)
Y <- c(0.2, 0.3, 1, 0.5,0.2,0.9)
Subject <- as.factor(c("A", "A", "B", "B","C","C"))
M <- tibble(X,Y,Subject)
unq_subjects <- unique(Subject)
final <- data.frame()
for (i in 1: length(unique(Subject)))
{
sub <- unq_subjects[i]
tmp <- as.data.frame(M %>% filter(Subject == sub) %>%
select(-Subject) %>%
add_row(X = sub, Y = NA) %>%
arrange(desc(X)))
final <- union_all(tmp,final)
}
final Output
X Y
1 C NA
2 3 0.2
3 3 0.9
4 B NA
5 2 1.0
6 2 0.5
7 A NA
8 1 0.2
9 1 0.3
Does it answer your question now?
Using dplyr and tidyr
library(dplyr)
library(tidyr)
M %>%
group_by(Subject) %>%
nest()
Hope this helps!
Here I got an inelegant solution worked for myself, inspired by Bertil Baron's answer. I would be happy to got any easier code...
trans_output <- function(M){
M1 <- M %>%
group_by(subject) %>%
nest()
df <- NULL
for (i in 1:2)
{
output2 <- M1$data[[i]]
df_sub <- rbind(as.character(M1$subject[[i]]), # subject ID
output2) # output data
idx <- c(1L)
df_sub <- df_sub %>%
mutate(Y = ifelse(row_number() %in% idx, NA, Y)) %>% # else, stay as Y
transmute(X = X,
Y = as.numeric(Y))
df <- rbind(df, df_sub)
rm(df_sub)
}
return(df)
}
M_trans <- trans_output(M)
The output looks like this:
> M_trans
# A tibble: 6 x 2
X Y
<chr> <dbl>
1 A NA
2 1 0.2
3 2 0.3
4 B NA
5 3 1
6 4 0.5
I have this reproducible dataframe:
df <- data.frame(ID = c("A", "A", "B", "B", "B","C", "C", "D"), cost = c("0.5", "0.4", "0.7", "0.8", "0.5", "1.3", "1.3", "2.6"))
I'm trying to groupby the ID, to test if there are differences in the cost column and update a new column called Test diff
Intermediate Output
ID cost Testdiff
1 A 0.5 Y
2 A 0.4 Y
3 B 0.7 Y
4 B 0.8 Y
5 B 0.5 Y
6 C 1.3 N
7 C 1.3 N
8 D 2.6 N
I'm looking at using a dplyr example to do this but I"m unsure if match is the correct function.
df %>% group_by(ID) %>% mutate(Testdiff = ifelse(match(cost) == T, "Y", "N"))
Once that is completed, I want to keep the 1st row of the unique ID, giving me this output
ID cost Testdiff
1 A 0.5 Y
2 B 0.7 Y
3 C 1.3 N
4 D 2.6 N
We could use n_distinct and then slice
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Testdiff = n_distinct(cost) > 1) %>%
slice(1)
# ID cost Testdiff
# <fct> <fct> <lgl>
#1 A 0.5 TRUE
#2 B 0.7 TRUE
#3 C 1.3 FALSE
#4 D 2.6 FALSE
If you want output to be "Y"/"N" instead of TRUE/FALSE
df %>%
group_by(ID) %>%
mutate(Testdiff = ifelse(n_distinct(cost) > 1, "Y", "N")) %>%
slice(1)
We could use ave and aggregate to solve it using base R
df$Testdiff <- ifelse(with(df, ave(cost, ID, FUN = function(x)
length(unique(x)))) > 1, "Y", "N")
aggregate(.~ID, df, head, n = 1)
# ID cost Testdiff
#1 A 0.5 Y
#2 B 0.7 Y
#3 C 1.3 N
#4 D 2.6 N
Since we have dplyr and base R already why not add in data.table:
library(data.table)
setDT(df)[, .(cost = cost[1], testdiff = uniqueN(cost) > 1), by = ID]
ID cost testdiff
1: A 0.5 TRUE
2: B 0.7 TRUE
3: C 1.3 FALSE
4: D 2.6 FALSE
A different tidyverse possibility could be:
df %>%
group_by(ID) %>%
mutate(Testdiff = ifelse(all(cost == first(cost)), "N", "Y")) %>%
filter(row_number() == 1)
ID cost Testdiff
<fct> <fct> <chr>
1 A 0.5 Y
2 B 0.7 Y
3 C 1.3 N
4 D 2.6 N
Or:
df %>%
group_by(ID) %>%
mutate(Testdiff = ifelse(all(cost == first(cost)), "N", "Y")) %>%
top_n(1, wt = desc(row_number()))