I try to write a looping code with ID in a data.frame df. what I did right now is I build another list dm which contains the unique ID from df$ID:
dm<-df %>% select(ID) %>% unique()
for (i in 1:length( dm$ID)){
df_new<-df %>% filter(ID %in% dm$ID[i])
...
Current codes can do what I need. But I wonder whether there is another way to do it without building dm? I want to build subset by each ID in df. Any suggestion?
Instead of looping over the unique 'ID' and subseting, a faster option is split which will split the data.frame into list of data.frame based on the unique values of 'ID'
df_list <- split(df, df$ID)
From here, we can either use lapply or a for loop
pdf(paste0(out_dir, output_date,'.pdf'))
for(i in seq_along(df_list)) {
ggplot(data = df_list[[i]]) +
...
}
dev.off()
Or with lapply
pdf(paste0(out_dir, output_date,'.pdf'))
lapply(df_list, function(dat)
ggplot(data = dat) +
...
)
dev.off()
Regarding the creation of an object of unique 'ID', a better option is
for(un in unique(df$ID)) {
df_new <- df %>%
filter(ID == un)
ggplot(df_new) +
...
}
Related
Starting from a list of lists like outcome:
id <- c(1,2,3,4,5,1,2,3,4,5)
month <- c(3,4,2,1,5,7,3,1,8,9)
preds <- c(0.5,0.1,0.15,0.23,0.75,0.6,0.49,0.81,0.37,0.14)
l_1 <- data.frame(id, preds, month)
preds <- c(0.45,0.18,0.35,0.63,0.25,0.63,0.29,0.11,0.17,0.24)
l_2 <- data.frame(id, preds, month)
preds <- c(0.58,0.13,0.55,0.13,0.76,0.3,0.29,0.81,0.27,0.04)
l_3 <- data.frame(id, preds, month)
preds <- c(0.3,0.61,0.18,0.29,0.85,0.76,0.56,0.91,0.48,0.91)
l_4 <- data.frame(id, preds, month)
outcome <- list(l_1, l_2, l_3, l_4)
My interest is to take the assigned unique row values and create a new variable as if we do:
sample <- outcome[[1]]
sample$unique_id <- rownames(sample)
However, I don´t want to go manually because my list has 100 lists.
Moreover, I don´t want to assign values manually to each row because I want to preserve the row names generated by R.
Any clue?
We may also use rownames_to_column
library(dplyr)
library(purrr)
library(tibble)
map(outcome, ~ .x %>%
rownames_to_column('unique_id'))
With lapply and cbind:
lapply(outcome, function(x) {
cbind(unique_id=rownames(x), x)
})
Another base R option is to use Map
Map(function(x){
x$unique_id <- rownames(x)
x
}, outcome)
Try using lapply
lapply(outcome, function(x) {
x$unique_id <- rownames(x)
x
})
I have two data frame, this is just a sample , database have approx 1 million of records.
can have name, email, alphanumeric code etc.
data1<-data.frame(
'ID 1' = c(86364,"ARV_2612","AGH_2212","IND_2622","CHG_2622"),
sector = c(3,3,1,2,5),
name=c("nhug","hugy","mjuk","ghtr","kuld"),
'Enternal code'=c(1,1,1,1,3),
col3=c(1,1,0,0,0),
col4=c(1,0,0,0,0),
col5=c(1,0,1,1,1)
)
data2<-data.frame(
'ID 1' = c(53265,"ARV_7362",76354,"IND_2622","CHG_9762"),
sector = c(3,3,1,2,5),
name=c("nhug","hugy","mjuk","ghtr","kuld"),
'Enternal code'=c(1,1,1,1,3),
col3=c(1,1,0,0,0),
col4=c(1,0,0,0,0),
col5=c(1,0,1,1,1)
)
data2 %>% mutate(
duplicated = factor(if_else(`ID 1` %in%
pull(data1, `ID 1`),
1,
0)))
now i am looking for a function to mutate my one data frame (data2) like. if I give column names of data1 and data2 to find if the values or string already exist in other data and mutate a new column to 1,0 for true and false.
the function would be like
func(data1 = "name",data2="name",mutated_com="name_exist")
In base R, you can write this function as :
func <- function(data1, data2, data1col, data2col, newcol) {
data2[[newcol]] <- factor(as.integer(data2[[data2col]] %in% data1[[data1col]]))
data2
}
and can call it as :
func(data1, data2, 'name', 'name', 'duplicate')
This will create a column named duplicate in data2 giving 1 where the name in df2 is also present in name of df1 and 0 otherwise.
Using dplyr syntax the above can be written as :
library(dplyr)
library(rlang)
func <- function(data1, data2, data1col, data2col, newcol) {
data2 %>%
mutate(!!newcol := factor(as.integer(.data[[data2col]] %in%
data1[[data1col]])))
}
You can use an inner_join (from dplyr) to determine the overlap between the two dataframes. To use all columns (if both dataframes have the same column names) you do not have to specify the 'by' argument.
You can then add a column 'duplicated' and join back to the original dataframe (df1 or df2) to get the desired result.
overlap <- data1 %>%
inner_join(data2) %>%
mutate(duplicated = 1)
data1 %>% #or data2
left_join(overlap) %>%
mutate(duplicated = ifelse(is.na(duplicated),0,1))
I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))
I have a data frame in R with a variable a which has a list of characters in it.
The list is like: list('5', '7', '9')
When I iterate by using a for loop, I'm able to calculate it:
for(i in 1:nrow(df)) {
df$a[i] <- sum(as.numeric(unlist(df$a[i])))
}
But, when I try that by using mutate, it returns NA.
df %>% mutate(
c <- sum(as.numeric(unlist(a)))
)
What is the problem with this code, and what should I do?
As it is a list of elements, we can loop using map
library(purrr)
library(dplyr)
df %>%
mutate(c = map_dbl(a, ~ sum(as.numeric(.x))))
Given
base <- data.frame( a = 1)
f <- function() c(2,3,4)
I am looking for a solution that would result in a function f being applied to each row of base data frame and the result would be appended to each row. Neither of the following works:
result <- base %>% rowwise() %>% mutate( c(b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( (b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( b,c,d = f() )
What is the correct syntax for this task?
This appears to be a similar problem (Assign multiple new variables on LHS in a single line in R) but I am specifically interested in solving this with functions from tidyverse.
I think the best you are going to do is a do() to modify the data.frame. Perhaps
base %>% do(cbind(., setNames(as.list(f()), c("b","c","d"))))
would probably be best if f() returned a list in the first place for the different columns.
In case you're willing to do this without dplyr:
# starting data frame
base_frame <- data.frame(col_a = 1:10, col_b = 10:19)
# the function you want applied to a given column
add_to <- function(x) { x + 100 }
# run this function on your base data frame, specifying the column you want to apply the function to:
add_computed_col <- function(frame, funct, col_choice) {
frame[paste(floor(runif(1, min=0, max=10000)))] = lapply(frame[col_choice], funct)
return(frame)
}
Usage:
df <- add_computed_col(base_frame, add_to, 'col_a')
head(df)
And add as many columns as needed:
df_b <- add_computed_col(df, add_to, 'col_b')
head(df_b)
Rename your columns.