I have a dataframe that I want to gather so that it is in tall format, and then mutate on another column with values based on membership of a string from another column in a list of lists. For example, I have the following data frame and list of lists:
dummy_data <- data.frame("id" = 1:20,"test1_10" = sample(1:100, 20),"test2_11" = sample(1:100, 20),
"test3_12" = sample(1:100, 20),"check1_20" = sample(1:100, 20),
"check2_21" = sample(1:100, 20),"sound1_30" = sample(1:100, 20),
"sound2_31" = sample(1:100, 20),"sound3_32" = sample(1:100, 20))
dummylist <- list(c('test1_','test2_','test3_'),c('check1_','check2_'),c('sound1_','sound2_','sound3_'))
names(dummylist) <- c('shipments','arrivals','departures')
And then I gather the data frame like so:
dummy_data <- dummy_data %>%
gather("part", "number", 2:ncol(.))
What I want to do is add a column that has the name of the list found in dummylist where the string before the underscore in the part column is a member. And I can do that like this:
dummydata <- dummydata %>%
mutate(Group = case_when(
str_extract(part,'.*_') %in% dummylist[[1]] ~ names(dummylist[1]),
str_extract(part,'.*_') %in% dummylist[[2]] ~ names(dummylist[2]),
str_extract(part,'.*_') %in% dummylist[[3]] ~ names(dummylist[3])
))
However, this requires a separate str_extract line for each list/group within the dummylist. And my real data has way more than 3 lists/groups. So I'm wondering if there is a more efficient way to do this mutate step to get the names of the lists in?
Any help is much appreciated, thanks!
It may be easier with a regex_left_join after converting the 'dummylist' to a two column dataset
library(fuzzyjoin)
library(dplyr)
library(tidyr)
library(tibble)
dummy_data %>%
# // reshape to long format - pivot_longer instead of gather
pivot_longer(cols = -id, names_to = 'part', values_to = 'number') %>%
# // join with the tibble/data.frame converted dummylist
regex_left_join(dummylist %>%
enframe(name = 'Group', value = 'part') %>%
unnest(part)) %>%
rename(part = part.x) %>%
select(-part.y)
-output
# A tibble: 160 × 4
id part number Group
<int> <chr> <int> <chr>
1 1 test1_10 72 shipments
2 1 test2_11 62 shipments
3 1 test3_12 17 shipments
4 1 check1_20 89 arrivals
5 1 check2_21 54 arrivals
6 1 sound1_30 39 departures
7 1 sound2_31 94 departures
8 1 sound3_32 95 departures
9 2 test1_10 77 shipments
10 2 test2_11 4 shipments
# … with 150 more rows
If you prepare your lookup table beforehand, you don't need any extra libraries, but dplyr and tidyr:
lookup <- sapply(
names(dummylist),
\(nm) { setNames(rep(nm, length(dummylist[[nm]])), dummylist[[nm]]) }
) |>
setNames(nm = NULL) |>
unlist()
lookup
# test1_ test2_ test3_ check1_ check2_ sound1_ sound2_ sound3_
# "shipments" "shipments" "shipments" "arrivals" "arrivals" "departures" "departures" "departures"
Now you just gsubing on the fly, and translating your parts, within usual mutate() verb:
dummy_data |>
pivot_longer(-id, names_to = 'part', values_to = 'number') |>
mutate(group = lookup[gsub('^(\\w+_).*$', '\\1', part)])
# # A tibble: 160 × 4
# id part number group
# <int> <chr> <int> <chr>
# 1 1 test1_10 91 shipments
# 2 1 test2_11 74 shipments
# 3 1 test3_12 46 shipments
# 4 1 check1_20 62 arrivals
# 5 1 check2_21 7 arrivals
# 6 1 sound1_30 35 departures
# 7 1 sound2_31 23 departures
# 8 1 sound3_32 84 departures
# 9 2 test1_10 59 shipments
# 10 2 test2_11 73 shipments
# # … with 150 more rows
Related
Edit: I found the solution with na.locf().
data <-
data %>%
group_by(country) %>%
arrange(wave) %>%
mutate(weight.io = na.locf(weight)) %>%
mutate(lag_weight = weight - lag(weight.io)
I have a dataset below.
set.seed(42000)
data <- data_frame(
country = sample(letters[1:20], size = 100, replace = TRUE),
weight = round(runif(100, min = 48, max = 90)))
data <- data %>%
group_by(country) %>%
arrange(weight) %>%
mutate(wave = seq_along(weight))
n_rows <- nrow(data)
perc_missing <- 10
data[sample(1:n_rows, sample(1:n_rows, round(perc_missing/100 * n_rows, 0))), c("weight")] <- NA
I would like to obtain the difference between one country's current "weight" and the last observed "weight for each wave.
For country "a" wave 5, I want the value to be 69 - 65 (last observed weight at wave < 5).
And for wave 8, 82(weight at wave 8) - 69(weight at wave 5).
My approach was the one below, but it didn't work.
data <-
data %>%
group_by(country) %>%
arrange(wave) %>%
mutate(lag_weight = weight - lag(weight, default = first(weight, na.rm = TRUE)))
Thank you!
I think this is a combination of diff (instead of lag, though that could work just as well) and more important tidyr::fill (or zoo::na.locf, not demonstrated):
BTW, na.rm= is not an argument for first, I've removed it.
library(dplyr)
# library(tidyr) # fill
data %>%
group_by(country) %>%
tidyr::fill(weight) %>%
filter(country == "a") %>%
mutate(lag_weight = weight - lag(weight, default = first(weight)))
# # A tibble: 10 x 4
# # Groups: country [1]
# country weight wave lag_weight
# <chr> <dbl> <int> <dbl>
# 1 a 54 1 0
# 2 a 55 2 1
# 3 a 65 3 10
# 4 a 65 4 0
# 5 a 69 5 4
# 6 a 69 6 0
# 7 a 69 7 0
# 8 a 82 8 13
# 9 a 82 9 0
# 10 a 85 10 3
The issue here is that weight is over-written with the LOCF (last-observation carried forward) value instead of preserving the NA values. If that's important, then you can make another weight variable for temporary use (and remove it):
data %>%
mutate(tmpweight = weight) %>%
group_by(country) %>%
tidyr::fill(tmpweight) %>%
filter(country == "a") %>%
mutate(lag_weight = tmpweight - lag(tmpweight, default = first(tmpweight))) %>%
select(-tmpweight)
# # A tibble: 10 x 4
# # Groups: country [1]
# country weight wave lag_weight
# <chr> <dbl> <int> <dbl>
# 1 a 54 1 0
# 2 a 55 2 1
# 3 a 65 3 10
# 4 a NA 4 0
# 5 a 69 5 4
# 6 a NA 6 0
# 7 a NA 7 0
# 8 a 82 8 13
# 9 a 82 9 0
# 10 a 85 10 3
FYI, you can use c(0, diff(weight)) instead of weight - lag(weight) for the same effect. Since it returns length of 1 shorter (since it is the gap between each value), we prepend a 0 here:
data %>%
group_by(country) %>%
tidyr::fill(weight) %>%
filter(country == "a") %>%
mutate(lag_weight = c(0, diff(weight)))
(The filter(country == "a") is purely for demonstration to match your example, not that it is required for this solution.)
I have a large data for which I'm attempting to remove repeated row entries based on several columns. The column headings and sample entries are
count freq, cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart
5036 0.0599 TGCAGTGCTAGAG CSARDPDR TRBV20-1 TRBD1 TRBJ1-5 15 17 43 21
There are several thousand rows, and for two rows to match all the values except for "count" and "freq" must be the same. I want to remove the repeated entries, but before that, I need to change the "count" value of the one repeated row with the sum of the individual repeated row "count" to reflect the true abundance. Then, I need to recalculate the frequency of the new "count" based on the sum of all the counts of the entire table.
For some reason, the script is not changing anything, and I know for a fact that the table has repeated entries.
Here's my script.
library(dplyr)
# Input sample replicate table.
dta <- read.table("/data/Sample/ci1371.txt", header=TRUE, sep="\t")
# combine rows with identical data. Recalculation of frequency values.
dta %>% mutate(total = sum(count)) %>%
group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
summarize(count_new = sum(count), freq = count_new/mean(total))
dta_clean <- dta
Any help is greatly appreciated. Here's a screenshot of how the datatable looks like.
Preliminary step: transform in data.table and store column names that are not count and freq
library(data.table)
setDT(df)
cols <- colnames(df)[3:ncol(df)]
(in your example, count and freq are in the first two positions)
To recompute count and freq:
df_agg <- df[, .(count = sum(count)), by = cols]
df_agg[, 'freq' := count/sum(count)]
If you want to keep unique values by all columns except count and freq
df_unique <- unique(df, by = cols)
Sample data, where grp1 and grp2 are intended to be all of your grouping variables.
set.seed(42)
dat <- data.frame(
grp1 = sample(1:2, size=20, replace=TRUE),
grp2 = sample(3:4, size=20, replace=TRUE),
count = sample(100, size=20, replace=TRUE),
freq = runif(20)
)
head(dat)
# grp1 grp2 count freq
# 1 2 4 38 0.6756073
# 2 2 3 44 0.9828172
# 3 1 4 4 0.7595443
# 4 2 4 98 0.5664884
# 5 2 3 44 0.8496897
# 6 2 4 96 0.1894739
Code:
library(dplyr)
dat %>%
group_by(grp1, grp2) %>%
summarize(count = sum(count)) %>%
ungroup() %>%
mutate(freq = count / sum(count))
# # A tibble: 4 x 4
# grp1 grp2 count freq
# <int> <int> <int> <dbl>
# 1 1 3 22 0.0206
# 2 1 4 208 0.195
# 3 2 3 383 0.358
# 4 2 4 456 0.427
I'm new to R and still struggling with loops.
I'm trying to create a loop where, based on a condition (variable_4 == 1), it will concatenate the content of variable_5, separated by comma.
data1 <- data.frame(
ID = c(123:127),
agent_1 = c('James', 'Lucas','Yousef', 'Kyle', 'Marisa'),
agent_2 = c('Sophie', 'Danielle', 'Noah', 'Alex', 'Marcus'),
agent_3 = c('Justine', 'Adrienne', 'Olivia', 'Janice', 'Josephine'),
Flag_1 = c(1,0,1,0,1),
Flag_2 = c(0,1,0,0,1),
Flag_3 = c(1,0,1,0,1)
)
data1$new_var<- ""
for(i in 2:10){
variable_4 <- paste0("flag_", i)
variable_5 <- paste0("agent_", i)
data1 <- data1 %>%
mutate(!! new_var = case_when(variable_4 == 1,paste(new_var, variable_5, sep=",")))
}
I've created new_var in a previous step because the code was giving me an error that the variable was not found. Ideally, the loop will accumulate the contents of variable_5, only if variable_4 is equal 1 and the result would be big string, separate by comma.
The loop will paste in the new var only the name of the agents which the flags are = 1. If Flag_1=1, then paste the name of the agent in the new_var, if not, ignore. If flag_2 =1, then concatenate the name of the agent in the new var, separating by comma, if not, then ignore...
You shouldn't need to use a loop for this. The data is in wide format which makes it harder, but if we convert to long format, we can easily find a vectorized solution rather than using a loop.
The pivot_longer function is useful here which requires tidyr version >= 1.0.0.
library(tidyr)
library(dplyr)
pivot_longer(data1,
cols = -ID,
names_to = c(".value", "group"),
names_sep = "_") %>%
group_by(ID) %>%
mutate(new_var = paste0(agent[Flag==1], collapse = ',')) %>%
pivot_wider(names_from = c("group"),
values_from = c('agent', 'Flag'),
names_sep = '_') %>%
ungroup() %>%
select(ID, starts_with('agent'), starts_with('Flag'), new_var)
## A tibble: 5 x 8
# ID agent_1 agent_2 agent_3 Flag_1 Flag_2 Flag_3 new_var
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 123 James Sophie Justine 1 0 1 James,Justine
#2 124 Lucas Danielle Adrienne 0 1 0 Danielle
#3 125 Yousef Noah Olivia 1 0 1 Yousef,Olivia
#4 126 Kyle Alex Janice 0 0 0 ""
#5 127 Marisa Marcus Josephine 1 1 1 Marisa,Marcus,Josephine
Details:
pivot_longer puts our data into a more natural format where each row represents one observation of the variables agent and flag, rather than several:
pivot_longer(data1,
cols = -ID,
names_to = c(".value", "group"),
names_sep = "_")
## A tibble: 15 x 4
# ID group agent Flag
# <int> <chr> <chr> <chr>
# 1 123 1 James 1
# 2 123 2 Sophie 0
# 3 123 3 Justine 1
# 4 124 1 Lucas 0
# 5 124 2 Danielle 1
# 6 124 3 Adrienne 0
# ...
For each ID, we can then paste together the agents which have flag values of 1. This is easy now that our variables are contained in single columns.
Lastly, we revert back to the wide format with pivot_wider. We also ungroup the data we previously grouped, and re-order the columns to the desired format.
There are a few different ways to do this in BaseR or the tidyverse, or a combination of both, if you stick to using tidyverse then consider this:
I have used mtcars as your dataframe instead!
#load dplyr or tidyverse
library(tidyverse)
# create data as mtcars
df <- mtcars
# create two new columns flag and agent as rownumbers
df <- df %>%
mutate(flag = paste0("flag", row_number())) %>%
mutate(agent = paste0("agent", row_number()))
# using case when in mutate statement
df2 <- df %>%
mutate(new_column = ifelse(flag == "flag1", yes = paste0(agent, " this is a new variable"), no = flag))
print(df2)
an ifelse statement might be more appropriate if you have one case - but if you have many then use case_when instead.
I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4
From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)
Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.
I'd like to recode the values in the df1 data frame using the df2 data frame so that I end up with a data frame like df3.
The current code almost does the trick, but there are two problems. First, it introduces NA when there's no match, e.g. there is no match in df2 for the df1 aed_bloodpr variable value "1,2" so the value becomes NA. Second, when a variable in df1 can't be mapped to df2, the code won't run (error message).
Have looked into the nomatch argument for match() and the .default argument for Map(), but I can't figure out how to use them so that I end up with df3.
Starting point:
Df1 <- data.frame("aed_bloodpr" = c("1,2","2","1","1"),
"aed_gluco" = c("2","1","3","2"),
"add_bmi" = c("2","5,7","7","5"),
"add_asthma" = c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Df2 <- data.frame("NameOfVariable" = c("aed_bloodpr","aed_bloodpr","aed_gluco","aed_gluco","aed_gluco","add_bmi","add_bmi","add_bmi"),
"VariableLevel" = c(1,2,1,2,3,2,5,7),
"VariableDef" = c("high","normal","elevated","normal","NA","above","normal","below"))
End point:
Df3 <- data.frame("aed_bloodpr" = c("1,2","normal","high","high"),
"aed_gluco" = c("normal","elevated","NA","normal"),
"add_bmi" = c("above","5,7","below","normal"),
"add_asthma"=c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Current code:
data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
Df1,
split(Df2[2:3], Df2[1])[names(Df1)]))
You need to clean up before you can relabel. The actual relabeling is more easily accomplished by a join. Here using the tidyverse (translate as you like):
library(tidyverse)
Df1 <- data.frame("aed_bloodpr" = c("1,2","2","1","1"),
"aed_gluco" = c("2","1","3","2"),
"add_bmi" = c("2","5,7","7","5"),
"add_asthma" = c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Df2 <- data.frame("NameOfVariable" = c("aed_bloodpr","aed_bloodpr","aed_gluco","aed_gluco","aed_gluco","add_bmi","add_bmi","add_bmi"),
"VariableLevel" = c(1,2,1,2,3,2,5,7),
"VariableDef" = c("high","normal","elevated","normal","NA","above","normal","below"))
Df1_long <- Df1 %>%
mutate_all(as.character) %>% # change factors to strings
rowid_to_column('i') %>% # add row index to enable later long-to-wide reshape
gather(variable, value, -i) %>% # reshape to long form
separate_rows(value, convert = TRUE) # unnest nested values and convert to numeric
str(Df1_long)
#> 'data.frame': 22 obs. of 3 variables:
#> $ i : int 1 1 2 3 4 1 2 3 4 1 ...
#> $ variable: chr "aed_bloodpr" "aed_bloodpr" "aed_bloodpr" "aed_bloodpr" ...
#> $ value : int 1 2 2 1 1 2 1 3 2 2 ...
Df2_clean <- Df2 %>%
mutate_if(is.factor, as.character) %>% # change factors to strings
mutate_all(na_if, 'NA') # change "NA" to NA
Df3 <- Df1_long %>%
left_join(Df2_clean, by = c('variable' = 'NameOfVariable', # merge
'value' = 'VariableLevel')) %>%
mutate(VariableDef = coalesce(VariableDef, as.character(value))) %>% # combine labels and values
group_by(i, variable) %>%
summarise(value = toString(VariableDef)) %>% # re-aggregate multiple values
spread(variable, value) # reshape to wide form
Df3
#> # A tibble: 4 x 6
#> # Groups: i [4]
#> i add_asthma add_bmi aed_bloodpr aed_gluco nausea
#> * <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 2 above high, normal normal 3
#> 2 2 2 normal, below normal elevated 3
#> 3 3 7 below high 3 4
#> 4 4 5 normal high normal 5