Join rows in one row based on common ID - r

I have a dataframe with 3 columns: ID, F and M. I want to join the values for F and M into one row based on ID, while now most of them are in two separate rows with NAs instead.
I do have unfortunately some duplicate rows and the data is still a bit messy at the moment (see example below)
I tried this, but I get Error: Expecting a single value: [extent=2].
test2 <- test %>% mutate(grouped_id = row_number()) %>%
group_by(BroodID) %>%
summarise_each(funs(na.omit))
Here is a reproducible example of what my data looks like:
structure(list(ID = c(2010.3, 2010.3, 2010.3, 2010.3, 2010.33,
2010.34, 2010.38, 2010.38, 2010.39, 2010.39, 2010.4, 2010.4,
2010.4, 2010.4, 2010.4, 2010.41, 2010.41, 2010.42, 2010.42, 2010.44,
2010.44, 2010.46, 2010.46), F = structure(c(5L, 5L, 12L, 12L,
11L, 8L, NA, 3L, NA, 1L, NA, 2L, 2L, 6L, 6L, NA, 7L, NA, 9L,
NA, 4L, NA, 10L), .Label = c("T206434", "T206553", "T931169",
"T931286", "T961275", "V470937", "X250041", "X250109", "X250195",
"X250568", "X251067", "X251069"), class = "factor"), M = structure(c(2L,
2L, 11L, 11L, 6L, NA, 9L, NA, 10L, NA, 1L, 1L, 4L, 4L, NA, 8L,
NA, 3L, NA, 7L, NA, 5L, NA), .Label = c("T206824", "T206994",
"T960191", "T961486", "X250567", "X250779", "X250851", "X251046",
"X251066", "X251074", "X251116"), class = "factor")), row.names = c(NA,
23L), class = "data.frame")
I'd like the rows where the values for F and M are split into two rows to be merged into one row based on ID.

We could use unite with na.rm = TRUE to remove NA values and use distinct to have only unique rows.
library(dplyr)
test %>%
mutate_at(2:3, as.character) %>%
tidyr::unite(combined, F, M, na.rm = TRUE, sep = ",") %>%
distinct()
# ID combined
#1 2010.30 T961275,T206994
#2 2010.30 X251069,X251116
#3 2010.33 X251067,X250779
#4 2010.34 X250109
#5 2010.38 X251066
#6 2010.38 T931169
#7 2010.39 X251074
#8 2010.39 T206434
#9 2010.40 T206824
#10 2010.40 T206553,T206824
#11 2010.40 T206553,T961486
#12 2010.40 V470937,T961486
#13 2010.40 V470937
#14 2010.41 X251046
#15 2010.41 X250041
#16 2010.42 T960191
#17 2010.42 X250195
#18 2010.44 X250851
#19 2010.44 T931286
#20 2010.46 X250567
#21 2010.46 X250568
If we want to further summarise by ID, we can do
test %>%
mutate_at(2:3, as.character) %>%
tidyr::unite(combined, F, M, na.rm = TRUE, sep = ",") %>%
distinct() %>%
group_by(ID) %>%
summarise(combined = toString(combined))

Related

How to select specific element from nested dataframes

I have a list of nested data frames and I want to extract the observations of the earliest year, my problem is the first year change with the data frames. the year is either 1992 or 2005.
I want to create a list to stock them, I tried with which, but since there is the same year, observations are repeated, and I want them apart
new_df<- which(df[[i]]==1992 | df[[i]]==2005)
I've tried with ifelse() but I have to do an lm operation after, and it doesn't work. And I can't take only the first rows, because the year are repeated
my code looks like this:
df<- list(a<-data.frame(a_1<-(1992:2015),
a_2<-sample(1:24)),
b<-data.frame(b_1<-(1992:2015),
b_2<-sample(1:24)),
c<-data.frame(c_1<-(2005:2015),
c_2<-sample(1:11)),
d<-data.frame(d_1<-(2005:2015),
d_2<-sample(1:11)))
You can define a function to get the data on one data.frame and loop on the list to extract values.
Below I use map from the purrr package but you can also use lapply and for loops
Please do not use <- when assigning values in a function call (here data.frame() ) because it will mess colnames. = is used in function calls for arguments variables and it's okay to use it. You can read this ;)
df<- list(a<-data.frame(a_1 = (1992:2015),
a_2 = sample(1:24)),
b<-data.frame(b_1 = (1992:2015),
b_2 = sample(1:24)),
c<-data.frame(c_1 = (2005:2015),
c_2 = sample(1:11)),
d<-data.frame(d_1 = (2005:2015),
d_2 = sample(1:11)))
extract_miny <- function(df){
miny <- min(df[,1])
res <- df[df[,1] == miny, 2]
names(res) <- miny
return(res)
}
map(df, extract_miny)
If the data is sorted as the example, you can slice() the first row for the information. Notice the use of = rather than <- in creating a nested dataframe.
library(tidyverse)
df <- list(
a = data.frame(a_1 = (1992:2015),
a_2 = sample(1:24)),
b = data.frame(b_1 = (1992:2015),
b_2 = sample(1:24)),
c = data.frame(c_1 = (2005:2015),
c_2 = sample(1:11)),
d = data.frame(d_1 = (2005:2015),
d_2 = sample(1:11))
)
df %>%
imap_dfr( ~ slice(.x, 1) %>%
set_names(c("year", "value")) %>%
mutate(dataframe = .y) %>%
as_tibble())
# A tibble: 4 x 3
year value dataframe
<int> <int> <chr>
1 1992 19 a
2 1992 2 b
3 2005 1 c
4 2005 5 d
You may subset anonymeously.
lapply(df, \(x) setNames(x[x[[1]] == min(x[[1]]), ], c('year', 'value'))) |> do.call(what=rbind)
# year value
# 1 1992 6
# 2 1992 9
# 3 2005 11
# 4 2005 11
Or maybe better by creating a variable from which sample the value stems from.
Map(`[<-`, df, 'sample', value=letters[seq_along(df)]) |>
lapply(\(x) setNames(x[x[[1]] == min(x[[1]]), ], c('year', 'value', 'sample'))) |>
do.call(what=rbind)
# year value sample
# 1 1992 6 a
# 2 1992 9 b
# 3 2005 11 c
# 4 2005 11 d
Data:
df <- list(structure(list(a_1.....1992.2015. = 1992:2015, a_2....sample.1.24. = c(6L,
18L, 23L, 5L, 7L, 14L, 4L, 10L, 19L, 17L, 15L, 1L, 11L, 22L,
13L, 8L, 20L, 16L, 2L, 3L, 24L, 21L, 9L, 12L)), class = "data.frame", row.names = c(NA,
-24L)), structure(list(b_1.....1992.2015. = 1992:2015, b_2....sample.1.24. = c(9L,
24L, 18L, 8L, 16L, 11L, 13L, 23L, 15L, 20L, 19L, 21L, 12L, 22L,
7L, 3L, 6L, 17L, 2L, 5L, 4L, 10L, 1L, 14L)), class = "data.frame", row.names = c(NA,
-24L)), structure(list(c_1.....2005.2015. = 2005:2015, c_2....sample.1.11. = c(11L,
2L, 5L, 10L, 9L, 6L, 1L, 7L, 3L, 8L, 4L)), class = "data.frame", row.names = c(NA,
-11L)), structure(list(d_1.....2005.2015. = 2005:2015, d_2....sample.1.11. = c(11L,
2L, 5L, 1L, 6L, 9L, 3L, 7L, 10L, 4L, 8L)), class = "data.frame", row.names = c(NA,
-11L)))

R: populate data.frame within function in mapply

A data.frame df1 is queried (fuzzy match) against another data.frame df2 with agrep. Via iterating over its output (a list called matches holding the row number of the respective matches in df2), df1 is populated with affiliated values from df2.
The goal is a function that is passed to mapply; however, in all my attempts df1 remains unchanged.
In a for-loop, the code works as expected and populates df1 with the affiliated variables from df2. Still, I would be interested how to solve this with a function that is passed to mapply.
First, the two data.frames:
df1 <- structure(list(Species = c("Alisma plantago-aquatica", "Alnus glutinosa",
"Carex davalliana", "Carex echinata",
"Carex elata"),
CheckPoint = c(NA, NA, NA, NA, NA),
L = c(NA, NA, NA, NA, NA),
R = c(NA, NA, NA, NA, NA),
K = c(NA, NA, NA, NA, NA)),
row.names = c(NA, 5L), class = "data.frame")
df2 <- structure(list(Species = c("Alisma gramineum", "Alisma lanceolatum",
"Alisma plantago-aquatica", "Alnus glutinosa",
"Alnus incana", "Alnus viridis",
"Carex davalliana", "Carex depauperata",
"Carex diandra", "Carex digitata",
"Carex dioica", "Carex distans",
"Carex disticha", "Carex echinata",
"Carex elata"),
L = c(7L, 7L, 7L, 5L, 6L, 7L, 9L, 4L, 8L, 3L, 9L, 9L, 8L,
8L, 8L),
R = c(7L, 7L, 5L, 5L, 4L, 3L, 4L, 7L, 6L, NA, 4L, 6L, 6L,
NA, NA),
K = c(6L, 2L, NA, 3L, 5L, 4L, 4L, 2L, 7L, 4L, NA, 3L, NA,
3L, 2L)),
row.names = seq(1:15), class = "data.frame")
Then, fuzzy match by Species:
matches <- lapply(df1$Species, agrep, x = df2$Species, value = FALSE,
max.distance = c(deletions = 0,
insertions = 1,
substitutions = 1))
Populating df1 with the values from df2 works as expected:
for (i in 1:dim(df1)[1]){
df1[i, 2:5] <- df2[matches[[i]], ]
}
In contrast to my approach with mapply that does return the correct values, although as a list of dissasembled values that are never written into df1. No combination (with or without return(df1), writing it into another variable nor desparate attempts with the state of SIMPLIFY or USE.NAMES) yielded the desired results.
populatedf1 <- function(matches, index){
df1[index, 2:5] <- df2[matches, ]
#return(df1)
}
mapply(populatedf1, matches, seq_along(matches), SIMPLIFY = FALSE,
USE.NAMES = FALSE)
Would be great if someone knows the solution or could point me into a certain direction, thanks! :)
Actually, you would not need any loop here (for or mapply) if you replace lapply with sapply (so that it returns a vector instead of list) and then do a direct assignment.
matches <- sapply(df1$Species, agrep, x = df2$Species, value = FALSE,
max.distance = c(deletions = 0,
insertions = 1,
substitutions = 1))
df1[, 2:5] <- df2[matches,]
df1
# Species CheckPoint L R K
#1 Alisma plantago-aquatica Alisma plantago-aquatica 7 5 NA
#2 Alnus glutinosa Alnus glutinosa 5 5 3
#3 Carex davalliana Carex davalliana 9 4 4
#4 Carex echinata Carex echinata 8 NA 3
#5 Carex elata Carex elata 8 NA 2
As far as your approach is concerned you can use Map or mapply with SIMPLIFY = FALSE and bring the list of dataframes into one dataframe using do.call and rbind and then assign.
df1[, 2:5] <- do.call(rbind, Map(populatedf1, matches, seq_along(matches)))

Removing cases in a long dataframe

I am currently experiencing a problem where I have a long dataframe (i.e., multiple rows per subject) and want to remove cases that don't have any measurements (in any of the rows) on one variable. I've tried transforming the data to wide format, but this was a problem as I can't go back anymore (going from long to wide "destroys" my timeline variable). Does anyone have an idea about how to fix this problem?
Below is some code to simulate the head of my data. Specifically, I want to remove cases that don't have a measurement of extraversion on any of the measurement occasions ("time").
structure(list(id = c(1L, 1L, 2L, 3L, 3L, 3L), time = c(79L, 95L, 79L, 28L, 40L, 52L),
extraversion = c(3.2, NA, NA, 2, 2.4, NA), satisfaction = c(3L, 3L, 4L, 5L, 5L, 9L),
`self-esteem` = c(4.9, NA, NA, 6.9, 6.7, NA)), .Names = c("id", "time", "extraversion",
"satisfaction", "self-esteem"), row.names = c(NA, 6L), class = "data.frame")
Note: I realise the missing of my extraversion variable coincides with my self-esteem variable.
To drop an entire id if they don't have any measurements for extraversion you could do:
library(data.table)
setDT(df)[, drop := all(is.na(extraversion)) ,by= id][!df$drop]
# id time extraversion satisfaction self-esteem drop
#1: 1 79 3.2 3 4.9 FALSE
#2: 1 95 NA 3 NA FALSE
#3: 3 28 2.0 5 6.9 FALSE
#4: 3 40 2.4 5 6.7 FALSE
#5: 3 52 NA 9 NA FALSE
Or you could use .I which I believe should be faster:
setDT(df)[df[,.I[!all(is.na(extraversion))], by = id]$V1]
Lastly, a base R solution could use ave (thanks to #thelatemail for the suggestion to make it shorter/more expressive):
df[!ave(is.na(df$extraversion), df$id, FUN = all),]
Assuming the data frame is named mydata, use a dplyr filter:
library(dplyr)
mydata %>%
group_by(id) %>%
filter(!all(is.na(extraversion))) %>%
ungroup()
d <-
structure(
list(
id = c(1L, 1L, 2L, 3L, 3L, 3L),
time = c(79L, 95L, 79L, 28L, 40L, 52L),
extraversion = c(3.2, NA, NA, 2, 2.4, NA),
satisfaction = c(3L, 3L, 4L, 5L, 5L, 9L),
`self-esteem` = c(4.9, NA, NA, 6.9, 6.7, NA)
),
.Names = c("id", "time", "extraversion",
"satisfaction", "self-esteem"),
row.names = c(NA, 6L),
class = "data.frame"
)
d[complete.cases(d$extraversion), ]
d[is.na(d$extraversion), ]
complete.cases is great if you wanted to remove any rows with missing data: complete.cases(d)

Mutate repeats first row value

I have a dataset with taxonomy assignment and I want to extract the genus in a new column.
library(tidyverse)
library(magrittr)
library(stringr)
df <- structure(list(C043 = c(18361L, 59646L, 27575L, 163L, 863L, 3319L,
0L, 6L), C057 = c(20020L, 97610L, 13427L, 1L, 161L, 237L, 2L,
105L), taxonomy = structure(c(3L, 2L, 1L, 6L, 4L, 4L, 5L, 2L), .Label = c("k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri"
), class = "factor")), .Names = c("C043", "C057", "taxonomy"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 8L, 10L), class = "data.frame")
So this is my function (it works)
extract_genus <- function(str){
genus <- str_split(str, pattern = ";")[[1]][6]
genus %<>% str_sub(start = 4) #%>% as.character
return(genus)
}
But when I applied it in mutate (with or without as.character), it repeats first row value in the new column.
df %>% mutate(genus = extract_genus(taxonomy))
C043 C057 taxonomy genus
1 18361 20020 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli Escherichia
2 59646 97610 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae Escherichia
3 27575 13427 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA Escherichia
4 163 1 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri Escherichia
5 863 161 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__ Escherichia
When I use sapply (but I don't want to, I want a solution with dplyr pipeline), it works.
df_group_gen$genus <- sapply(df_group_gen$taxonomy, extract_genus)
C043 C057 taxonomy genus
1 18361 20020 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli Escherichia
2 59646 97610 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae Enterobacter
3 27575 13427 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA Enterobacter
4 163 1 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri Pseudomonas
5 863 161 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__ Klebsiella
Why mutate doesn't compute as we can expect? I find this question but no answer is provided, only a had hoc code.
Thank you :)
You can Vectorize your function to allow mutate to occur on every row:
ex_gen <- Vectorize(extract_genus, vectorize.args='str')
df %>% mutate(genus=ex_gen(taxonomy))
Alternatively, you can use rowwise to mutate each row:
df %>%
rowwise() %>%
mutate(genus = extract_genus(taxonomy))

Add rows when values in columns are equal in df

For a sample dataframe:
df <- structure(list(animal.1 = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 3L, 1L, 1L), .Label = c("cat", "dog", "horse", "rabbit"), class = "factor"),
animal.2 = structure(c(1L, 2L, 2L, 2L, 4L, 4L, 1L, 1L, 3L,
1L), .Label = c("cat", "dog", "hamster", "rabbit"), class = "factor"),
number = c(5L, 3L, 2L, 5L, 1L, 4L, 6L, 7L, 1L, 11L)), .Names = c("animal.1",
"animal.2","number"), class = "data.frame", row.names = c(NA,
-10L))
... I wish to make a new df with 'animal' duplicates all added together. For example multiple rows with the same animal in columns 1 and 2 will be put together. So for example the dataframe above would read:
cat cat 16
dog dog 7
cat dog 3 etc. etc... (those with different animals would be left as they are). Importantly the sum of 'number' in both dataframes would be the same.
My real df is >400K observations, so anything that anyone could recommend could cope with a large dataset would be great!
Thanks in advance.
One option would be to use data.table. Convert "data.frame" to "data.table" (setDT(), if the "animal.1" rows are equal to "animal.2", then, replace the "number" with sum of "number" after grouping by the two columns, and finally get the unique rows.
library(data.table)
setDT(df)[as.character(animal.1)==as.character(animal.2),
number:=sum(number) ,.(animal.1, animal.2)]
unique(df)
# animal.1 animal.2 number
#1: cat cat 16
#2: cat dog 3
#3: dog dog 7
#4: dog rabbit 1
#5: rabbit rabbit 4
#6: rabbit cat 6
#7: horse cat 7
#8: cat hamster 1
Or an option with dplyr. The approach is similar to data.table. We group by "animal.1", "animal.2", then replace the "number" with sum only when "animal.1" is equal to "animal.2", and get the unique rows
library(dplyr)
df %>%
group_by(animal.1, animal.2) %>%
mutate(number=replace(number,as.character(animal.1)==
as.character(animal.2),
sum(number))) %>%
unique()

Resources