I have a data set as shown in Input table below. I want to combine rows (4,5,6), rows (8,9) and rows (11,12) of Input table such that they share the same ID as shown in row 4,8 and 11 in the Output table below.
I tried merge(), but that didn't work as expected. The key here is the ID column which has unique values.
Any suggestions on how I can achieve this efficiently?
Input
Row Name Val1 Val2 Unit ID
1 -0.5 5.5 V UI-001
2 a -0.5 2.5 V UI-002
3 b -0.5 5.5 V UI-003
4 c -0.5 5.5 V UI-004
5 d
6 e
7 -45 125 Ohms UI-005
8 f 2 kV UI-006
9 g
10 h 500 V UI-007
11 i 15 kV UI-008
12 j
13 k UI-009
dput() of Input
structure(list(Name = c(NA, "a", "b", "c", "d", "e", NA, "f",
"g", "h", "i", "j", "k"), Val1 = c(-0.5, -0.5, -0.5, -0.5, NA,
NA, -45, 2, NA, 500, 15, NA, NA), Val2 = c(5.5, 2.5, 5.5, 5.5,
NA, NA, 125, NA, NA, NA, NA, NA, NA), Unit = c("V", "V", "V",
"V", NA, NA, "Ohms", "kV", NA, "V", "kV", NA, NA), ID = c("UI-001",
"UI-002", "UI-003", "UI-004", NA, NA, "UI-005", "UI-006", NA,
"UI-007", "UI-008", NA, "UI-009")), row.names = c(NA, -13L), class =
c("tbl_df", "tbl", "data.frame"))
Output
Row Name Val1 Val2 Unit ID
1 -0.5 5.5 V UI-001
2 a -0.5 2.5 V UI-002
3 b -0.5 5.5 V UI-003
4 cde -0.5 5.5 V UI-004
5 -45 125 Ohms UI-005
6 fg 2 kV UI-006
7 h 500 V UI-007
8 ij 15 kV UI-008
9 k UI-009
dput() of Output
structure(list(Name = c(NA, "a", "b", "cde", NA, "fg", "h", "ij",
"k"), Val1 = c(-0.5, -0.5, -0.5, -0.5, -45, 2, 500, 15, NA),
Val2 = c(5.5, 2.5, 5.5, 5.5, 125, NA, NA, NA, NA), Unit = c("V",
"V", "V", "V", "Ohms", "kV", "V", "kV", NA), ID = c("UI-001",
"UI-002", "UI-003", "UI-004", "UI-005", "UI-006", "UI-007",
"UI-008", "UI-009")), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
We may use
out <- df[!is.na(df$ID), ]
out$Name[!is.na(out$Name)] <- tapply(df$Name, cumsum(!is.na(df$ID)), paste, collapse = "")[!is.na(out$Name)]
out
# Name Val1 Val2 Unit ID
# 1 <NA> -0.5 5.5 V UI-001
# 2 a -0.5 2.5 V UI-002
# 3 b -0.5 5.5 V UI-003
# 4 cde -0.5 5.5 V UI-004
# 7 <NA> -45.0 125.0 Ohms UI-005
# 8 fg 2.0 NA kV UI-006
# 10 h 500.0 NA V UI-007
# 11 ij 15.0 NA kV UI-008
# 13 k NA NA <NA> UI-009
The first line gets rid of all the rows where ID is NA. Then
tapply(df$Name, cumsum(!is.na(df$ID)), paste, collapse = "")
# 1 2 3 4 5 6 7 8 9
# "NA" "a" "b" "cde" "NA" "fg" "h" "ij" "k"
constructs the correct values for Name and !is.na(out$Name) gives us which rows of out should be modified (which is needed since "NA" isn't the same as NA).
Also a dplyr possibility:
df %>%
mutate(grp = ifelse((is.na(lead(ID, default = last(ID))) & !is.na(ID)) | is.na(ID), 1, 0),
grp = ifelse(grp != 0, cumsum(grp != lag(grp, 1, default = first(grp))), 0)) %>%
group_by(grp) %>%
mutate(Name = ifelse(grp != 0, paste(Name, collapse = ""), Name)) %>%
filter(!is.na(ID)) %>%
ungroup() %>%
select(-grp)
Name Val1 Val2 Unit ID
<chr> <dbl> <dbl> <chr> <chr>
1 <NA> -0.500 5.50 V UI-001
2 a -0.500 2.50 V UI-002
3 b -0.500 5.50 V UI-003
4 cde -0.500 5.50 V UI-004
5 <NA> -45.0 125. Ohms UI-005
6 fg 2.00 NA kV UI-006
7 h 500. NA V UI-007
8 ij 15.0 NA kV UI-008
9 k NA NA <NA> UI-009
First, it creates a grouping variable for NA cases on "ID" and the last non-NA cases on "ID" before those NA cases. Then, it groups by that grouping variable and combines the values from "Name" into one. Finally, it filters out the cases where "ID" is NA and removes the redundant grouping variable.
Or the same using rleid() from data.table to more conveniently create the grouping variable:
df %>%
mutate(grp = ifelse((is.na(lead(ID, default = last(ID))) & !is.na(ID)) | is.na(ID), 1, 0),
grp = ifelse(grp == 1, rleid(grp), grp)) %>%
group_by(grp) %>%
mutate(Name = ifelse(grp != 0, paste(Name, collapse = ""), Name)) %>%
filter(!is.na(ID)) %>%
ungroup() %>%
select(-grp)
Or a different possibility using fill():
df %>%
mutate(ID_temp = ID) %>%
fill(ID, .direction = "down") %>%
group_by(ID) %>%
mutate(Name = paste(Name, collapse = "")) %>%
filter(!is.na(ID_temp)) %>%
select(-ID_temp)
Here, you are filling the missing "ID" values with the previous non-missing value, grouping by it, and then combining the rows per groups.
Related
I have a few large dataframes in RStudio, that have this structure:
Original data structure
structure(list(CHROM = c("scaffold1000|size223437", "scaffold1000|size223437",
"scaffold1000|size223437", "scaffold1000|size223437"), POS = c(666,
1332, 3445, 4336), REF = c("A", "TA", "CTTGA", "GCTA"), RO = c(20,
14, 9, 25), ALT_1 = c("GAT", "TGC", "AGC", "T"), ALT_2 = c("CAG",
"TGA", "CGC", NA), ALT_3 = c("G", NA, "TGA", NA), ALT_4 = c("AGT",
NA, NA, NA), AO_1 = c(13, 4, 67, 120), AO_2 = c(12, 5, 34, NA
), AO_3 = c(6, NA, 18, NA), AO_4 = c(101, NA, NA, NA), AOF_1 = c(8.55263157894737,
17.3913043478261, 52.34375, 82.7586206896552), AOF_2 = c(7.89473684210526,
21.7391304347826, 26.5625, NA), AOF_3 = c(3.94736842105263, NA,
14.0625, NA), AOF_4 = c(66.4473684210526, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
But for an analysis I need it to look like this:
Desired output
structure(list(CHROM = c("scaffold1000|size223437", "scaffold1000|size223437",
"scaffold1000|size223437", "scaffold1000|size223437"), POS = c(666,
1332, 3445, 4336), REF = c("A", "TA", "CTTGA", "GCTA"), RO = c(20,
14, 9, 25), ALT_1 = c("AGT", "TGA", "AGC", "T"), ALT_2 = c("CAG",
"TGC", "CGC", NA), ALT_3 = c("G", NA, "TGA", NA), ALT_4 = c("GAT",
NA, NA, NA), AO_1 = c(101, 5, 67, 120), AO_2 = c(12, 4, 34, NA
), AO_3 = c(6, NA, 18, NA), AO_4 = c(13, NA, NA, NA), AOF_1 = c(66.4473684210526,
21.7391304347826, 52.34375, 82.7586206896552), AOF_2 = c(7.89473684210526,
17.3913043478261, 26.5625, NA), AOF_3 = c(3.94736842105263, NA,
14.0625, NA), AOF_4 = c(8.55263157894737, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
So what I would like to do is to rearrange the content of a row in a way, that the columns ALT_1, ALT_2, ALT_3, ALT_4 are alphabetically sorted, but at the same time I also need to rearrange the corresponding columns of AO and AOF, so that the values still match.
(The value of AO_1 should still match with the sequence that was in ALT_1.
So if ALT_1 becomes ALT_2 in the sorted dataframe, AO_1 should also become AO_2)
What I tried so far, but didn't work:
Pasting the values of ALT_1, AO_1, AOF_1 all in one field, so I have them together with
if (is.na(X[i,6]) == FALSE) {
X[i,6] <- paste(X[i,6],X[i,10],X[i,14],sep=" ")
}
}
And then I wanted to extract every row as a vector to sort the values and put it back in the dataframe, but I didn't manage to do this.
So the question would be how I can order the dataframe to get the desired output?
(I need to apply this to 32 dataframes with each having >100.000 values)
Here is dplyr solution. Took me some time and I needed some help pivot_wider dissolves arrange:
library(dplyr)
library(tidyr)
df1 %>%
mutate(id = row_number()) %>%
unite("conc1", c(ALT_1, AO_1, AOF_1), sep = "_") %>%
unite("conc2", c(ALT_2, AO_2, AOF_2), sep = "_") %>%
unite("conc3", c(ALT_3, AO_3, AOF_3), sep = "_") %>%
unite("conc4", c(ALT_4, AO_4, AOF_4), sep = "_") %>%
pivot_longer(
starts_with("conc")
) %>%
mutate(value = ifelse(value=="NA_NA_NA", NA_character_, value)) %>%
group_by(id) %>%
mutate(value = sort(value, na.last = TRUE)) %>%
ungroup() %>%
pivot_wider(
names_from = name,
values_from = value,
values_fill = "0"
) %>%
separate(conc1, c("ALT_1", "AO_1", "AOF_1"), sep = "_") %>%
separate(conc2, c("ALT_2", "AO_2", "AOF_2"), sep = "_") %>%
separate(conc3, c("ALT_3", "AO_3", "AOF_3"), sep = "_") %>%
separate(conc4, c("ALT_4", "AO_4", "AOF_4"), sep = "_") %>%
select(CHROM, POS, REF, RO, starts_with("ALT"), starts_with("AO_"), starts_with("AOF_")) %>%
type.convert(as.is=TRUE)
CHROM POS REF RO ALT_1 ALT_2 ALT_3 ALT_4 AO_1 AO_2 AO_3 AO_4 AOF_1 AOF_2 AOF_3 AOF_4
<chr> <int> <chr> <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 scaffold1000|size223437 666 A 20 AGT CAG G GAT 101 12 6 13 66.4 7.89 3.95 8.55
2 scaffold1000|size223437 1332 TA 14 TGA TGC NA NA 5 4 NA NA 21.7 17.4 NA NA
3 scaffold1000|size223437 3445 CTTGA 9 AGC CGC TGA NA 67 34 18 NA 52.3 26.6 14.1 NA
4 scaffold1000|size223437 4336 GCTA 25 T NA NA NA 120 NA NA NA 82.8 NA NA NA
here is a data.table approach
library(data.table)
# Set to data.table format
setDT(mydata)
# Melt to long format
DT.melt <- melt(mydata, measure.vars = patterns(ALT = "^ALT_", AO = "^AO_", AOF = "^AOF_"))
# order by groups, na's at the end
setorderv(DT.melt, cols = c("CHROM", "POS", "ALT"), na.last = TRUE)
# cast to wide again, use rowid() for numbering
dcast(DT.melt, CHROM + POS + REF + RO ~ rowid(REF), value.var = list("ALT", "AO", "AOF"))
# CHROM POS REF RO ALT_1 ALT_2 ALT_3 ALT_4 AO_1 AO_2 AO_3 AO_4 AOF_1 AOF_2 AOF_3 AOF_4
# 1: scaffold1000|size223437 666 A 20 AGT CAG G GAT 101 12 6 13 66.44737 7.894737 3.947368 8.552632
# 2: scaffold1000|size223437 1332 TA 14 TGA TGC <NA> <NA> 5 4 NA NA 21.73913 17.391304 NA NA
# 3: scaffold1000|size223437 3445 CTTGA 9 AGC CGC TGA <NA> 67 34 18 NA 52.34375 26.562500 14.062500 NA
# 4: scaffold1000|size223437 4336 GCTA 25 T <NA> <NA> <NA> 120 NA NA NA 82.75862 NA NA NA
help <- data.frame(
id = c(100, 100, 101, 102, 102),
q1 = c(NA, 1, NA, NA, 3),
q2 = c(1, NA, 2, NA, NA),
q3 = c(NA, 1, NA, 4, NA),
q4 = c(NA, NA, 4, NA, 5),
group = c("a", "b", "c", "a", "c"))
help$group <- as.character(help$group)
I am trying to pivot longer so dataset looks like this:
id score group
100 NA a
100 1 b
100 NA c
...
But I get an error with the numeric values of q1-q4 and the character string group.
pivot_longer(help, !id, names_to = "score",
values_to = "group", values_ptypes = list(group = 'character'))
Error: Can't convert <double> to <character>.
How can I pivot longer but also preserve the group variable (where there is several missing data for the q1-4 there is a match for every id and group)?
library(tidyr)
output <- pivot_longer(help, -c(id, group), names_to = "question",
values_to = "score") %>%
dplyr::select(-question) %>%
dplyr::arrange(id, group)
Output
head(output)
# A tibble: 6 × 3
id group score
<dbl> <chr> <dbl>
1 100 a NA
2 100 a 1
3 100 a NA
4 100 a NA
5 100 b 1
6 100 b NA
I want to replace the NA values for observations within a particular sub-group, but the sequence of the observations in that group is not ordered properly. So I am wondering if there exists some dplyr or plyr command that would allow me to replace missing values in a column belonging to one dataframe using the values from the same column from another dataframe while matching on the values of that "key" column.
Here's what I got. Hope someone could shed light on this. Thanks.
## data frame that contains missing values in "diff" column
df <- data.frame(type = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3),
diff = c(0.1, 0.3, NA, NA, NA, NA, NA, 0.2, 0.7, NA, 0.5, NA),
name = c("A", "B", "C", "D", "E", "A", "B", "C", "F", "A", "B", "C"))
## replace with values from this smaller data frame
df2 <- data.frame(diff_rep = c(0.3, 0.2, 0.4), name = c("A", "B", "C"))
## replace using ifelse
df$diff <- ifelse(is.na(df$diff) & (df$type == 2), df2$diff_rep , df$diff)
df
type diff name
1 1 0.1 A
2 1 0.3 B
3 1 NA C
4 2 0.3 D
5 2 0.2 E
6 2 0.4 A
7 2 0.3 B
8 2 0.2 C
9 2 0.7 F
10 3 NA A
11 3 0.5 B
12 3 NA C
## desired output
type diff name
1 1 0.1 A
2 1 0.3 B
3 1 NA C
4 2 NA D
5 2 NA E
6 2 0.3 A
7 2 0.2 B
8 2 0.4 C
9 2 0.7 F
10 3 NA A
11 3 0.5 B
12 3 NA C
Assuminhg row 9 is a mistake, you can use a left join first and then use ifelse() and coalesce() to get your desired result. coalesce() returns the first non-missing value
left_join(df, df2, by = "name") %>%
mutate(diff_wanted = if_else(type == 2,
coalesce(diff, diff_rep),
diff),
diff_wanted = ifelse(name %in% df2$name,
diff_wanted,
NA)) %>%
select(type, diff_wanted, name)
I have a data frame like below:
how do I remove na and use below value to go up?
Thanks
id name.america name.europe name.asia
1 a <NA> <NA>
2 <NA> b <NA>
3 <NA> <NA> c
4 d <NA> <NA>
Change to:
id name.america name.europe name.asia
1 a b c
2 d
We can loop through the columns and remove the NA, then make the lengths of the list elements same by appending NA at the end after getting the max length of the list element. Based on that, subset the 'id' column of the dataset and append with the output
lst <- lapply(df1[-1], na.omit)
lst1 <- lapply(lst, `length<-`, max(lengths(lst)))
out <- data.frame(lst1)
out1 <- cbind(id = df1$id[seq_len(nrow(out))], out)
out1
# id name.america name.europe name.asia
#1 1 a b c
#2 2 d <NA> <NA>
If we need NA to be changed to blanks ("") - not recommended
out1[is.na(out1)] <- ""
data
df1 <- structure(list(id = 1:4, name.america = c("a", NA, NA, "d"),
name.europe = c(NA, "b", NA, NA), name.asia = c(NA, NA, "c",
NA)), class = "data.frame", row.names = c(NA, -4L))
tidyverse-based solution
require(tidyverse)
df1 %>%
gather(key = "name", value = "val", -id) %>%
na.omit() %>%
select(-id) %>%
group_by(name) %>%
mutate(id = 1:n()) %>%
spread(key = name, value = val)
Results
# A tibble: 2 x 4
id name.america name.asia name.europe
<int> <chr> <chr> <chr>
1 1 a c b
2 2 d NA NA
Notes
If desired you can re-order columns with select or that variable prior to transformation.
NAs are left as such. If desired, you can use tidyr::replace_na to insert some string or space. I would discourage you from doing that.
Data
Taken from #akrun's answer above.
df1 <- structure(
list(
id = 1:4,
name.america = c("a", NA, NA, "d"),
name.europe = c(NA, "b", NA, NA),
name.asia = c(NA, NA, "c",
NA)
),
class = "data.frame",
row.names = c(NA, -4L)
)
df1[, -1] <- lapply(df1[,-1], function(x) c(na.omit(x), rep("",length(x)-length(na.omit(x)))))
df1[1:max(colSums(!(df1[,-1]==""))),]
# id name.america name.europe name.asia
#1 1 a b c
#2 2 d
Here is the sample dataframe:
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B", "B"),
num = c(1, NA, 6, 3, 7, NA , NA, 2))
How do I get forward and backward difference between rows over id category? There should be two new columns: one difference between between current raw and previous, and the other should be difference between current raw and next raw. If the previous raw is NA then it should calculate the difference between current row and the first previous raw that contains real number. The same holds for the other forward difference case.
Many thanks!!
require(magrittr)
df$backdiff <- c(NA, sapply(2:nrow(df),
function(i){
df$num[i] - df$num[(i-1):1] %>% .[!is.na(.)][1]
}))
df$forward.diff <- c(sapply(2:nrow(df) - 1,
function(i){
df$num[i] - df$num[(i+1):nrow(df)] %>% .[!is.na(.)][1]
}), NA)
One solution could be achieved by using fill function from tidyr to create two columns (one each for prev and next calculation) where NA values are removed.
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B", "B"),
num = c(1, NA, 6, 3, 7, NA , NA, 2))
library("tidyverse")
df %>% mutate(dup_num_prv = num, dup_num_nxt = num) %>%
group_by(id) %>%
fill(dup_num_prv, .direction = "down") %>%
fill(dup_num_nxt, .direction = "up") %>%
mutate(prev_diff = ifelse(is.na(num), NA, num - lag(dup_num_prv))) %>%
mutate(next_diff = ifelse(is.na(num), NA, num - lead(dup_num_nxt))) %>%
as.data.frame()
# Result is shown in columns 'prev_diff' and 'next_diff'
# id num dup_num_prv dup_num_nxt prev_diff next_diff
#1 A 1 1 1 NA -5
#2 A NA 1 6 NA NA
#3 A 6 6 6 5 3
#4 A 3 3 3 -3 NA
#5 B 7 7 7 NA 5
#6 B NA 7 2 NA NA
#7 B NA 7 2 NA NA
#8 B 2 2 2 -5 NA
Note: There are few queries which OP needs to clarify. The solution can be fine-tuned afterwards. dup_num_prv and dup_num_nxtare kept just for understanding purpose. These column can be removed.