I am trying to move data from one column to another, due to the underlying forms being filled out incorrectly.
In the form it asks for information on a household and asks for their age(AGE) and gender(SEX) for each member, allowing up to 5 people per household. However some users have filled in information for person 1,3 and 4, but not filled in any info for person 2 because they filled out person 2 incorrectly, crossed out the details and have filled person 2 details into the person 3 boxes etc.
The data looks like this (ref 1 and 5 are correct in this data, all others are incorrect)
df <- data.frame(
ref = c(1, 2, 3, 4, 5, 6),
AGE1 = c(45, 36, 26, 47, 24, NA),
AGE2 = c(NA, 24, NA, 13, 57, 28),
AGE3 = c(NA, NA, 35, NA, NA, 26),
AGE4 = c(NA, NA, 15, 11, NA, NA),
AGE5 = c(NA, 15, NA, NA, NA, NA),
SEX1 = c("M", "F", "M", "M", "M", NA),
SEX2 = c(NA, "M", NA, "F", "F", "F"),
SEX3 = c(NA, NA, "M", NA, NA, "M"),
SEX4 = c(NA, NA, "F", "F", NA, NA),
SEX5 = c(NA, "F", NA, NA, NA, NA)
)
This is what the table looks like currently
(I have replaced NA with - to make reading easier)
ref
AGE1
AGE2
AGE3
AGE4
AGE5
SEX1
SEX2
SEX3
SEX4
SEX5
1
45
-
-
-
-
M
-
-
-
-
2
36
24
-
-
15
F
M
-
-
F
3
26
-
35
15
-
M
-
M
F
-
4
47
13
-
11
-
M
F
-
F
-
5
24
57
-
-
-
M
F
-
-
-
6
-
28
26
-
-
-
F
M
-
-
but i would like it to look like this
ref
AGE1
AGE2
AGE3
AGE4
AGE5
SEX1
SEX2
SEX3
SEX4
SEX5
1
45
-
-
-
-
M
-
-
-
-
2
36
24
15
-
-
F
M
F
-
-
3
26
35
15
-
-
M
M
F
-
-
4
47
13
11
-
-
M
F
F
-
-
5
24
57
-
-
-
M
F
-
-
-
6
28
26
-
-
-
F
M
-
-
-
Is there a way of correcting this using dplyr? If not, is there another way in R of correcting the data
Here is a way using dplyr and tidyr. The approach involves pivoting the data to longer format, sorting the NA values to the end, renumbering the column names, and the pivoting to wide form again.
library(dplyr)
library(tidyr)
df <- data.frame(ref, AGE1, AGE2, AGE3, AGE4, AGE5,
SEX1, SEX2, SEX3, SEX4, SEX5)
df %>%
mutate(across(starts_with("AGE"), as.character)) %>%
pivot_longer(2:11) %>%
separate(name, into = c("cat", "num"), 3) %>%
arrange(is.na(value)) %>%
group_by(ref, cat) %>%
mutate(num = seq_along(value)) %>%
ungroup() %>%
arrange(cat) %>%
unite(name, cat, num, sep = "") %>%
pivot_wider(id_cols = ref) %>%
mutate(across(starts_with("AGE"), as.numeric))
# A tibble: 6 x 11
ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 1 45 NA NA NA NA M NA NA NA NA
2 2 36 24 15 NA NA F M F NA NA
3 3 26 35 15 NA NA M M F NA NA
4 4 47 13 11 NA NA M F F NA NA
5 5 24 57 NA NA NA M F NA NA NA
6 6 28 26 NA NA NA F M NA NA NA
Here's a way using dplyr and tidyr library.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ref,
names_to = c('.value', 'num'),
names_pattern = '([A-Z]+)(\\d+)') %>%
arrange(ref, AGE, SEX) %>%
group_by(ref) %>%
mutate(num = row_number()) %>%
ungroup %>%
pivot_wider(names_from = num, values_from = c(AGE, SEX))
# ref AGE_1 AGE_2 AGE_3 AGE_4 AGE_5 SEX_1 SEX_2 SEX_3 SEX_4 SEX_5
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
#1 1 45 NA NA NA NA M NA NA NA NA
#2 2 15 24 36 NA NA F M F NA NA
#3 3 15 26 35 NA NA F M M NA NA
#4 4 11 13 47 NA NA F F M NA NA
#5 5 24 57 NA NA NA M F NA NA NA
#6 6 26 28 NA NA NA M F NA NA NA
Try the base code below
u1 <- reshape(
setNames(df, sub("(\\d)", ".\\1", names(df))),
direction = "long",
idvar = "ref",
varying = -1
)
u2 <- reshape(
transform(
u1[with(u1, order(is.na(AGE), is.na(SEX))), ],
time = ave(time, ref, FUN = seq_along)
),
direction = "wide",
idvar = "ref"
)
out <- u2[match(names(df),sub("\\.","",names(u2)))]
and you will get
> out
ref AGE.1 AGE.2 AGE.3 AGE.4 AGE.5 SEX.1 SEX.2 SEX.3 SEX.4 SEX.5
1.1 1 45 NA NA NA NA M <NA> <NA> <NA> <NA>
2.1 2 36 24 15 NA NA F M F <NA> <NA>
3.1 3 26 35 15 NA NA M M F <NA> <NA>
4.1 4 47 13 11 NA NA M F F <NA> <NA>
5.1 5 24 57 NA NA NA M F <NA> <NA> <NA>
6.2 6 28 26 NA NA NA F M <NA> <NA> <NA>
data
df <- data.frame(
ref = c(1, 2, 3, 4, 5, 6),
AGE1 = c(45, 36, 26, 47, 24, NA),
AGE2 = c(NA, 24, NA, 13, 57, 28),
AGE3 = c(NA, NA, 35, NA, NA, 26),
AGE4 = c(NA, NA, 15, 11, NA, NA),
AGE5 = c(NA, 15, NA, NA, NA, NA),
SEX1 = c("M", "F", "M", "M", "M", NA),
SEX2 = c(NA, "M", NA, "F", "F", "F"),
SEX3 = c(NA, NA, "M", NA, NA, "M"),
SEX4 = c(NA, NA, "F", "F", NA, NA),
SEX5 = c(NA, "F", NA, NA, NA, NA)
)
Here is a solution using package dedupewider:
library(dedupewider)
df <- data.frame(
ref = c(1, 2, 3, 4, 5, 6),
AGE1 = c(45, 36, 26, 47, 24, NA),
AGE2 = c(NA, 24, NA, 13, 57, 28),
AGE3 = c(NA, NA, 35, NA, NA, 26),
AGE4 = c(NA, NA, 15, 11, NA, NA),
AGE5 = c(NA, 15, NA, NA, NA, NA),
SEX1 = c("M", "F", "M", "M", "M", NA),
SEX2 = c(NA, "M", NA, "F", "F", "F"),
SEX3 = c(NA, NA, "M", NA, NA, "M"),
SEX4 = c(NA, NA, "F", "F", NA, NA),
SEX5 = c(NA, "F", NA, NA, NA, NA)
)
age_moved <- na_move(df, cols = names(df)[grepl("^AGE\\d$", names(df))]) # 'right' direction is by default
sex_moved <- na_move(age_moved, cols = names(df)[grepl("^SEX\\d$", names(df))])
sex_moved
#> ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
#> 1 1 45 NA NA NA NA M <NA> <NA> NA NA
#> 2 2 36 24 15 NA NA F M F NA NA
#> 3 3 26 35 15 NA NA M M F NA NA
#> 4 4 47 13 11 NA NA M F F NA NA
#> 5 5 24 57 NA NA NA M F <NA> NA NA
#> 6 6 28 26 NA NA NA F M <NA> NA NA
Related
I have simplified my df to:
A <- c("a", "b", "c", "d", "e", "f", "g", "NA", "h", "I")
B <- c(NA, 2, 3, 4, NA, NA, 5, 6, 8, NA)
C <- c(NA, 9, 8, 4, 5, 7, 5, 6, NA, NA)
D <- c(NA, 1, NA, 3, NA, 5, NA, NA, 8, NA)
E <- c(1,2,3,4,5,6,7,8,9,10)
df <- data.frame(A, B, C, D, E)
I would like to create a general code to change the numerical value of columns B and C based on the NA value of column D.
The resulting df2 would be:
A <- c("a", "b", "c", "d", "e", "f", "g", "NA", "h", "I")
B <- c(NA, 2, NA, 4, NA, NA, NA, NA, 8, NA)
C <- c(NA, 9, NA, 4, NA, 7, NA, NA, NA, NA)
D <- c(NA, 1, NA, 3, NA, 5, NA, NA, 8, NA)
E <- c(1,2,3,4,5,6,7,8,9,10)
df2 <- data.frame(A, B, C, D, E)
For my code that isn't working I have so far tried the below which give me the error of "unused argument (as.numeric(B))":
df2 <- df %>% na_if(is.na(D), as.numeric(B)) %>%
na_if(is.na(D), as.numeric(C))
Any help with be greatly appreciate. I cannot install library(naniar) so please no solution that use replace_with_na_at.
Thank you!
With dplyr, we can apply a simple ifelse statement to both B and C using across and replace with NA when they meet the condition (i.e., D is NA).
library(dplyr)
output <- df %>%
mutate(across(B:C, ~ ifelse(is.na(D), NA, .x)))
Output
A B C D E
1 a NA NA NA 1
2 b 2 9 1 2
3 c NA NA NA 3
4 d 4 4 3 4
5 e NA NA NA 5
6 f NA 7 5 6
7 g NA NA NA 7
8 NA NA NA NA 8
9 h 8 NA 8 9
10 I NA NA NA 10
Test
identical(output, df2)
# [1] TRUE
data.table
A <- c("a", "b", "c", "d", "e", "f", "g", "NA", "h", "I")
B <- c(NA, 2, 3, 4, NA, NA, 5, 6, 8, NA)
C <- c(NA, 9, 8, 4, 5, 7, 5, 6, NA, NA)
D <- c(NA, 1, NA, 3, NA, 5, NA, NA, 8, NA)
E <- c(1,2,3,4,5,6,7,8,9,10)
df <- data.frame(A, B, C, D, E)
library(data.table)
cols <- c("B", "C")
setDT(df)[is.na(D), (cols) := NA][]
#> A B C D E
#> 1: a NA NA NA 1
#> 2: b 2 9 1 2
#> 3: c NA NA NA 3
#> 4: d 4 4 3 4
#> 5: e NA NA NA 5
#> 6: f NA 7 5 6
#> 7: g NA NA NA 7
#> 8: NA NA NA NA 8
#> 9: h 8 NA 8 9
#> 10: I NA NA NA 10
Created on 2022-03-02 by the reprex package (v2.0.1)
Base R
A base R solution with Map and is.na<-.
A <- c("a", "b", "c", "d", "e", "f", "g", "NA", "h", "I")
B <- c(NA, 2, 3, 4, NA, NA, 5, 6, 8, NA)
C <- c(NA, 9, 8, 4, 5, 7, 5, 6, NA, NA)
D <- c(NA, 1, NA, 3, NA, 5, NA, NA, 8, NA)
E <- c(1,2,3,4,5,6,7,8,9,10)
df <- data.frame(A, B, C, D, E)
df[c("B", "C")] <- Map(\(x, y) {
is.na(x) <- is.na(y)
x
}, df[c("B", "C")], df["D"])
df
#> A B C D E
#> 1 a NA NA NA 1
#> 2 b 2 9 1 2
#> 3 c NA NA NA 3
#> 4 d 4 4 3 4
#> 5 e NA NA NA 5
#> 6 f NA 7 5 6
#> 7 g NA NA NA 7
#> 8 NA NA NA NA 8
#> 9 h 8 NA 8 9
#> 10 I NA NA NA 10
Created on 2022-03-01 by the reprex package (v2.0.1)
dplyr
And a solution with dplyr, but the same is.na<-.
library(dplyr)
df %>%
mutate(across(B:C, \(x) {is.na(x) <- is.na(D); x}))
#> A B C D E
#> 1 a NA NA NA 1
#> 2 b 2 9 1 2
#> 3 c NA NA NA 3
#> 4 d 4 4 3 4
#> 5 e NA NA NA 5
#> 6 f NA 7 5 6
#> 7 g NA NA NA 7
#> 8 NA NA NA NA 8
#> 9 h 8 NA 8 9
#> 10 I NA NA NA 10
Created on 2022-03-01 by the reprex package (v2.0.1)
I have a few large dataframes in RStudio, that have this structure:
Original data structure
structure(list(CHROM = c("scaffold1000|size223437", "scaffold1000|size223437",
"scaffold1000|size223437", "scaffold1000|size223437"), POS = c(666,
1332, 3445, 4336), REF = c("A", "TA", "CTTGA", "GCTA"), RO = c(20,
14, 9, 25), ALT_1 = c("GAT", "TGC", "AGC", "T"), ALT_2 = c("CAG",
"TGA", "CGC", NA), ALT_3 = c("G", NA, "TGA", NA), ALT_4 = c("AGT",
NA, NA, NA), AO_1 = c(13, 4, 67, 120), AO_2 = c(12, 5, 34, NA
), AO_3 = c(6, NA, 18, NA), AO_4 = c(101, NA, NA, NA), AOF_1 = c(8.55263157894737,
17.3913043478261, 52.34375, 82.7586206896552), AOF_2 = c(7.89473684210526,
21.7391304347826, 26.5625, NA), AOF_3 = c(3.94736842105263, NA,
14.0625, NA), AOF_4 = c(66.4473684210526, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
But for an analysis I need it to look like this:
Desired output
structure(list(CHROM = c("scaffold1000|size223437", "scaffold1000|size223437",
"scaffold1000|size223437", "scaffold1000|size223437"), POS = c(666,
1332, 3445, 4336), REF = c("A", "TA", "CTTGA", "GCTA"), RO = c(20,
14, 9, 25), ALT_1 = c("AGT", "TGA", "AGC", "T"), ALT_2 = c("CAG",
"TGC", "CGC", NA), ALT_3 = c("G", NA, "TGA", NA), ALT_4 = c("GAT",
NA, NA, NA), AO_1 = c(101, 5, 67, 120), AO_2 = c(12, 4, 34, NA
), AO_3 = c(6, NA, 18, NA), AO_4 = c(13, NA, NA, NA), AOF_1 = c(66.4473684210526,
21.7391304347826, 52.34375, 82.7586206896552), AOF_2 = c(7.89473684210526,
17.3913043478261, 26.5625, NA), AOF_3 = c(3.94736842105263, NA,
14.0625, NA), AOF_4 = c(8.55263157894737, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
So what I would like to do is to rearrange the content of a row in a way, that the columns ALT_1, ALT_2, ALT_3, ALT_4 are alphabetically sorted, but at the same time I also need to rearrange the corresponding columns of AO and AOF, so that the values still match.
(The value of AO_1 should still match with the sequence that was in ALT_1.
So if ALT_1 becomes ALT_2 in the sorted dataframe, AO_1 should also become AO_2)
What I tried so far, but didn't work:
Pasting the values of ALT_1, AO_1, AOF_1 all in one field, so I have them together with
if (is.na(X[i,6]) == FALSE) {
X[i,6] <- paste(X[i,6],X[i,10],X[i,14],sep=" ")
}
}
And then I wanted to extract every row as a vector to sort the values and put it back in the dataframe, but I didn't manage to do this.
So the question would be how I can order the dataframe to get the desired output?
(I need to apply this to 32 dataframes with each having >100.000 values)
Here is dplyr solution. Took me some time and I needed some help pivot_wider dissolves arrange:
library(dplyr)
library(tidyr)
df1 %>%
mutate(id = row_number()) %>%
unite("conc1", c(ALT_1, AO_1, AOF_1), sep = "_") %>%
unite("conc2", c(ALT_2, AO_2, AOF_2), sep = "_") %>%
unite("conc3", c(ALT_3, AO_3, AOF_3), sep = "_") %>%
unite("conc4", c(ALT_4, AO_4, AOF_4), sep = "_") %>%
pivot_longer(
starts_with("conc")
) %>%
mutate(value = ifelse(value=="NA_NA_NA", NA_character_, value)) %>%
group_by(id) %>%
mutate(value = sort(value, na.last = TRUE)) %>%
ungroup() %>%
pivot_wider(
names_from = name,
values_from = value,
values_fill = "0"
) %>%
separate(conc1, c("ALT_1", "AO_1", "AOF_1"), sep = "_") %>%
separate(conc2, c("ALT_2", "AO_2", "AOF_2"), sep = "_") %>%
separate(conc3, c("ALT_3", "AO_3", "AOF_3"), sep = "_") %>%
separate(conc4, c("ALT_4", "AO_4", "AOF_4"), sep = "_") %>%
select(CHROM, POS, REF, RO, starts_with("ALT"), starts_with("AO_"), starts_with("AOF_")) %>%
type.convert(as.is=TRUE)
CHROM POS REF RO ALT_1 ALT_2 ALT_3 ALT_4 AO_1 AO_2 AO_3 AO_4 AOF_1 AOF_2 AOF_3 AOF_4
<chr> <int> <chr> <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 scaffold1000|size223437 666 A 20 AGT CAG G GAT 101 12 6 13 66.4 7.89 3.95 8.55
2 scaffold1000|size223437 1332 TA 14 TGA TGC NA NA 5 4 NA NA 21.7 17.4 NA NA
3 scaffold1000|size223437 3445 CTTGA 9 AGC CGC TGA NA 67 34 18 NA 52.3 26.6 14.1 NA
4 scaffold1000|size223437 4336 GCTA 25 T NA NA NA 120 NA NA NA 82.8 NA NA NA
here is a data.table approach
library(data.table)
# Set to data.table format
setDT(mydata)
# Melt to long format
DT.melt <- melt(mydata, measure.vars = patterns(ALT = "^ALT_", AO = "^AO_", AOF = "^AOF_"))
# order by groups, na's at the end
setorderv(DT.melt, cols = c("CHROM", "POS", "ALT"), na.last = TRUE)
# cast to wide again, use rowid() for numbering
dcast(DT.melt, CHROM + POS + REF + RO ~ rowid(REF), value.var = list("ALT", "AO", "AOF"))
# CHROM POS REF RO ALT_1 ALT_2 ALT_3 ALT_4 AO_1 AO_2 AO_3 AO_4 AOF_1 AOF_2 AOF_3 AOF_4
# 1: scaffold1000|size223437 666 A 20 AGT CAG G GAT 101 12 6 13 66.44737 7.894737 3.947368 8.552632
# 2: scaffold1000|size223437 1332 TA 14 TGA TGC <NA> <NA> 5 4 NA NA 21.73913 17.391304 NA NA
# 3: scaffold1000|size223437 3445 CTTGA 9 AGC CGC TGA <NA> 67 34 18 NA 52.34375 26.562500 14.062500 NA
# 4: scaffold1000|size223437 4336 GCTA 25 T <NA> <NA> <NA> 120 NA NA NA 82.75862 NA NA NA
I'm a fairly new R user -- trying to teach myself based on forums, videos, and trial+error. I have a very large dataset and would like to calculate number of members in the household who are considered children ( aged under 18). I have a column for number of household members, as well as 11 columns for each household member's age. My initial thought would be to select those who are under 18 and subtract from total household members. I've tried a few different lines of code unsuccessfully and I'm not sure how best to go about executing this. Any help is greatly appreciated!
enter image description here
There are a few ways to do this. I'm using something called a datastep from the libr package.
First, here is your data:
df <- data.frame(num_hhmem = c(6, 4, 4, 5, 4, NA, 8, NA),
ChildAge = c(9, 8, 10, 10, 9, NA, 8, NA),
hhm2_Age = c(36, 44, 52, 40, 33, NA, 37, NA),
hhm3_Age = c(34, 16, 53, 15, 15, NA, 39, NA),
hhm4_Age = c(15, 10, 92, 17, 11, NA, NA, NA),
hhm5_Age = c(7, NA, NA, 20, NA, NA, 10, NA),
hhm6_Age = c(11, NA, NA, NA, NA, NA, 6, NA),
hhm7_Age = c(NA, NA, NA, NA, NA, NA, 68, NA),
hhm8_Age = c(NA, NA, NA, NA, NA, NA, 78, NA),
hhm9_Age = c(NA, NA, NA, NA, NA, NA, NA, NA))
Then I set up the datastep with an array for the columns you want to iterate. Also I also set up a childCount variable with the value of 0 to start with. The datastep will loop through the dataframe row by row. So then you just iterate through the array and add any children to the childCount variable.
library(libr)
res <- datastep(df,
arrays = list(ages = dsarray("ChildAge", "hhm2_Age", "hhm3_Age",
"hhm4_Age", "hhm5_Age", "hhm6_Age",
"hhm7_Age", "hhm8_Age", "hhm9_Age")),
calculate = { childCount <- 0 },
drop = "age",
{
for(age in ages) {
if (!is.na(ages[age])) {
if (ages[age] < 18)
childCount <- childCount + 1
}
}
})
Here are the results:
res
# num_hhmem ChildAge hhm2_Age hhm3_Age hhm4_Age hhm5_Age hhm6_Age hhm7_Age hhm8_Age hhm9_Age childCount
# 1 6 9 36 34 15 7 11 NA NA NA 4
# 2 4 8 44 16 10 NA NA NA NA NA 3
# 3 4 10 52 53 92 NA NA NA NA NA 1
# 4 5 10 40 15 17 20 NA NA NA NA 3
# 5 4 9 33 15 11 NA NA NA NA NA 3
# 6 NA NA NA NA NA NA NA NA NA NA 0
# 7 8 8 37 39 NA 10 6 68 78 NA 3
# 8 NA NA NA NA NA NA NA NA NA NA 0
Here is another potential solution using tidyverse functions and the data formatted by #David J. Bosak:
df1 <- data.frame(num_hhmem = c(6, 4, 4, 5, 4, NA, 8, NA),
ChildAge = c(9, 8, 10, 10, 9, NA, 8, NA),
hhm2_Age = c(36, 44, 52, 40, 33, NA, 37, NA),
hhm3_Age = c(34, 16, 53, 15, 15, NA, 39, NA),
hhm4_Age = c(15, 10, 92, 17, 11, NA, NA, NA),
hhm5_Age = c(7, NA, NA, 20, NA, NA, 10, NA),
hhm6_Age = c(11, NA, NA, NA, NA, NA, 6, NA),
hhm7_Age = c(NA, NA, NA, NA, NA, NA, 68, NA),
hhm8_Age = c(NA, NA, NA, NA, NA, NA, 78, NA),
hhm9_Age = c(NA, NA, NA, NA, NA, NA, NA, NA))
df2 <- df1 %>%
rowwise() %>%
mutate(total_kids = rowSums(across(-c(num_hhmem), ~sum(.x <= 18, na.rm = TRUE))))
df2
#> # A tibble: 8 × 11
#> # Rowwise:
#> num_hhmem ChildAge hhm2_Age hhm3_Age hhm4_Age hhm5_Age hhm6_Age hhm7_Age hhm8_Age hhm9_Age total_kids
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
#> 1 6 9 36 34 15 7 11 NA NA NA 4
#> 2 4 8 44 16 10 NA NA NA NA NA 3
#> 3 4 10 52 53 92 NA NA NA NA NA 1
#> 4 5 10 40 15 17 20 NA NA NA NA 3
#> 5 4 9 33 15 11 NA NA NA NA NA 3
#> 6 NA NA NA NA NA NA NA NA NA NA 0
#> 7 8 8 37 39 NA 10 6 68 78 NA 3
#> 8 NA NA NA NA NA NA NA NA NA NA 0
Or, if you just want the counts in a dataframe on their own:
df3 <- df1 %>%
rowwise() %>%
summarise(total_kids = rowSums(across(-c(num_hhmem), ~sum(.x <= 18, na.rm = TRUE))))
df3
#> # A tibble: 8 × 1
#> total_kids
#> <dbl>
#> 1 4
#> 2 3
#> 3 1
#> 4 3
#> 5 3
#> 6 0
#> 7 3
#> 8 0
Given the following example dataset:
df <- structure(list(Id = 1:10,
Department = c("A", "B", "A", "C",
"A", "B", "B", "C", "D", "A"),
Q1 = c("US", NA, NA, "US",
NA, "US", NA, "US", NA, "US"),
Q2 = c("Comp B", NA, NA,
"Comp B", "Comp B", NA, "Comp B", NA, "Comp B", "Comp B"),
Q3 = c(NA, NA, NA, NA, NA, NA, "Comp C", NA, NA, NA),
Q4 = c(NA, "Comp D", NA, "Comp D", NA, NA, NA, NA, "Comp D", NA),
Sales = c(10, 23, 12, 5, 5, 76, 236, 4, 3, 10)),
row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
Is there a way to replace all non NA values in columns Q2:Q4 with, for instance, the word "Competitor" all at once? I know how to do string_replace on individual columns but with over 100 columns, with different words to be replaced in each, I'm hoping there is a quicker way. I tried messing around with various versions of mutate(across(Q2:Q4, ~str_replace(.x, !is.na, "Competitor"))), which I modelled after mutate(across(Q2:Q4, ~replace_na(.x, 0))) but that didn't work. I'm still not clear on the syntax on across except for the most simple operations and don't even know if it is applicable here.
Thanks!
str_replace is for replacing substring. The second argument with is.na is not be called i.e is.na is a function. We could use replace to replace the entire non-NA element
library(dplyr)
df1 <- df %>%
mutate(across(Q2:Q4, ~ replace(., !is.na(.), "Competitor")))
-output
# A tibble: 10 x 7
Id Department Q1 Q2 Q3 Q4 Sales
<int> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 A US Competitor <NA> <NA> 10
2 2 B <NA> <NA> <NA> Competitor 23
3 3 A <NA> <NA> <NA> <NA> 12
4 4 C US Competitor <NA> Competitor 5
5 5 A <NA> Competitor <NA> <NA> 5
6 6 B US <NA> <NA> <NA> 76
7 7 B <NA> Competitor Competitor <NA> 236
8 8 C US <NA> <NA> <NA> 4
9 9 D <NA> Competitor <NA> Competitor 3
10 10 A US Competitor <NA> <NA> 10
Or in base R
nm1 <- grep("^Q[2-4]$", names(df), value = TRUE)
df[nm1][!is.na(df[nm1])] <- "Competitor"
Here is another option:
library(dplyr)
library(purrr)
df %>%
mutate(pmap_df(select(df, Q2:Q4), ~ replace(c(...), !is.na(c(...)), "Competitor")))
# A tibble: 10 x 7
Id Department Q1 Q2 Q3 Q4 Sales
<int> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 A US Competitor NA NA 10
2 2 B NA NA NA Competitor 23
3 3 A NA NA NA NA 12
4 4 C US Competitor NA Competitor 5
5 5 A NA Competitor NA NA 5
6 6 B US NA NA NA 76
7 7 B NA Competitor Competitor NA 236
8 8 C US NA NA NA 4
9 9 D NA Competitor NA Competitor 3
10 10 A US Competitor NA NA 10
I have a data frame such as this (but of size 16 Billion):
structure(list(id1 = c(1, 2, 3, 4, 4, 4, 4, 4, 4, 4), id2 = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j"), b1 = c(NA, NA,
NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L), b2 = c(1, NA, NA, NA, NA, NA,
1, 1, 1, 1), b3 = c(NA, 1, NA, NA, NA, NA, NA, NA, 1, 1), b4 = c(NA,
NA, 1, NA, NA, NA, NA, NA, 1, 1)), .Names = c("id1", "id2", "b1",
"b2", "b3", "b4"), row.names = c(NA, 10L), class = "data.frame")
df
id1 id2 b1 b2 b3 b4
1 1 a NA 1 NA NA
2 2 b NA NA 1 NA
3 3 c NA NA NA 1
4 4 d 1 NA NA NA
5 4 e 1 NA NA NA
6 4 f 1 NA NA NA
7 4 g 1 1 NA NA
8 4 h 1 1 NA NA
9 4 i 1 1 1 1
10 4 j 1 1 1 1
I need to get it into long format, while ONLY keeping values of 1. Of course, I tried using gather from tidyr and also melt from data.table to no avail as the memory requirements of them are explosive. My original data had zeros and ones, but I filled zeroes with NA and hoped na.rm = TRUE option will help with memory issue. But, it does not.
With just ones retained and lengthened, my data frame will fit easily in memory I have.
Is there a better way to get at this vs. using the standard methods - reasonable compute as a tradeoff for better memory fit is acceptable.
My desired output is the equivalent of:
library(dplyr)
library(tidyr)
df %>% gather(b, value, -id1, -id2, na.rm = TRUE)
id1 id2 b value
1 4 d b1 1
2 4 e b1 1
3 4 f b1 1
4 4 g b1 1
5 4 h b1 1
6 4 i b1 1
7 4 j b1 1
8 1 a b2 1
9 4 g b2 1
10 4 h b2 1
11 4 i b2 1
12 4 j b2 1
13 2 b b3 1
14 4 i b3 1
15 4 j b3 1
16 3 c b4 1
17 4 i b4 1
18 4 j b4 1
# or
reshape2::melt(df, id=c("id1","id2"), na.rm=TRUE)
# or
library(data.table)
melt(setDT(df), id=c("id1","id2"), na.rm=TRUE)
Currently, the call to gather on my full data set gives me this error, which I believe is due to memory issue:
Error in .Call("tidyr_melt_dataframe", PACKAGE = "tidyr", data, id_ind, :
negative length vectors are not allowed