Set value of a column to NA based on conditions in R - r

I have a data frame, a reproducible example is as follows:
structure(list(subscriberid = c(1177460837L, 1177460837L, 1177460837L,
1146526049L, 1146526049L, 1146526049L), variable = c("3134",
"4550", "4550", "5160", "2530", "2530"), value = c(1, 2, 2, 1,
2, 2), gender = c(2, 2, 2, 1, 2, 2), cwe = c(NA, 50L, 50L, NA,
30L, 30L), hw = c(NA, 48L, 48L, NA, 26L, 26L), resp = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), cna = c(3L, 1L, 1L, 3L, 1L, 1L)), .Names = c("subscriberid",
"variable", "value", "gender", "cwe", "hw", "resp", "cna"), row.names = c(4L,
5L, 6L, 9L, 10L, 11L), class = "data.frame")
The actual data frame looks like this:
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 50 48 NA 1
6 1177460837 4550 2 2 50 48 NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 30 26 NA 1
11 1146526049 2530 2 2 30 26 NA 1
In the above df, row 5 and 6 are exactly the same. From row 5, I want to remove 48 and row 6 I want to remove 50. Essentially, I want to retain only one age in a row and set the other to NA. I tried using a for loop but that sets column values in the column that I refer in both the rows to NA.
for (i in 1:nrow(test)) {
test$hw[i] <- ifelse(!is.na(test$cwe[i]) & !is.na(test$hw[i]), NA, test$hw[i])
}
I am trying to set an if condition to identify if both the rows are same, then I want to iteratively remove one of the values from the first row and remove the other from the second.
The desired output is as follows:
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 50 NA NA 1
6 1177460837 4550 2 2 NA 48 NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 30 NA NA 1
11 1146526049 2530 2 2 NA 26 NA 1

You can use a combination of which() and duplicated() to receive duplicated rows.
Because you need to change values twice of the rows, you have to create a copy of the dataframe. Note that this will only work if the identical rows are always consecutive.
dfNA <- df
dfNA$hw[which(duplicated(df))-1] <- NA
dfNA$cwe[which(duplicated(df))] <- NA
dfNA
# subscriberid variable value gender cwe hw resp cna
#4 1177460837 3134 1 2 NA NA NA 3
#5 1177460837 4550 2 2 50 NA NA 1
#6 1177460837 4550 2 2 NA 48 NA 1
#9 1146526049 5160 1 1 NA NA NA 3
#10 1146526049 2530 2 2 30 NA NA 1
#11 1146526049 2530 2 2 NA 26 NA 1

A possible solution :
# create a logical vector indicating if current row is identical to previous one
# N.B.: do.call("paste",c(DF,sep="\r")) is used internally by "duplicated.data.frame" function
rowStrings <- do.call("paste", c(DF, sep = "\r"))
currRowIsEqualToPrev <- rowStrings[-1] == rowStrings[-length(rowStrings)]
# set first row hw = NA and second identical row cwe = NA
DF[c(FALSE,currRowIsEqualToPrev),'hw'] <- NA
DF[c(currRowIsEqualToPrev,FALSE),'cwe'] <- NA
> DF
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 NA 48 NA 1
6 1177460837 4550 2 2 50 NA NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 NA 26 NA 1
11 1146526049 2530 2 2 30 NA NA 1

Using lead and lag from dplyr package:
library(dplyr)
df1 %>%
group_by(subscriberid, variable) %>%
mutate(cwe = if_else(lead(cwe) == cwe, cwe, NA_integer_),
hw = if_else(lag(hw) == hw, hw, NA_integer_)) %>%
ungroup()
# # A tibble: 6 x 8
# subscriberid variable value gender cwe hw resp cna
# <int> <int> <int> <int> <int> <int> <lgl> <int>
# 1 1177460837 3134 1 2 NA NA NA 3
# 2 1177460837 4550 2 2 50 NA NA 1
# 3 1177460837 4550 2 2 NA 48 NA 1
# 4 1146526049 5160 1 1 NA NA NA 3
# 5 1146526049 2530 2 2 30 NA NA 1
# 6 1146526049 2530 2 2 NA 26 NA 1

I took a shot at it. This relies on using group_by from dplyr to find duplicate rows. This method assumes that rows can be reliably be identified as identical by using the subscriberid, variable, value, gender, resp, and cna columns alone.
Because it is operating within groups only, it will work even if a preceding non-identical row contains the same value for cwe (I did check this, but I would also confirm it for yourself if I were you).
library(dplyr)
ndf <- df %>%
group_by(subscriberid, variable, value, gender, resp, cna) %>%
mutate(cwe = na_if(cwe, lag(cwe)),
hw = na_if(hw, lead(hw))) %>%
ungroup()
Output:
# A tibble: 6 x 8
subscriberid variable value gender cwe hw resp cna
<int> <chr> <dbl> <dbl> <int> <int> <int> <int>
1 1177460837 3134 1. 2. NA NA NA 3
2 1177460837 4550 2. 2. 50 NA NA 1
3 1177460837 4550 2. 2. NA 48 NA 1
4 1146526049 5160 1. 1. NA NA NA 3
5 1146526049 2530 2. 2. 30 NA NA 1
6 1146526049 2530 2. 2. NA 26 NA 1

Related

How to create a new column with the derivative of a set of time serie values

I'm looking for help with R. I want to add three columns to existing data frames that contain time series data and have a lot of NA values. The data is about test scores. The first column I want to add is the first test score available. In the second column, I want the last test score available. In the third column, I want to calculate the derivative for each row by dividing the difference between the first and last scores by the number of tests that have passed. Important is that some of these past tests have NA values but I still want to include these when dividing. However, NA values that come after the last available test score I don't want to count.
Some explanation of my data:
A have a couple of data frames that all have test scores of different people. The different people are the rows and each column represents a test score. There are multiple test scores per person for the same test in the data frame. Column T1 shows their first score, T2 the second score, which was gathered a week later, and so on. Some people have started sooner than others and therefore have more test scores available. Also, some scores at the beginning and the middle are missing for various reasons. See the two example tables below where the index column is the actual index of the data frame and not a separate column. Some numbers are missing from the index (like 3) because this person had only NA values in their row, which I removed. It is important for me that the index stays this way.
Example 1 (test A):
INDEX
T1
T2
T3
T4
T5
T6
1
NA
NA
NA
3
4
5
2
57
57
57
57
NA
NA
4
44
NA
NA
NA
NA
NA
5
9
11
11
17
12
NA
Example 2 (test B):
INDEX
T1
T2
T3
T4
1
NA
NA
NA
17
2
11
16
20
20
4
1
20
NA
NA
5
20
20
20
20
My goal now is to add to these data frames the three columns mentioned before. For example 1 this would look like:
INDEX
T1
T2
T3
T4
T5
T6
FirstScore
LastScore
Derivative
1
NA
NA
NA
3
4
5
3
5
0.33
2
57
57
57
57
NA
NA
57
57
0
4
44
NA
NA
NA
NA
NA
44
44
0
5
9
11
11
17
12
NA
9
12
0.6
And for example 2:
INDEX
T1
T2
T3
T4
FirstScore
LastScore
Derivative
1
NA
NA
NA
17
17
17
0
2
11
16
20
20
11
20
2.25
4
1
20
NA
NA
1
20
9.5
5
20
20
20
20
20
20
0
I hope I have made myself clear and that someone can help me, thanks in advance!
Using one pmap_*
pmap_dfr(df1, ~{c(...) %>% t %>% as.data.frame() %>%
mutate(first_score = first(na.omit(c(...)[-1])),
last_score = last(na.omit(c(...)[-1])),
deriv = (last_score - first_score)/max(which(!is.na(c(...)[-1]))))})
INDEX T1 T2 T3 T4 T5 T6 first_score last_score deriv
1 1 NA NA NA 3 4 5 3 5 0.3333333
2 2 57 57 57 57 NA NA 57 57 0.0000000
3 4 44 NA NA NA NA NA 44 44 0.0000000
4 5 9 11 11 17 12 NA 9 12 0.6000000
in dplyr only using cur_data without rowwise() which slows down the operations
df1 %>% group_by(INDEX) %>%
mutate(first_score = c_across(starts_with('T'))[min(which(!is.na(cur_data())))],
last_score = c_across(starts_with('T'))[max(which(!is.na(cur_data()[1:6])))],
deriv = (last_score - first_score)/max(which(!is.na(cur_data()[1:6]))))
I think you can use the following solution. It surprisingly turned out to be a little verbose and convoluted but I think it is quite effective. I assumed that if the Last available score is not actually the last T, so I need to detect its index and divide the difference by it meaning NA values after the last one do not count. Otherwise it is divided by the number of all Ts available.
library(dplyr)
library(purrr)
df %>%
select(T1:T6) %>%
pmap(., ~ {x <- c(...)[!is.na(c(...))]; c(x[1], x[length(x)])}) %>%
exec(rbind, !!!.) %>%
as_tibble() %>%
set_names(c("First", "Last")) %>%
bind_cols(df) %>%
relocate(First, Last, .after = last_col()) %>%
rowwise() %>%
mutate(Derivative = ifelse(!is.na(T6) & T6 == Last, (Last - First)/(length(df)-1),
(Last - First)/last(which(c_across(T1:T6) == Last))))
# First Sample Data
# A tibble: 4 x 10
# Rowwise:
INDEX T1 T2 T3 T4 T5 T6 First Last Derivative
<int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 3 4 5 3 5 0.333
2 2 57 57 57 57 NA NA 57 57 0
3 4 44 NA NA NA NA NA 44 44 0
4 5 9 11 11 17 12 NA 9 12 0.6
Second Sample Data
df2 %>%
select(T1:T4) %>%
pmap(., ~ {x <- c(...)[!is.na(c(...))]; c(x[1], x[length(x)])}) %>%
exec(rbind, !!!.) %>%
as_tibble() %>%
set_names(c("First", "Last")) %>%
bind_cols(df2) %>%
relocate(First, Last, .after = last_col()) %>%
rowwise() %>%
mutate(Derivative = ifelse(!is.na(T4) & T4 == Last, (Last - First)/(length(df2)-1),
(Last - First)/last(which(c_across(T1:T4) == Last))))
# A tibble: 4 x 8
# Rowwise:
INDEX T1 T2 T3 T4 First Last Derivative
<int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 17 17 17 0
2 2 11 16 20 20 11 20 2.25
3 4 1 20 NA NA 1 20 9.5
4 5 20 20 20 20 20 20 0
Here's a tidyverse solution with no hardcoding. First I pivot longer, then extract the stats for each INDEX.
library(tidyverse)
df1 %>%
pivot_longer(-INDEX, names_to = "time", names_prefix = "T", names_transform = list(time = as.integer)) %>%
filter(!is.na(value)) %>%
group_by(INDEX) %>%
summarize(FirstScore = first(value), LastScore = last(value), divisor = max(time)) %>%
mutate(Derivative = (LastScore - FirstScore) / divisor) %>%
right_join(df1) %>%
select(INDEX, T1:T6, FirstScore, LastScore, Derivative)
for this output:
# A tibble: 4 x 10
INDEX T1 T2 T3 T4 T5 T6 FirstScore LastScore Derivative
<int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 3 4 5 3 5 0.333
2 2 57 57 57 57 NA NA 57 57 0
3 4 44 NA NA NA NA NA 44 44 0
4 5 9 11 11 17 12 NA 9 12 0.6
Output for 2nd data, with no changes to the code:
# A tibble: 4 x 10
INDEX T1 T2 T3 T4 T5 T6 FirstScore LastScore Derivative
<int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 3 4 5 17 17 0
2 2 57 57 57 57 NA NA 11 20 2.25
3 4 44 NA NA NA NA NA 1 20 9.5
4 5 9 11 11 17 12 NA 20 20 0
Sample data
df1 <- data.frame(
INDEX = c(1L, 2L, 4L, 5L),
T1 = c(NA, 57L, 44L, 9L),
T2 = c(NA, 57L, NA, 11L),
T3 = c(NA, 57L, NA, 11L),
T4 = c(3L, 57L, NA, 17L),
T5 = c(4L, NA, NA, 12L),
T6 = c(5L, NA, NA, NA)
)
df2 <- data.frame(
INDEX = c(1L, 2L, 4L, 5L),
T1 = c(NA, 11L, 1L, 20L),
T2 = c(NA, 16L, 20L, 20L),
T3 = c(NA, 20L, NA, 20L),
T4 = c(17L, 20L, NA, 20L)
)
You could also do:
df1 %>%
rowwise()%>%
mutate(firstScore = first(na.omit(c_across(T1:T6))),
lastScore = last(na.omit(c_across(T1:T6))),
Derivative = (lastScore-firstScore)/max(which(!is.na(c_across(T1:T6)))))
# A tibble: 4 x 10
# Rowwise:
INDEX T1 T2 T3 T4 T5 T6 firstScore lastScore Derivative
<int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 3 4 5 3 5 0.333
2 2 57 57 57 57 NA NA 57 57 0
3 4 44 NA NA NA NA NA 44 44 0
4 5 9 11 11 17 12 NA 9 12 0.6

How to handle or ignore NAs when using ifelse to mutate a new column with multiple conditions (solved)

I am a newcomer to dplyr and tried to create a new composite variable from three different age variables using dplyr and ifelse. I made a data frame to explain the situation as follows:
library(dplyr)
z <- data.frame("j6" = c(6, 19, NA, NA, NA, NA, NA, 8, 20, 20, NA),
"j7" = c(27, 20, NA, 7, 19, NA, NA, 20, 30, 9, NA),
"j8" = c(8, 22, NA, 20, NA, 8, 30, NA, NA, NA, 3))
z <- z %>%
mutate(., age_event = NA) %>%
mutate(., age_event = ifelse(j6 < 18 | j7 < 18 | j8 < 18, 1, 0))
My expectations:
The three columns (j6, j7, and j8) indicate ages, and if at lease one of them is less than 18 year-old, the new column (age_event) should be "1", otherwise 0.
And if the two of the three columns are both 18-year or older and the other is NA, the age_event variable should be 0 .
Likewise if the one of the three columns is 18-year or older and the others are NAs, the age_event variable should be 0.
Also it is NA if all of the three columns are NAs.
However, the result and problems are shown as follows:
> z
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA NA <-- should be 0, but NA
6 NA NA 8 1
7 NA NA 30 NA <-- should be 0, but NA
8 8 20 NA 1
9 20 30 NA NA <-- should be 0, but NA
10 20 9 NA 1
11 NA NA 3 1
I'd like to know if there is a way to turn 5th, 7th, and 9th observations above to 0s using mutate and ifelse. Any suggestions would be greatly appreciated!
Update (2/27/2020): I found a solution with pmin when using mutate and ifelse:
z <- z %>%
mutate(., age_event = ifelse(is.na(j6) & is.na(j7) & is.na(j8), NA,
ifelse(pmin(j6, j7, j8, na.rm = T) < 18, 1, 0)))
> z
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA 0
6 NA NA 8 1
7 NA NA 30 0
8 8 20 NA 1
9 20 30 NA 0
10 20 9 NA 1
11 NA NA 3 1
You can use rowMeans() in place of if_else() which will handle cases that are all NA.
z %>%
mutate(age_event = +(rowMeans(. < 18, na.rm = TRUE) > 0))
j6 j7 j8 age_event
1 6 27 8 1
2 19 20 22 0
3 NA NA NA NA
4 NA 7 20 1
5 NA 19 NA 0
6 NA NA 8 1
7 NA NA 30 0
8 8 20 NA 1
9 20 30 NA 0
10 20 9 NA 1
11 NA NA 3 1
We can use rowSums to calculate number of NA values in a row and number of values that are less than 18. We can then use case_when to assign numbers based on different conditions.
library(dplyr)
z %>%
mutate(calc = rowSums(!is.na(.), na.rm = TRUE),
ls18 = rowSums(. < 18, na.rm = TRUE),
age_event = case_when(calc == 0 & ls18 == 0 ~ NA_integer_,
ls18 > 0 ~ 1L,
TRUE ~ 0L)) %>%
select(-calc, -ls18)
# j6 j7 j8 age_event
#1 6 27 8 1
#2 19 20 22 0
#3 NA NA NA NA
#4 NA 7 20 1
#5 NA 19 NA 0
#6 NA NA 8 1
#7 NA NA 30 0
#8 8 20 NA 1
#9 20 30 NA 0
#10 20 9 NA 1
#11 NA NA 3 1

Convert list of lists to data frame retaining names and all columns

I'd like to convert chemical formulas to a data frame containing columns for 1) the mineral name, 2) the chemical formula and 3) a set of columns for each element that is extracted from the formula. I am given the first two columns and I can extract the number of elements from each formula using CHNOSZ::makeup(). However, I'm not familiar working with lists and not sure how to rbind() the lists back into a data frame that contains everything I'm looking for (i.e. see 1-3 above).
Here is what I have so far - appreciate any help (including a link to a good tutorial on how to convert data from nested lists into dataframes).
library(tidyverse)
library(CHNOSZ)
formulas <- structure(list(Mineral = c("Abelsonite", "Abernathyite", "Abhurite",
"Abswurmbachite", "Acanthite", "Acetamide"), Composition = c("C31H32N4Ni",
"K(UO2)(AsO4)4(H2O)", "Sn3O(OH)2Cl2", "CuMn6(SiO4)O8", "Ag2S",
"CH3CONH2")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
test <- formulas %>%
select(Composition) %>%
map(CHNOSZ::makeup) %>%
flatten
test2 <- do.call(rbind,test)
> test2
As H K O U
[1,] 31 32 4 1 31
[2,] 4 2 1 19 1
[3,] 2 2 3 3 2
[4,] 1 6 12 1 1
[5,] 2 1 2 1 2
[6,] 2 5 1 1 2
which is not right.
You could do something like this
library(tidyverse)
library(CNOSZ)
test <- formulas %>%
mutate(res = map(Composition, ~stack(makeup(.x)))) %>%
unnest(cols = res) %>%
spread(ind, values)
## A tibble: 6 x 17
# Mineral Composition C H N Ni As K O U Cl
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Abelso… C31H32N4Ni 31 32 4 1 NA NA NA NA NA
#2 Aberna… K(UO2)(AsO… NA 2 NA NA 4 1 19 1 NA
#3 Abhuri… Sn3O(OH)2C… NA 2 NA NA NA NA 3 NA 2
#4 Abswur… CuMn6(SiO4… NA NA NA NA NA NA 12 NA NA
#5 Acanth… Ag2S NA NA NA NA NA NA NA NA NA
#6 Acetam… CH3CONH2 2 5 1 NA NA NA 1 NA NA
## … with 6 more variables: Sn <dbl>, Cu <dbl>, Mn <dbl>, Si <dbl>, Ag <dbl>,
## S <dbl>

lapply alternative to for loop to append to data frame

I have a data frame:
df<-structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 20L, 33L, 40L,
45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", "pos"), row.names = c(NA, -14L), class = "data.frame")
> head(df)
chrom pos
1 1 10
2 1 200
3 1 134
4 1 400
5 1 600
6 1 1000
And I want to calculate pos[i+1] - pos[i] on the sample chromosome (chrom)
By using a for loop over each chrom level, and another over each row I get the expected results:
for (c in levels(df$chrom)){
df_chrom<-filter(df, chrom == c)
df_chrom<-arrange(df_chrom, df_chrom$pos)
for (i in 1:nrow(df_chrom)){
dist<-(df_chrom$pos[i+1] - df_chrom$pos[i])
logdist<-log10(dist)
cat(c, i, df_chrom$pos[i], dist, logdist, "\n")
}
}
However, I want to save this to a data frame, and think that lapply or apply is the right way to go about this. I can't work out how to make the pos[i+1] - pos[i] calculation though (seeing as lapply works on each row/column.
Any pointers would be appreciated
Here's the output from my solution:
chrom index pos dist log10dist
1 1 10 124 2.093422
1 2 134 66 1.819544
1 3 200 200 2.30103
1 4 400 200 2.30103
1 5 600 400 2.60206
1 6 1000 NA NA
2 1 20 13 1.113943
2 2 33 NA NA
3 1 40 5 0.69897
3 2 45 NA NA
4 1 50 5 0.69897
4 2 55 45 1.653213
4 3 100 23 1.361728
4 4 123 NA NA
We could do this using a group by difference. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'chrom', order the 'pos', get the difference of 'pos' (diff) and also log of the difference
library(data.table)
setDT(df)[order(pos), {v1 <- diff(pos)
.(index = seq_len(.N), pos = pos,
dist = c(v1, NA), logdiff = c(log10(v1), NA))}
, by = chrom]
# chrom index pos dist logdiff
# 1: 1 1 10 124 2.093422
# 2: 1 2 134 66 1.819544
# 3: 1 3 200 200 2.301030
# 4: 1 4 400 200 2.301030
# 5: 1 5 600 400 2.602060
# 6: 1 6 1000 NA NA
# 7: 2 1 20 13 1.113943
# 8: 2 2 33 NA NA
# 9: 3 1 40 5 0.698970
#10: 3 2 45 NA NA
#11: 4 1 50 5 0.698970
#12: 4 2 55 45 1.653213
#13: 4 3 100 23 1.361728
#14: 4 4 123 NA NA
Upon running the OP's code the output printed are
#1 1 10 124 2.093422
#1 2 134 66 1.819544
#1 3 200 200 2.30103
#1 4 400 200 2.30103
#1 5 600 400 2.60206
#1 6 1000 NA NA
#2 1 20 13 1.113943
#2 2 33 NA NA
#3 1 40 5 0.69897
#3 2 45 NA NA
#4 1 50 5 0.69897
#4 2 55 45 1.653213
#4 3 100 23 1.361728
#4 4 123 NA NA
We split df by df$chrom (Note that we reorder both df and df$chrom before splitting). Then we go through each of the subgroups (the subgroups are called a in this example) using lapply. On the pos column of each subgroup, we calculate difference (diff) of consecutive elements and take log10. Since diff decreases the number of elements by 1, we add a NA to the end. Finally, we rbind all the subgroups together using do.call.
do.call(rbind, lapply(split(df[order(df$chrom, df$pos),], df$chrom[order(df$chrom, df$pos)]),
function(a) data.frame(a, dist = c(log10(diff(a$pos)), NA))))
# chrom pos dist
#1.1 1 10 2.093422
#1.3 1 134 1.819544
#1.2 1 200 2.301030
#1.4 1 400 2.301030
#1.5 1 600 2.602060
#1.6 1 1000 NA
#2.7 2 20 1.113943
#2.8 2 33 NA
#3.9 3 40 0.698970
#3.10 3 45 NA
#4.11 4 50 0.698970
#4.12 4 55 1.653213
#4.13 4 100 1.361728
#4.14 4 123 NA

Keep unique entries by name and by time

A bit of code golf I am facing and struggling quite a bit. I had a hold to a complex dataset in long format, which I need in wide for analysis. I managed to convert easily. However, there is redundancy in the dataset after the convertion because of how the data was filled. So here is a MWE with the problem I am facing:
id <- c("ana","ana","ana", "brad","ana","brad","brad","brad", "matt", "matt", "matt")
hour <- c(0, 0, 24, 0, 48, 24, NA, 72, 0 , 24, 48 )
assessment <- c("memory", "memory", "attention", "verbal", "attention", "memory", "attention","attention", "memory", "attention", "attention")
value <- c(0.000,NA,0.895,0.000,15.000, 3, 5, NA,2, 4,5 )
mydata<-data.frame(id, hour, assessment, value)
Results in:
> mydata
id hour assessment value
1 ana 0 memory 0.000
2 ana 0 memory NA
3 ana 24 attention 0.895
4 brad 0 verbal 0.000
5 ana 48 attention 15.000
6 brad 24 memory 3.000
7 brad NA attention 5.000
8 brad 72 attention NA
9 matt 0 memory 2.000
10 matt 24 attention 4.000
11 matt 48 attention 5.000
and after:
library(dplyr)
library(tidyr)
mydata %>%
group_by(id) %>%
mutate(i1=row_number()) %>%
spread(assessment, value)
gets to:
Source: local data frame [11 x 6]
Groups: id [3]
id hour i1 attention memory verbal
* <fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 ana 0 1 NA 0 NA
2 ana 0 2 NA NA NA
3 ana 24 3 0.895 NA NA
4 ana 48 4 15.000 NA NA
5 brad 0 1 NA NA 0
6 brad 24 2 NA 3 NA
7 brad 72 4 NA NA NA
8 brad NA 3 5.000 NA NA
9 matt 0 1 NA 2 NA
10 matt 24 2 4.000 NA NA
11 matt 48 3 5.000 NA NA
Note that ana has two entries for hour 0 and memory; and brad has one entry with zero and another with missing. That missing should be considered as zero as well, that was a typing error of whoever collected the data.
The table below shows how ana's and brad's entries should be. Repetitions for the same id and hour (including NA) should be collapsed/merged (look at lines 1 and 5 below).
id hour i1 attention memory verbal
* <fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 ana 0 1 NA 0 NA
2 ana 24 3 0.895 NA NA
4 ana 48 4 15.000 NA NA
5 brad 0 1 5.000 NA 0
6 brad 24 2 NA 3 NA
7 brad 72 4 NA NA NA
9 matt 0 1 NA 2 NA
10 matt 24 2 4.000 NA NA
11 matt 48 3 5.000 NA NA
Question:
How do I reduce the duplicates for each subject+hour in such a dataset, so that it will look like the previous table?
One option is to replace the NA with 0, get the distinct rows and then proceed as in the OP's code
mydata %>%
mutate_at(vars(hour, value), funs(replace(., is.na(.), 0))) %>%
arrange(id, hour, desc(value)) %>%
distinct() %>%
group_by(id, hour, assessment) %>%
spread(assessment, value)

Resources