Required order of columns names which does not have NULL values - r

My data is bit typical and I need find out field/Column order that follow some pattern.
For Instance, One field(say sub3) has values till some rows and followed by NULL values, then another field will continue with some values(like Sub1) and then follows null values.
And in some cases I may have multiple fields having values at two rows(like Sub2 and Sub4).
In below case the solution is vector of field names which follow the pattern c(Sub3,Sub1,c(Sub2,Sub4),Sub5)
Here is the reproducible format of data and Snapshot of data.
structure(list(RollNo = 1:10, Sub1 = c(NA, NA, NA, NA, 3L, 2L,
NA, NA, NA, NA), Sub2 = c(NA, NA, NA, NA, NA, NA, "A", "B", NA,
NA), Sub3 = c(4L, 3L, 5L, 6L, NA, NA, NA, NA, NA, NA), Sub4 = c(NA,
NA, NA, NA, NA, NA, 2L, 5L, NA, NA), Sub5 = c(NA, NA, NA, NA,
NA, NA, NA, NA, 7L, NA)), .Names = c("RollNo", "Sub1", "Sub2",
"Sub3", "Sub4", "Sub5"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000200788>)

Sounds like you are sorting on the order of first non-NA data. If df is your data:
sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1)))
# RollNo Sub1 Sub2 Sub3 Sub4 Sub5
# 1 5 7 1 7 9
Gives the first non-NA row for each column. This should be a natural sort, meaning ties retain the original order between the ties.
There are a couple of ways to use this, one such:
order(sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1))))
# [1] 1 4 2 3 5 6
df[,order(sapply(df, function(x) min(Inf, head(which(!is.na(x)),n=1))))]
# RollNo Sub3 Sub1 Sub2 Sub4 Sub5
# 1 1 4 NA <NA> NA NA
# 2 2 3 NA <NA> NA NA
# 3 3 5 NA <NA> NA NA
# 4 4 6 NA <NA> NA NA
# 5 5 NA 3 <NA> NA NA
# 6 6 NA 2 <NA> NA NA
# 7 7 NA NA A 2 NA
# 8 8 NA NA B 5 NA
# 9 9 NA NA <NA> NA 7
# 10 10 NA NA <NA> NA NA
I'm inferring from the column names that RollNo should always be first, so:
df[,c(1, 1 + order(sapply(df[-1], function(x) min(Inf, head(which(!is.na(x)),n=1)))))]

Using:
DT[, nms := paste0(names(.SD)[!is.na(.SD)], collapse = ','), 1:nrow(DT), .SDcols = 2:6]
will get you:
> DT
RollNo Sub1 Sub2 Sub3 Sub4 Sub5 nms
1: 1 NA NA 4 NA NA Sub3
2: 2 NA NA 3 NA NA Sub3
3: 3 NA NA 5 NA NA Sub3
4: 4 NA NA 6 NA NA Sub3
5: 5 3 NA NA NA NA Sub1
6: 6 2 NA NA NA NA Sub1
7: 7 NA A NA 2 NA Sub2,Sub4
8: 8 NA B NA 5 NA Sub2,Sub4
9: 9 NA NA NA NA 7 Sub5
10: 10 NA NA NA NA NA
If you just want the specified vector:
unique(DT[, paste0(names(.SD)[!is.na(.SD)], collapse = ','), 1:nrow(DT), .SDcols = 2:6][V1!='']$V1)
which returns:
[1] "Sub3" "Sub1" "Sub2,Sub4" "Sub5"
As #Frank pinted out in the comments, you can also use:
melt(DT, id=1, na.rm = TRUE)[, toString(unique(variable)), by = RollNo][order(RollNo)]

Related

Merge survey columns across variables in R

I am analyzing a very large survey in which I want to combine four parts of the survey, through several combinations of 4 questions. Below I have created a small example. A little background: a respondent either answered q2, q5, q8 or q9, because they only filled in 1 of 4 parts of the survey based on their answer in q1 (not shown here).Therefore, only one of the four columns contains an answer (1 or 2), while the others contain NAs. q2, q5, q8, q9 are similar questions that have the same answer options, which is why I want to combine them to make my dataset less wide and make it easier to further analyze the data.
q2_1 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_1 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_1 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_1 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
q2_2 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_2 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_2 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_2 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
df <- data.frame(q2_1, q5_1, q8_1, q9_1, q2_2, q5_2, q8_2, q9_2)
df
# running df shows:
q2_1 q5_1 q8_1 q9_1 q2_2 q5_2 q8_2 q9_2
1 NA NA NA 1 NA NA NA 1
2 NA NA NA 2 NA NA NA 2
3 NA NA 1 NA NA NA 1 NA
4 NA NA 2 NA NA NA 2 NA
5 NA 1 NA NA NA 1 NA NA
6 NA 2 NA NA NA 2 NA NA
7 1 NA NA NA 1 NA NA NA
8 2 NA NA NA 2 NA NA NA
My desired end result would be a dataframe with only columns for questions starting with q2_ (so, in the example that would be q2_1 and q2_2; in reality there's about 20 for this question), but with the NAs replaced for the answer options from the corresponding q5_, q8_, and q_9.
# desired end result
q2_1 q2_2
1 1 1
2 1 2
3 1 1
4 2 2
5 1 1
6 2 2
7 1 1
8 2 2
For single questions, i've done this using the code below, but this is very manual and because q2, q5, q8, and q9 both go up to _20, I'm looking for a way to automate this more.
# example single question
library(tidyverse)
df <- df %>%
mutate(q2_1 = case_when(!is.na(q2_1) ~ q2_1,
!is.na(q5_1) ~ q5_1,
!is.na(q8_1) ~ q8_1,
!is.na(q9_1) ~ q9_1))
I hope I explained myself well enough and looking forward for some directions!
Here's one way, using coalesce:
df %>%
mutate(q2_1 = do.call(coalesce, across(ends_with('_1'))),
q2_2 = do.call(coalesce, across(ends_with('_2')))) %>%
select(q2_1, q2_2)
#> q2_1 q2_2
#> 1 1 1
#> 2 2 2
#> 3 1 1
#> 4 2 2
#> 5 1 1
#> 6 2 2
#> 7 1 1
#> 8 2 2
q2_1 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_1 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_1 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_1 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
q2_2 <- c(NA, NA, NA, NA, NA, NA, rep(c(1:2), 1))
q5_2 <- c(NA, NA, NA, NA, rep(c(1:2), 1), NA, NA)
q8_2 <- c(NA, NA, rep(c(1:2), 1), NA, NA, NA, NA)
q9_2 <- c(rep(c(1:2), 1), NA, NA, NA, NA, NA, NA)
df <- data.frame(q2_1, q5_1, q8_1, q9_1, q2_2, q5_2, q8_2, q9_2)
df
#> q2_1 q5_1 q8_1 q9_1 q2_2 q5_2 q8_2 q9_2
#> 1 NA NA NA 1 NA NA NA 1
#> 2 NA NA NA 2 NA NA NA 2
#> 3 NA NA 1 NA NA NA 1 NA
#> 4 NA NA 2 NA NA NA 2 NA
#> 5 NA 1 NA NA NA 1 NA NA
#> 6 NA 2 NA NA NA 2 NA NA
#> 7 1 NA NA NA 1 NA NA NA
#> 8 2 NA NA NA 2 NA NA NA
library(tidyverse)
suffix <- str_c("_", 1:2)
map_dfc(.x = suffix,
.f = ~ transmute(df, !!str_c("q2", .x) := rowSums(across(ends_with(.x
)), na.rm = T)))
#> q2_1 q2_2
#> 1 1 1
#> 2 2 2
#> 3 1 1
#> 4 2 2
#> 5 1 1
#> 6 2 2
#> 7 1 1
#> 8 2 2
Created on 2022-04-04 by the reprex package (v2.0.1)

collapsing columns values in a specific order and leaving the missing values as NA in R

I am using R.
I have 4 different databases. Each one have values for my variables. Some of the bases have more values than others. So I want to use first the one that has the most values and lastly the one that have the least values. The data looks like this...
Variables A B C D
John 2 4
Mike 6
Walter 7
Jennifer 9 8
Amanda 3
Carlos 9
Michael 3
James 5
Kevin 4
Dennis 7
Frank
Steven
Joseph
Elvis 2
Maria 1
So, in roder to fill the data a need to create a new column that first uses the data of column B because is the one that contains the most values, then A, then C and then D and the ones that are missing need to be NA's. Also I need to add another column that gives me the reference of the data. In other words if I am using the column B to the that of John I need a column that tells me that the data pertains to column B.
The column should look like this...
Variables E D
John 4 B
Mike 6 B
Walter 7 B
Jennifer 9 B
Amanda 3 A
Carlos 9 A
Michael 3 B
James 5 D
Kevin 4 A
Dennis 7 C
Frank NA NA
Steven NA NA
Joseph NA NA
Elvis 2 B
Maria 1 B
With tidyverse you can do the following...
Use pivot_longer to put into long form. Make name an ordered factor by "B", "A", "C", and "D". Then when you arrange, you can get the first value by this order within each person's name.
This assumes your missing data are NA. If they are instead blank character values, you can filter those out with filter(value != "") instead of drop_na(value).
library(tidyverse)
df %>%
pivot_longer(cols = -Variables) %>%
mutate(name = ordered(name, levels = c('B', 'A', 'C', 'D'))) %>%
group_by(Variables) %>%
drop_na(value) %>%
arrange(name) %>%
summarise(E = first(value),
New_D = first(name)) %>%
right_join(df)
Output
Variables E New_D A B C D
<chr> <dbl> <ord> <dbl> <dbl> <dbl> <dbl>
1 Amanda 3 A 3 NA NA NA
2 Carlos 9 A 9 NA NA NA
3 Dennis 7 C NA NA 7 NA
4 Elvis 2 B NA 2 NA NA
5 James 5 D NA NA NA 5
6 Jennifer 9 B NA 9 8 NA
7 John 4 B 2 4 NA NA
8 Kevin 4 A 4 NA NA NA
9 Maria 1 B NA 1 NA NA
10 Michael 3 B NA 3 NA NA
11 Mike 6 B NA 6 NA NA
12 Walter 7 B NA 7 NA NA
13 Frank NA NA NA NA NA NA
14 Steven NA NA NA NA NA NA
15 Joseph NA NA NA NA NA NA
Data
df <- structure(list(Variables = c("John", "Mike", "Walter", "Jennifer",
"Amanda", "Carlos", "Michael", "James", "Kevin", "Dennis", "Frank",
"Steven", "Joseph", "Elvis", "Maria"), A = c(2, NA, NA, NA, 3,
9, NA, NA, 4, NA, NA, NA, NA, NA, NA), B = c(4, 6, 7, 9, NA,
NA, 3, NA, NA, NA, NA, NA, NA, 2, 1), C = c(NA, NA, NA, 8, NA,
NA, NA, NA, NA, 7, NA, NA, NA, NA, NA), D = c(NA, NA, NA, NA,
NA, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-15L))

Find max value within a data frame interval

I have a dataframe that has x/y values every 5 seconds, with a depth value every second (time column). There is no depth where there is an x/y value.
x <- c("1430934", NA, NA, NA, NA, "1430939")
y <- c("4943206", NA, NA, NA, NA, "4943210")
time <- c(1:6)
depth <- c(NA, 10, 19, 84, 65, NA)
data <- data.frame(x, y, time, depth)
data
x y time depth
1 1430934 4943206 1 NA
2 NA NA 2 10
3 NA NA 3 19
4 NA NA 4 84
5 NA NA 5 65
6 1430939 4943210 6 NA
I would like to calculate the maximum depth between the x/y values that are not NA and add this to a new column in the row of the starting x/y values. So max depth of rows 2-5. An example of the output desired.
x y time depth newvar
1 1430934 4943206 1 NA 84
2 NA NA 2 10 NA
3 NA NA 3 19 NA
4 NA NA 4 84 NA
5 NA NA 5 65 NA
6 1430939 4943210 6 NA NA
This is to repeat whenever a new x/y value is present.
You can use ave and cumsum with !is.na to get the groups for ave like:
data$newvar <- ave(data$depth, cumsum(!is.na(data$x)), FUN=
function(x) if(all(is.na(x))) NA else {
c(max(x, na.rm=TRUE), rep(NA, length(x)-1))})
data
# x y time depth newvar
#1 1430934 4943206 1 NA 84
#2 <NA> <NA> 2 10 NA
#3 <NA> <NA> 3 19 NA
#4 <NA> <NA> 4 84 NA
#5 <NA> <NA> 5 65 NA
#6 1430939 4943210 6 NA NA
Using dplyr, we can create groups of every 5 rows and update the first row in group as max value in the group ignoring NA values.
library(dplyr)
df %>%
group_by(grp = ceiling(time/5)) %>%
mutate(depth = ifelse(row_number() == 1, max(depth, na.rm = TRUE), NA))
In base R, we can use tapply :
inds <- seq(1, nrow(df), 5)
df$depth[inds] <- tapply(df$depth, ceiling(df$time/5), max, na.rm = TRUE)
df$depth[-inds] <- NA
Maybe you can try ave like below
df <- within(df,
newvar <- ave(depth,
ceiling(time/5),
FUN = function(x) ifelse(length(x)>1&is.na(x),max(na.omit(x)),NA)))
such that
> df
x y time depth newvar
1 1430934 4943206 1 NA 84
2 NA NA 2 10 NA
3 NA NA 3 19 NA
4 NA NA 4 84 NA
5 NA NA 5 65 NA
6 1430939 4943210 6 NA NA
DATA
df <- structure(list(x = c(1430934L, NA, NA, NA, NA, 1430939L), y = c(4943206L,
NA, NA, NA, NA, 4943210L), time = 1:6, depth = c(NA, 10L, 19L,
84L, 65L, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))
Here is another option using data.table:
library(data.table)
setDT(data)[, newvar := replace(frollapply(depth, 5L, max, na.rm=TRUE, align="left"),
seq(.N) %% 5L != 1L, NA_integer_)]

R Aggregate multiple rows

My question seems to be a very common question, but the solutions I found on internet don't work...
I would like to aggregate rows in a data frame in R.
Here is the structure of my data frame (df), it is a table of citations :
Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Kaufmann V NA 1 NA NA
Kaufmann V NA NA 1 NA
Kaufmann V NA NA NA 1
Orfeuil P 1 NA NA NA
Orfeuil P NA 1 NA NA
Sorokin P NA NA NA 1
That is I would like to have :
Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 4 NA NA NA
Kaufmann V NA 1 1 1
Orfeuil P 1 1 NA NA
Sorokin P NA NA NA 1
I have tried those solutions, but it doesn't work :
ddply(df,"Autors", numcolwise(sum))
and
df %>% group_by(Autors) %>% summarize_all(sum)
It aggregates well the rows, but the values (sum of the 1 values) are absolutely not correct ! And I don't understand why...
Do you have an idea ?
Thank you very much !
Joël
You can also do the summing using rowsum(), although it (perhaps misleadingly) gives sums of 0 rather than NA for cells in the output that had only NA's for input.
rowsum(df[,c(2:5)],df$Autors,na.rm=T)
Gives:
Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 4 0 0 0
Kaufmann V 0 1 1 1
Orfeuil P 1 1 0 0
Sorokin P 0 0 0 1
It could be because the na.rm is not used
library(dplyr)
df %>%
group_by(Autors) %>%
summarize_all(sum, na.rm = TRUE)
if both plyr and dplyr are loaded, summarise would get masked, but doubt about summarise_all as it is a dplyr function
Based on the expected output, with na.rm = TRUE, it removes all NAs and if there are cases having only NAs it returns 0. To avoid that, we can have a condition
df %>%
group_by(Autors) %>%
summarize_all(funs(if(all(is.na(.))) NA else sum(., na.rm = TRUE)))
# A tibble: 4 x 5
# Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
# <chr> <int> <int> <int> <int>
#1 Burgess E 4 NA NA NA
#2 Kaufmann V NA 1 1 1
#3 Orfeuil P 1 1 NA NA
#4 Sorokin P NA NA NA 1
data
df <- structure(list(Autors = c("Burgess E", "Burgess E", "Burgess E",
"Burgess E", "Kaufmann V", "Kaufmann V", "Kaufmann V", "Orfeuil P",
"Orfeuil P", "Sorokin P"), Lannoy_2016 = c(1L, 1L, 1L, 1L, NA,
NA, NA, 1L, NA, NA), Ramadier_2014 = c(NA, NA, NA, NA, 1L, NA,
NA, NA, 1L, NA), Lord_2009 = c(NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA), Ortar_2008 = c(NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L
)), .Names = c("Autors", "Lannoy_2016", "Ramadier_2014", "Lord_2009",
"Ortar_2008"), class = "data.frame", row.names = c(NA, -10L))

with R, how to scan the interval value and print out subdata of other dataset?

Dataset 1:
dput(kk)
structure(list(V1 = c(1.05, NA, NA, NA, NA, NA, NA, NA, NA,
NA,
1.06, NA, NA, NA, NA, NA, NA, NA), V2 = c(NA, NA, 105.11, 105.12,
105.13, 105.14, 105.15, NA, 105.94, 105.99, NA, NA, 106.11, 106.12,
106.13, 106.14, 106.19, 106.2)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-18L))
show(kk)
V1 V2
1 1.05 NA
2 NA NA
3 NA 105.11
4 NA 105.12
5 NA 105.13
6 NA 105.14
7 NA 105.15
8 NA NA
9 NA 105.94
10 NA 105.99
11 1.06 NA
12 NA NA
13 NA 106.11
14 NA 106.12
15 NA 106.13
16 NA 106.14
17 NA 106.19
18 NA 106.20
Dataset 2:
structure(list(V1 = structure(1:4, .Label = c("1.05 ~ 1.06", "1.07",
"1.08", "1.09 ~ 1.10"), class = "factor")), .Names = "V1", class =
"data.frame", row.names = c(NA,
-4L))
V1
1 1.05 ~ 1.06
2 1.07
3 1.08
4 1.09 ~ 1.10
How can I scan the interval value of V1 in dataset 2 and print out the sub category data of dataset 1 which covers the interval on new dataset like above?
If I understand correctly you are after something like this:
lapply(df2$V1, function(x) {
z <- as.numeric(unlist(strsplit(as.character(x), split = " ~ ")))
b <- which(df1$V1 %in% z)
if(length(b)==0) return(NULL)
if(length(b)==1) return(df1[b,])
if(length(b)==2) return(df1[b[1]:b[2],])
})
#result
[[1]]
V1 V2
1 1.05 NA
2 NA NA
3 NA 105.11
4 NA 105.12
5 NA 105.13
6 NA 105.14
7 NA 105.15
8 NA NA
9 NA 105.94
10 NA 105.99
11 1.06 NA
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
Explanation
with
lapply(df2$V1, function(x)...
you go through elements of df2$V1 one by one and apply a function on each.
the function first splits the string at " ~ " then converts to numeric after un listing since strsplit returns a list and not a vector
z <- as.numeric(unlist(strsplit(as.character(x), split = " ~ ")))
Then it determines which elements of df1$V1 are in z
b <- which(df1$V1 %in% z)
And if b has 0 elements it returns NULL
if b has 1 element it returns just one row for df1$V1
if b has 2 elements it returns a range or rows from b[1] to b[2]

Resources