Split up a dataframe by number of NAs in each row - r

Consider a dataframe made up of thousand rows and columns that inclues several NAs. I'd like to split this dataframe up into smaller ones based on the number of NAs in each row. All rows that contain the same number of NAs, if there is any, should be in the same group. The new data frames are then saved separately.
> DF
ID C1 C2 C3 C4 C5
aa 12 13 10 NA 12
ff 12 NA NA 23 13
ee 67 23 NA NA 21
jj 31 14 NA 41 11
ss NA 15 11 12 11
The desired output will be:
> DF_chunk_1
ID C1 C2 C3 C4 C5
aa 12 13 10 NA 12
jj 31 14 NA 41 11
ss NA 15 11 12 11
> DF_chunk_2
ID C1 C2 C3 C4 C5
ff 12 NA NA 23 13
ee 67 23 NA NA 21
I appreciate any suggestion.

Try this following useful comments. You can split() and use apply() to build a group:
#Code
new <- split(DF,apply(DF[,-1],1,function(x)sum(is.na(x))))
Output:
$`1`
ID C1 C2 C3 C4 C5
1 aa 12 13 10 NA 12
4 jj 31 14 NA 41 11
5 ss NA 15 11 12 11
$`2`
ID C1 C2 C3 C4 C5
2 ff 12 NA NA 23 13
3 ee 67 23 NA NA 21
A more practical way (Many thanks and credits to #RuiBarradas):
#Code2
new <- split(DF, rowSums(is.na(DF[-1])))
Same output.

Related

Discrepency in dplyr in R : n() and length(variable-name) giving different answers after group by

I think if I have any data frame, when I use group_by and then invoke n() OR if I use group_by and invoke length(any variable_name in the data frame) they should give me the same answer.
However today I noticed that this is not the case.
I am not allowed to post this data, but here is the code.
Can someone try to understand why total count and c2 are not the same?
Please note that in the used data frame, WAVE_NO and REF_PERIOD_WAVE will give rise to the same groups. I just used this for printing nicely. Also DATE_OF_INTERVIEW is all NA in WAVE_NO = 1 to 24.
library(dplyr)
library(RMySQL)
con <- dbConnect(dbDriver("MySQL"), host = Sys.getenv("mydb"), db = "hhd", user = Sys.getenv("MY_USER"), password = Sys.getenv("MY_PASSWORD"))
dbListTables(con)
asp <- tbl(con,"my_table")
> asp %>% group_by(WAVE_NO,REF_PERIOD_WAVE) %>%
summarise(total_count = n(), c2 = length(DATE_OF_INTERVIEW)) %>% as.data.frame
`summarise()` has grouped output by 'WAVE_NO'. You can override using the `.groups` argument.
WAVE_NO REF_PERIOD_WAVE total_count c2
1 1 W1 2014 166744 NA
2 2 W2 2014 160705 NA
3 3 W3 2014 157442 NA
4 4 W1 2015 158443 NA
5 5 W2 2015 158666 NA
6 6 W3 2015 158624 NA
7 7 W1 2016 158624 NA
8 8 W2 2016 159778 NA
9 9 W3 2016 160511 NA
10 10 W1 2017 161167 NA
11 11 W2 2017 160847 NA
12 12 W3 2017 168165 NA
13 13 W1 2018 169215 NA
14 14 W2 2018 172365 NA
15 15 W3 2018 173181 NA
16 16 W1 2019 174405 NA
17 17 W2 2019 174405 NA
18 18 W3 2019 174405 NA
19 19 W1 2020 174405 NA
20 20 W2 2020 174405 NA
21 21 W3 2020 174405 NA
22 22 W1 2021 176661 NA
23 23 W2 2021 178677 NA
24 24 W3 2021 178677 NA
25 25 W1 2022 178677 11
26 26 W2 2022 178677 11
>
The problem is that while n() translates to COUNT in MySQL, length translates to length which gives the length of a string:
library(dbplyr)
library(dplyr)
md <- lazy_frame(a = gl(5, 3), b = rnorm(15), con = simulate_mysql())
md %>%
group_by(a) %>%
summarize(n = n(), len = length(b))
# <SQL>
# SELECT `a`, COUNT(*) AS `n`, length(`b`) AS `len`
# FROM `df`
# GROUP BY `a`

apply hpfilter to grouped variables with NAs using dplyr

I am trying to apply the hpfilter to one of the variables in my dataset that has a panel structure (id + year) and then add the filtered series to my dataset. It works perfectly fine as long as I do not have any NAs in one of the variables, but it yields an error if one of the ids has missing values. The reason for this is that the hpfilter function does not work with NAs (it yields only NAs).
Here's a reproducible example:
df1 <- read.table(text="country year X1 X2 W
A 1990 10 20 40
A 1991 12 15 NA
A 1992 14 17 41
A 1993 17 NA 44
B 1990 20 NA 45
B 1991 NA 13 61
B 1992 12 12 67
B 1993 14 10 68
C 1990 10 20 70
C 1991 11 14 50
C 1992 12 15 NA
C 1993 14 16 NA
D 1990 20 17 80
D 1991 16 20 91
D 1992 15 21 70
D 1993 14 22 69
", header=TRUE, stringsAsFactors=FALSE)
My approach was to use the dplyr group_by function to apply the hpfilter by country to variable X1:
library(mFilter)
library(plm)
# Organizing the Data as a Panel
df1 <- pdata.frame(df1, index = c("country","year"))
# Apply hpfilter to X1 and add trend to the sample
df1 <- df1 %>% group_by(country) %>% mutate(X1_trend = mFilter::hpfilter(na.exclude(X1), type = "lambda", freq = 6.25)$trend)
However, this yields the following error:
Error in `[[<-.data.frame`(`*tmp*`, col, value = c(11.1695436493374, 12.7688604220353, :
replacement has 15 rows, data has 16
The error occurs because the filtered series is shortened after applying the hp filter (by the NAs).
Since I have a large dataset with many countries it would be really great if there was a workaround, to maybe ignore the NAs when passing the series to the hpfilter, but not removing them. Thank you!
Here is a way to drop NA and calculate trend:
df2 <- df1 %>% group_by(country) %>%
filter(!is.na(X1)) %>%
pdata.frame(., index = c("country","year")) %>%
mutate(X1_trend = mFilter::hpfilter(X1, type = "lambda", freq = 6.25)$trend)
> df2
country year X1 X2 W X1_trend
1 A 1990 10 20 40 11.16954
2 A 1991 12 15 NA 12.76886
3 A 1992 14 17 41 14.18105
4 A 1993 17 NA 44 15.09597
5 B 1990 20 NA 45 15.17450
6 B 1992 12 12 67 14.38218
7 B 1993 14 10 68 13.45663
8 C 1990 10 20 70 12.75429
9 C 1991 11 14 50 12.71858
10 C 1992 12 15 NA 13.35221
11 C 1993 14 16 NA 14.38293
12 D 1990 20 17 80 15.32211
13 D 1991 16 20 91 15.61990
14 D 1992 15 21 70 15.47486
15 D 1993 14 22 69 15.14639
EDIT: To keep missing values in the final output, we do one more operation:
df3 <- merge(df1,df2, by = colnames(df1),all.x = T)
> df3
country year X1 X2 W X1_trend
1 A 1990 10 20 40 11.16954
2 A 1991 12 15 NA 12.76886
3 A 1992 14 17 41 14.18105
4 A 1993 17 NA 44 15.09597
5 B 1990 20 NA 45 15.17450
6 B 1991 NA 13 61 NA
7 B 1992 12 12 67 14.38218
8 B 1993 14 10 68 13.45663
9 C 1990 10 20 70 12.75429
10 C 1991 11 14 50 12.71858
11 C 1992 12 15 NA 13.35221
12 C 1993 14 16 NA 14.38293
13 D 1990 20 17 80 15.32211
14 D 1991 16 20 91 15.61990
15 D 1992 15 21 70 15.47486
16 D 1993 14 22 69 15.14639

Ranking data that have the same values [duplicate]

This question already has answers here:
Rank vector with some equal values [duplicate]
(3 answers)
Closed 4 years ago.
I have a large data set including a column of counts for different genetic markers. I want to generate an overall ranking that takes into account the count number regardless of the genetic marker. For instance if 2 or more genetic markers all have a count of 5 they should all have the same rank number and I want the rank numbers to be displayed in a separate column. I have this dataframe;
SNP count
a1 26
a2 18
a3 16
a4 15
a5 14
a6 14
a7 14
a8 15
a9 13
a10 12
a11 12
a12 11
a13 10
a14 9
a15 8
I want the output to be:
SNP count rank
a1 26 1
a2 18 2
a3 16 3
a4 15 4
a8 15 4
a5 14 5
a6 14 5
a7 14 5
a9 13 7
a10 12 8
a11 12 8
a12 11 9
a13 10 10
a14 9 11
a15 8 12
Note that SNPs a4 and a8 are the same, a5, a6 a7 have equal count values and also a10 and a11. I've tried
transform(df, x= ave(count,FUN=function(x) order(x,decreasing=T)))
but it's not want I want
What you are looking for is the rleid function from the data.table package.
data.table::rleid(df$count)
[1] 1 2 3 4 5 5 5 6 7 8 8 9 10 11 12
df is obtained like so:
df <- read.table(text ="SNP count
a1 26
a2 18
a3 16
a4 15
a5 14
a6 14
a7 14
a8 15
a9 13
a10 12
a11 12
a12 11
a13 10
a14 9
a15 8",
stringsAsFactors =FALSE,
header = TRUE)
And for thoroughness:
df$rank <- data.table::rleid(df$count)
df
SNP count rank
1 a1 26 1
2 a2 18 2
3 a3 16 3
4 a4 15 4
5 a5 14 5
6 a6 14 5
7 a7 14 5
8 a8 15 6
9 a9 13 7
10 a10 12 8
11 a11 12 8
12 a12 11 9
13 a13 10 10
14 a14 9 11
15 a15 8 12
Edit:
Thanks to #Frank, a better solution would be to sort the data frame by count before applying rleid:
setDT(df)[order(-count), rank := rleid(count)]
Which gives:
df
SNP count rank
1: a1 26 1
2: a2 18 2
3: a3 16 3
4: a4 15 4
5: a5 14 5
6: a6 14 5
7: a7 14 5
8: a8 15 4
9: a9 13 6
10: a10 12 7
11: a11 12 7
12: a12 11 8
13: a13 10 9
14: a14 9 10
15: a15 8 11

Compare values of two dataframes and substitute them

I've two data frames with the same number of rows and columns, 113x159 with this structure:
df1:
1 2 3 4
a AT AA AG CT
b NA AG AT CC
c AG GG GT AA
d NA NA TT TC
df2:
1 2 3 4
a NA 23 12 NA
b NA 23 44 12
c 11 14 27 55
d NA NA 12 34
I want to compare value to value db1 e db2, and if the value of db 2 is NA and the value of db1 isn't, replace it (also if db1 value is NA and in db2 not).
At the end, my df has to be this:
1 2 3 4
a NA AA AG NA
b NA AG AT CC
c AG GG GT AA
d NA NA TT CC
I've written this if loop but it doesn't work:
merge.na<-function(x){
for (i in df2) AND (k in df1){
if (i==NA) AND (k!=NA)
k==NA}
Any idea?
We can use replace
replace(df1, is.na(df2), NA)
# X1 X2 X3 X4
#a <NA> AA AG <NA>
#b <NA> AG AT CC
#c AG GG GT AA
#d <NA> <NA> TT TC

R: Function “diff” over various groups

While searching for a solution to my problem I found this thread: Function "diff" over various groups in R. I've got a very similar question so I'll just work with the example there.
This is what my desired output should look like:
name class year diff
1 a c1 2009 NA
2 a c1 2010 67
3 b c1 2009 NA
4 b c1 2010 20
I have two variables which form subgroups - class and name. So I want to compare only the values which have the same name and class. I also want to have the differences from 2009 to 2010. If there is no 2008, diff 2009 should return NA (since it can't calculate a difference).
I'm sure it works very similarly to the other thread but I just can't make it work. I used this code too (and simply solved the ascending year by sorting the data differently), but somehow R still manages to calculate a difference and does not return NA.
ddply(df, .(class, name), summarize, year=head(year, -1), value=diff(value))
Using the data set form the other post, I would do something like
library(data.table)
df <- df[df$year != 2008, ]
setkey(setDT(df), class, name, year)
df[, diff := lapply(.SD, function(x) c(NA, diff(x))),
.SDcols = "value", by = list(class, name)]
Which returns
df
# name class year value diff
# 1: a c1 2009 33 NA
# 2: a c1 2010 100 67
# 3: b c1 2009 80 NA
# 4: b c1 2010 90 10
# 5: a c2 2009 80 NA
# 6: a c2 2010 90 10
# 7: b c2 2009 90 NA
# 8: b c2 2010 100 10
# 9: a c3 2009 90 NA
#10: a c3 2010 100 10
#11: b c3 2009 80 NA
#12: b c3 2010 99 19
Using dplyr
df %>%
filter(year!=2008)%>%
arrange(name, class, year)%>%
group_by(class, name)%>%
mutate(diff=c(NA,diff(value)))
# Source: local data frame [12 x 5]
# Groups: class, name
# name class year value diff
# 1 a c1 2009 33 NA
# 2 a c1 2010 100 67
# 3 a c2 2009 80 NA
# 4 a c2 2010 90 10
# 5 a c3 2009 90 NA
# 6 a c3 2010 100 10
# 7 b c1 2009 80 NA
# 8 b c1 2010 90 10
# 9 b c2 2009 90 NA
# 10 b c2 2010 100 10
# 11 b c3 2009 80 NA
# 12 b c3 2010 99 19
Update:
With relative difference
df %>%
filter(year!=2008)%>%
arrange(name, class, year)%>%
group_by(class, name)%>%
mutate(diff1=c(NA,diff(value)), rel_diff=round(diff1/value[row_number()-1],2))

Resources