Rank instances by missing amount in descending order - r

I want to sort this dataset as (rank instances by missing amount in descending order)
can someone help me how to do it in R language , is there any command to do it in r .
df=data.frame(x=c(1,4,6,NA,7,NA,9,10,4,NA),
y=c(10,12,NA,NA,14,18,20,15,12,17),
z=c(225,198,NA,NA,NA,130,NA,200,NA,99),
v=c(44,51,NA,NA,45,NA,25,36,75,NA))
df
x y z v
1 1 10 225 44
2 4 12 198 51
3 6 NA NA NA
4 NA NA NA NA
5 7 14 NA 45
6 NA 18 130 NA
7 9 20 NA 25
8 10 15 200 36
9 4 12 NA 75
10 NA 17 99 NA
I want to get this result :
x y z v
4 NA NA NA NA
3 6 NA NA NA
6 NA 18 130 NA
10 NA 17 99 NA
5 7 14 NA 45
7 9 20 NA 25
9 4 12 NA 75
1 1 10 225 44
2 4 12 198 51
8 10 15 200 36

In my comment I incorrectly remembered the name of the argument for changing the direction of an order result. The fix is simply to use the correct name:
> df[ order(rowSums(is.na(df)), decreasing=TRUE), ]
x y z v
4 NA NA NA NA
3 6 NA NA NA
6 NA 18 130 NA
10 NA 17 99 NA
5 7 14 NA 45
7 9 20 NA 25
9 4 12 NA 75
1 1 10 225 44
2 4 12 198 51
8 10 15 200 36

Related

R: How to swap values in a data frame with condition?

My dataset has 2 IDs respectively from a parent and a child but I don't know which is who. I have however their age
This is the table I am working with:
ID1 ID2 sex1 sex2 age1 age2
1 8 9 1 2 44 11
2 17 7 1 1 56 76
3 1 44 NA NA 16 55
4 3 13 NA NA NA NA
5 55 6 2 NA 56 10
6 4 33 2 NA 45 9
7 2 66 1 NA 12 45
8 72 99 NA NA NA NA
9 12 11 2 2 30 12
By using an if statement, I want to identify who's who according to their age.
Here is the code I made but it is not working:
install.packages('seqinr')
library(seqinr)
for (i in 1:nrow(data)){
if (data$age2[i]> data$age1[i]){
swap(data$age1[i], data$age2[i])
}
}
The error message:
Error in if (data$age2[i] > data$age1[i]) { :
missing value where TRUE/FALSE needed
I want to put the parents' age in age1 and the child's age in age2.
Does someone has a better idea on how to do it?
Welcome to SO!
You can manage it without any for loop, in case you only need to put the highest value in age1, and the lower value in age2, comparing by row the two columns:
# I've put age_* to compare results with data, to replace, use age* in df$age*
df$age_1 <- pmax(df$age1, df$age2)
df$age_2 <- pmin(df$age1, df$age2)
With result:
ID1 ID2 sex1 sex2 age1 age2 age_1 age_2
1 8 9 1 2 44 11 44 11
2 17 7 1 1 56 76 76 56
3 1 44 NA NA 16 55 55 16
4 3 13 NA NA NA NA NA NA
5 55 6 2 NA 56 10 56 10
6 4 33 2 NA 45 9 45 9
7 2 66 1 NA 12 45 45 12
8 72 99 NA NA NA NA NA NA
9 12 11 2 2 30 12 30 12
With data:
df <- read.table(text = 'ID1 ID2 sex1 sex2 age1 age2
1 8 9 1 2 44 11
2 17 7 1 1 56 76
3 1 44 NA NA 16 55
4 3 13 NA NA NA NA
5 55 6 2 NA 56 10
6 4 33 2 NA 45 9
7 2 66 1 NA 12 45
8 72 99 NA NA NA NA
9 12 11 2 2 30 12', header = T)
library(tidyverse)
df <- read_table(
"ID1 ID2 sex1 sex2 age1 age2
8 9 1 2 44 11
17 7 1 1 56 76
1 44 NA NA 16 55
3 13 NA NA NA NA
55 6 2 NA 56 10
4 33 2 NA 45 9
2 66 1 NA 12 45
72 99 NA NA NA NA
12 11 2 2 30 12"
)
Method 1:
df %>%
transform(age1 = case_when(age1 > age2 ~ age1,
TRUE ~ age2),
age2 = case_when(age2 > age1 ~ age2,
TRUE ~ age1))
Method 2:
df %>%
transform(age1 = pmax(age1, age2),
age2 = pmin(age1, age2))
ID1 ID2 sex1 sex2 age1 age2
1 8 9 1 2 44 11
2 17 7 1 1 76 56
3 1 44 NA NA 55 16
4 3 13 NA NA NA NA
5 55 6 2 NA 56 10
6 4 33 2 NA 45 9
7 2 66 1 NA 45 12
8 72 99 NA NA NA NA
9 12 11 2 2 30 12

NA value in a dataframe

I try to apply a function to a column of a dataframe but when I do this i got a column full of NA values. I don't understand why.
Here is my code :
courbe <- function(x) exp(coef(regression)[1]*x+coef(regression[2]))
dataT[,c(2)] <- courbe(dataT[,c(1)])
And here my dataframe :
DateRep Cases
1 25 NA
2 24 NA
3 23 NA
4 22 NA
5 21 NA
6 20 NA
7 19 NA
8 18 NA
9 17 NA
10 16 NA
11 15 NA
12 14 NA
13 13 NA
14 12 NA
15 11 NA
16 10 NA
17 9 NA
18 8 NA
19 7 NA
20 6 NA
21 5 NA
22 4 NA
23 3 NA
24 2 NA
25 1 NA
26 0 NA
The output of print(coef(regression)) :
Coefficients:
(Intercept) dataT$DateRep
2.7095 0.2211
As figured out in the comments, the mistake was in the placement of indices coef(regression)[1] and coef(regression[2]).

Removing row if number of NA's is larger than 2 (or any number) in a certain amount of rows

I have the following panel data frame:
X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5
Ind 1 7 NA NA NA NA 1 4 6 8 6
Ind 2 2 NA 16 NA NA 5 16 12 3 4
Ind 3 NA NA NA 19 92 13 NA 12 NA NA
Ind 4 32 5 12 3 5 NA NA NA NA 4
Ind 5 44 3 46 3 47 3 2 NA 3 4
Ind 6 NA 34 NA 8 NA 14 15 12 3 4
Ind 7 49 55 67 49 89 6 17 2 3 4
Ind 8 NA NA 49 NA NA 11 20 6 NA 4
Ind 9 1 1 5 NA 9 NA NA NA NA NA
In pastable format:
df <- read.table(text="Index_name,X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5
Ind_1 7 NA NA NA NA 1 4 6 8 6
Ind_2 2 NA 16 NA NA 5 16 12 3 4
Ind_3 NA NA NA 19 92 13 NA 12 NA NA
Ind_4 32 5 12 3 5 NA NA NA NA 4
Ind_5 44 3 46 3 47 3 2 NA 3 4
Ind_6 NA 34 NA 8 NA 14 15 12 3 4
Ind_7 49 55 67 49 89 6 17 2 3 4
Ind_8 NA NA 49 NA NA 11 20 6 NA 4
Ind_9 1 1 5 NA 9 NA NA NA NA NA",row.names=1,
header=TRUE, stringsAsFactors=FALSE)
I want to filter out all rows that don't have at least 2 non-NA values in both the columns that start with X and the columns that start with Y.
For example:
Ind1: Drop (only 1 value in X1-X5)
Ind2: Keep (cause here there are at least 2 numbers in X)
Ind3: Keep cause both X and Y have 2 or more observations.
Ind4: Delete (only 1 value in Y1-Y5)
Ind5: Keep
Ind6: Keep
Ind7: Keep
Ind8: Delete (Only 1 value in X1-X5)
Ind9: Delete (though X is ok, Y is not okay.)
You could do this. Basically, you are counting (with rowSums), the number of non-NA data points first in x1-x5 and then in y1-y5. To indentify non-NAs, I use !is.na(). The ! is a negation, so the expression means "Not an NA". Finally, you are keeping only the rows where "row sum of non-NAs is >=2" for x1-x5 AND (&) for y1-y5. To be clear about the indexing, there are 10 columns in your data.frame. df[,1:5] represents the first 5 columns, which are x1-x5.
df[rowSums(!is.na(df[,1:5]))>=2 & rowSums(!is.na(df[,6:10]))>=2,]
X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5
Ind_2 2 NA 16 NA NA 5 16 12 3 4
Ind_3 NA NA NA 19 92 13 NA 12 NA NA
Ind_5 44 3 46 3 47 3 2 NA 3 4
Ind_6 NA 34 NA 8 NA 14 15 12 3 4
Ind_7 49 55 67 49 89 6 17 2 3 4
DATA
df <- read.table(text="Index_name,X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5
Ind_1 7 NA NA NA NA 1 4 6 8 6
Ind_2 2 NA 16 NA NA 5 16 12 3 4
Ind_3 NA NA NA 19 92 13 NA 12 NA NA
Ind_4 32 5 12 3 5 NA NA NA NA 4
Ind_5 44 3 46 3 47 3 2 NA 3 4
Ind_6 NA 34 NA 8 NA 14 15 12 3 4
Ind_7 49 55 67 49 89 6 17 2 3 4
Ind_8 NA NA 49 NA NA 11 20 6 NA 4
Ind_9 1 1 5 NA 9 NA NA NA NA NA",row.names=1,
header=TRUE, stringsAsFactors=FALSE)

Expand a dataframe based on columns in the dataframe in R

I have the following dataframe in R
df<-data.frame(
"Val1"=seq(from=1, to=40, by=5), 'Val2'=c(2,4,2,5,11,3,5,3),
"Val3"=seq(from=5, to=40, by=5), "Val4"=c(3,5,7,3,7,5,7,8))
The resulting dataframe looks as follows. Val 1, Val3 are the causal variables and Val2, Val4 are the dependent variables
Val1 Val2 Val3 Val4
1 1 2 5 3
2 6 4 10 5
3 11 2 15 7
4 16 5 20 3
5 21 11 25 7
6 26 3 30 5
7 31 5 35 7
8 36 3 40 8
I wish to obtain the following dataframe as an output
Val1 Val2 Val3 Val4
1 1 2 1 NA
2 2 NA 2 NA
3 3 NA 3 3
4 4 NA 4 NA
5 5 NA 5 NA
6 6 4 6 NA
7 7 NA 7 NA
8 8 NA 8 NA
9 9 NA 9 NA
10 10 NA 10 5
11 11 2 11 NA
12 12 NA 12 NA
13 13 NA 13 NA
14 14 NA 14 NA
15 15 NA 15 7
16 16 5 16 NA
17 17 NA 17 NA
18 18 NA 18 NA
19 19 NA 19 NA
20 20 NA 20 3
21 21 11 21 NA
22 22 NA 22 NA
23 23 NA 23 NA
24 24 NA 24 NA
25 25 NA 25 7
26 26 3 26 NA
27 27 NA 27 NA
28 28 NA 28 NA
29 29 NA 29 NA
30 30 NA 30 5
31 31 5 31 NA
32 32 NA 32 NA
33 33 NA 33 NA
34 34 NA 34 NA
35 35 NA 35 7
36 36 3 36 NA
37 37 NA 37 NA
38 38 NA 38 NA
39 39 NA 39 NA
40 40 NA 40 8
How do I accomplish this. I have created the following code but it involves creating a second dataframe and then copying data from the first to the second. Is there a way to overwrite the existing dataframe. I would like to avoid loops
df2<-data.frame('Val1'=
seq(from=min(na.omit(c(df$Val1, df$Val3))), to= max(na.omit(c(df$Val1,
df$Val3))), by=1), "Val3"=seq(from=min(na.omit(c(df$Val1, df$Val3))), to=
max(na.omit(c(df$Val1, df$Val3))), by=1))
###### Create two loops
for(i in df$Val1){
for(j in df2$Val1){
if(i==j){
df2$Val2[df2$Val1==j]=df$Val2[df$Val1==i]
} else{df2$Val2[df2$Val1==j]=NA}}}
for(i in df$Val3){ for(j in df2$Val3){
if(i==j){df2$Val4[df2$Val3==j]=df$Val4[df$Val3==i]
} else{df2$Val4[df2$Val3==j]=NA}}}
Is there a faster vectorised way to accomplish the same. requesting some one to help
Assuming there's a slight error in your output example (row 3 should show NA for Val4 and the 3 in row 3 should be in row 5), this works:
library(tidyverse)
df_new <- bind_cols(
df %>%
select(Val1, Val2) %>%
complete(., expand(., Val1 = 1:40)),
df %>%
select(Val3, Val4) %>%
complete(., expand(., Val3 = 1:40))
)
> df_new
# A tibble: 40 x 4
Val1 Val2 Val3 Val4
<dbl> <dbl> <dbl> <dbl>
1 1 2 1 NA
2 2 NA 2 NA
3 3 NA 3 NA
4 4 NA 4 NA
5 5 NA 5 3
6 6 4 6 NA
7 7 NA 7 NA
8 8 NA 8 NA
9 9 NA 9 NA
10 10 NA 10 5
# ... with 30 more rows
We use bind_cols() to put together two parts of the dataframe:
First we select the first two columns, expand() the causal variable and complete() the data, then we do it again for the third and fourth column.

Inputting values into data frame of longer length

I have 2 data frames with different numbers of rows (A has 55 and B has 41). I would like to take the Py values from data frame B and put them into A$Py corresponding to the "Link".
I tried
link.list <- A$Link
for(i in 1:length(link.list)){
A$Py[i] <- B[which(B$Link==link.list[i]), "Py"]
}
But get:
Error in A$Py[i] <- B[which(B$Link == link.list[i]), "Py"] :
replacement has length zero
I assume this error is triggered when there is a A$Link that is not in B. Any ideas solving this problem?
Thanks
data frame A:
Link VU Py
1 DVH1-1 1 NA
2 DVH1-10 9 NA
3 DVH1-2 1 NA
4 DVH1-3 1 NA
5 DVH1-4 9 NA
6 DVH1-5 9 NA
7 DVH1-6 1 NA
8 DVH1-7 1 NA
9 DVH1-8 10 NA
10 DVH1-9 10 NA
11 DVH2-1 2 NA
12 DVH2-2 1 NA
13 DVH2-3 9 NA
14 DVH2-4 9 NA
15 DVH2-5 10 NA
16 DVH2-6 9 NA
17 DVH2-7 4 NA
18 DVH2-8 9 NA
19 DVH3-1 1 NA
20 DVH3-2 12 NA
21 DVH3-3 12 NA
22 DWH1-1 4 NA
23 DWH1-10 8 NA
24 DWH1-2 4 NA
25 DWH1-3 4 NA
26 DWH1-4 8 NA
27 DWH1-5 8 NA
28 DWH1-6 4 NA
29 DWH1-7 4 NA
30 DWH1-8 9 NA
31 DWH1-9 9 NA
32 DWH2-1 4 NA
33 DWH2-2 4 NA
34 DWH2-3 8 NA
35 DWH2-4 8 NA
36 DWH2-5 8 NA
37 DWH2-6 8 NA
38 DWH2-7 7 NA
39 DWH2-8 5 NA
40 DWH3-1 3 NA
41 DWH3-2 49 NA
42 DWH3-3 0 NA
43 MH1-1 0 NA
44 MH1-2 1 NA
45 MH1-3 1 NA
46 MH1-4 1 NA
47 MH1-5 1 NA
48 UH1-1 17 NA
49 UH1-2 17 NA
50 UH1-3 17 NA
51 UH1-4 19 NA
52 UH2-1 4 NA
53 UH2-2 15 NA
54 UH3-1 24 NA
55 UH3-2 25 NA
data frame B:
Link Py
1 DVH1-1 0
2 DVH1-10 4
3 DVH1-2 0
4 DVH1-3 14
5 DVH1-4 0
6 DVH1-5 2
7 DVH1-6 12
8 DVH1-7 11
9 DVH1-8 9
10 DVH1-9 9
11 DVH2-1 0
12 DVH2-2 14
13 DVH2-3 3
14 DVH2-4 0
15 DVH2-5 10
16 DVH2-6 0
17 DVH2-7 2
18 DVH2-8 4
19 DVH3-1 16
20 DVH3-3 8
21 DWH1-1 6
22 DWH1-10 2
23 DWH1-2 0
24 DWH1-3 7
25 DWH1-5 0
26 DWH1-6 12
27 DWH1-7 10
28 DWH1-8 0
29 DWH1-9 3
30 DWH2-1 0
31 DWH2-2 10
32 DWH2-7 0
33 DWH2-8 9
34 DWH3-1 0
35 DWH3-2 0
36 MH1-1 0
37 UH1-3 6
38 UH1-4 4
39 UH2-1 0
40 UH2-2 9
41 UH3-2 4
Use merge and merge by Link, all.x will return all rows for x (in your case x= A).
I've only passed the first two columns of A, as A$pY in your example were all NA
merge(A[,1:2],B,by='Link', all.x = TRUE)
> head(a)
X Link VU Py
1 1 DVH1-1 1 NA
2 2 DVH1-10 9 NA
3 3 DVH1-2 1 NA
4 4 DVH1-3 1 NA
5 5 DVH1-4 9 NA
6 6 DVH1-5 9 NA
> head(b)
X Link Py
1 1 DVH1-1 0
2 2 DVH1-10 4
3 3 DVH1-2 0
4 4 DVH1-3 14
5 5 DVH1-4 0
6 6 DVH1-5 2
a[a$Link %in% b$Link,5]<-b[a$Link %in% b$Link,3]
names(a)[5]<-"Py1"
> head(a)
X Link VU Py Py1
1 1 DVH1-1 1 NA 0
2 2 DVH1-10 9 NA 4
3 3 DVH1-2 1 NA 0
4 4 DVH1-3 1 NA 14
5 5 DVH1-4 9 NA 0
6 6 DVH1-5 9 NA 2

Resources