How to replace all values of rows into NA? - r

I need to replace all values of rows (in range) into NA. How can I do it?
For example:
x <- c(1:30)
y <- c("a","b","c")
z <- rep(3)
df1 <- data.frame(x,y,z)
I need to replace all values of the rows (1:10) into NA

We can use row index for assignment
df1[1:10, ] <- NA
-output
df1
x y z
1 NA <NA> NA
2 NA <NA> NA
3 NA <NA> NA
4 NA <NA> NA
5 NA <NA> NA
6 NA <NA> NA
7 NA <NA> NA
8 NA <NA> NA
9 NA <NA> NA
10 NA <NA> NA
11 11 b 3
12 12 c 3
13 13 a 3
14 14 b 3
15 15 c 3
16 16 a 3
17 17 b 3
18 18 c 3
19 19 a 3
20 20 b 3
21 21 c 3
22 22 a 3
23 23 b 3
24 24 c 3
25 25 a 3
26 26 b 3
27 27 c 3
28 28 a 3
29 29 b 3
30 30 c 3

Related

NA value in a dataframe

I try to apply a function to a column of a dataframe but when I do this i got a column full of NA values. I don't understand why.
Here is my code :
courbe <- function(x) exp(coef(regression)[1]*x+coef(regression[2]))
dataT[,c(2)] <- courbe(dataT[,c(1)])
And here my dataframe :
DateRep Cases
1 25 NA
2 24 NA
3 23 NA
4 22 NA
5 21 NA
6 20 NA
7 19 NA
8 18 NA
9 17 NA
10 16 NA
11 15 NA
12 14 NA
13 13 NA
14 12 NA
15 11 NA
16 10 NA
17 9 NA
18 8 NA
19 7 NA
20 6 NA
21 5 NA
22 4 NA
23 3 NA
24 2 NA
25 1 NA
26 0 NA
The output of print(coef(regression)) :
Coefficients:
(Intercept) dataT$DateRep
2.7095 0.2211
As figured out in the comments, the mistake was in the placement of indices coef(regression)[1] and coef(regression[2]).

Removing row if number of NA's is larger than 2 (or any number) in a certain amount of rows

I have the following panel data frame:
X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5
Ind 1 7 NA NA NA NA 1 4 6 8 6
Ind 2 2 NA 16 NA NA 5 16 12 3 4
Ind 3 NA NA NA 19 92 13 NA 12 NA NA
Ind 4 32 5 12 3 5 NA NA NA NA 4
Ind 5 44 3 46 3 47 3 2 NA 3 4
Ind 6 NA 34 NA 8 NA 14 15 12 3 4
Ind 7 49 55 67 49 89 6 17 2 3 4
Ind 8 NA NA 49 NA NA 11 20 6 NA 4
Ind 9 1 1 5 NA 9 NA NA NA NA NA
In pastable format:
df <- read.table(text="Index_name,X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5
Ind_1 7 NA NA NA NA 1 4 6 8 6
Ind_2 2 NA 16 NA NA 5 16 12 3 4
Ind_3 NA NA NA 19 92 13 NA 12 NA NA
Ind_4 32 5 12 3 5 NA NA NA NA 4
Ind_5 44 3 46 3 47 3 2 NA 3 4
Ind_6 NA 34 NA 8 NA 14 15 12 3 4
Ind_7 49 55 67 49 89 6 17 2 3 4
Ind_8 NA NA 49 NA NA 11 20 6 NA 4
Ind_9 1 1 5 NA 9 NA NA NA NA NA",row.names=1,
header=TRUE, stringsAsFactors=FALSE)
I want to filter out all rows that don't have at least 2 non-NA values in both the columns that start with X and the columns that start with Y.
For example:
Ind1: Drop (only 1 value in X1-X5)
Ind2: Keep (cause here there are at least 2 numbers in X)
Ind3: Keep cause both X and Y have 2 or more observations.
Ind4: Delete (only 1 value in Y1-Y5)
Ind5: Keep
Ind6: Keep
Ind7: Keep
Ind8: Delete (Only 1 value in X1-X5)
Ind9: Delete (though X is ok, Y is not okay.)
You could do this. Basically, you are counting (with rowSums), the number of non-NA data points first in x1-x5 and then in y1-y5. To indentify non-NAs, I use !is.na(). The ! is a negation, so the expression means "Not an NA". Finally, you are keeping only the rows where "row sum of non-NAs is >=2" for x1-x5 AND (&) for y1-y5. To be clear about the indexing, there are 10 columns in your data.frame. df[,1:5] represents the first 5 columns, which are x1-x5.
df[rowSums(!is.na(df[,1:5]))>=2 & rowSums(!is.na(df[,6:10]))>=2,]
X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5
Ind_2 2 NA 16 NA NA 5 16 12 3 4
Ind_3 NA NA NA 19 92 13 NA 12 NA NA
Ind_5 44 3 46 3 47 3 2 NA 3 4
Ind_6 NA 34 NA 8 NA 14 15 12 3 4
Ind_7 49 55 67 49 89 6 17 2 3 4
DATA
df <- read.table(text="Index_name,X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5
Ind_1 7 NA NA NA NA 1 4 6 8 6
Ind_2 2 NA 16 NA NA 5 16 12 3 4
Ind_3 NA NA NA 19 92 13 NA 12 NA NA
Ind_4 32 5 12 3 5 NA NA NA NA 4
Ind_5 44 3 46 3 47 3 2 NA 3 4
Ind_6 NA 34 NA 8 NA 14 15 12 3 4
Ind_7 49 55 67 49 89 6 17 2 3 4
Ind_8 NA NA 49 NA NA 11 20 6 NA 4
Ind_9 1 1 5 NA 9 NA NA NA NA NA",row.names=1,
header=TRUE, stringsAsFactors=FALSE)

Expand a dataframe based on columns in the dataframe in R

I have the following dataframe in R
df<-data.frame(
"Val1"=seq(from=1, to=40, by=5), 'Val2'=c(2,4,2,5,11,3,5,3),
"Val3"=seq(from=5, to=40, by=5), "Val4"=c(3,5,7,3,7,5,7,8))
The resulting dataframe looks as follows. Val 1, Val3 are the causal variables and Val2, Val4 are the dependent variables
Val1 Val2 Val3 Val4
1 1 2 5 3
2 6 4 10 5
3 11 2 15 7
4 16 5 20 3
5 21 11 25 7
6 26 3 30 5
7 31 5 35 7
8 36 3 40 8
I wish to obtain the following dataframe as an output
Val1 Val2 Val3 Val4
1 1 2 1 NA
2 2 NA 2 NA
3 3 NA 3 3
4 4 NA 4 NA
5 5 NA 5 NA
6 6 4 6 NA
7 7 NA 7 NA
8 8 NA 8 NA
9 9 NA 9 NA
10 10 NA 10 5
11 11 2 11 NA
12 12 NA 12 NA
13 13 NA 13 NA
14 14 NA 14 NA
15 15 NA 15 7
16 16 5 16 NA
17 17 NA 17 NA
18 18 NA 18 NA
19 19 NA 19 NA
20 20 NA 20 3
21 21 11 21 NA
22 22 NA 22 NA
23 23 NA 23 NA
24 24 NA 24 NA
25 25 NA 25 7
26 26 3 26 NA
27 27 NA 27 NA
28 28 NA 28 NA
29 29 NA 29 NA
30 30 NA 30 5
31 31 5 31 NA
32 32 NA 32 NA
33 33 NA 33 NA
34 34 NA 34 NA
35 35 NA 35 7
36 36 3 36 NA
37 37 NA 37 NA
38 38 NA 38 NA
39 39 NA 39 NA
40 40 NA 40 8
How do I accomplish this. I have created the following code but it involves creating a second dataframe and then copying data from the first to the second. Is there a way to overwrite the existing dataframe. I would like to avoid loops
df2<-data.frame('Val1'=
seq(from=min(na.omit(c(df$Val1, df$Val3))), to= max(na.omit(c(df$Val1,
df$Val3))), by=1), "Val3"=seq(from=min(na.omit(c(df$Val1, df$Val3))), to=
max(na.omit(c(df$Val1, df$Val3))), by=1))
###### Create two loops
for(i in df$Val1){
for(j in df2$Val1){
if(i==j){
df2$Val2[df2$Val1==j]=df$Val2[df$Val1==i]
} else{df2$Val2[df2$Val1==j]=NA}}}
for(i in df$Val3){ for(j in df2$Val3){
if(i==j){df2$Val4[df2$Val3==j]=df$Val4[df$Val3==i]
} else{df2$Val4[df2$Val3==j]=NA}}}
Is there a faster vectorised way to accomplish the same. requesting some one to help
Assuming there's a slight error in your output example (row 3 should show NA for Val4 and the 3 in row 3 should be in row 5), this works:
library(tidyverse)
df_new <- bind_cols(
df %>%
select(Val1, Val2) %>%
complete(., expand(., Val1 = 1:40)),
df %>%
select(Val3, Val4) %>%
complete(., expand(., Val3 = 1:40))
)
> df_new
# A tibble: 40 x 4
Val1 Val2 Val3 Val4
<dbl> <dbl> <dbl> <dbl>
1 1 2 1 NA
2 2 NA 2 NA
3 3 NA 3 NA
4 4 NA 4 NA
5 5 NA 5 3
6 6 4 6 NA
7 7 NA 7 NA
8 8 NA 8 NA
9 9 NA 9 NA
10 10 NA 10 5
# ... with 30 more rows
We use bind_cols() to put together two parts of the dataframe:
First we select the first two columns, expand() the causal variable and complete() the data, then we do it again for the third and fourth column.

finding first time all players have played two, three ,... games in R

For the following example:
set.seed(24)
D <- data.frame(Team=sample(LETTERS[1:6],100,TRUE),stringsAsFactors=FALSE)
If i want to find the first row at which all players have had 1 turn, then the following works:
max(match(unique(D$Team),D$Team))
# [1] 18
but what if i want to find the first row when teams have played 2 games, or 3 games or more? Im stuck on how to do this, I guess what I would be looking for is the first index, i, in which all elements of table(D$Team)[1:i] are greater than 2, 3, 4. But this is quite slow and clunky
You could add a column with the total number of matches played by a team and then use max(which(...)) to interrogate a given amount :
D$Matches <- vapply(1:nrow(D),FUN = function(r)sum(D$Team[1:r] == D$Team[r]),1)
getWhenAllTeamsHavePlayedNMatches <- function(nMatches){
if(sum(D$Matches == nMatches) == length(unique(D$Team))){
return(max(which(D$Matches == nMatches)))
}
return(NA)
}
getWhenAllTeamsHavePlayedNMatches(4)
# e.g. returns 42
If you want to precalculate all values and add a column to D :
D$Matches <- vapply(1:nrow(D),FUN = function(r)sum(D$Team[1:r] == D$Team[r]),1)
nTeams <- length(unique(D$Team))
D$NumMatchesWithAllTeam <- vapply(1:nrow(D),
FUN = function(r) {
if(sum(D$Matches[1:r] == D$Matches[r]) == nTeams)
return(D$Matches[r])
return(NA)
}
,1)
Resulting data.frame :
> D
Team Matches NumMatchesWithAllTeam
1 B 1 NA
2 B 2 NA
3 E 1 NA
4 D 1 NA
5 D 2 NA
6 F 1 NA
7 B 3 NA
8 E 2 NA
9 E 3 NA
10 B 4 NA
11 D 3 NA
12 C 1 NA
13 E 4 NA
14 E 5 NA
15 B 5 NA
16 F 2 NA
17 B 6 NA
18 A 1 1
19 D 4 NA
20 A 2 NA
21 A 3 NA
22 D 5 NA
23 E 6 NA
24 A 4 NA
25 B 7 NA
26 E 7 NA
27 A 5 NA
28 D 6 NA
29 D 7 NA
30 A 6 NA
31 B 8 NA
32 B 9 NA
33 C 2 2
34 A 7 NA
35 F 3 NA
36 B 10 NA
37 E 8 NA
38 D 8 NA
39 E 9 NA
40 F 4 NA
41 C 3 3
42 C 4 4
43 B 11 NA
44 B 12 NA
45 A 8 NA
46 A 9 NA
47 C 5 NA
48 C 6 NA
49 B 13 NA
50 C 7 NA
51 C 8 NA
52 F 5 5
53 C 9 NA
54 E 10 NA
55 D 9 NA
56 F 6 6
57 C 10 NA
58 B 14 NA
59 B 15 NA
60 A 10 NA
61 C 11 NA
62 B 16 NA
63 B 17 NA
64 A 11 NA
65 E 11 NA
66 B 18 NA
67 F 7 7
68 F 8 8
69 E 12 NA
70 C 12 NA
71 A 12 NA
72 B 19 NA
73 A 13 NA
74 F 9 9
75 D 10 NA
76 C 13 NA
77 D 11 NA
78 E 13 NA
79 A 14 NA
80 E 14 NA
81 D 12 NA
82 A 15 NA
83 D 13 NA
84 B 20 NA
85 C 14 NA
86 C 15 NA
87 B 21 NA
88 F 10 10
89 C 16 NA
90 F 11 11
91 B 22 NA
92 E 15 NA
93 F 12 12
94 A 16 NA
95 C 17 NA
96 D 14 NA
97 D 15 NA
98 A 17 NA
99 C 18 NA
100 C 19 NA

Rank instances by missing amount in descending order

I want to sort this dataset as (rank instances by missing amount in descending order)
can someone help me how to do it in R language , is there any command to do it in r .
df=data.frame(x=c(1,4,6,NA,7,NA,9,10,4,NA),
y=c(10,12,NA,NA,14,18,20,15,12,17),
z=c(225,198,NA,NA,NA,130,NA,200,NA,99),
v=c(44,51,NA,NA,45,NA,25,36,75,NA))
df
x y z v
1 1 10 225 44
2 4 12 198 51
3 6 NA NA NA
4 NA NA NA NA
5 7 14 NA 45
6 NA 18 130 NA
7 9 20 NA 25
8 10 15 200 36
9 4 12 NA 75
10 NA 17 99 NA
I want to get this result :
x y z v
4 NA NA NA NA
3 6 NA NA NA
6 NA 18 130 NA
10 NA 17 99 NA
5 7 14 NA 45
7 9 20 NA 25
9 4 12 NA 75
1 1 10 225 44
2 4 12 198 51
8 10 15 200 36
In my comment I incorrectly remembered the name of the argument for changing the direction of an order result. The fix is simply to use the correct name:
> df[ order(rowSums(is.na(df)), decreasing=TRUE), ]
x y z v
4 NA NA NA NA
3 6 NA NA NA
6 NA 18 130 NA
10 NA 17 99 NA
5 7 14 NA 45
7 9 20 NA 25
9 4 12 NA 75
1 1 10 225 44
2 4 12 198 51
8 10 15 200 36

Resources