Add new observations to a column in a dataframe in R - r

Lets start with two data frames:
m1 <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
df1 <- as.data.frame(m1)
df1
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 5 9 8 3 8 7 1 5 5
2 2 1 NA 6 6 NA 3 8 8 2
3 NA 5 7 2 1 10 8 6 5 7
4 8 1 1 6 8 4 5 3 5 2
5 10 4 9 9 1 NA 7 8 6 2
6 1 8 NA 6 5 7 9 9 9 3
7 1 10 2 4 NA 10 6 5 5 4
8 7 3 10 7 5 5 2 1 NA 1
9 NA NA 8 10 6 4 3 10 7 7
10 7 10 2 2 9 4 NA 1 2 10
m2 <- matrix(sample(c(NA, 2:20), 100, replace = TRUE), 10)
df2 <- as.data.frame(m2)
df2
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 5 NA NA 19 20 15 5 11 4 17
2 4 13 20 NA 9 18 7 11 5 12
3 17 3 14 4 6 2 11 16 11 7
4 14 10 9 16 NA 7 20 5 8 6
5 5 14 10 20 19 16 NA 7 NA NA
6 12 14 14 8 3 20 15 7 15 17
7 4 15 18 12 4 2 19 13 9 8
8 14 11 4 20 5 17 NA 13 19 12
9 15 3 14 16 14 19 17 8 5 NA
10 2 2 11 2 16 4 NA 18 20 NA
Now, I do not want to merge both df, but only some colums.
How can I move df2$V10 to df1$V4?
The resulting df would be composed by 20 rows, but rows 11:20 would be filled by the 10 values of df2$V10. The remaining columns in these interval should be NA.

Extract the 'V10' column from 'df2', create a data.frame and use bind_rows to bind the two datasets. The other column values will be by default filled by NAs
library(dplyr)
bind_rows(df1, data.frame(V4 = df2$V10))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#1 2 10 NA 9 7 NA NA 8 1 5
#2 2 5 10 10 8 8 3 7 NA 2
#3 3 7 NA 5 4 5 2 5 7 2
#4 9 4 6 4 8 6 7 9 8 2
#5 3 6 2 3 3 6 10 5 9 5
#6 1 NA 3 7 5 4 6 3 7 10
#7 6 3 1 3 4 10 2 6 NA 7
#8 9 1 5 4 4 7 4 2 2 1
#9 3 1 6 6 1 7 7 6 6 1
#10 NA 6 10 9 10 10 6 4 3 9
#11 NA NA NA 10 NA NA NA NA NA NA
#12 NA NA NA 3 NA NA NA NA NA NA
#13 NA NA NA 4 NA NA NA NA NA NA
#14 NA NA NA 18 NA NA NA NA NA NA
#15 NA NA NA 20 NA NA NA NA NA NA
#16 NA NA NA 11 NA NA NA NA NA NA
#17 NA NA NA 15 NA NA NA NA NA NA
#18 NA NA NA 2 NA NA NA NA NA NA
#19 NA NA NA 3 NA NA NA NA NA NA
#20 NA NA NA 14 NA NA NA NA NA NA
For multiple columns, subset the dataset and set the column names of interest before doing the bind_rows
bind_rows(df1, setNames(df2[c('V10', 'V8')], c('V4', 'V2')))

Related

How to transfer values from one dataframe to another?

Consider the following code yielding the following dataframe
df1 <- data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", 'B', "B", "B"),
"X_A"=c(1,2,3,4,5,NA, NA, 8, 9,10), "X_B"=c(1,2,3,4,5,NA,NA, 8,9,10)
,"Y_A"=c(1,2,NA,NA, 10, 8,9,10,NA,NA), "Y_B"=c(1,2,NA, NA, 10,8,
9, 10, NA, NA))
it yields the following dataframe
ID X_A X_B Y_A Y_B
1 A 1 1 1 1
2 A 2 2 2 2
3 A 3 3 NA NA
4 A 4 4 NA NA
5 A 5 5 NA NA
6 B NA NA 8 8
7 B NA NA 9 9
8 B 8 8 10 10
9 B 9 9 NA NA
10 B 10 10 NA NA
I wish to transfer data from this dataframe to df2
ID X_A Y_A
1 A 1 1
2 A 2 2
3 A 3 3
4 A 4 4
5 A 5 5
6 A 6 6
7 A 7 7
8 A 8 8
9 A 9 9
10 A 10 10
11 B 1 1
12 B 2 2
13 B 3 3
14 B 4 4
15 B 5 5
16 B 6 6
17 B 7 7
18 B 8 8
19 B 9 9
20 B 10 10
The end data frame should be like this
ID X_A Y_A X_B Y_B
1 A 1 1 1 1
2 A 2 2 2 2
3 A 3 3 3 NA
4 A 4 4 4 NA
5 A 5 5 5 NA
6 A 6 6 NA NA
7 A 7 7 NA NA
8 A 8 8 NA NA
9 A 9 9 NA NA
10 A 10 10 NA NA
11 B 1 1 NA NA
12 B 2 2 NA NA
13 B 3 3 NA NA
14 B 4 4 NA NA
15 B 5 5 NA NA
16 B 6 6 NA NA
17 B 7 7 NA NA
18 B 8 8 8 8
19 B 9 9 9 9
20 B 10 10 10 10
The final output is like the result of a vlookup where, the ID and X_A, ID and Y_A columsn of df1 and df2 are matched so that the corresponding values of X_B and Y_B are filled in df2. In case there is no match, NA should result. I have tried the following code
merge(df1, df2).
this however slows down my system. I have also tried
library(dplyr)
df2 %>% right_join(df1, by=c(ID, x_A, y_A).
This results in all the rows appearing. Can the expected output be managed in R. request someone to help
Do you mean, join once on ID and X_A to get X_B, and afterwards ID and Y_A to get Y_B? Note that row 10 is different:
df2 %>%
left_join(select(df1, ID, X_A, X_B),
by = c("ID", "X_A")) %>%
left_join(select(df1, ID, Y_A, Y_B),
by = c("ID", "Y_A"))
# ID X_A Y_A X_B Y_B
# 1 A 1 1 1 1
# 2 A 2 2 2 2
# 3 A 3 3 3 NA
# 4 A 4 4 4 NA
# 5 A 5 5 5 NA
# 6 A 6 6 NA NA
# 7 A 7 7 NA NA
# 8 A 8 8 NA NA
# 9 A 9 9 NA NA
# 10 A 10 10 NA 10
# 11 B 1 1 NA NA
# 12 B 2 2 NA NA
# 13 B 3 3 NA NA
# 14 B 4 4 NA NA
# 15 B 5 5 NA NA
# 16 B 6 6 NA NA
# 17 B 7 7 NA NA
# 18 B 8 8 8 8
# 19 B 9 9 9 9
# 20 B 10 10 10 10
Base R:
want <- merge(df2, subset(df1, select = c(ID, X_A, X_B)), by = c("ID", "X_A"), all.x = TRUE)
(want <- merge(want, subset(df1, select = c(ID, Y_A, Y_B)), by = c("ID", "Y_A"), all.x = TRUE))

Is there a way to fix the issue of there being rows that has NA values for all of its attributes and NA for its rowname?

I am trying to store all rows with NA for my columns Math_G1, Math_G2 and Math_G3 into a dataset variable. However when I do this, there are additional rows that pops up which have NA as values throughout all its attributes including its row number (eg. NA.1, NA.2 ...) How do I fix this?
I have already tried to use the c() function to attempt to filter out all these results but these rows are still there, in addition to this, i have also used the which() function but they are still there.
Here is my code :
dat <- read.csv(file = "final merged.csv", stringsAsFactors=FALSE, na.strings=c("NA", "NULL"))
dat_small <- dat[c("age","traveltime","studytime",
"failures","famrel","freetime","goout","Dalc","Walc",
"health","absences","Math_G1","Math_G2","Math_G3","Por_G1","Por_G2","Por_G3","DoubleSub")]
sample_size <- 500
all_set <- sample(1:length(dat[,1]),sample_size,replace = F)
dat <- dat_small[all_set,]
index_na_math <- which(is.na(c(dat$Math_G1,dat$Math_G2,dat$Math_G3)))
index_na_por <- which(is.na(c(dat$Por_G1,dat$Por_G2,dat$Por_G3)))
index_na_both <- c(index_na_math,index_na_por)
#each row of my dataset helps define a specific student
#portugese and math are subjects that students within the dataset takes
dat_purepor <- dat[which(index_na_math),] #students who takes only portugese
dat_puremath <- dat[c(index_na_por),] # students who takes only math
dat_math <- dat[c(-index_na_math),] #students who takes math + students who take both
dat_por <- dat[c(-index_na_por),] #students who take portugese + students who take both
dat_both <- dat[c(-index_na_both),] #students who takes both math and portugese
dat_purepor
dat_puremath
I expected the output to be filtered according to my conditions but without any rows with NA as the values for all its columns so I don't understand why the final results return NA.
Here is a preview of the dataset dat_small:
> dat_small
age traveltime studytime failures famrel freetime goout Dalc Walc health absences Math_G1 Math_G2 Math_G3 Por_G1 Por_G2 Por_G3 DoubleSub
1 18 2 2 0 4 3 4 1 1 3 6 5 6 6 13 13 13 1
2 17 1 2 0 5 3 3 1 1 3 4 5 5 6 15 15 15 1
3 15 1 2 3 4 3 2 2 3 3 10 7 8 10 10 12 13 1
4 15 1 3 0 3 2 2 1 1 5 2 15 14 15 14 14 14 1
5 16 1 2 0 4 3 2 1 2 5 4 6 10 10 13 13 13 1
6 16 1 2 0 5 4 2 1 2 5 10 15 15 15 10 13 13 1
7 16 1 2 0 4 4 4 1 1 3 0 12 12 11 14 14 16 1
8 17 2 2 0 4 1 4 1 1 1 6 6 5 6 12 13 13 1
9 15 1 2 0 4 2 2 1 1 1 0 16 18 19 13 17 17 1
10 15 1 2 0 5 5 1 1 1 5 0 14 15 15 9 10 11 1
11 15 1 2 0 3 3 3 1 2 2 0 10 8 9 15 15 15 1
12 15 3 3 0 5 2 2 1 1 4 4 10 12 12 10 12 13 1
13 15 1 1 0 4 3 3 1 3 5 2 14 14 14 13 14 15 1
14 15 2 2 0 5 4 3 1 2 3 2 10 10 11 14 14 14 1
15 15 1 3 0 4 5 2 1 1 3 0 14 16 16 11 12 14 1
16 16 1 1 0 4 4 4 1 2 2 4 14 14 14 9 8 9 1
17 16 1 3 0 3 2 3 1 2 2 6 13 14 14 10 10 16 1
18 16 3 2 0 5 3 2 1 1 4 4 8 10 10 11 11 11 1
19 17 1 1 3 5 5 5 2 4 5 16 6 5 5 10 13 13 1
20 16 1 1 0 3 1 3 1 3 5 4 8 10 10 14 14 14 1
21 15 1 2 0 4 4 1 1 1 1 0 13 14 15 9 8 10 1
22 15 1 1 0 5 4 2 1 1 5 0 12 15 15 10 13 13 1
23 16 1 2 0 4 5 1 1 3 5 2 15 15 16 11 10 11 1
24 16 2 2 0 5 4 4 2 4 5 0 13 13 12 14 14 14 1
25 15 1 3 0 4 3 2 1 1 5 2 10 9 8 10 11 10 1
26 16 1 1 2 1 2 2 1 3 5 14 6 9 8 13 13 13 1
27 15 1 1 0 4 2 2 1 2 5 2 12 12 11 12 11 12 1
28 15 1 1 0 2 2 4 2 4 1 4 15 16 15 14 12 12 1
29 16 1 2 0 5 3 3 1 1 5 4 11 11 11 10 10 1 1
30 16 1 2 0 4 4 5 5 5 5 16 10 12 11 9 12 12 1
31 15 1 2 0 5 4 2 3 4 5 0 9 11 12 9 10 11 1
32 15 2 2 0 4 3 1 1 1 5 0 17 16 17 14 14 16 1
33 15 1 2 0 4 5 2 1 1 5 0 17 16 16 14 14 16 1
34 15 1 2 0 5 3 2 1 1 2 0 8 10 12 10 13 13 1
35 16 1 1 0 5 4 3 1 1 5 0 12 14 15 9 12 12 1
36 15 2 1 0 3 5 1 1 1 5 0 8 7 6 14 13 12 1
37 15 1 3 0 5 4 3 1 1 4 2 15 16 18 14 14 16 1
38 16 2 3 0 2 4 3 1 1 5 7 15 16 15 9 9 8 1
39 15 1 3 0 4 3 2 1 1 5 2 12 12 11 14 13 12 1
40 15 1 1 0 4 3 1 1 1 2 8 14 13 13 14 13 12 1
41 16 2 2 1 3 3 3 1 2 3 25 7 10 11 13 13 13 1
42 15 1 1 0 5 4 3 2 4 5 8 12 12 12 10 13 13 1
43 15 1 2 0 4 3 3 1 1 5 2 19 18 18 9 12 12 1
44 15 1 1 0 5 4 1 1 1 1 0 8 8 11 10 13 13 1
45 16 2 2 1 4 3 3 2 2 5 14 10 10 9 11 11 11 1
46 15 1 2 0 5 2 2 1 1 5 8 8 8 6 12 11 12 1
47 16 1 2 0 2 3 5 1 4 3 12 11 12 11 10 11 11 1
48 16 1 4 0 4 2 2 1 1 2 4 19 19 20 14 14 16 1
49 15 1 2 0 4 3 3 2 2 5 2 15 15 14 10 13 13 1
50 15 1 2 1 4 4 4 1 1 3 2 7 7 7 15 15 15 1
51 16 3 2 0 4 3 3 2 3 4 2 12 13 13 13 13 13 1
52 15 1 2 0 4 3 3 1 1 5 2 11 13 13 16 14 16 1
53 15 2 1 1 5 5 5 3 4 5 6 11 11 10 14 14 16 1
54 15 1 1 0 3 3 4 2 3 5 0 8 10 11 11 12 13 1
55 15 1 1 0 5 3 4 4 4 1 6 10 13 13 13 12 13 1
[ reached getOption("max.print") -- omitted 889 rows ]
Here is a preview of what happens when i run the dat_puremath dataset.
> dat_puremath
age traveltime studytime failures famrel freetime goout Dalc Walc health absences Math_G1 Math_G2 Math_G3 Por_G1 Por_G2 Por_G3 DoubleSub
918 15 2 4 0 4 4 2 2 3 3 12 16 16 16 NA NA NA 0
931 16 1 2 3 2 3 3 2 2 4 5 7 7 7 NA NA NA 0
933 16 1 2 0 3 3 4 1 1 4 0 12 13 14 NA NA NA 0
935 16 1 1 0 4 5 2 1 1 5 20 13 12 12 NA NA NA 0
927 16 2 2 0 3 4 4 1 4 5 2 13 13 11 NA NA NA 0
929 17 1 2 0 5 3 3 1 1 3 0 8 8 9 NA NA NA 0
942 17 1 3 0 3 3 2 2 2 3 3 11 11 11 NA NA NA 0
928 16 1 2 0 1 2 2 1 2 1 14 12 13 12 NA NA NA 0
936 17 1 3 0 3 2 3 1 1 4 4 10 9 9 NA NA NA 0
939 17 1 4 0 5 2 2 1 2 5 0 17 17 18 NA NA NA 0
941 17 1 2 0 4 2 2 1 1 3 12 11 9 9 NA NA NA 0
937 17 1 2 0 5 4 5 1 2 5 4 10 9 11 NA NA NA 0
925 16 1 2 0 4 4 2 1 1 3 0 14 14 14 NA NA NA 0
938 17 1 3 0 4 3 3 1 1 3 6 13 12 12 NA NA NA 0
921 15 1 3 0 4 2 2 1 1 5 2 9 11 11 NA NA NA 0
943 17 1 3 0 4 4 3 1 1 5 7 12 14 14 NA NA NA 0
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.7 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.11 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.12 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.13 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.15 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.16 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.17 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.18 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.19 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.20 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.21 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.22 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.23 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.24 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.25 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.26 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.27 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.28 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.29 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.30 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.31 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Can someone explain why this happens and how I can fix it? Thank you!
When indexing, using is.na(c(dat$Math_G1,dat$Math_G2,dat$Math_G3)) creates an array of length 3*nrow(dat), so when applying the indices it does not behave as expected once past the index number nrow(dat).
Try the following
index_na_math <- (is.na(dat$Math_G1) | is.na(dat$Math_G2) | is.na(dat$Math_G3))
similarly for the other one, and then
index_na_both <- index_na_math | index_na_por
# or depending what you mean by 'both'
index_na_both <- index_na_math & index_na_por
The subsetting with dat_math <- dat[!index_na_math,] will yield the expected result (accordingly for the others).

store list values in data.frame with unequal rows

I am trying to create a data-frame by extracting the values from the lists. I have a 157 list which contains unequal values, as shown in this pic
and what I want is to rbind all the list values in one data frame. I tried to do it through for loop but it only stored the first list values.
What I could do is:
porturn1=data.table::rbindlist(lapply(porturn[1], as.data.frame), idcol = "id")
porturn2=data.table::rbindlist(lapply(porturn[2], as.data.frame), idcol = "id")
porturn3=data.table::rbindlist(lapply(porturn[3], as.data.frame), idcol = "id")
porturn4=data.table::rbindlist(lapply(porturn[4], as.data.frame), idcol = "id")
porturn5=data.table::rbindlist(lapply(porturn[5], as.data.frame), idcol = "id")
and then apply the rbind.fill command to all these data but it seems quite cumbersome and impractical. Though the result after rbind.fill is what I wanted, as shown in the pic:
How can I create a loop to create the required data frame(as shown in the last pic, I needed to store 157 list values or 157 rows)?
You can run do.call with rbind.fill, which applies rbind.fill on individual entries of a list and assembles the results, in this case into a data.frame.
library(plyr)
## make test data
set.seed(0)
porturn <- sapply(sample(1:20, 10), function(x) 1:x)
str(porturn)
#> List of 10
#> $ : int [1:18] 1 2 3 4 5 6 7 8 9 10 ...
#> $ : int [1:6] 1 2 3 4 5 6
#> $ : int [1:7] 1 2 3 4 5 6 7
#> $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
#> $ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
#> $ : int [1:4] 1 2 3 4
#> $ : int [1:13] 1 2 3 4 5 6 7 8 9 10 ...
#> $ : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
#> $ : int [1:8] 1 2 3 4 5 6 7 8
#> $ : int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## work
porturnall = do.call(rbind.fill,lapply(porturn, function(x) as.data.frame(t(x))))
print(porturnall)
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
#> 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 NA NA
#> 2 1 2 3 4 5 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> 3 1 2 3 4 5 6 7 NA NA NA NA NA NA NA NA NA NA NA NA NA
#> 4 1 2 3 4 5 6 7 8 9 10 NA NA NA NA NA NA NA NA NA NA
#> 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 NA NA NA NA NA
#> 6 1 2 3 4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> 7 1 2 3 4 5 6 7 8 9 10 11 12 13 NA NA NA NA NA NA NA
#> 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 NA NA NA NA NA NA
#> 9 1 2 3 4 5 6 7 8 NA NA NA NA NA NA NA NA NA NA NA NA
#> 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## Created on 2018-07-21 by the reprex package (v0.2.0).

R: In dataframe: set first non-NA value in column to NA

I have a large dataframe, 300+ columns (time series) with about 2600 observations. The columns are filled with a lot of NA's and then a short time series, and then typically NA's again. I would like to find the first non-NA value in each column and replace it with NA.
This is what I'm hoping to achieve, only with a much bigger dataframe:
Before:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 1 1 NA NA
4 2 2 1 1
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
After:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 2 2 NA NA
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
I've searched around and found a way to do this for each column, but my efforts to apply it to the whole dataframe has proven difficult.
I have created an example dataframe to reproduce my original dataframe:
#Dataframe with NA
x1=x2=c(NA,NA,1:10,NA,NA)
x3=x4=c(NA,NA,NA,1:7,NA,NA,NA,NA)
df=data.frame(x1,x2,x3,x4)
I have used this to replace the first value with NA in 1 column (provided by #Joshua Ulrich here), however I would like to apply it to all columns without manually changing 300+ codes:
NonNAindex <- which(!is.na(df[,1]))
firstNonNA <- min(NonNAindex)
is.na(df[,1]) <- seq(firstNonNA, length.out=1)
I have tried to set the above as a function and run it for all columns with apply/lapply, as well as a for loop, but haven't really figured out how to apply the changes to my dataframe. I'm sure there is something I've completely overlooked as I'm just taking my first small steps in R.
All suggestions would be highly appreciated!
We can use base R
df1[] <- lapply(df1, function(x) replace(x, which(!is.na(x))[1], NA))
df1
# x1 x2 x3 x4
#1 NA NA NA NA
#2 NA NA NA NA
#3 NA NA NA NA
#4 2 2 NA NA
#5 3 3 2 2
#6 4 4 3 3
#7 5 5 4 4
#8 6 6 5 5
#9 7 7 6 6
#10 8 8 7 7
#11 9 9 NA NA
#12 10 10 NA NA
#13 NA NA NA NA
#14 NA NA NA NA
Or as #thelatemail suggested
df1[] <- lapply(df1, function(x) replace(x, Position(Negate(is.na), x), NA))
Since you would like to do this for all columns, you could use the mutate_all function from dplyr. See http://dplyr.tidyverse.org/ for more information. In particular, you may want to look at some of the examples shown here.
library(dplyr)
mutate_all(df, funs(if_else(row_number() == min(which(!is.na(.))), NA_integer_, .)))
#> x1 x2 x3 x4
#> 1 NA NA NA NA
#> 2 NA NA NA NA
#> 3 NA NA NA NA
#> 4 2 2 NA NA
#> 5 3 3 2 2
#> 6 4 4 3 3
#> 7 5 5 4 4
#> 8 6 6 5 5
#> 9 7 7 6 6
#> 10 8 8 7 7
#> 11 9 9 NA NA
#> 12 10 10 NA NA
#> 13 NA NA NA NA
#> 14 NA NA NA NA

replacing NA by row with value in a list

I have a table that looks kind of like this:
# item 1 2 3 4 5 6 7 8
#1 1 2 4 6 NA NA NA NA NA
#2 2 1 4 5 6 NA NA NA NA
#3 3 NA NA NA NA NA NA NA NA
#4 4 1 2 6 NA NA NA NA NA
#5 5 2 3 4 6 7 8 NA NA
and I have a list
list1<-11:13
I want to replace the NAs with the elements in the list by row and result should be like this:
# item 1 2 3 4 5 6 7 8
#1 1 2 4 6 11 12 13 NA NA
#2 2 1 4 5 6 11 12 13 NA
#3 3 11 12 13 NA NA NA NA NA
#4 4 1 2 6 11 12 13 NA NA
#5 5 2 3 4 6 7 8 11 12
I tried
for(i in 1:5){
res<-which(is.na(Mydata[i,]))
Mydata[i,res]<-c(list1, rep(NA, 8))
}
It seems to work with the table in the example but gives many warning messages. And when I run it with a really large table it sometimes gives the wrong result. Can anyone tell me what is wrong my code? Or is there any better way to do this?
We loop through the rows of 'Mydata' using apply with MARGIN=1, create the numeric index for elements that are NA ('i1'), check the minimum length of the NA elements and the list1 ('l1') and replace the elements based on the minimum number of elements.
t(apply(Mydata, 1, function(x) {
i1 <- which(is.na(x))
l1 <- min(length(i1), length(list1))
replace(x, i1[seq(l1)], list1[seq(l1)])}))
# item X1 X2 X3 X4 X5 X6 X7 X8
#1 1 2 4 6 11 12 13 NA NA
#2 2 1 4 5 6 11 12 13 NA
#3 3 11 12 13 NA NA NA NA NA
#4 4 1 2 6 11 12 13 NA NA
#5 5 2 3 4 6 7 8 11 12
Or as #RichardSciven mentioned, we can use na.omit with apply by looping over the rows
t(apply(df, 1, function(x) {
w <- na.omit(which(is.na(x))[1:3])
x[w] <- list1[1:length(w)]
x }))
You could do it all in one go using matrix indexing:
sel <- pmin(outer( 0:2, max.col(is.na(dat), "first"), `+`), ncol(dat))
dat[unique(cbind(c(col(sel)),c(sel)))] <- 11:13
# item 1 2 3 4 5 6 7 8
#[1,] 1 2 4 6 11 12 13 NA NA
#[2,] 2 1 4 5 6 11 12 13 NA
#[3,] 3 11 12 13 NA NA NA NA NA
#[4,] 4 1 2 6 11 12 13 NA NA
#[5,] 5 2 3 4 6 7 8 11 12

Resources