Making the rows of a data frame to NAs using R - r

I have a data frame as follows,
aid=c(1:10)
x1_var=rnorm(10,0,1)
x2_var=rnorm(10,0,1)
x3_var=rbinom(10,1,0.5)
data=data.frame(aid,x1_var,x2_var,x3_var)
head(data)
aid x1_var x2_var x3_var
1 1 -0.99759448 -0.2882535 1
2 2 -0.12755695 -1.3706875 0
3 3 1.04709366 0.8977596 1
4 4 0.48883458 -0.1965846 1
5 5 -0.40264114 0.2925659 1
6 6 -0.08409966 -1.3489460 1
I want to make the all the rows in this data frame completely to NA if x3_var==1(without making aid column to NA)
I tried the following code.
> data[which(data$x3_var==1),]=NA
> data
aid x1_var x2_var x3_var
1 NA NA NA NA
2 2 -0.12755695 -1.3706875 0
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
7 NA NA NA NA
8 8 -1.78160459 -1.8677633 0
9 9 -1.65895704 -0.8086148 0
10 10 -0.06281384 1.8888726 0
But this code have made the values of aid column also to NA. Can anybody help me to fix this?
Also are there any methods that do the same thing?
Thank you

Your code would work if you remove aid column from it.
data[which(data$x3_var==1),-1]=NA
You can also do this without which :
data[data$x3_var==1, -1]=NA
In the above two cases I am assuming that you know the position of aid column i.e 1. If in reality you don't know the position of the column you can use match to get it's position.
data[data$x3_var==1, -match('aid', names(data))] = NA

A dplyr solution. Assuming the columns to be altered begin with "x" as in the example data.
library(dplyr)
set.seed(1001)
df1 <- data.frame(aid = 1:10,
x1_var = rnorm(10,0,1),
x2_var = rnorm(10,0,1),
x3_var = rbinom(10,1,0.5))
df1 %>%
mutate(across(starts_with("x"), ~ifelse(x3_var == 1, NA, .x)))
aid x1_var x2_var x3_var
1 1 2.1886481 0.3026445 0
2 2 -0.1775473 1.6343924 0
3 3 NA NA NA
4 4 -2.5065362 0.4671611 0
5 5 NA NA NA
6 6 -0.1435595 0.1102652 0
7 7 NA NA NA
8 8 -0.6229437 -1.0302508 0
9 9 NA NA NA
10 10 NA NA NA

Related

Picking LHS column and RHS column of data.table assignment using other column values in R

Here is the code to produce a sample dataset:
require(data.table)
testdata <- data.table(
X = rep(sample(1:3),5),
Y = rep(sample(1:3),5),
X1 = rnorm(15),
X2 = rnorm(15),
X3 = rnorm(15),
Y1 = NA_character_,
Y2 = NA_character_,
Y3 = NA_character_
)
Initial data table:
X Y X1 X2 X3 Y1 Y2 Y3
1: 3 3 -0.7098927 0.63342935 0.94470612 NA NA NA
2: 1 2 0.3008547 -1.40043977 1.53781754 NA NA NA
3: 2 1 0.3423140 0.34897695 -0.38402565 NA NA NA
4: 3 3 -0.5726456 -2.24526957 -1.10947867 NA NA NA
5: 1 2 -1.3239474 -0.53924617 -0.04103982 NA NA NA
6: 2 1 0.2493801 0.85806647 0.96488021 NA NA NA
7: 3 3 -2.0653505 0.05481703 1.75161043 NA NA NA
8: 1 2 -1.3919774 0.34282832 0.50834289 NA NA NA
9: 2 1 0.5928025 -1.11899399 0.35967102 NA NA NA
10: 3 3 -0.4704720 0.64004313 -0.17343794 NA NA NA
11: 1 2 0.3056093 2.14544631 0.43740447 NA NA NA
12: 2 1 -0.1568971 1.05091249 1.18884487 NA NA NA
13: 3 3 -1.3078670 1.07482123 -0.65367957 NA NA NA
14: 1 2 0.4622123 -0.60308532 -1.11104235 NA NA NA
15: 2 1 -0.7894978 0.33018926 -0.04700393 NA NA NA
Here is the action I want to perform:
In each row,
if X = 2 and Y = 3 then Y3 <- X2
Expected Output:
X Y X1 X2 X3 Y1 Y2 Y3
1: 3 3 -0.7098927 0.63342935 0.94470612 NA NA 0.94470612
2: 1 2 0.3008547 -1.40043977 1.53781754 NA 0.3008547 NA
3: 2 1 0.3423140 0.34897695 -0.38402565 0.34897695 NA NA
4: 3 3 -0.5726456 -2.24526957 -1.10947867 NA NA -1.10947867
5: 1 2 -1.3239474 -0.53924617 -0.04103982 NA -1.3239474 NA
6: 2 1 0.2493801 0.85806647 0.96488021 0.85806647 NA NA
7: 3 3 -2.0653505 0.05481703 1.75161043 NA NA 1.75161043
8: 1 2 -1.3919774 0.34282832 0.50834289 NA -1.3919774 NA
9: 2 1 0.5928025 -1.11899399 0.35967102 -1.11899399 NA NA
10: 3 3 -0.4704720 0.64004313 -0.17343794 NA NA -0.17343794
11: 1 2 0.3056093 2.14544631 0.43740447 NA 0.3056093 NA
12: 2 1 -0.1568971 1.05091249 1.18884487 1.05091249 NA NA
13: 3 3 -1.3078670 1.07482123 -0.65367957 NA NA -0.65367957
14: 1 2 0.4622123 -0.60308532 -1.11104235 NA 0.4622123 NA
15: 2 1 -0.7894978 0.33018926 -0.04700393 0.33018926 NA NA
How can I achieve this using simple data.table syntax? I have tried get, eval(parse) etc but running into trouble each time.
Note that my actual dataset is quite large(100 plus columns) so I require a solution that doesn't rely on column numbers. I can possible write a large number of if statements as well but it looks like a bad way to do this for about 30 odd columns that need to be assigned in a similar way.
data.table version is 1.10.4 and the R version is 3.3.2
Edit: I solved it using a function. Not sure if this is the best way though as it is very very slow.
populateY <- function(input_table) {
for(i in 1:nrow(input_table)) {
k <- X
j <- Y
tempX <- paste0("input_table$X",k,"[i]")
tempY <- paste0("input_table$Y",j,"[i]")
eval(parse(text = paste0(tempY," <- ",tempX)))
}
return(input_table)
}
If you're open to using the tidyverse and tibble data frames, I would do it this way.
require(tibble)
testdata <- as_tibble(testdata)
testdata <- testdata %>%
mutate(Y3 = ifelse(X == 2 & Y == 3, X2, NA))
You can then add all the lines you need easily and legibly in the mutate function.
Else if you're going to use data.tables for sure, then I'd go with akrun's suggestion, though you'll need change the data type of column Y3 to double, or just not have it exist when you run that code.

Faster way to convert a list to a data.frame with some column values missing

I have this list of list
> head(train)
[[1]]
[[1]]$Physics
[1] 8
[[1]]$Chemistry
[1] 7
[[1]]$PhysicalEducation
[1] 3
[[1]]$English
[1] 4
[[1]]$Mathematics
[1] 6
[[1]]$serial
[1] 195490
.
.
[[6]]
[[6]]$Physics
[1] 2
[[6]]$Chemistry
[1] 1
[[6]]$Biology
[1] 2
[[6]]$English
[1] 4
[[6]]$Mathematics
[1] 8
[[6]]$serial
[1] 182318
each sub-list has any five elements out of these 12 and one extra named serial
columns <- c("Physics", "Chemistry", "PhysicalEducation", "English",
"Mathematics", "serial", "ComputerScience", "Hindi", "Biology",
"Economics", "Accountancy", "BusinessStudies")
I am trying yo convert this list into data frame.
Presently, I am doing this using this for loop by iterating one row at a time. Although this works, it takes a huge amount of time.
colclass <- rep("numeric",12)
comby <- read.table(text = '', colClasses = colclass, col.names = columns)
for(i in 1:length(train)){
comby[i,names(train[[i]])] <- train[[i]]
}
I tried using do.call(rbind, train) but that doesn't work as it keeps adding new data into the old columns from the first iteration.
What's a better, faster way? I have around 1.5 million observations.
Desired o/p : the data frame should have all the columns. I want NA where there is no value. Also I am interested if it could be done faster without using any additional packages.
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics Accountancy
1 8 7 3 4 6 195490 NA NA NA NA NA
2 1 1 1 3 3 190869 NA NA NA NA NA
3 1 2 2 1 2 3111 NA NA NA NA NA
4 8 7 6 7 7 47738 NA NA NA NA NA
5 1 1 1 3 2 85520 NA NA NA NA NA
6 2 1 NA 4 8 182318 NA NA 2 NA NA
BusinessStudies
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Here is the reproducible code
train <- [{\"Physics\":8,\"Chemistry\":7,\"PhysicalEducation\":3,\"English\":4,\"Mathematics\":6,\"serial\":195490},{\"Physics\":1,\"Chemistry\":1,\"PhysicalEducation\":1,\"English\":3,\"Mathematics\":3,\"serial\":190869},{\"Physics\":1,\"Chemistry\":2,\"PhysicalEducation\":2,\"English\":1,\"Mathematics\":2,\"serial\":3111},{\"Physics\":8,\"Chemistry\":7,\"PhysicalEducation\":6,\"English\":7,\"Mathematics\":7,\"serial\":47738},{\"Physics\":1,\"Chemistry\":1,\"PhysicalEducation\":1,\"English\":3,\"Mathematics\":2,\"serial\":85520},{\"Physics\":2,\"Chemistry\":1,\"Biology\":2,\"English\":4,\"Mathematics\":8,\"serial\":182318},{\"Physics\":3,\"Chemistry\":4,\"PhysicalEducation\":5,\"English\":5,\"Mathematics\":8,\"serial\":77482},{\"Accountancy\":2,\"BusinessStudies\":5,\"Economics\":3,\"English\":6,\"Mathematics\":7,\"serial\":152940},{\"Physics\":5,\"Chemistry\":6,\"Biology\":7,\"English\":3,\"Mathematics\":8,\"serial\":132620}]
train <- rjson::fromJSON(train)
As a starting point you can use purrr::map as follows:
A sample data set:
x <- list(list(physics=8,
Chemistry=7,
PhysicalEducation=3,
English=4,
serial=195490),
list(physics=2,
Chemistry=1,
Biology=2,
English=4,
Mathematics=8,
serial=182318))
Sol.1 [Shortest to avoid loops]
zzz <- sapply(columns, function(n) map_dbl(x,n,.null=NA) ) %>%
data.frame()
Which gives:
> zzz
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics
1 NA 7 3 4 NA 195490 NA NA NA NA
2 NA 1 NA 4 8 182318 NA NA 2 NA
Accountancy BusinessStudies
1 NA NA
2 NA NA
If you would like to understand how this works, you can check the longer solutions below.
Sol.2 [Manual assignment]
-pick the values for each column:
z <- data.frame(
serial = map_dbl(x,"serial",.null=NA),
Biology = map_dbl(x,"Biology",.null=NA),
Chemistry = map_dbl(x,"Chemistry",.null=NA)
)
Which gives:
> z
serial Biology Chemistry
1 195490 NA 7
2 182318 2 1
>
Sol.3 [Pre-defined dataframe and for-loop]
create a dataframe with a fixed size
zz <- data.frame(matrix(NA, nrow = length(x), ncol = 12))
assign names
names(zz) <- columns
assign values from the lists
for(i in 1:ncol(zz)){
zz[columns[i]] <- map_dbl(x,columns[i],.null=NA)
}
Which gives:
> zz
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics
1 NA 7 3 4 NA 195490 NA NA NA NA
2 NA 1 NA 4 8 182318 NA NA 2 NA
Accountancy BusinessStudies
1 NA NA
2 NA NA
You can accomplish this in base R by combining Reduce, and Map.
data
Here is a dataset that matches your structure.
set.seed(1234)
temp <- replicate(7, setNames(replicate(7, sample(1:10, 1), simplify=FALSE), letters[1:7]),
simplify=FALSE)
To produce a data.frame from this, you can use
Reduce(rbind, Map(data.frame, temp))
a b c d e f g
1 2 7 7 7 9 7 1
2 3 7 6 7 6 3 10
3 3 9 3 3 2 3 4
4 4 2 1 3 9 6 10
5 9 1 5 3 4 6 2
6 8 3 3 10 9 6 7
7 4 7 4 6 7 5 3
Where data.frame constructs data.frames with the inner elements. Map applies this to each element of the outer list, resulting in a list of data.frames. Finally, Reduce rbinds the data.frames in the list and produces a single data.frame.

Add an integer to every element of data frame

Say I have a data frame as follows
rsi5 rsi10
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 44.96650 NA
7 39.68831 NA
8 28.35625 NA
9 37.77910 NA
10 53.54822 NA
11 52.05308 46.01867
12 80.44368 66.09973
13 60.88418 56.04507
14 53.59851 52.10633
15 46.45874 48.23648
I wish to simply add 1 (i.e. 9 becomes 10) to each non-NA element of this data frame. There is probably a very simple solution to this but simple arithmetics on dataframes do not seem to work in R giving very strange results.
Just use + 1 as you would expect. Below is a mock example as it wasn't worth copying your data for for this.
Step One: Create a data.frame
R> df <- data.frame(A=c(NA, 1, 2, 3), B=c(NA, NA, 12, 13))
R> df
A B
1 NA NA
2 1 NA
3 2 12
4 3 13
R>
Step Two: Add one
R> df + 1
A B
1 NA NA
2 2 NA
3 3 13
4 4 14
R>

How to replace columns containing NA with the contents of the previous column?

I have a large dataframe with random columns which contain NA values. It looks like this:
2002-06-26 2002-06-27 2002-06-28 2002-07-01 2002-07-02 2002-07-03 2002-07-05
1 US1718711062 NA BMG4388N1065 US0116591092 NA AN8068571086 GB00BYMT0J19
2 US9837721045 NA US0025671050 US03662Q1058 NA BMG3223R1088 US0097281069
3 NA US00847J1051 US06652V2088 NA BMG4388N1065 US0305061097
4 NA US04351G1013 US1046741062 NA BMG7496G1033 US03836W1036
5 NA US2925621052 US1431301027 NA CA88157K1012 US06652V2088
6 NA US34988V1061 US1897541041 NA CH0044328745 US1547604090
7 NA US3596941068 US2053631048 NA GB00B5BT0K07 US1778351056
8 NA US4180561072 US2567461080 NA IE00B5LRLL25 US1999081045
9 NA US4198791018 US2925621052 NA IE00B8KQN827 US3498531017
10 NA US45071R1095 US3989051095 NA IE00BGH1M568 US42222N1037
I need a code which identifies and fills out the NA columns with the contents of the previous column. So for example column "2002-06-27" should contain "US1718711062" and "US9837721045". The NA columns are at irregular intervals.
Columns are also of random length some only containing one element so I think the best way to identify columns with no values is to look at the first row like so:
row.has.na <- which(is.na(data[1,]))
[1] 2 5
To complete my comment: as you have already computed row.has.na, the vector of indices for the NA column, here is a way to use it and get what you need:
data[, row.has.na] <- data[, row.has.na - 1]
This should work. Note that this also works if two (or more) NA columns are next to each other. Maybe there is a way around the while-loop, but...
# Create some data
data <- data.frame(col1 = 1:10, col2 = NA, col3 = 10:1, col4 = NA, col5 = NA, col6 = NA)
# Find which columns contain NA in the first row
col_NA <- which(is.na(data[1,]))
# Select the previous columns
col_replace <- col_NA - 1
# Check if any NA columns are next to each other and fix it:
while(any(diff(col_replace) == 1)){
ind <- which(diff(col_replace) == 1) + 1
col_replace[ind] <- col_replace[ind] - 1
}
# Replace the NA columns with the previous columns
data[,col_NA] <- data[,col_replace]
col1 col2 col3 col4 col5 col6
1 1 1 10 10 10 10
2 2 2 9 9 9 9
3 3 3 8 8 8 8
4 4 4 7 7 7 7
5 5 5 6 6 6 6
6 6 6 5 5 5 5
7 7 7 4 4 4 4
8 8 8 3 3 3 3
9 9 9 2 2 2 2
10 10 10 1 1 1 1

Replace negative values by NA values

I have positive, negative and NA values in a Table, I need to replace negative values by NA values. Positive and NA values should remain as they are in Table. My Data set is similar to the one below:
NO. q
1 NA
2 NA
3 -133.6105198
4 -119.6991209
5 28.84460104
6 66.05345087
7 84.7058947
8 -134.4522694
9 NA
10 NA
11 73.20465643
12 -69.90723514
13 NA
14 69.70833003
15 65.27859906
I tried this:
if (q>0) {
q=NA
} else {
q=q
}
Or use replace:
> df$q2 <- replace(df$q, which(df$q < 0), NA)
> df
NO. q q2
1 1 NA NA
2 2 NA NA
3 3 -133.61052 NA
4 4 -119.69912 NA
5 5 28.84460 28.84460
6 6 66.05345 66.05345
7 7 84.70589 84.70589
8 8 -134.45227 NA
9 9 NA NA
10 10 NA NA
11 11 73.20466 73.20466
12 12 -69.90724 NA
13 13 NA NA
14 14 69.70833 69.70833
15 15 65.27860 65.27860
Or with data.table:
library(data.table)
setDT(df)[q < 0, q := NA]
Or with replace in a dplyr pipe:
library(dplyr)
df %>% mutate(q = replace(q, which(q<0), NA))
You could try this:
sample <- c(1, -2, NA)
sample[sample < 0] <- NA
sample
[1] 1 NA NA
Or if you're using a data.frame (suppose it's called df):
df$q[df$q < 0] <- NA
You could try
df1$q1 <- NA^(df1$q <0) * df1$q
df1
# NO. q q1
#1 1 NA NA
#2 2 NA NA
#3 3 -133.61052 NA
#4 4 -119.69912 NA
#5 5 28.84460 28.84460
#6 6 66.05345 66.05345
#7 7 84.70589 84.70589
#8 8 -134.45227 NA
#9 9 NA NA
#10 10 NA NA
#11 11 73.20466 73.20466
#12 12 -69.90724 NA
#13 13 NA NA
#14 14 69.70833 69.70833
#15 15 65.27860 65.27860
Or use ifelse
with(df1, ifelse(q < 0, NA, q))
Or
is.na(df1$q) <- df1$q < 0
Another way of accomplishing the same thing is (now I see this is ALMOST the same as another answer by akrun, sorry for that)
daf$q = ifelse(daf$q < 0, NA_real_, daf$q)

Resources