I have a data frame as follows,
aid=c(1:10)
x1_var=rnorm(10,0,1)
x2_var=rnorm(10,0,1)
x3_var=rbinom(10,1,0.5)
data=data.frame(aid,x1_var,x2_var,x3_var)
head(data)
aid x1_var x2_var x3_var
1 1 -0.99759448 -0.2882535 1
2 2 -0.12755695 -1.3706875 0
3 3 1.04709366 0.8977596 1
4 4 0.48883458 -0.1965846 1
5 5 -0.40264114 0.2925659 1
6 6 -0.08409966 -1.3489460 1
I want to make the all the rows in this data frame completely to NA if x3_var==1(without making aid column to NA)
I tried the following code.
> data[which(data$x3_var==1),]=NA
> data
aid x1_var x2_var x3_var
1 NA NA NA NA
2 2 -0.12755695 -1.3706875 0
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
7 NA NA NA NA
8 8 -1.78160459 -1.8677633 0
9 9 -1.65895704 -0.8086148 0
10 10 -0.06281384 1.8888726 0
But this code have made the values of aid column also to NA. Can anybody help me to fix this?
Also are there any methods that do the same thing?
Thank you
Your code would work if you remove aid column from it.
data[which(data$x3_var==1),-1]=NA
You can also do this without which :
data[data$x3_var==1, -1]=NA
In the above two cases I am assuming that you know the position of aid column i.e 1. If in reality you don't know the position of the column you can use match to get it's position.
data[data$x3_var==1, -match('aid', names(data))] = NA
A dplyr solution. Assuming the columns to be altered begin with "x" as in the example data.
library(dplyr)
set.seed(1001)
df1 <- data.frame(aid = 1:10,
x1_var = rnorm(10,0,1),
x2_var = rnorm(10,0,1),
x3_var = rbinom(10,1,0.5))
df1 %>%
mutate(across(starts_with("x"), ~ifelse(x3_var == 1, NA, .x)))
aid x1_var x2_var x3_var
1 1 2.1886481 0.3026445 0
2 2 -0.1775473 1.6343924 0
3 3 NA NA NA
4 4 -2.5065362 0.4671611 0
5 5 NA NA NA
6 6 -0.1435595 0.1102652 0
7 7 NA NA NA
8 8 -0.6229437 -1.0302508 0
9 9 NA NA NA
10 10 NA NA NA
I have a dataframe A like below.
Notice that the first column is the row name with random order.
ID
5 10
3 10
1 10
Them. I have another 5 * 1 data frame B with NAs. I am trying to copy A to B matching the column names in A. I want to get a data frame like below.
ID
1 10
2 NA
3 10
4 NA
5 10
What you are trying to do is potentially dangerous. If you are 100% sure that the rows contain identifiers that would match between the 2 data frames, here's the code.
library(tidyverse)
# Generate a data frame that looks like yours (you don't need this)
df <- data.frame(ID=c(10, NA, 10, NA, 10))
# Assign row names to a new column on the df
df$names <- row.names(df)
# Here's how your data will look like
df<-df[complete.cases(df),]
# Make a second df
df2 <- data.frame(names=as.character(1:20))
# Join by names (what are other possible columns to join by ?)
left_join(df2, df, by="names")
This will produce
names ID
1 1 10
2 2 NA
3 3 10
4 4 NA
5 5 10
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
11 11 NA
12 12 NA
13 13 NA
14 14 NA
15 15 NA
16 16 NA
17 17 NA
18 18 NA
19 19 NA
20 20 NA
I have this list of list
> head(train)
[[1]]
[[1]]$Physics
[1] 8
[[1]]$Chemistry
[1] 7
[[1]]$PhysicalEducation
[1] 3
[[1]]$English
[1] 4
[[1]]$Mathematics
[1] 6
[[1]]$serial
[1] 195490
.
.
[[6]]
[[6]]$Physics
[1] 2
[[6]]$Chemistry
[1] 1
[[6]]$Biology
[1] 2
[[6]]$English
[1] 4
[[6]]$Mathematics
[1] 8
[[6]]$serial
[1] 182318
each sub-list has any five elements out of these 12 and one extra named serial
columns <- c("Physics", "Chemistry", "PhysicalEducation", "English",
"Mathematics", "serial", "ComputerScience", "Hindi", "Biology",
"Economics", "Accountancy", "BusinessStudies")
I am trying yo convert this list into data frame.
Presently, I am doing this using this for loop by iterating one row at a time. Although this works, it takes a huge amount of time.
colclass <- rep("numeric",12)
comby <- read.table(text = '', colClasses = colclass, col.names = columns)
for(i in 1:length(train)){
comby[i,names(train[[i]])] <- train[[i]]
}
I tried using do.call(rbind, train) but that doesn't work as it keeps adding new data into the old columns from the first iteration.
What's a better, faster way? I have around 1.5 million observations.
Desired o/p : the data frame should have all the columns. I want NA where there is no value. Also I am interested if it could be done faster without using any additional packages.
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics Accountancy
1 8 7 3 4 6 195490 NA NA NA NA NA
2 1 1 1 3 3 190869 NA NA NA NA NA
3 1 2 2 1 2 3111 NA NA NA NA NA
4 8 7 6 7 7 47738 NA NA NA NA NA
5 1 1 1 3 2 85520 NA NA NA NA NA
6 2 1 NA 4 8 182318 NA NA 2 NA NA
BusinessStudies
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Here is the reproducible code
train <- [{\"Physics\":8,\"Chemistry\":7,\"PhysicalEducation\":3,\"English\":4,\"Mathematics\":6,\"serial\":195490},{\"Physics\":1,\"Chemistry\":1,\"PhysicalEducation\":1,\"English\":3,\"Mathematics\":3,\"serial\":190869},{\"Physics\":1,\"Chemistry\":2,\"PhysicalEducation\":2,\"English\":1,\"Mathematics\":2,\"serial\":3111},{\"Physics\":8,\"Chemistry\":7,\"PhysicalEducation\":6,\"English\":7,\"Mathematics\":7,\"serial\":47738},{\"Physics\":1,\"Chemistry\":1,\"PhysicalEducation\":1,\"English\":3,\"Mathematics\":2,\"serial\":85520},{\"Physics\":2,\"Chemistry\":1,\"Biology\":2,\"English\":4,\"Mathematics\":8,\"serial\":182318},{\"Physics\":3,\"Chemistry\":4,\"PhysicalEducation\":5,\"English\":5,\"Mathematics\":8,\"serial\":77482},{\"Accountancy\":2,\"BusinessStudies\":5,\"Economics\":3,\"English\":6,\"Mathematics\":7,\"serial\":152940},{\"Physics\":5,\"Chemistry\":6,\"Biology\":7,\"English\":3,\"Mathematics\":8,\"serial\":132620}]
train <- rjson::fromJSON(train)
As a starting point you can use purrr::map as follows:
A sample data set:
x <- list(list(physics=8,
Chemistry=7,
PhysicalEducation=3,
English=4,
serial=195490),
list(physics=2,
Chemistry=1,
Biology=2,
English=4,
Mathematics=8,
serial=182318))
Sol.1 [Shortest to avoid loops]
zzz <- sapply(columns, function(n) map_dbl(x,n,.null=NA) ) %>%
data.frame()
Which gives:
> zzz
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics
1 NA 7 3 4 NA 195490 NA NA NA NA
2 NA 1 NA 4 8 182318 NA NA 2 NA
Accountancy BusinessStudies
1 NA NA
2 NA NA
If you would like to understand how this works, you can check the longer solutions below.
Sol.2 [Manual assignment]
-pick the values for each column:
z <- data.frame(
serial = map_dbl(x,"serial",.null=NA),
Biology = map_dbl(x,"Biology",.null=NA),
Chemistry = map_dbl(x,"Chemistry",.null=NA)
)
Which gives:
> z
serial Biology Chemistry
1 195490 NA 7
2 182318 2 1
>
Sol.3 [Pre-defined dataframe and for-loop]
create a dataframe with a fixed size
zz <- data.frame(matrix(NA, nrow = length(x), ncol = 12))
assign names
names(zz) <- columns
assign values from the lists
for(i in 1:ncol(zz)){
zz[columns[i]] <- map_dbl(x,columns[i],.null=NA)
}
Which gives:
> zz
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics
1 NA 7 3 4 NA 195490 NA NA NA NA
2 NA 1 NA 4 8 182318 NA NA 2 NA
Accountancy BusinessStudies
1 NA NA
2 NA NA
You can accomplish this in base R by combining Reduce, and Map.
data
Here is a dataset that matches your structure.
set.seed(1234)
temp <- replicate(7, setNames(replicate(7, sample(1:10, 1), simplify=FALSE), letters[1:7]),
simplify=FALSE)
To produce a data.frame from this, you can use
Reduce(rbind, Map(data.frame, temp))
a b c d e f g
1 2 7 7 7 9 7 1
2 3 7 6 7 6 3 10
3 3 9 3 3 2 3 4
4 4 2 1 3 9 6 10
5 9 1 5 3 4 6 2
6 8 3 3 10 9 6 7
7 4 7 4 6 7 5 3
Where data.frame constructs data.frames with the inner elements. Map applies this to each element of the outer list, resulting in a list of data.frames. Finally, Reduce rbinds the data.frames in the list and produces a single data.frame.
I have a large dataframe with random columns which contain NA values. It looks like this:
2002-06-26 2002-06-27 2002-06-28 2002-07-01 2002-07-02 2002-07-03 2002-07-05
1 US1718711062 NA BMG4388N1065 US0116591092 NA AN8068571086 GB00BYMT0J19
2 US9837721045 NA US0025671050 US03662Q1058 NA BMG3223R1088 US0097281069
3 NA US00847J1051 US06652V2088 NA BMG4388N1065 US0305061097
4 NA US04351G1013 US1046741062 NA BMG7496G1033 US03836W1036
5 NA US2925621052 US1431301027 NA CA88157K1012 US06652V2088
6 NA US34988V1061 US1897541041 NA CH0044328745 US1547604090
7 NA US3596941068 US2053631048 NA GB00B5BT0K07 US1778351056
8 NA US4180561072 US2567461080 NA IE00B5LRLL25 US1999081045
9 NA US4198791018 US2925621052 NA IE00B8KQN827 US3498531017
10 NA US45071R1095 US3989051095 NA IE00BGH1M568 US42222N1037
I need a code which identifies and fills out the NA columns with the contents of the previous column. So for example column "2002-06-27" should contain "US1718711062" and "US9837721045". The NA columns are at irregular intervals.
Columns are also of random length some only containing one element so I think the best way to identify columns with no values is to look at the first row like so:
row.has.na <- which(is.na(data[1,]))
[1] 2 5
To complete my comment: as you have already computed row.has.na, the vector of indices for the NA column, here is a way to use it and get what you need:
data[, row.has.na] <- data[, row.has.na - 1]
This should work. Note that this also works if two (or more) NA columns are next to each other. Maybe there is a way around the while-loop, but...
# Create some data
data <- data.frame(col1 = 1:10, col2 = NA, col3 = 10:1, col4 = NA, col5 = NA, col6 = NA)
# Find which columns contain NA in the first row
col_NA <- which(is.na(data[1,]))
# Select the previous columns
col_replace <- col_NA - 1
# Check if any NA columns are next to each other and fix it:
while(any(diff(col_replace) == 1)){
ind <- which(diff(col_replace) == 1) + 1
col_replace[ind] <- col_replace[ind] - 1
}
# Replace the NA columns with the previous columns
data[,col_NA] <- data[,col_replace]
col1 col2 col3 col4 col5 col6
1 1 1 10 10 10 10
2 2 2 9 9 9 9
3 3 3 8 8 8 8
4 4 4 7 7 7 7
5 5 5 6 6 6 6
6 6 6 5 5 5 5
7 7 7 4 4 4 4
8 8 8 3 3 3 3
9 9 9 2 2 2 2
10 10 10 1 1 1 1
I have positive, negative and NA values in a Table, I need to replace negative values by NA values. Positive and NA values should remain as they are in Table. My Data set is similar to the one below:
NO. q
1 NA
2 NA
3 -133.6105198
4 -119.6991209
5 28.84460104
6 66.05345087
7 84.7058947
8 -134.4522694
9 NA
10 NA
11 73.20465643
12 -69.90723514
13 NA
14 69.70833003
15 65.27859906
I tried this:
if (q>0) {
q=NA
} else {
q=q
}
Or use replace:
> df$q2 <- replace(df$q, which(df$q < 0), NA)
> df
NO. q q2
1 1 NA NA
2 2 NA NA
3 3 -133.61052 NA
4 4 -119.69912 NA
5 5 28.84460 28.84460
6 6 66.05345 66.05345
7 7 84.70589 84.70589
8 8 -134.45227 NA
9 9 NA NA
10 10 NA NA
11 11 73.20466 73.20466
12 12 -69.90724 NA
13 13 NA NA
14 14 69.70833 69.70833
15 15 65.27860 65.27860
Or with data.table:
library(data.table)
setDT(df)[q < 0, q := NA]
Or with replace in a dplyr pipe:
library(dplyr)
df %>% mutate(q = replace(q, which(q<0), NA))
You could try this:
sample <- c(1, -2, NA)
sample[sample < 0] <- NA
sample
[1] 1 NA NA
Or if you're using a data.frame (suppose it's called df):
df$q[df$q < 0] <- NA
You could try
df1$q1 <- NA^(df1$q <0) * df1$q
df1
# NO. q q1
#1 1 NA NA
#2 2 NA NA
#3 3 -133.61052 NA
#4 4 -119.69912 NA
#5 5 28.84460 28.84460
#6 6 66.05345 66.05345
#7 7 84.70589 84.70589
#8 8 -134.45227 NA
#9 9 NA NA
#10 10 NA NA
#11 11 73.20466 73.20466
#12 12 -69.90724 NA
#13 13 NA NA
#14 14 69.70833 69.70833
#15 15 65.27860 65.27860
Or use ifelse
with(df1, ifelse(q < 0, NA, q))
Or
is.na(df1$q) <- df1$q < 0
Another way of accomplishing the same thing is (now I see this is ALMOST the same as another answer by akrun, sorry for that)
daf$q = ifelse(daf$q < 0, NA_real_, daf$q)