Impute only certain NA's for a variable in a data frame [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I'm new to R and exploring different beautiful options in it. I'm working on a data frame where I have a variable with 900 missing values, i.e NAs.
I want to impute 3 different values for NAs;
1st 300 NA's with Value 1.
2nd 300 NA's with Value 2.
3rd 300 NA's with Value 3.
There are a total of 23272 rows in the data.
dim(data)
[1] 23272 2
colSums(is.na(data))
month year
884 884
summary(data$month)
1 2 3 4 5 6 7 8 9 10 11 12 NA's
1977 1658 1837 1584 1703 1920 1789 2046 1955 2026 1845 2048 884
If we check the month 8,10 and 12. There is no much differences, Hence thought of assigning these 3 months to NA by splitting at the ratio (300:300:284). Usually we go my MODE, but I want to try this approach.

I assume you mean you a have a long list, some of the values of which are NAs:
set.seed(42)
df <- data.frame(val = sample(c(1:3, NA_real_), size = 1000, replace = TRUE))
We can keep a running tally of NA's and assign those to the imputed value using integer division with %/%.
library(tidyverse)
df2 <- df %>%
mutate(NA_num = if_else(is.na(val),
cumsum(is.na(val)),
NA_integer_),
imputed = NA_num %/% 100 + 1)
Output:
df2 %>%
slice(397:410) # based on manual examination using this seed
val NA_num imputed
1 NA 98 1
2 NA 99 1
3 3 NA NA
4 1 NA NA
5 1 NA NA
6 3 NA NA
7 3 NA NA
8 2 NA NA
9 NA 100 2
10 1 NA NA
11 NA 101 2
12 2 NA NA
13 1 NA NA
14 2 NA NA

Without an example, I think this will work.
Basically, filter the NAs to a new table, do the calc and merge it back. Assume the new_dt is the OG data where you filter to only contain the NAs
library('tidyverse');
new_dt = data.frame(x1 =rep(1:900), x2= NA) %>% filter(is.na(x2)) %>%
mutate(23 = case_when(row_number()%/%300==0 ~1,
row_number()%/%300==1 ~2,
row_number()%/%300==2 ~3))
dt <- rbind(dt,new_dt)

Related

Paste date in new column if condition is true in another R [duplicate]

This question already has an answer here:
Replace value using index [R]
(1 answer)
Closed 2 years ago.
I want to extract the date from a variable if the condition in another variable is true.
Example: if comorbidity1==10, extract the date from smr_01, otherwise NA
I also need to do this for if if comorbidity1==11 OR comorbidity1==12, extract the date from smr_01, otherwise NA
This is what I want my data to look like
comorbidity1 smr_01 NewDate
1 20120607 NA
10 20120607 20120607
10 20120613 20120613
3 20121103 NA
6 20150607 NA
12 20140509 NA
11 20120405 NA
I have tried this
fulldata$NewDate<-ifelse(fulldata$comorbidity1==10, fulldata$smr_01, NA)
but it is not pasting the date in the correct format.
what I am getting looks like this
comorbidity1 smr_01 NewDate
1 20120607 NA
10 20120607 4675
10 20120613 17856
3 20121103 NA
6 20150607 NA
12 20140509 NA
11 20120405 NA
smr_01 is classed as a date
Thank you
Try :
df$NewDate <- as.Date(NA)
inds <- df$comorbidity1 == 10
#For more than 1 value use %in%
#inds <- df$comorbidity1 %in% 10:12
df$NewDate[inds] <- df$smr_01[inds]
df

Selecting the first non 0 value in a row [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a large data frame of data from across months and I want to select the
first number that is not NA in each row. For instance ID 895 would correspond to the value in Feb15, 687.
ID Jan15 Feb15 Mar15 Apr15
----- ------- ------- ------- -------
100 NA NA NA 625
113 451 586 NA NA
895 NA 687 313 17
454 NA 977 NA 146
It would be helpful to store them in a variable so I could perform further calculations by month.
apply(tempdat[,32:43],1, function(x) head(which(x>0),1))
This data frame contains thousands of rows so, is it possible to have the all the numbers returned for each month stored into their own new vars or one new data frame by month.
In this case:
AggJan15 = 451
AggFeb15 = 687
AggMar15 = 0
AggApr15 = 625
The two answers below are based on different assumptions on what the question is saying.
1) In this answer we are assuming you want the first non-NA in each row. First find the index of the first NAs, one per row, using max.col giving ix. Then create an output data frame whose first column is ID, second is the first non-NA month for that row and whose third column is the value in that month. The next line NAs out any month that does not have a non-NA value and is not needed if you know that every row has at least one non-NA. Note that we have convert month/year to class yearmon so that they sort properly.
library(zoo)
DF1 <- DF[-1]
ix <- max.col(!is.na(DF1), "first")
out <- data.frame(ID = DF$ID,
month = as.yearmon(names(DF1)[ix], "%b%y"),
value = DF1[cbind(1:nrow(DF1), ix)])
out$month[is.na(out$value)] <- NA
## ID month value
## 1 100 Apr 2015 625
## 2 113 Jan 2015 451
## 3 895 Feb 2015 687
In a comment the poster says they want the sum by month so in that case we first sum by month giving ag and then we merge that with all months within the range to fill it out. The third line can be omitted if it is OK to have absent months filled in with NA; otherwise, use it and they will be filled with 0.
ag <- aggregate(value ~ month, out, sum)
m <- merge(ag, seq(min(ag$month), max(ag$month), 1/12), by = 1, all = TRUE)
m$value[is.na(m$value)] <- 0
## month value
## 1 Jan 2015 451
## 2 Feb 2015 687
## 3 Mar 2015 0
## 4 Apr 2015 625
2) Originally I thought you wanted the first non-NA in each column and this answer addresses that.
Assuming DF is as shown reproducibly in the Note at the end use na.locf specifying reverse order and take the first row.
library(zoo)
Agg <- na.locf(DF[-1], fromLast = TRUE)[1, ]
Agg
## Jan15 Feb15 Mar15 Apr15
## 1 451 586 313 625
Agg$Jan15
## [1] 451
Note
Lines <- "ID Jan15 Feb15 Mar15 Apr15
----- ------- ------- ------- -------
100 NA NA NA 625
113 451 586 NA NA
895 NA 687 313 17 "
DF <- read.table(text = Lines, header = TRUE, comment.char = "-")

how to extract the value from multiple columns in a specific order [duplicate]

This question already has answers here:
Get Value of last non-empty column for each row [duplicate]
(3 answers)
Closed 4 years ago.
I have this dataset that contains variables from three previous years.
data <- read.table(text="
a 2015 2016 2017
1 100 100 100
2 1000 5 NA
3 10000 NA NA", header=TRUE)
I would like to create a new column in my data which contains the value from the most recent year. The order is 2017 ->2016 ->2015.
output <- read.table(text="
a 2015 2016 2017 recent
1 100 100 100 100
2 1000 5 NA 5
3 10000 NA NA 10000", header=TRUE)
I know that I can use "if" command to achieve it, but I am wondering if there is a quick and simple way to do it.
Thanks!
Here's a simple base R solution. This assumes that the years are sorted from left-right.
data$recent <- apply(data, 1, function(x) tail(na.omit(x), 1))
a X2015 X2016 X2017 recent
1 1 100 100 100 100
2 2 1000 5 NA 5
3 3 10000 NA NA 10000

Removing "outer rows" to allow for interpolation (and prevent extrapolation)

I have (left)joined two data frames by country-year.
df<- left_join(df, df2, by="country-year")
leading to the following example output:
country country-year a b
1 France France2000 NA NA
2 France France2001 1000 1000
3 France France2002 NA NA
4 France France2003 1600 2200
5 France France2004 NA NA
6 UK UK2000 1000 1000
7 UK UK2001 NA NA
8 UK UK2002 1000 1000
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I initially wanted to remove all values for which both of the added columns (a,b) were NA.
df<-df[!is.na( df$a | df$b ),]
However, in second instance, I decided I wanted to interpolate the data I had (but not extrapolate). So instead I would like to remove all the columns for which I cannot interpolate; in the example:
1 France France2000 NA NA
5 France France2004 NA NA
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I believe there are 2 options. First I somehow adapt this function:
library(tidyerse)
TRcomplete<-TRcomplete%>%
group_by(country) %>%
mutate_at(a:b,~na.fill(.x,"extend"))
to interpolate only, and then remove then apply df<-df[!is.na( df$a | df$b ),]
or I write a code to remove the "outer"columns first and then use extend like normal. Desired output:
country country-year a b
2 France France2001 1000 1000
3 France France2002 1300 1600
4 France France2003 1600 2200
6 UK UK2000 1000 1000
7 UK UK2001 0 0
8 UK UK2002 1000 1000
Any suggestions?
There are options in na.fill to specify what is done. If you look at ?na.fill, you see that fill can specify the left, interior and right, so if you specify the left and right are NA and the interior is "extend", then it will only fill the interior data. You can then filter the rows with NA.
library(tidyverse)
library(zoo)
df %>%
group_by(country) %>%
mutate_at(vars(a:b),~na.fill(.x,c(NA, "extend", NA))) %>%
filter(!is.na(a) | !is.na(b))
By the way, you have a typo in your library(tidyverse) statement; you are missing the v.

cross sectional sub-sets in data.table

I have a data.table which contains multiple columns, which is well represented by the following:
DT <- data.table(date = as.IDate(rep(c("2012-10-17", "2012-10-18", "2012-10-19"), each=10)),
session = c(1,2,3), price = c(10, 11, 12,13,14),
volume = runif(30, min=10, max=1000))
I would like to extract a multiple column table which shows the volume traded at each price in a particular type of session -- with each column representing a date.
At present, i extract this data one date at a time using the following:
DT[session==1,][date=="2012-10-17", sum(volume), by=price]
and then bind the columns.
Is there a way of obtaining the end product (a table with each column referring to a particular date) without sticking all the single queries together -- as i'm currently doing?
thanks
Does the following do what you want.
A combination of reshape2 and data.table
library(reshape2)
.DT <- DT[,sum(volume),by = list(price,date,session)][, DATE := as.character(date)]
# reshape2 for casting to wide -- it doesn't seem to like IDate columns, hence
# the character DATE co
dcast(.DT, session + price ~ DATE, value.var = 'V1')
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439
6 2 10 NA 755.2650 998.7646
7 2 11 251.3691 695.0153 NA
8 2 12 791.6882 NA 275.4777
9 2 13 NA 111.7700 240.3329
10 2 14 230.6461 817.9438 NA
11 3 10 902.9220 NA 870.3641
12 3 11 NA 719.8441 963.1768
13 3 12 361.8612 563.9518 NA
14 3 13 393.6963 NA 718.7878
15 3 14 NA 871.4986 582.6158
If you just wanted session 1
dcast(.DT[session == 1L], session + price ~ DATE)
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439

Resources