Selecting the first non 0 value in a row [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a large data frame of data from across months and I want to select the
first number that is not NA in each row. For instance ID 895 would correspond to the value in Feb15, 687.
ID Jan15 Feb15 Mar15 Apr15
----- ------- ------- ------- -------
100 NA NA NA 625
113 451 586 NA NA
895 NA 687 313 17
454 NA 977 NA 146
It would be helpful to store them in a variable so I could perform further calculations by month.
apply(tempdat[,32:43],1, function(x) head(which(x>0),1))
This data frame contains thousands of rows so, is it possible to have the all the numbers returned for each month stored into their own new vars or one new data frame by month.
In this case:
AggJan15 = 451
AggFeb15 = 687
AggMar15 = 0
AggApr15 = 625

The two answers below are based on different assumptions on what the question is saying.
1) In this answer we are assuming you want the first non-NA in each row. First find the index of the first NAs, one per row, using max.col giving ix. Then create an output data frame whose first column is ID, second is the first non-NA month for that row and whose third column is the value in that month. The next line NAs out any month that does not have a non-NA value and is not needed if you know that every row has at least one non-NA. Note that we have convert month/year to class yearmon so that they sort properly.
library(zoo)
DF1 <- DF[-1]
ix <- max.col(!is.na(DF1), "first")
out <- data.frame(ID = DF$ID,
month = as.yearmon(names(DF1)[ix], "%b%y"),
value = DF1[cbind(1:nrow(DF1), ix)])
out$month[is.na(out$value)] <- NA
## ID month value
## 1 100 Apr 2015 625
## 2 113 Jan 2015 451
## 3 895 Feb 2015 687
In a comment the poster says they want the sum by month so in that case we first sum by month giving ag and then we merge that with all months within the range to fill it out. The third line can be omitted if it is OK to have absent months filled in with NA; otherwise, use it and they will be filled with 0.
ag <- aggregate(value ~ month, out, sum)
m <- merge(ag, seq(min(ag$month), max(ag$month), 1/12), by = 1, all = TRUE)
m$value[is.na(m$value)] <- 0
## month value
## 1 Jan 2015 451
## 2 Feb 2015 687
## 3 Mar 2015 0
## 4 Apr 2015 625
2) Originally I thought you wanted the first non-NA in each column and this answer addresses that.
Assuming DF is as shown reproducibly in the Note at the end use na.locf specifying reverse order and take the first row.
library(zoo)
Agg <- na.locf(DF[-1], fromLast = TRUE)[1, ]
Agg
## Jan15 Feb15 Mar15 Apr15
## 1 451 586 313 625
Agg$Jan15
## [1] 451
Note
Lines <- "ID Jan15 Feb15 Mar15 Apr15
----- ------- ------- ------- -------
100 NA NA NA 625
113 451 586 NA NA
895 NA 687 313 17 "
DF <- read.table(text = Lines, header = TRUE, comment.char = "-")

Related

Merging Multiple (and different datasets)

I'd like to merge multiple (around ten) datasets in R. Quite a few of the datasets are different from each other, so I don't need to match them by row name or anything. I'd just like to paste them side by side, on a single dataframe so I can export them into a single sheet. For instance, I have the following two datasets:
Month
Engagement
Test
Jan
51
1
Feb
123
2
Variable
Engagement
Hot
412
Cold
4124
Warm
4fd4
I'd simply like to put them side by side (as in left and right) in a single data frame for exporting purposes, like this:
Month
Engagement
Test
Variable
Engagement
Jan
51
1
Hot
412
Feb
123
2
Cold
4124
NA
NA
NA
Warm
4fd4
Is there any way to accomplish this? It might seem like a strange request, but do let me know if I should provide any more info! Thank you so much.
Put the data in a list. Find the max number of rows from the list. For each dataframe subset the rows, dataframe with lower number of rows will be appended with NA's.
data <- list(df1, df2)
n <- seq_len(max(sapply(data, nrow)))
result <- do.call(cbind, lapply(data, `[`, n, ))
result
# Month Engagement Test Variable Engagement
#1 Jan 51 1 Hot 412
#2 Feb 123 2 Cold 4124
#NA <NA> NA NA Warm 4fd4
Index both data then merge by the index and drop the index:
df1 <- read.csv("Book1.csv", header = TRUE, na.strings = "")
df2 <- read.csv("Book2.csv", header = TRUE, na.strings = "")
# Assign index to the dataframe
rownames(df1) <- 1:nrow(df1)
rownames(df2) <- 1:nrow(df2)
# Merge by index:
merged <- merge(df1, df2, by=0, all=TRUE) %>%
select(-1)
merged
Output:
Month Engagement Test Variable Engagement
1 Jan 51 1 Hot 412
2 Feb 123 2 Cold 4124
3 <NA> NA NA Warm 4fd4

Subset a df by the last non-NA value in a column

My dataframe looks like this:
Year aquil_7 aquil_8 aquil_9
2018 NA 201 222
2019 192 145 209
2020 166 121 NA
2021 190 NA NA
I want to subset this dataframe so as to include only those columns where the last non-NA year is equal to or less then 2020. In the example above, this means deleting the aquil_7 column since it's last non-NA year is 2021.
How could I do this?
A simple baseR answer.
Explanation - columnwise (that explaining arg 2 in apply) iteration to check given conditions on all database except first column. cbinding the result with T so that the result includes first column.
df <- read.table(text = "Year aquil_7 aquil_8 aquil_9
2018 NA 201 222
2019 192 145 209
2020 166 121 NA
2021 190 NA NA", header = T)
df[c(T, apply((!is.na(df[-1]))*df$Year, 2, function(x){max(x) < 2021}))]
Year aquil_8 aquil_9
1 2018 201 222
2 2019 145 209
3 2020 121 NA
4 2021 NA NA
Not sure if there's a better way to implement this (but I do hope so). In the meantime, you could e.g. do
library(tidyverse)
cols_to_keep <- df %>%
pivot_longer(-Year) %>%
group_by(name) %>%
summarize(var = min(Year[is.na(value)]) >= 2020) %>%
filter(var) %>%
pull(name)
df %>%
select(Year, cols_to_keep)

Impute only certain NA's for a variable in a data frame [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I'm new to R and exploring different beautiful options in it. I'm working on a data frame where I have a variable with 900 missing values, i.e NAs.
I want to impute 3 different values for NAs;
1st 300 NA's with Value 1.
2nd 300 NA's with Value 2.
3rd 300 NA's with Value 3.
There are a total of 23272 rows in the data.
dim(data)
[1] 23272 2
colSums(is.na(data))
month year
884 884
summary(data$month)
1 2 3 4 5 6 7 8 9 10 11 12 NA's
1977 1658 1837 1584 1703 1920 1789 2046 1955 2026 1845 2048 884
If we check the month 8,10 and 12. There is no much differences, Hence thought of assigning these 3 months to NA by splitting at the ratio (300:300:284). Usually we go my MODE, but I want to try this approach.
I assume you mean you a have a long list, some of the values of which are NAs:
set.seed(42)
df <- data.frame(val = sample(c(1:3, NA_real_), size = 1000, replace = TRUE))
We can keep a running tally of NA's and assign those to the imputed value using integer division with %/%.
library(tidyverse)
df2 <- df %>%
mutate(NA_num = if_else(is.na(val),
cumsum(is.na(val)),
NA_integer_),
imputed = NA_num %/% 100 + 1)
Output:
df2 %>%
slice(397:410) # based on manual examination using this seed
val NA_num imputed
1 NA 98 1
2 NA 99 1
3 3 NA NA
4 1 NA NA
5 1 NA NA
6 3 NA NA
7 3 NA NA
8 2 NA NA
9 NA 100 2
10 1 NA NA
11 NA 101 2
12 2 NA NA
13 1 NA NA
14 2 NA NA
Without an example, I think this will work.
Basically, filter the NAs to a new table, do the calc and merge it back. Assume the new_dt is the OG data where you filter to only contain the NAs
library('tidyverse');
new_dt = data.frame(x1 =rep(1:900), x2= NA) %>% filter(is.na(x2)) %>%
mutate(23 = case_when(row_number()%/%300==0 ~1,
row_number()%/%300==1 ~2,
row_number()%/%300==2 ~3))
dt <- rbind(dt,new_dt)

Reshaping data in R with multiple variable levels - "aggregate function missing" warning

I'm trying to use dcast in reshape2 to transform a data frame from long to wide format. The data is hospital visit dates and a list of diagnoses. (Dx.num lists the sequence of diagnoses in a single visit. If the same patient returns, this variable starts over and the primary diagnosis for the new visit starts at 1.) I would like there to be one row per individual (id). The data structure is:
id visit.date visit.id bill.num dx.code FY Dx.num
1 1/2/12 203 1234 409 2012 1
1 3/4/12 506 4567 512 2013 1
2 5/6/18 222 3452 488 2018 1
2 5/6/18 222 3452 122 2018 2
3 2/9/14 567 6798 923 2014 1
I'm imagining I would end up with columns like this:
id, date_visit1, date_visit2, visit.id_visit1, visit.id_visit2, bill.num_visit1, bill.num_visit2, dx.code_visit1_dx1, dx.code_visit1_dx2 dx.code_visit2_dx1, FY_visit1_dx1, FY_visit1_dx2, FY_visit2_dx1
Originally, I tried creating a visit_dx column like this one:
**visit.dx**
v1dx1 (visit 1, dx 1)
v2dx1 (visit 2, dx 1)
v1dx1 (...)
v1dx2
v1dx1
And used the following code, omitting "Dx.num" from the DF, as it's accounted for in "visit.dx":
wide <-
dcast(
setDT(long),
id + visit.date + visit.id + bill.num ~ visit.dx,
value.var = c(
"dx.code",
"FY"
)
)
When I run this, I get the warning "Aggregate function missing, defaulting to 'length'" and new dataframe full of 0's and 1's. There are no duplicate rows in the dataframe, however. I'm beginning to think I should go about this completely differently.
Any help would be much appreciated.
The data.table package extended dcast with rowid and allowing multiple value.var, so...
library(data.table)
dcast(setDT(DF), id ~ rowid(id), value.var=setdiff(names(DF), "id"))
id visit.date_1 visit.date_2 visit.id_1 visit.id_2 bill.num_1 bill.num_2 dx.code_1 dx.code_2 FY_1 FY_2 Dx.num_1 Dx.num_2
1: 1 1/2/12 3/4/12 203 506 1234 4567 409 512 2012 2013 1 1
2: 2 5/6/18 5/6/18 222 222 3452 3452 488 122 2018 2018 1 2
3: 3 2/9/14 <NA> 567 NA 6798 NA 923 NA 2014 NA 1 NA

R-Adding rows between rows of a data frame [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a dataframe that looks like this:
The second two rows show the Albumin lab values 3 and 63 hours after admission of patient. Between these two rows, I want to add 59 new rows incrementing HoursFromAdmitLab values by one each time so I have one row for each hour after admission. for the 59 newly added rows between first and second row, I want to repeat the first row values of every column with the exception that I want AbnormalityCode and Value be NA and as mentioned before, HoursFromAdmitLab be incremented by 1 in each row.
So I want to have one row for each hour after admission (HoursFromAdmitLab) and for the hours that lab was not taken I want the Value and AbnormalityCode as NA that means there is no value available. The second row of my result data frame should look like this:
I want to repeat this process between second and third row and so on. I tried to this with a loop but it takes for ever and I know there should be a better way.
One possible way to achieve this is to use two different joins:
join with columns which should not be filled
join with columns which should be filled using a rolling join to fill
The data.table package is used for this as the OP has indicated that performance could be crucial for his settings.
library(data.table) # CRAN version 1.10.4
# make sure data is in correct order
setorder(setDT(DT), GUID, Hours)
# create sequence of hours for each case
Hours <- DT[, .(Hours = seq(min(Hours), max(Hours))), by = GUID]
# 1st join with columns which should not be filled
tmp <- DT[, c("GUID", "Hours", "Value", "AbnormCode")][Hours, on = c("GUID", "Hours")]
# 2nd, rolling join with columns which should be filled
result <- DT[, -c("Value", "AbnormCode")][tmp, on = .(GUID, Hours), roll = TRUE]
result
# GUID BirthYearNum GenderCode Hours Value AbnormCode
# 1: 27632200200 1949 Female 3 4.3 N
# 2: 27632200200 1949 Female 4 NA NA
# 3: 27632200200 1949 Female 5 NA NA
# 4: 27632200200 1949 Female 6 NA NA
# 5: 27632200200 1949 Female 7 NA NA
# ---
#273: 27632200200 1949 Female 275 NA NA
#274: 27632200200 1949 Female 276 NA NA
#275: 27632200200 1949 Female 277 NA NA
#276: 27632200200 1949 Female 278 NA NA
#277: 27632200200 1949 Female 279 3.0 L
Note that the approach relies on GUID being the unique key, i.e., it is assumed that a separate sequence has to be created for each GUID.
Data
As the OP has failed to provide reproducible data the following data are used:
library(data.table)
DT <- data.table(
GUID = "27632200200",
BirthYearNum = 1949L,
GenderCode = "Female",
Hours = c(3, 63, 111, 159, 231, 279),
Value = c(4.3, 3.8, 3.6, 3.3, 3, 3),
AbnormCode = c(rep("N", 3), rep("L", 3))
)
DT
# GUID BirthYearNum GenderCode Hours Value AbnormCode
#1: 27632200200 1949 Female 3 4.3 N
#2: 27632200200 1949 Female 63 3.8 N
#3: 27632200200 1949 Female 111 3.6 N
#4: 27632200200 1949 Female 159 3.3 L
#5: 27632200200 1949 Female 231 3.0 L
#6: 27632200200 1949 Female 279 3.0 L
Note that HoursFromAdmitLab has been abbreviated to Hours and AbnormalityCode to AbnormCode.

Resources