R-Adding rows between rows of a data frame [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a dataframe that looks like this:
The second two rows show the Albumin lab values 3 and 63 hours after admission of patient. Between these two rows, I want to add 59 new rows incrementing HoursFromAdmitLab values by one each time so I have one row for each hour after admission. for the 59 newly added rows between first and second row, I want to repeat the first row values of every column with the exception that I want AbnormalityCode and Value be NA and as mentioned before, HoursFromAdmitLab be incremented by 1 in each row.
So I want to have one row for each hour after admission (HoursFromAdmitLab) and for the hours that lab was not taken I want the Value and AbnormalityCode as NA that means there is no value available. The second row of my result data frame should look like this:
I want to repeat this process between second and third row and so on. I tried to this with a loop but it takes for ever and I know there should be a better way.

One possible way to achieve this is to use two different joins:
join with columns which should not be filled
join with columns which should be filled using a rolling join to fill
The data.table package is used for this as the OP has indicated that performance could be crucial for his settings.
library(data.table) # CRAN version 1.10.4
# make sure data is in correct order
setorder(setDT(DT), GUID, Hours)
# create sequence of hours for each case
Hours <- DT[, .(Hours = seq(min(Hours), max(Hours))), by = GUID]
# 1st join with columns which should not be filled
tmp <- DT[, c("GUID", "Hours", "Value", "AbnormCode")][Hours, on = c("GUID", "Hours")]
# 2nd, rolling join with columns which should be filled
result <- DT[, -c("Value", "AbnormCode")][tmp, on = .(GUID, Hours), roll = TRUE]
result
# GUID BirthYearNum GenderCode Hours Value AbnormCode
# 1: 27632200200 1949 Female 3 4.3 N
# 2: 27632200200 1949 Female 4 NA NA
# 3: 27632200200 1949 Female 5 NA NA
# 4: 27632200200 1949 Female 6 NA NA
# 5: 27632200200 1949 Female 7 NA NA
# ---
#273: 27632200200 1949 Female 275 NA NA
#274: 27632200200 1949 Female 276 NA NA
#275: 27632200200 1949 Female 277 NA NA
#276: 27632200200 1949 Female 278 NA NA
#277: 27632200200 1949 Female 279 3.0 L
Note that the approach relies on GUID being the unique key, i.e., it is assumed that a separate sequence has to be created for each GUID.
Data
As the OP has failed to provide reproducible data the following data are used:
library(data.table)
DT <- data.table(
GUID = "27632200200",
BirthYearNum = 1949L,
GenderCode = "Female",
Hours = c(3, 63, 111, 159, 231, 279),
Value = c(4.3, 3.8, 3.6, 3.3, 3, 3),
AbnormCode = c(rep("N", 3), rep("L", 3))
)
DT
# GUID BirthYearNum GenderCode Hours Value AbnormCode
#1: 27632200200 1949 Female 3 4.3 N
#2: 27632200200 1949 Female 63 3.8 N
#3: 27632200200 1949 Female 111 3.6 N
#4: 27632200200 1949 Female 159 3.3 L
#5: 27632200200 1949 Female 231 3.0 L
#6: 27632200200 1949 Female 279 3.0 L
Note that HoursFromAdmitLab has been abbreviated to Hours and AbnormalityCode to AbnormCode.

Related

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

Selecting the first non 0 value in a row [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a large data frame of data from across months and I want to select the
first number that is not NA in each row. For instance ID 895 would correspond to the value in Feb15, 687.
ID Jan15 Feb15 Mar15 Apr15
----- ------- ------- ------- -------
100 NA NA NA 625
113 451 586 NA NA
895 NA 687 313 17
454 NA 977 NA 146
It would be helpful to store them in a variable so I could perform further calculations by month.
apply(tempdat[,32:43],1, function(x) head(which(x>0),1))
This data frame contains thousands of rows so, is it possible to have the all the numbers returned for each month stored into their own new vars or one new data frame by month.
In this case:
AggJan15 = 451
AggFeb15 = 687
AggMar15 = 0
AggApr15 = 625
The two answers below are based on different assumptions on what the question is saying.
1) In this answer we are assuming you want the first non-NA in each row. First find the index of the first NAs, one per row, using max.col giving ix. Then create an output data frame whose first column is ID, second is the first non-NA month for that row and whose third column is the value in that month. The next line NAs out any month that does not have a non-NA value and is not needed if you know that every row has at least one non-NA. Note that we have convert month/year to class yearmon so that they sort properly.
library(zoo)
DF1 <- DF[-1]
ix <- max.col(!is.na(DF1), "first")
out <- data.frame(ID = DF$ID,
month = as.yearmon(names(DF1)[ix], "%b%y"),
value = DF1[cbind(1:nrow(DF1), ix)])
out$month[is.na(out$value)] <- NA
## ID month value
## 1 100 Apr 2015 625
## 2 113 Jan 2015 451
## 3 895 Feb 2015 687
In a comment the poster says they want the sum by month so in that case we first sum by month giving ag and then we merge that with all months within the range to fill it out. The third line can be omitted if it is OK to have absent months filled in with NA; otherwise, use it and they will be filled with 0.
ag <- aggregate(value ~ month, out, sum)
m <- merge(ag, seq(min(ag$month), max(ag$month), 1/12), by = 1, all = TRUE)
m$value[is.na(m$value)] <- 0
## month value
## 1 Jan 2015 451
## 2 Feb 2015 687
## 3 Mar 2015 0
## 4 Apr 2015 625
2) Originally I thought you wanted the first non-NA in each column and this answer addresses that.
Assuming DF is as shown reproducibly in the Note at the end use na.locf specifying reverse order and take the first row.
library(zoo)
Agg <- na.locf(DF[-1], fromLast = TRUE)[1, ]
Agg
## Jan15 Feb15 Mar15 Apr15
## 1 451 586 313 625
Agg$Jan15
## [1] 451
Note
Lines <- "ID Jan15 Feb15 Mar15 Apr15
----- ------- ------- ------- -------
100 NA NA NA 625
113 451 586 NA NA
895 NA 687 313 17 "
DF <- read.table(text = Lines, header = TRUE, comment.char = "-")

Impute only certain NA's for a variable in a data frame [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I'm new to R and exploring different beautiful options in it. I'm working on a data frame where I have a variable with 900 missing values, i.e NAs.
I want to impute 3 different values for NAs;
1st 300 NA's with Value 1.
2nd 300 NA's with Value 2.
3rd 300 NA's with Value 3.
There are a total of 23272 rows in the data.
dim(data)
[1] 23272 2
colSums(is.na(data))
month year
884 884
summary(data$month)
1 2 3 4 5 6 7 8 9 10 11 12 NA's
1977 1658 1837 1584 1703 1920 1789 2046 1955 2026 1845 2048 884
If we check the month 8,10 and 12. There is no much differences, Hence thought of assigning these 3 months to NA by splitting at the ratio (300:300:284). Usually we go my MODE, but I want to try this approach.
I assume you mean you a have a long list, some of the values of which are NAs:
set.seed(42)
df <- data.frame(val = sample(c(1:3, NA_real_), size = 1000, replace = TRUE))
We can keep a running tally of NA's and assign those to the imputed value using integer division with %/%.
library(tidyverse)
df2 <- df %>%
mutate(NA_num = if_else(is.na(val),
cumsum(is.na(val)),
NA_integer_),
imputed = NA_num %/% 100 + 1)
Output:
df2 %>%
slice(397:410) # based on manual examination using this seed
val NA_num imputed
1 NA 98 1
2 NA 99 1
3 3 NA NA
4 1 NA NA
5 1 NA NA
6 3 NA NA
7 3 NA NA
8 2 NA NA
9 NA 100 2
10 1 NA NA
11 NA 101 2
12 2 NA NA
13 1 NA NA
14 2 NA NA
Without an example, I think this will work.
Basically, filter the NAs to a new table, do the calc and merge it back. Assume the new_dt is the OG data where you filter to only contain the NAs
library('tidyverse');
new_dt = data.frame(x1 =rep(1:900), x2= NA) %>% filter(is.na(x2)) %>%
mutate(23 = case_when(row_number()%/%300==0 ~1,
row_number()%/%300==1 ~2,
row_number()%/%300==2 ~3))
dt <- rbind(dt,new_dt)

Reshaping data in R with multiple variable levels - "aggregate function missing" warning

I'm trying to use dcast in reshape2 to transform a data frame from long to wide format. The data is hospital visit dates and a list of diagnoses. (Dx.num lists the sequence of diagnoses in a single visit. If the same patient returns, this variable starts over and the primary diagnosis for the new visit starts at 1.) I would like there to be one row per individual (id). The data structure is:
id visit.date visit.id bill.num dx.code FY Dx.num
1 1/2/12 203 1234 409 2012 1
1 3/4/12 506 4567 512 2013 1
2 5/6/18 222 3452 488 2018 1
2 5/6/18 222 3452 122 2018 2
3 2/9/14 567 6798 923 2014 1
I'm imagining I would end up with columns like this:
id, date_visit1, date_visit2, visit.id_visit1, visit.id_visit2, bill.num_visit1, bill.num_visit2, dx.code_visit1_dx1, dx.code_visit1_dx2 dx.code_visit2_dx1, FY_visit1_dx1, FY_visit1_dx2, FY_visit2_dx1
Originally, I tried creating a visit_dx column like this one:
**visit.dx**
v1dx1 (visit 1, dx 1)
v2dx1 (visit 2, dx 1)
v1dx1 (...)
v1dx2
v1dx1
And used the following code, omitting "Dx.num" from the DF, as it's accounted for in "visit.dx":
wide <-
dcast(
setDT(long),
id + visit.date + visit.id + bill.num ~ visit.dx,
value.var = c(
"dx.code",
"FY"
)
)
When I run this, I get the warning "Aggregate function missing, defaulting to 'length'" and new dataframe full of 0's and 1's. There are no duplicate rows in the dataframe, however. I'm beginning to think I should go about this completely differently.
Any help would be much appreciated.
The data.table package extended dcast with rowid and allowing multiple value.var, so...
library(data.table)
dcast(setDT(DF), id ~ rowid(id), value.var=setdiff(names(DF), "id"))
id visit.date_1 visit.date_2 visit.id_1 visit.id_2 bill.num_1 bill.num_2 dx.code_1 dx.code_2 FY_1 FY_2 Dx.num_1 Dx.num_2
1: 1 1/2/12 3/4/12 203 506 1234 4567 409 512 2012 2013 1 1
2: 2 5/6/18 5/6/18 222 222 3452 3452 488 122 2018 2018 1 2
3: 3 2/9/14 <NA> 567 NA 6798 NA 923 NA 2014 NA 1 NA

R getting rid of nested for loops

I did quite some searching on how to simplify the code for the problem below but was not successful. I assume that with some kind of apply-magic one could speed things up a little, but so far I still have my difficulties with these kind of functions ....
I have an data.frame data, structured as follows:
year iso3c gdpppc elec solid liquid heat
2010 USA 1567 1063 1118 835 616
2015 USA 1571 NA NA NA NA
2020 USA 1579 NA NA NA NA
... USA ... NA NA NA NA
2100 USA 3568 NA NA NA NA
2010 ARG 256 145 91 85 37
2015 ARG 261 NA NA NA NA
2020 ARG 270 NA NA NA NA
... ARG ... NA NA NA NA
2100 ARG 632 NA NA NA NA
As you can see, I have a historical starting value for 2010 and a complete scenario for gdppc up to 2100. I want to let values for elec, solid, liquid and heat grow according to some elasticity with respect to the development of gdppc, but separately for each country (coded in iso3c).
I have the elasticities defined in a separate data.frame parameters:
item value
elec 0.5
liquid 0.2
solid -0.1
heat 0.1
So far I am using a nested for loop:
for (e in 1:length(levels(parameters$item)){
for (c in 1:length(levels(data$iso3c)){
tmp <- subset(data, select=c("year", "iso3c", "gdppc", parameters[e, "item"]), subset=("iso3c" == levels(data$iso3c)[c]))
tmp[tmp$year %in% seq(2015, 2100, 5), parameters[e, "item"]] <-
tmp[tmp$year == 2010, parameters[e, "item"]] *
cumprod((1 + (tmp[tmp$year %in% seq(2015, 2100, 5), "gdppc"] /
tmp[tmp$year %in% seq(2010, 2095, 5), "gdppc"] - 1) * parameters[e, "value"]))
data[data$iso3c == levels(data$iso3c)[i] & data$year %in% seq(2015, 2100, 5), parameters[e, "item"]] <- tmp[tmp$year > 2010, parameters[e, "item"]]
}
}
The outer loop loops over the columns and the inner one over the countries. The inner loop runs for every country (I have 180+ countries). First, a subset containing data on one single country and on the variable of interest is selected. Then I let the respective variable grow with a certain elasticity to growth in gdppc and finally put the subset back into place in data.
I have already tried to let the outer loop run in parallel using foreach but was not succesful recombining the results. Since I have to run similar calculations quite often I would be very grateful for any help.
Thanks
Here's one way. Note I renamed your parameters data.frame to p
library(data.table)
library(reshape2)
dt <- data.table(data)
dt.melt = melt(dt,id=1:3)
dt.melt[,value:=as.numeric(value)] # coerce value column to numeric
dt.melt[,value:=head(value,1)+(gdpppc-head(gdpppc,1))*p[p$item==variable,]$value,
by="iso3c,variable"]
result <- dcast(dt.melt,iso3c+year+gdpppc~variable)
result
# iso3c year gdpppc elec solid liquid heat
# 1 ARG 2010 256 145.0 91.0 85.0 37.0
# 2 ARG 2015 261 147.5 90.5 86.0 37.5
# 3 ARG 2020 270 152.0 89.6 87.8 38.4
# 4 ARG 2100 632 333.0 53.4 160.2 74.6
# 5 USA 2010 1567 1063.0 1118.0 835.0 616.0
# 6 USA 2015 1571 1065.0 1117.6 835.8 616.4
# 7 USA 2020 1579 1069.0 1116.8 837.4 617.2
# 8 USA 2100 3568 2063.5 917.9 1235.2 816.1
The basic idea is to use the melt(...) function to reshape your original data into "long" format, where the values in the four columns solid, liquid, elec, and heat are all in one column, value, and the column variable indicates which metric value refers to. Now, using data tables, you can fill in the values easily. Then, reshape the result back into wide format using dcast(...).

Resources