how to extract the value from multiple columns in a specific order [duplicate] - r

This question already has answers here:
Get Value of last non-empty column for each row [duplicate]
(3 answers)
Closed 4 years ago.
I have this dataset that contains variables from three previous years.
data <- read.table(text="
a 2015 2016 2017
1 100 100 100
2 1000 5 NA
3 10000 NA NA", header=TRUE)
I would like to create a new column in my data which contains the value from the most recent year. The order is 2017 ->2016 ->2015.
output <- read.table(text="
a 2015 2016 2017 recent
1 100 100 100 100
2 1000 5 NA 5
3 10000 NA NA 10000", header=TRUE)
I know that I can use "if" command to achieve it, but I am wondering if there is a quick and simple way to do it.
Thanks!

Here's a simple base R solution. This assumes that the years are sorted from left-right.
data$recent <- apply(data, 1, function(x) tail(na.omit(x), 1))
a X2015 X2016 X2017 recent
1 1 100 100 100 100
2 2 1000 5 NA 5
3 3 10000 NA NA 10000

Related

Paste date in new column if condition is true in another R [duplicate]

This question already has an answer here:
Replace value using index [R]
(1 answer)
Closed 2 years ago.
I want to extract the date from a variable if the condition in another variable is true.
Example: if comorbidity1==10, extract the date from smr_01, otherwise NA
I also need to do this for if if comorbidity1==11 OR comorbidity1==12, extract the date from smr_01, otherwise NA
This is what I want my data to look like
comorbidity1 smr_01 NewDate
1 20120607 NA
10 20120607 20120607
10 20120613 20120613
3 20121103 NA
6 20150607 NA
12 20140509 NA
11 20120405 NA
I have tried this
fulldata$NewDate<-ifelse(fulldata$comorbidity1==10, fulldata$smr_01, NA)
but it is not pasting the date in the correct format.
what I am getting looks like this
comorbidity1 smr_01 NewDate
1 20120607 NA
10 20120607 4675
10 20120613 17856
3 20121103 NA
6 20150607 NA
12 20140509 NA
11 20120405 NA
smr_01 is classed as a date
Thank you
Try :
df$NewDate <- as.Date(NA)
inds <- df$comorbidity1 == 10
#For more than 1 value use %in%
#inds <- df$comorbidity1 %in% 10:12
df$NewDate[inds] <- df$smr_01[inds]
df

Impute only certain NA's for a variable in a data frame [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I'm new to R and exploring different beautiful options in it. I'm working on a data frame where I have a variable with 900 missing values, i.e NAs.
I want to impute 3 different values for NAs;
1st 300 NA's with Value 1.
2nd 300 NA's with Value 2.
3rd 300 NA's with Value 3.
There are a total of 23272 rows in the data.
dim(data)
[1] 23272 2
colSums(is.na(data))
month year
884 884
summary(data$month)
1 2 3 4 5 6 7 8 9 10 11 12 NA's
1977 1658 1837 1584 1703 1920 1789 2046 1955 2026 1845 2048 884
If we check the month 8,10 and 12. There is no much differences, Hence thought of assigning these 3 months to NA by splitting at the ratio (300:300:284). Usually we go my MODE, but I want to try this approach.
I assume you mean you a have a long list, some of the values of which are NAs:
set.seed(42)
df <- data.frame(val = sample(c(1:3, NA_real_), size = 1000, replace = TRUE))
We can keep a running tally of NA's and assign those to the imputed value using integer division with %/%.
library(tidyverse)
df2 <- df %>%
mutate(NA_num = if_else(is.na(val),
cumsum(is.na(val)),
NA_integer_),
imputed = NA_num %/% 100 + 1)
Output:
df2 %>%
slice(397:410) # based on manual examination using this seed
val NA_num imputed
1 NA 98 1
2 NA 99 1
3 3 NA NA
4 1 NA NA
5 1 NA NA
6 3 NA NA
7 3 NA NA
8 2 NA NA
9 NA 100 2
10 1 NA NA
11 NA 101 2
12 2 NA NA
13 1 NA NA
14 2 NA NA
Without an example, I think this will work.
Basically, filter the NAs to a new table, do the calc and merge it back. Assume the new_dt is the OG data where you filter to only contain the NAs
library('tidyverse');
new_dt = data.frame(x1 =rep(1:900), x2= NA) %>% filter(is.na(x2)) %>%
mutate(23 = case_when(row_number()%/%300==0 ~1,
row_number()%/%300==1 ~2,
row_number()%/%300==2 ~3))
dt <- rbind(dt,new_dt)

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

R - How to sum a column based on date range? [duplicate]

This question already has an answer here:
R // Sum by based on date range
(1 answer)
Closed 7 years ago.
Suppose I have df1 like this:
Date Var1
01/01/2015 1
01/02/2015 4
....
07/24/2015 1
07/25/2015 6
07/26/2015 23
07/27/2015 15
Q1: Sum of Var1 on previous 3 days of 7/27/2015 (not including 7/27).
Q2: Sum of Var1 on previous 3 days of 7/25/2015 (This is not last row), basically I choose anyday as reference day, and then calculate rolling sum.
As suggested in one of the comments in the link referenced by #SeƱorO, with a little bit of work you can use zoo::rollsum:
library(zoo)
set.seed(42)
df <- data.frame(d=seq.POSIXt(as.POSIXct('2015-01-01'), as.POSIXct('2015-02-14'), by='days'),
x=sample(20, size=45, replace=T))
k <- 3
df$sum3 <- c(0, cumsum(df$x[1:(k-1)]),
head(zoo::rollsum(df$x, k=k), n=-1))
df
## d x sum3
## 1 2015-01-01 16 0
## 2 2015-01-02 12 16
## 3 2015-01-03 15 28
## 4 2015-01-04 15 43
## 5 2015-01-05 17 42
## 6 2015-01-06 10 47
## 7 2015-01-07 11 42
The 0, cumsum(...) is to pre-populate the first two rows that are ignored (rollsum(x, k) returns a vector of length length(x)-k+1). The head(..., n=-1) discards the last element, because you said that the nth entry should sum the previous 3 and not its own row.

How to calculate the exponential in some columns of a dataframe in R?

I have a dataframe:
X Year Dependent.variable.1 Forecast.Dependent.variable.1
1 2009 12.42669703 12.41831191
2 2010 12.39309563 12.40043599
3 2011 12.36596964 12.38256006
4 2012 12.32067284 12.36468414
5 2013 12.303095 12.34680822
6 2014 NA 12.32893229
7 2015 NA 12.31105637
8 2016 NA 12.29318044
9 2017 NA 12.27530452
10 2018 NA 12.25742859
I want to calulate the exponential of the third and fourth columns. How can I do that?
In case your dataframe is called dfs, you can do the following:
dfs[c('Dependent.variable.1','Forecast.Dependent.variable.1')] <- exp(dfs[c('Dependent.variable.1','Forecast.Dependent.variable.1')])
which gives you:
X Year Dependent.variable.1 Forecast.Dependent.variable.1
1 1 2009 249371 247288.7
2 2 2010 241131 242907.5
3 3 2011 234678 238603.9
4 4 2012 224285 234376.5
5 5 2013 220377 230224.0
6 6 2014 NA 226145.1
7 7 2015 NA 222138.5
8 8 2016 NA 218202.9
9 9 2017 NA 214336.9
10 10 2018 NA 210539.5
In case you know the column numbers, this could then also simply be done by using:
dfs[,3:4] <- exp(dfs[,3:4])
which gives you the same result as above. I usually prefer to use the actual column names as the indices might change when the data frame is further processed (e.g. I delete columns, then the indices change).
Or you could do:
dfs$Dependent.variable.1 <- exp(dfs$Dependent.variable.1)
dfs$Forecast.Dependent.variable.1 <- exp(dfs$Forecast.Dependent.variable.1)
In case you want to store these columns in new variables (below they are called exp1 and exp2, respectively), you can do:
exp1 <- exp(dfs$Forecast.Dependent.variable.1)
exp2 <- exp(dfs$Dependent.variable.1)
In case you want to apply it to more than two columns and/or use more complicated functions, I highly recommend to look at apply/lappy.
Does that answer your question?

Resources