I am new to R. I was hoping to replace the missing values for X in the data. How can I replace the missing values of "X" when "Time" = 1 or 2 with the value of "X" when "Time" = 3 for the same "SubID" and the same "Day"
SubID: subject number
Day: each subject's day number (1,2,3...21)
Time: morning marked as 1, afternoon marked as 2, and evening marked as
3
X: only has a valid value when Time is 3, others are missing.
SubID Day Time X
1 1 1 NA
1 1 2 NA
1 1 3 7.4
1 2 1 NA
1 2 3 6.2
2 1 1 NA
2 1 2 NA
2 1 3 7.1
2 2 3 5.9
2 2 2 NA
2 2 1 NA
I was able to go as far as the following codes in zoo. I have very limited experience in R. Thank you in advance!
data2 <- transform(data1,
x = na.aggregate(x,by=SubID,FUN=sum,na.rm = T))
Here's the explanation of my comment:
library(data.table)
library(zoo)
setDT(data1)
data1[order(-Time),
Xf := na.locf(X),
by = .(SubID, Day)]
Ok so the setDT function makes the data1 object a data.table. Then order(-Time) orders data1 with respect to Time in descending order (because of the -). Xf := na.locf(X) creates a new column Xf by reference (which means you don't have to assign this back to data1) as na.locf(X) which is a function in the zoo package that fills the NAs forward with the previous value (in this case filling 2 and 1 with the value in 3). The last line specifies that we want to do this grouped by SubID and Day.
Hope it's clearer now, feel free to ask if you have further doubts.
You can sort the data by descending time and then use X[1].
library(dplyr)
df <- tibble(SubID=1, Day=1, Time=c(1,2,3), X=c(NA, NA, 2.2))
df <- df %>%
group_by(SubID, Day) %>%
arrange(desc(Time)) %>%
mutate(
X=case_when(
is.na(X) ~ X[1],
TRUE ~ X)
)
Related
everyone!
Being a beginner with the R software (I think my request is feasible on this software), I would like to ask you a question.
In a large Excel type file, I have a column where the values I am interested in are only every 193 lines. So I would like the previous 192 rows to be equal to the value of the 193rd position ... and so on for all 193 rows, until the end of the column.
Concretely, here is what I would like to get for this little example:
Month Fund_number Cluster_ref_INPUT Expected_output
1 1 1 1
2 1 1 1
3 1 3 1
4 1 1 1
1 3 2 NA
2 3 NA NA
3 3 NA NA
4 3 NA NA
1 8 4 5
2 8 5 5
3 8 5 5
4 8 5 5
The column "Cluster_ref_INPUT" is partitioned according to the column "Fund_number" (one observation for each fund every month for 4 months). The values that interest me in the INPUT column appear every 4 observations (the value in the 4th month).
Thus, we can see that for each fund number, we find in the column "Expected_output" the values corresponding to the value found in the last line of the column "Cluster_ref_INPUT". (every 4 lines). I think we should partition by "Fund_number" and put that all the lines are equal to the last one... something like that?
Do you have any idea what code I should use to make this work?
I hope that's clear enough. Do not hesitate if I need to clarify.
Thank you very much in advance,
Vanie
Here's a one line solution using data.table:
library(data.table)
exdata <- fread(text = "
Month Fund_number Cluster_ref_INPUT Expected_output
1 1 1 1
2 1 1 1
3 1 3 1
4 1 1 1
1 2 2 NA
2 2 NA NA
3 2 NA NA
4 2 NA NA
1 3 4 5
2 3 5 5
3 3 5 5
4 3 5 5")
# You can read you data directly as data.table using fread or convert using setDT(exdata)
exdata[, newvar := Cluster_ref_INPUT[.N], by = Fund_number]
> exdata
Month Fund_number Cluster_ref_INPUT Expected_output newvar
1: 1 1 1 1 1
2: 2 1 1 1 1
3: 3 1 3 1 1
4: 4 1 1 1 1
5: 1 2 2 NA NA
6: 2 2 NA NA NA
7: 3 2 NA NA NA
8: 4 2 NA NA NA
9: 1 3 4 5 5
10: 2 3 5 5 5
11: 3 3 5 5 5
12: 4 3 5 5 5
There are probably solutions using tidyverse that'll be a lot faster, but here's a solution in base R.
#Your data
df <- data.frame(Month = rep_len(c(1:4), 12),
Fund_number = rep(c(1:3), each = 4),
Cluster_ref_INPUT = c(1, 1, 3, 1, 2, NA, NA, NA, 4, 5, 5, 5),
stringsAsFactors = FALSE)
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(), Fund_number = c(), Cluster_ref_INPUT = c(), expected_input = c(), stringsAsFactors = FALSE)
#Using a for loop
#Iterate through the list of unique Fund_number values
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
outdat
# Month Fund_number Cluster_ref_INPUT expected_input
# 1 1 1 1 1
# 2 2 1 1 1
# 3 3 1 3 1
# 4 4 1 1 1
# 5 1 2 2 NA
# 6 2 2 NA NA
# 7 3 2 NA NA
# 8 4 2 NA NA
# 9 1 3 4 5
# 10 2 3 5 5
# 11 3 3 5 5
# 12 4 3 5 5
EDIT: additional solutions + benchmarking
Per OP's comment on this answer, I've presented some faster solutions (dplyr and the data.table solution from the other answer) and also benchmarked them on a 950,004 row simulated dataset similar to the one in OP's example. Code and results below; the entire code-block can be copy-pasted and run directly as long as the necessary libraries (microbenchmark, dplyr, data.table) and their dependencies are installed. (If someone knows a solution based on apply() they're welcome to add it here.)
rm(list = ls())
#Library for benchmarking
library(microbenchmark)
#Dplyr
library(dplyr)
#Data.table
library(data.table)
#Your data
df <- data.frame(Month = rep_len(c(1:12), 79167),
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334),
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE),
stringsAsFactors = FALSE)
#Data in format for data.table
df_t <- data.table(Month = rep_len(c(1:12), 79167),
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334),
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE),
stringsAsFactors = FALSE)
#----------------
#Base R solution
#Using a for loop
#Iterate through the list of unique Fund_number values
base_r_func <- function(df) {
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(),
Fund_number = c(),
Cluster_ref_INPUT = c(),
expected_input = c(),
stringsAsFactors = FALSE)
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
#This return is needed for the base_r_func function wrapper
#this code is enclosed in (not necessary otherwise)
return(outdat)
}
#----------------
#Tidyverse solution
dplyr_func <- function(df){
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))
}
#----------------
#Data table solution
dt_func <- function(df_t){
#For this function, we are using
#dt_t (created above)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
dt_func_conv <- function(df){
#Converting data.frame to data.table format
df_t <- data.table(df)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
#----------------
#Benchmarks
bm_vals <- microbenchmark(base_r_func(df),
dplyr_func(df),
dt_func(df_t),
dt_func_conv(df), times = 8)
bm_vals
# Unit: milliseconds
# expr min lq mean median uq max neval
# base_r_func(df) 618.58202 702.30019 721.90643 743.02018 754.87397 756.28077 8
# dplyr_func(df) 119.18264 123.26038 128.04438 125.64418 133.37712 140.60905 8
# dt_func(df_t) 38.06384 38.27545 40.94850 38.88269 43.58225 48.04335 8
# dt_func_conv(df) 48.87009 51.13212 69.62772 54.36058 57.68829 181.78970 8
#----------------
As can be seen, using data.table would be the way to go if speed is a necessity. data.table is faster than dplyr and base R even when the overhead of converting a regular data.frame to a data.table is considered (see results of dt_func_conv()).
Edit: following up on Carlos Eduardo Lagosta's comments, using setDT() to coerce the df from a data.frame to a data.table, makes the overhead of said coercion close to nil. Code snippet and benchmark values below.
#This version includes the time taken
#to coerce a data.frame to a data.table
dt_func_conv <- function(df){
#Logic similar to dplyr solution
#setDT() coerces data.frames to the data.table format
setDT(df)[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
bm_vals
# Unit: milliseconds
# expr min lq mean median uq max neval
# base_r_func(df) 271.60196 344.47280 353.76204 348.53663 368.65696 435.16163 8
# dplyr_func(df) 121.31239 122.67096 138.54481 128.78134 138.72509 206.69133 8
# dt_func(df_t) 38.21601 38.57787 40.79427 39.53428 43.14732 45.61921 8
# dt_func_conv(df) 41.11210 43.28519 46.72589 46.74063 50.16052 52.32235 8
For the OP specifically: whatever solution you wish to use, the code you're looking for is within the body of the corresponding function. So, for instance, if you want to use the dplyr solution, you would need to take this code and tailor it to your data objects:
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))
I want to remove all rows in a data frame where Month and Mo columns are more than 1 month apart. I have heard you can do this with all.equal(df$Month, df$Mo, 1), but it is just returning a string. Is this possible in R?
Row Month Mo
1 1 1
1 2 4 #<-Remove
According to ?all.equal documentation, the return value of all.equal is
Either TRUE (NULL for attr.all.equal) or a vector of mode "character" describing the differences between target and current.
So no, you can't do it with all.equal, as it returns a single value. You can see more details in the docs about what the function does.
To do what you want, you can use plain R:
d <- data.frame(Row = 1:2, Month = 1:2, Mo = c(1,4)) # your data.frame
# Row Month Mo
# 1 1 1 1
# 2 2 2 4
d[!(abs(d$Month - d$Mo) > 1),] # d without rows where Month and Mo are far apart.
# Row Month Mo
# 1 1 1 1
or equivalently
d[abs(d$Month - d$Mo) <= 1,]
you can do something with dplyr
library(dplyr)
df %>%
filter(Month == Mo | Month == Mo+1 | Month == Mo-1)
I have a dataset with two values for each date like this:
date x y
1 2013-05-01 1 2
2 2013-05-02 2 2
3 2013-05-03 3 2
date is in the format as.Date, using the package lubridate.
Now I want to have the mean of the two values, except for a certain time span, in which I want to use the values of x.
I tried the following:
mean=(x+y)/2
newdata=ifelse((data$date < 2013-10-01 | date$date > 2014-04-09), mean, x)
but if will just take the mean for all dates.
Is it possible to use greater/lesser than relationships for dates?
Any suggestions on how to make this work?
Thanks in advance
It looks like you are not casting the comparison values as dates. Also the dates you used for comparison don't exclude any of the dates in the dataframe you provided so I'd expect the mean to be selected every time.
date <- as.Date(c('2013-05-01', '2013-05-02', '2013-05-03'))
x <- c(1, 2, 3)
y <- c(2, 2, 2)
mean <- (x + y)/2
df <- data.frame(date = date, x = x, y = y)
newdata <- ifelse((df$date < as.Date('2013-05-02') | df$date > as.Date('2014-04-09')), mean, x)
newdata
I changed the dates in the condition to be more selective and I got 1.5 2.0 3.0. It selects the first value from mean and the others from x which agrees with the condition I used in the ifelse().
How about something like this:
library(lubridate)
library(data.table)
##
set.seed(123)
Data <- data.frame(
date=as.Date(ymd(20130904))+0:364,
x=as.numeric(sample(1:3,365,replace=TRUE)),
y=as.numeric(sample(1:3,365,replace=TRUE)))
setDT(Data)
##
xSpan <- seq.Date(
from=as.Date("2013-10-01"),
to=as.Date("2014-04-09"),
by="day")
##
Edited - forgot to group by date
Data[,z:=ifelse(
date %in% xSpan,
x,
mean(c(x,y))),
by=date]
##
> head(Data)
date x y z
1: 2013-09-04 1 3 2.0
2: 2013-09-05 3 1 2.0
3: 2013-09-06 2 1 1.5
4: 2013-09-07 3 2 2.5
5: 2013-09-08 3 2 2.5
6: 2013-09-09 1 2 1.5
> head(subset(Data, date %in% xSpan))
date x y z
1: 2013-10-01 2 3 2
2: 2013-10-02 1 3 1
3: 2013-10-03 1 1 1
4: 2013-10-04 3 1 3
5: 2013-10-05 3 1 3
6: 2013-10-06 3 1 3
I just defined xSpan as a contiguous sequence of days for which one of the functions is used in (in your example, just the identity function of x). Dates not included in this time span will use mean to determine their value of z.
Below is a subset of my data:
> head(dt)
name start end
1: 1 3195984 3197398
2: 1 3203519 3205713
3: 2 3204562 3207049
4: 2 3411782 3411982
5: 2 3660632 3661579
6: 3 3638391 3640590
dt <- data.frame(name = c(1, 1, 2, 2, 2, 3), start = c(3195984,
3203519, 3204562, 3411782, 3660632, 3638391), end = c(3197398,
3205713, 3207049, 3411982, 3661579, 3640590))
I want to calculate another value: the difference between the end coordinate of line n and the start coordinate of line n+1 but only if both elements share a name. To elaborate this is what I want a resulting data frame to look like:
name start end dist
1: 1 3195984 3197398
2: 1 3203519 3205713 -6121
3: 2 3204562 3207049
4: 2 3411782 3411982 −204733
5: 2 3660632 3661579 −248650
6: 3 3638391 3640590
The reason I want to do this is that I'm looking for dist values that are positive. One way I've tried this is to offset the start and end coordinates but then I run into a problem where I am comparing things with different names.
How does one do this in R?
A data.table solution may be good here:
library(data.table)
dt <- as.data.table(dt)
dt[, dist := c(NA, end[-(length(end))] - start[-1]) , by=name]
dt
# name start end dist
#1: 1 3195984 3197398 NA
#2: 1 3203519 3205713 -6121
#3: 2 3204562 3207049 NA
#4: 2 3411782 3411982 -204733
#5: 2 3660632 3661579 -248650
#6: 3 3638391 3640590 NA
Assuming your data is sorted, you can also do it with base R functions:
dt$dist <- unlist(
by(dt, dt$name, function(x) c(NA, x$end[-(length(x$end))] - x$start[-1]) )
)
Using dplyr (with credit to #thelatemail for the calculation of dist):
library(dplyr)
dat.new <- dt %.%
group_by(name) %.%
mutate(dist = c(NA, end[-(length(end))] - start[-1]))
Here is a different dplyr solution:
dt %.% group_by(name) %.% mutate(dist = lag(end) - start)
giving:
Source: local data frame [6 x 4]
Groups: name
name start end dist
1 1 3195984 3197398 NA
2 1 3203519 3205713 -6121
3 2 3204562 3207049 NA
4 2 3411782 3411982 -204733
5 2 3660632 3661579 -248650
6 3 3638391 3640590 NA
I have a rather small dataset of 3 columns (id, date and distance) in which some dates may be duplicated (otherwise unique) because there is a second distance value associated with that date.
For those duplicated dates, how do I average the distances then replace the original distance with the averages?
Let's use this dataset as the model:
z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
# 1 2
# 1 4
# 2 1
# 2 3
# 3 5
# 4 2
The mean of id#1 is 3 and of id#2 is 2, which would then replace each of the original var's.
I've checked multiple questions to address this and have found related discussions. As a result, here is what I have so far:
# Check if any dates have two estimates (duplicate Epochs)
length(unique(Rdataset$Epoch)) == nrow(Rdataset)
# if 'TRUE' then each day has a unique data point (no duplicate Epochs)
# if 'FALSE' then duplicate Epochs exist, and the distances must be
# averaged for each duplicate Epoch
Rdataset$Distance <- ave(Rdataset$Distance, Rdataset$Epoch, FUN=mean)
Rdataset <- unique(Rdataset)
Then, with the distances for duplicate dates averaged and replaced, I wish to perform other functions on the entire dataset.
Here's a solution that doesn't bother to actually check if the id's are duplicated- you don't actually need to, since for non-duplicated id's, you can just use the mean of the single var value:
duplicated_ids = unique(z$id[duplicated(z$id)])
library(plyr)
z_deduped = ddply(
z,
.(id),
function(df_section) {
res_df = data.frame(id=df_section$id[1], var=mean(df_section$var))
}
)
Output:
> z_deduped
id var
1 1 3
2 2 2
3 3 5
4 4 2
Unless I misunderstand:
library(plyr)
ddply(z, .(id), summarise, var2 = mean(var))
# id var2
# 1 1 3
# 2 2 2
# 3 3 5
# 4 4 2
Here is another answer in data.table style:
library(data.table)
z <- data.table(id = c(1, 1, 2, 2, 3, 4), var = c(2, 4, 1, 3, 5, 2))
z[, mean(var), by = id]
id V1
1: 1 3
2: 2 2
3: 3 5
4: 4 2
There is no need to treat unique values differently than duplicated values as the mean of a single argument is the argument.
zt<-aggregate(var~id,data=z,mean)
zt
id var
1 1 3
2 2 2
3 3 5
4 4 2