I'm trying to come up with a function that does the following to a data.frame outputting a new data.frame with the same names:
1- Creates a seq(min(target), max(target), .1).
2- Takes the mean of all other variables.
For example, if q is our data.frame, and jen is the target in it, I want to reformat q such that jen's data becomes seq(min(jen), max(jen), .1), and both bob and joe just change to their mean values.
Is it possible to do this in R?
I tried something but it is far from being accurate.
q = data.frame(bob = 1:5 - 3, jen = c(1.7, 2.6, 2.5, 4.4, 3.8) - 3, joe = 5:9)
change <- function(dataframe = q, target = "jen"){
n <- names(dataframe)
dataframe[target] <- seq(from = min(target), max(target), .1)
}
A base R solution. My idea is to create the target column first in the function, and then use a for-loop to add the mean of other columns.
# Example data frame
q <- data.frame(bob = 1:5 - 3, jen = c(1.7, 2.6, 2.5, 4.4, 3.8) - 3, joe = 5:9)
# Create then function
change <- function(dat, target){
vec <- dat[, target]
target_new <- seq(min(vec), max(vec), by = 0.1)
dat2 <- data.frame(target_new)
names(dat2) <- target
for (i in names(dat)[!names(dat) %in% target]){
dat2[[i]] <- mean(dat[[i]])
}
dat2 <- dat2[, names(dat)]
return(dat2)
}
# Apply the function
change(q, "jen")
# bob jen joe
# 1 0 -1.3 7
# 2 0 -1.2 7
# 3 0 -1.1 7
# 4 0 -1.0 7
# 5 0 -0.9 7
# 6 0 -0.8 7
# 7 0 -0.7 7
# 8 0 -0.6 7
# 9 0 -0.5 7
# 10 0 -0.4 7
# 11 0 -0.3 7
# 12 0 -0.2 7
# 13 0 -0.1 7
# 14 0 0.0 7
# 15 0 0.1 7
# 16 0 0.2 7
# 17 0 0.3 7
# 18 0 0.4 7
# 19 0 0.5 7
# 20 0 0.6 7
# 21 0 0.7 7
# 22 0 0.8 7
# 23 0 0.9 7
# 24 0 1.0 7
# 25 0 1.1 7
# 26 0 1.2 7
# 27 0 1.3 7
# 28 0 1.4 7
Here is one option with base R
data.frame(Map(function(x, y) if(x=="mean") get(x)(y) else
get(x)(min(y), max(y), by = 0.1), setNames(c("mean", "seq", "mean"), names(q)), q))
Or with dplyr
library(dplyr)
q %>%
summarise(bob = mean(bob),
jen = list(seq(min(jen), max(jen), by = 0.1)),
joe = mean(joe)) %>%
unnest
Or if there are many columns to get the mean and only a single column sequence, then instead of specifying one by one
q %>%
mutate_at(c(1,3), mean) %>%
group_by(bob, joe) %>%
summarise(jen = list(seq(min(jen), max(jen), by = 0.1))) %>%
unnest
Or use complete
q %>%
group_by(bob = mean(bob), joe = mean(joe)) %>%
complete(jen = seq(min(jen), max(jen), by = .1))
My solution uses colMeans function and repeats the result as many times as the sequence is long. Then I replace the target column with the sequence results.
q = data.frame(bob = 1:5 - 3, jen = c(1.7, 2.6, 2.5, 4.4, 3.8) - 3, joe = 5:9)
manip <- function(target, df){
t.column <- which(colnames(df) == target)
dfmeans <- colMeans(df)
minmax <- range(df[,t.column],na.rm = T)
t.seq <- seq(minmax[1],minmax[2],.1)
newdf <- matrix(dfmeans, ncol = length(dfmeans))[rep(1, length(t.seq)),]
newdf[,t.column] <- t.seq
colnames(newdf) <- colnames(df)
return(as.data.frame(newdf))
}
manip("jen",q)
Related
I would like to create a function in R that rounds numeric dataframes (or columns in a dataframe) depending on the number. If the number is less than 1, round to 1 decimal, but if it is greater than 1, round to 0 decimals.
This is what I have
data <- data.frame(x = c(1.111, 0.809, 5.55555, 0.567), y = c(0.235, 0.777, 4.55555555, 393.55))
round0 <- function(x) format(round(x, digits=0), nsmall = 0, trim = TRUE)
round0(data)
x y
1 1 0
2 1 1
3 6 5
4 1 394
# What I want
x y
1 1 0.2
2 0.8 1
3 6 5
4 1 394
> round0 <- function(x) ifelse(x<1,round(x,1),round(x))
> sapply(data,round0)
x y
[1,] 1.0 0.2
[2,] 0.8 0.8
[3,] 6.0 5.0
[4,] 0.6 394.0
You can use :
round0 <- function(x) ifelse(x < 1, format(round(x, 1), nsmall = 1), round(x))
data[] <- lapply(data, round0)
data
# x y
#1 1 0.2
#2 0.8 0.8
#3 6 5
#4 0.6 394
Note that this is only for display purpose and classes of columns are of type character. If you want to perform any mathematical calculation on it you need to convert it back to numeric.
I would like to create a column that sums the adjacent value and 80% of the previous value from another column. So, if column x is 1, 2, 3...10, I want column z to be 1, 2.8, 5.24, 8.192, etc.
Yet, here is my failed attempt:
x <- c(1:10)
y <- c("")
df <- data.frame(x,y)
df1 <- df %>%
mutate(y = cumsum(x*0.8))
Result:
x y
1 1 0.8
2 2 2.4
3 3 4.8
4 4 8.0
5 5 12.0
6 6 16.8
7 7 22.4
8 8 28.8
9 9 36.0
10 10 44.0
I would use a for loop to do this. It's important to initialize a vector first, especially if you're working with a large data set.
# initialize
newx <- vector("numeric", length(df$x))
newx[1] <- df$x[1]
for(i in 2:length(df$x)){
newx[i] <- df$x[i] + (0.8 * newx[i-1])
}
newx
# [1] 1.00000 2.80000 5.24000 8.19200 11.55360 15.24288 19.19430 23.35544 27.68435 32.14748
With the addition of purrr, you can do:
df %>%
mutate(y = accumulate(x, ~ .x * 0.8 + .y))
x y
1 1 1.00000
2 2 2.80000
3 3 5.24000
4 4 8.19200
5 5 11.55360
6 6 15.24288
7 7 19.19430
8 8 23.35544
9 9 27.68435
10 10 32.14748
Try using the Reduce function:
Reduce(function(last, current) current + last * .8, x = x, accumulate = T)
# [1] 1.00000 2.80000 5.24000 8.19200 11.55360 15.24288 19.19430 23.35544 27.68435 32.14748
I have a dataframe here:
df <- data.frame("Time" = 1:10, "Value" = c(1.7,NA,-999,-999,1.5,1.6,NA,4,-999,8))
"NA" means there is no observation, just leave them there. "-999" means the observation is identified as an outlier.
Now I am trying to replace the "-999" with the average of the nearest values. For example:
The first "-999" should be replaced with (1.7+1.5)/2 = 1.6
The second "-999" should be replaced with (1.7+1.5)/2 = 1.6
The last "-999" should be replaced with (4.0+8.0)/2 = 6
I tried to usenext statement to find the next iteration, and use if statement to decide where to stop. But how can I go up to check the previous iterations? Or is there just another kind of solution to this?
Many thanks.
One approach utilizing dplyr, purrr and tidyr could be:
df %>%
mutate(New_Value = if_else(Value == -999,
map_dbl(.x = seq_along(Value),
~ mean(c(tail(na.omit(na_if(Value[1:(.x - 1)], -999)), 1),
head(na.omit(na_if(Value[(.x + 1):n()], -999)), 1)))),
Value))
Time Value New_Value
1 1 1.7 1.7
2 2 NA NA
3 3 -999.0 1.6
4 4 -999.0 1.6
5 5 1.5 1.5
6 6 1.7 1.7
7 7 NA NA
8 8 4.0 4.0
9 9 -999.0 6.0
10 10 8.0 8.0
Using a few while loops, which bump up how far we lag/lead, we can accomplish this. I am not sure how performant this operation will be on large data sets. But it seems to get the job done for your sample data.
# find where replacements and initialize
where_to_replace <- which(df$Value == -999)
len_replace <- length(where_to_replace)
lag_value <- rep(NA, len_replace)
lead_value <- rep(NA, len_replace)
# more initializing
i <- 1
lag_n <- 1
lead_n <- 1
while(i <= len_replace){
# find appropriate lagged value
# can't use NA or lag value == -999
while(is.na(lag_value[i]) | lag_value[i] == -999){
lag_value[i] <- dplyr::lag(df$Value, lag_n)[where_to_replace[i]]
lag_n <- lag_n + 1
}
# find appropriate lead value
# can't use NA or -999 as lead value
while(is.na(lead_value[i]) | lead_value[i] == -999){
lead_value[i] <- dplyr::lead(df$Value, lead_n)[where_to_replace[i]]
lead_n <- lead_n + 1
}
# reset iterators
i <- i + 1
lag_n <- 1
lead_n <- 1
}
# replacement value
df$Value[where_to_replace] <- (lead_value + lag_value) / 2
# Time Value
# 1 1 1.7
# 2 2 NA
# 3 3 1.6
# 4 4 1.6
# 5 5 1.5
# 6 6 1.6
# 7 7 NA
# 8 8 4.0
# 9 9 6.0
# 10 10 8.0
I created two new helper colums - before and after.
Before fills every NA and -999 with the next value on top and after fills NAs and -999 with the next value underneath. In the next step I over wrote each -999 with the mean of the two values.
df <- data.frame(Time = 1:10,
Value = c(1.7, NA, -999, -999, 1.5,
1.6, NA,
4, -999, 8))
df <- df %>%
mutate(before = recode(Value, `-999` = NA_real_),
after = recode(Value, `-999` = NA_real_)) %>%
fill(before, .direction = "down") %>%
fill(after, .direction = "up") %>%
mutate(Value = case_when(Value == -999 ~ (before + after)/2,
TRUE ~ Value)) %>%
select(Time, Value)
The Output
Time Value
1 1 1.7
2 2 NA
3 3 1.6
4 4 1.6
5 5 1.5
6 6 1.6
7 7 NA
8 8 4.0
9 9 6.0
10 10 8.0
Here is a base R option using findInterval
x <- which(df$Value == -999)
y <- setdiff(which(!is.na(df$Value)),x)
ind <- findInterval(x,y)
dfout <- within(df,Value <- replace(Value,x,rowMeans(cbind(Value[y[ind]],Value[y[ind+1]]))))
such that
> dfout
Time Value
1 1 1.7
2 2 NA
3 3 1.6
4 4 1.6
5 5 1.5
6 6 1.6
7 7 NA
8 8 4.0
9 9 6.0
10 10 8.0
Just sticking with base R data.frames we can make a function and use sapply over indices of interest.
outliers <- df$Value == -999 # Keep as logical for now
fillers <- which(!is.na(df$Value) & !outliers)
outliers <- which(outliers) # Now convert to indices; FALSE and NA do not appear
filled_outliers <- sapply(outliers, function(x) {
before_ind = max(fillers[fillers < x]) # maximum INDEX before an outlier
after_ind = min(fillers[fillers > x])
0.5*(df$Value[before_ind] + df$Value[after_ind])
})
df[outliers, ] <- filled_outliers
df
Gives:
Time Value
1 1.0 1.7
2 2.0 NA
3 1.6 1.6
4 1.6 1.6
5 5.0 1.5
6 6.0 1.6
7 7.0 NA
8 8.0 4.0
9 6.0 6.0
10 10.0 8.0
I have a dataframe df with a point each 0.1 unit:
df <- expand.grid(x = seq(0, 20, by = .1),
y = seq(0, 20, by = .1))
I defined a new dataframe grid which has a point each 4 units:
grid <- expand.grid(xg = seq(0, 20, by = 4),
yg = seq(0, 20, by = 4))
I would like to use the points of grid as nodes of a grid and determine the points in df which fall inside its cells.
The information about the grid cell should be added to a new column in df providing a string such as i.j for each point, where i and j are the row and column index of the grid cell, respectively. The new column should report NA for df points on the grid lines.
For example, all df points with 0 < x < 4 and 0 < y < 4 should be labeled as 1.1, whereas points with 8 < x < 12 and 16 < y < 20 should be labeled as 3.5 and so on.
The ideal solution should be fine also for grids with different size, i.e. expand.grid(xg = seq(0, 20, by = 2), yg = seq(0, 20, by = 2).
Thanks for your help.
This is a bit hacky, but you could create the i.j index in grid then join this to df and pad the NAs for each grid chunk:
df <- expand.grid(x = seq(0, 20, by = .1),
y = seq(0, 20, by = .1))
head(df)
#> x y
#> 1 0.0 0
#> 2 0.1 0
#> 3 0.2 0
#> 4 0.3 0
#> 5 0.4 0
#> 6 0.5 0
grid <- expand.grid(xg = seq(0, 20, by = 4),
yg = seq(0, 20, by = 4))
head(grid)
#> xg yg
#> 1 0 0
#> 2 4 0
#> 3 8 0
#> 4 12 0
#> 5 16 0
#> 6 20 0
# Make row/col indices
grid$i <- as.integer(factor(grid$xg))
grid$j <- as.integer(factor(grid$yg))
grid$i.j <- paste(grid$i, grid$j, sep = '.')
grid$i <- NULL
grid$j <- NULL
# Merge indices to df
indexed <- merge(df, grid, by.x = c('x', 'y'), by.y = c('xg', 'yg'), all = TRUE)
head(indexed)
#> x y i.j
#> 1 0 0.0 1.1
#> 2 0 0.1 NA
#> 3 0 0.2 NA
#> 4 0 0.3 NA
#> 5 0 0.4 NA
#> 6 0 0.5 NA
# Fill in betweens
for (i in 2:nrow(indexed)) {
if (is.na(indexed$i.j[i])) {
indexed$i.j[i] <- indexed$i.j[i - 1]
}
}
head(indexed)
#> x y i.j
#> 1 0 0.0 1.1
#> 2 0 0.1 1.1
#> 3 0 0.2 1.1
#> 4 0 0.3 1.1
#> 5 0 0.4 1.1
#> 6 0 0.5 1.1
This would only work if the df sequences intersect the grid sequences, i.e. df[8081, ] = {4.0, 4.0} is in grid and df[1, ] is also in grid.
Also, the for loop is pretty slow. You could try converting it to an Rcpp loop, or maybe there's a more efficient way of non-equi joining with {data.table} or {sqldf}
I would like to add one new row for each of the subjects in my dataframe, which looks something like this:
Subject = c("1","5","10")
time = c("2", "2.25", "2.5")
value = c("3", "17", "9")
DF <- data.frame(Subject, time, value)
Subject time value
1 1 2 3
2 5 2.25 17
3 10 2.5 9
I want to add a new row for each subject with a time = 0 and value = 0, giving this:
Subject = c("1","1","5","5","10","10")
time = c("0","2","0", "2.25","0", "2.5")
value = c("0","3","0", "17","0", "9")
DF2 <- data.frame(Subject, time, value)
Subject time value
1 1 0 0
2 1 2 3
3 5 0 0
4 5 2.25 17
5 10 0 0
6 10 2.5 9
I have a lot of subjects with a lot of gaps in their subject numbers, and want do this for all of them in a reasonable way. Any suggestions?
Thank you in advance.
Sincerily,
ykl
I would just rbind in the new values (not sure why you specified all your values as character values, here I changed them to numeric)
DF <- data.frame(
Subject = c(1,5,10),
time = c(2, 2.25, 2.5),
value = c(3, 17, 9)
)
DF2 <- rbind(
DF,
data.frame(Subject = unique(DF$Subject), time="0", value="0")
)
this puts them at the bottom, but you could re-sort of you like
DF2[order(DF2$subject, DF2$time), ]
You can also use interleave from the "gdata" package:
library(gdata)
interleave(DF, data.frame(Subject = 0, time = 0, value = 0))
# Subject time value
# 1 1 2.00 3
# 11 0 0.00 0
# 2 5 2.25 17
# 1.1 0 0.00 0
# 3 10 2.50 9
# 1.2 0 0.00 0
This uses #MrFlick's sample data.
DF <- data.frame(
Subject = c(1,5,10),
time = c(2, 2.25, 2.5),
value = c(3, 17, 9)
)