I have a data frame with two columns, one containing dates, the other numbers. My goal is to insert dates from another data frame into the date column. Here is an example:
df <- data.frame(rep(as.Date("2001-01-01", origin = "1970-01-01"), 3),
c(1, 2, 3),
stringsAsFactors = F)
ins <- data.frame(rep(as.Date("1999-01-01", origin = "1970-01-01"), 3),
c(1, 2, 3),
stringsAsFactors = F)
The data frame I want to obtain is:
> df_goal
dates numbers
1 1999-01-01 1
2 2001-01-01 2
3 2001-01-01 3
I tried df[1, ] <- c(ins[1, 1], ins[1, 2]), but I got the following error:
Error in as.Date.numeric(value) : 'origin' must be supplied
However, if in df I omitt the numeric column, it works:
df <- data.frame(rep(as.Date("2001-01-01"), 3),
stringsAsFactors = F)
ins <- data.frame(rep(as.Date("1999-01-01"), 3),
c(1, 2, 3),
stringsAsFactors = F)
df[1, ] <- ins[1, 1]
How to get the first case (df with two columns) working?
I tried df[1, ] <- c(ins[1, 1], ins[1, 2]), but I got the following error:
Error in as.Date.numeric(value) : 'origin' must be supplied
Don't use c -- it transforms its arguments so they have the same class.
In this case, c(ins[1, 1], ins[1, 2]) makes a date vector; and when this is assigned onto the second column of df, R tries to coerce that column to date to make sense of the assignment, like as.Date(c(1, 2, 3)).
You can instead do df[1,] <- ins[1, c(1,2)].
Side note: Don't do this sort of insertion based on row numbers; there must be a better way to achieve what you're after, like a join/merge.
Alternatively:
df2 <- rbind(ins[1,], df[2:3,])
or
df2 <- df
df2[1,] <- ins[1,]
Related
I have a Dataframe which looks similar to this:
set.seed(42)
start <- Sys.Date() + sort(sample(1:10, 5))
set.seed(43)
end <- Sys.Date() + sort(sample(1:10, 5))
end[4] <- NA
A <- c("10", "15", "NA", "4", "NA")
B <- rpois(n = 5, lambda = 10)
df <- data.frame(start, end, A, B)
I would like , when there is an NA in the column A to caclulate the hours beweet start and end. Nothing shall happen when either start or end is NA.
I tried somthing like that:
df[, df$A [is.na(df[, df$A])]] <- difftime(df$end, df$start, units = "hours")
but this gives me the Error: undefined columns selected.
Does someone have an Idea? Thanks.
Create an index where there are NA in 'A' column, subset the 'start', 'end' based on the index, get the difftime and assign back
df$A <- as.numeric(df$A)
i1 <- is.na(df$A)
df$A[i1] <- with(df, as.numeric(difftime(start[i1], end[i1], units = "hours")))
This question already has answers here:
How to prevent ifelse() from turning Date objects into numeric objects
(7 answers)
Closed 2 years ago.
I have the following dataset:
A B
2007-11-22 2004-11-18
<NA> 2004-11-10
when the value of column A is NA, I want this value to be replaced by the date in B, except with an additional 25 days added.
Here is what the outcome should look like:
A B
2007-11-22 2004-11-18
2004-12-05 2004-11-10
So far, I have tried the following if else formula, but with no success.
library(lubridate)
data$A<- ifelse(is.na(data$A),data$B+days(25),data$A)
Could anyone tell me what's wrong with it or give me an alternate solution? The code to build my dataset is below.
A<-c("2007-11-22 01:00:00", NA)
B<-c("2004-11-18","2004-11-10")
data<-data.frame(A,B)
data$A<-as.Date(data$A);data$B<-as.Date(data$B)
The reason of the issue can be traced back from the source code of ifelse. When you type View(ifelse), you will see some lines in the bottom of the source code as below
ans <- test
len <- length(ans)
ypos <- which(test)
npos <- which(!test)
if (length(ypos) > 0L)
ans[ypos] <- rep(yes, length.out = len)[ypos]
if (length(npos) > 0L)
ans[npos] <- rep(no, length.out = len)[npos]
ans
where test is logic array, and ans is initialized as a copy of test. When running ans[ypos] <- rep(yes, length.out = len)[ypos], the class of ans is coerced to numeric, rather than Date. That's why you have integers on A column after using ifelse.
You can try the code below
data$A <- as.Date(ifelse(is.na(data$A), data$B + days(25), data$A), origin = "1970-01-01")
which gives
> data
A B
1 2007-11-22 2004-11-18
2 2004-12-05 2004-11-10
Assuming the data given reproducibly in the Note at the end -- in particular we assume both columns are of Date class -- compute a logical vector is_na which indicates which entries are NA and then set those from B.
is_na <- is.na(data$A)
data$A[is_na] <- data$B[is_na] + 25
This would also work and has the advantage that it does not overwrite data:
transform(data, A = replace(A, is.na(A), B[is.na(A)] + 25))
Note
Lines <- "
A B
2007-11-22 2004-11-18
NA 2004-11-10"
data <- read.table(text = Lines, header = TRUE)
data[] <- lapply(data, as.Date) # convert to Date class
Instead of ifelse you could use coalesce
library(tidyverse)
library(lubridate)
A <- c("2007-11-22 01:00:00", NA)
B <- c("2004-11-18","2004-11-10")
data <-data.frame(A,B)
data <- data %>%
mutate(A = as_date(A),
B = as_date(B),
A = coalesce(A,B+days(25)))
library(tidyverse)
df0 <- data.frame(col1 = c(5, 2), col2 = c(6, 4))
df1 <- data.frame(col1 = c(5, 2),
col2 = c(6, 4),
col3 = ifelse(apply(df0[, 1:2], 1, sum) > 10 &
df0[, 2] > 5,
"True",
"False"))
df2 <- as_tibble(df1)
I've got my data frame df1 above. I've basically "copied" it as a tibble df2. Let's mimic an analysis for this df1 data frame and df2 tibble.
identical(df1[[2]], df1[, 2])
# [1] TRUE
identical(df2[[2]], df2[, 2])
# [1] FALSE
Since df1 and df2 are essentially the "same", why do I get the TRUE/FALSE dichotomy in my code block above. What is the tibble() property that has changed?
The same question asked another way - what is the difference between [[X]] and [, X], when applied to base R, and also when used in the tidyverse?
Since all lists are vectors, we can think of this in terms of list subsetting. Take for instance:
L <- list(A = c(1, 2), B = c(1, 4))
L[[2]]
This Extracts the second element of the list. Extrapolate this to:
df1[[2]]
We get the same output as df1[, 2] hence identical(df1[[2]], df1[, 2]) returns TRUE.
The second part is to do with tibble structure ie:
typeof(as_tibble(df1)[[2]])
[1] "double"
typeof(as_tibble(df1[, 2]))
[1] "list"
The second is a list while the first is a vector hence identical returns FALSE.
Objects of class tbl_df have:(From the docs)
A class attribute of c("tbl_df", "tbl", "data.frame").
A base type of "list", where each element of the list has the same NROW().
A names attribute that is a character vector the same length as the underlying list.
A row.names attribute, included for compatibility with the base data.frame class. This attribute is only consulted to query the number of rows, any row names that might be stored there are ignored by most tibble methods.
I'm working with a list of data frames. In each data frame, I would like to pad a single ID variable with leading zeros. The ID variables are character vectors and are always the first variable in the data frame. In each data frame, however, the ID variable has a different length. For example:
df1_id ranges from 1:20, thus I need to pad with up to one zero,
df2_id ranges from 1:100, thus I need to pad with up to two zeros,
etc.
My question is, how can I pad each data frame without having to write a single line of code for each data frame in the list.
As mentioned above, I can solve this problem by using the str_pad function on each data frame separately. For example, see the code below:
#Load stringr package
library(stringr)
#Create sample data frames
df1 <- data.frame("x" = as.character(1:20), "y" = rnorm(20, 10, 1),
stringsAsFactors = FALSE)
df2 <- data.frame("v" = as.character(1:100), "y" = rnorm(100, 10, 1),
stringsAsFactors = FALSE)
df3 <- data.frame("z" = as.character(1:1000), "y" = rnorm(1000, 10, 1),
stringsAsFactors = FALSE)
#Combine data fames into list
dfl <- list(df1, df2, df3)
#Pad ID variables with leading zeros
dfl[[1]]$x <- str_pad(dfl[[1]]$x, width = 2, pad = "0")
dfl[[2]]$v <- str_pad(dfl[[2]]$v, width = 3, pad = "0")
dfl[[3]]$z <- str_pad(dfl[[3]]$z, width = 4, pad = "0")
While this solution works relatively well for a short list, as the number of data frames increases, it becomes a bit unwieldy.
I would love if there was a way that I could embed some sort of "sequence" vector into the width argument of the str_pad function. Something like this:
dfl <- lapply(dfl, function(x) {x[,1] <- str_pad(x[,1], width = SEQ, pad =
"0")})
where SEQ is a vector of variable lengths. Using the above example it would look something like:
seq <- c(2,3,4)
Thanks in advance, and please let me know if you have any questions.
~kj
You could use Map here, which is designed to apply a function "to the first elements of each ... argument, the second elements, the third elements", see ?mapply for details.
library(stringr)
vec <- c(2,3,4) # this is the vector of 'widths', don't name it seq
Map(function(i, y) {
dfl[[i]][, 1] <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
dfl[[i]] # this gets returned
},
# you iterate over these two vectors in parallel
i = 1:length(dfl),
y = vec)
Output
#[[1]]
# x y
#1 01 9.373546
#2 02 10.183643
#3 03 9.164371
#
#[[2]]
# v y
#1 001 11.595281
#2 002 10.329508
#3 003 9.179532
#4 004 10.487429
#
#[[3]]
# z y
#1 0001 10.738325
#2 0002 10.575781
#3 0003 9.694612
#4 0004 11.511781
#5 0005 10.389843
explanation
The function that we pass to Map is an anonymous function, which more or less you provided in your question:
function(i, y) {
dfl[[i]][, 1] <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
dfl[[i]] # this gets returned
}
You see the function takes two argument, i and y (choose other names if you like such as df and width), and for each dataframe in your list it modifies the first column dfl[[i]][, 1] <- ... . What the anonymous function does is it applies str_pad to the first column of each dataframe
... <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
but you see that we don't pass a fixed value to the width argument, but y.
Coming back to Map. Map now applies str_pad to the first dataframe, with argument width = 2, it applies str_pad to the second dataframe, with argument width = 3 and - you probably guessed it - it applies str_pad to the third dataframe in your list, with argument width = 4.
The arguments are specified in the last two lines of the code as
i = 1:length(dfl),
y = vec)
I hope this helps.
data
(consider to create a minimal example next time as the number of rows of the dataframes is not relevant for the problem)
set.seed(1)
df1 <- data.frame("x" = as.character(1:3), "y" = rnorm(3, 10, 1),
stringsAsFactors = FALSE)
df2 <- data.frame("v" = as.character(1:4), "y" = rnorm(4, 10, 1),
stringsAsFactors = FALSE)
df3 <- data.frame("z" = as.character(1:5), "y" = rnorm(5, 10, 1),
stringsAsFactors = FALSE)
#Combine data fames into list
dfl <- list(df1, df2, df3)
Here is an example,
df <- data.frame(x = I(list(1:2, 3:4)))
x <- df[1,]
Now the following does not work,
df[2,] <- x
or
df[2,] <- I(x)
Warning message:
In `[<-.data.frame`(`*tmp*`, 2, , value = list(1:2)) :
replacement element 1 has 2 rows to replace 1 rows
How do I add more rows to data frame with a single column of vector type.
I found the following after few tries,
df[2,] <- list(x)
add new row of list type.
It might be because you are using a list. If you set your data frame as:
df <- data.frame(rbind(c(1, 2), c(3, 4)))
then your code should work:
df <- data.frame(rbind(c(1, 2), c(3, 4))) # Make DF
x <- df[1,]
df[2,] <- x
print(df)
> df
X1 X2
1 1 2
2 1 2