I have a df with multiple vars containing dates.
Among these vars some report multiple dates separated by formatting symbols.
For each cell in each of the relevant vars, I would like to split the string, reformat as data, and pick the last date.
DATA
data <- data.frame(ex=c(1,2),date_1 = c("30/12/1997\n22/12/1998","15/12/1993"), date_2 = c("21/03/1997\n11/04/1996\n11/04/1996\n11/04/1996\n11/04/1996",NA))
expected <- data.frame(ex=c(1,2),date_1 = c("1998-12-22","1993-12-15"), date_2 = c("1997-03-21",NA))
CODE ATTEMPTED (1) ERROR: ALL ENTRIES GET THE VAR MAX VALUE NOT THE CELL MAX VALUE
data[grep("date",names(data),value = T)] <- lapply(data[grep("date",names(data),value = T)], function(x) max(as.Date(str_split(x,"\n")[[1]],format="%d/%m/%Y"), na.rm = T))
CODE ATTEMPTED (2) (NESTED LAPPLY) ERROR: CODE BREAKS DOWN SOMEWHERE
data[grep("date",names(data),value = T)] <- lapply(data[grep("date",names(data),value = T)], function(y) max(y, lapply(data[grep("date",names(data),value = T)], function(x)
as.Date(str_split(x,"\n")[[1]],format="%d/%m/%Y"), na.rm = T)))
CODE ATTEMPTED (3) (NESTED LAPPLY) ERROR: CODE BREAKS DOWN SOMEWHERE
data[grep("date",names(data),value = T)] <- lapply(data[grep("date",names(data),value = T)], function(y) max(y,function(x) as.Date(str_split(x,"\n")[[1]],format="%d/%m/%Y"), na.rm = T))
We can use :
data[-1] <- lapply(data[-1], function(y) sapply(strsplit(y ,"\n"),
function(x) max(as.Date(x, "%d/%m/%Y"))))
data[-1] <- lapply(data[-1], as.Date)
data
# ex date_1 date_2
#1 1 1998-12-22 1997-03-21
#2 2 1993-12-15 <NA>
The logic is the same as described for every column (except first) we split the string on "\n", convert to date and return the max value. The inner sapply loop returns numeric representation of dates so we use another lapply to convert the numbers to date.
Related
Here is what I have done, but with this solution I have new data frame which has numeric columns only but I want to keep my original data frame.
data_without_na <- select_if(new_data,is.numeric)
data_without_na[] <- lapply(
data_without_na,
function(data_without_na) {
data_without_na[is.na(data_without_na)] <- median(data_without_na, na.rm = TRUE)
data_without_na
})
This is what my code is but I would prefer to perform the same operation on my original data frame. The idea is to get the index of columns which are of numeric data type, ind <- which(sapply(new_data, is.numeric)) and get the column number to perform operation on my original data frame, but it's giving me an error
Simulate a dataframe:
d <- data.frame("char1" = sample(letters,100, replace = T),
"char2" = sample(letters,100, replace = T),
"numeric1" = sample(c(NA,seq(1,50,2.5)),100, replace = T),
"numeric2" = sample(c(NA,seq(1,50,2.5)),100, replace = T))
d %>%
mutate_if(is.numeric, ~ifelse(is.na(.x),median(.x, na.rm = T),.x))
We take this data frame and mutate all columns which are numeric. "~" defines an anonymous function and ".x" stands for the variable/column.
First Data frame 'total_coming_in' column names: 'LocationID','PartNumber',"Quantity"
Second Data frame 'total_going_out' column names: 'LocationID','PartNumber',"Quantity"
I want output as 'total_data' column names: 'LocationID','PartNumber',"Quantity_subtract" where
Quantity_subtract = total_coming_in$Quantity - total_going_out$Quantity grouped for each 'LocationID','PartNumber'
I tried this :-
matchingCols <- c('LocationID','PartNumber')
mergingCols <- names(coming_in)[3]
total_coming_in[total_going_out,on=matchingCols,
lapply(
setNames(mergingCols),
function(x) get(x) - get(paste0("i.", x))
),
nomatch=0L,
by=.EACHI
]
Using data.table as you seem to want to, I would first cleanly merge the two tables and then do the substract operation on just the rows that make sense (i.e. for rows in total_coming_in which have matching values values in total_going_out and vice-versa):
library(data.table)
M <- merge(total_coming_in, total_going_out, by = c('LocationID','PartNumber'))
# i.e. all.x = FALSE, all.y = FALSE,
# thereby eliminating rows in x without matching row in y and vice-versa
M[ , Quantity_subtract := Quantity.x - Quantity.y,
by = c('LocationID','PartNumber')]
Now for completenes, as your question might be interpreted as allowing 0 values for Quantity.y in total_going_out for rows of total_coming_in that have no matching values in total_going_out and vice-versa, you could do in this case:
M <- merge(total_coming_in, total_going_out, all = TRUE, by = c('LocationID','PartNumber'))
# i.e. all.x = TRUE, all.y = TRUE,
# thereby allowing rows in x without matching row in y and vice-versa
M[is.na(Quantity.x), Quantity.x := 0]
M[is.na(Quantity.y), Quantity.y := 0]
M[ , Quantity_subtract := Quantity.x - Quantity.y,
by = c('LocationID','PartNumber')]
So you want to have a column that gives you the difference of total_coming_in and total_going_out for each combination of PartNumber and LocationID, correct?
If so, the following will do:
library(dplyr)
matchingCols <- c("LocationID", "PartNumber")
total_data <- full_join(total_coming_in, total_going_out, by=matchingCols)
total_data <- mutate(total_data, Quantity_subtract = Quantity.x - Quantity.y)
total_data <- select(total_data, -Quantity.x, -Quantity.y) #if you want to get rid of these columns
I used this example data:
total_coming_in <- list(LocationID = round(runif(26, 1000, 9000)),
PartNumber = paste(runif(26, 10000, 20000), LETTERS, sep="-"),
Quantity = round(runif(26, 2, 4))
) %>% as_tibble()
random_integers <- sample(1:26,26,FALSE)
total_going_out <- list(LocationID = total_coming_in$LocationID[random_integers],
PartNumber = total_coming_in$PartNumber[random_integers],
Quantity = round(runif(26, 1, 3))
) %>% as_tibble()
I have 2 problems. First, I have datasets with 2 column names that are similar. I want to select the first one and not use the second one. The numeric values in the column names are the serial number of the sensor and can vary and they can be in various columns.
How can I select the first column name of the 2 so I can plot it or use it in calculations?
How can I recover those long column names so I can use them? For example how to I get "Depth_456" to use in depthmax2 with out typing it in or making a subset named depth. The problem is the numeric value which is the serial number of the sensor and it changes from instrument to instrument and dataset to dataset. I am trying to write generic code that will work on all the different instruments.
My Data
df1 <- data.frame(Sal_224 = 1:8, Temp_696 = 1:8, Depth_456 = 1:8, Temp_654 = 8:15)
df2<-data.frame(sapply(df1, function(x) as.numeric(as.character(x))))
temp<- df2[grep("Temp", names(df2), value=TRUE)]
depth<- df2[grep("Depth", names(df2), value=TRUE)]
depthmax<- max(depth, na.rm = TRUE)
depthmax2<- max(df2$"Depth_456", na.rm = TRUE)
This doesn't work
depthmax2<- max(df2$grep("Depth", names(df2), value=TRUE), na.rm = TRUE)
We need [[ instead of $.
max(df2[[ grep("Depth", names(df2), value=TRUE)]], na.rm = TRUE)
#[1] 8
Or another option is startsWith
max(df2[[names(df2)[startsWith(names(df2), "Depth")]]], na.rm = TRUE)
#[1] 8
Also, max works on a vector. If there are more than one match, we may have to loop over and get the max
sapply(df2[ grep("Depth", names(df2), value=TRUE)], max, na.rm = TRUE)
I need to prepare a certain dataset for analysis. What I have is a table with column names (obviously). The column names are as follows (sample colnames):
"X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"
(this is a vector, for those not familiair with R colnames() function)
Now, what I want is simply to flip the values in front of, and after the underscore. e.g. X99_NORM becomes NORM_X99. Note that I want this only for the column names which contain NORM in their name.
Some other base R options
1)
Use sub to switch the beginning and end - we can make use of capturing groups here.
x <- sub(pattern = "(^X\\d+)_(NORM$)", replacement = "\\2_\\1", x = x)
Result
x
# [1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
2)
A regex-free approach that might be more efficient using chartr, dirname and paste. But we need to get the indices of the columns that contain "NORM" first
idx <- grep(x = x, pattern = "NORM", fixed = TRUE)
x[idx] <- paste0("NORM_", dirname(chartr("_", "/", x[idx])))
x
data
x <- c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
x = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
replace(x,
grepl("NORM", x),
sapply(strsplit(x[grepl("NORM", x)], "_"), function(x){
paste(rev(x), collapse = "_")
}))
#[1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
A tidyverse solution with stringr:
library(tidyverse)
library(stringr)
my_data <- tibble(column = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"))
my_data %>%
filter(str_detect(column, "NORM")) %>%
mutate(column_2 = paste0("NORM", "_", str_extract(column, ".+(?=_)"))) %>%
select(column_2)
# A tibble: 3 x 1
column_2
<chr>
1 NORM_X99
2 NORM_X101
3 NORM_X30
I have a df like this
a1 <- c(1,2,NA)
a2 <- c(3,4,NA)
a3 <- c(4,5,6)
a1_fill <- c(1,2,3)
a2_fill <- c(3,4,5)
a3_fill <- c(4,5,6)
b1 <- c(4,3,1)
b2 <- c(2,NA,9)
b3 <- c(NA,3,5)
b1_fill <- c(4,3,1)
b2_fill <- c(2,1,9)
b3_fill <- c(8,3,5)
df <- data.frame(a1,a2,a3,b1,b2,b3,a1_fill,a2_fill,a3_fill,b1_fill,b2_fill,b3_fill)
I want to create two new columns with values from the fill cols that do not appear in the corresponding a or b cols. I do so like this
df$missingA <- apply(df,1,function(x) setdiff(x[which(grepl("a",names(x),fixed = TRUE) & grepl("fill",names(x),fixed = TRUE))],x[which(grepl("a",names(x),fixed = TRUE) & !grepl("fill",names(x),fixed = TRUE))]))
df$missingB <- apply(df,1,function(x) setdiff(x[which(grepl("b",names(x),fixed = TRUE) & grepl("fill",names(x),fixed = TRUE))],x[which(grepl("b",names(x),fixed = TRUE) & !grepl("fill",names(x),fixed = TRUE))]))
For some reason when I run the above code, the line that is run second returns a list of lists, whereas the first returns a list of numeric. Why is this?
This is dependent on the order in which the lines are run. We can use the code below to determine the class of the elements in the columns
class(df$missingA[[1]]) # Class of first element is numeric
class(df$missingB[[1]]) # Class of first element is list
Starting from a clean dataset, if you first create missingB and then create missingA, you'll see that the missingA will be a list of list, whereas missingB will be a list of numeric (the types are reversed).
df$missingB <- apply(df,1,function(x)
setdiff(
x[which(
grepl("b",names(x),fixed = TRUE) &
grepl("fill",names(x),fixed = TRUE))
],
x[which(
grepl("b",names(x),fixed = TRUE) &
!grepl("fill",names(x),fixed = TRUE))
]
)
)
df$missingA <- apply(df,1,function(x)
setdiff(
x[which(
grepl("a",names(x),fixed = TRUE) &
grepl("fill",names(x),fixed = TRUE))
],
x[which(
grepl("a",names(x),fixed = TRUE) &
!grepl("fill",names(x),fixed = TRUE))
]
)
)
class(df$missingA[[1]]) # Class of first element is list
class(df$missingB[[1]]) # Class of first element is numeric
My guess is that the following is happening. You are starting off with a dataframe containing only numeric columns. R tries to be helpful by matching the type of your list to numeric. When the second column gets added the dataframe is no longer
purely made up of numeric columns, as one of the types is now list, and therefore R does not try to update the type.
To test this you can add a none numeric column before adding missingA and missingB for example:
df$text <- list("a","b","C")
Now missingA and missingB will both be created as list of list
Another way to retain the types as they are produced by the apply statements is by assigning the output to an intermediate variable
missingA <- apply(df,1,function(x)
setdiff(
x[which(
grepl("a",names(x),fixed = TRUE) &
grepl("fill",names(x),fixed = TRUE))
],
x[which(
grepl("a",names(x),fixed = TRUE) &
!grepl("fill",names(x),fixed = TRUE))
]
)
)
df$missingA <- missingA
To summarize, the difference you are seeing is not caused by the apply statements or the output they produce, but by how the data is added to the dataframe. Hope this helps!