Related
I have several sheets that I import from excel. While these sheets are similar there are some differences due to manual entry. I am trying to filter out the rows that has "Total" and anything beyond that row. The logic I have works for df1 and df3 but I am not sure how to get it to work for df2. Could someone please help?
df1<-structure(list(...1 = structure(c(1630022400, 1630108800, 1630195200,
1630281600, 1630368000, NA), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Vinayak Trading` = c(1984.31, NA, NA, NA, NA, 2916.17
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
df2<-structure(list(...1 = c("44526", "44527", "44528", "44529", "44530",
"Total"), `Vinayak Trading` = c(NA, NA, NA, NA, NA, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
df3<-structure(list(...1 = c("44680", "44681", NA, "Total", NA, NA
), `Vinayak Trading` = c(NA, NA, NA, 2736.42, NA, NA)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
transform <- function(df) {
names(df)[1] <- "Date"
df <- df %>%
filter(row_number() < which(is.na(Date))) %>% #To tackle sheets where Total is not present
filter(row_number() < which(Date=="Total")) %>% #To remove Total in sheets where it is present
select(-Total) #To remove the Total column
}
df1 <- transform(df1)# Desired reults
df2 <- transform(df2)# Error due to no NAs - don't know how to handle
df3 <- transform(df3)#Desired result with warning
We can use dplyr::cumany to remove "Total" (or NA) and anything beyond.
transform2 <- function(df) {
df %>%
rename(Date = 1) %>%
filter(!cumany(Date %in% c("Total", NA))) %>%
select(-any_of("Total"))
}
transform2(df1)
# # A tibble: 5 × 2
# Date `Vinayak Trading`
# <dttm> <dbl>
# 1 2021-08-27 00:00:00 1984.
# 2 2021-08-28 00:00:00 NA
# 3 2021-08-29 00:00:00 NA
# 4 2021-08-30 00:00:00 NA
# 5 2021-08-31 00:00:00 NA
transform2(df2)
# # A tibble: 5 × 2
# Date `Vinayak Trading`
# <chr> <dbl>
# 1 44526 NA
# 2 44527 NA
# 3 44528 NA
# 4 44529 NA
# 5 44530 NA
transform2(df3)
# # A tibble: 2 × 2
# Date `Vinayak Trading`
# <chr> <dbl>
# 1 44680 NA
# 2 44681 NA
We can use rename(Date = 1) as an inline replacement for names(df)[1] <- "Date", it seems a bit more pipe-esque;
== NA doesn't return true/false, but %in% NA does; we can use is.na(Date) | Date == "Total", or we can use Date %in% c("Total", NA) with the results one would expect;
cumany is a "cumulative any", meaning that when a value returns true, then all subsequent values will be true as well, see cumany(c(F,T,F)); for its opposite, see cumall(c(T,F,T)); for a base-R equivalent, use cumsum(cond) > 0 for cumany and cumsum(!cond) == 0 for cumall; and
I use select(-any_of("Total")) since it will remove the column if it exists and do nothing otherwise (none of your sample data included it, so I thought it better to be safe).
EDIT:
Please note that the problem in this question was simply that I accidentally put zoo::na.locf(l[[i]][1], na.rm=TRUE), when it should be zoo::na.locf(l[[i]][1], na.rm=FALSE). However by the time I figured this out the question had already received two answers. As a result, instead of removing the question (as this is discouraged after people have submitted answers), I have slightly adapted the question, to make sure the post will at least have some merit.
Question:
I have a list of dataframes:
test_dat <- structure(list(...5 = c("euro", "euro", NA, NA, NA, NA,
NA, "dollar", NA)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
test_dat2 <- structure(list(...5 = c(NA, "euro", NA, NA, NA, NA,
NA, "dollar", NA)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
test_dat2
# A tibble: 9 × 1
...5
<chr>
1 NA
2 euro
3 NA
4 NA
5 NA
6 NA
7 NA
8 dollar
9 NA
l = list(test_dat , test_dat2)
I want to fill NA's in the list of df's, but sometimes there is a leading NA. I do not know for which entries there is leading NA.
for (i in seq_along(l)){
# Fill first column
l[[i]][1] <- zoo::na.locf(l[[i]][1])
}
Leading to:
Error:
! Assigned data `zoo::na.locf(l[[i]][1])` must be compatible with existing data.
✖ Existing data has 9 rows.
✖ Assigned data has 8 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
I assumed that the following would solve it, but did not:
for (i in seq_along(l)){
# Fill first column
l[[i]][1] <- zoo::na.locf(l[[i]][1], na.rm=TRUE)
}
Desired output:
test_dat2
# A tibble: 9 × 1
...5
<chr>
1 NA
2 euro
3 euro
4 euro
5 euro
6 euro
7 euro
8 dollar
9 dollar
I can't reproduce the problem with R 4.2.1, zoo 1.8.11, dplyr 1.0.10 and tibble 3.1.8. The code in the question with na.rm=FALSE works for me as do any of the lines below. Make sure that you have the latest version of all packages and R and restart R and try the code in the Note below and then the code in the body of this response with a fresh instance of R.
library(dplyr)
library(zoo)
l <- l.orig
for(i in seq_along(l)) l[[i]][1] <- na.locf(l[[i]][1], na.rm = FALSE)
l <- l.orig
for(i in seq_along(l)) l[[i]][, 1] <- na.locf(l[[i]][, 1], na.rm = FALSE)
l <- l.orig
for(i in seq_along(l)) l[[i]][[1]] <- na.locf(l[[i]][[1]], na.rm = FALSE)
l <- l.orig
lapply(l, function(x) replace(x, 1, na.locf(x[, 1], na.rm = FALSE)))
If each component of l has only one column (which is the case in the l shown in the question) a simplification is possible.
l <- l.orig
for(i in seq_along(l)) l[[i]] <- na.locf(l[[i]], na.rm = FALSE)
l <- l.orig
lapply(l, na.locf, na.rm = FALSE)
If the data frames have the same number of rows we could represent l as a data frame and apply na.locf to it assuming it is ok to apply na.locf to all columns.
l <- l.orig
DF <- data.frame(l)
na.locf(DF, na.rm = FALSE)
Note
library(dplyr)
l.orig <-
list(structure(list(...5 = c("euro", "euro", NA, NA, NA, NA,
NA, "dollar", NA)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame")), structure(list(...5 = c(NA, "euro", NA,
NA, NA, NA, NA, "dollar", NA)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame")))
Maybe you want something like this. I assume you want to apply this across both dataframes in the list.
library(tidyverse)
l |>
map(~fill(.x, everything(), .direction = "down"))
#> [[1]]
#> # A tibble: 9 x 1
#> ...5
#> <chr>
#> 1 euro
#> 2 euro
#> 3 euro
#> 4 euro
#> 5 euro
#> 6 euro
#> 7 euro
#> 8 dollar
#> 9 dollar
#>
#> [[2]]
#> # A tibble: 9 x 1
#> ...5
#> <chr>
#> 1 <NA>
#> 2 euro
#> 3 euro
#> 4 euro
#> 5 euro
#> 6 euro
#> 7 euro
#> 8 dollar
#> 9 dollar
I am working with a data frame in R in which a column contain gene IDs separated by bars that look like the following:
geneIDs <- c("100/1000/100008586","1277/63923/8516","1133/1132/1956/8516")
#> geneIDs
#> [1] "100/1000/100008586" "1277/63923/8516" "1133/1132/1956/8516"
I need to convert each of the different geneIDs to Gene Symbol based on a data.frame that contains in each row the geneID and its correspondent Gene Symbol, as depicted bellow:
#> head(gene_symbols)
ENTREZID SYMBOL
1 1 A1BG
2 10 NAT2
3 100 ADA
4 1000 CDH2
5 10000 AKT3
6 100008586 GAGE12F
Using the first element from the geneIDs as an example, my expected outcome would look like:
#> geneIDs
#> [1] "ADA/CDH2/GAGE12F"
Thank you very much in advance!
Possible solution:
geneIDs <- c("100/1000/100008586","1277/63923/8516","1133/1132/1956/8516")
lookupTable <- structure(list(ENTREZID = c(1L, 10L, 100L, 1000L, 10000L, 100008586L
), SYMBOL = c("A1BG", "NAT2", "ADA", "CDH2", "AKT3", "GAGE12F"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
)) %>%
mutate(ENTREZID = as.character(ENTREZID))
as_tibble(x = geneIDs) %>%
mutate(value = strsplit(geneIDs, split = "/")) %>%
unnest_longer(value) %>%
left_join(lookupTable, by = c("value" = "ENTREZID"))
Which gives:
# A tibble: 10 × 2
value SYMBOL
<chr> <chr>
1 100 ADA
2 1000 CDH2
3 100008586 GAGE12F
4 1277 NA
5 63923 NA
6 8516 NA
7 1133 NA
8 1132 NA
9 1956 NA
10 8516 NA
Or to return exactly what you specified:
geneString <- as_tibble(x = geneIDs) %>%
mutate(value = strsplit(geneIDs, split = "/")) %>%
unnest_longer(value) %>%
left_join(lookupTable, by = c("value" = "ENTREZID")) %>%
filter(!is.na(SYMBOL)) %>%
pull(SYMBOL)
paste(geneString, collapse = "/")
"ADA/CDH2/GAGE12F"
You could split the strings at the / and match each to the ENTREZID column to look up the SYMBOL. Replace any non-matches with the original string fragment, and paste the result together, collapsing with "/"
sapply(strsplit(geneIDs, '/'), function(x) {
y <- gene_symbols$SYMBOL[match(x, gene_symbols$ENTREZID)]
y[is.na(y)] <- x[is.na(y)]
paste0(y, collapse = '/')
})
#> [1] "ADA/CDH2/GAGE12F" "1277/63923/8516" "1133/1132/1956/8516"
You can do this:
library(tidyverse)
geneIDs %>%
map(~ {vec <- df$SYMBOL[df$ENTREZID %in% unlist(str_split(.x, '/'))]
if(length(vec) > 0) {
paste(vec, collapse = '/')
}}) %>%
keep(~ length(.x) > 0)
[[1]]
[1] "ADA/CDH2/GAGE12F"
Perhaps gsubfn can be used here
library(gsubfn)
library(tibble)
gsubfn("\\d+", as.list(deframe(gene_symbols)), geneIDs)
[1] "ADA/CDH2/GAGE12F" "1277/63923/8516" "1133/1132/1956/8516"
data
gene_symbols <- structure(list(ENTREZID = c(1L, 10L, 100L, 1000L,
10000L, 100008586L
), SYMBOL = c("A1BG", "NAT2", "ADA", "CDH2", "AKT3", "GAGE12F"
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6"))
geneIDs <- c("100/1000/100008586","1277/63923/8516","1133/1132/1956/8516")
Given the following data
df1 <- structure(list(ID = 1:3, alpha_1 = c(2L, 2L, 3L),
alpha_2 = c(1L, 2L,
3L), alpha_3 = c(4L, 4L, 2L), alpha_4 = c(3L, NA, NA), beta_1 = c(NA,
2L, NA), beta_2 = c(3L, NA, 2L), charlie_1 = c(1L, NA, 1L), charlie_2 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
I'm trying to coalesce all columns sharing the same initial prefix name (i.e. coalesce alpha_1, alpha_2, alpha_3, alpha_4, and coalesce beta_1 beta_2, etc.), but from both the left and right sides. That is, I want to generate two new variables, say 'alpha_left' and 'alpha_right', whose columns would be, in this example, (2, 2, 3) and (3, 4, 2) respectively (first non-missing elements from the left and right side of the dataframe).
User #akrun offered a great solution for the coalescing part here, but I'm unsure how to create two new variables from both the left and right coalesces.
Here is an option in tidyverse
Reshape to 'long' format - pivot_longer
Grouped by 'ID'
Do the summarise across the columns 'alpha' till 'charlie'
Get the column name - cur_column()
Create a tibble with the first non-NA element from the left and the right
Change the column names by appending the 'nm1' as prefix
Finally, unnest the list columns created in summarise
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
pivot_longer(cols = contains("_"),
names_to = c( ".value", "grp"), names_sep = "_") %>%
group_by(ID) %>%
summarise(across(alpha:charlie, ~ {
nm1 <- cur_column()
tbl1 <- tibble(left= .[complete.cases(.)][1],
right = rev(.)[complete.cases(rev(.))][1]);
names(tbl1) <- str_c(nm1, "_", names(tbl1))
list(tbl1)})) %>%
unnest(c(alpha, beta, charlie))
-output
# A tibble: 3 x 7
ID alpha_left alpha_right beta_left beta_right charlie_left charlie_right
<int> <int> <int> <int> <int> <int> <int>
1 1 2 3 3 3 1 1
2 2 2 4 2 2 2 2
3 3 3 2 2 2 1 1
Or using base R
lst1 <- lapply(split.default(df1[-1], sub("_\\d+$", "", names(df1)[-1])),
function(x) {
x1 <- apply(x, 1, function(y) {
y1 <- na.omit(y)
if(length(y1) > 1 ) y1[c(1, length(y1))] else y1[1]
})
if(is.vector(x1)) as.data.frame(matrix(x1)) else as.data.frame(t(x1))
})
You could also do:
df1[-1] %>%
split.default(sub("_\\d+", "", names(.))) %>%
imap_dfc(~data.frame(right = coalesce(!!!.x),
left = coalesce(!!!rev(.x))) %>%
set_names(paste(.y, names(.), sep="_")))
alpha_right alpha_left beta_right beta_left charlie_right charlie_left
1 2 3 3 3 1 1
2 2 4 2 2 2 2
3 3 2 2 2 1 1
One more approach not as elegant as #Onyambu's
library(tidyverse)
df1[-1] %>%
split.default(sub("_\\d+", "", names(.))) %>%
imap_dfc(~ .x %>% rowwise() %>%
mutate(!!paste0(.y, '_left') := head(na.omit(c_across(everything())),1),
!!paste0(.y, '_right') := tail(na.omit(c_across(!last_col())),1),
.keep = 'none' )
)
#> # A tibble: 3 x 6
#> # Rowwise:
#> alpha_left alpha_right beta_left beta_right charlie_left charlie_right
#> <int> <int> <int> <int> <int> <int>
#> 1 2 3 3 3 1 1
#> 2 2 4 2 2 2 2
#> 3 3 2 2 2 1 1
Created on 2021-06-19 by the reprex package (v2.0.0)
Another option
library(tidyverse)
df1 <- structure(list(ID = 1:3, alpha_1 = c(2L, 2L, 3L),
alpha_2 = c(1L, 2L,
3L), alpha_3 = c(4L, 4L, 2L), alpha_4 = c(3L, NA, NA), beta_1 = c(NA,
2L, NA), beta_2 = c(3L, NA, 2L), charlie_1 = c(1L, NA, 1L), charlie_2 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
df1 %>%
pivot_longer(cols = -ID, names_sep = "_", names_to = c(".value", "set")) %>%
group_by(ID) %>%
fill(alpha:charlie, .direction = "updown") %>%
filter(set %in% range(set)) %>%
mutate(set = c("left", "right")) %>%
pivot_wider(id_cols = ID, names_from = set, values_from = alpha:charlie)
#> # A tibble: 3 x 7
#> # Groups: ID [3]
#> ID alpha_left alpha_right beta_left beta_right charlie_left charlie_right
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 2 3 3 3 1 1
#> 2 2 2 4 2 2 2 2
#> 3 3 3 2 2 2 1 1
Created on 2021-06-20 by the reprex package (v2.0.0)
I have two data frames. dfOne is made like this:
X Y Z T J
3 4 5 6 1
1 2 3 4 1
5 1 2 5 1
and dfTwo is made like this
C.1 C.2
X Z
Y T
I want to obtain a new dataframe where there are simultaneously X, Y, Z, T Values which are major than a specific threshold.
Example. I need simultaneously (in the same row):
X, Y > 2
Z, T > 4
I need to use the second data frame to reach my objective, I expect something like:
dfTwo$C.1>2
so the result would be a new dataframe with this structure:
X Y Z T J
3 4 5 6 1
How could I do it?
Here is a base R method with Map and Reduce.
# build lookup table of thresholds relative to variable name
vals <- setNames(c(2, 2, 4, 4), unlist(dat2))
# subset data.frame
dat[Reduce("&", Map(">", dat[names(vals)], vals)), ]
X Y Z T J
1 3 4 5 6 1
Here, Map returns a list of length 4 with logical variables corresponding to each comparison. This list is passed to Reduce which returns a single logical vector with length corresponding to the number of rows in the data.frame, dat. This logical vector is used to subset dat.
data
dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L,
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X",
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA,
-3L))
dat2 <-
structure(list(C.1 = structure(1:2, .Label = c("X", "Y"), class = "factor"),
C.2 = structure(c(2L, 1L), .Label = c("T", "Z"), class = "factor")), .Names = c("C.1",
"C.2"), class = "data.frame", row.names = c(NA, -2L))
We can use the purrr package
Here is the input data.
# Data frame from lmo's solution
dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L,
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X",
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA,
-3L))
# A numeric vector to show the threshold values
# Notice that columns without any requirements need NA
vals <- c(X = 2, Y = 2, Z = 4, T = 4, J = NA)
Here is the implementation
library(purrr)
map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) %>% na.omit()
# A tibble: 1 x 5
X Y Z T J
<int> <int> <int> <int> <int>
1 3 4 5 6 1
map2_dfc loop through each column in dat and each value in vals one by one with a defined function. ~ifelse(.x > .y | is.na(.y), .x, NA) means if the number in each column is larger than the corresponding value in vals, or vals is NA, the output should be the original value from the column. Otherwise, the value is replaced to be NA. The output of map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) is a data frame with NA values in some rows indicating that the condition is not met. Finally, na.omit removes those rows.
Update
Here I demonstrate how to covert the dfTwo dataframe to the vals vector in my example.
First, let's create the dfTwo data frame.
dfTwo <- read.table(text = "C.1 C.2
X Z
Y T",
header = TRUE, stringsAsFactors = FALSE)
dfTwo
C.1 C.2
1 X Z
2 Y T
To complete the task, I load the dplyr and tidyr package.
library(dplyr)
library(tidyr)
Now I begin the transformation of dfTwo. The first step is to use stack function to convert the format.
dfTwo2 <- dfTwo %>%
stack() %>%
setNames(c("Col", "Group")) %>%
mutate(Group = as.character(Group))
dfTwo2
Col Group
1 X C.1
2 Y C.1
3 Z C.2
4 T C.2
The second step is to add the threshold information. One way to do this is to create a look-up table showing the association between Group and Value
threshold_df <- data.frame(Group = c("C.1", "C.2"),
Value = c(2, 4),
stringsAsFactors = FALSE)
threshold_df
Group Value
1 C.1 2
2 C.2 4
And then we can use the left_join function to combine the data frame.
dfTwo3 <- dfTwo2 %>% left_join(threshold_dt, by = "Group")
dfTwo3
Col Group Value
1 X C.1 2
2 Y C.1 2
3 Z C.2 4
4 T C.2 4
Now it is the third step. Notice that there is a column called J which does not need any threshold. So we need to add this information to dfTwo3. We can use the complete function from tidyr. The following code completes the data frame by adding Col in dat but not in dfTwo3 and NA to the Value.
dfTwo4 <- dfTwo3 %>% complete(Col = colnames(dat))
dfTwo4
# A tibble: 5 x 3
Col Group Value
<chr> <chr> <dbl>
1 J <NA> NA
2 T C.2 4
3 X C.1 2
4 Y C.1 2
5 Z C.2 4
The fourth step is arrange the right order of dfTwo4. We can achieve this by turning Col to factor and assign the level based on the order of the column name in dat.
dfTwo5 <- dfTwo4 %>%
mutate(Col = factor(Col, levels = colnames(dat))) %>%
arrange(Col) %>%
mutate(Col = as.character(Col))
dfTwo5
# A tibble: 5 x 3
Col Group Value
<chr> <chr> <dbl>
1 X C.1 2
2 Y C.1 2
3 Z C.2 4
4 T C.2 4
5 J <NA> NA
We are almost there. Now we can create vals from dfTwo5.
vals <- dfTwo5$Value
names(vals) <- dfTwo5$Col
vals
X Y Z T J
2 2 4 4 NA
Now we are ready to use the purrr package to filter the data.
The aboved are the breakdown of steps. We can combine all these steps into the following code for simlicity.
library(dplyr)
library(tidyr)
threshold_df <- data.frame(Group = c("C.1", "C.2"),
Value = c(2, 4),
stringsAsFactors = FALSE)
dfTwo2 <- dfTwo %>%
stack() %>%
setNames(c("Col", "Group")) %>%
mutate(Group = as.character(Group)) %>%
left_join(threshold_df, by = "Group") %>%
complete(Col = colnames(dat)) %>%
mutate(Col = factor(Col, levels = colnames(dat))) %>%
arrange(Col) %>%
mutate(Col = as.character(Col))
vals <- dfTwo2$Value
names(vals) <- dfTwo2$Col
dfOne[Reduce(intersect, list(which(dfOne["X"] > 2),
which(dfOne["Y"] > 2),
which(dfOne["Z"] > 4),
which(dfOne["T"] > 4))),]
# X Y Z T J
#1 3 4 5 6 1
Or iteratively (so fewer inequalities are tested):
vals = c(X = 2, Y = 2, Z = 4, T = 4) # from #lmo's answer
dfOne[Reduce(intersect, lapply(names(vals), function(x) which(dfOne[x] > vals[x]))),]
# X Y Z T J
#1 3 4 5 6 1
I'm writing this assuming that the second DF is meant to categorize the fields in the first DF. It's way simpler if you don't need to use the second one to define the conditions:
dfNew = dfOne[dfOne$X > 2 & dfOne$Y > 2 & dfOne$Z > 4 & dfOne$T > 4, ]
Or, using dplyr:
library(dplyr)
dfNew = dfOne %>% filter(X > 2 & Y > 2 & Z > 4 & T > 4)
In case that's all you need, I'll save this comment while I poke at the more complicated version of the question.