Convert data in 10 columns into some rows - r

I have a data set as following:-
a <- data.frame(X1="A", X2="B", X3="C", X4="D", X5="0",
X6="0", X7="0", X8="0", X9="0", X10="0")
Basically it is a 1 row X 10 column data.frame.
The resulting data.frame should have the column elements of a as rows rather than columns. And any columns in a which are equal to "0" should not be present in the new data.frame. For ex. -
# b
# [1] A
# [2] B
# [3] C
# [4] D

Use a transpose and subset with a logical condition
data.frame("b" = t(df1)[t(df1) != 0])
A second look gave me chance to play with code, you did not need a transpose
data.frame("b" = df1[df1 != 0])

You could unlist and then subset
subset(data.frame(b = unlist(a), row.names = NULL), b != 0)
# b
#1 A
#2 B
#3 C
#4 D

Using pivot_longer function, you can reshape your dataframe into a longer format and then filter values that are "0". With the function column_to_rownames from tibble package, you can pass the first column as rownames.
Altogether, you can do something like this:
library(tidyr)
library(dplyr)
library(tibble)
a %>% pivot_longer(everything(), names_to = "Row", values_to = "b") %>%
filter(b != "0") %>%
column_to_rownames("Row")
b
X1 A
X2 B
X3 C
X4 D

Related

Replace sets of rows in a tibble with a single row

There is a similar question for data.table at Replace sets of rows in a data.table with a single row but I am looking for an equivalent solution in tidyverse. So, I have a tibble like:
DT <- tibble (
id=c(1,1,1,1,2,2,2,2),
location = c("a","b","c","d","a","b","d","e"),
seq = c(1,2,3,4,1,2,3,4))
For every id, I want to look for the sequence b,c,d and if there is such a thing, I want to replace the rows with b and c with a single row, let's say z. The values for the other variables should retain the values of the previous b (in this case id and seq)
So in this case, the new tibble should be
DT.Tobe <- tibble (
id=c(1,1,1,2,2,2,2),
place = c("a","z","d","a","b","d","e"),
seq = c(1,2,4,1,2,3,4))
I was not able to find even a starting point for this...
library(dplyr)
# library(zoo) # rollapply
DT %>%
group_by(id) %>%
mutate(
isseq = zoo::rollapply(location, 3, FUN = function(z) identical(z, c("b", "c", "d")), align = "left", partial = TRUE),
isseq = isseq | lag(isseq, default = FALSE)
) %>%
group_by(id, isseq) %>%
summarize(
across(everything(), ~ {
if (cur_group()$isseq) {
if (cur_column() == "location") "z" else first(.)
} else .
})
) %>%
ungroup() %>%
select(-isseq)
# # A tibble: 7 x 3
# id location seq
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 d 4
# 3 1 z 2
# 4 2 a 1
# 5 2 b 2
# 6 2 d 3
# 7 2 e 4
The order is changed because the group_by(isseq) tends to keep "like" together. This should be easy to either re-order (assuming "seq" is meaningful) or pre-add an order variable and using it later.
If it is possible for a single id to have multiple of such sequences (if so, say something), then run-length encoding will be needed here as well (to differentiate between different b-c-d sequences in the same id).
One possible option would be to use a for-loop. Here is my pseudocode.
for (i in nrows(DT)){ # Repeat this if statement for each row in your DT
if (place[i] == "b & place[i+1] == "c"){ # if the first item is B and the second item is C
DT <- DT %>%
dplyr::replace(place[i] == "z") # Replaces item B with the z character
DT[-(i+1)] # Deletes item C's row
}
}
The Dplyr cheat sheet has some useful functions that may help with finding the right tools for the if-statement part of this pseudocode.
What are your thoughts?

If string has a certain character, fill an empty cell in the same row with a certain value

Say I have the following data frame:
# S/N a b
# 1 L1-S2 <blank>
# 2 T1-T3 <blank>
# 3 T1-L2 <blank>
How do I turn the above data frame into this:
# S/N a b
# 1 L1-S2 LS
# 2 T1-T3 T
# 3 T1-L2 TL
I am thinking of writing a loop, where
For x in column a,
If first character in x == L AND 4th character in x == S,
fill the corresponding cell in b with LS
and so on...
However, I am not sure how to implement it, or if there is a more elegant way of doing this.
We can extract the upper case letters and remove the repeated ones
library(stringr)
library(dplyr)
df1 %>%
mutate(b = str_replace(str_replace(a, "^([A-Z])\\d+-([A-Z])\\d+",
"\\1\\2"), "(.)\\1+", "\\1"))
-output
# S_N a b
#1 1 L1-S2 LS
#2 2 T1-T3 T
#3 3 T1-L2 TL
Or another option is str_extract_all to extract the upper case letters, loop over the list with map, paste the unique elements
library(purrr)
df1 %>%
mutate(b = str_extract_all(a, "[A-Z]") %>%
map_chr(~ str_c(unique(.x), collapse="")))
Or using a corresponding base R option for the first tidyverse option
df1$b <- sub("(.)\\1+", "\\1", gsub("[0-9-]+", "", df1$a))
Or with strsplit
df1$b <- sapply(strsplit(df1$a, "[0-9-]+"),
function(x) paste(unique(x), collapse=""))
data
df1 <- structure(list(S_N = 1:3, a = c("L1-S2", "T1-T3", "T1-L2"),
b = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))

How do I mutate a list-column to a common one leaving only the last value when there is a vector in the list?

I am trying to use purrr::map_chr to get the last element of the vector in a list-column as the actual value in case that it exists.
THE reproducible example:
library(data.table)
library(purrr)
x <- data.table(one = c("a", "b", "c"), two = list("d", c("e","f","g"), NULL))
I want data as it is but changing my list-column to a common one with "g" as the value for x[2,2]. What I've tryed:
x %>% mutate(two = ifelse(is.null(.$two), map_chr(~NA_character_), map_chr(~last(.))))
The result should be the next one.
# one two
# a d
# b g
# c NA
Thaks in advance!
Here is an option. We can use if/else instead of ifelse here
library(dplyr)
library(tidyr)
x %>%
mutate(two = map_chr(two, ~ if(is.null(.x)) NA_character_ else last(.x)))
# one two
#1 a d
#2 b g
#3 c NA
Or replace the NULL elements with NA and extract the last
x %>%
mutate(two = map_chr(two, ~ last(replace(.x, is.null(.), NA))))
I would propose this solution which is a bit cleaner.
library(tidyverse)
df <- tibble(one = c("a", "b", "c"), two = list("d", c("e","f","g"), NULL))
df %>%
mutate_at("two", replace_na, NA_character_) %>%
mutate_at("two", map_chr, last)

Create data frame, matching based on first elements of a list

I want to create a data frame based on the first element of a list. Specifically, I have
One vector containing variables (names1);
One list that contains two variables (some vars1 and the values);
And the end product should a data.frame with "names1" that contains as many lines as cases that match.
If there is no match between a specific list and a the vector, it should be NA.
The values can also be factors or strings.
names1 <- c("a", "b", "c")
dat1 <- data.frame(names1 =c("a", "b", "c", "f"),values= c("val1", 13, 11, 0))
dat1$values <- as.factor(dat1$values)
dat2 <- data.frame(names1 =c("a", "b", "x"),values= c(12, 10, 2))
dat2$values <- as.factor(dat2$values)
list1 <- list(dat1, dat2)
The results should be a new data frame with the variables "names" and all values that match of each of list parts:
a b c
val1 13 11
12 10 NA
One option would be to loop through the list ('list1'), filter the 'names' column based on the 'names' vector, convert it to a single dataset while creating an identification column with .id, spread from 'long' to 'wide' and remove the 'grp' column
library(tidyverse)
map_df(list1, ~ .x %>%
filter(names %in% !! names), .id = 'grp') %>%
spread(names, values) %>%
select(-grp)
# a b c
#1 25 13 11
#2 12 10 NA
Or another option is to bind the datasets together with bind_rows, created a grouping id 'grp' to specify the list element, filter the rows by selecting only 'names' column that match with the 'names' vector and spread from 'long' to 'wide'
bind_rows(list1, .id = 'grp') %>%
filter(names %in% !! names) %>%
spread(names, values)
NOTE: It is better not to use reserved keywords for specifying object names (names). Also, to avoid confusions, the object should be different from the column names of the dataframe object.
It can be also done with only base R. Create a group identifier with Map, rbind the list elements to single dataset, subset the rows by keeping only the values from the 'names' vector, and reshape from 'long' to 'wide'
df1 <- subset(do.call(rbind, Map(cbind, list1,
ind = seq_along(list1))), names %in% .GlobalEnv$names)
reshape(df1, idvar = 'ind', direction = 'wide', timevar = 'names')[-1]
A mix of base R and dplyr. For every list element we create a dataframe with 1 row. Using dplyr's rbind_list row bind them together and then subset only those columns which we need using names.
library(dplyr)
rbind_list(lapply(list1, function(x)
setNames(data.frame(t(x$values)), x$names)))[names]
# a b c
# <dbl> <dbl> <dbl>
#1 25 13 11
#2 12 10 NA
Output without subset looks like this
rbind_list(lapply(list1, function(x) setNames(data.frame(t(x$values)), x$names)))
# a b c x
# <dbl> <dbl> <dbl> <dbl>
#1 25 13 11 NA
#2 12 10 NA 2
In base R
t(sapply(list1, function(x) setNames(x$values, names)[match(names, x$names)]))
# a b c
# [1,] 25 13 11
# [2,] 12 10 NA
Using base R only
body <- do.call('rbind', lapply(list1, function(list.element){
element.vals <- list.element[['values']]
element.names <- list.element[['names']]
names(element.vals) <- element.names
return.vals <- element.vals[names]
if(all(is.na(return.vals))) NULL else return.vals
}))
df <- as.data.frame(body)
names(df) <- names
df
For the sake of completeness, here is a data.table approach using dcast() and rowid():
library(data.table)
nam <- names1 # avoid name conflict with column name
rbindlist(list1)[names1 %in% nam, dcast(.SD, rowid(names1) ~ names1)][, names1 := NULL][]
a b c
1: val1 13 11
2: 12 10 <NA>
Or, more concisely, pick columns after reshaping:
library(data.table)
rbindlist(list1)[, dcast(.SD, rowid(names1) ~ names1)][, .SD, .SDcols = names1]

Select column name based on data frame content R

I want to build a matrix or data frame by choosing names of columns where the element in the data frame contains does not contain an NA. For example, suppose I have:
zz <- data.frame(a = c(1, NA, 3, 5),
b = c(NA, 5, 4, NA),
c = c(5, 6, NA, 8))
which gives:
a b c
1 1 NA 5
2 NA 5 6
3 3 4 NA
4 5 NA 8
I want to recognize each NA and build a new matrix or df that looks like:
a c
b c
a b
a c
There will be the same number of NAs in each row of the input matrix/df. I can't seem to get the right code to do this. Suggestions appreciated!
library(dplyr)
library(tidyr)
zz %>%
mutate(k = row_number()) %>%
gather(column, value, a, b, c) %>%
filter(!is.na(value)) %>%
group_by(k) %>%
summarise(temp_var = paste(column, collapse = " ")) %>%
separate(temp_var, into = c("var1", "var2"))
# A tibble: 4 × 3
k var1 var2
* <int> <chr> <chr>
1 1 a c
2 2 b c
3 3 a b
4 4 a c
Here's a possible vectorized base R approach
indx <- which(!is.na(zz), arr.ind = TRUE)
matrix(names(zz)[indx[order(indx[, "row"]), "col"]], ncol = 2, byrow = TRUE)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "c"
#[3,] "a" "b"
#[4,] "a" "c"
This finds non-NA indices, sorts by rows order and then subsets the names of your zz data set according to the sorted index. You can wrap it into as.data.frame if you prefer it over a matrix.
EDIT: transpose the data frame one time before process, so don't need to transpose twice in loop in first version.
cols <- names(zz)
for (column in cols) {
zz[[column]] <- ifelse(is.na(zz[[column]]), NA, column)
}
t_zz <- t(zz)
cols <- vector("list", length = ncol(t_zz))
for (i in 1:ncol(t_zz)) {
cols[[i]] <- na.omit(t_zz[, i])
}
new_dt <- as.data.frame(t(do.call("cbind", cols)))
The tricky part here is your goal actually change data frame structure, so the task of "remove NA in each row" have to build row by row as new data frame, since every column in each row could came from different column of original data frame.
zz[1, ] is a one row data frame, use t to convert it into vector so we can use na.omit, then transpose back to row.
I used 2 for loops, but for loops are not necessarily bad in R. The first one is vectorized for each column. The second one need to be done row by row anyway.
EDIT: growing objects is very bad in performance in R. I knew I can use rbindlist from data.table which can take a list of data frames, but OP don't want new packages. My first attempt just use rbind which could not take list as input. Later I found an alternative is to use do.call. It's still slower than rbindlist though.

Resources