How do I specify pivot_wider for an entire dataframe? - r

I am able to pivot_wider for a specific column using the following:
new_df <- pivot_wider(old_df, names_from = col10, values_from = value_col, values_fn = list)
I would like to pivot_wider with every column in a dataframe (minus an id column). What is the best way to do this? Should I use a loop or is there a way that this function takes the whole dataframe?
To clarify, using the below sample dataframes, I am able to go from old_df to new_df using the pivot_wider function I listed above. I would like to now go from old_df2 to new_df2.
old_df <- structure(list(id = c("1", "1", "2"), col10 = c("yellow",
"green", "green"), value_col = c("1", "1", "1")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
old_df2 <- structure(list(id = c("1", "1", "2"), col10 = c("yellow",
"green", "green"), col11 = c("dog",
"cat", "dog"), value_col = c("1", "1", "1")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
new_df <- pivot_wider(old_df, names_from = col10, values_from = value_col, values_fn = list)
new_df2 <- structure(list(id = c("1", "2"), yellow = c("1", "NULL"), green = c("1", "1"), dog = c("1", "1"), cat = c("1", "NULL")), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))

If you would like to have separate column names for each value between these two columns (or any number of columns) you first need to use pivot_longer to put all the column names into a single column and then use pivot_wider to spread them:
library(tidyr)
old_df2 %>%
pivot_longer(!c(id, value_col), names_to = "Cols", values_to = "vals") %>%
pivot_wider(names_from = vals, values_from = value_col) %>%
select(-Cols) %>%
group_by(id) %>%
summarise(across(everything(), ~ sum(as.numeric(.x), na.rm = TRUE)))
# A tibble: 2 x 5
id yellow dog green cat
<chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 2 0 1 1 0

Update 1
As per your update, here comes with a data.table option
dcast(
melt(setDT(old_df),
id.var = "id",
measure.vars = patterns("^col\\d+")
),
id ~ value,
fun.aggregate = length,
fill = NA
)
which gives
id cat dog green yellow
1: 1 1 1 1 1
2: 2 NA 1 1 NA
Are you looking for something like below?
reshape(
transform(
old_df,
q = ave(id, id, FUN = seq_along)
),
direction = "wide",
idvar = "id",
timevar = "q"
)
The output is
id col10.1 col11.1 value_col.1 col10.2 col11.2 value_col.2
1 1 yellow dog 1 green cat 1
3 2 green dog 1 <NA> <NA> <NA>

You could combine those columns and unnest them followed by pivot_wider:
library(tidyr)
library(dplyr)
old_df2 <- structure(list(id = c("1", "1", "2"), col10 = c("yellow",
"green", "green"), col11 = c("dog",
"cat", "dog"), value_col = c("1", "1", "1")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
old_df2 %>%
mutate(new_col = strsplit(paste(col10, col11, sep = "_"), "_"), .keep = "unused") %>%
unnest(new_col) %>%
pivot_wider(names_from = new_col, values_from = value_col)
#> # A tibble: 2 x 5
#> id yellow dog green cat
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1 1 1 1
#> 2 2 <NA> 1 1 <NA>
Created on 2021-08-25 by the reprex package (v2.0.1)

Related

Count words in each cell of a dataframe in R

I have a dataframe that looks like
df <- structure(list(Variable = c("Factor1", "Factor2", "Factor3"),
Variable1 = c("word1, word2", "word1", "word1"),
Variable2 = c("word1", "word1, word2", "word1"),
Variable3 = c("word1, word2", "word1", "word1, word2, word3")),
row.names = c(NA, -3L), class = "data.frame")
and would like to create a df that counts occurrences of words in each cell (separated by ",") and input the number into each cell.
df2 <- structure(list(Variable = c("Factor1", "Factor2", "Factor3"),
Variable1 = c("2", "1", "1"),
Variable2 = c("1", "2", "1"),
Variable3 = c("2", "1", "3")),
row.names = c(NA, -3L), class = "data.frame")
Would someone be able to help me in how this would be done?
Thanks!
Using dplyr and stringi:
df %>%
mutate(across(matches("variable\\d{1,}"),stringi::stri_count_words))
Variable Variable1 Variable2 Variable3
1 Factor1 2 1 2
2 Factor2 1 2 1
3 Factor3 1 1 3
I suppose you could try this if desired a base-R solution. Count the number of characters with nchar of a given character value, and subtract the number of characters after removing commas. The difference would be the number of commas (adding 1 would give the number of words/phrases separated by commas). This should be fast too (also see this answer).
cbind(df[1], t(apply(df[-1], 1, \(x) {
nchar(x) - nchar(gsub(",", "", x, fixed = T)) + 1
})))
Output
Variable Variable1 Variable2 Variable3
1 Factor1 2 1 2
2 Factor2 1 2 1
3 Factor3 1 1 3

Is there a way to capture the sequence of values based on there rank

Hi all I have got a dataframe. I need to create another column so that it should tell at what place each categories are there. For example PLease refer expected output
df
ColB ColA
X A>B>C
U B>C>A
Z C>A>B
Expected output
df1
ColB ColA A B C
X A>B>C 1 2 3
U B>C>A 3 1 2
Z C>A>B 2 3 1
We can first bring ColA into separate rows, group_by ColB and give an unique row number for each entry and then convert the data into wide format using pivot_wider.
library(dplyr)
library(tidyr)
df %>%
mutate(ColC = ColA) %>%
separate_rows(ColC, sep = ">") %>%
group_by(ColB) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = ColC, values_from = row)
# ColB ColA A B C
# <fct> <fct> <int> <int> <int>
#1 X A>B>C 1 2 3
#2 U B>C>A 3 1 2
#3 Z C>A>B 2 3 1
data
df <- structure(list(ColB = structure(c(2L, 1L, 3L), .Label = c("U",
"X", "Z"), class = "factor"), ColA = structure(1:3, .Label = c("A>B>C",
"B>C>A", "C>A>B"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
We can do this in base R
df[LETTERS[1:3]] <- t(sapply(regmatches(df$ColA, gregexpr("[A-Z]",
df$ColA)), match, x = LETTERS[1:3]))
df
# ColB ColA A B C
#1 X A>B>C 1 2 3
#2 U B>C>A 3 1 2
#3 Z C>A>B 2 3 1
data
df <- structure(list(ColB = structure(c(2L, 1L, 3L), .Label = c("U",
"X", "Z"), class = "factor"), ColA = structure(1:3, .Label = c("A>B>C",
"B>C>A", "C>A>B"), class = "factor")), class = "data.frame",
row.names = c(NA,
-3L))

tidyr::spread resulting in multiple rows

I have a similar problem than the following, but the solution presented in the following link does not work for me:
tidyr spread does not aggregate data
I have a df in the following structure:
UndesiredIndex DesiredIndex DesiredRows Result
1 x1A x1 A 50,32
2 x1B x2 B 7,34
3 x2A x1 A 50,33
4 x2B x2 B 7,35
Using the code below:
dftest <- bd_teste %>%
select(-UndesiredIndex) %>%
spread(DesiredIndex, Result)
I expected the following result:
DesiredIndex A B
A 50,32 50,33
B 7,34 7,35
Although, I keep getting the following result:
DesiredIndex x1 x2
1 A 50.32 NA
2 B 7.34 NA
3 A NA 50.33
4 B NA 7.35
PS: Sometimes I force the column UndesiredIndex out with select(-UndesiredIndex), but I keep getting the following message:
Adding missing grouping variables: UndesiredIndex
Might be something easy to stack those rows, but I'm new to R and have been trying so hard to solve this but without success.
Thanks in advance!
We group by DesiredIndex, create a sequence column and then do the spread:
library(tidyverse)
df1 %>%
select(-UndesiredIndex) %>%
group_by(DesiredIndex) %>%
mutate(new = LETTERS[row_number()]) %>%
ungroup %>%
select(-DesiredIndex) %>%
spread(new, Result)
# A tibble: 2 x 3
# DesiredRows A B
# <chr> <chr> <chr>
#1 A 50,32 50,33
#2 B 7,34 7,35
Data
df1 <- structure(
list(
UndesiredIndex = c("x1A", "x1B", "x2A", "x2B"),
DesiredIndex = c("x1", "x2", "x1", "x2"),
DesiredRows = c("A", "B", "A", "B"),
Result = c("50,32", "7,34", "50,33", "7,35")
),
class = "data.frame",
row.names = c("1", "2", "3", "4")
)
Shorter, but more theoretically round-about.
Data
(Thanks to #akrun!)
df1 <- structure(
list(
UndesiredIndex = c("x1A", "x1B", "x2A", "x2B"),
DesiredIndex = c("x1", "x2", "x1", "x2"),
DesiredRows = c("A", "B", "A", "B"),
Result = c("50,32", "7,34", "50,33", "7,35")
),
class = "data.frame",
row.names = c("1", "2", "3", "4")
)
This is a great technique for concatenating rows.
df1 %>%
group_by(DesiredRows) %>%
summarise(Result = paste(Result, collapse = "|")) %>% #<Concatenate rows
separate(Result, into = c("A", "B"), sep = "\\|") #<Separate by '|'
#> # A tibble: 2 x 3
#> DesiredRows A B
#> <chr> <chr> <chr>
#> 1 A 50,32 50,33
#> 2 B 7,34 7,35
Created on 2018-08-06 by the reprex package (v0.2.0).

Mutating via dplyr by a particular set of indices

I am working with a dataframe which follows a pattern of:
Key Data
Loc Place1
Value1 6
Value2 7
Loc Place2
Value3 8
Loc Place3
Value1 9
Value2 10
Loc Place4
Value3 11
It is a rough dataset where a pattern exists - in this example, rows within sep(1,100,by=5) would identify the first location of an observation. My goal is to adjust the key in those positions to be different such as LocA rather than Loc in order for a spread(key,value) to provide me with unique observations I can use for further analysis:
LocA Value1 Value2 Loc Value3
Place1 6 7 Place2 8
Place3 9 10 Place4 11
I have been using dplyr and a chain of other mutates and selects to get to this point so I'm hoping to remain in the chain. I can see how I can do it with a appropriate subsetting outside of the chain but am having difficulty wrapping my head around a dplyr solution.
Your data:
df <- structure(list(Key = c("Loc", "Value1", "Value2", "Loc", "Value3",
"Loc", "Value1", "Value2", "Loc", "Value3"), Data = c("Place1",
"6", "7", "Place2", "8", "Place3", "9", "10", "Place4", "11")), .Names = c("Key",
"Data"), row.names = c(NA, -10L), class = "data.frame")
Is this workable?
library(dplyr)
library(tidyr)
df %>%
mutate(grp = (row_number() - 1) %/% 5) %>%
group_by(grp) %>%
mutate(
Key = ifelse(! duplicated(Key), Key, paste0(Key, "A"))
) %>%
ungroup() %>%
spread(Key, Data) %>%
select(-grp)
# Source: local data frame [2 x 5]
# Loc LocA Value1 Value2 Value3
# * <chr> <chr> <chr> <chr> <chr>
# 1 Place1 Place2 6 7 8
# 2 Place3 Place4 9 10 11
Here is another way to do this. I admit this one won't scale as good as the one from r2evans above.
df <- structure(list(Key = c("Loc", "Value1", "Value2", "Loc", "Value3",
"Loc", "Value1", "Value2", "Loc", "Value3"), Data = c("Place1",
"6", "7", "Place2", "8", "Place3", "9", "10", "Place4", "11")), .Names = c("Key",
"Data"), row.names = c(NA, -10L), class = "data.frame")
library(dplyr)
library(tidry)
df %>%
mutate(gid = ceiling(row_number() / 5)) %>%
group_by(gid) %>%
summarize(concatenated_text = str_c(Data, collapse = ",")) %>%
separate(concatenated_text, into = c("LocA", "Value1", "Value2", "Loc", "Value3"), sep=",")

Manipulating all split data sets

I'm drawing a blank-- I have 51 sets of split data from a data frame that I had, and I want to take the mean of the height of each set.
print(dataset)
$`1`
ID Species Plant Height
1 A 1 42.7
2 A 1 32.5
$`2`
ID Species Plant Height
3 A 2 43.5
4 A 2 54.3
5 A 2 45.7
...
...
...
$`51`
ID Species Plant Height
134 A 51 52.5
135 A 51 61.2
I know how to run each individually, but with 51 split sections, it would take me ages.
I thought that
mean(dataset[,4])
might work, but it says that I have the wrong number of dimensions. I get now why that is incorrect, but I am no closer to figuring out how to average all of the heights.
The dataset is a list. We could use lapply/sapply/vapply etc to loop through the list elements and get the mean of the 'Height' column. Using vapply, we can specify the class and length of the output (numeric(1)). This will be useful for debugging.
vapply(dataset, function(x) mean(x[,4], na.rm=TRUE), numeric(1))
# 1 2 51
#37.60000 47.83333 56.85000
Or another option (if we have the same columnames/number of columns for the data.frames in the list), would be to use rbindlist from data.table with the optionidcol=TRUEto generate a singledata.table. The '.id' column shows the name of thelistelements. We group by '.id' and get themeanof theHeight`.
library(data.table)
rbindlist(dataset, idcol=TRUE)[, list(Mean=mean(Height, na.rm=TRUE)), by = .id]
# .id Mean
#1: 1 37.60000
#2: 2 47.83333
#3: 51 56.85000
Or a similar option as above is unnest from library(tidyr) to return a single dataset with the '.id' column, grouped by '.id', we summarise to get the mean of 'Height'.
library(tidyr)
library(dplyr)
unnest(dataset, .id) %>%
group_by(.id) %>%
summarise(Mean= mean(Height, na.rm=TRUE))
# .id Mean
#1 1 37.60000
#2 2 47.83333
#3 51 56.85000
The syntax for plyr is
df1 <- unnest(dataset, .id)
ddply(df1, .(.id), summarise, Mean=mean(Height, na.rm=TRUE))
# .id Mean
#1 1 37.60000
#2 2 47.83333
#3 51 56.85000
data
dataset <- structure(list(`1` = structure(list(ID = 1:2, Species = c("A",
"A"), Plant = c(1L, 1L), Height = c(42.7, 32.5)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-2L)), `2` = structure(list(ID = 3:5, Species = c("A", "A", "A"
), Plant = c(2L, 2L, 2L), Height = c(43.5, 54.3, 45.7)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-3L)), `51` = structure(list(ID = 134:135, Species = c("A", "A"
), Plant = c(51L, 51L), Height = c(52.5, 61.2)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-2L))), .Names = c("1", "2", "51"))
This also works, though it uses dplyr.
library(dplyr)
1:length(dataset) %>%
lapply(function(i)
test[[i]] %>%
mutate(section = i ) ) %>%
bind_rows %>%
group_by(section) %>%
summarize(mean_height = mean(height) )

Resources