I am working with a dataframe which follows a pattern of:
Key Data
Loc Place1
Value1 6
Value2 7
Loc Place2
Value3 8
Loc Place3
Value1 9
Value2 10
Loc Place4
Value3 11
It is a rough dataset where a pattern exists - in this example, rows within sep(1,100,by=5) would identify the first location of an observation. My goal is to adjust the key in those positions to be different such as LocA rather than Loc in order for a spread(key,value) to provide me with unique observations I can use for further analysis:
LocA Value1 Value2 Loc Value3
Place1 6 7 Place2 8
Place3 9 10 Place4 11
I have been using dplyr and a chain of other mutates and selects to get to this point so I'm hoping to remain in the chain. I can see how I can do it with a appropriate subsetting outside of the chain but am having difficulty wrapping my head around a dplyr solution.
Your data:
df <- structure(list(Key = c("Loc", "Value1", "Value2", "Loc", "Value3",
"Loc", "Value1", "Value2", "Loc", "Value3"), Data = c("Place1",
"6", "7", "Place2", "8", "Place3", "9", "10", "Place4", "11")), .Names = c("Key",
"Data"), row.names = c(NA, -10L), class = "data.frame")
Is this workable?
library(dplyr)
library(tidyr)
df %>%
mutate(grp = (row_number() - 1) %/% 5) %>%
group_by(grp) %>%
mutate(
Key = ifelse(! duplicated(Key), Key, paste0(Key, "A"))
) %>%
ungroup() %>%
spread(Key, Data) %>%
select(-grp)
# Source: local data frame [2 x 5]
# Loc LocA Value1 Value2 Value3
# * <chr> <chr> <chr> <chr> <chr>
# 1 Place1 Place2 6 7 8
# 2 Place3 Place4 9 10 11
Here is another way to do this. I admit this one won't scale as good as the one from r2evans above.
df <- structure(list(Key = c("Loc", "Value1", "Value2", "Loc", "Value3",
"Loc", "Value1", "Value2", "Loc", "Value3"), Data = c("Place1",
"6", "7", "Place2", "8", "Place3", "9", "10", "Place4", "11")), .Names = c("Key",
"Data"), row.names = c(NA, -10L), class = "data.frame")
library(dplyr)
library(tidry)
df %>%
mutate(gid = ceiling(row_number() / 5)) %>%
group_by(gid) %>%
summarize(concatenated_text = str_c(Data, collapse = ",")) %>%
separate(concatenated_text, into = c("LocA", "Value1", "Value2", "Loc", "Value3"), sep=",")
Related
I am able to pivot_wider for a specific column using the following:
new_df <- pivot_wider(old_df, names_from = col10, values_from = value_col, values_fn = list)
I would like to pivot_wider with every column in a dataframe (minus an id column). What is the best way to do this? Should I use a loop or is there a way that this function takes the whole dataframe?
To clarify, using the below sample dataframes, I am able to go from old_df to new_df using the pivot_wider function I listed above. I would like to now go from old_df2 to new_df2.
old_df <- structure(list(id = c("1", "1", "2"), col10 = c("yellow",
"green", "green"), value_col = c("1", "1", "1")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
old_df2 <- structure(list(id = c("1", "1", "2"), col10 = c("yellow",
"green", "green"), col11 = c("dog",
"cat", "dog"), value_col = c("1", "1", "1")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
new_df <- pivot_wider(old_df, names_from = col10, values_from = value_col, values_fn = list)
new_df2 <- structure(list(id = c("1", "2"), yellow = c("1", "NULL"), green = c("1", "1"), dog = c("1", "1"), cat = c("1", "NULL")), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
If you would like to have separate column names for each value between these two columns (or any number of columns) you first need to use pivot_longer to put all the column names into a single column and then use pivot_wider to spread them:
library(tidyr)
old_df2 %>%
pivot_longer(!c(id, value_col), names_to = "Cols", values_to = "vals") %>%
pivot_wider(names_from = vals, values_from = value_col) %>%
select(-Cols) %>%
group_by(id) %>%
summarise(across(everything(), ~ sum(as.numeric(.x), na.rm = TRUE)))
# A tibble: 2 x 5
id yellow dog green cat
<chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 2 0 1 1 0
Update 1
As per your update, here comes with a data.table option
dcast(
melt(setDT(old_df),
id.var = "id",
measure.vars = patterns("^col\\d+")
),
id ~ value,
fun.aggregate = length,
fill = NA
)
which gives
id cat dog green yellow
1: 1 1 1 1 1
2: 2 NA 1 1 NA
Are you looking for something like below?
reshape(
transform(
old_df,
q = ave(id, id, FUN = seq_along)
),
direction = "wide",
idvar = "id",
timevar = "q"
)
The output is
id col10.1 col11.1 value_col.1 col10.2 col11.2 value_col.2
1 1 yellow dog 1 green cat 1
3 2 green dog 1 <NA> <NA> <NA>
You could combine those columns and unnest them followed by pivot_wider:
library(tidyr)
library(dplyr)
old_df2 <- structure(list(id = c("1", "1", "2"), col10 = c("yellow",
"green", "green"), col11 = c("dog",
"cat", "dog"), value_col = c("1", "1", "1")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
old_df2 %>%
mutate(new_col = strsplit(paste(col10, col11, sep = "_"), "_"), .keep = "unused") %>%
unnest(new_col) %>%
pivot_wider(names_from = new_col, values_from = value_col)
#> # A tibble: 2 x 5
#> id yellow dog green cat
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1 1 1 1
#> 2 2 <NA> 1 1 <NA>
Created on 2021-08-25 by the reprex package (v2.0.1)
I have the following dataframe:
df1 <- data.frame(
date = c("14-Mar-20", "14-Mar-20", "14-Mar-20", "15-Mar-20", "15-Mar-20", "15-Mar-20"),
status = c("new", "progress", "completed", "new", "progress", "completed"),
count = c("1", "2", "3", "4", "5", "6"),
stringsAsFactors = FALSE
)
I want to reshape it into the following format:
How can I do so? I am trying to use "melt" function but I am unable to make any headway!
We can use pivot_wider from tidyr
library(dplyr)
library(tidyr)
df1 %>%
pivot_wider(names_from = status, values_from = count)
# A tibble: 2 x 4
# date new progress completed
# <chr> <chr> <chr> <chr>
#1 14-Mar-20 1 2 3
#2 15-Mar-20 4 5 6
dcast from data.table:
setDT(df1)
dcast(df1, date ~ status, value.var = 'count')
Here is a base R solution using reshape
res <- reshape(df1,direction = "wide",idvar = "date",timevar = "status")
> res
date count.new count.progress count.completed
1 14-Mar-20 1 2 3
4 15-Mar-20 4 5 6
I have a data frame where one column contains a list. I want to convert the list to numeric and sum the values into a new column. Each row has a column with a vector like this:
c("47", "39", "1")
The new column would contain the sum of those numbers and would look like this:
List SumList
c("47", "39", "1") 87
c("11", "11") 22
c("1", "2") 3
I have tried a couple different approaches, but nothing seems to produce the outcome I need.
Example data frame:
DF <- structure(list(list = structure(list(c("47", "39", "1"), c("11",
"11"), c("1", "2")))), class = "data.frame", row.names = c(NA, -3L))
You can accomplish what you want using the dplyr functions rowwise and mutate.
Example:
library(dplyr)
df <- tibble(List = list(c("47", "39", "1"), c("11","11"), c("1","2"))) %>%
rowwise() %>%
mutate(SumList = sum(as.numeric(List)))
1) Assuming the data frame in the Note at the end, try the following code. No packages are used.
transform(DF, sum = sapply(list, function(x) sum(as.numeric(x))))
giving:
list sum
1 47, 39, 1 87
2 11, 11 22
3 1, 2 3
2) Another approach is to convert DF to a long form and then sum that giving the same result. Again no packages are used.
long <- stack(setNames(DF$list, seq_along(DF$list)))
transform(DF, sum = rowsum(as.numeric(long$value), long$ind))
Note
The input in reproducible form:
DF <- structure(list(list = structure(list(c("47", "39", "1"), c("11",
"11"), c("1", "2")))), class = "data.frame", row.names = c(NA, -3L))
Here's a purrr solution that uses map_dbl.
library(dplyr)
library(tibble)
library(purrr)
tibble(x = list(c("47", "39", "1"), c("11","11"), c("1","2"))) %>%
mutate(Sum = map_dbl(x, function(i)sum(as.numeric(i))))
#> # A tibble: 3 x 2
#> x Sum
#> <list> <dbl>
#> 1 <chr [3]> 87
#> 2 <chr [2]> 22
#> 3 <chr [2]> 3
Created on 2019-03-20 by the reprex package (v0.2.1)
Supposing we use G. Grothendieck structure:
DF <- structure(list(list = structure(list(c("47", "39", "1"), c("11",
"11"), c("1", "2")))), class = "data.frame", row.names = c(NA, -3L))
DF$SumList <-lapply(1:nrow(DF), function(x) sum(as.double(unlist(DF$list[x]))))
And with Frank input:
DF$SumList <-lapply(1:nrow(DF), function(x) sum(as.double(DF$list[[x]])))
I have a similar problem than the following, but the solution presented in the following link does not work for me:
tidyr spread does not aggregate data
I have a df in the following structure:
UndesiredIndex DesiredIndex DesiredRows Result
1 x1A x1 A 50,32
2 x1B x2 B 7,34
3 x2A x1 A 50,33
4 x2B x2 B 7,35
Using the code below:
dftest <- bd_teste %>%
select(-UndesiredIndex) %>%
spread(DesiredIndex, Result)
I expected the following result:
DesiredIndex A B
A 50,32 50,33
B 7,34 7,35
Although, I keep getting the following result:
DesiredIndex x1 x2
1 A 50.32 NA
2 B 7.34 NA
3 A NA 50.33
4 B NA 7.35
PS: Sometimes I force the column UndesiredIndex out with select(-UndesiredIndex), but I keep getting the following message:
Adding missing grouping variables: UndesiredIndex
Might be something easy to stack those rows, but I'm new to R and have been trying so hard to solve this but without success.
Thanks in advance!
We group by DesiredIndex, create a sequence column and then do the spread:
library(tidyverse)
df1 %>%
select(-UndesiredIndex) %>%
group_by(DesiredIndex) %>%
mutate(new = LETTERS[row_number()]) %>%
ungroup %>%
select(-DesiredIndex) %>%
spread(new, Result)
# A tibble: 2 x 3
# DesiredRows A B
# <chr> <chr> <chr>
#1 A 50,32 50,33
#2 B 7,34 7,35
Data
df1 <- structure(
list(
UndesiredIndex = c("x1A", "x1B", "x2A", "x2B"),
DesiredIndex = c("x1", "x2", "x1", "x2"),
DesiredRows = c("A", "B", "A", "B"),
Result = c("50,32", "7,34", "50,33", "7,35")
),
class = "data.frame",
row.names = c("1", "2", "3", "4")
)
Shorter, but more theoretically round-about.
Data
(Thanks to #akrun!)
df1 <- structure(
list(
UndesiredIndex = c("x1A", "x1B", "x2A", "x2B"),
DesiredIndex = c("x1", "x2", "x1", "x2"),
DesiredRows = c("A", "B", "A", "B"),
Result = c("50,32", "7,34", "50,33", "7,35")
),
class = "data.frame",
row.names = c("1", "2", "3", "4")
)
This is a great technique for concatenating rows.
df1 %>%
group_by(DesiredRows) %>%
summarise(Result = paste(Result, collapse = "|")) %>% #<Concatenate rows
separate(Result, into = c("A", "B"), sep = "\\|") #<Separate by '|'
#> # A tibble: 2 x 3
#> DesiredRows A B
#> <chr> <chr> <chr>
#> 1 A 50,32 50,33
#> 2 B 7,34 7,35
Created on 2018-08-06 by the reprex package (v0.2.0).
I'm drawing a blank-- I have 51 sets of split data from a data frame that I had, and I want to take the mean of the height of each set.
print(dataset)
$`1`
ID Species Plant Height
1 A 1 42.7
2 A 1 32.5
$`2`
ID Species Plant Height
3 A 2 43.5
4 A 2 54.3
5 A 2 45.7
...
...
...
$`51`
ID Species Plant Height
134 A 51 52.5
135 A 51 61.2
I know how to run each individually, but with 51 split sections, it would take me ages.
I thought that
mean(dataset[,4])
might work, but it says that I have the wrong number of dimensions. I get now why that is incorrect, but I am no closer to figuring out how to average all of the heights.
The dataset is a list. We could use lapply/sapply/vapply etc to loop through the list elements and get the mean of the 'Height' column. Using vapply, we can specify the class and length of the output (numeric(1)). This will be useful for debugging.
vapply(dataset, function(x) mean(x[,4], na.rm=TRUE), numeric(1))
# 1 2 51
#37.60000 47.83333 56.85000
Or another option (if we have the same columnames/number of columns for the data.frames in the list), would be to use rbindlist from data.table with the optionidcol=TRUEto generate a singledata.table. The '.id' column shows the name of thelistelements. We group by '.id' and get themeanof theHeight`.
library(data.table)
rbindlist(dataset, idcol=TRUE)[, list(Mean=mean(Height, na.rm=TRUE)), by = .id]
# .id Mean
#1: 1 37.60000
#2: 2 47.83333
#3: 51 56.85000
Or a similar option as above is unnest from library(tidyr) to return a single dataset with the '.id' column, grouped by '.id', we summarise to get the mean of 'Height'.
library(tidyr)
library(dplyr)
unnest(dataset, .id) %>%
group_by(.id) %>%
summarise(Mean= mean(Height, na.rm=TRUE))
# .id Mean
#1 1 37.60000
#2 2 47.83333
#3 51 56.85000
The syntax for plyr is
df1 <- unnest(dataset, .id)
ddply(df1, .(.id), summarise, Mean=mean(Height, na.rm=TRUE))
# .id Mean
#1 1 37.60000
#2 2 47.83333
#3 51 56.85000
data
dataset <- structure(list(`1` = structure(list(ID = 1:2, Species = c("A",
"A"), Plant = c(1L, 1L), Height = c(42.7, 32.5)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-2L)), `2` = structure(list(ID = 3:5, Species = c("A", "A", "A"
), Plant = c(2L, 2L, 2L), Height = c(43.5, 54.3, 45.7)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-3L)), `51` = structure(list(ID = 134:135, Species = c("A", "A"
), Plant = c(51L, 51L), Height = c(52.5, 61.2)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-2L))), .Names = c("1", "2", "51"))
This also works, though it uses dplyr.
library(dplyr)
1:length(dataset) %>%
lapply(function(i)
test[[i]] %>%
mutate(section = i ) ) %>%
bind_rows %>%
group_by(section) %>%
summarize(mean_height = mean(height) )