This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 3 years ago.
I have a large dataframe containing a cross table of keys from other tables. Instead of having multiple instances of key1 coupled with different values for key2 I would like there to be one row for each key1 with several columns instead.
I tried doing this with a for loop but it couldn't get it to work.
Here's an example. I have a data frame with the structure df1 and I would like it to have the structure of df2.
df1 <- data.frame(c("a", "a", "a", "b", "b", "c", "c", "c", "c", "c", "d"),c(1, 2, 3, 2, 3, 1, 2, 3, 4, 5, 9))
names(df1) <- c("key1", "key2")
df2 <- data.frame(c("a", "b", "c", "d"), c(1, 2, 1, 9), c(2, 3, 2, NA), c(3, NA, 3, NA), c(NA, NA, 4, NA), c(NA, NA, 5, NA))
names(df2) <- c("key1", "key2_1", "key2_2", "key2_3", "key2_4", "key2_5")
I suspect this is possible using an approach utilizing apply but I haven't found a way yet. Any help is appreciated!
library(dplyr)
library(tidyr)
df1 %>%
group_by(key1) %>%
mutate(var = paste0("key2_", seq(n()))) %>%
spread(var, key2)
# # A tibble: 4 x 6
# # Groups: key1 [4]
# key1 key2_1 key2_2 key2_3 key2_4 key2_5
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 2 3 NA NA
# 2 b 2 3 NA NA NA
# 3 c 1 2 3 4 5
# 4 d 9 NA NA NA NA
Related
I want to summarise both factor and numerical variables using group_by and summarise. For example, if I have the following data frame:
group<- c(1, 1, 2, 2, 3, 3, 4, 4)
var1<- c(3, 6, 3, 2, 7, 5, 2, 5)
var2<- c("A", "B", "B", "B", "A", "A", "B", "A")
df<- data.frame(group, var1, var2)
I want to achieve the following output:
# A tibble: 4 x 3
group max_1 sum_A
<dbl> <dbl> <dbl>
1 6 1
2 3 0
3 7 2
4 5 1
I have tried various iterations of the following using "tally", and "n", and "sum", but none work
summary<- df %>% group_by (group) %>%
summarise(max_1 = max(var1)),
mutate (var2A = sum (var2 == "A"))
Thank you!
This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????
A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))
You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5
You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5
An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0
Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))
I'm trying to replace NA values in factor column with the values of the cell above. It would be great to have this in a tidy verse approach, but it doesn't matter too much if its not.
I have data that looks like:
data <- tibble(site = as.factor(c("A", "A", NA, "B","B", NA,"C", NA, "C")),
value = c(1, 2, NA, 1, 2, NA, 1, NA, 2))
And I need it to look like:
output <- data <- tibble(site = as.factor(c("A", "A", "A", "B","B", "B","C", "C", "C")),
value = c(1, 1, NA, 1,2, NA, 1, NA, 2))
I've tried a few different approaches using lag and replace_na although they have basically amounted to trying the same thing which is:
mutate(site = as.character(site),
site = ifelse(is.na(site), "zero", site),
site = ifelse(site == "zero", lag(site), site),
site = as.factor(site))
Thanks!
Try fill() from tidyr:
library(tidyverse)
#Code
data <- data %>% fill(site)
Output:
# A tibble: 9 x 2
site value
<fct> <dbl>
1 A 1
2 A 2
3 A NA
4 B 1
5 B 2
6 B NA
7 C 1
8 C NA
9 C 2
An option with na.locf
library(zoo)
data$state <- na.locf0(data$site)
I am trying to calculate the median (but that could be substituted by similar metrics) by group for multiple columns based on subsets defined by other columns. This is direct follow-on question from this previous post of mine. I have attempted to incorporate calculating the median via aggregate into the Map(function(x,y) dosomething, x, y) solution kindly provided by #Frank, but that didn't work. Let me illustrate:
Calculate median for A and B by groups GRP1 and GRP2
df <- data.frame(GRP1 = c("A","A","A","A","A","A","B","B","B","B","B","B"), GRP2 = c("A","A","A","B","B","B","A","A","A","B","B","B"), A = c(0,4,6,7,0,1,9,0,0,8,3,4), B = c(6,0,4,8,6,7,0,9,9,7,3,0))
med <- aggregate(.~GRP1+GRP2,df,FUN=median)
Simple. Now add columns defining which rows to be used for calculating the median, i.e. rows with NAs should be dropped, column a defines which rows to be used for calculating the median in column A, same for columns b and B:
a <- c(1,4,7,3,NA,3,7,NA,NA,4,8,1)
b <- c(5,NA,7,9,5,6,NA,8,1,7,2,9)
df1 <- cbind(df,a,b)
As mentioned above, I have tried combining Map and aggregate, but that didn't work. I assume that Map doesn't know what to do with GRP1 and GRP2.
med1 <- Map(function(x,y) aggregate(.~GRP1+GRP2,df1[!is.na(y)],FUN=median), x=df1[,3:4], y=df1[, 5:6])
This is the result I'm looking for:
GRP1 GRP2 A B
1 A A 4 5
2 B A 9 9
3 A B 4 7
4 B B 4 3
Any help will be much appreciated!
Using data.table
library(data.table)
setDT(df1)
df1[, .(A = median(A[!is.na(a)]), B = median(B[!is.na(b)])), by = .(GRP1, GRP2)]
GRP1 GRP2 A B
1: A A 4 5
2: A B 4 7
3: B A 9 9
4: B B 4 3
Same logic in dplyr
library(dplyr)
df1 %>%
group_by(GRP1, GRP2) %>%
summarise(A = median(A[!is.na(a)]), B = median(B[!is.na(b)]))
The original df1:
df1 <- data.frame(
GRP1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
GRP2 = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"),
A = c(0, 4, 6, 7, 0, 1, 9, 0, 0, 8, 3, 4),
B = c(6, 0, 4, 8, 6, 7, 0, 9, 9, 7, 3, 0),
a = c(1, 4, 7, 3, NA, 3, 7, NA, NA, 4, 8, 1),
b = c(5, NA, 7, 9, 5, 6, NA, 8, 1, 7, 2, 9)
)
With dplyr:
library(dplyr)
df1 %>%
mutate(A = ifelse(is.na(a), NA, A),
B = ifelse(is.na(b), NA, B)) %>%
# I use this to put as NA the values we don't want to include
group_by(GRP1, GRP2) %>%
summarise(A = median(A, na.rm = T),
B = median(B, na.rm = T))
# A tibble: 4 x 4
# Groups: GRP1 [?]
GRP1 GRP2 A B
<fct> <fct> <dbl> <dbl>
1 A A 4 5
2 A B 4 7
3 B A 9 9
4 B B 4 3
I'm looking to obtain a subset of my first, larger, dataframe 'df1' by selecting rows which contain particular combinations in the first two variables, as specified in a smaller 'df2'. For example:
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df1 # my actual df has 20 varables
ID day value
A 1 4
A 2 5
A 2 6
B 1 7
B 2 8
B 3 9
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
df2 # this df remains at 2 variables
ID day
A 2
B 1
Where the output would be:
ID day value
A 2 5
A 2 6
B 1 7
Any help wouldbe much appreciated, thanks!
This is a good use of the merge function.
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
merge(df1,
df2,
by = c("ID", "day"))
Which gives output:
ID day value
1 A 2 5
2 A 2 6
3 B 1 7
Here is a dplyr solution:
library("dplyr")
semi_join(df1, df2, by = c("ID", "day"))
# ID day value
# 1 A 2 5
# 2 A 2 6
# 3 B 1 7