Transform columns and rows of a dataframe [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I have a dataframe:
ID Value Name Score Card_type Card_number
1 NA John 242 X 23
1 124 John NA X 23
1 124 John 242 Y 25
1 124 NA 242 Y NA
2 55 Mike NA X 11
2 55 NA 431 X 11
2 55 Mike 431 Y 14
2 NA Mike 431 Y 14
As you see, there are IDs and each of them has two groups (Card_type) for column Card_number. Also as you see, some rows with same ID and Card_type have same missing values in some columns. What I want to get is, to make each ID be one row with filled columns. And column Card_number must be split into two columns Card_number_type_X and Card_number_type_X and column Card_type must be removed.
So the desired result must look like this:
ID Value Name Score Card_number_type_X Card_number_type_Y
1 124 John 242 23 25
2 55 Mike 431 11 14
How could I do that?

One way would be to fill the missing values in each ID and then get data in wide format keeping only the unique values.
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
fill(everything(), .direction = 'updown') %>%
pivot_wider(names_from = Card_type, values_from = Card_number,
values_fn = unique, names_prefix = 'Card_number_type_')
# ID Value Name Score Card_number_type_X Card_number_type_Y
# <int> <int> <chr> <int> <int> <int>
#1 1 124 John 242 23 25
#2 2 55 Mike 431 11 14
It seems original data is not the same as shared data in which case we can try :
df %>%
group_by(ID) %>%
fill(everything(), .direction = 'updown') %>%
distinct() %>%
group_by(ID, Value, Name, Score) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = Card_type, values_from = Card_number,
names_prefix = 'Card_number_type_')
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Value = c(NA,
124L, 124L, 124L, 55L, 55L, 55L, NA), Name = c("John", "John",
"John", NA, "Mike", NA, "Mike", "Mike"), Score = c(242L, NA,
242L, 242L, NA, 431L, 431L, 431L), Card_type = c("X", "X", "Y",
"Y", "X", "X", "Y", "Y"), Card_number = c(23L, 23L, 25L, NA,
11L, 11L, 14L, 14L)), class = "data.frame", row.names = c(NA,
-8L))

Related

one hot encoding only factor variables in R recipes

I have a dataframe df like so
height age dept
69 18 A
44 8 B
72 19 B
58 34 C
I want to one-hot encode only the factor variables (only dept is a factor). How can i do this?
Currently right now I'm selecting everything..
and getting this warning:
Warning message:
The following variables are not factor vectors and will be ignored: height, age
ohe <- df %>%
recipes::recipe(~ .) %>%
recipes::step_dummy(tidyselect::everything()) %>%
recipes::prep() %>%
recipes::bake(df)
Use the where with is.factor instead of everything
library(dplyr)
df %>%
recipes::recipe(~ .) %>%
recipes::step_dummy(tidyselect:::where(is.factor)) %>%
recipes::prep() %>%
recipes::bake(df)
-output
# A tibble: 4 × 4
height age dept_B dept_C
<int> <int> <dbl> <dbl>
1 69 18 0 0
2 44 8 1 0
3 72 19 1 0
4 58 34 0 1
data
df <- structure(list(height = c(69L, 44L, 72L, 58L), age = c(18L, 8L,
19L, 34L), dept = structure(c(1L, 2L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")

Remove row within groups if coordinates of subgroup are within another subgroup in r

I have a dataframe such as
Groups NAMES start end
G1 A 1 50
G1 A 25 45
G1 B 20 51
G1 A 51 49
G2 A 200 400
G2 B 1 1600
G2 A 2000 3000
G2 B 4000 5000
and the idea is within each Groups to look at NAMES where start & end coordinates of A are within coordinates of B
for instance here in the example :
Groups NAMES start end
G1 A 1 50 <- A is outside any B coordinate
G1 A 25 45 <- A is **inside** the B coord `20-51`,then I remove these B row.
G1 B 20 51
G1 A 51 49 <- A is outside any B coordinate
G2 A 200 400 <- A is **inside** the B coordinate 1-1600, then I romove this B row.
G2 B 1 1600
G2 A 2000 3000 <- A is outside any B coordinate
G2 B 4000 5000 <- this one does not have any A inside it, then it will be kept in the output.
Then I should get as output :
Groups NAMES start end
G1 A 1 50
G1 A 25 45
G1 A 51 49
G2 A 200 400
G2 A 2000 3000
G2 B 4000 5000
Does someone have an idea please ?
Here is the dataframe in dput format if it can help you ? :
structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("G1", "G2"), class = "factor"), NAMES = structure(c(1L,
1L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"),
start = c(1L, 25L, 20L, 51L, 200L, 1L, 2000L, 4000L), end = c(50L,
45L, 51L, 49L, 400L, 1600L, 3000L, 5000L)), class = "data.frame", row.names = c(NA,
-8L))
Here's a possible approach. We'll split the df by NAMES and join the two parts to each other by Groups to do within-group comparisons. Only B rows can get dropped, so those are the only ones whose row numbers we want to keep track of.
We can then just group by rowid to tag the B rows by whether or not they have any A inside them. Finally, filter to the B to keep and concatenate back to the A rows.
library(tidyverse)
df <- structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("G1", "G2"), class = "factor"), NAMES = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"), start = c(1L, 25L, 20L, 51L, 200L, 1L, 2000L, 4000L), end = c(50L, 45L, 51L, 49L, 400L, 1600L, 3000L, 5000L)), class = "data.frame", row.names = c(NA, -8L))
A <- filter(df, NAMES == "A")
B <- df %>%
filter(NAMES == "B") %>%
rowid_to_column()
comparison <- inner_join(A, B, by = "Groups") %>%
mutate(A_in_B = start.x >= start.y & end.x <= end.y) %>%
group_by(rowid) %>%
summarise(keep_B = !any(A_in_B))
B %>%
inner_join(comparison, by = "rowid") %>%
filter(keep_B) %>%
select(-rowid, -keep_B) %>%
bind_rows(A) %>%
arrange(Groups, NAMES)
#> Groups NAMES start end
#> 1 G1 A 1 50
#> 2 G1 A 25 45
#> 3 G1 A 51 49
#> 4 G2 A 200 400
#> 5 G2 A 2000 3000
#> 6 G2 B 4000 5000
Created on 2021-07-27 by the reprex package (v1.0.0)
This will also do using purrr::map_dfr
library(tidyverse)
df %>%
group_split(Groups) %>%
map_dfr(~ .x %>% mutate(r = row_number()) %>%
full_join(.x %>%
filter(NAMES == 'B'),
by = 'Groups') %>%
group_by(r) %>%
filter(any(NAMES.x == 'B' | start.x > start.y & end.x < end.y)) %>%
ungroup %>%
select(Groups, ends_with('.x')) %>%
distinct %>%
rename_with(~ gsub('\\.x', '', .), everything())
)
#> # A tibble: 6 x 4
#> Groups NAMES start end
#> <fct> <fct> <int> <int>
#> 1 G1 A 25 45
#> 2 G1 B 20 51
#> 3 G1 A 51 49
#> 4 G2 A 200 400
#> 5 G2 B 1 1600
#> 6 G2 B 4000 5000
Created on 2021-07-27 by the reprex package (v2.0.0)

Is there an R function for subtracting values in a variable by group based on specified conditions?

I have a df that takes this general form:
ID votes
1 65
1 85
2 100
2 20
2 95
3 50
3 60
I want to create a new df that takes the two highest values in votes for each ID and shows their difference. The new df should look like this:
ID margin
1 20
2 5
3 10
Is there a way to use dplyr for this?
An option would be to be arrange by 'ID', 'votes' (either in descending or ascending), grouped by 'ID' and get the diff of the first two 'votes'
library(dplyr)
df1 %>%
arrange(ID, desc(votes)) %>%
group_by(ID) %>%
summarise(margin = abs(diff(votes[1:2])))
# A tibble: 3 x 2
# ID margin
# <int> <int>
#1 1 20
#2 2 5
#3 3 10
Or another option is
df1 %>%
group_by(ID) %>%
summarise(margin = max(votes) - max(votes[-which.max(votes)]))
Or with slice and diff
df1 %>%
group_by(ID) %>%
slice(row_number(votes)[1:2]) %>%
summarise(margin = diff(votes))
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L), votes = c(65L,
85L, 100L, 20L, 95L, 50L, 60L)), class = "data.frame", row.names = c(NA,
-7L))

How would I add a Total Row for each value in a specific column, that does calculations based upon other columns,

Assume I have this data frame
What I want is this
What I want to do is create rows which groups upon the month variable, which then obtains the sum of the total variable, and the unique value of the days_month variable for all of the values in person for that month.
I am just wondering if there is an easy way to do this that does not involve multiple spreads and gathers with adorn totals that I have to change the days in month back to original value after the totals were summed, etc. Is there a quick and easy way to do this?
One option would be to group by 'month', 'days_in_month' and apply adorn_total by group_mapping
library(dplyr)
library(janitor)
df1 %>%
group_by(month, days_in_month) %>%
group_map(~ .x %>%
adorn_totals("row")) %>%
select(names(df1))
# A tibble: 10 x 4
# Groups: month, days_in_month [2]
# month person total days_in_month
# <int> <chr> <int> <int>
# 1 1 John 7 31
# 2 1 Jane 18 31
# 3 1 Tim 20 31
# 4 1 Cindy 11 31
# 5 1 Total 56 31
# 6 2 John 18 28
# 7 2 Jane 13 28
# 8 2 Tim 15 28
# 9 2 Cindy 9 28
#10 2 Total 55 28
If we need other statistics, we can have it in group_map
library(tibble)
df1 %>%
group_by(month, days_in_month) %>%
group_map(~ bind_rows(.x, tibble(person = "Mean", total = mean(.x$total))))
data
df1 <- structure(list(month = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), person = c("John",
"Jane", "Tim", "Cindy", "John", "Jane", "Tim", "Cindy"), total = c(7L,
18L, 20L, 11L, 18L, 13L, 15L, 9L), days_in_month = c(31L, 31L,
31L, 31L, 28L, 28L, 28L, 28L)), class = "data.frame", row.names = c(NA,
-8L))

Calculate rowMeans on a range of column (Variable number)

I want to calculate rowMeans of a range of column but I cannot give the hard-coded value for colnames (e.g c(C1,C3)) or range (e.g. C1:C3) as both names and range are variable. My df looks like:
> df
chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
1 123 abc 12 10.00 19 18.00 12 13.00 -14
2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
This is just a sample, in reality I have columns ranging in MGW.1 ... MGW.196 and so. Here Instead of giving the exact colnames or an exact range I want to pass initial of colnames and want to get average of all columns having that initials. Something like: MGW=rowMeans(df[,MGW.*]), HEL=rowMeans(df[,HEL.*])
So my final output should look like:
> df
chr name age MGW Hel
1 123 abc 12 10.00 19
2 234 bvf 24 13.29 13
3 376 bxc 17 -6.95 10
I know these values are not correct but it is just to give you and idea. Secondly I want to remove all those rows from data frame which contains NA in the entire row except the first 3 values.
Here is the dput for sample example:
> dput(df)
structure(list(chr = c(123L, 234L, 376L), name = structure(1:3, .Label = c("abc",
"bvf", "bxc"), class = "factor"), age = c(12L, 24L, 17L), MGW.1 = c(10,
-13.29, -6.95), MGW.2 = c(19L, 13L, 10L), MGW.3 = c(18, -3.02,
-18), HEL.1 = c(12L, 12L, 15L), HEL.2 = c(13, -0.12, 4), HEL.3 = c(-14L,
24L, -4L)), .Names = c("chr", "name", "age", "MGW.1", "MGW.2",
"MGW.3", "HEL.1", "HEL.2", "HEL.3"), class = "data.frame", row.names = c(NA,
-3L))
Firstly
I think you are looking for this to get mean of rows:
df$mean.Hel <- rowMeans(df[, grep("^HEL.", names(df))])
And to delete the columns afterwards:
df[, grep("^HEL.", names(df))] <- NULL
Secondly
To delete rows which have only NA after the first three elements.
rows.delete <- which(rowSums(!is.na(df)[,4:ncol(df)]) == 0)
df <- df[!(1:nrow(df) %in% rows.delete),]
Here's an idea achieving your desired output without hardcoding variable names:
library(dplyr)
library(tidyr)
df %>%
# remove rows where all values are NA except the first 3 columns
filter(rowSums(is.na(.[4:length(.)])) != length(.) - 3) %>%
# gather the data in a tidy format
gather(key, value, -(chr:age)) %>%
# separate the key column into label and num allowing
# to regroup by variables without hardcoding them
separate(key, into = c("label", "num")) %>%
group_by(chr, name, age, label) %>%
# calculate the mean
summarise(mean = mean(value, na.rm = TRUE)) %>%
spread(label, mean)
I took the liberty to modify your initial data to show how the logic would fit special cases. For example, here we have a row (#4) where all values but the first 3 columns are NAs (according to your requirements, this row should be removed) and one where there is a mix of NAs and values (#5). In this case, I assumed we would like to have a result for MGW since there is a value at MGW.1:
# chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
#1 123 abc 12 10.00 19 18.00 12 13.00 -14
#2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
#3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
#4 999 zzz 21 NA NA NA NA NA NA
#5 888 aaa 12 10.00 NA NA NA NA NA
Which gives:
#Source: local data frame [4 x 5]
#Groups: chr, name, age [4]
#
# chr name age HEL MGW
#* <int> <fctr> <int> <dbl> <dbl>
#1 123 abc 12 3.666667 15.666667
#2 234 bvf 24 11.960000 -1.103333
#3 376 bxc 17 5.000000 -4.983333
#4 888 aaa 12 NaN 10.000000
Data
df <- structure(list(chr = c(123L, 234L, 376L, 999L, 888L), name = structure(c(2L,
3L, 4L, 5L, 1L), .Label = c("aaa", "abc", "bvf", "bxc", "zzz"
), class = "factor"), age = c(12L, 24L, 17L, 21L, 12L), MGW.1 = c(10,
-13.29, -6.95, NA, 10), MGW.2 = c(19L, 13L, 10L, NA, NA), MGW.3 = c(18,
-3.02, -18, NA, NA), HEL.1 = c(12L, 12L, 15L, NA, NA), HEL.2 = c(13,
-0.12, 4, NA, NA), HEL.3 = c(-14L, 24L, -4L, NA, NA)), .Names = c("chr",
"name", "age", "MGW.1", "MGW.2", "MGW.3", "HEL.1", "HEL.2", "HEL.3"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

Resources