R construct a data frame using 2 column - r

I have a data frame like:
df
group group_name value
1 1 <NA> VV0001
2 1 <NA> VV_RS00280
3 2 <NA> VV0002
4 2 <NA> VV_RS00285
5 3 <NA> VV0003
6 3 <NA> VV_RS00290
7 5 <NA> VV0004
8 5 <NA> VV_RS00295
9 6 <NA> VV0005
10 6 <NA> VV_RS00300
11 7 <NA> VV0006
12 7 <NA> VV_RS00305
13 8 <NA> VV0007
14 8 <NA> VV_RS00310
15 9 <NA> VV0009
16 9 <NA> VV_RS00315
17 10 <NA> VV0011
18 10 <NA> VV_RS00320
19 11 <NA> VV0012
20 11 <NA> VV_RS00325
21 12 <NA> VV0013
22 12 <NA> VV_RS00330
so I want to construct an other data frame using the columns "group" and "value", all the group 1 (df[df$group == 1,]) will get the data in "value" column (VV0001, VV_RS00280) and construct the data.frame like:
group value
1 VV0001 VV_RS00280
and then the next df[df$group == 2,], and so on, at the end will be:
group value
1 VV0001 VV_RS00280
2 VV0002 VV_RS00285
3 VV0003 VV_RS00290
4 VV0004 VV_RS00295
I tried to do it manually but the nrow(df) is big, > 3000 or more !!
Thanks

You may try,
library(dplyr)
library(tidyr)
df %>%
rename(idv = group) %>%
mutate(group_name = rep(c("group", "value"),n()/2)) %>%
group_by(idv) %>%
pivot_wider(names_from = group_name, values_from = value) %>%
ungroup %>%
select(-idv)
group value
<chr> <chr>
1 VV0001 VV_RS00280
2 VV0002 VV_RS00285
3 VV0003 VV_RS00290
4 VV0004 VV_RS00295
5 VV0005 VV_RS00300
6 VV0006 VV_RS00305
7 VV0007 VV_RS00310
8 VV0009 VV_RS00315
9 VV0011 VV_RS00320
10 VV0012 VV_RS00325
11 VV0013 VV_RS00330

Related

How can i select specific columns that start with a word according to a condition of another column in R using dplyr?

I have a data frame that looks like this :
date
var
cat_low
dog_low
cat_high
dog_high
Love
Friend
2022-01-01
A
1
7
13
19
NA
friend
2022-01-01
A
2
8
14
20
NA
friend
2022-01-01
A
3
9
15
21
NA
friend
2022-02-01
B
4
10
16
22
love
NA
2022-02-01
B
5
11
17
23
love
NA
2022-02-01
B
6
12
18
24
love
NA
I want to select the columns related to columns Love and Friend. If the column Love is love to give the columns that starts with cat and if the column Friend is friend to give me the columns that start with dog.
ideally i want to look like this :
date
var
a
b
2022-01-01
A
7
19
2022-01-01
A
8
20
2022-01-01
A
9
21
2022-02-01
B
4
16
2022-02-01
B
5
17
2022-02-01
B
6
18
library(lubridate)
date = c(rep(as.Date("2022-01-01"),3),rep(as.Date("2022-02-01"),3))
var = c(rep("A",3),rep("B",3))
cat_low = seq(1,6,1)
dog_low = seq(7,12,1)
cat_high = seq(13,18,1)
dog_high = seq(19,24,1)
Friend = c(rep("friend",3),rep(NA,3))
Love = c(rep(NA,3),rep("love",3))
df = tibble(date,var,cat_low,dog_low,cat_high,dog_high,Love,Friend);df
Any help? How i can do that in R using dplyr ?
With dplyr try this.
The first summarise filters for dog or cat, the second renames and puts the variables together.
library(dplyr)
df %>%
summarise(date, var,
across(starts_with("dog"), ~ .x[Friend == "friend"]),
across(starts_with("cat"), ~ .x[Love == "love"])) %>%
rename(a = dog_low, b = dog_high) %>%
summarise(date, var, a = ifelse(is.na(a), cat_low, a),
b = ifelse(is.na(b), cat_high, b))
date var a b
1 2022-01-01 A 7 19
2 2022-01-01 A 8 20
3 2022-01-01 A 9 21
4 2022-02-01 B 4 16
5 2022-02-01 B 5 17
6 2022-02-01 B 6 18
There might be better ways, but here's one:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = starts_with(c("cat", "dog")),
names_to = c("animal", ".value"),
names_pattern = "(cat|dog)_(low|high)") %>%
filter((is.na(Love) & animal == "dog") |
(is.na(Friend) & animal == "cat")) %>%
select(date, var, low, high)
output
# A tibble: 6 × 4
date var low high
<date> <chr> <dbl> <dbl>
1 2022-01-01 A 7 19
2 2022-01-01 A 8 20
3 2022-01-01 A 9 21
4 2022-02-01 B 4 16
5 2022-02-01 B 5 17
6 2022-02-01 B 6 18

Remove row on group depending on multiple criteria r

I have a dataset with some repeated values on Date variable, so I would like to filter this rows based on several conditions. As an example, the dataframe looks like:
df <- read.table(text =
"Date column_A column_B column_C Column_D
1 2020-01-01 10 15 15 20
2 2020-01-02 10 15 15 20
3 2020-01-03 10 13 15 20
4 2020-01-04 10 15 15 20
5 2020-01-05 NA 14 15 20
6 2020-01-05 7 NA NA 28
7 2020-01-06 10 15 15 20
8 2020-01-07 10 15 15 20
9 2020-01-07 10 NA NA 20
10 2020-01-08 10 15 15 20", header=TRUE)
df$Date <- as.Date(df$Date)
The different conditions to filter should be, ONLY on duplicated rows:
If "column A" is NA and the other numeric, select the numeric row
If both values are similar(both NA or both numeric), select row with less NAs.
My best approach, after several options is:
df$cnt_na <- apply(df[,2:5], 1, function(x) sum(is.na(x)))
df <- df %>% group_by(Date) %>% slice(which.min(all_of(cnt_na))) %>% select(-cnt_na)
Although in my case, it doesn't do the first condition. The main problem is that if I filter by !is.na(Date), I also remove other not duplicated rows.
Thanks in advance
I would sort your table based on your conditions and then pick the first row for every group:
library(dplyr)
df %>%
rowwise() %>%
mutate(cnt_na = sum(across(-Date, ~ sum(is.na(.))))) %>%
arrange(Date, is.na(column_A), cnt_na) %>%
group_by(Date) %>%
slice_head() %>%
ungroup()
which gives
# A tibble: 8 x 6
Date column_A column_B column_C Column_D cnt_na
<date> <int> <int> <int> <int> <int>
1 2020-01-01 10 15 15 20 0
2 2020-01-02 10 15 15 20 0
3 2020-01-03 10 13 15 20 0
4 2020-01-04 10 15 15 20 0
5 2020-01-05 7 NA NA 28 2
6 2020-01-06 10 15 15 20 0
7 2020-01-07 10 15 15 20 0
8 2020-01-08 10 15 15 20 0

Add column from df2 to df1 based on match between df1 and df2

I have two data sets df1 and df2, which have one column "ID" and "Country" in common:
df1 <- data.frame(ID=c(1:20), State=c("NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","CA","IL","SD","NC","SC","WA","CO","AL","AK","HI"))
df2 <- data.frame(ID=c(1,2,3,4,5,"NA","NA","NA","NA","NA"), Year=c("2020","2021","2020","2020","2021","2020","2020","2021","2020","2019"),State=c("NA","NA","NA","NA","NA","CA","SC","NY","NJ","OR"))
How can I add Year from df2 to df1 to the same ID that exists in df1 OR the same State that exists in df1?
The reason why I want to make this change: I just need to add this "Year" information from df2 to df1.
Here's a dplyr solution:
library(dplyr)
df1 <- df1 %>%
mutate(join = ifelse(State == 'NA', ID, State))
df2 <- df2 %>%
mutate(join = ifelse(State == 'NA', ID, State))
df_new <- left_join(df1, df2, by = "join") %>%
mutate(State = coalesce(State.x, State.y)) %>%
select(-c(State.x, State.y, join, ID.y)) %>%
rename(ID = ID.x)
This gives us:
ID Year State
1 1 2020 NA
2 2 2021 NA
3 3 2020 NA
4 4 2020 NA
5 5 2021 NA
6 6 <NA> NA
7 7 <NA> NA
8 8 <NA> NA
9 9 <NA> NA
10 10 <NA> NA
11 11 2020 CA
12 12 <NA> IL
13 13 <NA> SD
14 14 <NA> NC
15 15 2020 SC
16 16 <NA> WA
17 17 <NA> CO
18 18 <NA> AL
19 19 <NA> AK
20 20 <NA> HI
You could do:
df1 <- type.convert(df1)
df2 <- type.convert(df2)
df1 %>%
left_join(select(df2, -State), 'ID') %>%
left_join(select(filter(df2, is.na(ID)), -ID), 'State') %>%
mutate(Year = coalesce(Year.x, Year.y), Year.x = NULL, Year.y = NULL)
ID State Year
1 1 <NA> 2020
2 2 <NA> 2021
3 3 <NA> 2020
4 4 <NA> 2020
5 5 <NA> 2021
6 6 <NA> NA
7 7 <NA> NA
8 8 <NA> NA
9 9 <NA> NA
10 10 <NA> NA
11 11 CA 2020
12 12 IL NA
13 13 SD NA
14 14 NC NA
15 15 SC 2020
16 16 WA NA
17 17 CO NA
18 18 AL NA
19 19 AK NA
20 20 HI NA

In search of a more efficient solution converting Wide data to long data

I want to convert the data from wide to long.I have solved the problem with the reshape package but then I manually had to define which column belonged the "gather columns", if there are hundreds of columns (which is the case in my data) that would be time consuming and a high risk of writing errors.
Does anyone know how to make a more efficient function to reach to this result?
id <- 1001:1003
qA2 <- c(10,5,1)
qB2 <- c(11,6,3)
qC2 <- c(10,7,5)
qA3 <- c(15,12,8)
qB3 <- c(18,15,7)
qC3 <- c(19,11,10)
df <- data.frame(id,qA2,qB2,qC2, qA3, qB3, qC3)
df
id qA2 qB2 qC2 qA3 qB3 qC3
1 1001 10 11 10 15 18 19
2 1002 5 6 7 12 15 11
3 1003 1 3 5 8 7 10
Solution with the reshape package:
library(reshape2)
df_test <- reshape(df, idvar="id", direction="long", varying=list(c(2,5), c(3,6), c(4,7)),v.names=c("qA", "qB", "qC"),times=2:3)
df_test
df_test <- df_test[order(df_test$id, df_test$time),]
id time qA qB qC
1001.2 1001 2 10 11 10
1001.3 1001 3 15 18 19
1002.2 1002 2 5 6 7
1002.3 1002 3 12 15 11
1003.2 1003 2 1 3 5
1003.3 1003 3 8 7 10
Using dplyr and tidyr, here is one way not sure about the efficiency though
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -id) %>%
mutate(key = sub("\\d+", "", key)) %>%
group_by(key) %>%
mutate(row = row_number()) %>%
spread(key, value) %>%
select(-row)
# A tibble: 6 x 4
# id qA qB qC
# <int> <dbl> <dbl> <dbl>
#1 1001 10 11 10
#2 1001 15 18 19
#3 1002 5 6 7
#4 1002 12 15 11
#5 1003 1 3 5
#6 1003 8 7 10
With the new version of tidyr (1.0.0) (already on CRAN, just update it):
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with("q"),
names_to = "time",
names_prefix = "q[A-Z]",
values_to = c("qA","qB","qC"))
Here is a base R one liner,
df1 <- cbind(id = df$id, (do.call(cbind, lapply(split.default(df[-1],
gsub('\\d+', '', names(df)[-1])), stack))[c(TRUE, FALSE)]))
df1[with(df1, order(id)),]
# id qA.values qB.values qC.values
#1 1001 10 11 10
#4 1001 15 18 19
#2 1002 5 6 7
#5 1002 12 15 11
#3 1003 1 3 5
#6 1003 8 7 10
We can use names_pattern with pivot_longer
library(tidyr)
pivot_longer(df, -id, names_to = c(".value", "time"), names_pattern= "(\\D+)(\\d+)")
# A tibble: 6 x 5
# id time qA qB qC
# <int> <chr> <dbl> <dbl> <dbl>
#1 1001 2 10 11 10
#2 1001 3 15 18 19
#3 1002 2 5 6 7
#4 1002 3 12 15 11
#5 1003 2 1 3 5
#6 1003 3 8 7 10

data frame selecting top by grouping

I have a data frame such as:
set.seed(1)
df <- data.frame(
sample = 1:50,
value = runif(50),
group = c(rep(NA, 20), gl(3, 10)))
I want to select the top 10 samples based on value. However, if there is a group corresponding to the sample, I only want to include one sample from that group. If group == NA, I want to include all of them. Arranging df by value looks like:
df_top <- df %>%
arrange(-value) %>%
top_n(10, value)
sample value group
1 46 0.7973088 3
2 49 0.8108702 3
3 22 0.8394404 1
4 2 0.8612095 NA
5 27 0.8643395 1
6 20 0.8753213 NA
7 44 0.8762692 3
8 26 0.8921983 1
9 11 0.9128759 NA
10 30 0.9606180 1
I would want to include samples 36, 22, 2, 20, 11, and the next five highest values in my data frame that continue to fit the pattern. How do I accomplish this?
I think I figured this out. Would this be the best way:
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
filter(ifelse(!is.na(group), value == max(value), value == value)) %>%
ungroup() %>%
top_n(10, value)
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 18 0.992 NA
2 7 0.945 NA
3 21 0.935 1
4 4 0.908 NA
5 6 0.898 NA
6 35 0.827 2
7 41 0.821 3
8 20 0.777 NA
9 15 0.770 NA
10 17 0.718 NA
Similar method that uses slice instead of filter:
library(dplyr)
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
slice(if(any(!is.na(group))) 1 else 1:n()) %>%
ungroup() %>%
top_n(10, value)
Result:
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 21 0.9347052 1
2 35 0.8273733 2
3 41 0.8209463 3
4 18 0.9919061 NA
5 7 0.9446753 NA
6 4 0.9082078 NA
7 6 0.8983897 NA
8 20 0.7774452 NA
9 15 0.7698414 NA
10 17 0.7176185 NA

Resources