Match and replace row values across two dataframes - r

I am trying to replace entrezgene_accession names with entrezgene_id, but I am not able to figure it out.
The idea is to replace the gene names such as cd37, and catb in df1, with their entrezgene_id that is in df2.
I have been trying to combine datasets using dplyr, but that has not worked.
# df1: 2,002 × 1
id
<chr>
1 106590043
2 cd37
3 106577144
4 106561987
5 106569503
6 106571198
7 106573872
8 106601676
9 106612275
10 catb
# … with 1,992 more rows
# df2: 426 × 2
entrezgene_accession entrezgene_id
<chr> <chr>
37 catb 100195493
38 catk 100195370
39 catl1 100286607
40 cats 100196462
41 cav2 106573118
42 cav2 100196537
43 cb055 100306867
44 cbx6 106591466
45 ccdc178 106569132
46 ccdc84 106603745
47 ccm2 106571003
48 ccnb1 106563318
49 ccnd1 100306852
50 ccr3 100380477
51 ccr6 100194943
52 cd164 106607963
53 cd37 100195746
# … with 416 more rows

left_join from dplyr can help do the trick
library(dplyr)
df1<-tibble::tribble(
~id,
"106590043",
"cd37",
"catb"
)
df2<-tibble::tribble(
~entrezgene_accession, ~entrezgene_id,
"catb", "100286607",
"catk", "100195370",
"catl1", "100286607",
"cd37", "100195746"
)
df_combined<-df1 %>%
left_join(df2, by=c("id"="entrezgene_accession")) %>%
mutate(complete_id=if_else(is.na(entrezgene_id),id,entrezgene_id))
df_combined
#> # A tibble: 3 × 3
#> id entrezgene_id complete_id
#> <chr> <chr> <chr>
#> 1 106590043 <NA> 106590043
#> 2 cd37 100195746 100195746
#> 3 catb 100286607 100286607
Created on 2022-01-09 by the reprex package (v2.0.1)

Related

R: Loop through all unique values and count them

I have a dataset with staff information. I have a column that lists their current age and a column that lists their salary. I want to create an R data frame that has 3 columns: one to show all the unique ages, one to count the number of people who are that age and one to give me the median salary for each particular age. On top of this, I would like to group those who are under 21 and over 65. Ideally it would look like this:
age
number of people
median salary
Under 21
36
26,300
22
15
26,300
23
30
27,020
24
41
26,300
etc
Over65
47
39,100
The current dataset has hundreds of columns and thousands of rows but the columns that are of interest are like this:
ageyears
sal22
46
28,250
32
26,300
19
27,020
24
26,300
53
36,105
47
39,100
47
26,200
70
69,500
68
75,310
I'm a bit lost on the best way to do this but assume some sort of loop would work best? Thanks so much for any direction or help.
library(tidyverse)
sample_data <- tibble(
age = sample(17:70, 100, replace = TRUE) %>% as.character(),
salary = sample(20000:90000, 100, replace = TRUE)
)
# A tibble: 100 × 2
age salary
<chr> <int>
1 56 35130
2 56 44203
3 20 28701
4 47 66564
5 66 60823
6 54 36755
7 66 30731
8 68 21338
9 19 80875
10 61 44547
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
sample_data %>%
mutate(age = case_when(age <= 21 ~ "Under 21",
age >= 65 ~ "Over 65",
TRUE ~ age)) %>%
group_by(age) %>%
summarise(count = n(),
median_salary = median(salary))
# A tibble: 38 × 3
age count median_salary
<chr> <int> <dbl>
1 22 4 46284.
2 23 3 55171
3 25 3 74545
4 27 1 37052
5 28 3 66006
6 29 1 82877
7 30 2 40342.
8 31 2 27815
9 32 1 32282
10 33 3 64523
# … with 28 more rows
# ℹ Use `print(n = ...)` to see more rows

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

Creating serial number for unique entries in R

I wanted to assign same serial number for all same Submission_Ids under one Batch_number. Could some one please help me figure this out?
Submission_Id <- c(619295,619295,619295,619295,619296,619296,619296,619296,619296,556921,556921,559254,647327,647327,647327,646040,646040,646040,646040,646040,64604)
Batch_No <- (633,633,633,633,633,633,633,633,633,633,633,633,634,634,634,650,650,650,650,650,650)
Expected result
Sl.No <- c(1,1,1,1,2,2,2,2,2,3,3,4,1,1,1,1,1,1,1,1,1)
One way to do it is creating run-length IDs with data.table::rleid(Submission_Id) grouped_by(Batch_No). We can use this inside 'dplyr'. To show this I created a tibble() with both given vectors Batch_Id and Submission_Id.
library(dplyr)
library(data.table)
dat <- tibble(Submission_Id = Submission_Id,
Batch_No = Batch_No)
dat %>%
group_by(Batch_No) %>%
mutate(S1.No = data.table::rleid(Submission_Id))
#> # A tibble: 21 x 3
#> # Groups: Batch_No [3]
#> Submission_Id Batch_No S1.No
#> <dbl> <dbl> <int>
#> 1 619295 633 1
#> 2 619295 633 1
#> 3 619295 633 1
#> 4 619295 633 1
#> 5 619296 633 2
#> 6 619296 633 2
#> 7 619296 633 2
#> 8 619296 633 2
#> 9 619296 633 2
#> 10 556921 633 3
#> # ... with 11 more rows
The original data
Submission_Id <- c(619295,619295,619295,619295,619296,619296,619296,619296,619296,556921,556921,559254,647327,647327,647327,646040,646040,646040,646040,646040,64604)
Batch_No <- c(633,633,633,633,633,633,633,633,633,633,633,633,634,634,634,650,650,650,650,650,650)
Created on 2022-12-16 by the reprex package (v2.0.1)

Merge two datasets but one of them is year_month and the other is year_month_week

I practice data merging using R nowadays. Here are simple two data df1 and df2.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month=c(202205,202206,202207,202204,202205,202206,202207),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
year_month_week=c(2022052,2022053,2022061,2022043,2022051,2022052),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 202205 in year_month column means May 2022.
For df2, 2022052 in year_month_week column means 2nd week of May, 2022.
I want to merge df1 and df2 with respect to year_month_week. So, all the elements of df2 are left, but some values of df2 can be copied.
For example, 202205 in year_month includes 2022052 and 2022053. There is no column points in df2. In this case, 65 is copied. My expected output looks like this:
df<-data.frame(id=c(1,1,1,2,2,2),
year_month_week=c(2022052,2022053,2022061,2022043,2022051,2022052),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3),
points=c(65,65,58,21,25,25))
Create a temporary year_month column in df2 by taking the first six characters of year_month_week, then do a left join on df1 by year_month and id before removing the temporary column.
Using tidyverse, we could do this as follows:
library(tidyverse)
df2 %>%
mutate(year_month = as.numeric(substr(year_month_week, 1, 6))) %>%
left_join(df1, by = c('year_month', 'id')) %>%
select(-year_month)
#> id year_month_week temperature points
#> 1 1 2022052 36.1 65
#> 2 1 2022053 36.3 65
#> 3 1 2022061 36.6 58
#> 4 2 2022043 34.3 21
#> 5 2 2022051 34.9 25
#> 6 2 2022052 35.3 25
Or in base R using merge:
df2$year_month <- substr(df2$year_month_week, 1, 6)
merge(df2, df1, by = c('year_month', 'id'))[-1]
#> id year_month_week temperature points
#> 1 2 2022043 34.3 21
#> 2 1 2022052 36.1 65
#> 3 1 2022053 36.3 65
#> 4 2 2022051 34.9 25
#> 5 2 2022052 35.3 25
#> 6 1 2022061 36.6 58

Gather or transpose data with multiple rows as 'key' argument

In my mind I want to tidyr::gather() gather on not only the column names but also on row 1 and 2. What I want to achieve is to have a data frame with 5 columns and 4 rows.
This is a little piece of the dataset I'm working with:
library(tidyverse)
# A tibble: 4 x 3
Aanduiding `Coolsingel 40 links` `Goudseweg 15 links`
<chr> <chr> <chr>
1 Gebiedsnummer 1 2
2 Postcode 3011 AD 3031 XH
3 Leefbaar Rotterdam 124 110
4 Partij van de Arbeid (P.v.d.A.) 58 65
and its reproducable dput(df) to work with:
df <- structure(list(Aanduiding = c("Gebiedsnummer", "Postcode", "Leefbaar Rotterdam",
"Partij van de Arbeid (P.v.d.A.)"), `Coolsingel 40 links` = c("1",
"3011 AD", "124", "58"), `Goudseweg 15 links` = c("2", "3031 XH",
"110", "65")), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"), .Names = c("Aanduiding", "Coolsingel 40 links",
"Goudseweg 15 links"))
So wanted out put looks like this:
Aanduiding Gebiedsnummer Postcode adres value
<chr> <dbl> <chr> <chr> <dbl>
1 Leefbaar Rotterdam 1.00 3011 AD Coolsingel 40 links 124
2 Leefbaar Rotterdam 1.00 3031 XH Goudseweg 15 links 120
3 Partij van de Arbeid (P.v.d.A.) 2.00 3011 AD Coolsingel 40 links 58.0
4 Partij van de Arbeid (P.v.d.A.) 2.00 3031 XH Goudseweg 15 links 65.0
I use the gather() function from the tidyr package a lot, but this is alway when I only want to gather the column names with a certain value. Now I actually want to gather the column names but also observation on row 1 and 2.
Can I gather on multiple key's? Or paste the values in observation 1 and 2 to the column, then gather() and then separate()?
What's the best tactic here, if possible in a tidyr way.
Much appreciated.
There's two things that need to be done here, and you'll have to figure out how to break down your dataset accordingly.
data.frame(t(df[1:2,]))
will give you:
X1 X2
Aanduiding Gebiedsnummer Postcode
Coolsingel 40 links 1 3011 AD
Goudseweg 15 links 2 3031 XH
And
tidyr::gather(df[3:4,],key="adres",value="value", `Coolsingel 40 links`, `Goudseweg 15 links`)
will give you:
Aanduiding adres value
<chr> <chr> <chr>
1 Leefbaar Rotterdam Coolsingel 40 links 124
2 Partij van de Arbeid (P.v.d.A.) Coolsingel 40 links 58
3 Leefbaar Rotterdam Goudseweg 15 links 110
4 Partij van de Arbeid (P.v.d.A.) Goudseweg 15 links 65
How you go on from there is another problem, possibly a left_join based on adres, but that really depends on how the rest of the data is structured.
You can do this with a combination of gather and spread a few times. I do this often when I need to move a value out to serve as a denominator for a calculation.
library(tidyverse)
...
The goal is to move Gebiedsnummer and Postcode out of Aanduiding, and to gather the other two columns into one column of values. The first gather gets you this:
df %>%
gather(key = address, value = value, -Aanduiding)
#> # A tibble: 8 x 3
#> Aanduiding address value
#> <chr> <chr> <chr>
#> 1 Gebiedsnummer Coolsingel 40 links 1
#> 2 Postcode Coolsingel 40 links 3011 AD
#> 3 Leefbaar Rotterdam Coolsingel 40 links 124
#> 4 Partij van de Arbeid (P.v.d.A.) Coolsingel 40 links 58
#> 5 Gebiedsnummer Goudseweg 15 links 2
#> 6 Postcode Goudseweg 15 links 3031 XH
#> 7 Leefbaar Rotterdam Goudseweg 15 links 110
#> 8 Partij van de Arbeid (P.v.d.A.) Goudseweg 15 links 65
Using a spread after that gets:
df %>%
gather(key = address, value = value, -Aanduiding) %>%
spread(key = Aanduiding, value = value)
#> # A tibble: 2 x 5
#> address Gebiedsnummer `Leefbaar Rotter… `Partij van de Arbe… Postcode
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Coolsinge… 1 124 58 3011 AD
#> 2 Goudseweg… 2 110 65 3031 XH
Then you want to gather again, but to keep address, Gebiedsnummer, and Postcode as their own columns. The select is just there to get the columns in order. So all together:
df %>%
gather(key = address, value = value, -Aanduiding) %>%
spread(key = Aanduiding, value = value) %>%
gather(key = Aanduiding, value = value, -Gebiedsnummer, -address, -Postcode) %>%
select(Aanduiding, Gebiedsnummer, Postcode, address, value) %>%
mutate_at(vars(Gebiedsnummer, value), as.numeric)
#> # A tibble: 4 x 5
#> Aanduiding Gebiedsnummer Postcode address value
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 Leefbaar Rotterdam 1 3011 AD Coolsingel 40 l… 124
#> 2 Leefbaar Rotterdam 2 3031 XH Goudseweg 15 li… 110
#> 3 Partij van de Arbeid (P.v… 1 3011 AD Coolsingel 40 l… 58
#> 4 Partij van de Arbeid (P.v… 2 3031 XH Goudseweg 15 li… 65
Created on 2018-08-24 by the reprex package (v0.2.0).

Resources