Joining and merging two dataframes with no multiplicity - r

The title is really confusing I know, but I have to explain my problem.
I have 2 datasets where first contains frequency of citations by years for every article which is represented by pmid. It look like this:
pmid year freq
1 14561399 2011 1
2 14561399 2012 3
3 18511332 2010 1
4 21193046 2012 2
5 21193046 2013 2
6 14561399 2013 1
7 18511332 2011 1
8 18511332 2012 3
9 14561399 2014 1
10 16533158 2013 2
and the second contains article features and looks like this:
pmid title_char title_wrds
1 20711763 75 9
2 20734175 109 12
3 20058113 93 13
4 20625865 142 17
5 20517661 103 12
6 20195930 128 16
Both dataset as you can see contains "pmid", which is parameter by which I need to "merge" or "join" this dataset. That's not a problem, it could be done with just merge() function or with dplyr package. But when I do this, results look like this:
pmid title_char title_wrds year freq
1 184 77 10 2010 1
2 406 142 20 2008 1
3 407 110 16 2008 1
4 407 110 16 2003 1
5 408 79 10 1998 1
6 450 58 7 2012 2
7 450 58 7 2009 1
My problem is - as you can see for example lines 2 and 3 - these two lines contains the same article (the same pmid, same features) but it is in the two line because of the year of citation.
pmid title_char title_wrds year freq
3 407 110 16 2008 1
4 407 110 16 2003 1
And I want something like this:
pmid title_char title_wrds year2008Freq year2003Freq
1 407 110 16 1 1
That is 1 line per 1 article.

You could try
library(reshape2)
res <- dcast(dfN, ...~paste0('year', year, 'Freq'), value.var='freq')
data
dfN <- structure(list(pmid = c(184L, 406L, 407L, 407L, 408L, 450L, 450L
), title_char = c(77L, 142L, 110L, 110L, 79L, 58L, 58L),
title_wrds = c(10L,
20L, 16L, 16L, 10L, 7L, 7L), year = c(2010L, 2008L, 2008L, 2003L,
1998L, 2012L, 2009L), freq = c(1L, 1L, 1L, 1L, 1L, 2L, 1L)),
.Names = c("pmid",
"title_char", "title_wrds", "year", "freq"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7"))

Related

friendship network identification in R

I want to identify networks where all people in the same network directly or indirectly connected through friendship nominations while no students from different networks are connected.
I am using the Add Health data. Each student nominates upto 10 friends.
Say, sample data may look like this:
ID FID_1 FID_2 FID_3 FID_4 FID_5 FID_6 FID_7 FID_8 FID_9 FID_10
1 2 6 7 9 10 NA NA NA NA NA
2 5 9 12 45 13 90 87 6 NA NA
3 1 2 4 7 8 9 10 14 16 18
100 110 120 122 125 169 178 190 200 500 520
500 100 110 122 125 169 178 190 200 500 520
700 800 789 900 NA NA NA NA NA NA NA
1000 789 2000 820 900 NA NA NA NA NA NA
There are around 85,000 individuals. Could anyone please tell me how I can get network ID?
So, I would like the data to look the following
ID network_ID ID network_ID
1 1 700 3
2 1 789 3
3 1 800 3
4 1 820 3
5 1 900 3
6 1 1000 3
7 1 2000 3
8 1
9 1
10 1
12 1
13 1
14 1
16 1
18 1
90 1
87 1
100 2
110 2
120 2
122 2
125 2
169 2
178 2
190 2
200 2
500 2
520 2
So, everyone directly or indirectly connected to ID 1 belong to network 1. 2 is a friend of 1. So, everyone directly or indirectly connected to 2 are also in 1's network and so on. 700 is not connected to 1 or friend of 1 or friend of friend of 1 and so on. Thus 700 is in a different network, which is network 3.
Any help will be much appreciated...
Update
library(igraph)
library(dplyr)
library(data.table)
setDT(df) %>%
melt(id.var = "ID", variable.name = "FID", value.name = "ID2") %>%
na.omit() %>%
setcolorder(c("ID", "ID2", "FID")) %>%
graph_from_data_frame() %>%
components() %>%
membership() %>%
stack() %>%
setNames(c("Network_ID", "ID")) %>%
rev() %>%
type.convert(as.is = TRUE) %>%
arrange(Network_ID, ID)
gives
ID Network_ID
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 12 1
12 13 1
13 14 1
14 16 1
15 18 1
16 45 1
17 87 1
18 90 1
19 100 2
20 110 2
21 120 2
22 122 2
23 125 2
24 169 2
25 178 2
26 190 2
27 200 2
28 500 2
29 520 2
30 700 3
31 789 3
32 800 3
33 820 3
34 900 3
35 1000 3
36 2000 3
Data
> dput(df)
structure(list(ID = c(1L, 2L, 3L, 100L, 500L, 700L, 1000L), FID_1 = c(2L,
5L, 1L, 110L, 100L, 800L, 789L), FID_2 = c(6L, 9L, 2L, 120L,
110L, 789L, 2000L), FID_3 = c(7L, 12L, 4L, 122L, 122L, 900L,
820L), FID_4 = c(9L, 45L, 7L, 125L, 125L, NA, 900L), FID_5 = c(10L,
13L, 8L, 169L, 169L, NA, NA), FID_6 = c(NA, 90L, 9L, 178L, 178L,
NA, NA), FID_7 = c(NA, 87L, 10L, 190L, 190L, NA, NA), FID_8 = c(NA,
6L, 14L, 200L, 200L, NA, NA), FID_9 = c(NA, NA, 16L, 500L, 500L,
NA, NA), FID_10 = c(NA, NA, 18L, 520L, 520L, NA, NA)), class = "data.frame", row.names = c(NA,
-7L))
Are you looking for something like this?
library(data.table)
library(dplyr)
setDT(df) %>%
melt(id.var = "ID", variable.name = "FID", value.name = "ID2") %>%
na.omit() %>%
setcolorder(c("ID", "ID2", "FID")) %>%
graph_from_data_frame() %>%
plot(edge.label = E(.)$FID)
Data
structure(list(ID = 1:3, FID_1 = c(2L, 5L, 1L), FID_2 = c(6L,
9L, 2L), FID_3 = c(7L, 12L, 4L), FID_4 = c(9L, 45L, 7L), FID_5 = c(10L,
12L, 8L), FID_6 = c(NA, 90L, 9L), FID_7 = c(NA, 87L, 10L), FID_8 = c(NA,
6L, 14L), FID_9 = c(NA, NA, 16L), FID_10 = c(NA, NA, 18L)), class = "data.frame", row.names = c(NA,
-3L))

data.table approach for creating a running sequential number for each row in a group

I need to create a running sequential number for every row in a group. The groups are both the artist IDs and also the course number. The course number is also a sequential running number created based off of a very specific criteria: If an artist goes more than 7 days in between recording, a new course number is created.
For example, let's say that we have an artist_id whose data looks like this:
artist_id
session_number_total
CustomerRecordId
SiteRecordId
recording_date
control_panel
year
1
1
3
3
1/1/2000
Left
2000
1
2
3
3
1/3/2000
Right
2000
1
3
3
3
1/8/2000
Right
2000
1
4
3
3
5/1/2000
Left
2000
This artist_id came in for a session on 1/1/2000, 1/3/2000, 1/8/2000, and 5/1/2000. Based off of the aforementioned criteria for creating the course number groups (no more than 7 days in between recording dates) while also adding in the running count of the number of sessions for that artist, the final dataset should look like this:
artist_id
session_number_total
CustomerRecordId
SiteRecordId
recording_date
control_panel
year
days_between
Status
course_number
session_in_course
1
1
3
3
1/1/2000
Left
2000
0
Existing Course
1
1
1
2
3
3
1/3/2000
Right
2000
2
Existing Course
1
2
1
3
3
3
1/8/2000
Right
2000
5
Existing Course
1
3
1
4
3
3
5/1/2000
Left
2000
114
New Course
2
1
I am able to achieve this using some very convoluted code with dplyr. It works every time but the problem is that with 2.5 million rows in my dataset, the code can take 20-30 minutes to run each time I open a new session in R.
Considering how great data.table is for large sets of data, I'm wondering if anyone knows a data.table solution for creating the sequential group numbers and the running session count inside each of the groups based off of the criteria mentioned above? Any help would be appreciated so thank you in advance.
Here is a reproducible dataset and the code I used in dplyr to accomplish creating the final dataset:
library(tidyverse)
library(scales)
library(tibbletime)
library(lubridate)
library(data.table)
x <- structure(list(artist_id = c(257L, 257L, 257L, 257L, 257L, 257L,
257L, 257L, 421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L,
421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L), session_number_total = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L), CustomerRecordId = c(4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), SiteRecordId = c(5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), recording_date = structure(c(16062,
16063, 16065, 16066, 16074, 16079, 16087, 16092, 16027, 16028,
16035, 16038, 16056, 16058, 16065, 16072, 16073, 16077, 16079,
16083, 16086, 16087, 16090, 16091, 16094, 16111), class = "Date"),
control_panel = c("Left", "Left", "Left", "Left", "Left",
"Left", "Left", "Left", "Bilateral", "Bilateral", "Bilateral",
"Bilateral", "Bilateral", "Bilateral", "Bilateral", "Bilateral",
"Bilateral", "Bilateral", "Bilateral", "Bilateral", "Bilateral",
"Bilateral", "Bilateral", "Bilateral", "Bilateral", "Bilateral"
), year = c(2013L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L,
2014L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2014L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L)), class = "data.frame", row.names = c(NA, -26L))
x
final_df <- x %>%
arrange(artist_id, recording_date) %>%
group_by(artist_id) %>%
mutate(days_between = recording_date - lag(recording_date, 1)) %>%
arrange(artist_id, recording_date) %>%
mutate(days_between = ifelse(is.na(days_between), 0, days_between)) %>%
mutate(Status = ifelse(days_between > 7, "New Course", "Existing Course")) %>%
mutate(num = ifelse(Status == "New Course", seq(1, 100000), 1)) %>%
group_by(artist_id, Status, num) %>%
mutate(group_num = row_number()) %>%
ungroup() %>%
group_by(artist_id, group_num) %>%
mutate(course_number = ifelse(group_num == 1, seq(1, 100000),NA)) %>%
ungroup() %>%
group_by(artist_id) %>%
fill(course_number) %>%
ungroup() %>%
group_by(artist_id, course_number) %>%
mutate(session_in_course =row_number()) %>%
select(-num, -group_num)
final_df <- as.data.frame(final_df)
final_df
How about this data.table solution:
library(data.table)
setDT(x)
x[, days_between := c(0, diff(recording_date)), by = .(artist_id)
][, course_number := 1L + cumsum(days_between > 7), by = .(artist_id)
][, session_in_course := seq_len(.N), by = .(artist_id, course_number)]
# artist_id session_number_total CustomerRecordId SiteRecordId recording_date control_panel year days_between course_number session_in_course
# <int> <int> <int> <int> <Date> <char> <int> <num> <int> <int>
# 1: 257 1 4 5 2013-12-23 Left 2013 0 1 1
# 2: 257 2 4 5 2013-12-24 Left 2013 1 1 2
# 3: 257 3 4 5 2013-12-26 Left 2013 2 1 3
# 4: 257 4 4 5 2013-12-27 Left 2013 1 1 4
# 5: 257 5 4 5 2014-01-04 Left 2014 8 2 1
# 6: 257 6 4 5 2014-01-09 Left 2014 5 2 2
# 7: 257 7 4 5 2014-01-17 Left 2014 8 3 1
# 8: 257 8 4 5 2014-01-22 Left 2014 5 3 2
# 9: 421 1 5 10 2013-11-18 Bilateral 2013 0 1 1
# 10: 421 2 5 10 2013-11-19 Bilateral 2013 1 1 2
# 11: 421 3 5 10 2013-11-26 Bilateral 2013 7 1 3
# 12: 421 4 5 10 2013-11-29 Bilateral 2013 3 1 4
# 13: 421 5 5 10 2013-12-17 Bilateral 2013 18 2 1
# 14: 421 6 5 10 2013-12-19 Bilateral 2013 2 2 2
# 15: 421 7 5 10 2013-12-26 Bilateral 2013 7 2 3
# 16: 421 8 5 10 2014-01-02 Bilateral 2014 7 2 4
# 17: 421 9 5 10 2014-01-03 Bilateral 2014 1 2 5
# 18: 421 10 5 10 2014-01-07 Bilateral 2014 4 2 6
# 19: 421 11 5 10 2014-01-09 Bilateral 2014 2 2 7
# 20: 421 12 5 10 2014-01-13 Bilateral 2014 4 2 8
# 21: 421 13 5 10 2014-01-16 Bilateral 2014 3 2 9
# 22: 421 14 5 10 2014-01-17 Bilateral 2014 1 2 10
# 23: 421 15 5 10 2014-01-20 Bilateral 2014 3 2 11
# 24: 421 16 5 10 2014-01-21 Bilateral 2014 1 2 12
# 25: 421 17 5 10 2014-01-24 Bilateral 2014 3 2 13
# 26: 421 18 5 10 2014-02-10 Bilateral 2014 17 3 1
# artist_id session_number_total CustomerRecordId SiteRecordId recording_date control_panel year days_between course_number session_in_course

Is there a way to automatically average multiple treatments at once in R? [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 2 years ago.
Very sorry if this is a reposted question, I checked the search engine and couldn't find the answer I was looking for. Say I have the following dataset:
Plot Plant Count
1 101 1 9
2 101 2 15
3 101 3 5
4 101 4 15
5 101 5 26
6 102 1 9
7 102 2 26
8 102 3 9
9 102 4 15
10 102 5 17
11 103 1 12
12 103 2 6
13 103 3 22
14 103 4 12
15 103 5 6
I'd like to average the "Count" number between the 5 plants of each plot. However, in my real dataset, I have much more than 3 plots. Is there a way to write my code so that it automatically averages all my plots at once? I'd like to learn to write a code that would get me the average for each plot as efficiently as possible. Any help would be very much appreciated.
I am fairly new to stackoverflow and am not the strongest with R, so if I have made a mistake in my formatting or something similar please let me know. Thanks for your time!
Try this with dplyr using group_by() and summarise(). Here the code:
library(dplyr)
#Data
newdf <- df %>% group_by(Plot) %>% summarise(Avg=mean(Count))
Output:
# A tibble: 3 x 2
Plot Avg
<int> <dbl>
1 101 14
2 102 15.2
3 103 11.6
Some data used:
#Data
df <- structure(list(Plot = c(101L, 101L, 101L, 101L, 101L, 102L, 102L,
102L, 102L, 102L, 103L, 103L, 103L, 103L, 103L), Plant = c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), Count = c(9L,
15L, 5L, 15L, 26L, 9L, 26L, 9L, 15L, 17L, 12L, 6L, 22L, 12L,
6L)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"))
If you want to keep your variables use mutate() in this way:
#Code
newdf <- df %>% group_by(Plot) %>% mutate(Avg=mean(Count))
Output:
# A tibble: 15 x 4
# Groups: Plot [3]
Plot Plant Count Avg
<int> <int> <int> <dbl>
1 101 1 9 14
2 101 2 15 14
3 101 3 5 14
4 101 4 15 14
5 101 5 26 14
6 102 1 9 15.2
7 102 2 26 15.2
8 102 3 9 15.2
9 102 4 15 15.2
10 102 5 17 15.2
11 103 1 12 11.6
12 103 2 6 11.6
13 103 3 22 11.6
14 103 4 12 11.6
15 103 5 6 11.6
Or using base R:
#Base R
newdf <- aggregate(Count~Plot,data=df,mean)
Output:
Plot Count
1 101 14.0
2 102 15.2
3 103 11.6

How to subtract one row from multiple rows by group, for data set with multiple columns in R?

I would like to learn how to subtract one row from multiple rows by group, and save the results as a data table/matrix in R. For example, take the following data frame:
data.frame("patient" = c("a","a","a", "b","b","b","c","c","c"), "Time" = c(1,2,3), "Measure 1" = sample(1:100,size = 9,replace = TRUE), "Measure 2" = sample(1:100,size = 9,replace = TRUE), "Measure 3" = sample(1:100,size = 9,replace = TRUE))
patient Time Measure.1 Measure.2 Measure.3
1 a 1 19 5 75
2 a 2 64 20 74
3 a 3 40 4 78
4 b 1 80 91 80
5 b 2 48 31 73
6 b 3 10 5 4
7 c 1 30 67 55
8 c 2 24 13 90
9 c 3 45 31 88
For each patient, I would like to subtract the row where Time == 1 from all rows associated with that patient. The result would be:
patient Time Measure.1 Measure.2 Measure.3
1 a 1 0 0 0
2 a 2 45 15 -1
3 a 3 21 -1 3
4 b 1 0 0 0
5 b 2 -32 -60 -5
6 b 3 -70 -86 -76
7 c 1 0 0 0
....
I have tried the following code using the dplyr package, but to no avail:
raw_patient<- group_by(rawdata,patient, Time)
baseline_patient <-mutate(raw_patient,cpls = raw_patient[,]- raw_patient["Time" == 0,])
As there are multiple columns, we can use mutate_at by specifying the variables in vars and then subtract the elements from those elements in each column that corresponds to 'Time' 1 after grouping by 'patient'
library(dplyr)
df1 %>%
group_by(patient) %>%
mutate_at(vars(matches("Measure")), funs(.- .[Time==1]))
# A tibble: 9 × 5
# Groups: patient [3]
# patient Time Measure.1 Measure.2 Measure.3
# <chr> <int> <int> <int> <int>
#1 a 1 0 0 0
#2 a 2 45 15 -1
#3 a 3 21 -1 3
#4 b 1 0 0 0
#5 b 2 -32 -60 -7
#6 b 3 -70 -86 -76
#7 c 1 0 0 0
#8 c 2 -6 -54 35
#9 c 3 15 -36 33
data
df1 <- structure(list(patient = c("a", "a", "a", "b", "b", "b", "c",
"c", "c"), Time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Measure.1 = c(19L,
64L, 40L, 80L, 48L, 10L, 30L, 24L, 45L), Measure.2 = c(5L, 20L,
4L, 91L, 31L, 5L, 67L, 13L, 31L), Measure.3 = c(75L, 74L, 78L,
80L, 73L, 4L, 55L, 90L, 88L)), .Names = c("patient", "Time",
"Measure.1", "Measure.2", "Measure.3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

Merge two tables in R; column names differ with A and B options

I have two datasets that I'm trying to merge together. The first one contains information for every test subject with a unique ID (in rows). The second set contains measurements for every test subject (in columns), however each subject was measured twice so the unique ID reads "IDa and IDb." I'd like to find a way to merge these two tables based on the unique ID, regardless of whether it is measurement A or B.
Here's a small sample of the 2 datasets, and a table of the intended output. Any help would be appreciated!
UniqueID Site State Age Height
Tree001 FK OR 23 70
Tree002 FK OR 45 53
Tree003 NM OR 35 84
UniqueID Tree001A Tree001B Tree002A Tree002B Tree003A Tree003B
1996 4 2
1997 7 8 7 3
1998 3 2 9 4 7
1999 11 9 2 12 3 13
2010 8 8 4 6 11 4
2011 10 5 6 3 8 9
UniqueID Tree001A Tree001B Tree002A Tree002B Tree003A Tree003B
Site FK FK FK FK NM NM
State OR OR OR OR OR OR
Age 23 23 45 45 35 35
Height 70 70 53 53 84 84
1996 4 2
1997 7 8 7 3
1998 3 2 9 4 7
1999 11 9 2 12 3 13
2010 8 8 4 6 11 4
2011 10 5 6 3 8 9
This can be one approach.
df1 <- structure(list(UniqueID = structure(1:3, .Label = c("Tree001",
"Tree002", "Tree003"), class = "factor"), Site = structure(c(1L,
1L, 2L), .Label = c("FK", "NM"), class = "factor"), State = structure(c(1L,
1L, 1L), .Label = "OR", class = "factor"), Age = c(23L, 45L,
35L), Height = c(70L, 53L, 84L)), .Names = c("UniqueID", "Site",
"State", "Age", "Height"), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(UniqueID = c(1996L, 1997L, 1998L, 1999L, 2010L,
2011L), Tree001A = c(4L, 7L, 3L, 11L, 8L, 10L), Tree001B = c(NA,
8L, 2L, 9L, 8L, 5L), Tree002A = c(2L, 7L, 9L, 2L, 4L, 6L), Tree002B = c(NA,
NA, 4L, 12L, 6L, 3L), Tree003A = c(NA, 3L, 7L, 3L, 11L, 8L),
Tree003B = c(NA, NA, NA, 13L, 4L, 9L)), .Names = c("UniqueID",
"Tree001A", "Tree001B", "Tree002A", "Tree002B", "Tree003A", "Tree003B"
), class = "data.frame", row.names = c(NA, -6L))
> df1
UniqueID Site State Age Height
1 Tree001 FK OR 23 70
2 Tree002 FK OR 45 53
3 Tree003 NM OR 35 84
> df2
UniqueID Tree001A Tree001B Tree002A Tree002B Tree003A Tree003B
1 1996 4 <NA> 2 <NA> <NA> <NA>
2 1997 7 8 7 <NA> 3 <NA>
3 1998 3 2 9 4 7 <NA>
4 1999 11 9 2 12 3 13
5 2010 8 8 4 6 11 4
6 2011 10 5 6 3 8 9
# Use transpose function to change df1
df3 <- as.data.frame(t(df1[,-1]))
colnames(df3) <- df1[,1]
# Change rownames to UniqueID
df3$UniqueID <- rownames(df3)
# ROwnames to numeric
rownames(df3) <- c(1:4)
# Modify dataframe so that you have two columns for each subject
df3 <- df3[,c(4,1,1,2,2,3,3)]
colnames(df3) <- c("UniqueID", "Tree001A", "Tree001B", "Tree002A",
"Tree002B", "Tree003A", "Tree003B")
# Change classes of columns of df2 to factor
df2 <- data.frame(sapply(df2,function(x) class(x)<- as.factor(x)))
# Now combine two data frames
new <- rbind(df3,df2)
> new
UniqueID Tree001A Tree001B Tree002A Tree002B Tree003A Tree003B
1 Site FK FK FK FK NM NM
2 State OR OR OR OR OR OR
3 Age 23 23 45 45 35 35
4 Height 70 70 53 53 84 84
5 1996 4 <NA> 2 <NA> <NA> <NA>
6 1997 7 8 7 <NA> 3 <NA>
7 1998 3 2 9 4 7 <NA>
8 1999 11 9 2 12 3 13
9 2010 8 8 4 6 11 4
10 2011 10 5 6 3 8 9

Resources