I am working on a function that outputs a data frame that currently omits trials where there is missing data. However, I would like the full trial count to be added back into the file and the other data columns be blank for these instances (reflecting the missing data).
Example Data Frames:
Df1withTrialCount <- data.frame(Participant = c('A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A' ),
Trial = c(1,1,2,2,3,3,4,5,6,7,8,9,10,10,10),
NotRelevantVariable = c(1,2,3,4,5,6,4,3,2,1,1,2,3,4,5))
Df2NeedsTrialsAddedIn <- data.frame(Participant = c('A', 'A', 'A', 'A', 'A'),
Trial = c(1,3,5,6,10),
EyeGaze = c(.4, .2., .2, .1, .1))
So I would end up with something that had one row each for Trials 1-10 but blanks in Eye Gaze when there was not data (e.g., Trial 2 would have a blank for EyeGaze but Trial 3 would have .2).
Any help or insights would be greatly appreciated.
Take care and thank you for your time,
Caroline
With base::merge:
merge(unique(Df1withTrialCount[, c("Participant", "Trial")]),
Df2NeedsTrialsAddedIn,
all.x = TRUE)
We can use complete
library(tidyr)
complete(Df2NeedsTrialsAddedIn, Participant,
Trial = seq_len(max(Df1withTrialCount$Trial)))
-output
# A tibble: 10 x 3
# Participant Trial EyeGaze
# <chr> <dbl> <dbl>
# 1 A 1 0.4
# 2 A 2 NA
# 3 A 3 0.2
# 4 A 4 NA
# 5 A 5 0.2
# 6 A 6 0.1
# 7 A 7 NA
# 8 A 8 NA
# 9 A 9 NA
#10 A 10 0.1
If we need both min and `max from first dataset
complete(Df2NeedsTrialsAddedIn, Participant,
Trial = seq(min(Df1withTrialCount$Trial), max(Df1withTrialCount$Trial), by = 1))
library(tidyverse)
Df1withTrialCount %>%
left_join(Df2NeedsTrialsAddedIn, by=c('Participant', 'Trial')) %>%
distinct(Trial, .keep_all = TRUE)
Related
I'm quite new to R. I have a large dataframe approximating the following:
df <- data.frame(
source = c('a', 'b', 'c', 'e'),
partner = c('b', 'c', 'e', 'a'),
info = c(1,2,3,4)
)
For each row in the dataframe I want to get the info column from the partner and concatenate it to the source row. I'm doing this by building a second dataframe in the following way:
prt <- unlist(df$partner)
collect_partner <- function(x, df) {
df[df[, 'source'] == x, 'info']
}
prt_df <- do.call("rbind", lapply(prt, collect_partner, df)) # slow
final_df <- cbind(df, prt_df)
However, this approach is very slow and I'm sure there must be a better way. Unfortunately I'm finding it hard to articulate what I'm trying to do, so solutions aren't forthcoming from googling etc. Any suggestions would be much appreciated!
If you work with the tidyverse, I'd use a left_join basically with itself. I first create a data.frame that contains only the info about source and info. To make sure that you have only one value per unique entry in source, I use distinct (not necessarily needed).
Then, I join the data to the original data frame:
library(dplyr)
df <- data.frame(
source = c('a', 'b', 'c', 'e'),
partner = c('b', 'c', 'e', 'a'),
info = c(1,2,3,4)
)
source_info <- df %>%
select(source, prt_df = info) %>%
distinct(source, .keep_all = TRUE)
df %>%
left_join(source_info, by = c("partner" = "source"))
#> source partner info prt_df
#> 1 a b 1 2
#> 2 b c 2 3
#> 3 c e 3 4
#> 4 e a 4 1
Created on 2023-02-13 by the reprex package (v1.0.0)
With base R using sapply and ==
df$prt_df <- sapply(df$partner, function(x) which(x == df$source))
df
source partner info prt_df
1 a b 1 2
2 b c 2 3
3 c e 3 4
4 e a 4 1
Using data.table
library(data.table)
dt <- as.data.table(df)
dt[, prt_df := lapply(partner, function(x) which(x == source)), ]
dt
source partner info prt_df
1: a b 1 2
2: b c 2 3
3: c e 3 4
4: e a 4 1
On a slightly modified set dt_m with repeated and missing values.
dt_m[, prt_df := lapply(partner, function(x) which(x == source)), ]
dt_m
source partner info prt_df
1: a b 1
2: a c 2 3
3: c e 3 4
4: e a 4 1,2
modified data
dt_m <- structure(list(source = c("a", "a", "c", "e"), partner = c("b",
"c", "e", "a"), info = c(1, 2, 3, 4)), row.names = c(NA, -4L),
class = c("data.table", "data.frame"))
How can I convert my data from this:
example <- data.frame(RTD_1_LOC = c('A', 'B'), RTD_2_LOC = c('C', 'D'),
RTD_3_LOC = c('E', 'F'), RTD_4_LOC = c('G', 'H'),
RTD_5_LOC = c('I', 'J'),RTD_1_OFF = c('1', '2'), RTD_2_OFF = c('3', '4'),
RTD_3_OFF = c('5', '6'), RTD_4_OFF = c('7', '8'),
RTD_5_OFF = c('9', '10'))
to this:
example2 <- data.frame(RTD = c(1,1,2,2,3,3,4,4,5,5),LOC = c('A', 'B','C','D','E','F','G','H','I','J'),
OFF = c(1,2,3,4,5,6,7,8,9,10))
I have been using tidyverse gather, but I end up with about 50 columns
ex <- gather(example,RTD, Location, RTD_1_LOC:RTD_5_LOC)
ex$RTD <- sub('_LOC',"",ex$RTD)
ex3 <- gather(ex,RTD, Offset, RTD_1_OFF:RTD_5_OFF)
ex2$RTD <- sub('_OFF',"",ex2$RTD)
We can use pivot_longer from tidyr and specify the names_pattern to capture the groups from the column names. As the 'RTD' column should be left as such, specify in the names_to, a vector of 'RTD' and the column values (.value) so that the 'RTD' will get the digits capture ((\\d+) and the word ((\\w+)) 'LOC', 'OFF' will be created as new columns with the column values
library(dplyr)
library(tidyr)
example %>%
pivot_longer(cols = everything(),
names_to = c("RTD", ".value"), names_pattern = "\\w+_(\\d+)_(\\w+)")
-output
# A tibble: 10 x 3
RTD LOC OFF
<chr> <chr> <chr>
1 1 A 1
2 2 C 3
3 3 E 5
4 4 G 7
5 5 I 9
6 1 B 2
7 2 D 4
8 3 F 6
9 4 H 8
10 5 J 10
Suppose I have data like the following:
# A tibble: 10 x 4
# Groups: a.month, a.group [10]
a.month a.group other.group amount
<date> <chr> <chr> <dbl>
1 2016-02-01 A X 15320
2 2016-05-01 A Z 50079
3 2016-06-01 A Y 60564
4 2016-08-01 A X 10540
5 2017-01-01 B X 30020
6 2017-03-01 B X 76310
7 2017-04-01 B Y 44215
8 2017-05-01 A Y 67241
9 2017-06-01 A Z 17180
10 2017-07-01 B Z 31720
And I want to produce rows for every possible combination of a.group, other.group and for every month in between (with amount being zero if not present on the data above)
I managed to produce a tibble with the default amounts through:
another.tibble <- as_tibble(expand.grid(
a.month = months.list,
a.group = unique.a.groups,
other.group = unique.o.groups,
amount = 0
));
How should I proceed to populate another.tibble with the values from the first one?
It is important to invoke expand.grid with stringsAsFactors=FALSE. Then, we simply make a LEFT_JOIN() to complete the combinations where we have data
library(tidyverse)
df <- tribble(
~a.month, ~a.group, ~other.group, ~amount,
'2016-02-01', 'A', 'X', 15320,
'2016-05-01', 'A', 'Z', 50079,
'2016-06-01', 'A', 'Y', 60564,
'2016-08-01', 'A', 'X', 10540,
'2017-01-01', 'B', 'X', 30020,
'2017-03-01', 'B', 'X', 76310,
'2017-04-01', 'B', 'Y', 44215,
'2017-05-01', 'A', 'Y', 67241,
'2017-06-01', 'A', 'Z', 17180,
'2017-07-01', 'B', 'Z', 31720
)
another.tibble <- as_tibble(expand.grid(
a.month = unique(df$a.month),
a.group = unique(df$a.group),
other.group = unique(df$other.group),
amount = 0, stringsAsFactors=F)
)
another.tibble %>%
left_join(df, by= c("a.month" = "a.month", "a.group" = "a.group", "other.group" = "other.group")) %>%
mutate(amount.x = ifelse(is.na(amount.y), 0, amount.y)) %>%
rename(amount = amount.x) %>%
select(1:4)
I have a large data frame of 100,000's rows, and I want to add a column where the value is a sample of a subset of another data frame based on common names in the data frames. Might be easier to explain with examples...
largeDF <- data.frame(colA = c('a', 'b', 'b', 'a', 'a', 'b'),
colB = c('x', 'y', 'y', 'x', 'y', 'y'),
colC = 1:6)
sampleDF <- data.frame(colA = c('a','a','a','a','b','b','b','b','b','b'),
colB = c('x','x','y','y','x','y','y','y','y','y'),
sample = 1:10)
I then want to add a new column sample to largeDF, which is a random sample of the sample column in sampleDF for the appropriate subset of colA and colB.
For example, for the first row the values are a and x, so the value will be a random sample of 1 or 2, for the next row (b and y) it will be a random sample of 6, 7, 8, 9 or 10.
So we could end up with something like:
rowA rowB rowC sample
1 a x 1 2
2 b y 2 9
3 b y 3 7
4 a x 4 2
5 a y 5 4
6 b y 6 8
Any help would be appreciated!
Using dplyr... (This throws a few warnings, but appears to work anyway.)
library(dplyr)
largeDF <- largeDF %>% group_by(colA,colB) %>%
mutate(sample=sample(sampleDF$sample[sampleDF$colA==colA & sampleDF$colB==colB],
size=n(),replace=TRUE))
largeDF
colA colB colC sample
<fctr> <fctr> <int> <int>
1 a x 1 2
2 b y 2 6
3 b y 3 9
4 a x 4 1
5 a y 5 4
6 b y 6 9
You could do something like this:
largeDF$sample <- apply(largeDF,1,function(a)
with(sampleDF, sample(sampleDF[colA==a[1] & colB==a[2],]$sample,1)))
I do not quite understand the question but it seems that you are just adding a new column in the large data frame that is just the sampled "sample" column from a subsample...
see if the following code gives you an idea into the functionality you need:
cbind.data.frame(largeDF, sample = sample(sampleDF$sample, nrow(largeDF)))
# colA colB colC sample
#1 a x 1 9
#2 b y 2 10
#3 b y 3 1
#4 a x 4 3
#5 a y 5 6
#6 b y 6 7
I think this is one possible solution for you...
library(dplyr)
largeDF_sample <- sapply(1:nrow(largeDF), function(x) {
sampleDF_part = filter(sampleDF, colA==largeDF$colA[x] & colB==largeDF$colB[x])
return(sample(sampleDF_part$sample)[1])
})
largeDF$sample <- largeDF_sample
What is the proper way to count the result of a left outer join using dplyr?
Consider the two data frames:
a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )
a specifies four different IDs. b specifies six records that reference IDs in a. If I want to see how many times each ID is referenced, I might try this:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 1
3 3 3
4 4 1
However, the result is misleading because it indicates that ID 2 was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA for ID 2). I would like to avoid introducing a separate library such as sqldf.
With data.table, you can do
library(data.table)
setDT(a); setDT(b)
b[a, .N, on="id", by=.EACHI]
id N
1: 1 2
2: 2 0
3: 3 3
4: 4 1
Here, the syntax is x[i, j, on, by=.EACHI].
.EACHI refers to each row of i=a.
j=.N uses a special variable for the number of rows.
There are already some good answers but since the question asks not to use packages here is one. We perform a left join on a and b and append a refs column which is TRUE if ref_id is not NA. Then use aggregate to sum over the refs column:
m <- transform(merge(a, b, all.x = TRUE), refs = !is.na(ref_id))
aggregate(refs ~ id, m, sum)
giving:
id refs
1 1 2
2 2 0
3 3 3
4 4 1
It does require another package, but i'd feel remiss for not mentioning tidylog which provides reports for a wide range of tidyverse verbs. In your case, it would produce a report like:
library(tidylog)
a <- data.frame(id = c(1, 2, 3, 4 ))
b <- data.frame(id = c(1, 1, 3, 3, 3, 4), ref_id = c('a', 'b', 'c', 'd', 'e', 'f'))
a %>% left_join(b, by='id')
left_join: added one column (ref_id)
> rows only in x 1
> rows only in y (0)
> matched rows 6 (includes duplicates)
> ===
> rows total 7
id ref_id
1 1 a
2 1 b
3 2 <NA>
4 3 c
5 3 d
6 3 e
7 4 f
See here and here for more examples/info
I'm having a hard time deciding if this is a hack or the proper way to count references, but this returns the expected result:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=sum( !is.na( ref_id ) ) )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 0
3 3 3
4 4 1