Reshape data long data in R or aggregate? - r

I have a data set that is in a long format, and I can’t seem to get it the right shape for analysis. Perhaps this shape is appropriate — my experience has been almost entirely with wide format data, so this data file is not making sense to me. (Reproducible data file at end of the post.)
> head(df,10)
ID attributes values
1 1 AU AAA
2 1 AU BBB
3 1 YR 2014
4 2 AU CCC
5 2 AU DDD
6 2 AU EEE
7 2 AU FFF
8 2 AU GGG
9 2 YR 2013
10 3 AU HHH
The attributes column contain variables of interest to me, and I want to perform a series of aggregation functions. For example, I would like to:
1.Obtain a count of the number of authors (AU) for each ID. For example:
ID N.AU
1 2
2 5
3 1
4 2
5 5
6 1
Compute the median number of authors (AU) by year (YR)
YR Median.N.AU
2013 5.0
2014 1.5
For both of these examples, I have tried dplry with group_by and summaries, but haven’t cracked the code. I have also tried dcast. My hope is to come up with a solution that I can easily generalize to a larger data frame that has many more attributes that take on either a single value or multiple values. Any help or pointers to a similar solution would be greatly appreciated.
attributes = c("AU", "AU", "YR", "AU", "AU", "AU", "AU", "AU", "YR", "AU", "YR",
"AU", "AU", "YR", "AU", "AU", "AU", "AU", "AU", "YR", "AU", "YR")
ID = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6)
values = c("AAA", "BBB", "2014", "CCC", "DDD", "EEE", "FFF", "GGG", "2013", "HHH", "2014",
"III", "JJJ", "2014", "KKK", "LLL", "MMM", "NNN", "OOO", "2013", "PPP", "2014")
df <- data.frame(ID, attributes, values)

I think you're getting confused because you actually have two tables of
data linked by a common ID:
library(dplyr)
df <- tbl_df(df)
years <- df %>%
filter(attributes == "YR") %>%
select(id = ID, year = values)
years
#> Source: local data frame [6 x 2]
#>
#> id year
#> 1 1 2014
#> 2 2 2013
#> 3 3 2014
#> 4 4 2014
#> 5 5 2013
#> .. .. ...
authors <- df %>%
filter(attributes == "AU") %>%
select(id = ID, author = values)
authors
#> Source: local data frame [16 x 2]
#>
#> id author
#> 1 1 AAA
#> 2 1 BBB
#> 3 2 CCC
#> 4 2 DDD
#> 5 2 EEE
#> .. .. ...
Once you have the data in this form, it's easy to answer the questions
you're interested in:
Authors per paper:
n_authors <- authors %>%
group_by(id) %>%
summarise(n = n())
Or
n_authors <- authors %>% count(id)
Median authors per year:
n_authors %>%
left_join(years) %>%
group_by(year) %>%
summarise(median(n))
#> Joining by: "id"
#> Source: local data frame [2 x 2]
#>
#> year median(n)
#> 1 2013 5.0
#> 2 2014 1.5

Here's a possible data.table solution
I would also suggest to create some aggregated data set with separated columns. For example:
library(data.table)
(subdf <- as.data.table(df)[, .(N.AU = sum(attributes == "AU"),
Year = values[attributes == "YR"]) , ID])
# ID N.AU Year
# 1: 1 2 2014
# 2: 2 5 2013
# 3: 3 1 2014
# 4: 4 2 2014
# 5: 5 5 2013
# 6: 6 1 2014
Calculating median per year
subdf[, .(Median.N.AU = median(N.AU)), keyby = Year]
# Year Median.N.AU
# 1: 2013 5.0
# 2: 2014 1.5

I misunderstood the structure of your dataset initially. Thanks to the comments below I realize your data needs to be restructured.
# split the data out
df1 <- df[df$attributes == "AU",]
df2 <- df[df$attributes == "YR",]
# just keeping the columns with data as opposed to the label
df3 <- merge(df1, df2, by="ID")[,c(1,3,5)]
# set column names for clarification
colnames(df3) <- c("ID","author","year")
# get author counts
num.authors <- count(df3, vars=c("ID","year"))
ID year freq
1 1 2014 2
2 2 2013 5
3 3 2014 1
4 4 2014 2
5 5 2013 5
6 6 2014 1
summaryBy(freq ~ year, data = num.authors, FUN = list(median))
year freq.median
1 2013 5.0
2 2014 1.5
The nice thing about summaryBy is that you can add in which ever function has been defined in the list and you will get another column containing the other metric (e.g. mean, sd, etc.)

Related

Create new variables based on specific factor levels in time series data with dplyr

I've got some time series data where both the steps of the sequence (ranging from 1 to 8) as well as its topic (>100) are encoded as character factor levels within a single variable. Here is a minimal example (I omitted timestamps which would be increasing within each id):
id <- c(1,rep(2,5),rep(3,4))
step <- c("call", "call", "agent", "forest", "forward", "resolved", "call", "agent", "beach", "resolved")
(df <- data.frame(id,step))
id step
1 1 call
2 2 call
3 2 agent
4 2 forest
5 2 forward
6 2 resolved
7 3 call
8 3 agent
9 3 beach
10 3 resolved
I now want to split this information into two dedicated variables (step and topic), thereby shrinking the dataframe in rows and making it wider, while also repeating the topic for each row of the time series and adding an "NA" when there is no topic. Using base R to split this into two dataframes and merging them back together gets the job done:
step <- subset(df, step %in% c("call", "agent", "forward", "resolved"))
topic <- subset(df, step %in% c("forest", "beach"))
topic$topic <- topic$step
topic$step <- NULL
(newdf <- merge(step,topic, all=TRUE))
id step topic
1 1 call <NA>
2 2 call forest
3 2 agent forest
4 2 forward forest
5 2 resolved forest
6 3 call beach
7 3 agent beach
8 3 resolved beach
This is somewhat clunky though and I'm looking for a more elegant dplyr/tidyverse approach to this. pivot_wider() doesn't seem to be able to do this. Any ideas?
This isn't particularly elegant, but it works:
steps <- c("call", "agent", "forward", "resolved")
df %>%
mutate(type = ifelse(step %in% steps, "step", "topic"),
row = cumsum(type == "step")) %>%
pivot_wider(names_from = type, values_from = step) %>%
group_by(id) %>%
fill(topic, .direction = "updown") %>%
ungroup()
# A tibble: 8 x 4
id row step topic
<dbl> <int> <chr> <chr>
1 1 1 call NA
2 2 2 call forest
3 2 3 agent forest
4 2 4 forward forest
5 2 5 resolved forest
6 3 6 call beach
7 3 7 agent beach
8 3 8 resolved beach
Thanks for providing a minimal example of your problem
id <- c(1,rep(2,5),rep(3,4))
step <- c("call", "call", "agent", "forest", "forward",
"resolved", "call", "agent", "beach", "resolved")
df <- data.frame(id,step)
df
#> id step
#> 1 1 call
#> 2 2 call
#> 3 2 agent
#> 4 2 forest
#> 5 2 forward
#> 6 2 resolved
#> 7 3 call
#> 8 3 agent
#> 9 3 beach
#> 10 3 resolved
This is a possible solution using tidyverse
library(dplyr)
library(tidyr)
df %>%
# define in column type_c if step is an step or a topic
# you need a unique id for each row to use pivot_wider in this case
mutate(
type_c = if_else(step %in% c("forest", "beach"), "topic", "step"),
unique_id = 1:nrow(df)) %>%
pivot_wider(names_from = type_c, values_from = c(id, step)) %>%
mutate(id = coalesce(id_step, id_topic)) %>%
select(id, step = step_step, topic = step_topic) %>%
# Need group_by to apply the function fill
group_by(id) %>%
# fill replaces NA, in each id, with a value found in any direction "downup"
fill(topic, .direction = "downup") %>%
# get rid off the NA in column step that pivot_wider created for each topic
filter(!is.na(step))
#> # A tibble: 8 x 3
#> # Groups: id [3]
#> id step topic
#> <dbl> <chr> <chr>
#> 1 1 call <NA>
#> 2 2 call forest
#> 3 2 agent forest
#> 4 2 forward forest
#> 5 2 resolved forest
#> 6 3 call beach
#> 7 3 agent beach
#> 8 3 resolved beach
Created on 2021-06-08 by the reprex package (v0.3.0)

Deleting duplicated rows based on condition (position)

I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
I have mulple observations for each id and time identifier - e.g. I have 3 different alpha 1970 values. I would like to retain only one observation per id/year most notably the last one that appears in for each id/year.
the final dataset should look something like this:
final <- data.frame("id" = c("Alpha","Alpha","Beta","Beta","Beta"),
"Year" = c(1970,1971,1980,1981,1982),
"Val" = c(-2,5,5,3,5))
Does anyone know how I can approach the problem?
Thanks a lot in advance for your help
If you are open to a data.table solution, this can be done quite concisely:
library(data.table)
setDT(df)[, .SD[.N], by = c("id", "Year")]
#> id Year Val
#> 1: Alpha 1970 -2
#> 2: Alpha 1971 5
#> 3: Beta 1980 5
#> 4: Beta 1981 3
#> 5: Beta 1982 5
by = c("id", "Year") groups the data.table by id and Year, and .SD[.N] then returns the last row within each such group.
How about this?
library(tidyverse)
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
final <-
df %>%
group_by(id, Year) %>%
slice(n()) %>%
ungroup()
final
#> # A tibble: 5 x 3
#> id Year Val
#> <fct> <dbl> <dbl>
#> 1 Alpha 1970 -2
#> 2 Alpha 1971 5
#> 3 Beta 1980 5
#> 4 Beta 1981 3
#> 5 Beta 1982 5
Created on 2019-09-29 by the reprex package (v0.3.0)
Translates to "within each id-Year group, take only the row where the row number is equal to the size of the group, i.e. it's the last row under the current ordering."
You could also use either filter(), e.g. filter(row_number() == n()), or distinct() (and then you wouldn't even have to group), e.g. distinct(id, Year, .keep_all = TRUE) - but distinct functions take the first distinct row, so you'd need to reverse the row ordering here first.
An option with base R
aggregate(Val ~ ., df, tail, 1)
# id Year Val
#1 Alpha 1970 -2
#2 Alpha 1971 5
#3 Beta 1980 5
#4 Beta 1981 3
#5 Beta 1982 5
If we need to select the first row
aggregate(Val ~ ., df, head, 1)

Match words in a data frame to a string in R

I have a data frame from a recall task where participants recall as many words as they can from a list they learned earlier. Here's a mock up of the data. Each row is a subject and each column (w1-w5) is a word recalled:
df <- data.frame(subject = 1:5,
w1 = c("screen", "toad", "toad", "witch", "toad"),
w2 = c("package", "tuna", "tuna", "postage", "dinosaur"),
w3 = c("tuna", "postage", "toast", "athlete", "ranch"),
w4 = c("toad", "witch", "tuna", "package", "NA"),
w5 = c("windwo", "mermaid", "NA", "NA", "NA")
)
Which produces the following data frame:
subject w1 w2 w3 w4 w5
1 1 screen package tuna toad windwo
2 2 toad tuna postage witch mermaid
3 3 toad tuna toast tuna NA
4 4 witch postage athlete package NA
5 5 toad dinosaur ranch NA NA
I want to match each word produced (columns w1 - w5) to a list of the correct words, which are:
words <- c("screen", "package", "tuna", "toad", "window",
"postage", "witch", "mermaid", "toast", "dinosaur")
I only want to award points for words that are spelled correctly and are not repeated. So for example, for the data above I'd like to end up with a data frame that looks like this:
subject nCorrect
1 1 4
2 2 5
3 3 3
4 4 3
5 5 2
Subject 1 would get four points because they misspelled one word.
Subject 2 would get five points.
Subject 3 would get 3 points because they repeated tuna and are missing one word.
Subject 4 would get three points because they have one incorrect word and one missing word.
Subject 5 would get two points because they have one incorrect word and two missing words.
data.frame(subject = df$subject
, nCorrect = apply(df[, -1], 1, function(x) sum(unique(x) %in% words)))
# subject nCorrect
# 1 1 4
# 2 2 5
# 3 3 3
# 4 4 3
# 5 5 2
With data.table (same result)
setDT(df)
df[, sum(unique(unlist(.SD)) %in% words), by = subject]
Another option is to convert the data in long format. Group on subject to use dplyr::summarise to find correct number of matching answers.
library(tidyverse)
words <- c("screen", "package", "tuna", "toad", "window",
"postage", "witch", "mermaid", "toast", "dinosaur")
df %>% gather(key, value, -subject) %>%
group_by(subject) %>%
summarise(nCorrect = sum(unique(value) %in% words))
# # A tibble: 5 x 2
# subject nCorrect
# <int> <int>
# 1 1 4
# 2 2 5
# 3 3 3
# 4 4 3
# 5 5 2

R: Application of group_by in dplyr

I'm just getting started using dplyr and I have the following two problems, which should be easy to solve with group_by, but I don't get it.
I have data that looks like this:
data <- data.frame(cbind("year" = c(2010, 2010, 2010, 2011, 2012, 2012, 2012, 2012),
"institution" = c("a", "a", "b", "a", "a", "a", "b", "b"),
"branch.num" = c(1, 2, 1, 1, 1, 2, 1, 2)))
data
# year institution branch.num
#1 2010 a 1
#2 2010 a 2
#3 2010 b 1
#4 2011 a 1
#5 2012 a 1
#6 2012 a 2
#7 2012 b 1
#8 2012 b 2
The data is structured hierarchical: An institution at the highest level can a several branches, which are numbered starting at 1.
Problem 1: I want to select the rows containing only branches, for which exists a value in every year, that is in the example data only Branch 1 of Institution a, so the selection should be lines 1, 4 and 5.
Pronlem 2: I want to know the average number of branches a institution has over all years. That is in the example for Institution a (2+1+2)/3 = 1.67 and for institution b (1+0+2)/3 = 1.
Here is one solution:
Problem #1:
library(dplyr)
nYears <- n_distinct(data$year)
data %>% group_by(institution, branch.num) %>% filter(n_distinct(year) == nYears)
Source: local data frame [3 x 3]
Groups: institution, branch.num [1]
year institution branch.num
(fctr) (fctr) (fctr)
1 2010 a 1
2 2011 a 1
3 2012 a 1
Problem #2:
data %>% group_by(institution, year) %>% summarise(nBranches = n_distinct(branch.num)) %>% ungroup() %>% group_by(institution) %>% summarise(meanBranches = sum(nBranches)/nYears)
Source: local data frame [2 x 2]
institution meanBranches
(fctr) (dbl)
1 a 1.666667
2 b 1.000000

From CET time to local time

I have this data
> dff_all[1:10,c(2,3)]
cet_hour_of_registration country_id
1 20 SE
2 12 SE
3 11 SE
4 15 GB
5 12 SE
6 14 BR
7 23 MX
8 13 SE
9 1 BR
10 9 SE
and I want to create a variable $hour with the local time. The conversations are as follows The changes from CET to local time is
FI+1. MX-7. UK-1. BR-5.
I tried to do it with a nested IF. Did not make it.
#Create a data lookup table
country_id <- c("FI", "MX", "UK", "BR", "SE")
time_diff <- c(1,-7,-1,-5, 0)
df <- data.frame(country_id, time_diff)
#this is a substitute data frame for your data.
hour_reg <- c(20,12,11,15,5)
dff_all <- data.frame(country_id, hour_reg)
#joing the tables with dplyr function -> or with base join (double check join type for your needs)
library(dplyr)
new_table <- join(dff_all, df)
#make new column
mutate(new_table, hour = hour_reg - time_diff)
#output
country_id hour_reg time_diff hour
1 FI 20 1 19
2 MX 12 -7 19
3 UK 11 -1 12
4 BR 15 -5 20
5 SE 5 0 5
Base package:
# A variation of the example provided by vinchinzu
# Original table
country_id <- c("FI", "MX", "UK", "BR", "SE", "SP", "RE")
hour_reg <- c(20, 12, 11, 15, 5, 3, 7)
df1 <- data.frame(country_id, hour_reg)
# Lookup table
country_id <- c("FI", "MX", "UK", "BR", "SE")
time_diff <- c(1, -7, -1, -5, 0)
df2 <- data.frame(country_id, time_diff)
# We merge them and calculate a new column
full <- merge(df1, df2, by = "country_id", all.x = TRUE)
full$hour <- full$hour - full$time_diff
full
Output, in case we do not have that country in the lookup table, we will get NA:
country_id hour_reg time_diff hour
1 BR 15 -5 20
2 FI 20 1 19
3 MX 12 -7 19
4 RE 7 NA NA
5 SE 5 0 5
6 SP 3 NA NA
7 UK 11 -1 12
If we would like to show all rows without NA:
full[complete.cases(full), ]
To replace NA for zeros:
full[is.na(full)] <- 0

Resources