reshape wide into long while splitting - r

I am looking for reshaping:
ID p2012 p2010 p2008 p2006 c2012 c2010 c2008 c2006
1 1 160 162 163 165 37.3 37.3 37.1 37.1
2 2 163 164 164 163 2.6 2.6 2.6 2.6
into:
ID year p c
1 1 2006 165 37.1
2 1 2008 164 37.1
3 1 2010 162 37.3
4 1 2012 160 37.3
5 2 2006 163 2.6
6 2 2008 163 2.6
7 2 2010 164 2.6
8 2 2012 163 2.6
I am new to R, have been trying around with melt and dcast functions, but there are just to many twists for me at this stage. Help would be much appreciated!
A dput of my df:
structure(list(ID = 1:2, p2012 = c(160L, 163L), p2010 = c(162L, 164L), p2008 = 163:164, p2006 = c(165L, 163L), c2012 = c(37.3, 2.6), c2010 = c(37.3, 2.6), c2008 = c(37.1, 2.6), c2006 = c(37.1, 2.6)), .Names = c("ID", "p2012", "p2010", "p2008", "p2006", "c2012", "c2010", "c2008", "c2006"), class = "data.frame", row.names = c(NA, -2L))

An alternative to shadow's answer is to use the reshape function:
reshape(d, direction='long', varying=list(2:5, 6:9), v.names=c("p", "c"), idvar="ID", times=c(2012, 2010, 2008, 2006))
This assumes that you know the column indices of the p and c beforehand (or add additional code to figure them out). Furthermore, the times vector above could be found by using something similar to the gsub function of shadow.
Which way to use probably is a matter of taste.

You probably have to melt the data first, then split the variable and the year and then dcast to your final data.frame.
require(reshape2)
# melt data.frame
dfmelt <- melt(df, id.vars="ID", variable.name="var.year")
# split "var.year" into new variables "var" and "year"
dfmelt[, "var"] <- gsub("[0-9]", "", as.character(dfmelt[, "var.year"]))
dfmelt[, "year"] <- as.numeric(gsub("[a-z, A-Z]", "", as.character(dfmelt[, "var.year"])))
# cast to data with column for each var-name
dcast(dfmelt, ID+year~var, value.var="value")

You can also use the following solution from tidyr. You don't actually need to use regular expressions, if "p" or "c" is always the first letter of the column names:
library(tidyr)
library(dplyr) # only loaded for the %>% operator
dat %>%
gather(key,value,p2012:c2006) %>%
separate(key,c("category","year"),1) %>%
spread(category,value)
ID year c p
1 1 2006 37.1 165
2 1 2008 37.1 163
3 1 2010 37.3 162
4 1 2012 37.3 160
5 2 2006 2.6 163
6 2 2008 2.6 164
7 2 2010 2.6 164
8 2 2012 2.6 163

Related

Summing multiple observation rows in R

I have a dataset with 4 observations for 90 variables. The observations are answer to a questionnaire of the type "completely agree" to "completely disagree", expressed in percentages. I want to sum the two positive observations (completely and somewhat agree) and the two negative ones (completely and somewhat disagree) for all variables. Is there a way to do this in R?
My dataset looks like this:
Albania Andorra Azerbaijan etc.
1 13.3 18.0 14.9 ...
2 56.3 45.3 27.2 ...
3 21.3 27.2 28.0 ...
4 8.9 9.4 5.2 ...
And I want to sum rows 1+2 and 3+4 to look something like this:
Albania Andorra Azerbaijan etc.
1 69.6 63.3 65.4 ...
2 30.2 36.6 33.2 ...
I am really new to R so I have no idea how to go about this. All answers to similar questions I found on this website and others either have character type observations, multiple rows for the same observation (with missing data), or combine all the rows into just 1 row. My problem falls in none of these categories, I just want to collapse some of the observations.
Since you only have four rows, it's probably easiest to just add the first two rows together and the second two rows together. You can use rbind to stick the two resulting rows together into the desired data frame:
rbind(df[1,] + df[2, ], df[3,] + df[4,])
#> Albania Andorra Azerbaijan
#> 1 69.6 63.3 42.1
#> 3 30.2 36.6 33.2
Data taken from question
df <- structure(list(Albania = c(13.3, 56.3, 21.3, 8.9), Andorra = c(18,
45.3, 27.2, 9.4), Azerbaijan = c(14.9, 27.2, 28, 5.2)), class = "data.frame",
row.names = c("1", "2", "3", "4"))
Another option could be by summing every 2 rows with rowsum and using gl with k = 2 like in the following coding:
rowsum(df, gl(n = nrow(df), k = 2, length = nrow(df)))
#> Albania Andorra Azerbaijan
#> 1 69.6 63.3 42.1
#> 2 30.2 36.6 33.2
Created on 2023-01-06 with reprex v2.0.2
Using dplyr
library(dplyr)
df %>%
group_by(grp = gl(n(), 2, n())) %>%
summarise(across(everything(), sum))
-output
# A tibble: 2 × 4
grp Albania Andorra Azerbaijan
<fct> <dbl> <dbl> <dbl>
1 1 69.6 63.3 42.1
2 2 30.2 36.6 33.2

R Panel data: Create new variable based on ifelse() statement and previous row

My question refers to the following (simplified) panel data, for which I would like to create some sort of xrd_stock.
#Setup data
library(tidyverse)
firm_id <- c(rep(1, 5), rep(2, 3), rep(3, 4))
firm_name <- c(rep("Cosco", 5), rep("Apple", 3), rep("BP", 4))
fyear <- c(seq(2000, 2004, 1), seq(2003, 2005, 1), seq(2005, 2008, 1))
xrd <- c(49,93,121,84,37,197,36,154,104,116,6,21)
df <- data.frame(firm_id, firm_name, fyear, xrd)
#Define variables
growth = 0.08
depr = 0.15
For a new variable called xrd_stock I'd like to apply the following mechanics:
each firm_id should be handled separately: group_by(firm_id)
where fyear is at minimum, calculate xrd_stock as: xrd/(growth + depr)
otherwise, calculate xrd_stock as: xrd + (1-depr) * [xrd_stock from previous row]
With the following code, I already succeeded with step 1. and 2. and parts of step 3.
df2 <- df %>%
ungroup() %>%
group_by(firm_id) %>%
arrange(firm_id, fyear, decreasing = TRUE) %>% #Ensure that data is arranged w/ in asc(fyear) order; not required in this specific example as df is already in correct order
mutate(xrd_stock = ifelse(fyear == min(fyear), xrd/(growth + depr), xrd + (1-depr)*lag(xrd_stock))))
Difficulties occur in the else part of the function, such that R returns:
Error: Problem with `mutate()` input `xrd_stock`.
x object 'xrd_stock' not found
i Input `xrd_stock` is `ifelse(...)`.
i The error occured in group 1: firm_id = 1.
Run `rlang::last_error()` to see where the error occurred.
From this error message, I understand that R cannot refer to the just created xrd_stock in the previous row (logical when considering/assuming that R is not strictly working from top to bottom); however, when simply putting a 9 in the else part, my above code runs without any errors.
Can anyone help me with this problem so that results look eventually as shown below. I am more than happy to answer additional questions if required. Thank you very much to everyone in advance, who looks at my question :-)
Target results (Excel-calculated):
id name fyear xrd xrd_stock Calculation for xrd_stock
1 Cosco 2000 49 213 =49/(0.08+0.15)
1 Cosco 2001 93 274 =93+(1-0.15)*213
1 Cosco 2002 121 354 …
1 Cosco 2003 84 385 …
1 Cosco 2004 37 364 …
2 Apple 2003 197 857 =197/(0.08+0.15)
2 Apple 2004 36 764 =36+(1-0.15)*857
2 Apple 2005 154 803 …
3 BP 2005 104 452 …
3 BP 2006 116 500 …
3 BP 2007 6 431 …
3 BP 2008 21 388 …
arrange the data by fyear so minimum year is always the 1st row, you can then use accumulate to calculate.
library(dplyr)
df %>%
arrange(firm_id, fyear) %>%
group_by(firm_id) %>%
mutate(xrd_stock = purrr::accumulate(xrd[-1], ~.y + (1-depr) * .x,
.init = first(xrd)/(growth + depr)))
# firm_id firm_name fyear xrd xrd_stock
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 Cosco 2000 49 213.
# 2 1 Cosco 2001 93 274.
# 3 1 Cosco 2002 121 354.
# 4 1 Cosco 2003 84 385.
# 5 1 Cosco 2004 37 364.
# 6 2 Apple 2003 197 857.
# 7 2 Apple 2004 36 764.
# 8 2 Apple 2005 154 803.
# 9 3 BP 2005 104 452.
#10 3 BP 2006 116 500.
#11 3 BP 2007 6 431.
#12 3 BP 2008 21 388.

Data.table: operation with group-shifted data

Consider the folowing data.table:
DT <- data.table(year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013),
level = c(137,137,137,136,136,136,135,135,135),
valueIn = c(13,30,56,11,25,60,8,27,51))
I would like have the following ouput:
DT <- data.table(year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013),
level = c(137,137,137,136,136,136,135,135,135),
valueIn = c(13,30,56, 11,25,60, 8,27,51),
valueOut = c(12,27.5,58, 9.5,26,55.5, NA,NA,NA))
In other words, I want to calculate the operation (valueIn[level] - valueIn[level-1]) / 2, according to the year. For example, the first value is calculated like this: (13+11)/2=12.
For the moment, I do that with for loops, in which I create data.table's subsets for each level:
levelDtList <- list()
levels <- sort(DT$level, decreasing = FALSE)
for (this.level in levels) {
levelDt <- DT[level == this.level]
if (this.level == min(levels)) {
valueOut <- NA
} else {
levelM1Data <- levelDtList[[this.level - 1]]
valueOut <- (levelDt$valueIn + levelM1Data$valueIn) / 2
}
levelDt$valueOut <- valueOut
levelDtList[[this.level]] <- levelDt
}
datatable <- rbindlist(levelDtList)
This is ugly and quite slow, so I am looking for a better, faster, data.table-based solution.
Using the shift-function with type = 'lead' to get the next value, sum and divide by two:
DT[, valueOut := (valueIn + shift(valueIn, type = 'lead'))/2, by = year]
you get:
year level valueIn valueOut
1: 2011 137 13 12.0
2: 2012 137 30 27.5
3: 2013 137 56 58.0
4: 2011 136 11 9.5
5: 2012 136 25 26.0
6: 2013 136 60 55.5
7: 2011 135 8 NA
8: 2012 135 27 NA
9: 2013 135 51 NA
With all the parameters of the shift-function specified:
DT[, valueOut := (valueIn + shift(valueIn, n = 1L, fill = NA, type = 'lead'))/2, by = year]
We can also use shift with Reduce
DT[, valueOut := Reduce(`+`, shift(valueIn, type = "lead", 0:1))/2, by = year]
DT
# year level valueIn valueOut
#1: 2011 137 13 12.0
#2: 2012 137 30 27.5
#3: 2013 137 56 58.0
#4: 2011 136 11 9.5
#5: 2012 136 25 26.0
#6: 2013 136 60 55.5
#7: 2011 135 8 NA
#8: 2012 135 27 NA
#9: 2013 135 51 NA
It is more easier to generalize as shift can take a vector of 'n' values.
If you:
don't mind using dplyr
the year is the thing that relates your items
the structure shown is representative of reality
then this could work for you:
DT %>% group_by(year) %>% mutate(valueOut = (valueIn + lead(valueIn)) / 2)

Create tidy dataset from an untidy one with two variables per column and implicit missings

I have an untidy dataset that combines two variables (some missing) in each of two columns (a small subsample in the data frame 'test' below). I'm struggling to create the desired tidy dataset below.
untidy <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]",
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%",
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, 5L), class = "data.frame")
Desired data frame
N_patients N_ears pct_patients pct_ears
173 NA 58.61 NA
60 NA 13.30 NA
54 96 11.11 NA
168 328 14.79 10.45
Thanks!
Seems there is always an edge case - where both answers fail to consider something about the 5th row. Seems to be just a regex issue. Suggestions on how to fix?
untidy_2 <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]",
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%",
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
ie. row 5, [35.55%] is parsed as pct_patients
N [ears] % Otorrhea N_patients N_ears pct_patients pct_ears
1 173 58.61% 173 NA 58.61 NA
2 60 13.30% 60 NA 13.30 NA
3 54 [96] 11.11% 54 96 11.11 NA
4 168 [328] 52.38% 168 328 52.38 NA
5 75 [150] [35.33%] 75 150 35.33 NA
Happily, this is pretty easy with the tidyr package in the tidyverse.
library(tidyverse)
test <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", "906 [1685]"),
`% Otorrhea` = c("58.61%", "13.30%", "11.11%", "52.38%", "14.79% [10.45%]")),
Names = c("N [ears]", "% Otorrhea"),
row.names = c(NA, 5L), class = "data.frame")
test %>%
separate(`N [ears]`, into = c("N_patients", "N_ears"), sep = "\\s\\[", fill = "right") %>%
separate(`% Otorrhea`, into = c("pct_patients", "pct_ears"), sep = "\\s\\[", fill = "right") %>%
mutate_each(funs(parse_number))
#> N_patients N_ears pct_patients pct_ears
#> 1 173 NA 58.61 NA
#> 2 60 NA 13.30 NA
#> 3 54 96 11.11 NA
#> 4 168 328 52.38 NA
#> 5 906 1685 14.79 10.45
Here is an alternative with extract() function with regular expressions:
library(tidyr)
test %>%
extract(`N [ears]`, into = c("N_patients", "N_ears"),
regex = "^(\\d+)(?:\\s\\[(\\d+)\\])?$") %>%
extract(`% Otorrhea`, into = c("pct_patients", "pct_ears"),
regex = "^([.0-9]+)%(?:\\s\\[([.0-9]+)%\\])?$")
# N_patients N_ears pct_patients pct_ears
#1 173 <NA> 58.61 <NA>
#2 60 <NA> 13.30 <NA>
#3 54 96 11.11 <NA>
#4 168 328 52.38 <NA>
#5 906 1685 14.79 10.45
Here we can use non-capture group (?:...) with ? to capture optional ears columns.
The best answer for my actual dataset was provided by in the comment by
https://stackoverflow.com/users/4497050/alistaire
Shown below, wrapped in a simple funtion.
library(tidyverse)
make_tidy <- function(untidy){
tidy <- untidy %>%
separate_(colnames(untidy)[1], c('N_patients', 'N_ears'), fill = 'right', extra = 'drop', convert = TRUE) %>%
separate_(colnames(untidy)[2], c('pct_patients', 'pct_ears'), sep = '[^\\d.]+', extra = 'drop', convert = TRUE)
}
tidy_2 <- make_tidy(untidy_2)
Correctly parses untidy_2
> tidy_2
# A tibble: 5 × 4
N_patients N_ears pct_patients pct_ears
* <int> <int> <dbl> <dbl>
1 173 NA 58.61 NA
2 60 NA 13.30 NA
3 54 96 11.11 NA
4 168 328 52.38 NA
5 906 1685 14.79 10.45

How to control number of decimal digits in write.table() output?

When working with data (e.g., in data.frame) the user can control displaying digits by using
options(digits=3)
and listing the data.frame like this.
ttf.all
When the user needs to paste the data in Excell like this
write.table(ttf.all, 'clipboard', sep='\t',row.names=F)
The digits parameter is ignored and numbers are not rounded.
See nice output
> ttf.all
year V1.x.x V1.y.x ratio1 V1.x.y V1.y.y ratioR V1.x.x V1.y.x ratioAL V1.x.y V1.y.y ratioRL
1 2006 227 645 35.2 67 645 10.4 150 645 23.3 53 645 8.22
2 2007 639 1645 38.8 292 1645 17.8 384 1645 23.3 137 1645 8.33
3 2008 1531 3150 48.6 982 3150 31.2 755 3150 24.0 235 3150 7.46
4 2009 1625 3467 46.9 1026 3467 29.6 779 3467 22.5 222 3467 6.40
But what is in excel (clipboard) is not rounded. How to control in in write.table()?
You can use the function format() as in:
write.table(format(ttf.all, digits=2), 'clipboard', sep='\t',row.names=F)
format() is a generic function that has methods for many classes, including data.frames. Unlike round(), it won't throw an error if your dataframe is not all numeric. For more details on the formatting options, see the help file via ?format
Adding a solution for data frame having mixed character and numeric columns. We first use mutate_if to select numeric columns then apply the round() function to them.
# install.packages('dplyr', dependencies = TRUE)
library(dplyr)
df <- read.table(text = "id year V1.x.x V1.y.x ratio1
a 2006 227.11111 645.11111 35.22222
b 2007 639.11111 1645.11111 38.22222
c 2008 1531.11111 3150.11111 48.22222
d 2009 1625.11111 3467.11111 46.22222",
header = TRUE, stringsAsFactors = FALSE)
df %>%
mutate_if(is.numeric, round, digits = 2)
#> id year V1.x.x V1.y.x ratio1
#> 1 a 2006 227.11 645.11 35.22
#> 2 b 2007 639.11 1645.11 38.22
#> 3 c 2008 1531.11 3150.11 48.22
#> 4 d 2009 1625.11 3467.11 46.22
### dplyr v1.0.0+
df %>%
mutate(across(where(is.numeric), ~ round(., digits = 2)))
#> id year V1.x.x V1.y.x ratio1
#> 1 a 2006 227.11 645.11 35.22
#> 2 b 2007 639.11 1645.11 38.22
#> 3 c 2008 1531.11 3150.11 48.22
#> 4 d 2009 1625.11 3467.11 46.22
Created on 2019-03-17 by the reprex package (v0.2.1.9000)

Resources