error spreading data set in R - r

I have a long data set that is broken down by geographical location and year, with about 5 variables of interest (see structure blow), every time I try to convert it to wide form, I get told that there's duplication so it can't.
df
Yr Geo Obs1 Obs2
2001 Dist1 1 3
2002 Dist1 2 5
2003 Dist1 4 2
2004 Dist1 2 1
2001 Dist2 1 3
2002 Dist2 .9 5
2003 Dist2 6 8
2004 Dist2 2 .2
I want to convert it into something like this
yr dist1obs1 dist1obs2 dist2obs1 dist2obs2
2001
2002
2003
2004

Looking for something like this...?
> reshape(df, v.names= c("Obs1", "Obs2"), idvar="Yr", timevar ="Geo", direction="wide")
Yr Obs1.Dist1 Obs2.Dist1 Obs1.Dist2 Obs2.Dist2
1 2001 1 3 1.0 3.0
2 2002 2 5 0.9 5.0
3 2003 4 2 6.0 8.0
4 2004 2 1 2.0 0.2

Here is a solution using tidyr. Because spread works with one key-value pair, you need to first gather the Obs and unite the dist with it so that you have one key-value pair to work with. I also set the column names to be lower case as shown in the requested output.
library(tidyverse)
tbl <- read_table2(
"Yr Geo Obs1 Obs2
2001 Dist1 1 3
2002 Dist1 2 5
2003 Dist1 4 2
2004 Dist1 2 1
2001 Dist2 1 3
2002 Dist2 .9 5
2003 Dist2 6 8
2004 Dist2 2 .2"
)
tbl %>%
gather("obsnum", "obs", Obs1, Obs2) %>%
unite(colname, Geo, obsnum, sep = "") %>%
spread(colname, obs) %>%
`colnames<-`(str_to_lower(colnames(.)))
#> # A tibble: 4 x 5
#> yr dist1obs1 dist1obs2 dist2obs1 dist2obs2
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2001 1. 3. 1.00 3.00
#> 2 2002 2. 5. 0.900 5.00
#> 3 2003 4. 2. 6.00 8.00
#> 4 2004 2. 1. 2.00 0.200
Created on 2018-04-19 by the reprex package (v0.2.0).

Related

`str_replace_all` numeric values in column according to named vector

I want to use a named vector to map numeric values of a data frame column.
consider the following example:
df <- data.frame(year = seq(2000,2004,1), value = sample(11:15, r = T)) %>%
add_row(year=2005, value=1)
df
# year value
# 1 2000 12
# 2 2001 15
# 3 2002 11
# 4 2003 12
# 5 2004 14
# 6 2005 1
I now want to replace according to a vector, like this one
repl_vec <- c("1"="apple", "11"="radish", "12"="tomato", "13"="cucumber", "14"="eggplant", "15"="carrot")
which I do with this
df %>% mutate(val_alph = str_replace_all(value, repl_vec))
However, this gives:
# year value val_alph
# 1 2000 11 appleapple
# 2 2001 13 apple3
# 3 2002 15 apple5
# 4 2003 12 apple2
# 5 2004 14 apple4
# 6 2005 1 apple
since str_replace_all uses the first match and not the whole match. In the real data, the names of the named vector are also numbers (one- and two-digits).
I expect the output to be like this:
# year value val_alph
# 1 2000 11 radish
# 2 2001 13 cucumber
# 3 2002 15 carrot
# 4 2003 12 tomato
# 5 2004 14 eggplant
# 6 2005 1 apple
Does someone have a clever way of achieving this?
I would use base R's match instead of string matching here, since you are looking for exact whole string matches.
df %>%
mutate(value = repl_vec[match(value, names(repl_vec))])
#> year value
#> 1 2000 radish
#> 2 2001 carrot
#> 3 2002 carrot
#> 4 2003 cucumber
#> 5 2004 eggplant
#> 6 2005 apple
Created on 2022-04-20 by the reprex package (v2.0.1)
Is this what you want to do?
set.seed(1234)
df <- data.frame(year = seq(2000,2004,1), value = sample(11:15, r = T)) %>%
add_row(year=2005, value=1)
repl_vec <- c("1"="one", "11"="eleven", "12"="twelve", "13"="thirteen", "14"="fourteen", "15"="fifteen")
names(repl_vec) <- paste0("\\b", names(repl_vec), "\\b")
df %>%
mutate(val_alph = str_replace_all(value, repl_vec, names(repl_vec)))
which gives:
year value val_alph
1 2000 14 fourteen
2 2001 12 twelve
3 2002 15 fifteen
4 2003 14 fourteen
5 2004 11 eleven
6 2005 1 one

Average percentage change over different years in R

I have a data frame from which I created a reproducible example:
country <- c('A','A','A','B','B','C','C','C','C')
year <- c(2010,2011,2015,2008,2009,2008,2009,2011,2015)
score <- c(1,2,2,1,4,1,1,3,2)
country year score
1 A 2010 1
2 A 2011 2
3 A 2015 2
4 B 2008 1
5 B 2009 4
6 C 2008 1
7 C 2009 1
8 C 2011 3
9 C 2015 2
And I am trying to calculate the average percentage increase (or decrease) in the score for each country by calculating [(final score - initial score) รท (initial score)] for each year and averaging it over the number of years.
country year score change
1 A 2010 1 NA
2 A 2011 2 1
3 A 2015 2 0
4 B 2008 1 NA
5 B 2009 4 3
6 C 2008 1 NA
7 C 2009 1 0
8 C 2011 3 2
9 C 2015 2 -0.33
The final result I am hoping to obtain:
country avg_change
1 A 0.5
2 B 3
3 C 0.55
As you can see, the trick is that countries have spans over different years, sometimes with a missing year in between. I tried different ways to do it manually but I do struggle. If someone could hint me a solution would be great. Many thanks.
With dplyr, we can group_by country and get mean of difference between scores.
library(dplyr)
df %>%
group_by(country) %>%
summarise(avg_change = mean(c(NA, diff(score)), na.rm = TRUE))
# country avg_change
# <fct> <dbl>
#1 A 0.500
#2 B 3.00
#3 C 0.333
Using base R aggregate with same logic
aggregate(score~country, df, function(x) mean(c(NA, diff(x)), na.rm = TRUE))
We can use data.table to group by 'country' and take the mean of the difference between the 'score' and the lag of 'score'
library(data.table)
setDT(df1)[, .(avg_change = mean(score -lag(score), na.rm = TRUE)), .(country)]
# country avg_change
#1: A 0.5000000
#2: B 3.0000000
#3: C 0.3333333

back fill NA values in panel data set

I want to know how I can backfill NA values in panel data set.
data set
date firms return
1999 A NA
2000 A 5
2001 A NA
1999 B 9
2000 B NA
2001 B 10
expected out come
date firms return
1999 A 5
2000 A 5
2001 A NA
1999 B 9
2000 B 10
2001 B 10
I use this formula to fill NA values with previous value in panel data set
library(dplyr)
library(tidyr)
df1<-df %>% group_by(firms) %>% fill(return)
Is there any easy way like this by which I can fill NA values with next value in a panel data set.
You almost had it.
df <- df %>% group_by(firms) %>% fill(return, .direction="up")
df
# A tibble: 6 x 3
# Groups: firms [2]
date firms return
<int> <fct> <int>
1 1999 A 5
2 2000 A 5
3 2001 A NA
4 1999 B 9
5 2000 B 10
6 2001 B 10

How can I drop observations within a group following the occurrence of NA?

I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

repeat rows in a dataset based on a column, but increment the rows [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 5 years ago.
I have a dataset which has project name, start year and contract term. I need to develop this dataset into time series. For example, one row in my dataset is: Project A, start year 2003 and contract term 5. I would like to repeat each row based on contract term. My dataset looks like this:
Project Name Start Year Contract Term
A 2003 5
B 2013 3
C 2000 2
My desired result should look like this:
Project Name Start Year Contract Term
A 2003 5
A 2004 5
A 2005 5
A 2006 5
A 2007 5
B 2013 3
B 2014 3
B 2014 3
C 2000 2
C 2001 2
I have tried:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
But this only repeats each project by the number in contract term. I can not make it to increment the years.
Thanks in advance!
Here it is in two steps:
Step 1, you know:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2003 5
# 1.1 A 2003 5
# 1.2 A 2003 5
# 1.3 A 2003 5
# 1.4 A 2003 5
# 2 B 2013 3
# 2.1 B 2013 3
# 2.2 B 2013 3
# 3 C 2000 2
# 3.1 C 2000 2
Step 2 makes use of sequence and basic addition:
sequence(rpsInput$Contract.Term) ## This will be helpful...
# [1] 1 2 3 4 5 1 2 3 1 2
rpsData$Start.Year <- rpsData$Start.Year + sequence(rpsInput$Contract.Term)
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2004 5
# 1.1 A 2005 5
# 1.2 A 2006 5
# 1.3 A 2007 5
# 1.4 A 2008 5
# 2 B 2014 3
# 2.1 B 2015 3
# 2.2 B 2016 3
# 3 C 2001 2
# 3.1 C 2002 2
Just to piggy back on Ananda's answer, change
sequence(rpsInput$Contract.Term)
to
(sequence(rpsInput$Contract.Term)-1)
to get the output you desire.
ProjectName<-c("A","B","C")
Start.Year<-c(2003,2013,2000)
Contract.Term<-c(5,3,2)
rpsInput<-data.frame(ProjectName,Start.Year,Contract.Term)
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData$Start.Year <- rpsData$Start.Year + (sequence(rpsInput$Contract.Term)-1)
rpsData
# ProjectName Start.Year Contract.Term
#1 A 2003 5
#1.1 A 2004 5
#1.2 A 2005 5
#1.3 A 2006 5
#1.4 A 2007 5
#2 B 2013 3
#2.1 B 2014 3
#2.2 B 2015 3
#3 C 2000 2
#3.1 C 2001 2

Resources