subtract specific row und rename it - r

it is possible to subtract certain rows and rename them?
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","b","c","a","b","c", "a", "b", "c")
value <- c(2,2,10,3,3,12,4,4,16)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
And this is how the result should look:
year
category
value
2005
a
2
2005
b
2
2005
c
4
2006
a
3
2006
b
3
2006
c
12
2007
a
4
2007
b
4
2007
c
16
2005
c-b
2
2006
c-b
9
2007
c-b
12

You can use group_modify:
library(tidyverse)
df %>%
group_by(year) %>%
group_modify(~ add_row(.x, category = "c-b", value = .x$value[.x$category == "c"] - .x$value[.x$category == "b"]))
# A tibble: 12 x 3
# Groups: year [3]
year category value
<dbl> <chr> <dbl>
1 2005 a 2
2 2005 b 2
3 2005 c 10
4 2005 c-b 8
5 2006 a 3
6 2006 b 3
7 2006 c 12
8 2006 c-b 9
9 2007 a 4
10 2007 b 4
11 2007 c 16
12 2007 c-b 12

See substract() function.
Example:
substracted_df<-substr(df,df$category=="c")
If you want to know which rows are you dealing with, use which()
rows<-which(df$category=="c")
substracted_df<-df[rows, ]
You can rename each desired row as
row.names(substracted_df)<-c("Your desired row names")

Related

`str_replace_all` numeric values in column according to named vector

I want to use a named vector to map numeric values of a data frame column.
consider the following example:
df <- data.frame(year = seq(2000,2004,1), value = sample(11:15, r = T)) %>%
add_row(year=2005, value=1)
df
# year value
# 1 2000 12
# 2 2001 15
# 3 2002 11
# 4 2003 12
# 5 2004 14
# 6 2005 1
I now want to replace according to a vector, like this one
repl_vec <- c("1"="apple", "11"="radish", "12"="tomato", "13"="cucumber", "14"="eggplant", "15"="carrot")
which I do with this
df %>% mutate(val_alph = str_replace_all(value, repl_vec))
However, this gives:
# year value val_alph
# 1 2000 11 appleapple
# 2 2001 13 apple3
# 3 2002 15 apple5
# 4 2003 12 apple2
# 5 2004 14 apple4
# 6 2005 1 apple
since str_replace_all uses the first match and not the whole match. In the real data, the names of the named vector are also numbers (one- and two-digits).
I expect the output to be like this:
# year value val_alph
# 1 2000 11 radish
# 2 2001 13 cucumber
# 3 2002 15 carrot
# 4 2003 12 tomato
# 5 2004 14 eggplant
# 6 2005 1 apple
Does someone have a clever way of achieving this?
I would use base R's match instead of string matching here, since you are looking for exact whole string matches.
df %>%
mutate(value = repl_vec[match(value, names(repl_vec))])
#> year value
#> 1 2000 radish
#> 2 2001 carrot
#> 3 2002 carrot
#> 4 2003 cucumber
#> 5 2004 eggplant
#> 6 2005 apple
Created on 2022-04-20 by the reprex package (v2.0.1)
Is this what you want to do?
set.seed(1234)
df <- data.frame(year = seq(2000,2004,1), value = sample(11:15, r = T)) %>%
add_row(year=2005, value=1)
repl_vec <- c("1"="one", "11"="eleven", "12"="twelve", "13"="thirteen", "14"="fourteen", "15"="fifteen")
names(repl_vec) <- paste0("\\b", names(repl_vec), "\\b")
df %>%
mutate(val_alph = str_replace_all(value, repl_vec, names(repl_vec)))
which gives:
year value val_alph
1 2000 14 fourteen
2 2001 12 twelve
3 2002 15 fifteen
4 2003 14 fourteen
5 2004 11 eleven
6 2005 1 one

Appending and overwriting when joining dataframes

I have the following three dataframes:
prim <- data.frame("t"=2007:2012,
"a"=1:6,
"b"=7:12)
secnd <- data.frame("t"=2012:2013,
"a"=c(5, 7))
third <- data.frame("t"=2012:2013,
"b"=c(11, 13))
I want to join secnd and third to prim in two steps. In the first step I join prim and secnd, where any existing elements in prim are overwritten by those in secnd, so we end up with:
t a b
1 2007 1 7
2 2008 2 8
3 2009 3 9
4 2010 4 10
5 2011 5 11
6 2012 5 12
7 2013 7 NA
After this I want to join with third, where again existing elements are overwritten by those in third:
t a b
1 2007 1 7
2 2008 2 8
3 2009 3 9
4 2010 4 10
5 2011 5 11
6 2012 5 11
7 2013 7 13
Is there a way to achieve this using dplyr or base R?
By using dplyr you can do:
require(dplyr)
prim %>% full_join(secnd, by = 't') %>%
full_join(third, by = 't') %>%
mutate(a = coalesce(as.integer(a.y),a.x),
b = coalesce(as.integer(b.y),b.x)) %>%
select(t,a,b)
I added the as.integer function since you have different data types in your dataframes.
Consider base R with a chain merge and ifelse calls, followed by final column cleanup:
final_df <- Reduce(function(x, y) merge(x, y, by="t", all=TRUE), list(prim, secnd, third))
final_df <- within(final_df, {
a.x <- ifelse(is.na(a.y), a.x, a.y)
b.x <- ifelse(is.na(b.y), b.x, b.y)
})
final_df <- setNames(final_df[,1:3], c("t", "a", "b"))
final_df
# t a b
# 1 2007 1 7
# 2 2008 2 8
# 3 2009 3 9
# 4 2010 4 10
# 5 2011 5 11
# 6 2012 5 11
# 7 2013 7 13
Not very pretty. But seems to do the job
prim %>%
anti_join(secnd, by = "t") %>%
full_join(secnd, by = c("t", "a")) %>%
select(-b) %>%
left_join(prim %>%
anti_join(third, by = "t") %>%
full_join(third, by = c("t", "b")) %>%
select(-a))
gives
t a b
1 2007 1 7
2 2008 2 8
3 2009 3 9
4 2010 4 10
5 2011 5 11
6 2012 5 11
7 2013 7 13

back fill NA values in panel data set

I want to know how I can backfill NA values in panel data set.
data set
date firms return
1999 A NA
2000 A 5
2001 A NA
1999 B 9
2000 B NA
2001 B 10
expected out come
date firms return
1999 A 5
2000 A 5
2001 A NA
1999 B 9
2000 B 10
2001 B 10
I use this formula to fill NA values with previous value in panel data set
library(dplyr)
library(tidyr)
df1<-df %>% group_by(firms) %>% fill(return)
Is there any easy way like this by which I can fill NA values with next value in a panel data set.
You almost had it.
df <- df %>% group_by(firms) %>% fill(return, .direction="up")
df
# A tibble: 6 x 3
# Groups: firms [2]
date firms return
<int> <fct> <int>
1 1999 A 5
2 2000 A 5
3 2001 A NA
4 1999 B 9
5 2000 B 10
6 2001 B 10

Combine data in many row into a columnn

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?
One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)
One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)
1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources