Normalize data using look-up table R - r

I have a data frame that looks like this:
COMM YEAR PRODUCE
Apple 2001 3
Mango 2001 5
Apple 2002 7
Mango 2002 2
Apple 2003 1
Mango 2003 9
I also have a yearly production dataframe:
Year Total.Produce
2001 10
2002 13
2003 14
I want to add a new column to the first data-frame that contains the normalized production of each item per year:
COMM YEAR PRODUCE Normalized.Produce
Apple 2001 3 3/10
Mango 2001 5 5/10
Apple 2002 7 7/13
Mango 2002 2 2/13
Apple 2003 9 9/14
Mango 2003 2 2/14
What is the most effecient R way of doing this?
My tables contains about 100,000 entries.

Just use match to get the matching rows:
R> match(dd1$YEAR, dd2$Year)
[1] 1 1 2 2 3 3
Then just use standard vectorised commands:
dd1$normalise = dd1$PRODUCE/dd2$Total.Produce[match(dd1$YEAR, dd2$Year)]

Can also use the merge function.
d1 <- read.table(text=
"COMM YEAR PRODUCE
Apple 2001 3
Mango 2001 5
Apple 2002 7
Mango 2002 2
Apple 2003 1
Mango 2003 9", head=T, as.is=T)
d2 <- read.table(text="Year Total.Produce
2001 10
2002 13
2003 14", head=T, as.is=T)
d3 <- merge(d1, d2, by.x="YEAR", by.y="Year")
d3$Normalized.Produce <- d3$PRODUCE/d3$Total.Produce
# YEAR COMM PRODUCE Total.Produce Normalized.Produce
# 1 2001 Apple 3 10 0.30000000
# 2 2001 Mango 5 10 0.50000000
# 3 2002 Apple 7 13 0.53846154
# 4 2002 Mango 2 13 0.15384615
# 5 2003 Apple 1 14 0.07142857
# 6 2003 Mango 9 14 0.64285714

Related

`str_replace_all` numeric values in column according to named vector

I want to use a named vector to map numeric values of a data frame column.
consider the following example:
df <- data.frame(year = seq(2000,2004,1), value = sample(11:15, r = T)) %>%
add_row(year=2005, value=1)
df
# year value
# 1 2000 12
# 2 2001 15
# 3 2002 11
# 4 2003 12
# 5 2004 14
# 6 2005 1
I now want to replace according to a vector, like this one
repl_vec <- c("1"="apple", "11"="radish", "12"="tomato", "13"="cucumber", "14"="eggplant", "15"="carrot")
which I do with this
df %>% mutate(val_alph = str_replace_all(value, repl_vec))
However, this gives:
# year value val_alph
# 1 2000 11 appleapple
# 2 2001 13 apple3
# 3 2002 15 apple5
# 4 2003 12 apple2
# 5 2004 14 apple4
# 6 2005 1 apple
since str_replace_all uses the first match and not the whole match. In the real data, the names of the named vector are also numbers (one- and two-digits).
I expect the output to be like this:
# year value val_alph
# 1 2000 11 radish
# 2 2001 13 cucumber
# 3 2002 15 carrot
# 4 2003 12 tomato
# 5 2004 14 eggplant
# 6 2005 1 apple
Does someone have a clever way of achieving this?
I would use base R's match instead of string matching here, since you are looking for exact whole string matches.
df %>%
mutate(value = repl_vec[match(value, names(repl_vec))])
#> year value
#> 1 2000 radish
#> 2 2001 carrot
#> 3 2002 carrot
#> 4 2003 cucumber
#> 5 2004 eggplant
#> 6 2005 apple
Created on 2022-04-20 by the reprex package (v2.0.1)
Is this what you want to do?
set.seed(1234)
df <- data.frame(year = seq(2000,2004,1), value = sample(11:15, r = T)) %>%
add_row(year=2005, value=1)
repl_vec <- c("1"="one", "11"="eleven", "12"="twelve", "13"="thirteen", "14"="fourteen", "15"="fifteen")
names(repl_vec) <- paste0("\\b", names(repl_vec), "\\b")
df %>%
mutate(val_alph = str_replace_all(value, repl_vec, names(repl_vec)))
which gives:
year value val_alph
1 2000 14 fourteen
2 2001 12 twelve
3 2002 15 fifteen
4 2003 14 fourteen
5 2004 11 eleven
6 2005 1 one

How can I drop observations within a group following the occurrence of NA?

I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

repeat rows in a dataset based on a column, but increment the rows [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 5 years ago.
I have a dataset which has project name, start year and contract term. I need to develop this dataset into time series. For example, one row in my dataset is: Project A, start year 2003 and contract term 5. I would like to repeat each row based on contract term. My dataset looks like this:
Project Name Start Year Contract Term
A 2003 5
B 2013 3
C 2000 2
My desired result should look like this:
Project Name Start Year Contract Term
A 2003 5
A 2004 5
A 2005 5
A 2006 5
A 2007 5
B 2013 3
B 2014 3
B 2014 3
C 2000 2
C 2001 2
I have tried:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
But this only repeats each project by the number in contract term. I can not make it to increment the years.
Thanks in advance!
Here it is in two steps:
Step 1, you know:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2003 5
# 1.1 A 2003 5
# 1.2 A 2003 5
# 1.3 A 2003 5
# 1.4 A 2003 5
# 2 B 2013 3
# 2.1 B 2013 3
# 2.2 B 2013 3
# 3 C 2000 2
# 3.1 C 2000 2
Step 2 makes use of sequence and basic addition:
sequence(rpsInput$Contract.Term) ## This will be helpful...
# [1] 1 2 3 4 5 1 2 3 1 2
rpsData$Start.Year <- rpsData$Start.Year + sequence(rpsInput$Contract.Term)
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2004 5
# 1.1 A 2005 5
# 1.2 A 2006 5
# 1.3 A 2007 5
# 1.4 A 2008 5
# 2 B 2014 3
# 2.1 B 2015 3
# 2.2 B 2016 3
# 3 C 2001 2
# 3.1 C 2002 2
Just to piggy back on Ananda's answer, change
sequence(rpsInput$Contract.Term)
to
(sequence(rpsInput$Contract.Term)-1)
to get the output you desire.
ProjectName<-c("A","B","C")
Start.Year<-c(2003,2013,2000)
Contract.Term<-c(5,3,2)
rpsInput<-data.frame(ProjectName,Start.Year,Contract.Term)
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData$Start.Year <- rpsData$Start.Year + (sequence(rpsInput$Contract.Term)-1)
rpsData
# ProjectName Start.Year Contract.Term
#1 A 2003 5
#1.1 A 2004 5
#1.2 A 2005 5
#1.3 A 2006 5
#1.4 A 2007 5
#2 B 2013 3
#2.1 B 2014 3
#2.2 B 2015 3
#3 C 2000 2
#3.1 C 2001 2

Merge 2 data frame based on 2 columns with different column names

I have 2 very large data sets that looks like below:
merge_data <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10),
position=c("yes","no","yes","no","yes",
"no","yes","no","yes","yes"),
school = c("a","b","a","a","c","b","c","d","d","e"),
year1 = c(2000,2000,2000,2001,2001,2000,
2003,2005,2008,2009),
year2=year1-1)
merge_data
ID position school year1 year2
1 1 support a 2000 1999
2 2 oppose b 2000 1999
3 3 support a 2000 1999
4 4 oppose a 2001 2000
5 5 support c 2001 2000
6 6 oppose b 2000 1999
7 7 support c 2003 2002
8 8 oppose d 2005 2004
9 9 support d 2008 2007
10 10 support e 2009 2008
merge_data_2 <- data.frame(year=c(1999,1999,2000,2000,2000,2001,2003
,2012,2009,2009,2008,2002,2009,2005,
2001,2000,2002,2000,2008,2005),
amount=c(100,200,300,400,500,600,700,800,900,
1000,1100,1200,1300,1400,1500,1600,
1700,1800,1900,2000),
ID=c(1,1,2,2,2,3,3,3,5,6,8,9,10,13,15,17,19,20,21,7))
merge_data_2
year amount ID
1 1999 100 1
2 1999 200 1
3 2000 300 2
4 2000 400 2
5 2000 500 2
6 2001 600 3
7 2003 700 3
8 2012 800 3
9 2009 900 5
10 2009 1000 6
11 2008 1100 8
12 2002 1200 9
13 2009 1300 10
14 2005 1400 13
15 2001 1500 15
16 2000 1600 17
17 2002 1700 19
18 2000 1800 20
19 2008 1900 21
20 2005 2000 7
And what I want is:
ID position school year1 year2 amount
1 yes a 2000 1999 300
2 no b 2000 1999 1200
10 yes e 2009 2008 1300
for ID=1 in the merge_data_2, we have amount =300, since there are 2 cases where ID=1,and their year1 or year1 is equal to the year of ID=1 in merge_data
So basically what I want is to perform a merge based on the ID and year.
2 conditions:
ID from merge_data matches the ID from merge_data_2
one of the year1 and year2 from merge_data also matches the year from merge_data_2.
then make the merge based on the sum of the amount for each IDs.
and I think the code will be something looks like:
merge_data_final <- merge(merge_data, merge_data_2,
merge_data$ID == merge_data_2$ID && (merge_data$year1 ||
merge_data$year2 == merge_data_2$year))
Then somehow to aggregate the amount by ID.
Obviously I know the code is wrong, and I have been thinking about plyr or reshape library, but was having difficulties of getting my hands on them.
Any helps would be great! thanks guys!
As noted above, I think you have some discrepancies between your example input and output data. Here's the basic approach - you were on the right track with reshape2. You can simply melt() your data into long format so you are joining on a single column instead of the either/or bit you had going on before.
library(reshape2)
#melt into long format
merge_data_m <- melt(merge_data, measure.vars = c("year1", "year2"))
#merge together, specifying the joining columns
merge(merge_data_m, merge_data_2, by.x = c("ID", "value"), by.y = c("ID", "year"))
#-----
ID value position school variable amount
1 1 1999 yes a year2 100
2 1 1999 yes a year2 200
3 2 2000 no b year1 500
4 2 2000 no b year1 300
5 2 2000 no b year1 400

Resources