converting continuous number into a binary value - r

I have a dataset that includes a column called BirthYear that includes lots of years in which people were born and I need to create a new column that prints "young" if their BirthYear is > 1993 and to print "old" if their BirthYear is < 1993. I've tried using the if function but I cant seem to achieve it, I would appreciate if u let me know how to do it, thanks!

I also like cut() for this, especially if you want the result to be a factor.
year <- sample(1989:1999, size=20, replace=T) # Arbitrary vector of years
breaks <- c(-Inf, 1993, Inf) # The 3 bounds of the 2 intervals
labels <- c("old", "young") # The 2 labels of the 2 intervals
binary <- cut(x=year, breaks=breaks, labels=labels, right=F)
# Inspect
data.frame(year, binary)
The result:
year binary
1 1993 young
2 1997 young
3 1989 old
4 1998 young
5 1999 young
6 1989 old
7 1994 young
8 1991 old
9 1991 old
10 1991 old
...
This is close to a duplicate, but involves custom labels.
If you have to inspect more than one variable eventually, look at dplyr::case_when().

Another option could be use dplyr::recode_factor as below:
set.seed(1)
year <- sample(1970:2005, size=10, replace=T)
> year
#[1] 2001 1975 1979 1994 1974 1973 1985 1994 1975 1981
recode_factor(as.factor(year > 1993), 'TRUE' = "Old", 'FALSE' = "Young")
#[1] Old Young Young Old Young Young Young Old Young Young
#Levels: Old Young

Related

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

Adding data points in a column by factors in R

The data.frame my_data consists of two columns("PM2.5" & "years") & around 6400000 rows. The data.frame has various data points for pollutant levels of "PM2.5" for years 1999, 2002, 2005 & 2008.
This is what i have done to the data.drame:
{
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
}
I want to find the sum of all PM2.5 levels (i.e sum of all data points under PM2.5) according to different year. How can I do it.
!The image shows the first 20 rows of the data.frame.
Since the column "years" is arranged, it is showing only 1999
Say this is your data:
library(plyr) # <- don't forget to tell us what libraries you are using
give us an easy sample set
my_data <- data.frame(year=sample(c("1999","2002","2005","2008"), 10, replace=T), PM2.5 = rnorm(10,mean = 5))
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
> my_data
year PM2.5
1 1999 5.556852
2 2002 5.508820
3 2002 4.836500
4 2002 3.766266
5 2005 6.688936
6 2005 5.025600
7 2005 4.041670
8 2005 4.614784
9 2005 4.352046
10 2008 6.378134
One way to do it (out of many, many ways already shown by a simple google search):
> with(my_data, (aggregate(PM2.5, by=list(year), FUN="sum")))
Group.1 x
1 1999 5.556852
2 2002 14.111586
3 2005 24.723037
4 2008 6.378134

Translating Stata code into R

General newbie when it comes to time series data analysis in R. I am having trouble translating a bit of Stata code into R code for a replication project I am doing.
The intent of the Stata code and the Stata code (from the original analysis) are the following:
#### Delete extra yearc observations with different wartypes #####
drop if yearc==yearc[_n+1] & wartype!="CIVIL"
drop if yearc==yearc[_n-1] & wartype!="CIVIL"
So, translated, I keep the rows in which the country is having a civil war and delete the rows in which there is an interstate war during the same years.
I have named the data object (i.e., the data set)
mywar
in R.
I am assuming I somehow do a conditional ifelse statement, or something similar, such as:
invisible(mywar$yearc <- ifelse(mywar$yearc==n-1 | mywar$yearc==n+1 | mywar$wartype!=civil, NA,
mywar$yearc)) # I am assuming I cannot condition ifelse statements like this; but, this is how I imagine it
mywar <- mywar[!is.na(mywar$yearc),]
EDIT:
So perhaps an example
> b <- c(1970, 1970, 1970, 1971, 1982, 1999, 1999, 2000, 2001, 2002)
> c <- c("inter", "civil", "intra", "civil", "civil", "inter", "civil", "civil", "civil", "civil")
> df <- data.frame(b,c)
> df$j <- ifelse(df$b==n-1 & df$b==n+1 & df$c!="civil", NA, df$b)
> df
b c j
1 1970 inter 1970
2 1970 civil 1970
3 1970 intra 1970
4 1971 civil 1971
5 1982 civil 1982
6 1999 inter 1999
7 1999 civil 1999
8 2000 civil 2000
9 2001 civil 2001
10 2002 civil 2002
So, what I was trying to do was create NAs for rows 1,3,and 6 as they are duplicate years in my logistic regression on the onset of civil war (I am not interested in inter and intra wars, however defined) so that I can delete these rows from my data set. Here, I just recreated row b. (Note, what is missing from this made up data are the country ids. But assume that these ten entries represent the same country (for instance, Somalia)). So, I am interested in how to delete these type of rows in a data set with 28,000 rows.
dplyr is also a good way — you just need to "keep" instead of "drop"
library(dplyr)
filter(df, (yearc != lead(yearc, 1) & yearc != lag(yearc, 1)) | wartype == "CIVIL")
You're focusing on Stata's if qualifier, but it sounds like you simply want to subset the data frame--hence your use of the drop command in Stata. I also learned Stata before R and was confused since I relied so heavily on the if qualifier in Stata and immediately pursued ifelse in R. But, I later realized that the more relevant technique in R revolved around subsetting. There is a subset() command, but most people prefer subsetting by using brackets (see code below).
In your original question you ask how to do two things:
how to delete observations (i.e. rows) that are coded "inter" or "intra" on column C, and
how to mark them as missing
Sample Data
b <- c(1970, 1970, 1970, 1971, 1982, 1999, 1999, 2000, 2001, 2002)
c <- c("inter", "civil", "intra", "civil", "civil", "inter", "civil", "civil", "civil", "civil")
df <- data.frame(b,c)
df
b c
1 1970 inter
2 1970 civil
3 1970 intra
4 1971 civil
5 1982 civil
6 1999 inter
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
1. Dropping Observations
If you want to delete observations that are not "civil" in column C, you can subset the data frame to only keep those cases that are "civil":
df2 <- df[df$c=="civil",]
df2
b c
2 1970 civil
4 1971 civil
5 1982 civil
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
The above code creates a new data frame, df2, that is a subset of df, but you can also completely overwrite the original data frame:
df <- df[df$c=="civil",]
Or, you can generate a new one and then remove the old one, if you don't like your workspace cluttered with lots of data frames:
df2 <- df[df$c=="civil",]
rm(df)
2. Marking Observations as Missing
If you want to mark observations that are not "civil" in column C, you can do that by overwriting them as NA:
df$c[df$c != "civil"] <- NA
df
b c
1 1970 <NA>
2 1970 civil
3 1970 <NA>
4 1971 civil
5 1982 civil
6 1999 <NA>
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
You could then use listwise deletion (see the na.omit() command) to remove the cases from whatever analyses you're doing.
Side Note
Your original Stata code seeks to subset when column b is a duplicate and column c is "inter" or "intra". However, the way your sample data were presented, this seemed to be a redundant concern, which is why my solution above only looks at column c. However, if you want to match your Stata code as closely as possible, you can do that by
df <- df[order(df$b, df$c),]
df$duplicate <- duplicated(df$b)
df2 <- df[df$c=="civil" & df$duplicate==FALSE,]
which
orders the data chronologically by year and then alphabetically by war
creates a new variable that specifies whether column b is a duplicate year
subsets the data frame to remove undesirable cases.
Try changing your | operator to &.
Here is some made up data:
R> b <- c(rep(1:4, each=3))
R> c <- 1:length(b)
R> df <- data.frame(c,b)
R> df$j <- ifelse(df$b != 2 & df$b != 3 & df$b != 1, NA, df$b)
R> df
c b j
1 1 1 1
2 2 1 1
3 3 1 1
4 4 2 2
5 5 2 2
6 6 2 2
7 7 3 3
8 8 3 3
9 9 3 3
10 10 4 NA
11 11 4 NA
12 12 4 NA
That last line of your code mywar <- mywar[!is.na(mywar$yearc),] should work fine as well

Split and randomly reassemble a time series, but maintain leap years in R

I need to create datasets of weather data to use for modeling over the next 50 years. I am planning to do this by using historical weather data (daily, 1980-2012), but mixing up the years in a random order and then relabeling them with 2014-2054. However, I cannot be completely random, because it is important to maintain leap years. I want to have as many datasets as possible so I can get an average response of the model to different weather patterns.
Here is an example of what the historical data looks like (except there is data for every day). How could I reassemble it so the years are in a different order, but make sure years with 366 days (1980, 1984, 1988) end up in future leap years (2016, 2020, 2024, 2028, 2052)? And then do that at least 50 more times?
year day radn maxt
1980 1 5.827989 -1.59375
1980 2 5.655813 -1.828125
1980 3 6.159346 -0.96875
1981 4 6.065136 -1.84375
1981 5 5.961181 -2.34375
1981 6 5.758733 -2.0625
1981 7 6.458055 -2.90625
1982 8 6.73056 -2.890625
1982 9 6.89472 -1.796875
1983 10 6.687879 -2.140625
1984 11 6.585833 -1.609375
1984 12 6.466392 -0.71875
1984 13 7.100092 -0.515625
1985 14 7.176402 -1.734375
1985 15 7.236122 -2.5
1985 16 7.455515 -2.375
1986 17 7.395174 -1.390625
1986 18 7.341537 -2.21875
1987 19 7.678102 -2.828125
1987 20 7.539239 -2.875
1987 21 7.231031 -2.390625
1988 22 7.397067 -0.21875
1988 23 7.947912 -0.5
1989 24 8.355059 -1.03125
1990 25 8.145792 -1.5
1990 26 8.591616 -2.078125
Here is a function that scrambles the years of a passed data frame df, returning a new data frame:
scramble.years = function(df) {
# Build convenience vectors of years
early.leap = seq(1980, 2012, 4)
late.leap = seq(2016, 2052, 4)
early.nonleap = seq(1980, 2012)[!seq(1980, 2012) %in% early.leap]
late.nonleap = seq(2014, 2054)[!seq(2014, 2054) %in% late.leap]
# Build map from late years to early years
map = data.frame(from=c(sample(early.leap, length(late.leap), replace=T),
sample(early.nonleap, length(late.nonleap), replace=T)),
to=c(late.leap, late.nonleap))
# Build a new data frame with the correct years/days for later period
return.list = lapply(2014:2054, function(x) {
get.df = subset(df, year == map$from[map$to == x])
get.df$year = x
return(get.df)
})
return(do.call(rbind, return.list))
}
You can call scramble.years any number of times to get new scrambled data frames.

Grouping and conditions without loop (big data)

I have several observations of the same groups, and for each observation I have a year.
dat = data.frame(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995))
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
For each observation, i would like to know if another observation of the same group can be found with given conditions relative to the focal observation. e.g. : "Is there any other observation (than the focal one) that has been done during the last 6 years (starting from the focal year) in the same group".
Ideally the dataframe should be like that
group year six_years
1 a 2000 1 # there is another member of group a that is year = 1996 (2000-6 = 1994, this value is inside the threshold)
2 a 1996 0
3 a 1975 0
4 b 2002 0
5 b 2010 0
6 b 1980 0
7 c 1990 1
8 c 1986 0
9 c 1995 1
Basically for each row we should look into the subset of groups, and see if any(dat$year == conditions). It is very easy to do with a for loop, but it's of no use here : the dataframe is massive (several millions of row) and a loop would take forever.
I am searching for an efficient way with vectorized functions or a fast package.
Thanks !
EDITED
Actually thinking about it you will probably have a lot of recurring year/group combinations, in which case much quicker to pre-calculate the frequencies using count() - which is also a plyr function:
90M rows took ~4sec
require(plyr)
dat <- data.frame(group = sample(c("a","b","c"),size=9000000,replace=TRUE),
year = sample(c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),size=9000000,replace=TRUE))
test<-function(y,g,df){
d<-df[df$year>=y-6 &
df$year<y &
df$group== g,]
return(nrow(d))
}
rollup<-function(){
summ<-count(dat) # add a frequency to each combination
return(ddply(summ,.(group,year),transform,t=test(as.numeric(year),group,summ)*freq))
}
system.time(rollup())
user system elapsed
3.44 0.42 3.90
My dataset had too many different groups, and the plyr option proposed by Troy was too slow.
I found a hack (experts would probably say "an ugly one") with package data.table : the idea is to merge the data.table with itself quickly with the fast merge function. It gives every possible combination between a given year of a group and all others years from the same group.
Then proceed with an ifelse for every row with the condition you're looking for.
Finally, aggregate everything with a sum function to know how many times every given years can be found in a given timespan relative to another year.
On my computer, it took few milliseconds, instead of the probable hours that plyr was going to take
dat = data.table(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995), key = "group")
Produces this :
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
Then :
z = merge(dat, dat, by = "group", all = T, allow.cartesian = T) # super fast
z$sixyears = ifelse(z$year.y >= z$year.x - 6 & z$year.y < z$year.x, 1, 0) # creates a 0/1 column for our condition
z$sixyears = as.numeric(z$sixyears) # we want to sum this up after
z$year.y = NULL # useless column now
z2 = z[ , list(sixyears = sum(sixyears)), by = list(group, year.x)]
(Years with another year of the same group in the last six years are given a "1" :
group year x
1 a 1975 0
2 b 1980 0
3 c 1986 0
4 c 1990 1 # e.g. here there is another "c" which was in the timespan 1990 -6 ..
5 c 1995 1 # <== this one. This one too has another reference in the last 6 years, two rows above.
6 a 1996 0
7 a 2000 1
8 b 2002 0
9 b 2010 0
Icing on the cake : it deals with NA seamlessly.
Here's another possibility also using data.table but including diff().
dat <- data.table(group = rep(c("a","b","c"), each = 3),
year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),
key = "group")
valid_case <- subset(dt[,list(valid_case = diff(year)), by=key(dt)],
abs(valid_case)<6)
dat$valid_case <- ifelse(dat$group %in% valid_case$group, 1, 0)
I am not sure how this compares in terms of speed or NA handling (I think it should be fine with NAs since they propagate in diff() and abs()), but I certainly find it more readable. Joins are really fast in data.table, but I'd have to think avoiding that all together helps. There's probably a more idiomatic way to do the condition in the ifelse statement using data.table joins. That could potentially speed things up, although my experience has never found %in% to be the limiting factor.

Resources