Finding the longest continuous series of events - r

I have time-series data for many different countries of overall more than 50 years. Now I want to find the longest recording period of data for each country. I have tried something but it does not work yet (When I checked the output, it returned not the correct number of continuous years for most of the cases and I don't know why):
cntry <- as.list(unique(df$country))
df$longest.ts <- NULL
for (i in cntry) {
x <- max(diff(which(diff(df$year[df$country==i]) != 1)))
df$longest.ts[df$country==i] <- x
}
I appreciate your help,
Best
Edit: my data.frame is very big and has a lot of different variables but essentially I have got something like this:
df <- data.frame(
country = c("Bolivia","Bolivia","Bolivia","Bolivia","Bolivia","Bolivia",
"China","China","China","China","China","China","China","China"),
year = c(1923,1924,1925,1940,1945,1946,1960,1961,1962,1963,1964,1965,1981,1982)
)
And I would like to get a output with the countries and the longest time-series within the country (e.g. here Bolivia: 3yrs, China: 6yrs).

There is a nice trick using diff to take the differences and rle to encode the differences in pairs of (value, times_repeated) (run length encoding, see ?rle).
For example, consider
vec <- c(1960L, 1961L, 1962L, 1963L, 1964L, 1965L, 1981L, 1982L)
diff(vec)
#> [1] 1 1 1 1 1 16 1
rle(diff(vec))
#> Run Length Encoding
#> lengths: int [1:3] 5 1 1
#> values : int [1:3] 1 16 1
You want to find how many times in a row there is a difference of 1: this is the maximum length (+1) corresponding to a value of 1.
Putting that in a function (note the check on variable type, because in your example year is a factor instead of a numeric):
longest_ts <- function(vec) {
if(!is.numeric(vec)) stop("Vector must be numeric!")
RLE <- rle(diff(vec))
max(RLE$lengths[RLE$values == 1]) + 1
}
Now just apply to your column (don't forget to group_by):
df <- data.frame(
country = c("Bolivia","Bolivia","Bolivia","Bolivia","Bolivia","Bolivia",
"China","China","China","China","China","China","China","China"),
year = c("1923","1924","1925","1940","1945","1946","1960","1961","1962","1963","1964","1965","1981","1982")
)
library("dplyr")
df %>%
mutate(year = as.numeric(as.character(year))) %>% # fix your year variable
group_by(country) %>%
summarise(longest_ts = longest_ts(year))
Result:
country longest_ts
<fct> <dbl>
1 Bolivia 3
2 China 6

Related

A question about looping and creating/joining tables

My code is meant to order a table called Football (imported csv2) and then, using a for loop, go through the data and return the row number of the start year and end year.
Football[order(Football$Year),]
start_year <- min(Football$Year)
end_year <- max(Football$Year)
for (i in 1:nrow(Football)
{
if (Football$Year[i] = start_year)
{
row_of_start <- i
}
if (Football$Year[i] = end_year)
{
row_of_end <- i
}
}
This produces the following error:
> if (Football$Year[1] = start_year) row_of_start <- 1
Error: unexpected '=' in "if (Football$Year[1] ="
I appreciate there are probably ways of doing this without a for loop (which I would be very appreciative to know) although I would also like to know how to make the for loop work (to further my understanding).
You can skip the loop entirely using which(). This will usually be faster and more legible:
# Create example data
set.seed(123)
Football <- data.frame(Year = sample(1990:2000, size = 10),
foo = sample(letters, size = 10))
# Sort the data as you have done
Football_sort <- Football[order(Football$Year), ]
# Get the row numbers of the min and max (start and end years)
which(with(Football_sort, Year == min(Year)))
#> [1] 1
which(with(Football_sort, Year == max(Year)))
#> [1] 10
Depending upon what you actually want to do, you can skip the ordering step as well. Both of the below depend upon the dplyr package to work.
If you just want the start and end year rows rather than their row numbers:
library(dplyr)
Football %>%
filter(Year %in% c(min(Year), max(Year)))
#> Year foo
#> 1 2000 e
#> 2 1990 d
If you want the "year number" of the start and end year:
Football %>%
summarise(start_year = 1,
end_year = max(Year) - min(Year))
#> start_year end_year
#> 1 1 10

Using regular expressions in tidyr::extract

I am working with 3D motion-capture data. This means I have 3 columns (X,Y,Z) of joint coordinates for several joints in the body (e.g. the three columns describing the position of the left knee joint center are: LKX,LKY,LKZ).
My end goal is to plot at least 9 joint centers, and I believe the only way to achieve this is to transform my wide format dataframe into a long one.
As you can tell, I am trying to transform many sets of jointcenters ending with either: X,Y or Z. Therefore, I try to use regular expressions within tidyr:extract, but I just can´t get the code right.
df_wide <- data.frame(
ID = rep(1:2, each=10),
JN = rep(1:2, each=5),
Frame = rep(1:5, 4),
System = rep(1:2, 10),
RKX = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
RKY = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
RKZ = rep(1:10+rnorm(10,mean=1,sd=0.5), 2),
LHeX = rep(1:10-rnorm(10,mean=1,sd=0.5),2),
LHeY = rep(1:10-rnorm(10,mean=1,sd=0.5),2),
LHeZ = rep(1:10-rnorm(10,mean=1,sd=0.5),2))
head(df_wide, 2)
ID JN Frame System RKX RKY RKZ LHeX LHeY LHeZ
1 1 1 1 1 1.332827 2.068720 2.295742 -0.02336031 -0.3011227 -1.212326
2 1 1 2 2 3.570076 3.306799 3.136177 2.08828231 1.9226740 2.106496
I wish to obtain this result:
ID JN Frame System joint X Y Z
1 1 1 1 1 RK 1.440103 2.221676 1.621871
2 1 1 1 1 LHe 3.537940 3.060948 2.856955
Here is my latest (of many) attempts. It has two problems; 1) extract only produces NA; 2) spread returns "Error: Duplicate identifiers for rows" I suspect this is related to the problem with extract.
df_3D <- df_wide %>%
gather(keys, values, -ID, -JN, -Frame, -System)%>%
extract(keys, c("X", "Y", "Z", "joint"), "(X$) (Y$) (Z$) ([A-Z].$)")%>%
spread(X, values)
I have found several good questions and answers regarding the transformation, but none of them specifically target the use of regular expressions.
Your approach is a little off. Each element of the keys column once you've gathered has the structure <Joint><Coord>, so you want something like:
df_wide %>%
gather(keys, values, -ID, -JN, -Frame, -System) %>%
extract(keys, c("Joint", "Coord"), "(.*)(X|Y|Z)$") %>%
spread(Coord, values)
The regex I've used here captures anything in the first group (since I don't know all the possible joint names), then X or Y or Z as the final character in the second group. There are lots of other regexes that would achieve the same thing.
Output:
ID JN Frame System Joint X Y Z
1 1 1 1 1 LHe 0.1344259 -0.2927277 0.05375166
2 1 1 1 1 RK 1.8083539 2.4053498 2.32899399
3 1 1 2 2 LHe 1.1777492 1.1780538 0.96549849
4 1 1 2 2 RK 3.2254236 2.4100235 2.79816371
You'll need to gather your data into a super long format, then split out the dimension, then spread THAT data back out into your X, Y, and Z columns:
library(tidyr)
library(stringr)
df2 <- df_wide %>%
# leave the other columns
gather( jointid, position, -ID, -JN, -Frame, -System ) %>%
# insert a seperator to make it easier to split the X/Y/Z from the joint name
mutate(jointid = str_replace( jointid, "X|Y|Z", ";\\0")) %>%
# split the joint name and the dimension apart
tidyr::separate(jointid, c('joint', 'dim'), sep = ";" ) %>%
# spread the joint and position apart into 3 columns
spread(dim, position)

How can I create new column in data frame by aggregating rows?

I have a large (~200k rows) dataframe that is structured like this:
df <-
data.frame(c(1,1,1,1,1), c('blue','blue','blue','blue','blue'), c('m','m','m','m','m'), c(2016,2016,2016,2016,2016),c(3,4,5,6,7), c(10,20,30,40,50))
colnames(df) <- c('id', 'color', 'size', 'year', 'week','revenue')
Let's say it is currently week 7, and I want to compare the trailing 4 week average of revenue to the current week's revenue. What I would like to do is create a new column for that average when all of the identifiers match.
df_new <-
data.frame(1, 'blue', 'm', 2016,7,50, 25 )
colnames(df_new) <- c('id', 'color', 'size', 'year', 'week','revenue', 't4ave')
How can I accomplish this efficiently? Thank you for the help
good question. for loops are pretty inefficient, but since you do have to check the conditions of prior entries, this is the only solution I can think of (mind you, I'm also an intermediate at R):
for (i in 1:nrow(df))
{
# condition for all entries to match up
if ((i > 5) && (df$id[i] == df$id[i-1] == df$id[i-2] == df$id[i-3] == df$id[i-4])
&& (df$color[i] == df$color[i-1] == df$color[i-2] == df$color[i-3] == df$color[i-4])
&& (df$size[i] == df$size[i-1] == df$size[i-2] == df$size[i-3] == df$size[i-4])
&& (df$year[i] == df$year[i-1] == df$year[i-2] == df$year[i-3] == df$year[i-4])
&& (df$week[i] == df$week[i-1] == df$week[i-2] == df$week[i-3] == df$week[i-4]))
# avg of last 4 entries' revenues
avg <- mean(df$revenue[i-1] + df$revenue[i-2] + df$revenue[i-3] + df$revenue[i-4])
# create new variable of difference between this entry and last 4's
df$diff <- df$revenue[i] - avg
}
This code will probably take forever, but it should work. If this is a one time thing for when the code needs to run, then it should be okay. Otherwise, hopefully others will be able to advise.
A solution using dplyr and zoo. The idea is to group the variable that are the same, such as id, color, size, and year. Aftet that, use rollmean to calculate the rolling mean of revenue. Use na.pad = TRUE and align = "right" to make sure the calculation covers the recent weeks. Finally, use lag to "shift" the calculation results to fit your needs.
library(dplyr)
library(zoo)
df2 <- df %>%
group_by(id, color, size, year) %>%
mutate(t4ave = rollmean(revenue, 4, na.pad = TRUE, align = "right")) %>%
mutate(t4ave = lag(t4ave))
df2
# A tibble: 5 x 7
# Groups: id, color, size, year [1]
id color size year week revenue t4ave
<dbl> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 1 blue m 2016 3 10 NA
2 1 blue m 2016 4 20 NA
3 1 blue m 2016 5 30 NA
4 1 blue m 2016 6 40 NA
5 1 blue m 2016 7 50 25

Obtaining mean of dataframe by value groups in another dataframe

Once again I consult your wisdom.
I have 2 dataframes of the form:
**data1sample**
ID value
water 3
water 5
fire 1
fire 3
fire 2
air 1
**data2controls**
ID value
water 1
fire 3
air 5
I want to use the values in my control dataframe (data2controls) and know their corresponding percentile in the sample distribution (data1sample). I have to classify each sample by their ID (meaning, get fire control against fire sample, and water against water, etc), but I haven't been able to do so.
I am using the command:
mean(data1sample[data1sample$ID == data2controls$ID,] <= data2controls$value)
but I get the error
In Ops.factor(left, right) : ‘<=’ not meaningful for factors
What I am after is basically the percentile of the value in dataframe2 calculated based on the samples of dataframe1 (I am trying to obtain the percentile as in percentile = mean(data1sample$value(by ID) <= dataframe2$value))
So something like this:
**data2controls**
ID value percentile(based on data1 sample values)
water 1 .30
fire 3 .14
air 5 .1
Please disregard the percentile values, they're just made up to show desired output.
I'd love if someone could give me a hand! Thanks!!
Its hard to answer without the desired output, but I will try to guess it here:
library(dplyr)
data1sample <- data.frame(ID = c("water", "water", "fire", "fire", "fire", "air"), value = c(3,5,1,3,2,1))
data2sample <- data.frame(ID = c("water", "fire", "air"), value = c(1,3,5))
by_ID <- data1sample %>% group_by(ID) %>% summarise(control = mean(value))
data2sample %>% inner_join(by_ID)
#> Joining, by = "ID"
#> ID value control
#> 1 water 1 4
#> 2 fire 3 2
#> 3 air 5 1
This gives the result I think you're after?
for(i in d2$ID){
x <- mean(d1[d1$ID == i & d1$value <= d2[d2$ID == i, 'value'], 'value'])
print(x)
}
Based on the data you provided it returns NaN for water because there are no 'water's that meet your criterion, and so div by 0

Building dummy variable with many conditions (R)

My dataset looks something like this
ID YOB ATT94 GRADE94 ATT96 GRADE96 ATT 96 .....
1 1975 1 12 0 NA
2 1985 1 3 1 5
3 1977 0 NA 0 NA
4 ......
(with ATTXX a dummy var. denoting attendance at school in year XX, GRADEXX denoting the school grade)
I'm trying to create a dummy variable that = 1 if an individual is attending school when they are 19/20 years old. e.g. if YOB = 1988 and ATT98 = 1 then the new variable = 1 etc. I've been attempting this using mutate in dplyr but I'm new to R (and coding in general!) so struggle to get anything other than an error any code I write.
Any help would be appreciated, thanks.
Edit:
So, I've just noticed that something has gone wrong, I changed your code a bit just to add another column to the long format data table. Here is what I did in the end:
df %>%
melt(id = c("ID", "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
so it looks something like e.g.
ID YOB VARIABLE VALUE dummy
1 1979 ATT94 1994 1
1 1979 ATT96 1996 1
1 1979 ATT98 0 0
2 1976 ATT94 0 0
2 1976 ATT96 1996 1
2 1976 ATT98 1998 1
i.e. whenever the ATT variables take a value other than 0 the dummy = 1, even if they're not 19/20 years old. Any ideas what could be going wrong?
On my phone so I can't check this right now but try:
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Edit: The above approach will create the column but when the condition does not hold it will be equal to NA
As #Greg Snow mentions, this approach assumes that the column was already created and is equal to zero initially. So you can do the following to get your dummy variable:
df$dummy <- rep(0, nrow(df))
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.
Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.
# Load libraries
library(dplyr)
library(reshape2)
# Create a sample dataset
ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()
# Recode ATTx variables with the actual year
df$ATT94[df$ATT94==1] <- 1994
df$ATT96[df$ATT96==1] <- 1996
df$ATT98[df$ATT98==1] <- 1998
# Melt the data into a long format and perform requested analysis
df %>%
melt(id = "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
#Warner shows a way to create the variable (or at least the 1's the assumption is the column has already been set to 0). Another approach is to not explicitly create a dummy variable, but have it created for you in the model syntax (what you asked for is essentially an interaction). If running a regression, this would be something like:
fit <- lm( resp ~ I(DOB==1988):I(ATT98==1), data=df )
or
fit <- lm( resp ~ I( (DOB==1988) & (ATT98==1) ), data=df)

Resources