R converting to long format, pattern - r

I would like to convert a data.table like this one from wide format to long.
set.seed(1)
DT <- data.table(
ID = c(1:5, NA),
Name = c("Bob","Ana","Smith","Sam","Big","Lulu"),
Kind_2001 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Kind_2002 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Kind_2003 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Conc_2001 = sample(99,6),
Conc_2002 = sample(79,6),
Conc_2003 = sample(49,6)
)
ID Name Kind_2001 Kind_2002 Kind_2003 Conc_2001 Conc_2002 Conc_2003
1 Bob b NA c 38 22 24
2 Ana b c b 77 31 29
3 Smith c c NA 91 2 49
4 Sam NA a b 21 30 9
5 Big a a c 62 66 38
NA Lulu NA a NA 12 26 30
And I would like to get something like this:
ID Name Year Kind Conc
1 Bob 2001 b 38
1 Bob 2002 NA 22
1 Bob 2003 c 24
2 Ana 2001 b 77
2 Ana 2002 c 31
2 Ana 2003 b 29
...
The real table has many more variables, I'm looking for a solution without explicitly saying every column name or number, detecting automatically the pattern.
I have two kind of columns, some ending with an underscore and a four digit year, such as _2001, and the other without that ending.
Some can have an underscore in the middle of the name (this will be kept untransformed).
I would like to transform the columns ending with a year to long format.
I've tried with
melt(DT, id=1:2, variable.name = "year")
or with
melt(DT, id=1:2, measure=patterns("_2[0-9][0-9][0-9]$"))
but I'm not getting what I want.
Maybe I first need to filter the names with gsub.
PD: I've found this solution.
posi <- grep("_[0-9][0-9][0-9][0-9]$",colnames(DT))
work <- unique(gsub("_[0-9][0-9][0-9][0-9]$","",colnames(DT)[posi]))
melt(DT, measure=patterns(paste0("^",work)), variable="year", value.name=work)
It almost works but the year column is not populated properly. I'm missing something or it's a bug.
And I'm sure it could be written simpler.
ID Name year Kind Conc
1 Bob 1 b 38
2 Ana 1 b 77
3 Smith 1 c 91
4 Sam 1 NA 21
5 Big 1 a 62
NA Lulu 1 NA 12
1 Bob 2 NA 22
2 Ana 2 c 31
3 Smith 2 c 2
4 Sam 2 a 30
5 Big 2 a 66
NA Lulu 2 a 26
1 Bob 3 c 24
2 Ana 3 b 29
3 Smith 3 NA 49
4 Sam 3 b 9
5 Big 3 c 38
NA Lulu 3 NA 30
Regards
I've tried eddi solution with my database and I get the error:
"Error: cannot allocate vector of size 756.5 Mb"
even though I have 16GB of memory.

We can solve this on scale using reshape() from base R, without having to explicitly name variables.
# First we get indices of colnames that have format "_1234" at the end
tomelt <- grep("_([0-9]{4})$",names(DT))
# Now we use these indices to reshape data
reshape(DT, varying = tomelt, sep = "_",
direction = 'long', idvar = "ID", timevar = "Year)
# ID Name Year Kind Conc
# 1: 1 Bob 2001 b 38
# 2: 2 Ana 2001 b 77
# 3: 3 Smith 2001 c 91
# 4: 4 Sam 2001 NA 21
# 5: 5 Big 2001 a 62
# 6: NA Lulu 2001 NA 12
...

If we are looking for data.table solution, extract the prefix part from the names of the "DT" and use the unique elements as patterns in the measure argument in melt. Similarly, the suffix from "Year" is extracted and replace the numeric index with that.
nm <- unique(sub("_\\d+", "", names(DT)[-(1:2)]))
yr <- unique(sub("\\D+_", "", names(DT)[-(1:2)]))
melt(DT, measure = patterns(paste0("^", nm)), value.name = nm,
variable.name = "Year")[, Year := yr[Year]][]
# ID Name Year Kind Conc
# 1: 1 Bob 2001 b 38
# 2: 2 Ana 2001 b 77
# 3: 3 Smith 2001 c 91
# 4: 4 Sam 2001 NA 21
# 5: 5 Big 2001 a 62
# 6: NA Lulu 2001 NA 12
# 7: 1 Bob 2002 NA 22
# 8: 2 Ana 2002 c 31
# 9: 3 Smith 2002 c 2
#10: 4 Sam 2002 a 30
#11: 5 Big 2002 a 66
#12: NA Lulu 2002 a 26
#13: 1 Bob 2003 c 24
#14: 2 Ana 2003 b 29
#15: 3 Smith 2003 NA 49
#16: 4 Sam 2003 b 9
#17: 5 Big 2003 c 38
#18: NA Lulu 2003 NA 30

Here's an option that's more robust with respect to the order of your columns, as well as missing/extra years:
dcast(melt(DT, id.vars = c("ID", "Name"))
[, .(ID, Name, sub('_.*', '', variable), sub('.*_', '', variable), value)],
ID + Name + V4 ~ V3)
# ID Name V4 Conc Kind
# 1: 1 Bob 2001 38 b
# 2: 1 Bob 2002 22 NA
# 3: 1 Bob 2003 24 c
# 4: 2 Ana 2001 77 b
# 5: 2 Ana 2002 31 c
# 6: 2 Ana 2003 29 b
# 7: 3 Smith 2001 91 c
# 8: 3 Smith 2002 2 c
# 9: 3 Smith 2003 49 NA
#10: 4 Sam 2001 21 NA
#11: 4 Sam 2002 30 a
#12: 4 Sam 2003 9 b
#13: 5 Big 2001 62 a
#14: 5 Big 2002 66 a
#15: 5 Big 2003 38 c
#16: NA Lulu 2001 12 NA
#17: NA Lulu 2002 26 a
#18: NA Lulu 2003 30 NA
Edit for many id columns:
idvars = grep("_", names(DT), invert = TRUE)
dcast(melt(DT, id.vars = idvars)
[, `:=`(var = sub('_.*', '', variable),
year = sub('.*_', '', variable),
variable = NULL)],
... ~ var, value.var='value')

In case anybody is interested I post here my full solution,
able to work with datasets bigger than memory. It uses some of your ideas and some mine.
My data is the file file.csv (or you can even do it with a compressed file using fread("unzip -c name.zip").
## Initialization
nline <- 1500000 # total number of lines or use wc -l to do it automatically.
chunk <- 5000 # change it according to your memory and number of columns.
times <- ceiling(nline/chunk)
name <- names(fread("file.csv", stringsAsFactors=F, integer64 = "character", nrows=0, na.strings=c("", "NA")) )
idvars = grep("_20[0-9][0-9]$",name , invert = TRUE)
# Now we loop every chunk
for(iter in 0:(times-1)) {
my <- fread("file.csv", stringsAsFactors=F, integer64 = "character", skip=1+(iter*chunk), nrows=chunk, na.strings=c("", "NA"))
colnames(my) <- name
temp <- melt(my, id.vars = idvars)
newfile <- dcast(
temp[, `:=`(var = sub('_20[0-9][0-9]$', '', variable), year = sub('.*_', '', variable), variable = NULL)],
... ~ var, value.var='value')
fwrite(newfile, "long.csv", quote=FALSE, sep=",", append=T)
rm(temp); rm(newfile); rm(my); gc()
}
#
As said before the problem with this method is that it converts all the value to character but if you save them to a file and read the file again (as here) you get the proper classes.
In case of very large files this method is very slow.
I encourage you to improve this solution or suggest any generic solution with tidyr, splitstackshape or other packages.
Or even better it would be great to do it with a database such as sqlite.
The solution should work on datasets with unordered columns or even with "_" in the middle of the name, such as:
set.seed(1)
DT <- data.table(
ID = c(1:15),
Name = c("Bob","Ana","Smith","Sam","Big","Lulu", "Loli", "Chochi", "Tom", "Dick", "Pet", "Shin", "Rock", "Pep", "XXX"),
Kind_2001 = factor(sample(c(letters[1:3], NA), 15, TRUE)),
Kind_2002 = factor(sample(c(letters[1:3], NA), 15, TRUE)),
Kind_2003 = factor(sample(c(letters[1:3], NA), 15, TRUE)),
Conc_2004 = sample(49,15),
aa_Conc_2001 = c(sample(99,14), NA),
Conc_2002 = sample(79,15)
)

Related

Joining dataframes with different dimensions and filling the gaps

I want to join two different dataframes. Those dataframes are of different dimensions. Here are the example datasets,
Main dataset
# Main data
id <- c(rep(1, 3), rep(3, 3), rep(10, 1))
time <- c(201601, 201602, 201603, 201601, 201602, 201603, 201601)
data1 <- c(100, 150, 160, 111, 120, 130, 150)
data2 <- c(5, 6, 9, 3, 2, 1, 0)
dataf1 <- data.frame(id, time, data1, data2)
Dataframe to be joined with the main dataset
# Additional data
id <- c(3, 10, 2)
time <- c(rep(201604, 3))
data2 <- c(20, 30, 11)
dataf2 <- data.frame(id, time, data2)
I want to join these two dataframes, namely, dataf1 and dataf2. I have tried dplyr::full_join(dataf1, dataf2, by = "id") but it's not giving what I want. The expected join should look like this,
However, the final output should include the missing timestamps. The final output should look like this,
Is there any way I can achieve this?
Here is a data.table go at your question
library(data.table)
#create data.tables out of your data.frames
setDT(dataf1)
setDT(dataf2)
#row-bind all your data together
alldata <- rbindlist( list( dataf1, dataf2 ), use.names = TRUE, fill = TRUE )
#get all unique id-time combinations out of your data
DT <- CJ( alldata$id, alldata$time, unique = TRUE)
setnames(DT, names(DT), c("id", "time"))
#join your data to all unique combinataions of id-time
ans <- DT[ alldata, `:=`( data1 = i.data1, data2 = i.data2), on = .(id, time)]
ourput
# id time data1 data2
# 1: 1 201601 100 5
# 2: 1 201602 150 6
# 3: 1 201603 160 9
# 4: 1 201604 NA NA
# 5: 2 201601 NA NA
# 6: 2 201602 NA NA
# 7: 2 201603 NA NA
# 8: 2 201604 NA 11
# 9: 3 201601 111 3
# 10: 3 201602 120 2
# 11: 3 201603 130 1
# 12: 3 201604 NA 20
# 13:10 201601 150 0
# 14:10 201602 NA NA
# 15:10 201603 NA NA
# 16:10 201604 NA 30
As you can see, it (almost) matches your desired output.
I got confused at why you wanted id = 10 & time = 201604 ==> data1 = 30. Why this behaviour, while data1 = NA, and data2 = 30 ?
Of course you can easily exchange data1 with data2 using an ifelse-like solution in like ans[ is.na(data1) & !is.na(data2),:=(data1 = data2, data2 = NA)]
Here is one way using tidyr::complete with dplyr. After doing a full_join, we convert time column to Date object. For every id complete the sequence from the minimum value to '2016-04-01' and remove NA rows.
library(dplyr)
full_join(dataf1, dataf2, by = "id") %>%
select(-time.y, -data2.y) %>%
rename_all(~names(dataf1)) %>%
mutate(time1 = as.Date(paste0(time, "01"), "%Y%m%d")) %>%
tidyr::complete(id, time1 = seq(min(time1, na.rm = TRUE),
as.Date('2016-04-01'), by = "1 month")) %>%
mutate(time = format(time1, "%Y%m")) %>%
filter_at(vars(-id), any_vars(!is.na(.))) %>%
select(-time1)
# id time data1 data2
# <dbl> <chr> <dbl> <dbl>
# 1 1 201601 100 5
# 2 1 201602 150 6
# 3 1 201603 160 9
# 4 1 201604 NA NA
# 5 2 201601 NA NA
# 6 2 201602 NA NA
# 7 2 201603 NA NA
# 8 2 201604 NA NA
# 9 3 201601 111 3
#10 3 201602 120 2
#11 3 201603 130 1
#12 3 201604 NA NA
#13 10 201601 150 0
#14 10 201602 NA NA
#15 10 201603 NA NA
#16 10 201604 NA NA
This matches your exact final output:
library(data.table)
setnames(dataf2, "data2", "data1") # Warning: This will modify the original dataf2
rbindlist(
list(dataf1, dataf2),
fill = TRUE
)[CJ(id, time, unique = TRUE), on = .(id, time)]
# id time data1 data2
# 1: 1 201601 100 5
# 2: 1 201602 150 6
# 3: 1 201603 160 9
# 4: 1 201604 NA NA
# 5: 2 201601 NA NA
# 6: 2 201602 NA NA
# 7: 2 201603 NA NA
# 8: 2 201604 11 NA
# 9: 3 201601 111 3
# 10: 3 201602 120 2
# 11: 3 201603 130 1
# 12: 3 201604 20 NA
# 13: 10 201601 150 0
# 14: 10 201602 NA NA
# 15: 10 201603 NA NA
# 16: 10 201604 30 NA

Cleaning a data.frame in a semi-reshape/semi-aggregate fashion

First time posting something here, forgive any missteps in my question.
In my example below I've got a data.frame where the unique identifier is the tripID with the name of the vessel, the species code, and a catch metric.
> testFrame1 <- data.frame('tripID' = c(1,1,2,2,3,4,5),
'name' = c('SS Anne','SS Anne', 'HMS Endurance', 'HMS Endurance','Salty Hippo', 'Seagallop', 'Borealis'),
'SPP' = c(101,201,101,201,102,102,103),
'kept' = c(12, 22, 14, 24, 16, 18, 10))
> testFrame1
tripID name SPP kept
1 1 SS Anne 101 12
2 1 SS Anne 201 22
3 2 HMS Endurance 101 14
4 2 HMS Endurance 201 24
5 3 Salty Hippo 102 16
6 4 Seagallop 102 18
7 5 Borealis 103 10
I need a way to basically condense the data.frame so that all there is only one row per tripID as shown below.
> testFrame1
tripID name SPP kept SPP.1 kept.1
1 1 SS Anne 101 12 201 22
2 2 HMS Endurance 101 14 201 24
3 3 Salty Hippo 102 16 NA NA
4 4 Seagallop 102 18 NA NA
5 5 Borealis 103 10 NA NA
I've looked into tidyr and reshape but neither of those are can deliver quite what I'm asking for. Is there anything out there that does this quasi-reshaping?
Here are two alternatives using base::reshape and data.table::dcast:
1) base R
reshape(transform(testFrame1,
timevar = ave(tripID, tripID, FUN = seq_along)),
idvar = cbind("tripID", "name"),
timevar = "timevar",
direction = "wide")
# tripID name SPP.1 kept.1 SPP.2 kept.2
#1 1 SS Anne 101 12 201 22
#3 2 HMS Endurance 101 14 201 24
#5 3 Salty Hippo 102 16 NA NA
#6 4 Seagallop 102 18 NA NA
#7 5 Borealis 103 10 NA NA
2) data.table
library(data.table)
setDT(testFrame1)
dcast(testFrame1, tripID + name ~ rowid(tripID), value.var = c("SPP", "kept"))
# tripID name SPP_1 SPP_2 kept_1 kept_2
#1: 1 SS Anne 101 201 12 22
#2: 2 HMS Endurance 101 201 14 24
#3: 3 Salty Hippo 102 NA 16 NA
#4: 4 Seagallop 102 NA 18 NA
#5: 5 Borealis 103 NA 10 NA
Great reproducible post considering it's your first. Here's a way to do it with dplyr and tidyr -
testFrame1 %>%
group_by(tripID, name) %>%
summarise(
SPP = toString(SPP),
kept = toString(kept)
) %>%
ungroup() %>%
separate("SPP", into = c("SPP", "SPP.1"), sep = ", ", extra = "drop", fill = "right") %>%
separate("kept", into = c("kept", "kept.1"), sep = ", ", extra = "drop", fill = "right")
# A tibble: 5 x 6
tripID name SPP SPP.1 kept kept.1
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1.00 SS Anne 101 201 12 22
2 2.00 HMS Endurance 101 201 14 24
3 3.00 Salty Hippo 102 <NA> 16 <NA>
4 4.00 Seagallop 102 <NA> 18 <NA>
5 5.00 Borealis 103 <NA> 10 <NA>

In R - generate pairwise data.frame from all rows in data.frame

I have a data.frame called df with a 8 million observations on 4 columns:
name <- c("Pablo", "Christina", "Steve", "Diego", "Ali", "Brit", "Ruth", "Mia", "David", "Dylan")
year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
v2 <- sample(1:10, 10, replace=T)
df <- data.frame(year, v1)
> df
name year v1 v2
1 Pablo 2000 2 9
2 Christina 2001 5 3
3 Steve 2002 8 9
4 Diego 2003 7 6
5 Ali 2004 2 4
6 Brit 2005 1 1
7 Ruth 2006 10 9
8 Mia 2007 6 7
9 David 2008 10 9
10 Dylan 2009 3 2
I want to generate a data.frame output with all pair-wise combination of the rows in df that looks like this:
>output
name year v1 v2 name_2 year_2 v1_2 v2_2
1 Pablo 2000 2 9 Christina 2001 5 3
2 Pablo 2000 2 9 Steve 2002 8 9
3 Pablo 2000 2 9 Diego 2003 7 6
etc.
What are the fastest ways to do this?
tidyr::crossing will return all combinations of observations, but you'll need to set names with setNames or the like. If you don't want self-matches, you can remove them by calling dplyr::filter on any unique ID column.
library(tidyverse)
df_crossed <- df %>%
setNames(paste0(names(.), '_2')) %>%
crossing(df) %>%
filter(name != name_2)
head(df_crossed)
## name_2 year_2 v1_2 v2_2 name year v1 v2
## 1 Pablo 2000 5 5 Christina 2001 7 3
## 2 Pablo 2000 5 5 Steve 2002 1 9
## 3 Pablo 2000 5 5 Diego 2003 2 8
## 4 Pablo 2000 5 5 Ali 2004 9 5
## 5 Pablo 2000 5 5 Brit 2005 8 5
## 6 Pablo 2000 5 5 Ruth 2006 8 1
Another way to fix names would be to use janitor::clean_names after crossing, though it's an extra package.
Hopefully this will give the result the post owner was looking for.
name <- c("Pablo", "Christina", "Steve", "Diego", "Ali", "Brit", "Ruth", "Mia", "David", "Dylan")
year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
v2 <- sample(1:10, 10, replace=T)
df <- data.frame(name, year, v1, v2, stringsAsFactors=FALSE)
print(df)
rows = nrow(df)
n <- rows * (rows - 1) / 2
ndf <- data.frame(
name1=character(n),year1=numeric(n), v1_1=numeric(n),v2_1=numeric(n),
name2=character(n),year2=numeric(n), v1_2=numeric(n),v2_2=numeric(n),
stringsAsFactors=FALSE
)
k <- 1
for (i in 1:(rows-1))
{
for (j in (i+1):rows)
{
ndf[k,] <- c(df[i,], df[j,])
k <- k + 1
}
}
print(ndf)
# name year v1 v2
#1 Pablo 2000 4 9
#2 Christina 2001 2 1
#3 Steve 2002 2 9
#4 Diego 2003 5 5
#5 Ali 2004 10 4
#6 Brit 2005 5 2
#7 Ruth 2006 7 10
#8 Mia 2007 6 7
#9 David 2008 4 10
#10 Dylan 2009 7 3
# name1 year1 v1_1 v2_1 name2 year2 v1_2 v2_2
#1 Pablo 2000 4 9 Christina 2001 2 1
#2 Pablo 2000 4 9 Steve 2002 2 9
#3 Pablo 2000 4 9 Diego 2003 5 5
#4 Pablo 2000 4 9 Ali 2004 10 4
#5 Pablo 2000 4 9 Brit 2005 5 2
#6 Pablo 2000 4 9 Ruth 2006 7 10
#7 Pablo 2000 4 9 Mia 2007 6 7
#8 Pablo 2000 4 9 David 2008 4 10
#9 Pablo 2000 4 9 Dylan 2009 7 3
#10 Christina 2001 2 1 Steve 2002 2 9
#...
Not to add to the noise but consider a base R cross join with merge on same dataframe that also filters out reverse duplicates. Do note, cross join before filter will return a 8 mill X 8 mill records dataset, so hopefully your RAM is sufficient for such an operation.
df <- data.frame(name = c("Pablo", "Christina", "Steve", "Diego", "Ali",
"Brit", "Ruth", "Mia", "David", "Dylan"),
year = seq(2000, 2009, 1),
v1 =sample(1:10, 10, replace=T),
v2 =sample(1:10, 10, replace=T),
stringsAsFactors = FALSE)
# MERGE ON KEY, THEN REMOVE KEY COL
df$key <- 1
dfm <- merge(df, df, by="key")[,-1]
# FILTER OUT SAME NAME AND REVERSE DUPS, THEN RENAME COLUMNS
dfm <- setNames(dfm[(dfm$name.x < dfm$name.y),],
c("name_p1", "year_p1", "V1_p1", "V2_p1",
"name_p2", "year_p2", "V1_p2", "V2_p2"))
# ALL PABLO PAIRINGS
dfm[dfm$name_p1=='Pablo' | dfm$name_p2=='Pablo',]
# name_p1 year_p1 V1_p1 V2_p1 name_p2 year_p2 V1_p2 V2_p2
# 3 Pablo 2000 7 8 Steve 2002 3 1
# 7 Pablo 2000 7 8 Ruth 2006 8 4
# 11 Christina 2001 10 10 Pablo 2000 7 8
# 31 Diego 2003 4 9 Pablo 2000 7 8
# 41 Ali 2004 5 3 Pablo 2000 7 8
# 51 Brit 2005 2 4 Pablo 2000 7 8
# 71 Mia 2007 7 7 Pablo 2000 7 8
# 81 David 2008 1 7 Pablo 2000 7 8
# 91 Dylan 2009 9 2 Pablo 2000 7 8
If somehow this large set derived from an SQL compliant database, I can provide the counterpart in SQL which may be more efficient as the filter runs with join process and not separately after.
This extension of #alistaires solution shows a crossed matrix used as index. The question as stated wants the full crossed output which
will be very large (~64 million rows for 8 million items) so there
is really no way around the memory requirement. However, if the
the real-world use of this is to deal with subsets, the indexing technique
shown here may be a way to reduce memory use. Its possible that crossing the integers only uses less memory during the crossing operation.
library(dplyr)
library(tidyr)
crossed <- as.matrix(crossing(1:nrow(df), 1:nrow(df)))
# bind and name in one step (may be inefficient) so that filter can be applied in one step
output <- as.data.frame(cbind(df[crossed[, 1],],
data.frame(name_2 = df[crossed[, 2], 1],
year_2 = df[crossed[, 2], 2],
v1_2 = df[crossed[, 2], 3],
v2_2 = df[crossed[, 2], 4]) )) %>%
filter(!(name == name_2 & year == year_2))
# estimated sized for 8 million rows gine this 10 row sample
format(object.size(output) / (10 / 8e6), units="MB")
#[1] "5304 Mb"
You could cross join the name column to itself, using data.table and remove repeated cases. This will result in a smaller structure on which to merge in data rather than doing the full merge, then filtering. You can add the rest of the data with two merges: once to merge data associated with the first name column and again to merge in data associated with the second column.
name <- c("Pablo", "Christina", "Steve", "Diego", "Ali", "Brit", "Ruth", "Mia", "David", "Dylan")
year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
v2 <- sample(1:10, 10, replace=T)
# stringsAsFactors = FALSE in order for pmin to work properly
df <- data.frame(name, year, v1, v2, stringsAsFactors = FALSE)
library(data.table)
setDT(df)
setkey(df)
# cross-join name column to itself while removing duplicates and redundancies
name_cj <- setnames(
CJ(df[, name], df[, name])[V1 < V2], # taking a hint from Parfait's clever solution
c("name1", "name2"))
# perform 2 merges, once for the 1st name column and
# again for the 2nd name colum
name_cj <- merge(
merge(name_cj, df, by.x = "name1", by.y = "name"),
df,
by.x = "name2", by.y = "name", suffixes = c("_1", "_2"))
# reorder columns as desired with setorder()
head(name_cj)
# name2 name1 year_1 v1_1 v2_1 year_2 v1_2 v2_2
#1: Brit Ali 2004 3 8 2005 4 5
#2: Christina Ali 2004 3 8 2001 9 8
#3: Christina Brit 2005 4 5 2001 9 8
#4: David Ali 2004 3 8 2008 5 2
#5: David Brit 2005 4 5 2008 5 2
#6: David Christina 2001 9 8 2008 5 2

Find row of the next instance of the value in R

I have two columns Time and Event. There are two events A and B. Once an event A takes place, I want to find when the next event B occurs. Column Time_EventB is the desired output.
This is the data frame:
df <- data.frame(Event = sample(c("A", "B", ""), 20, replace = TRUE), Time = paste("t", seq(1,20)))
What is the code in R for finding the next instance of a value (B in this case)?
What is the code for once the instance of B is found, return the value of the corresponding Time Column?
The code should be something like this:
data$Time_EventB <- ifelse(data$Event == "A", <Code for returning time of next instance of B>, "")
In Excel this can be done using VLOOKUP.
Here's a simple solution:
set.seed(1)
df <- data.frame(Event = sample(c("A", "B", ""),size=20, replace=T), time = 1:20)
as <- which(df$Event == "A")
bs <- which(df$Event == "B")
next_b <- sapply(as, function(a) {
diff <- bs-a
if(all(diff < 0)) return(NA)
bs[min(diff[diff > 0]) == diff]
})
df$next_b <- NA
df$next_b[as] <- df$time[next_b]
> df
Event time next_b
1 A 1 2
2 B 2 NA
3 B 3 NA
4 4 NA
5 A 5 8
6 6 NA
7 7 NA
8 B 8 NA
9 B 9 NA
10 A 10 14
11 A 11 14
12 A 12 14
13 13 NA
14 B 14 NA
15 15 NA
16 B 16 NA
17 17 NA
18 18 NA
19 B 19 NA
20 20 NA
Here's an attempt using a "rolling join" from the data.table package:
library(data.table)
setDT(df)
df[Event=="B", .(time, nextb=time)][df, on="time", roll=-Inf][Event != "A", nextb := NA][]
# time nextb Event
# 1: 1 2 A
# 2: 2 NA B
# 3: 3 NA B
# 4: 4 NA
# 5: 5 8 A
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA B
# 9: 9 NA B
#10: 10 14 A
#11: 11 14 A
#12: 12 14 A
#13: 13 NA
#14: 14 NA B
#15: 15 NA
#16: 16 NA B
#17: 17 NA
#18: 18 NA
#19: 19 NA B
#20: 20 NA
Using data as borrowed from #thc

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources