Basically, I have a very large data frame/data table and I would like to search a column for the first, and closest, NA value which is less than my current index position.
For example, let's say I have a data frame DF as follows:
INDEX | KEY | ITEM
----------------------
1 | 10 | AAA
2 | 12 | AAA
3 | NA | AAA
4 | 18 | AAA
5 | NA | AAA
6 | 24 | AAA
7 | 29 | AAA
8 | 31 | AAA
9 | 34 | AAA
From this data frame we have an NA value at index 3 and at index 5. Now, let's say we start at index 8 (which has KEY of 31). I would like to search the column KEY backwards such that the moment it finds the first instance of NA the search stops, and the index of the NA value is returned.
I know there are ways to find all NA values in a vector/column (for example, I can use which(is.na(x)) to return the index values which have NA) but due to the sheer size of the data frame I am working and due to the large number of iterations that need to be performed this is a very inefficient way of doing it. One method I thought of doing is creating a kind of "do while" loop and it does seem to work, but this again seems quite inefficient since it needs to perform calculations each time (and given that I need to do over 100,000 iterations this does not look like a good idea).
Is there a fast way of searching a column backwards from a particular index such that I can find the index of the closest NA value?
Why not do a forward-fill of the NA indexes once, so that you can then look up the most recent NA for any row in future:
library(dplyr)
library(tidyr)
df = df %>%
mutate(last_missing = if_else(is.na(KEY), INDEX, as.integer(NA))) %>%
fill(last_missing)
Output:
> df
INDEX KEY ITEM last_missing
1 1 10 AAA NA
2 2 12 AAA NA
3 3 NA AAA 3
4 4 18 AAA 3
5 5 NA AAA 5
6 6 24 AAA 5
7 7 29 AAA 5
8 8 31 AAA 5
9 9 34 AAA 5
Now there's no need to recalculate every time you need the answer for a given row. There may be more efficient ways to do the forward fill, but I think exploring those is easier than figuring out how to optimise the backward search.
Related
I'm new with R and I have the following problem. Maybe it's a really easy question, but I don't know the terms to search for an answer.
My problem:
I have several persons, each person is assigned a studynumber (SN). And each SN has one or more tests being performed, the test can have multiple results.
My data is long at the moment, but I need it to be wide (one row for each SN).
For example:
What I have:
SN testnumbers result
1 1 1234 6
2 1 1234 9
3 2 4567 6
4 3 5678 9
5 3 8790 9
What I want:
SN test1result1 test1result2 test2result1
1 1 6 6 NA
2 2 6 NA NA
3 3 9 NA 9
So I need to renumber the testnumbers into test 1 etc for each SN, in order to use the spread function, I think. But I don't know how.
I did manage to renumber testnumber into a list of 1 till the last unique testnumber, but still the wide dataframe looks awful.
I'll start by saying that filling in missing data in one data frame with info from another has one solution that may work for my problem. However, it solves it with a FOR loop, and I would prefer a vectorized solution.
I have 125 years of climate data with year, month, temperature, precipitation, and open pan evaporation. It is daily data summarized by month. Some years in the late 1800's have entire months missing, and I would like to substitute those missing months with its equivalent month from a 30-year average around that time.
I have pasted some of the code I've been playing with, below:
# For simplicity, let's pretend there are 5 months in the year, so year 3
# is the only year with a complete set of data, years 1 and 2 are missing some.
df1<-structure(
list(
Year=c(1,1,1,2,2,3,3,3,3,3),
Month=c(1,2,4,2,5,1,2,3,4,5),
Temp=c(-2,2,10,-4,12,2,4,8,14,16),
Precip=c(20,10,50,10,60,26,18,40,60,46),
Evap=c(2,6,30,4,48,4,10,32,70,40)
)
)
# This represents the 30-year average data:
df2<-structure(
list(
Month=c(1,2,3,4,5),
Temp=c(1,3,9,13,15),
Precip=c(11,13,21,43,35),
Evap=c(1,5,13,35,45)
)
)
# to match my actual setup
df1<-as_tibble(df1)
df2<-as_tibble(df2)
# I can get to the list of months missing from a given year
full_year <- df2[,1]
compare_year1 <- df1[df1$Year==1,2]
missing_months <- setdiff(full_year,compare_year1)
# Or I can get the full data from each year missing one or more months
year_full <- df2[,1]
years_compare <- split(df1[,c(2)], df1$Year)
years_missing_months <- names(years_compare[sapply(years_compare,nrow)<5])
complete_years_missing_months <- df1[df1$Year %in% years_missing_months,]
This is where I've gotten stumped.
I've looked at anti_join and merge, but it looks like they need data of the same length in each frame. I can get from lists grouped by year to identify the years that are missing months, but I'm not sure how to actually get the rows inserted from there. It seems like lapply could be useful, but the answer ain't comin'.
Thanks in advance.
Edit 7/19: As an illustration of what I need, just looking at year "1", the current data (df1) has the following:
Year | Mon | Temp | Precip | Evap
1 | 1 | -2 | 20 | 2
1 | 2 | 2 | 10 | 6
1 | 4 | 10 | 50 | 30
Months 3 and 5 are missing data, so I would like to insert the equivalent-month data from the 30-year average table (df2), so the final result for year "1" would look like:
Year | Mon | Temp | Precip | Evap
1 | 1 | -2 | 20 | 2
1 | 2 | 2 | 10 | 6
1 | 3 | 9 | 21 | 13
1 | 4 | 10 | 50 | 30
1 | 5 | 15 | 35 | 45
Then fill in every year missing months in like manner. Year "3" would have no change, because (in this 5-month example) there are no months missing data.
First just add rows to hold the imputed values, since you know that there are missing rows with known dates:
df1$date <- as.Date(paste0("200",df1$Year,"/",df1$Month,"/01"))
pretend_12months <- seq(min(df1$date),max(df1$date),by = "1 month")
pretend_5months <- pretend_12months[lubridate::month(pretend_12months) < 6]
pretend_5months <- data.frame(date=pretend_5months)
new <- merge(df1,
pretend_5months,
by="date",
all=TRUE)
new$Year <- ifelse(is.na(new$Year),
substr(lubridate::year(new$date),4,4),
new$Year)
new$Month <- ifelse(is.na(new$Month),
lubridate::month(new$date),
new$Month)
Impute the NA values using a left join:
# key part: left join using any library or builtin method (left_join,merge, etc)
fillin <- sqldf::sqldf("select a.date,a.Year,a.Month, b.Temp, b.Precip, b.Evap from new a left join df2 b on a.Month = b.Month")
# apply data set from join to the NA data
new$Temp[is.na(new$Temp)] <- fillin$Temp[is.na(new$Temp)]
new$Precip[is.na(new$Precip)] <- fillin$Precip[is.na(new$Precip)]
new$Evap[is.na(new$Evap)] <- fillin$Evap[is.na(new$Evap)]
date Year Month Temp Precip Evap
1 2001-01-01 1 1 -2 20 2
2 2001-02-01 1 2 2 10 6
3 2001-03-01 1 3 9 21 9
4 2001-04-01 1 4 10 50 30
5 2001-05-01 1 5 15 35 15
6 2002-01-01 2 1 1 11 1
7 2002-02-01 2 2 -4 10 4
8 2002-03-01 2 3 9 21 9
9 2002-04-01 2 4 13 43 13
10 2002-05-01 2 5 12 60 48
11 2003-01-01 3 1 2 26 4
12 2003-02-01 3 2 4 18 10
13 2003-03-01 3 3 8 40 32
14 2003-04-01 3 4 14 60 70
15 2003-05-01 3 5 16 46 40
I have two data sets, and both have same dimensions, and want to combine them such that 1st column of second data set is stacked next to 1st column of first data set, and so on.
Consider below example, which is the expected output. Here, v1 is coming from data set 1, and v2 is coming from data set 2. I also want to keep the column header as it is.
| v1 | v2 |
|:------:|:------:|
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
| -0.71 | -0.71 |
I tried cbind() and data.frame(), but both led to second data being added after the full first data set, not column after column.
-> dim(firstDataSet)
100 200
-> dim(secondDataSet)
100 200
-> finalDataSet_cbind <- cbind(firstDataSet, secondDataSet)
-> dim(finalDataSet_cbind)
100 400
-> finalDataSet_dframe <- data.frame(firstDataSet, secondDataSet)
-> dim(finalDataSet_dframe)
100 400
Please suggest correct and better ways to achieve this, thanks.
UPDATE: Response to possible duplicate flag to this question:
That answer didn't work out for me. The data I get after following the solution, didn't result in what I want, and was similar to final output I get with cbind() approach explained above.
The first answer given, works out for me, but with a small issue of new column name assigned to each column, instead of keeping the original column headers.
Also, I don't have enough reputation to add comment to the accepted answer.
Probably not the most efficient solution with for loop, but works
data1 <- cbind(1:10,11:20, 21:30)
data2 <- cbind(1:10,11:20, 21:30)
combined <- NULL
for(i in 1:ncol(data1)){
combined <- cbind(combined, data1[,i], data2[,i])
}
To fix the column name requirement, you could do this. Basically, you first cbind, then you create an index in the right order. Using that index, you also create a vector of correct column names. You then index the order of the columns, and add the column names.
df1 <- df2 <- data.frame(v1=1:10,v2=11:20, v3=21:30)
final <- cbind(df1,df2)
indexed <- rep(1:ncol(df1), each = 2) + (0:1) * ncol(df1)
new_colnames <- colnames(final)[indexed]
final_ordered <- final[indexed]
colnames(final_ordered) <- new_colnames
v1 v1 v2 v2 v3 v3
1 1 1 11 11 21 21
2 2 2 12 12 22 22
3 3 3 13 13 23 23
4 4 4 14 14 24 24
5 5 5 15 15 25 25
6 6 6 16 16 26 26
7 7 7 17 17 27 27
8 8 8 18 18 28 28
9 9 9 19 19 29 29
10 10 10 20 20 30 30
I have the following dataset: (sample)
Team Job Question Answer
1 1 1 2 1
2 1 1 3a 2
3 1 1 3b 2
4 1 1 4a 1
5 1 1 4b 1
and I have 21 teams so there are many rows. I am trying filter the rows of the teams which did good in the experiment (with the dplyr package):
q10best <- filter(quest,Team==c(2,4,6,10,13,17,21))
But it gives me messed up data and with many missing rows.
On the other hand, when I use:
q10best <- filter(quest,Team==2 | Team==4 | Team==6 | Team==10 | Team==13 | Team==17 | Team==21)
It gives me the right dataset that I want. What is the difference? what am I doing wrong in the first command?
Thanks
== checks if two objects are exactly the same. You are trying to check if one object (each element of quest$Team) belongs to a list of value. The proper way to do that is to use %in%
q10best <- filter(quest,Team %in% c(2,4,6,10,13,17,21))
I have the following data frame in R that has overlapping data in the two columns a_sno and b_sno
a_sno<- c(4,5,5,6,6,7,9,9,10,10,10,11,13,13,13,14,14,15,21,21,21,22,23,23,24,25,183,184,185,185,200)
b_sno<-c(5,4,6,5,7,6,10,13,9,13,14,15,9,10,14,10,13,11,22,23,24,21,21,25,21,23,185,185,183,184,200)
df = data.frame(a_sno, b_sno)
If you take a close look at the data you can see that the 4,5,6&7 intersect/ overlap and I need to put them into a group called 1.
Like wise 9,10,13,14 into group 2, 11 and 15 into group 3 etc.... and 200 is not intersecting with any other row but still need to be assigned its own group.
The resulting output should look like this:
---------
group|sno
---------
1 | 4
1 | 5
1 | 6
1 | 7
2 | 9
2 | 10
2 | 13
2 | 14
3 | 11
3 | 15
4 | 21
4 | 22
4 | 23
4 | 24
4 | 25
5 | 183
5 | 184
5 | 185
6 | 200
Any help to get this done is much appreciated. Thanks
Probably not the most efficient solution but you could use graphs to do this:
#sort the data by row and remove duplicates
df = unique(t(apply(df,1,sort)))
#load the library
library(igraph)
#make a graph with your data
graph <-graph.data.frame(df)
#decompose it into components
components <- decompose.graph(graph)
#get the vertices of the subgraphs
result<-lapply(seq_along(components),function(i){
vertex<-as.numeric(V(components[[i]])$name)
cbind(rep(i,length(vertex)),vertex)
})
#make the final dataframe
output<-as.data.frame(do.call(rbind,result))
colnames(output)<-c("group","sno")
output