[R]: applying a function to columns based on conditional row position - r

I am attempting to find the number of observations by column in a data frame that meet a certain condition after the max for that column has been encountered.
Here is a highly simplified example:
fake.dat<-data.frame(samp1=c(5,6,7,5,4,5,10,5,6,7), samp2=c(2,3,4,6,7,9,2,3,7,8), samp3=c(2,3,4,11,7,9,2,3,7,8),samp4=c(5,6,7,5,4,12,10,5,6,7))
samp1 samp2 samp3 samp4
1 5 2 2 5
2 6 3 3 6
3 7 4 4 7
4 5 6 11 5
5 4 7 7 4
6 5 9 9 12
7 10 2 2 10
8 5 3 3 5
9 6 7 7 6
10 7 8 8 7
So, let's say I'm trying to find the number of observations per column that are greater than 5 after excluding all the observations in a column up to and including the row where the maximum for the column occurs.
Expected outcome:
samp1 samp2 samp3 samp4
2 2 4 3
I am able to get the answer I want by using nested for loops to exclude the observations I don't want.
newfake.dat<-data.frame()
for(j in 1:length(fake.dat)){
for(i in 1:nrow(fake.dat)){
ifelse(i>max.row[j],newfake.dat[i,j]<-fake.dat[i,j],"NA")
print(newfake.dat)
}}
This creates a new data frame on which I can run an easy apply function.
colcount<-apply(newfake.dat,2,function(x) (sum(x>5,na.rm=TRUE)))
V1 V2 V3 V4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA 7 NA
6 NA NA 9 NA
7 NA 2 2 10
8 5 3 3 5
9 6 7 7 6
10 7 8 8 7
V1 V2 V3 V4
2 2 4 3
Which is all well and good for this tiny example dataset, but is prohibitively slow on anything approaching the size of my real datasets. Which are large (2000 x 2000 or larger) and numerous. I tried it with a truncated version of one of my files (fewer columns, but same number of rows) and it ran for at least 5 hours (I left it going when I left work for the day). Also, I don't really need the new dataframe for anything other than to be able to run the apply function.
Is there any way to do this more efficiently? I tried limiting the rows that the apply function works on by using seq and the row number of the max.
maxrow<-apply(fake.dat,2,function(x) which.max(x))
print(maxrow)
seq.att<-apply(fake.dat,2,function(x) {
sum(x[which(seq(1,nrow(fake.dat))==(maxrow)):nrow(fake.dat)]>5,na.rm=TRUE)})
Which kicks up four instances of this warning message:
1: In seq(1, nrow(fake.dat)) == (maxrow) :
longer object length is not a multiple of shorter object length
If I ignore the warning message and get the output anyway it doesn't give me the answer I expected:
samp1 samp2 samp3 samp4
2 3 3 3
I also tried using a while function which kept cycling so I stopped it (I misplaced the code I tried for this).
So far the most promising result has come from the nested for loops, but I know it's terribly inefficient and I'm hoping that there's a better way. I'm still new to R, and I'm sure I'm tripping up on some syntax somewhere. Thanks in advance for any help you can provide!

Here is a way in dplyr to replicate the same process that you showed with base R
library(dplyr)
fake.dat %>%
summarise_each(funs(sum(.[(which.max(.)+1):n()]>5,
na.rm=TRUE)))
# samp1 samp2 samp3 samp4
#1 2 2 4 3
If you need it as two steps:
datNA <- fake.dat %>%
mutate_each(funs(replace(., seq_len(which.max(.)), NA)))
datNA %>%
summarise_each(funs(sum(.>5, na.rm=TRUE)))

Here's one approach using data.table:
library(data.table)
##
data <- data.frame(
samp1=c(5,6,7,5,4,5,10,5,6,7),
samp2=c(2,3,4,6,7,9,2,3,7,8),
samp3=c(2,3,4,11,7,9,2,3,7,8),
samp4=c(5,6,7,5,4,12,10,5,6,7))
##
Dt <- data.table(data)
##
R> Dt[,lapply(.SD,function(x){
y <- x[(which.max(x)+1):.N]
length(y[y>5])
})
samp1 samp2 samp3 samp4
1: 2 2 4 3

A single-liner in base R:
vapply(fake.dat,function(x) sum(x[(which.max(x)+1):length(x)]>5),1L)
#samp1 samp2 samp3 samp4
# 2 2 4 3

Related

Count unique values in Raster Data in R

I have these Raster Datasets, which look like this
1 2 3 4 5
1 NA NA NA 10 NA
2 7 3 7 10 10
3 NA 3 7 3 3
4 9 9 NA 3 7
5 3 NA 7 NA NA
via
MyRaster1 <- raster("MyRaster_EUNIS1.tif")
head(MyRaster1)
I created that table.
Using unique(MyRaster1) I get 3 7 9 10.
What I need are the counts of these unique values in the raster dataset.
I have tried quite a few ways around, one way works, but is a lot of trouble and I can't get a loop to work for all the raster datasets I have.
Classes1 <- as.factor(unique(values(MyRaster1)))[!is.na(unique(values(MyRaster1)))]
val1 <- unique(MyRaster1)
Tab1 <- matrix(nrow = length(values(MyRaster1)), ncol = length(val))
colnames(Tab1) <- levels(unique(Classes1))
Tab1 <- Tab1[!is.na(Tab1[,1]),]
colSums(Tab1)
It seems to work properly, until I try to delete the NA values. When I use colSums before that, I get NA as result for each column, after I delete the NA values, I get 0.
This is my first time using R, so I'm a real novice. I've researched quite a lot, but since I hardly understand the language at all, this is the furthest I have gotten.
Thank you for your help.
Edit:
table(MyRaster1)
gives me this: Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
The best result would be:
3 7 9 10
6 5 2 3
But I'd also be ok with a different format which I could use in Excel.
Use raster::freq()
Here's an example for the first two rows of your data:
r <- raster(matrix(c(NA,NA,NA,10,NA,7,3,7,10,10), nrow = 2, ncol =5))
> freq(r)
value count
[1,] 3 1
[2,] 7 2
[3,] 10 3
[4,] NA 4
Note that the freq function rounds unless explicitly told not to:
https://www.rdocumentation.org/packages/raster/versions/3.0-7/topics/freq

Data.table: rbind a list of data tables with unequal columns [duplicate]

This question already has answers here:
rbindlist data.tables with different number of columns
(1 answer)
Rbind with new columns and data.table
(5 answers)
Closed 4 years ago.
I have a list of data tables that are of unequal lengths. Some of the data tables have 35 columns and others have 36.
I have this line of code, but it generates an error
> lst <- unlist(full_data.lst, recursive = FALSE)
> model_dat <- do.call("rbind", lst)
Error in rbindlist(l, use.names, fill, idcol) :
Item 1362 has 35 columns, inconsistent with item 1 which has 36 columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE.
Any suggestions on how I can modify that so that it works properly.
Here's a minimal example of what you are trying to do.
No need to use any other package to do this. Just set fill=TRUE in rbindlist.
You can do this:
df1 <- data.table(m1 = c(1,2,3))
df2 <- data.table(m1 = c(1,2,3), m2=c(3,4,5))
df3 <- rbindlist(list(df1, df2), fill=T)
print(df3)
m1 m2
1: 1 NA
2: 2 NA
3: 3 NA
4: 1 3
5: 2 4
6: 3 5
If I understood your question correctly, I could possibly see only two options for having your data tables appended.
Option A: Drop the extra variable from one of the datasets
table$column_Name <- NULL
Option B) Create the variable with missing values in the incomplete dataset.
full_data.lst$column_Name <- NA
And then do rbind function.
Try to use rbind.fill from package plyr:
Input data, 3 dataframes with different number of columns
df1<-data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4,5))
df2<-data.frame(a=c(1,2,3,4,5,6),b=c(1,2,3,4,5,6),c=c(1,2,3,4,5,6))
df3<-data.frame(a=c(1,2,3),d=c(1,2,3))
full_data.lst<-list(df1,df2,df3)
The solution
library("plyr")
rbind.fill(full_data.lst)
a b c d
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 1 1 1 NA
7 2 2 2 NA
8 3 3 3 NA
9 4 4 4 NA
10 5 5 5 NA
11 6 6 6 NA
12 1 NA NA 1
13 2 NA NA 2
14 3 NA NA 3

Difference between ntile and cut and then quantile() function in R

I found two threads on this topic for calculating deciles in R. However, both the methods i.e. dplyr::ntile and quantile() yield different output. In fact, dplyr::ntile() fails to output proper deciles.
Method 1: Using ntile()
From R: splitting dataset into quartiles/deciles. What is the right method? thread, we could use ntile().
Here's my code:
vector<-c(0.0242034679584454, 0.0240411606258083, 0.00519255930109344,
0.00948031338483081, 0.000549450549450549, 0.085972850678733,
0.00231687756193192, NA, 0.1131625967838, 0.00539244534707915,
0.0604885614579294, 0.0352030947775629, 0.00935626135385923,
0.401201201201201, 0.0208212839791787, NA, 0.0462887301644538,
0.0224952741020794, NA, NA, 0.000984952654008562)
ntile(vector,10)
The output is:
ntile(vector,10)
5 5 2 3 1 7 1 NA 8 2 7 6 3 8 4 NA 6 4 NA NA 1
If we analyze this, we see that there is no 10th quantile!
Method 2: using quantile()
Now, let's use the method from How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame thread.
Here's my code:
as.numeric(cut(vector, breaks=quantile(vector, probs=seq(0,1, length = 11), na.rm=TRUE),include.lowest=TRUE))
The output is:
7 6 2 4 1 9 2 NA 10 3 9 7 4 10 5 NA 8 5 NA NA 1
As we can see, the outputs are completely different. What am I missing here? I'd appreciate any thoughts.
Is this a bug in ntile() function?
In dplyr::ntile NA is always last (highest rank), and that is why you don't see the 10th decile in this case. If you want the deciles not to consider NAs, you can define a function like the one here which I use next:
ntile_na <- function(x,n)
{
notna <- !is.na(x)
out <- rep(NA_real_,length(x))
out[notna] <- ntile(x[notna],n)
return(out)
}
ntile_na(vector, 10)
# [1] 6 6 2 4 1 9 2 NA 9 3 8 7 3 10 5 NA 8 5 NA NA 1
Also, quantile has 9 ways of computing quantiles, you are using the default, which is the number 7 (you can check ?stats::quantile for the different types, and here for the discussion about them).
If you try
as.numeric(cut(vector,
breaks = quantile(vector,
probs = seq(0, 1, length = 11),
na.rm = TRUE,
type = 2),
include.lowest = TRUE))
# [1] 6 6 2 4 1 9 2 NA 9 3 8 7 3 10 5 NA 8 5 NA NA 1
you have the same result as the one using ntile.
In summary: it is not a bug, it is just the different ways they are implemented.

Getting stale values on using ifelse in a dataframe

Hi I am aggregating values from two columns and creating a final third column, based on priorities. If values in column 1 are missing or are NA then I go for column 2.
df=data.frame(internal=c(1,5,"",6,"NA"),external=c("",6,8,9,10))
df
internal external
1 1
2 5 6
3 8
4 6 9
5 NA 10
df$final <- df$internal
df$final <- ifelse((df$final=="" | df$final=="NA"),df$external,df$final)
df
internal external final
1 1 2
2 5 6 3
3 8 4
4 6 9 4
5 NA 10 2
How can I get final value as 4 and 2 for row 3 and row 5 when the external is 8 and 2. I don't know what's wrong but these values don't make any sense to me.
The issue arises because R converts your values to factors.
Your code will work fine with
df=data.frame(internal=c(1,5,"",6,"NA"),external=c("",6,8,9,10),stringsAsFactors = FALSE)
PS: this hideous conversion to factors should definitely belong to the R Inferno, http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Reordering (deleting/changing order) columns of data in data frame

I have two large data sets and I am attempting to reformat the older data set to put the questions in the same order as the newer data set (so that I can easily perform t-tests on each identical question to track significant changes over the 2 years between data sets). The new version both deleted and added questions when changing from the old version.
The way I've been attempting to do this, R keeps crashing due to, as best I can figure, vectors being too large. I'm not sure how they are getting to be this large, however! Below is what I am doing:
Both data sets have the same format. The original sets are 415 for the new and 418 for the old. I want to match the first approximately 158 colums of the new data set to the old. Each data set has column names which are q1-q415 and the data in each column is numerical 1-5 or NA. There are approximately 100 answers per question/column, the old data set has more respondants (140 rows in old vs 114 rows in new). An example is below (but keep in mind there are over 400 columns in the full set and over 100 rows!)
The following is an example of what data.old looks like. data.new looks the same only data.new has more Rows of number/na answers. Here I show questions 1 through 20 and the first 10 rows.
data.old = 418 columns (q1 though q418) x 140 rows
data.new = 415 columns (q1 through q415) x 114 rows
I need to match the first 170 COLUMNS of data.old to the first 157 COLUMNS of data.new
To do this, I will be deleting 17 columns from data.old (questions that were in the data.old questionnaire and deleted from the data.new questionnaire) but also adding 7 new columns to data.old (which will contain NAs... place holders for where data.new had new questions introducted that did not exist in data.old questionnaire)
>data.old
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20
1 3 4 3 3 5 4 1 NA 4 NA 1 2 NA 5 4 3 2 3 1
3 4 5 2 2 4 NA 1 3 2 5 2 NA 3 2 1 4 3 2 NA
2 NA 2 3 2 1 4 3 5 1 2 3 4 3 NA NA 2 1 2 5
1 2 4 1 2 5 2 3 2 1 3 NA NA 2 1 5 5 NA 2 3
4 3 NA 2 1 NA 3 4 2 2 1 4 5 5 NA 3 2 3 4 1
5 2 1 5 3 2 3 3 NA 2 1 5 4 3 4 5 3 NA 2 NA
NA 2 4 1 5 5 NA NA 2 NA 1 3 3 3 4 4 5 5 3 1
4 5 4 5 5 4 3 4 3 2 5 NA 2 NA 2 3 5 4 5 4
2 2 3 4 1 5 5 3 NA 2 1 3 5 4 NA 2 3 4 3 2
2 1 5 3 NA 2 3 NA 4 5 5 3 2 NA 2 3 1 3 2 4
So in the new set, some of the questions were deleted, some new ones were added, and some changed order, so I went through and created subsets of old data in the order that I would need to combine them again to match the new dataset. When a question does not exist in the old data set, I want to use the question in the new data set so that I can (theoretically) perform my t-tests in a big loop.
dataold.set1 <- dataold[1:16]
dataold.set2 <- dataold[18:19]
dataold.set3 <- dataold[21:23]
dataold.set4 <- dataold[25:26]
dataold.set5 <- dataold[30:33]
dataold.set6 <- dataold[35:36]
dataold.set7 <- dataold[38:39]
dataold.set8 <- dataold[41:42]
dataold.set9 <- dataold[44]
dataold.set10 <- dataold[46:47]
dataold.set11 <- dataold[49:54]
dataold.set12 <- datanew[43:49]
dataold.set13 <- dataold[62:85]
dataold.set14 <- dataold[87:90]
dataold.set15 <- datanew[78]
dataold.set16 <- dataold[91:142]
dataold.set17 <- dataold[149:161]
dataold.set18 <- dataold[55:61]
dataold.set19 <- dataold[163:170]
I then was attempting to put the columns back together into one set
I tried both
dataold.adjust <- merge(dataold.set1, dataold.set2)
dataold.adjust <- merge(dataold.adjust, dataold.set3)
dataold.adjust <- merge(dataold.adjust, dataold.set4)
and I also tried
dataold.adjust <- cbind(dataold.set1, dataold.set2, dataold.set3)
However, every time I try to perform these functions, R freezes, then crashes. I managed to get it to display an error once, and it said it could not work with a vector of 10 Mb, and then I got multiple errors involving over 1000 Mb vectors. I'm not really sure how my vectors are that large, when this is crashing out by set 3, which is only 23 columns of data in a table, and the data sets I'm normally using are over 400 columns in length.
Is there another way to do this that won't cause my program to crash and have memory issues (and won't require me typing out the column names of over 100 columns), or is there some element of code here that I am missing where I'm getting a memory sink? I've been attempting to trouble shoot it and have spent an hour dealing with R crashing without any luck figuring out how to make this work.
Thanks for the assistance!
You're making tons of unnecessary copies of your data and then you're growing the final object (dataold.adjust). You just need a vector that orders the columns correctly:
cols1 <- c(1:16,18:19,21:23,25:26,30:33,35:36,38:39,41:42,44,46:47,49:54)
cols2 <- c(62:85,87:90)
cols3 <- c(91:142,149:161,55:61,163:170)
# merge old / new data by row and add NA for unmatched rows
dataold.adjust <- merge(data.old[,c(cols1,cols2,cols3)],
data.new[,c(43:49,78)], by="row.names", all=TRUE)
# put columns in desired order
dataold.adjust <- dataold.adjust[,c(1:length(cols1), # 1st cols from dataold
ncol(dataold.adjust)-length(43:49):1, # 1st cols from datanew
(length(cols1)+1):length(cols2), # 2nd cols from dataold
ncol(dataold.adjust), # 2nd cols from datanew
(length(cols1)+length(cols2)+1):length(cols3))] # 3rd cols from dataold
The last part is an absolute kludge, but I've hit my self-imposed time limit for SO today. :)

Resources