Fill column based on several conditions, with priorities of those conditions

Fill column based on several conditions, with priorities of those conditions - r

I have a data set like this:
Age <- rnorm(n=100, mean=20, sd=5)
ind <- which(Age %in% sample(Age, 50))
Age[ind]<-NA
Age2 <- rnorm(n=100, mean=20, sd=5)
ing <- which(Age2 %in% sample(Age2, 50))
Age2[ing]<-NA
Age3 <- rnorm(n=100, mean=20, sd=5)
int <- which(Age3 %in% sample(Age3, 50))
Age3[int]<-NA
data<-data.frame(Age,Age2,Age3)
Its an old data set several different people put together where multiple columns mean the same thing (there are several columns for age in the real data set). As you can see, there are quite a few NA's. I'd like to create a unified "age" column. To do this, I'd like to ideally use the number from the first age column, but if that is NA I'd then preferentially use the number from Age2, and if it is also NA i'd use Age3, and I'd like to do so in that order (Age3 would never supersede Age2, etc...) as I trust the people who input the data in that order haha.
I'm aware of other answers on here for filling columns based on several conditions, like so: dplyr replacing na values in a column based on multiple conditions
But I'm not sure how to place priorities. Thank you!

You can use coalesce() from dplyr which will fill based on the first non-missing value from left to right.
library(dplyr)
df <-data.frame(Age,Age2,Age3)
df$new_age <- coalesce(!!!df)
head(df)
Age Age2 Age3 new_age
1 17.19762 NA NA 17.19762
2 18.84911 21.17693 NA 18.84911
3 27.79354 NA NA 27.79354
4 NA 15.19072 NA 15.19072
5 NA NA 27.99254 27.99254
6 28.57532 NA 19.55717 28.57532

A base R possibility could be:
apply(data, 1, function(x) x[which(!is.na(x))[1]])

Related

extracting identifiers from row observations

I want to extract specific elements, specifically ID, from rows that have NAs. Here is my df:
df
ID x
1-12 1
1-13 NA
1-14 3
2-12 20
3-11 NA
I want a dataframe that has the IDs of observations that are NA, like so:
df
ID x
1-13 NA
3-11 NA
I tried this, but it's giving me a dataframe with the row #s that have NAs (e.g., row 2, row 5), not the IDs.
df1 <- data.frame(which(is.na(df$x)))
Can someone please help?

This is a very basic subsetting question:
df[is.na(df$x),]
Good basic and free guides can be found on w3schools: https://www.w3schools.com/r/
Cheers
Hannes

Simply run the following line:
df[is.na(x),]

Another option is complete.cases
subset(df, !complete.cases(x))

Here is another base R option using na.omit
> df[!1:nrow(df) %in% row.names(na.omit(df)), ]
ID x
2 1-13 NA
5 3-11 NA

R: Summing frequency in a list

Edit: Packages used are: plyr and vegan. R is most up to date version.
My base data is this:
X1 = c('Archea01', 'Bacteria01', 'Bacteria02')
Sample1 = c(0.2,NA,NA)
Sample2 = c(0, 0.001, NA)
Sample3 = c(0.04, NA, NA)
df = data.frame(X1,Sample1,Sample2,Sample3)
df
X1 Sample1 Sample2 Sample3
1 Archea01 0.2 0.000 0.04
2 Bacteria01 NA 0.001 NA
3 Bacteria02 NA NA NA
Data purposefully made with NAs, to reflect real data.
My goal is to sum the frequency of bacterial/archeal occurrence in each sample, which would ideally create this type of data frame:
Sample1 Sample2 Sample3
23 11 12
I have managed to create a list of frequency:
dfFreq <- apply(df, 2, count)
Although this looks good, it's not quite what I want:
head(dfFreq)[2]
$Sample2
x freq
1 0.000 23
2 0.001 5
3 <NA> 50
The next logical step would be to convert the list into a dataframe and sum frequency (or vice versa), but my code has not worked. I have tried:
df.data <- ldply (dfFreq, data.frame)
dfSUM <- apply(dfFreq, 2, sum)
Trying to sum the list simply hasn't worked (unsurprisingly). Regarding transforming into a dataframe, I have looked all over Stack Overflow and have seen a lot suggesting the above or lapply, but the data frame that is created from the code suggested is:
x freq
Archea01 1
Bacteria01 1
etc etc
Which is not what I want.
Any thoughts about how to either A) sum frequency and then convert into a data frame like the one I want, or B) convert the list into a sensible data frame whose frequency column can be summed? I think A is the only way I can get to the point I want, but any thoughts about this would be greatly appreciated.
Edit 2.0:
Ryan Morton suggested the following code:
require(dplyr)
dfBound <- rbind(dfFreq)
Which has resulted in this data frame:
X1 Sample1
dfFreq list(x = 1:1885, freq = c(1, 1, 1) list(x = c(1, 2, 3)
Although this certainly seems closer to the solution, I notice that each list either follows the format of X1, or the format of Sample1 (x = c(1,2,3, etc), which indicates that something wrong happened in the process of binding the lists.
Any ideas of why this may not be working, and what solution there may be for summing the frequency found within the list?
Thanks very much.

Update
I figured out how to sum my original frequency table and convert it into the data frame I was hoping for. Thanks to Ryan Morton for pointing me in the right direction and providing code.
dfNARemoved <- lapply(dfFreq, function(x) transform(x[-nrow(x),]))#removing useless NAs in my data
dfFreqxRemoved <- lapply(dfNARemoved, function(x) { x["x"] <- NULL; x }) #removing useless x column
dfSum <- lapply(dfFreqxRemoved, function(x) sum(x))
require(dplyr)
#Now converting into a dataframe
dfBound <- rbind(dfSum)
dfData <- as.data.frame(dfBound)

Combine rows into one, replace NA

I have the following dataframe: (this is just a small sample)
VALUE COUNT AREA n_dd-2000 n_dd-2001 n_dd-2002 n_dd-2003 n_dd-2004 n_dd-2005 n_dd-2006 n_dd-2007 n_dd-2008 n_dd-2009 n_dd-2010
2 16 2431 243100 NA NA NA NA NA NA 3.402293 3.606941 4.000461 3.666381 3.499614
3 16 2610 261000 3.805082 4.013435 3.98 3.490139 3.433857 3.27813 NA NA NA NA NA
4 16 35419 3541900 NA NA NA NA NA NA NA NA NA NA NA
and I would like to combine all three rows into one row replacing NA with the number that appears in each column (there's only one number per column). Just ignore the first three columns. I used this code:
bdep[4,4:9] <- bdep[3,4:9]
to replace NA's with numbers from another row, but can't figure out how to repeat it for all the columns. The columns 4 and beyond have a sequence in each row of six numbers followed by 20 NA's, so I've tried going down the road of using lapply() and seq() or for loops, but my efforts are failing.

I did a simple solution by replacing the NA:s with zeroes and adding all rows per column. Did this work?
#data
bdep <- rbind(c(rep(NA,6),3.402293,3.606941,4.000461,3.666381,3.499614),
c(3.805082,4.013435,3.98,3.490139,3.433857,3.27813, rep(NA,5)),
c(rep(NA,11)))
#solution
bdep2 <- ifelse(is.na(bdep), 0, bdep)
bdep3 <- apply(bdep2, 2, sum)
bdep3 #the row you want?

I finally came to a solution by patching together some code I found in other posts (esp. sequencing and for loops). I think this would be considered messy coding, so I'd welcome other solutions. This should better describe what I was trying to do in the OP, where I was trying to generalize too much. Specifically, I have 17 variables, measured over 14 years (that's 238 columns), and something happened while generating these data where the first 6 years of a variable are in one row and the following 8 years are in the other row, so instead of re-run the model, I just wanted to combine the two rows into one.
Below are some sample data, simplified from my real scenario.
Create the data frame:
df <- data.frame(
VALUE = c(16, 16, 16),
COUNT = c(2431, 2610, 35419),
AREA = c(243100, 261000, 3541900),
n_dd_2000 = c(NA, 3.805, NA),
n_dd_2001 = c(3.402, NA, NA)
)
The next two lines establish a sequence starting a pattern at column 4, repeating every 1 column, repeated 2 times out in the first line, 1 time out in the second line, and how many times to repeat the sequence:
info <- data.frame(start=seq(4, by=1, length.out=2), len=rep(1,2))
info2 <- data.frame(start=seq(5, by=1, length.out=1), len=rep(1,2))
This is the code from my real dataset, where I started at column 4, repeated the pattern every 14 columns, out 17 times, and looked at the first 6, then 8 columns: info <- data.frame(start=seq(4, by=14, length.out=17), len=rep(c(6,8),17))
The two for loops below write the specified values in the sequence from row 2 and row 1 to row 3, respectively:
foo = sequence(info$len) + rep(info$start-1, info$len)
foo2 = sequence(info2$len) + rep(info2$start-1, info2$len)
for(n in 1:length(foo)){
df[3,foo[n]] <- df[2,foo[n]]
}
for(n in 1:length(foo2)){
df[3,foo2[n]] <- df[1,foo2[n]]
}
Then I removed the first two rows I got those values from and I'm left with one complete row, no NA's:
df <- df[-(1:2),]

Recode values omitting NA's

I want to recode the values in a matrix in such a way that all values <=.2 become 2, <=.4 become 3 etc. However, there are missings in my data, which I do not want to change (keep them NA). Here you find a simplified version of my code. Using na.omit works perfectly for the first changes
try <- matrix(c(0.78,0.62,0.29,0.47,0.30,0.63,0.30,0.20,0.15,0.58,0.52,0.64,
0.76,0.32,0.64,0.50,0.67,0.27, NA), nrow = 19)
try[na.omit(try <= .2)] <- 2 #Indeed changes .20 and .15 to 2 and leaves the NA as NA
However, when I do the same for a higher category, the NA is also changed:
try[na.omit(try <= .8)] <- 5 #changes all other values including the NA to 5
Can someone explain to me what is the difference between the two and why the second one also changes the NA-value while the first one does not? Or am I doing something else wrong?

You can do
try[try <= .8] <- 5
The NA values will remain as NA
Or create a logical condition to exclude the NA values
try[try <=.8 & !is.na(try)] <- 5

Create a matrix of percent difference for all possible combinations using combn

I have a data set of 144 scenarios and would like to calculate the percent change of all possible combinations using the comb n function. I have tried to use a percent difference function within the combn but it keeps giving me a large amount of NA's. Is there a way that I can accomplish this?
create percent change function:
pcchange=function(x,lag=1)
c(diff(x,lag),rep(NA,lag))/x*100
use withing combn:
Catch_comp<-combn(catch_table$av_muC, 2, pcchange)
convert results into a matrix
inputs <- headers
out2 <- Catch_comp
class(out2) <- "dist"
attr(out2, "Labels") <- as.character(inputs)
attr(out2, "Size") <- length(inputs)
out2 <- as.matrix(out2)
out2
This is what my table is coming out looking like:
> out2
F_R1S2_11 F_R1S2_12 F_R1S2_13 F_R1S2_21 F_R1S2_22 F_R1S2_23 F_R1S2_31 0.00000000 -0.8328001 NA -2.1972852 NA -0.11300746 NA -1.15112915 NA -2.7011787 NA -0.5359923 NA
F_R1S2_12 -0.83280008 0.0000000 NA -1.4558031 NA
As an example:
I have the average of 1000 simulations of the actual catch for two scenarios-
F_R1S1_11=155420.36
and
F_R1S1_12= 154126.0215.
Using the pcchange function I would like to calculate:
((F_R1S1_11-F_R1S1_12)/F_R1S1_11)*100 or
((155420.36-154126.02)/155420.36)*100=0.83%
change in the values.
I would like to do this for all possible combinations in a 144x144 matrix form. I hope that helps.
Thanks!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fill column based on several conditions, with priorities of those conditions - r

A base R possibility could be: apply(data, 1, function(x) x[which(!is.na(x))[1]])

Related

extracting identifiers from row observations

R: Summing frequency in a list

Combine rows into one, replace NA

Recode values omitting NA's

Create a matrix of percent difference for all possible combinations using combn

Categories

Resources