Calculate a mean and deal with NA - r

I got a dataset (df) that looks like this:
LETTER | VALUE |
A | 2 |
A | 3 |
B | 4 |
B | NA |
B | 6 |
B | NA |
C | NA |
C | NA |
I m looking for a way to create a second datased (new_df) based on the mean of VALUE for each LETTER. But I need to know which letter have NA.
new_df should look like this:
LETTER | VALUE |
A | 2,5 |
B | 5 |
C | NA |
Here is the code I tried :
new_df <- aggregate(as.numeric(VALUE) ~ LETTER, df, mean)
The issue with it, is that it omit NA and only returns this:
LETTER | VALUE |
A | 2,5 |
B | 5 |
Can you please help?

You may just change defaults of aggregate()
aggregate(as.numeric(VALUE) ~ LETTER, df, function(x) mean(x, na.rm=TRUE),
na.action = na.pass)

Related

R, get entires from one column based on values of another column in R

I am trying to get column entries as a list that match a list of entries from data frame
Showing what I am trying to do:
Dataframe named Tepo
| | name | shortcut |
| -------- | -------------- | ----------|
| 1 | Apples | A |
| 2 | Bannans | B |
| 3 | oranges | O |
| 4 | Carrots | C |
| 5 | Mangos | M |
| 6 | Strawberies | S |
I have a list FruitList as chr
>FruitList
>[1] "Bannas" "Carrots" "Mangos"
And I would like to get a list, shortcutList, of the corresponding columns:
>shortcutList
>[1] "B" "C" "M"
My attempt:
shortcutList <- tepo$shorcut[tepo$name == FruiteList[]]
However, I don't get the desired list output.
Thanks for the help
Use %in% :
shortcutList <- tepo$shortcut[tepo$name %in% FruitList]

Populate table column by relative row values of another column in R

I’m trying to populate a table column by relative row values of another column in R. I have a table with two data columns (Col1, Col2) and two point value columns (P1, P2). Data1 is populated, Data2 is not. I want the value of Data2 to be populated by the value in either P1 or P2, based on the relative value of Data 1. In a given row, if the previous value of Data1 is higher than its current value, the Data2 cell is populated by the value in P1. If the previous value of Data1 is lower than its current value, the Data2 cell is populated by the value in P2. To illustrate what I’m trying to do, I’ve provided two sample tables. The first table is what I have (Data2 is not populated), and the second table is the desired outcome.
Table1 (What I have)
+-----+----+----+-------+-------+
| FID | P1 | P2 | Data1 | Data2 |
+-----+----+----+-------+-------+
| 1 | A | B | 50 | |
| 2 | C | D | 40 | |
| 3 | E | F | 60 | |
| 4 | G | H | 70 | |
| 5 | I | J | 65 | |
Table2 (Desired Outcome)
+-----+----+----+-------+-------+
| FID | P1 | P2 | Data1 | Data2 |
+-----+----+----+-------+-------+
| 1 | A | B | 50 | NA |
| 2 | C | D | 40 | C |
| 3 | E | F | 60 | F |
| 4 | G | H | 70 | H |
| 5 | I | J | 65 | I |
+-----+----+----+-------+-------+
Is there a built in function in R to accomplish this? If not, any advice on how to create one?
A solution using dplyr could be:
df %>%
mutate(Data2 = ifelse(lag(Data1) > Data1, paste0(P1), paste0(P2)))
FID P1 P2 Data1 Data2
1 1 A B 50 <NA>
2 2 C D 40 C
3 3 E F 60 F
4 4 G H 70 H
5 5 I J 65 I

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Objects' IDs by common value in different variables

I have a dataset with self-ratings and peer-ratings. The dataset is in long format. Before reshaping the dataset into wide-format, I want to give self-ratings and peer-ratings a common ID so that I can later match the peer-ratings to the self-ratings by that ID. The data look like this:
| questionnaire | ID | REF | SERIAL | x | y |
|---------------|----|------|--------|----|----|
| self | 1 | 1234 | NA | 4 | NA |
| self | 2 | 2345 | NA | 6 | NA |
| peer | NA | NA | 1234 | NA | 8 |
| peer | NA | NA | 2345 | NA | 4 |
The self-ratings have a reference variable ("REF") which refer to a peer-rating. The peer-ratings have the same value in the variable "SERIAL".
I'm now trying to attribute the same ID to the peer-ratings as the ID of the self-ratings which refer to the peers by the SERIAL value. The table should look like this then:
| questionnaire | ID | REF | SERIAL | x | y |
|---------------|----|------|--------|----|----|
| self | 1 | 1234 | NA | 4 | NA |
| self | 2 | 2345 | NA | 6 | NA |
| peer | 1 | NA | 1234 | NA | 8 |
| peer | 2 | NA | 2345 | NA | 4 |
How could I do this best?
may I can help you. I would like to change the ID values right in the data frame. For that I would do a loop around the rows by picking up an ID from the right counterpart and adding it into the NA's in the data frame. In my following code example, I've searched for the right ID for every peer. The map_df function does the loop and prepares as a output a data frame.
library(purrr)
library(dplyr)
marcel.data <- data.frame(questionnaire = c("self", "self", "peer", "peer"),
ID = c(1, 2, NA, NA),
REF = c("1234", "2345", NA, NA),
SERIAL = c(NA, NA, "1234", "2345"),
x = c(4, 6, NA, NA),
y = c(NA, NA, 8, 4))
new_id_data <- map_df(1:nrow(marcel.data), function(i){
if(marcel.data[i, 1] == "peer") { # matching peer to Self
serial <- unique(marcel.data[i, "SERIAL"])
new.id <- marcel.data %>% # getting the needed ID
filter(REF == serial & questionnaire == "self") %>%
select(ID)
marcel.data[i, "ID"] <- new.id # adding the new ID to the old data
marcel.data[i,]
} else
(marcel.data[i,])
})

R ddply sum value from next row

I want to sum the column value from a row with the next one.
> df
+----+------+--------+------+
| id | Val | Factor | Col |
+----+------+--------+------+
| 1 | 15 | 1 | 7 |
| 3 | 20 | 1 | 4 |
| 2 | 35 | 2 | 8 |
| 7 | 35 | 1 | 12 |
| 5 | 40 | 1 | 11 |
| 6 | 45 | 2 | 13 |
| 4 | 55 | 1 | 4 |
| 8 | 60 | 1 | 7 |
| 9 | 15 | 2 | 12 |
..........
I would like to have the mean of sum of the Row$Val + nextRow$Val based on their id and Col. I can't assume that the id or Col are consecutive.
I am using ddply to summarize my df. I have tried
> ddply(df, .(Factor), summarize,
max(Val),
sum(Val),
mean(Val + df[df$id == id+1 & df$Col = Col]$Val)
)
> "longer object length is not a multiple of shorter object length"
You can build a vector of values with
sapply(df$id, function(x){mean(c(
subset(df, id == x, select = Val, drop = TRUE),
subset(df, id == x+1, select = Val, drop = TRUE)
))})
You could simplify, but I tried to make it as readable as possible.
You can use rollapply from the zoo package. Since you want mean of only two consecutive rows , you can try
library(zoo)
rollapply(df[order(df$id), 2], 2, function(x) sum(x)/2)
#[1] 17.5 27.5 35.0 37.5 42.5 50.0 57.5 37.5
You can do something like this with dplyr package:
library(dplyr)
df <- arrange(df, id)
mean(df$Val + lead(df$Val), na.rm = TRUE)
[1] 76.25

Resources