This question already has answers here:
Replace NA with previous and next rows mean in R
(3 answers)
Closed 5 years ago.
I'm trying to replace NA values with the mean of the previous row and previous column same row using dplyr. See example below:
df <- data.frame(A=c(1,1,2),
B=c(2,4,NA))
So in this case the NA would be replaced by 3. How do I do this?
Below is what I was the lines I was thinking on but it doesn't work.
dfb <- df %>%
mutate(B = if_else(is.na(B), mean(lag(B),A), B))
Thanks!
Instead of using mean we can mention them separately and then divide it by 2.
df %>% mutate(B = ifelse(is.na(B),(lag(B) + A)/2, B))
# A B
#1 1 2
#2 1 4
#3 2 3
A simple base R method using subsetting is
df$B[is.na(df$B)] <- (df$B[which(is.na(df$B))-1] + df$A[is.na(df$B)]) / 2
df
A B
1 1 2
2 1 4
3 2 3
is.na returns a logical vector indicating whether each element is NA. which returns the position of logical TRUE elements. which is necessary for the first component of the average, since we have to find the lagged value.
This can be extended a bit to reduce computation (responding to docendo-discimus's comment) by computing the missing values once, and storing it, then re-using this vector.
missers <- is.na(df$B)
df$B[missers] <- (df$B[which(missers)-1] + df$A[missers]) / 2
#clean up, maybe
rm(missers)
Related
I've got a question regarding the filter() function of dplyr, and/or base subset() function within R. Basically, when I use filter() or subset() I can extract observations based on two conditions, which is what I need.
As an example, this is what I've been using so far:
df %>% filter(Axis_1_1 == "Diagnostic of function on axis1 postponed") %>% filter(is.na(diagnostic_code9))
This gives me the right amount of observations that satisfy these two conditions at the same time, i.e. 92 out of the 23992 in total.
However, when I use the negation sign to not include these observations in my current dataframe, R is deleting roughly 8000 extra observations. Thus, the end result is 15992 observations left after filtering with the negation "!" sign used. Example:
df %>% filter(Axis_1_1 != "Diagnostic of function on axis1 postponed") %>% filter(!is.na(diagnostic_code9))
Using simple subsetting from base R gives me the same wrong end result, while it manages to find the correct 92 observations that satisfy the condition, as stated in the first example.
subset(df, df$Axis1_1 == "Diagnostic of function on axis1 postponed" & is.na(diagnostic_code9))
My dataframe consists of 112 variables and 23900+ observations in the current setting.
Thus, my questions are:
Could there be something curious going on with my dataframe I'm using (Unfortunately I cannot give you a subset out of it)
Second, is there something wrong here with my coding?
Lastly, what is R exactly doing in the background? Since it is able to filter out these observations based on the exact conditioning where they match the string and is.na() function, while doing completely something else when using the negation sign.
Your logic doesn't quote work in this case. Doing two subsequent filter statments is kind of like doing an AND operation. Consider the following example
df <- data.frame(a=c(1,1,1,1,2,2,2, 2),
b=c(NA,NA,5,5,5,5,5,NA))
df %>% filter(a==1) %>% filter(is.na(b))
# a b
# 1 1 NA
# 2 1 NA
df %>% filter(a!=1) %>% filter(!is.na(b))
# a b
# 1 2 5
# 2 2 5
# 3 2 5
Note the rows with a=1, b=5 are not returned even though they are not in the first output because your first filter (filter(!=1)) eliminates them.
So if you consider your two filters as A and B, in the first case you are doing A and B. It would be the same as
df %>% filter(a==1 & is.na(b))
# a b
# 1 1 NA
# 2 1 NA
But in the second you are doing NOT A and NOT B. These are not equivalent. According to DeMorgan's Law, you need NOT A OR NOT B. So try
df %>% filter(a!=1 | !is.na(b))
# a b
# 1 1 5
# 2 1 5
# 3 2 5
# 4 2 5
# 5 2 5
# 6 2 NA
or equivalently (note the parenthsis applying the NOT (!) to the whole expression)
df %>% filter(!(a==1 & is.na(b)))
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I am an early user of Rstudio, and i have a quite simple problem, but unfortunately i am not able to solve it.
I just want to aggregate rows of my data.frame by words contained on the first column of the df.
The data.frame is made by five columns:
The first one is made by words;
the second, the third, the fourth, the fifth ones are made by numeric values.
for example if the data would be:
SecondWord X Y Z Q
NO 1 2 2 1
NO 0 0 1 0
YES 1 1 1 1
i expect to see a result like:
SecondWord X Y Z Q
NO 1 2 3 1
YES 1 1 1 1
How could i do?
i have tried to use the following method:
test <- read.csv2("test.csv")
df<-aggregate(.~Secondword,data=test, FUN = sum, na.rm=TRUE)
But the values were not the ones i expected to see.
Thank you for your future helps and sorry for the "simple" question.
You can also use tidyverse
library(tidyverse)
df <- test %>%
group_by(SecondWord) %>%
summarize_each(funs(sum))
df
# SecondWord X Y Z Q
# NO 1 2 3 1
# YES 1 1 1 1
ddply should work as well.
For example, something like:
library(plyr)
grouped <- ddply(test, "Secondword", numcolwise(sum))
This question already has answers here:
Sorting rows alphabetically
(4 answers)
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a large dataset I want to simplify but I'm currently having some troubles with one thing.
The following table shows a origin destination combination. The count column, represents the amount of occurrences of A to B for example.
From To count
A B 2
A C 1
C A 3
B C 1
The problem I have is that for example A to C (1), is actually the same as C to A (3). As direction doesn't really matter to me only that there's a connection between A and C, I wonder how can I simply have A to C (4).
The problem is that I have a factor with 400 levels, so I can't do it manually. Is there something with dplyr or similar that can solve this for me?
df[1:2] <- t(apply(df[1:2], 1, sort))
aggregate(count ~ From + To, df, sum)
results in:
From To count
1 A B 2
2 A C 4
3 B C 1
Here is a base R method using aggregate, sort, paste, and mapply.
with(df, aggregate(count,
list(route=mapply(function(x, y) paste(sort(c(x, y)), collapse=" - "),
From, To)), sum))
route x
1 A - B 2
2 A - C 4
3 B - C 1
Here, mapply takes pairs of elements from the from and to variables, sorts them and pastes them into a single string with collapse=TRUE. The resulting string vector is used in aggregate to group the observations and sum the count values. with reduces typing.
I'm looking for a way to exclude a number of answers from a length function.
This is a follow on question from Getting R Frequency counts for all possible answers In sql the syntax could be
select * from someTable
where variableName not in ( 0, null )
Given
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
ZeroTenNAScale <- c(0:10,NA);
R.freq = setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=ZeroTenNAScale,exclude=NULL))));
S.freq = setNames(nm=c('S','freq'),data.frame(table(factor(df$S,levels=ZeroTenNAScale,exclude=NULL))));
length(S.freq$freq[S.freq$freq!=0])
# 5
How would I change
length(S.freq$freq[S.freq$freq!=0])
to get an answer of 4 by excluding 0 and NA?
We can use colSums,
colSums(!is.na(S.freq)[S.freq$freq!=0,])[[1]]
#[1] 4
You can use sum to calculate the sum of integers. if NA's are found in your column you could be using na.rm(), however because the NA is located in a different column you first need to remove the row containing NA.
Our solution is as follows, we remove the rows containing NA by subsetting S.freq[!is.na(S.freq$S),], but we also need the second column freq:
sum(S.freq[!is.na(S.freq$S), "freq"])
# 4
You can try na.omit (to remove NAs) and subset ( to get rid off all lines in freq equal to 0):
subset(na.omit(S.freq), freq != 0)
S freq
4 3 1
6 5 1
8 7 1
10 9 1
From here, that's straightforward:
length(subset(na.omit(S.freq), freq != 0)$freq)
[1] 4
Does it solve your problem?
Just add !is.na(S.freq$S) as a second filter:
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$S)])
If you want to extend it with other conditions, you could make an index vector first for readability:
idx <- S.freq$freq!=0 & !is.na(S.freq$S)
length(S.freq$freq[idx])
You're looking for values with frequency > 0, that means you're looking for unique values. You get this information directly from vector S:
length(unique(df$S))
and leaving NA aside you get answer 4 by:
length(unique(df$S[!is.na(df$S)]))
Regarding your question on how to exclude a number of items based on their value:
In R this is easily done with logical vectors as you used it in you code already:
length(S.freq$freq[S.freq$freq!=0])
you can combine different conditions to one logical vector and use it for subsetting e.g.
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$freq)])
This question already has answers here:
The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe
(11 answers)
Closed 7 years ago.
I have the following data frame:
df <- data.frame(a=rep(1:3),b=rep(1:3),c=rep(4:6),d=rep(4:6))
df
a b c d
1 1 1 4 4
2 2 2 5 5
3 3 3 6 6
i would like to have a vector N which determines my window size so for thsi example i will set
N <- 1
I would like to split this dataframe into equal portions of N rows and store the 3 resulting dataframes into a list.
I have the following code:
groupMaker <- function(x, y) 0:(x-1) %/% y
testlist2 <- split(df, groupMaker(nrow(df), N))
The problem is that this code renames my column names by adding an X0. in front
result <- as.data.frame(testlist2[1])
result
X0.a X0.b X0.c X0.d
1 1 1 4 4
>
I would like a code that does the exact same thing but keeps the column names as they are. please keep in mind that my original data has a lot more than 3 rows so i need something that is applicable to a much larger dataframe.
To extract a list element, we can use [[. Also, as each list elements are data.frames, we don't need to explicitly call as.data.frame again.
testlist2[[1]]
We can also use gl to create the grouping variable.
split(df, as.numeric(gl(nrow(df), N, nrow(df))))