I am trying to summarise a dataframe based on grouping by label column. I want to obtain means based on the following conditions:
- if all numbers are NA - then I want to return NA
- if mean of all the numbers is 1 or lower - I want to return 1
- if mean of all the numbers is higher than 1 - I want a mean of the values in the group that are greater than 1
- all the rest should be 100.
Managed to find the answer and now my code is running well - is.na() should be there instead of ==NA in the first ifelse() statement and that was the issue.
label <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7)
sev <- c(NA,NA,NA,NA,1,0,1,1,1,NA,1,2,2,4,5,1,0,1,1,4,5)
Data2 <- data.frame(label,sev)
d <- Data2 %>%
group_by(label) %>%
summarize(sevmean = ifelse(is.na(mean(sev,na.rm=TRUE)),NA,
ifelse(mean(sev,na.rm=TRUE)<=1,1,
ifelse(mean(sev,na.rm=TRUE)>1,
mean(sev[sev>1],na.rm=TRUE),100))))
Your first condition is the issue here. If we remove the nested ifelse and keep only the first one, we get the same output
Data2 %>%
group_by(label) %>%
summarise(sevmean = ifelse(mean(sev,na.rm=TRUE)==NaN,NA,1))
# label sevmean
# <dbl> <lgl>
#1 1.00 NA
#2 2.00 NA
#3 3.00 NA
#4 4.00 NA
#5 5.00 NA
#6 6.00 NA
#7 7.00 NA
I am not sure why you are checking NaN but if you want to do that , check it with is.nan instead of ==
Data2 %>%
group_by(label) %>%
summarize(sevmean = ifelse(is.nan(mean(sev,na.rm=TRUE)),NA,
ifelse(mean(sev,na.rm=TRUE)<=1,1,
ifelse(mean(sev,na.rm=TRUE)>1,
mean(sev[sev>1],na.rm=TRUE),100))))
# label sevmean
# <dbl> <dbl>
#1 1.00 NA
#2 2.00 1.00
#3 3.00 1.00
#4 4.00 2.00
#5 5.00 3.67
#6 6.00 1.00
#7 7.00 4.50
Related
I have one big data frame with different columns like name, position, expression level, q value and so on, and i have many repeats for most of the objects with same name but different expression levels, so I want to filter them if expression levels are in opposite of each other for example up(+) and down (-) regulated values, omit and remove those, but if it finds repeats with different expressions but all up (+) or all down (-) regulated, keep them.
here is an example of my file:
df1<-data.frame(gene.name=c( "DEC1","DEC1","DEC1","ATP","ANXA2","ANXA1","ANXA1","ANXA1"),
expression.level=c(2.01,0.5,-1.56,3.1,0.67,0.1,1.2,3),
q.value=c(0.001,0.002,0.0001,0.9,0.00001,0.9,0.0002,0.002))
and output like this:
output<-data.frame(gene.name=c( "ATP","ANXA2","ANXA1","ANXA1","ANXA1"),
expression.level=c(3.1,0.67,0.1,1.2,3),
q.value=c(0.9,0.00001,0.9,0.0002,0.002))
Thanks in advance for your help.
We can use sign() to check whether they are positive or negative or zero. Then use filter to include those that have the same sign.
library(dplyr)
df1 %>%
group_by(gene.name) %>%
filter(length(unique(sign(expression.level))) == 1) %>%
ungroup()
gene.name expression.level q.value
1 ATP 3.10 9e-01
2 ANXA2 0.67 1e-05
3 ANXA1 0.10 9e-01
4 ANXA1 1.20 2e-04
5 ANXA1 3.00 2e-03
Using ave you can do this with a one-liner.
df1[with(df1, ave(expression.level, gene.name, FUN=\(x) length(unique(sign(x))))) == 1, ]
# gene.name expression.level q.value
# 4 ATP 3.10 9e-01
# 5 ANXA2 0.67 1e-05
# 6 ANXA1 0.10 9e-01
# 7 ANXA1 1.20 2e-04
# 8 ANXA1 3.00 2e-03
Using data.table
library(data.table)
setDT(df1)[df1[, .I[uniqueN(sign(expression.level)) == 1], gene.name]$V1]
-output
gene.name expression.level q.value
<char> <num> <num>
1: ATP 3.10 9e-01
2: ANXA2 0.67 1e-05
3: ANXA1 0.10 9e-01
4: ANXA1 1.20 2e-04
5: ANXA1 3.00 2e-03
I'm new in R and I'm struggling with this df that looks like this:
Date Group Factor 1 Factor 2 Spread
2019-04-01 a 1.01 1.011 0.01
2019-04-02 a 1.02 1.012 0.02
2019-04-03 a 1.03 1.013 0.03
2019-04-01 b 1.005 1.004 0.01
2019-04-02 b 1.0051 1.0041 0.02
2019-04-03 b 1.0052 1.0042 0.03
I would like do verify each group in each row and if the results are Group "a" do Factor1/Factor1(1 day lag) * Factor2 + spread, and if the group it's not "a" do not add the spread.
Since you are conditioning on the group, this is a good example of by (base R), dplyr::group_by, or data.table's x[,,by=].
The equation is effectively the same in all three, capitalizing on the fact that (Group[1] == "a") will be coerced from a logical to numeric when multipled by a number; since FALSE translates to a 0, then effectively disabled adding Spread.
Base
I use within here to make the internals a little more readable, but this is not a requirement (in which case you'd need to prepend x$ in front of all of the variable names).
The lagging can be done using dplyr::lag (even if you don't use the rest of the package for this) or many other techniques. I don't find stats::lag to be the most intuitive in applications like this, but I'm sure somebody will suggest a way to incorporate it into an answer. The use of c(NA, ...) ensures that we don't bring in a different group's data or impute data we don't have, since we have no value to bring in on the first row of a group. Finally, head(..., n = 1) returns the first element of a vector/list, while head(..., n = -1) (negative) returns all but the last.
newx <- by(x, x$Group, function(y) {
within(y, {
NewVal = Factor2 * Factor1 / c(NA, head(Factor1, n=-1)) + (Group[1] == "a") * Spread
})
})
newx
# x$Group: a
# Date Group Factor1 Factor2 Spread NewVal
# 1 2019-04-01 a 1.01 1.011 0.01 NA
# 2 2019-04-02 a 1.02 1.012 0.02 1.042020
# 3 2019-04-03 a 1.03 1.013 0.03 1.052931
# -------------------------------------------------------
# x$Group: b
# Date Group Factor1 Factor2 Spread NewVal
# 4 2019-04-01 b 1.0050 1.0040 0.01 NA
# 5 2019-04-02 b 1.0051 1.0041 0.02 1.0042
# 6 2019-04-03 b 1.0052 1.0042 0.03 1.0043
This is really just a list with some fancy by-specific formatting, so you can treat it as such as combine them in an efficient base-R way:
do.call("rbind.data.frame", c(newx, stringsAsFactors = FALSE))
# Date Group Factor1 Factor2 Spread NewVal
# a.1 2019-04-01 a 1.0100 1.0110 0.01 NA
# a.2 2019-04-02 a 1.0200 1.0120 0.02 1.042020
# a.3 2019-04-03 a 1.0300 1.0130 0.03 1.052931
# b.4 2019-04-01 b 1.0050 1.0040 0.01 NA
# b.5 2019-04-02 b 1.0051 1.0041 0.02 1.004200
# b.6 2019-04-03 b 1.0052 1.0042 0.03 1.004300
dplyr
Many find the tidyverse line of packages to read intuitively.
library(dplyr)
x %>%
group_by(Group) %>%
mutate(NewVal = Factor2 * Factor1 / lag(Factor1) + (Group[1] == "a") * Spread) %>%
ungroup()
# # A tibble: 6 x 6
# Date Group Factor1 Factor2 Spread NewVal
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2019-04-01 a 1.01 1.01 0.01 NA
# 2 2019-04-02 a 1.02 1.01 0.02 1.04
# 3 2019-04-03 a 1.03 1.01 0.03 1.05
# 4 2019-04-01 b 1.00 1.00 0.01 NA
# 5 2019-04-02 b 1.01 1.00 0.02 1.00
# 6 2019-04-03 b 1.01 1.00 0.03 1.00
data.table
On a different note, many find data.table better because of efficiencies gained from in-place modification (most of R's operations are copy-on-write, meaning some operations re-copy the object or a portion of it with each change).
library(data.table)
X <- as.data.table(x)
X[, NewVal := Factor2 * Factor1 / shift(Factor1) + (Group[1] == "a") * Spread, by = "Group"]
X
# Date Group Factor1 Factor2 Spread NewVal
# 1: 2019-04-01 a 1.0100 1.0110 0.01 NA
# 2: 2019-04-02 a 1.0200 1.0120 0.02 1.042020
# 3: 2019-04-03 a 1.0300 1.0130 0.03 1.052931
# 4: 2019-04-01 b 1.0050 1.0040 0.01 NA
# 5: 2019-04-02 b 1.0051 1.0041 0.02 1.004200
# 6: 2019-04-03 b 1.0052 1.0042 0.03 1.004300
The "in-place" part is evident on the second line here, where it appears as if the [ operation should just return a subset or something of the data ... but in this case using := causes the columns to be created (or changed) in-place.
To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.
I am trying to learn R and I have a data frame which contains 68 continuous and categorical variables. There are two variables -> x and lnx, on which I need help. Corresponding to a large number of 0's & NA's in x, lnx shows NA. Now, I want to write a code through which I can take log(x+1) in order to replace those NA's in lnx to 0, where corresponding x is also 0 (if x == 0, then I want only lnx == 0, if x == NA, I want lnx == NA). Data frame looks something like this -
a b c d e f x lnx
AB1001 1.00 3.00 67.00 13.90 2.63 1776.7 7.48
AB1002 0.00 2.00 72.00 38.70 3.66 0.00 NA
AB1003 1.00 3.00 48.00 4.15 1.42 1917 7.56
AB1004 0.00 1.00 70.00 34.80 3.55 NA NA
AB1005 1.00 1.00 34.00 3.45 1.24 3165.45 8.06
AB1006 1.00 1.00 14.00 7.30 1.99 NA NA
AB1007 0.00 3.00 53.00 11.20 2.42 0.00 NA
I tried writing the following code -
data.frame$lnx[is.na(data.frame$lnx)] <- log(data.frame$x +1)
but I get the following warning message and the output is wrong:
number of items to replace is not a multiple of replacement length. Can someone guide me please.
Thanks.
In R you can select rows using conditionals and assign values directly. In you example you could do this:
df[is.na(df$lnx) & df$x == 0,'lnx'] <- 0
Here's what this does:
is.na(df$lnx) returns a logical vector the length of df$lnx telling, for each row, whether lnx is NA. df$x == 0 does the same thing, checking whether, for each row, x == 0. By using the & operator, we combine those vectors into one that contains TRUE only for rows where both conditions are TRUE.
We then use the bracket notation to select the lnx column of those rows where both conditions are TRUE in df and then insert the value 0 into those cells using <-
The specific error your getting is because log(data.frame$x +1) and df$lnx[is.na(df$lnx)] are different lengths. log(data.frame$x +1) produces a vector whose length is the number of rows of your data frame while the length of df$lnx[is.na(df$lnx)] is the number of rows that have NA in lnx
Using a dplyr solution:
library(dplyr)
df %>%
mutate(lnx = case_when(
x == 0.0 ~ 0,
is.na(x) ~ NA_real_))
This yields for your example:
# A tibble: 7 x 8
a b c d e f x lnx
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AB1001 1. 3. 67. 13.9 2.63 1777. NA
2 AB1002 0. 2. 72. 38.7 3.66 0. 0.
3 AB1003 1. 3. 48. 4.15 1.42 1917. NA
4 AB1004 0. 1. 70. 34.8 3.55 NA NA
5 AB1005 1. 1. 34. 3.45 1.24 3165. NA
6 AB1006 1. 1. 14. 7.30 1.99 NA NA
7 AB1007 0. 3. 53. 11.2 2.42 0. 0.
First off, I am learning to use dplyr after having used base-r for most of my career (not really a data analyst, but trying to learn). I don't know if dplyr is the best option for this, or if I should use something else.
I have a data file generated by a piece of equipment that is very messy. There are header/tombstone data embedded within the data (time/date/location/sensor data for a specific location between rows of data for that location). The files are relatively large (150,000 observations x 14 variables), and I have successfully used dplyr to separate the actual data from the tombstone data (tombstone data has 6 rows of information spread over the 14 columns).
I am trying to create a single row of the tombstone information to append to the actual measurements so that it can be easily readable in R for analysis without relying on a "blackbox" solution from the manufacturer.
a sample of the data file and my script is provided below:
# Read csv file of data into R
data <- read_csv("data.csv", col_names = FALSE)
data
# A tibble: 155,538 x 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 80.00 19.00 0.00 37.0 1.0 0.0 3.00 NA NA NA NA NA NA
2 1.4e+01 8.00 6.00 13.00 43.0 9.0 33.0 50.00 1.00 -1.60 -2.00 50.10 14.88 NA
3 5.9e-01 5.15 2.02 -0.57 0.0 0.0 0.0 0.00 24.58 28.02 25.64 25.37 NA NA
4 0.0e+00 0.00 0.00 0.00 0.0 NA NA NA NA NA NA NA NA NA
5 3.0e+04 30000.00 -32768.00 -32768.00 0.0 NA NA NA NA NA NA NA NA NA
6 0.0e+00 0.00 0.00 0.00 0.0 0.0 0.0 0.25 20.30 NA NA NA NA NA
7 3.7e+01 cm BT counts 1.0 0.1 NA NA NA NA NA NA NA NA
8 NA 0.25 13.30 145.46 7.5 -11.0 2.1 0.80 157.00 149.00 158.00 143.00 100.00 2147483647
9 NA 0.35 13.37 144.54 7.8 -10.9 2.4 -0.40 153.00 150.00 148.00 146.00 100.00 2147483647
10 NA 0.45 14.49 144.65 8.4 -11.8 1.8 -0.90 139.00 156.00 151.00 152.00 100.00 2147483647
# ... with 155,528 more rows
# Get header information from file and create index(ens) of header information to later append header data to each line of measured data
header <- data %>%
filter(!is.na(data[,1])) %>%
mutate_all(as.character) %>%
mutate(ens = rep(1:(nrow(header)/6), each = 6)) %>%
group_by(ens)
n.head <- bind_cols(header[header$ens == 1,][1,], header[header$ens == 1,][2,], header[header$ens == 1,][3,], header[header$ens == 1,][4,], header[header$ens == 1,][5,], header[header$ens == 1,][6,])
Rows 2:7 have the information I am trying to work with, I know that creating a row of 90+ variables is not ideal, but this is a first step in cleaning this data up so that I can then work with it.
the last row with n.head is what I am hoping to end up with, without needing to write a loop to run that ~20,000 times... Any help would be appreciated, thank you in advance for input!
The trick here is to use tidy::spread() and tibble::enframe to get the header columns spread out into a single row data frame.
library(tidyverse)
header <- data[2:7] %>%
# convert the data frame to a vector
t %>%
as.vector %>%
# then change it back into a single row data frame that's in long format
enframe %>%
# then push that back into a wide format, ie. 1 row and a bajillion columns
spread(name, value)
# replicate the row as many times as you have data
header[2:nrow(actualdata,] <- header
#use bind_cols() to glue your header rows onto each row of the actual data
actualdata <- data[7:nrow(data),] %>%
bind_cols(foo)