Transformation of missing values by taking log(x+1) - r

I am trying to learn R and I have a data frame which contains 68 continuous and categorical variables. There are two variables -> x and lnx, on which I need help. Corresponding to a large number of 0's & NA's in x, lnx shows NA. Now, I want to write a code through which I can take log(x+1) in order to replace those NA's in lnx to 0, where corresponding x is also 0 (if x == 0, then I want only lnx == 0, if x == NA, I want lnx == NA). Data frame looks something like this -
a b c d e f x lnx
AB1001 1.00 3.00 67.00 13.90 2.63 1776.7 7.48
AB1002 0.00 2.00 72.00 38.70 3.66 0.00 NA
AB1003 1.00 3.00 48.00 4.15 1.42 1917 7.56
AB1004 0.00 1.00 70.00 34.80 3.55 NA NA
AB1005 1.00 1.00 34.00 3.45 1.24 3165.45 8.06
AB1006 1.00 1.00 14.00 7.30 1.99 NA NA
AB1007 0.00 3.00 53.00 11.20 2.42 0.00 NA
I tried writing the following code -
data.frame$lnx[is.na(data.frame$lnx)] <- log(data.frame$x +1)
but I get the following warning message and the output is wrong:
number of items to replace is not a multiple of replacement length. Can someone guide me please.
Thanks.

In R you can select rows using conditionals and assign values directly. In you example you could do this:
df[is.na(df$lnx) & df$x == 0,'lnx'] <- 0
Here's what this does:
is.na(df$lnx) returns a logical vector the length of df$lnx telling, for each row, whether lnx is NA. df$x == 0 does the same thing, checking whether, for each row, x == 0. By using the & operator, we combine those vectors into one that contains TRUE only for rows where both conditions are TRUE.
We then use the bracket notation to select the lnx column of those rows where both conditions are TRUE in df and then insert the value 0 into those cells using <-
The specific error your getting is because log(data.frame$x +1) and df$lnx[is.na(df$lnx)] are different lengths. log(data.frame$x +1) produces a vector whose length is the number of rows of your data frame while the length of df$lnx[is.na(df$lnx)] is the number of rows that have NA in lnx

Using a dplyr solution:
library(dplyr)
df %>%
mutate(lnx = case_when(
x == 0.0 ~ 0,
is.na(x) ~ NA_real_))
This yields for your example:
# A tibble: 7 x 8
a b c d e f x lnx
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AB1001 1. 3. 67. 13.9 2.63 1777. NA
2 AB1002 0. 2. 72. 38.7 3.66 0. 0.
3 AB1003 1. 3. 48. 4.15 1.42 1917. NA
4 AB1004 0. 1. 70. 34.8 3.55 NA NA
5 AB1005 1. 1. 34. 3.45 1.24 3165. NA
6 AB1006 1. 1. 14. 7.30 1.99 NA NA
7 AB1007 0. 3. 53. 11.2 2.42 0. 0.

Related

Filter data frame to remove some repeated factors with different sign

I have one big data frame with different columns like name, position, expression level, q value and so on, and i have many repeats for most of the objects with same name but different expression levels, so I want to filter them if expression levels are in opposite of each other for example up(+) and down (-) regulated values, omit and remove those, but if it finds repeats with different expressions but all up (+) or all down (-) regulated, keep them.
here is an example of my file:
df1<-data.frame(gene.name=c( "DEC1","DEC1","DEC1","ATP","ANXA2","ANXA1","ANXA1","ANXA1"),
expression.level=c(2.01,0.5,-1.56,3.1,0.67,0.1,1.2,3),
q.value=c(0.001,0.002,0.0001,0.9,0.00001,0.9,0.0002,0.002))
and output like this:
output<-data.frame(gene.name=c( "ATP","ANXA2","ANXA1","ANXA1","ANXA1"),
expression.level=c(3.1,0.67,0.1,1.2,3),
q.value=c(0.9,0.00001,0.9,0.0002,0.002))
Thanks in advance for your help.
We can use sign() to check whether they are positive or negative or zero. Then use filter to include those that have the same sign.
library(dplyr)
df1 %>%
group_by(gene.name) %>%
filter(length(unique(sign(expression.level))) == 1) %>%
ungroup()
gene.name expression.level q.value
1 ATP 3.10 9e-01
2 ANXA2 0.67 1e-05
3 ANXA1 0.10 9e-01
4 ANXA1 1.20 2e-04
5 ANXA1 3.00 2e-03
Using ave you can do this with a one-liner.
df1[with(df1, ave(expression.level, gene.name, FUN=\(x) length(unique(sign(x))))) == 1, ]
# gene.name expression.level q.value
# 4 ATP 3.10 9e-01
# 5 ANXA2 0.67 1e-05
# 6 ANXA1 0.10 9e-01
# 7 ANXA1 1.20 2e-04
# 8 ANXA1 3.00 2e-03
Using data.table
library(data.table)
setDT(df1)[df1[, .I[uniqueN(sign(expression.level)) == 1], gene.name]$V1]
-output
gene.name expression.level q.value
<char> <num> <num>
1: ATP 3.10 9e-01
2: ANXA2 0.67 1e-05
3: ANXA1 0.10 9e-01
4: ANXA1 1.20 2e-04
5: ANXA1 3.00 2e-03

Swapping data frame values randomly between different deciles of the data frame

Its a bit complicated to explain, so I hope it is clear enough, but if not I'll try and expand more.
So I have a data-frame like this:
df <- data.frame(index=sort(runif(300, -10,10)), v1=runif(300, -2,-1), v2=runif(300, 1,2))
It gives us a 3-column 300-row df. The first column ("index") contains sorted values from -10 to 10, and the next two columns ("v1"/"v2") contain random numeric values that are not important for this issue.
Now I classify my df rows into deciles according to the index column, (e.g. decile 1: places 1-30, decile 2: places 31-60) and I want to swap randomly between the rows such that all the 1st decile values are swapped randomly with the 6th decile, all 2nd decile values are swapped randomly with the 7th decile, and so on. When I say swapped I mean that the index value remains in its place but the v1 and v2 values are swapped (still coupled) with the v1 and v2 of a random row in the appropriate decile.
For example, the v1 and v2 of the first row in the df (and thus from the 1st decile), will be swapped with the v1 and v2 of the 160th row in the df (6th decile), the v1 and v2 of the second row in the df (1st decile) will be swapped with the v1 and v2 of the 175th row in the df (also 6th decile), the v1 and v2 of the 31st row in the df (2nd decile) will be swapped with the v1 and v2 of the 186th row in the df (7th decile) and so on so all of the v1+v2 values have changed places randomly to their appropriate new decile.
Hope it's clear. I've been trying to solve it for hours and couldn't figure it out.
Thanks
Using order() to sort by two indices, one being the rearranged deciles, the other one random.
set.seed(123)
dtf <- data.frame(round(cbind(index=sort(runif(20, -10, 10)),
v1=runif(20, 0, 5),
v2=runif(20, 5, 10)), 2))
ea <- nrow(dtf)/10
# Deciles shifted by 5
d <- rep(((1:10 + 4) %% 10) + 1, each=ea)
# Random index within decile
r <- c(replicate(10, sample(ea)))
cbind(dtf, z=dtf[order(d, r), -1])
# index v1 v2 z.v1 z.v2
# 12 -9.16 4.45 5.71 4.51 7.21
# 11 -9.09 3.46 7.07 4.82 5.23
# 14 -7.94 3.20 7.07 3.98 5.61
# 13 -5.08 4.97 6.84 3.45 8.99
# 15 -4.25 3.28 5.76 0.12 7.80
# 16 -3.44 3.54 5.69 2.39 6.03
# 17 -1.82 2.72 6.17 3.79 5.64
# 18 -0.93 2.97 7.33 1.08 8.77
# 19 -0.87 1.45 6.33 1.59 9.48
# 20 0.56 0.74 9.29 1.16 6.87
# 2 1.03 4.82 5.23 3.46 7.07
# 1 1.45 4.51 7.21 4.45 5.71
# 3 3.55 3.45 8.99 3.20 7.07
# 4 5.77 3.98 5.61 4.97 6.84
# 6 7.66 0.12 7.80 3.54 5.69
# 5 7.85 2.39 6.03 3.28 5.76
# 8 8.00 3.79 5.64 2.97 7.33
# 7 8.81 1.08 8.77 2.72 6.17
# 10 9.09 1.59 9.48 0.74 9.29
# 9 9.14 1.16 6.87 1.45 6.33
I think that this is what you need.
swapByBlocks <- function(df, blockSize = 30, nblocks = 10){
if((nrow(df) != blockSize*nblocks) || nblocks %%2) stop("Undefined behaviour")
swappedDF <- df[c((nrow(df)/2 +1):nrow(df), 1:(nrow(df)/2)),]
ndxMat <- sapply(1:(nblocks/2),function(dummy) sample(1:blockSize))
for(i in 1:ncol(ndxMat)) {
swappedDF[(i-1)*blockSize + 1:blockSize, ] <- swappedDF[((i-1)*blockSize + 1:blockSize)[ndxMat[,i]], ]
swappedDF[(i+nblocks/2-1)*blockSize + 1:blockSize, ] <- swappedDF[((i+nblocks/2-1)*blockSize + 1:blockSize)[order(ndxMat[,i])], ]
}
return(swappedDF)
}
A small case where you can check how it works:
res <- swapByBlocks(df[1:18,], blockSize = 3, nblocks = 6)
> df[1:18,]
index v1 v2
1 -9.859624 -1.657779 1.954094
2 -9.774898 -1.015825 1.006341
3 -9.624402 -1.713754 1.527065
4 -9.441129 -1.891834 1.803793
5 -9.424195 -1.125674 1.581199
6 -8.890537 -1.142044 1.219111
7 -8.838012 -1.173445 1.013408
8 -8.296938 -1.780396 1.570550
9 -8.172076 -1.789056 1.178596
10 -7.671897 -1.988539 1.690468
11 -7.655868 -1.095662 1.876414
12 -7.450011 -1.337443 1.632104
13 -7.204528 -1.880350 1.408944
14 -7.085862 -1.232293 1.593247
15 -7.030691 -1.087031 1.924306
16 -6.989892 -1.639967 1.495058
17 -6.978945 -1.395340 1.872944
18 -6.930379 -1.841031 1.061046
> res
index v1 v2
10 -7.450011 -1.337443 1.632104
11 -7.655868 -1.095662 1.876414
12 -7.671897 -1.988539 1.690468
13 -7.030691 -1.087031 1.924306
14 -7.085862 -1.232293 1.593247
15 -7.204528 -1.880350 1.408944
16 -6.989892 -1.639967 1.495058
17 -6.930379 -1.841031 1.061046
18 -6.978945 -1.395340 1.872944
1 -9.624402 -1.713754 1.527065
2 -9.774898 -1.015825 1.006341
3 -9.859624 -1.657779 1.954094
4 -8.890537 -1.142044 1.219111
5 -9.424195 -1.125674 1.581199
6 -9.441129 -1.891834 1.803793
7 -8.838012 -1.173445 1.013408
8 -8.172076 -1.789056 1.178596
9 -8.296938 -1.780396 1.570550
>
Here there are 18 rows with six blocks of three numbers each. Rows 1 to 3 get swapped with rows 10 to 12, rows 4 to 6 with rows 13 to 15 and rows 4
7 to 9 with rows 16 to 17.

R: How to aggregate with NA values

To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.

ifelse() nested statements in summarize function in dplyr R

I am trying to summarise a dataframe based on grouping by label column. I want to obtain means based on the following conditions:
- if all numbers are NA - then I want to return NA
- if mean of all the numbers is 1 or lower - I want to return 1
- if mean of all the numbers is higher than 1 - I want a mean of the values in the group that are greater than 1
- all the rest should be 100.
Managed to find the answer and now my code is running well - is.na() should be there instead of ==NA in the first ifelse() statement and that was the issue.
label <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7)
sev <- c(NA,NA,NA,NA,1,0,1,1,1,NA,1,2,2,4,5,1,0,1,1,4,5)
Data2 <- data.frame(label,sev)
d <- Data2 %>%
group_by(label) %>%
summarize(sevmean = ifelse(is.na(mean(sev,na.rm=TRUE)),NA,
ifelse(mean(sev,na.rm=TRUE)<=1,1,
ifelse(mean(sev,na.rm=TRUE)>1,
mean(sev[sev>1],na.rm=TRUE),100))))
Your first condition is the issue here. If we remove the nested ifelse and keep only the first one, we get the same output
Data2 %>%
group_by(label) %>%
summarise(sevmean = ifelse(mean(sev,na.rm=TRUE)==NaN,NA,1))
# label sevmean
# <dbl> <lgl>
#1 1.00 NA
#2 2.00 NA
#3 3.00 NA
#4 4.00 NA
#5 5.00 NA
#6 6.00 NA
#7 7.00 NA
I am not sure why you are checking NaN but if you want to do that , check it with is.nan instead of ==
Data2 %>%
group_by(label) %>%
summarize(sevmean = ifelse(is.nan(mean(sev,na.rm=TRUE)),NA,
ifelse(mean(sev,na.rm=TRUE)<=1,1,
ifelse(mean(sev,na.rm=TRUE)>1,
mean(sev[sev>1],na.rm=TRUE),100))))
# label sevmean
# <dbl> <dbl>
#1 1.00 NA
#2 2.00 1.00
#3 3.00 1.00
#4 4.00 2.00
#5 5.00 3.67
#6 6.00 1.00
#7 7.00 4.50

Using dplyr to manipulate header information imbedded in csv

First off, I am learning to use dplyr after having used base-r for most of my career (not really a data analyst, but trying to learn). I don't know if dplyr is the best option for this, or if I should use something else.
I have a data file generated by a piece of equipment that is very messy. There are header/tombstone data embedded within the data (time/date/location/sensor data for a specific location between rows of data for that location). The files are relatively large (150,000 observations x 14 variables), and I have successfully used dplyr to separate the actual data from the tombstone data (tombstone data has 6 rows of information spread over the 14 columns).
I am trying to create a single row of the tombstone information to append to the actual measurements so that it can be easily readable in R for analysis without relying on a "blackbox" solution from the manufacturer.
a sample of the data file and my script is provided below:
# Read csv file of data into R
data <- read_csv("data.csv", col_names = FALSE)
data
# A tibble: 155,538 x 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 80.00 19.00 0.00 37.0 1.0 0.0 3.00 NA NA NA NA NA NA
2 1.4e+01 8.00 6.00 13.00 43.0 9.0 33.0 50.00 1.00 -1.60 -2.00 50.10 14.88 NA
3 5.9e-01 5.15 2.02 -0.57 0.0 0.0 0.0 0.00 24.58 28.02 25.64 25.37 NA NA
4 0.0e+00 0.00 0.00 0.00 0.0 NA NA NA NA NA NA NA NA NA
5 3.0e+04 30000.00 -32768.00 -32768.00 0.0 NA NA NA NA NA NA NA NA NA
6 0.0e+00 0.00 0.00 0.00 0.0 0.0 0.0 0.25 20.30 NA NA NA NA NA
7 3.7e+01 cm BT counts 1.0 0.1 NA NA NA NA NA NA NA NA
8 NA 0.25 13.30 145.46 7.5 -11.0 2.1 0.80 157.00 149.00 158.00 143.00 100.00 2147483647
9 NA 0.35 13.37 144.54 7.8 -10.9 2.4 -0.40 153.00 150.00 148.00 146.00 100.00 2147483647
10 NA 0.45 14.49 144.65 8.4 -11.8 1.8 -0.90 139.00 156.00 151.00 152.00 100.00 2147483647
# ... with 155,528 more rows
# Get header information from file and create index(ens) of header information to later append header data to each line of measured data
header <- data %>%
filter(!is.na(data[,1])) %>%
mutate_all(as.character) %>%
mutate(ens = rep(1:(nrow(header)/6), each = 6)) %>%
group_by(ens)
n.head <- bind_cols(header[header$ens == 1,][1,], header[header$ens == 1,][2,], header[header$ens == 1,][3,], header[header$ens == 1,][4,], header[header$ens == 1,][5,], header[header$ens == 1,][6,])
Rows 2:7 have the information I am trying to work with, I know that creating a row of 90+ variables is not ideal, but this is a first step in cleaning this data up so that I can then work with it.
the last row with n.head is what I am hoping to end up with, without needing to write a loop to run that ~20,000 times... Any help would be appreciated, thank you in advance for input!
The trick here is to use tidy::spread() and tibble::enframe to get the header columns spread out into a single row data frame.
library(tidyverse)
header <- data[2:7] %>%
# convert the data frame to a vector
t %>%
as.vector %>%
# then change it back into a single row data frame that's in long format
enframe %>%
# then push that back into a wide format, ie. 1 row and a bajillion columns
spread(name, value)
# replicate the row as many times as you have data
header[2:nrow(actualdata,] <- header
#use bind_cols() to glue your header rows onto each row of the actual data
actualdata <- data[7:nrow(data),] %>%
bind_cols(foo)

Resources