First off, I am learning to use dplyr after having used base-r for most of my career (not really a data analyst, but trying to learn). I don't know if dplyr is the best option for this, or if I should use something else.
I have a data file generated by a piece of equipment that is very messy. There are header/tombstone data embedded within the data (time/date/location/sensor data for a specific location between rows of data for that location). The files are relatively large (150,000 observations x 14 variables), and I have successfully used dplyr to separate the actual data from the tombstone data (tombstone data has 6 rows of information spread over the 14 columns).
I am trying to create a single row of the tombstone information to append to the actual measurements so that it can be easily readable in R for analysis without relying on a "blackbox" solution from the manufacturer.
a sample of the data file and my script is provided below:
# Read csv file of data into R
data <- read_csv("data.csv", col_names = FALSE)
data
# A tibble: 155,538 x 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 80.00 19.00 0.00 37.0 1.0 0.0 3.00 NA NA NA NA NA NA
2 1.4e+01 8.00 6.00 13.00 43.0 9.0 33.0 50.00 1.00 -1.60 -2.00 50.10 14.88 NA
3 5.9e-01 5.15 2.02 -0.57 0.0 0.0 0.0 0.00 24.58 28.02 25.64 25.37 NA NA
4 0.0e+00 0.00 0.00 0.00 0.0 NA NA NA NA NA NA NA NA NA
5 3.0e+04 30000.00 -32768.00 -32768.00 0.0 NA NA NA NA NA NA NA NA NA
6 0.0e+00 0.00 0.00 0.00 0.0 0.0 0.0 0.25 20.30 NA NA NA NA NA
7 3.7e+01 cm BT counts 1.0 0.1 NA NA NA NA NA NA NA NA
8 NA 0.25 13.30 145.46 7.5 -11.0 2.1 0.80 157.00 149.00 158.00 143.00 100.00 2147483647
9 NA 0.35 13.37 144.54 7.8 -10.9 2.4 -0.40 153.00 150.00 148.00 146.00 100.00 2147483647
10 NA 0.45 14.49 144.65 8.4 -11.8 1.8 -0.90 139.00 156.00 151.00 152.00 100.00 2147483647
# ... with 155,528 more rows
# Get header information from file and create index(ens) of header information to later append header data to each line of measured data
header <- data %>%
filter(!is.na(data[,1])) %>%
mutate_all(as.character) %>%
mutate(ens = rep(1:(nrow(header)/6), each = 6)) %>%
group_by(ens)
n.head <- bind_cols(header[header$ens == 1,][1,], header[header$ens == 1,][2,], header[header$ens == 1,][3,], header[header$ens == 1,][4,], header[header$ens == 1,][5,], header[header$ens == 1,][6,])
Rows 2:7 have the information I am trying to work with, I know that creating a row of 90+ variables is not ideal, but this is a first step in cleaning this data up so that I can then work with it.
the last row with n.head is what I am hoping to end up with, without needing to write a loop to run that ~20,000 times... Any help would be appreciated, thank you in advance for input!
The trick here is to use tidy::spread() and tibble::enframe to get the header columns spread out into a single row data frame.
library(tidyverse)
header <- data[2:7] %>%
# convert the data frame to a vector
t %>%
as.vector %>%
# then change it back into a single row data frame that's in long format
enframe %>%
# then push that back into a wide format, ie. 1 row and a bajillion columns
spread(name, value)
# replicate the row as many times as you have data
header[2:nrow(actualdata,] <- header
#use bind_cols() to glue your header rows onto each row of the actual data
actualdata <- data[7:nrow(data),] %>%
bind_cols(foo)
Related
I have a data frame of blood test markers results and I want to fill in the NA's by the following criteria:
For each group of ID (TIME is in ascending order) if the marker value is NA then fill it with the closest not NA value in this group (may be past or future) but only if the time difference is less than 14.
this example of my data:
df<-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA NA
2 43 2.33 22.34 NA NA NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
ID is the patient.
The TIME is the time of the blood test.
The others are the markers.
The only way I could do it is with loops which I try to avoid as much as possible.
I expect the output to be:
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA 13.21
2 43 2.33 22.34 30.31 5.72 NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
CA.19.9 and CA.124 are filled with the previous (10 days before)
NSE filled with the previous (11 days)
CA.72.4 not filled since the time difference of 1.32 which is -48 is 49 days from the next measure.
I bet there is a much simpler, vectorized solution but the following works.
fill_NA <- function(DF){
sp <- split(df, df$ID)
sp <- lapply(sp, function(DF){
d <- diff(DF$TIME)
i_diff <- c(FALSE, d < 14)
res <- sapply(DF[-(1:2)], function(X){
inx <- i_diff & is.na(X)
if(any(inx)){
inx <- which(inx)
last_change <- -1
for(i in inx){
if(i > last_change + 1){
if(i == 1){
X[i] <- X[i + 1]
}else{
X[i] <- X[i - 1]
}
last_change <- i
}
}
}
X
})
cbind(DF[1:2], res)
})
res <- do.call(rbind, sp)
row.names(res) <- NULL
res
}
fill_NA(df)
# ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
#1 2 1 1.32 14.62 33.98 6.18 NA NA
#2 2 22 1.42 14.59 27.56 7.11 NA 13.21
#3 2 33 1.81 16.80 30.31 5.72 NA 13.21
#4 2 43 2.33 22.34 30.31 5.72 NA NA
#5 2 85 2.23 36.33 39.57 7.38 NA NA
#6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA
#7 4 1 23.34 94.09 956.50 118.20 NA NA
#8 4 30 18.23 121.50 825.30 98.26 NA NA
Yes, you can have a vectorized solution. first let us consider the case in which you only impute using the future value. You need to create few auxiliary variables:
a variable that tells you whether the next observation belong to the same id (so it can be used to impute),
a variable that tells you whether the next observation is less than 14 days apart from the current one.
These do not depend on the specific variable you want to impute. for each variable to be imputed you will also need a variable that tells you whether the next variable is missing.
Then you can vectorize the following logic: when the next observation has the same id, and when it is less than 14 days from the current one and it is not missing copy its value in the current one.
Things get more complicated when you need to decide whether to use the past or future value, but the logic is the same. the code is below, it is a bit long but you can simplify it, I just wanted to be clear about what it does.
Hope this helps
x <-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
### these are the columns we want to input
cols.to.impute <- colnames(x)[! colnames(x) %in% c("ID","TIME")]
### is the next id the same?
x$diffidf <- NA
x$diffidf[1:(nrow(x)-1)] <- diff(x$ID)
x$diffidf[x$diffidf > 0] <- NA
### is the previous id the same?
x$diffidb <- NA
x$diffidb[2:nrow(x)] <- diff(x$ID)
x$diffidb[x$diffidb > 0] <- NA
### diff in time with next observation
x$difftimef <- NA
x$difftimef[1:(nrow(x)-1)] <- diff(x$TIME)
### diff in time with previous observation
x$difftimeb <- NA
x$difftimeb[2:nrow(x)] <- diff(x$TIME)
### if next (previous) id is not the same time difference is not meaningful
x$difftimef[is.na(x$diffidf)] <- NA
x$difftimeb[is.na(x$diffidb)] <- NA
### we do not need diffid anymore (due to previous statement)
x$diffidf <- x$diffidb <- NULL
### if next (previous) point in time is more than 14 days it is not useful for imputation
x$difftimef[abs(x$difftimef) > 14] <- NA
x$difftimeb[abs(x$difftimeb) > 14] <- NA
### create variable usef that tells us whether we should attempt to use the forward observation for imputation
### it is 1 only if difftime forward is less than difftime backward
x$usef <- NA
x$usef[!is.na(x$difftimef) & x$difftimef < x$difftimeb] <- 1
x$usef[!is.na(x$difftimef) & is.na(x$difftimeb)] <- 1
x$usef[is.na(x$difftimef) & !is.na(x$difftimeb)] <- 0
if (!is.na(x$usef[nrow(x)]))
stop("\nlast observation usef is not missing\n")
### now we get into column specific operations.
for (col in cols.to.impute){
### we will store the results in x$imputed, and copy into c[,col] at the end
x$imputed <- x[,col]
### x$usef needs to be modified depending on the specific column, so we define a local version of it
x$usef.local <- x$usef
### if a variable is not missing no point in looking at usef.local, so we make it missing
x$usef.local[!is.na(x[,col])] <- NA
### when usef.local is 1 but the next observation is missing it cannot be used for imputation, so we
### make it 0. but a value of 0 does not mean we can use the previous observation because that may
### be missing too. so first we make usef 0 and next we check the previous observation and if that
### is missing too we make usef missing
x$previous.value <- c(NA,x[1:(nrow(x)-1),col])
x$next.value <- c(x[2:nrow(x),col],NA)
x$next.missing <- is.na(x$next.value)
x$previous.missing <- is.na(x$previous.value)
x$usef.local[x$next.missing & x$usef.local == 1] <- 0
x$usef.local[x$previous.missing & x$usef.local == 0] <- NA
### now we can impute properly: use next value when usef.local is 1 and previous value when usef.local is 0
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 1] <- TRUE
x$imputed[tmp] <- x$next.value[tmp]
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 0] <- TRUE
x$imputed[tmp] <- x$previous.value[tmp]
### copy to column
x[,col] <- x$imputed
}
### get rid of useless temporary stuff
x$previous.value <- x$previous.missing <- x$next.value <- x$next.missing <- x$imputed <- x$usef.local <- NULL
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE difftimef difftimeb usef
1 2 1 1.32 14.62 33.98 6.18 NA NA NA NA NA
2 2 22 1.42 14.59 27.56 7.11 NA 13.21 11 NA 1
3 2 33 1.81 16.80 30.31 5.72 NA 13.21 10 11 1
4 2 43 2.33 22.34 30.31 5.72 NA NA NA 10 0
5 2 85 2.23 36.33 39.57 7.38 NA NA NA NA NA
6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA NA NA NA
7 4 1 23.34 94.09 956.50 118.20 NA NA NA NA NA
8 4 30 18.23 121.50 825.30 98.26 NA NA NA NA NA
>
I am trying to figure out different temperature ranges for specific locations (CB, HK, etc.) in my data frame,
it looks like this:
'head(join)'
OTU_num location date otus Depth DO Temperature pH Secchi.Depth
1 Otu0001 CB 03JUN09 21 0.0 7.60 21.0 3.68 NA
2 Otu0001 CB 03JUN09 21 0.5 8.27 16.4 3.68 NA
3 Otu0001 CB 03JUN09 21 1.0 7.65 14.9 3.68 NA
4 Otu0001 CB 03JUN09 21 1.5 5.26 12.2 3.25 NA
5 Otu0001 CB 03JUN09 21 2.0 4.01 10.1 3.25 NA
I am calculating the range using:
ranges <- join %>%
group_by(location) %>%
na.omit %>%
mutate(min=min(Temperature), max=max(Temperature), subtract=min-max) %>%
arrange(subtract)
Some of the temperature values are "NA" so I used na.omit, however it appears to be taking out the negative values? so the ranges I get are wrong.
location min max subtract
MA 0.1 27.3 -27.2
I double checked using the range function for one of the locations (there are a lot and I did not want to use range for each location)
MA <- subset(join, location=="MA")
range(MA$Temperature, na.rm = TRUE)
[1] -2.2 27.6
Why are the values different? Any help is appreciated!!!
I think you should use join %>% filter(!is.na(Temperature)), so only rows that have NA temperatures will be removed.
I am trying to summarise a dataframe based on grouping by label column. I want to obtain means based on the following conditions:
- if all numbers are NA - then I want to return NA
- if mean of all the numbers is 1 or lower - I want to return 1
- if mean of all the numbers is higher than 1 - I want a mean of the values in the group that are greater than 1
- all the rest should be 100.
Managed to find the answer and now my code is running well - is.na() should be there instead of ==NA in the first ifelse() statement and that was the issue.
label <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7)
sev <- c(NA,NA,NA,NA,1,0,1,1,1,NA,1,2,2,4,5,1,0,1,1,4,5)
Data2 <- data.frame(label,sev)
d <- Data2 %>%
group_by(label) %>%
summarize(sevmean = ifelse(is.na(mean(sev,na.rm=TRUE)),NA,
ifelse(mean(sev,na.rm=TRUE)<=1,1,
ifelse(mean(sev,na.rm=TRUE)>1,
mean(sev[sev>1],na.rm=TRUE),100))))
Your first condition is the issue here. If we remove the nested ifelse and keep only the first one, we get the same output
Data2 %>%
group_by(label) %>%
summarise(sevmean = ifelse(mean(sev,na.rm=TRUE)==NaN,NA,1))
# label sevmean
# <dbl> <lgl>
#1 1.00 NA
#2 2.00 NA
#3 3.00 NA
#4 4.00 NA
#5 5.00 NA
#6 6.00 NA
#7 7.00 NA
I am not sure why you are checking NaN but if you want to do that , check it with is.nan instead of ==
Data2 %>%
group_by(label) %>%
summarize(sevmean = ifelse(is.nan(mean(sev,na.rm=TRUE)),NA,
ifelse(mean(sev,na.rm=TRUE)<=1,1,
ifelse(mean(sev,na.rm=TRUE)>1,
mean(sev[sev>1],na.rm=TRUE),100))))
# label sevmean
# <dbl> <dbl>
#1 1.00 NA
#2 2.00 1.00
#3 3.00 1.00
#4 4.00 2.00
#5 5.00 3.67
#6 6.00 1.00
#7 7.00 4.50
I am trying to learn R and I have a data frame which contains 68 continuous and categorical variables. There are two variables -> x and lnx, on which I need help. Corresponding to a large number of 0's & NA's in x, lnx shows NA. Now, I want to write a code through which I can take log(x+1) in order to replace those NA's in lnx to 0, where corresponding x is also 0 (if x == 0, then I want only lnx == 0, if x == NA, I want lnx == NA). Data frame looks something like this -
a b c d e f x lnx
AB1001 1.00 3.00 67.00 13.90 2.63 1776.7 7.48
AB1002 0.00 2.00 72.00 38.70 3.66 0.00 NA
AB1003 1.00 3.00 48.00 4.15 1.42 1917 7.56
AB1004 0.00 1.00 70.00 34.80 3.55 NA NA
AB1005 1.00 1.00 34.00 3.45 1.24 3165.45 8.06
AB1006 1.00 1.00 14.00 7.30 1.99 NA NA
AB1007 0.00 3.00 53.00 11.20 2.42 0.00 NA
I tried writing the following code -
data.frame$lnx[is.na(data.frame$lnx)] <- log(data.frame$x +1)
but I get the following warning message and the output is wrong:
number of items to replace is not a multiple of replacement length. Can someone guide me please.
Thanks.
In R you can select rows using conditionals and assign values directly. In you example you could do this:
df[is.na(df$lnx) & df$x == 0,'lnx'] <- 0
Here's what this does:
is.na(df$lnx) returns a logical vector the length of df$lnx telling, for each row, whether lnx is NA. df$x == 0 does the same thing, checking whether, for each row, x == 0. By using the & operator, we combine those vectors into one that contains TRUE only for rows where both conditions are TRUE.
We then use the bracket notation to select the lnx column of those rows where both conditions are TRUE in df and then insert the value 0 into those cells using <-
The specific error your getting is because log(data.frame$x +1) and df$lnx[is.na(df$lnx)] are different lengths. log(data.frame$x +1) produces a vector whose length is the number of rows of your data frame while the length of df$lnx[is.na(df$lnx)] is the number of rows that have NA in lnx
Using a dplyr solution:
library(dplyr)
df %>%
mutate(lnx = case_when(
x == 0.0 ~ 0,
is.na(x) ~ NA_real_))
This yields for your example:
# A tibble: 7 x 8
a b c d e f x lnx
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AB1001 1. 3. 67. 13.9 2.63 1777. NA
2 AB1002 0. 2. 72. 38.7 3.66 0. 0.
3 AB1003 1. 3. 48. 4.15 1.42 1917. NA
4 AB1004 0. 1. 70. 34.8 3.55 NA NA
5 AB1005 1. 1. 34. 3.45 1.24 3165. NA
6 AB1006 1. 1. 14. 7.30 1.99 NA NA
7 AB1007 0. 3. 53. 11.2 2.42 0. 0.
Consider the following list:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))
How do I manage to have a data frame with all terms associated with these 3 words in the columns and showing:
The corresponding correlation coefficient (if it exists)
NA if it does not exists for this word (for example the couple (oil, they) would show NA)
Here's a solution using reshape2 to help reshape the data
library(reshape2)
aa<-do.call(rbind, Map(function(d, n)
cbind.data.frame(
xterm=if (length(d)>0) names(d) else NA,
cor=if(length(d)>0) d else NA,
term=n),
a, names(a))
)
dcast(aa, term~xterm, value.var="cor")
Or you could use dplyr and tidyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
a1 <- unnest(lapply(a, function(x) data.frame(xterm=names(x),
cor=x, stringsAsFactors=FALSE)), term)
a1 %>%
spread(xterm, cor) #here it removed terms without any `cor` for the `xterm`
# term 15.8 ability above agreement analysts buyers clearly emergency fixed
#1 oil 0.87 NA 0.76 0.71 0.79 0.70 0.8 0.75 0.73
#2 opec 0.85 0.8 0.82 0.76 0.85 0.83 NA 0.87 NA
# late market meeting prices prices. said that they trying who winter
#1 0.8 0.75 0.77 0.72 NA 0.78 0.73 NA 0.8 0.8 0.8
#2 NA NA 0.88 NA 0.79 0.82 NA 0.8 NA NA NA
Update
aNew <- sapply(tdm$dimnames$Terms, function(i) findAssocs(tdm, i, corlimit=0.95))
aNew2 <- aNew[!!sapply(aNew, function(x) length(dim(x)))]
aNew3 <- unnest(lapply(aNew2, function(x) data.frame(xterm=rownames(x),
cor=x[,1], stringsAsFactors=FALSE)[1:3,]), term)
res <- aNew3 %>%
spread(xterm, cor)
dim(res)
#[1] 1021 160
res[1:3,1:5]
# term ... 100,000 10.8 1.1
#1 ... NA NA NA NA
#2 100,000 NA NA NA 1
#3 10.8 NA NA NA NA