Creating multiple columns in data table with a for loop - r

My data table looks like:
head(data)
Date AI AGI ADI ASI ARI ERI NVRI SRI FRI IRI
1: 1991-09-06 NA 2094.19 NA NA NA NA NA NA NA NA
2: 1991-09-13 NA 2204.94 NA NA NA NA NA NA NA NA
3: 1991-09-20 NA 2339.10 NA NA NA NA NA NA NA NA
4: 1991-09-27 NA 2387.81 NA NA NA NA NA NA NA NA
5: 1991-10-04 NA 2459.94 NA NA NA NA NA NA NA NA
6: 1991-10-11 NA 2571.07 NA NA NA NA NA NA NA NA
Don't worry about the NAs. What I want to do is make a "percentage change" column for each of the columns apart from date.
What I've done so far is:
names_no_date <- unique(names(data))[!unique(names(data)) %in% "Date"]
for (i in names_no_date){
data_ch <- data[, paste0(i, "ch") := i/shift(i, n = 1, type = "lag")-1]}
I get the error:
Error in i/shift(i, n = 1, type = "lag") :
non-numeric argument to binary operator
I'm wondering how I get around this error?

i is a string, so you are trying to divide a string in i/shift(i, n = 1, type = "lag"):
> "AI"/NA
Error in "AI"/NA : non-numeric argument to binary operator
Instead, do
for (i in names_no_date){
data[, paste0(i, "ch") := get(i)/shift(get(i), n = 1, type = "lag")-1]
}
Also see Referring to data.table columns by names saved in variables.
Edit: #Frank writes in the comments that a more concise way to produce OP's output is
data[, paste0(names_no_date, "_pch") := .SD/shift(.SD) - 1, .SDcols=names_no_date]

Related

How to get name row as variable in function and plot density graph

I have issues with my function, i don't know if the problem is in the function or in my way to called it.
I have big dataframe with > 20000 row and around 700 columns, with each row a part of a gene and i want to calculate density for each row + plot the density plot with name of the gene.
baseM <- read.csv("expansions_full_omim_06_07_21.2.csv", sep = "\t")
rownames(baseM) <- paste(baseM$motif, baseM$chromosome, baseM$intervalle , baseM$gene , baseM$localisation, baseM$OMIM, sep = ".")
baseM.num <- baseM[sapply(baseM, is.numeric)]
names <- rownames(baseM.num.fltr)
d.density <- function(X, n){
#print(X)
d <- density(as.numeric(as.matrix(X)), na.rm=T)
peaks <- NULL
for (i in 2:(length(d$y)-1)) {
if (d$y[i-1] >= d$y[i] & d$y[i] <= d$y[i+1]) {
peaks <- cbind(peaks, c(d$x[i], d$y[i]))
}}
df <- data.frame(test =as.numeric(as.matrix(X)))
g <- ggplot(df, aes(x = as.numeric(as.matrix(test)))) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
ggsave(filename=paste("/work/gad/shared/analyse/STR/Marine/analysis/output/annotation/R_plots/", n, ".png", sep=""), plot=g)
#q <- plot(d)
#png(file=file_name)
#print(q)
#dev.off()
return(peaks)
}
baseM.num.fltr$peaks <- apply(temp, 1 , d.density, n=names)
I get correctly my peaks but obviously something wrong with the plot. I'm not sure my way to pass the name is correct, or is something else would be better/easier? Thanks for your help! I tried 2 ways for the plot, with or without ggplot2 but not working.
This is the error I get:
NULL
Erreur : `device` must be NULL, a string or a function.
>
Example of my data :
> head(baseM)
motif chromosome intervalle gene localisation
1 AAAAAAAAAAAAAAAAAAAC chr2 (69131154, 69132154) BMP10 intergenic
2 AAAAAAAAAAAAAAAAAAAC chr2 (237411093, 237412093) IQCA1 intronic
3 AAAAAAAAAAAAAAAAAAAC chr2 (44378070, 44379070) LRPPRC intergenic
4 AAAAAAAAAAAAAAAAAAAC chr2 (105218444, 105219444) LINC01102 intergenic
5 AAAAAAAAAAAAAAAAAAAC chr2 (124310903, 124311903) LINC01826 intergenic
6 AAAAAAAAAAAAAAAAAAAC chr2 (30730559, 30731559) LCLAT1 intronic
OMIM
1 .
2 .
3 .,Mitochondrial complex IV deficiency, nuclear type 5, (French-Canadian), 220111 (3)
4 .
5 .
6 .
dijen003 dijen004 dijen005 dijen006 dijen007 dijen008 dijen009 dijen010
1 NA NA NA NA NA NA NA NA
2 7 NA NA NA NA NA NA NA
3 NA NA NA 5 NA NA NA NA
4 NA NA NA 5 NA NA NA NA
5 NA NA NA 5 NA NA NA NA
6 NA NA NA NA 5 NA NA NA
dijen011 dijen012 dijen013 dijen014 dijen015 dijen016 dijen017 dijen018
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
(Sorry i know it's a short example but data is really big - and of course not all lines have that much NA)
For the device argument, use png or 'png'. (Note that png() will work also but only when the filename has the '.png' extension.)
(png() will work also but only when the filename includes the '.png' extension, see comment thread below.)
Example:
library(tidyverse)
set.seed(1L)
df <- tibble(a = rnorm(10))
df %>% ggplot(aes(a)) + geom_density()
ggsave("foo.png", device = "png")

How do I apply a formula to each value in a data frame?

I've created a formula that calculates the exponential moving average of data:
myEMA <- function(price, n) {
ema <- c()
data_start <- which(!is.na(price))[1]
ema[1:data_start+n-2] <- NA
ema[data_start+n-1] <- mean(price[data_start:(data_start+n-1)])
beta <- 2/(n+1)
for(i in (data_start+n):length(price)) {
ema[i] <- beta*price[i] +
(1-beta)*ema[i-1]
}
ema <- reclass(ema,price)
return(ema)
}
The data I'm using is:
pricesupdated <- data.frame(a = seq(1,100), b = seq(1,200,2), c = c(NA,NA,NA,seq(1,97)))
I would like to create a dataframe where I apply the formula to each variable in my above data.frame. My attempt was:
frameddata <- data.frame(myEMA(pricesupdated,12))
But the error message that I get is:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'mean': undefined columns selected
I'm able to print the answer that I want, but not create a dataframe...
Can you help me?
First of all myEMA() is a function, not a formula. Check out help("function") and help("formula") for details on what the distinction is.
The myEMA() function takes a numeric vector as its first argument and returns a numeric vector with the same dimensions as its first argument.
A data.frame object is bascially just a list of vectors with a special class attribute. The most common way to repeat a function call across each element in a list is to use one of the *apply family of functions. For example, you can use lapply(), which will calls myEMA once on each variable in pricesupdated and returns a list with one element per function call containing that function call's returned value (a numeric vector). This list can be easily converted back to data.frame() since all its elements have the same length:
results <- lapply(pricesupdated, myEMA, n = 12)
# look at the structure of the results object
> str(results)
List of 3
$ a: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
$ b: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
$ c: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
frameddata <- as.data.frame(results)
# look at the top 15 records in this object
> head(frameddata, 15)
a b c
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
9 NA NA NA
10 NA NA NA
11 NA NA NA
12 6.5 12 NA
13 7.5 14 NA
14 8.5 16 NA
15 9.5 18 6.5
The question is likely a duplicate, ...
but the apply-family might help, e.g.
sapply(pricesupdated, myEMA, n=12)
for reproducibilty, it would be benificial to add require(pec)

Assign value to data frame in R to all elements conditionally

I try to assign value to all cells in a dataframe having a specific value
by this code
train_data <- read.csv("train_set.csv",header=TRUE)
train_data[train_data == "<NA>"] <- 0
But it does not work, I still see the values unchanged. How can I change values? Data in CSV is as below
spec1 spec2
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
SP-0013 SP-0063
SP-0013 SP-0063
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
As #akrun and others have mentioned we may need an actual copy of your data, but give this a shot:
train_data <- read.csv("train_set.csv", header=TRUE, na.strings = c("NA", "<NA>"), stringsAsFactors=FALSE)
train_data[is.na(train_data)] <- 0

Opening csv of specific sequences: NAs come out of nowhere?

I feel like this is a relatively straightforward question, and I feel I'm close but I'm not passing edge-case testing. I have a directory of CSVs and instead of reading all of them, I only want some of them. The files are in a format like 001.csv, 002.csv,...,099.csv, 100.csv, 101.csv, etc which should help to explain my if() logic in the loop. For example, to get all files, I'd do something like:
id = 1:1000
setwd("D:/")
filenames = as.character(NULL)
for (i in id){
if(i < 10){
i <- paste("00",i,sep="")
}
else if(i < 100){
i <- paste("0",i,sep="")
}
filenames[[i]] <- paste(i,".csv", sep="")
}
y <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
The above code works fine for id=1:1000, for id=1:10, id=20:70 but as soon as I pass it id=99:100 or any sequence involving numbers starting at over 100, it introduces a lot of NAs.
Example output below for id=98:99
> filenames
098 099
"098.csv" "099.csv"
Example output below for id=99:100
> filenames
099
"099.csv" NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
"100.csv"
I feel like I'm missing some catch statement in my if() logic. Any insight would be greatly appreciated! :)
You can avoid the loop for creating the filenames
filenames <- sprintf('%03d.csv', 1:1000)
y <- do.call(rbind, lapply(filenames, read.csv, header = TRUE))
#akrun has given you a much better way of solving your task. But in terms of the actual issue with your code, the problem is that for i < 100 you subset by a character vector (implicitly converted using paste) while for i >= 100 you subset by an integer. When you use id = 99:100 this translates to:
filenames <- character(0)
filenames["099"] <- "099.csv" # length(filenames) == 1L
filenames[100] <- "100.csv" # length(filenames) == 100L, with all(filenames[2:99] == NA)
Assigning to a named member of a vector that doesn't yet exist will create a new member at position length(vector) + 1 whereas assigning to a numbered position that is > length(vector) will also fill in every intervening position with NA.
Another approach, although less efficient than #akrun's solution, is with the following function:
merged <- function(id = 1:332) {
df <- data.frame()
for(i in 1:length(id)){
add <- read.csv(sprintf('%03d.csv', id[i]))
df <- rbind(df,add)
}
df
}
Now, you can merge the files with:
dat <- merged(99:100)
Furthermore, you can assign columnnames by inserting the following line in the function just before the last line with df:
colnames(df) <- c(..specify the colnames in here..)

how can I add to columns in R

I cannot seem to add two columns in R.
when I try
dat$V1 + dat$V2
I get
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Warning message:
In Ops.factor(dat$V1, dat$V2) : + not meaningful for factors
lots of other questions suggest to do as I have done, however as you can see this does not work for me. what is the problem?
Try to convert your factor columns to numeric: If V1 and V2 are 1st two columns.
dat[,1:2] <- lapply(dat[,1:2], function(x) as.numeric(as.character(x)))
dat$V1 +dat$V2
For example:
dat <- data.frame(V1= factor(1:5), V2= factor(6:10))
dat$V1+dat$V2
#[1] NA NA NA NA NA
#Warning message:
#In Ops.factor(dat$V1, dat$V2) : + not meaningful for factors
dat[,1:2] <- lapply(dat[,1:2], function(x) as.numeric(as.character(x)))
dat$V1 +dat$V2
#[1] 7 9 11 13 15

Resources