how can I add to columns in R - r

I cannot seem to add two columns in R.
when I try
dat$V1 + dat$V2
I get
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Warning message:
In Ops.factor(dat$V1, dat$V2) : + not meaningful for factors
lots of other questions suggest to do as I have done, however as you can see this does not work for me. what is the problem?

Try to convert your factor columns to numeric: If V1 and V2 are 1st two columns.
dat[,1:2] <- lapply(dat[,1:2], function(x) as.numeric(as.character(x)))
dat$V1 +dat$V2
For example:
dat <- data.frame(V1= factor(1:5), V2= factor(6:10))
dat$V1+dat$V2
#[1] NA NA NA NA NA
#Warning message:
#In Ops.factor(dat$V1, dat$V2) : + not meaningful for factors
dat[,1:2] <- lapply(dat[,1:2], function(x) as.numeric(as.character(x)))
dat$V1 +dat$V2
#[1] 7 9 11 13 15

Related

How to get name row as variable in function and plot density graph

I have issues with my function, i don't know if the problem is in the function or in my way to called it.
I have big dataframe with > 20000 row and around 700 columns, with each row a part of a gene and i want to calculate density for each row + plot the density plot with name of the gene.
baseM <- read.csv("expansions_full_omim_06_07_21.2.csv", sep = "\t")
rownames(baseM) <- paste(baseM$motif, baseM$chromosome, baseM$intervalle , baseM$gene , baseM$localisation, baseM$OMIM, sep = ".")
baseM.num <- baseM[sapply(baseM, is.numeric)]
names <- rownames(baseM.num.fltr)
d.density <- function(X, n){
#print(X)
d <- density(as.numeric(as.matrix(X)), na.rm=T)
peaks <- NULL
for (i in 2:(length(d$y)-1)) {
if (d$y[i-1] >= d$y[i] & d$y[i] <= d$y[i+1]) {
peaks <- cbind(peaks, c(d$x[i], d$y[i]))
}}
df <- data.frame(test =as.numeric(as.matrix(X)))
g <- ggplot(df, aes(x = as.numeric(as.matrix(test)))) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
ggsave(filename=paste("/work/gad/shared/analyse/STR/Marine/analysis/output/annotation/R_plots/", n, ".png", sep=""), plot=g)
#q <- plot(d)
#png(file=file_name)
#print(q)
#dev.off()
return(peaks)
}
baseM.num.fltr$peaks <- apply(temp, 1 , d.density, n=names)
I get correctly my peaks but obviously something wrong with the plot. I'm not sure my way to pass the name is correct, or is something else would be better/easier? Thanks for your help! I tried 2 ways for the plot, with or without ggplot2 but not working.
This is the error I get:
NULL
Erreur : `device` must be NULL, a string or a function.
>
Example of my data :
> head(baseM)
motif chromosome intervalle gene localisation
1 AAAAAAAAAAAAAAAAAAAC chr2 (69131154, 69132154) BMP10 intergenic
2 AAAAAAAAAAAAAAAAAAAC chr2 (237411093, 237412093) IQCA1 intronic
3 AAAAAAAAAAAAAAAAAAAC chr2 (44378070, 44379070) LRPPRC intergenic
4 AAAAAAAAAAAAAAAAAAAC chr2 (105218444, 105219444) LINC01102 intergenic
5 AAAAAAAAAAAAAAAAAAAC chr2 (124310903, 124311903) LINC01826 intergenic
6 AAAAAAAAAAAAAAAAAAAC chr2 (30730559, 30731559) LCLAT1 intronic
OMIM
1 .
2 .
3 .,Mitochondrial complex IV deficiency, nuclear type 5, (French-Canadian), 220111 (3)
4 .
5 .
6 .
dijen003 dijen004 dijen005 dijen006 dijen007 dijen008 dijen009 dijen010
1 NA NA NA NA NA NA NA NA
2 7 NA NA NA NA NA NA NA
3 NA NA NA 5 NA NA NA NA
4 NA NA NA 5 NA NA NA NA
5 NA NA NA 5 NA NA NA NA
6 NA NA NA NA 5 NA NA NA
dijen011 dijen012 dijen013 dijen014 dijen015 dijen016 dijen017 dijen018
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
(Sorry i know it's a short example but data is really big - and of course not all lines have that much NA)
For the device argument, use png or 'png'. (Note that png() will work also but only when the filename has the '.png' extension.)
(png() will work also but only when the filename includes the '.png' extension, see comment thread below.)
Example:
library(tidyverse)
set.seed(1L)
df <- tibble(a = rnorm(10))
df %>% ggplot(aes(a)) + geom_density()
ggsave("foo.png", device = "png")

Creating multiple columns in data table with a for loop

My data table looks like:
head(data)
Date AI AGI ADI ASI ARI ERI NVRI SRI FRI IRI
1: 1991-09-06 NA 2094.19 NA NA NA NA NA NA NA NA
2: 1991-09-13 NA 2204.94 NA NA NA NA NA NA NA NA
3: 1991-09-20 NA 2339.10 NA NA NA NA NA NA NA NA
4: 1991-09-27 NA 2387.81 NA NA NA NA NA NA NA NA
5: 1991-10-04 NA 2459.94 NA NA NA NA NA NA NA NA
6: 1991-10-11 NA 2571.07 NA NA NA NA NA NA NA NA
Don't worry about the NAs. What I want to do is make a "percentage change" column for each of the columns apart from date.
What I've done so far is:
names_no_date <- unique(names(data))[!unique(names(data)) %in% "Date"]
for (i in names_no_date){
data_ch <- data[, paste0(i, "ch") := i/shift(i, n = 1, type = "lag")-1]}
I get the error:
Error in i/shift(i, n = 1, type = "lag") :
non-numeric argument to binary operator
I'm wondering how I get around this error?
i is a string, so you are trying to divide a string in i/shift(i, n = 1, type = "lag"):
> "AI"/NA
Error in "AI"/NA : non-numeric argument to binary operator
Instead, do
for (i in names_no_date){
data[, paste0(i, "ch") := get(i)/shift(get(i), n = 1, type = "lag")-1]
}
Also see Referring to data.table columns by names saved in variables.
Edit: #Frank writes in the comments that a more concise way to produce OP's output is
data[, paste0(names_no_date, "_pch") := .SD/shift(.SD) - 1, .SDcols=names_no_date]

Subscript with matrix generated by assign()

I assigned a matrix to a name which varies with j:
j <- 2L
assign(paste0("pca", j,".FAVAR_fcst", sep=""), matrix(ncol=24, nrow=12))
This works very neat. Then I try to access a column of that matrix
paste0("pca", j,".FAVAR_fcst", sep="")[,2]
and get the following error:
Error in paste0("pca", j, ".FAVAR_fcst", sep = "")[, 2] :
incorrect number of dimensions
I've tried several variations and combinations with cat(), print() and capture.output(), but nothing seems to work. I'm not sure what I have to search exactly for and couldn't find a solution. Can you help me?
You can use get :
get(paste0("pca", j,".FAVAR_fcst", sep="")) # for the matrix
get(paste0("pca", j,".FAVAR_fcst", sep=""))[,2] # for the column
# [1] NA NA NA NA NA NA NA NA NA NA NA NA
An other solution would be to combine eval and as.symbol :
eval(as.symbol(paste0("pca", j,".FAVAR_fcst", sep="")))[,2]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA

Assign value to data frame in R to all elements conditionally

I try to assign value to all cells in a dataframe having a specific value
by this code
train_data <- read.csv("train_set.csv",header=TRUE)
train_data[train_data == "<NA>"] <- 0
But it does not work, I still see the values unchanged. How can I change values? Data in CSV is as below
spec1 spec2
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
SP-0013 SP-0063
SP-0013 SP-0063
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
As #akrun and others have mentioned we may need an actual copy of your data, but give this a shot:
train_data <- read.csv("train_set.csv", header=TRUE, na.strings = c("NA", "<NA>"), stringsAsFactors=FALSE)
train_data[is.na(train_data)] <- 0

Opening csv of specific sequences: NAs come out of nowhere?

I feel like this is a relatively straightforward question, and I feel I'm close but I'm not passing edge-case testing. I have a directory of CSVs and instead of reading all of them, I only want some of them. The files are in a format like 001.csv, 002.csv,...,099.csv, 100.csv, 101.csv, etc which should help to explain my if() logic in the loop. For example, to get all files, I'd do something like:
id = 1:1000
setwd("D:/")
filenames = as.character(NULL)
for (i in id){
if(i < 10){
i <- paste("00",i,sep="")
}
else if(i < 100){
i <- paste("0",i,sep="")
}
filenames[[i]] <- paste(i,".csv", sep="")
}
y <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
The above code works fine for id=1:1000, for id=1:10, id=20:70 but as soon as I pass it id=99:100 or any sequence involving numbers starting at over 100, it introduces a lot of NAs.
Example output below for id=98:99
> filenames
098 099
"098.csv" "099.csv"
Example output below for id=99:100
> filenames
099
"099.csv" NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
"100.csv"
I feel like I'm missing some catch statement in my if() logic. Any insight would be greatly appreciated! :)
You can avoid the loop for creating the filenames
filenames <- sprintf('%03d.csv', 1:1000)
y <- do.call(rbind, lapply(filenames, read.csv, header = TRUE))
#akrun has given you a much better way of solving your task. But in terms of the actual issue with your code, the problem is that for i < 100 you subset by a character vector (implicitly converted using paste) while for i >= 100 you subset by an integer. When you use id = 99:100 this translates to:
filenames <- character(0)
filenames["099"] <- "099.csv" # length(filenames) == 1L
filenames[100] <- "100.csv" # length(filenames) == 100L, with all(filenames[2:99] == NA)
Assigning to a named member of a vector that doesn't yet exist will create a new member at position length(vector) + 1 whereas assigning to a numbered position that is > length(vector) will also fill in every intervening position with NA.
Another approach, although less efficient than #akrun's solution, is with the following function:
merged <- function(id = 1:332) {
df <- data.frame()
for(i in 1:length(id)){
add <- read.csv(sprintf('%03d.csv', id[i]))
df <- rbind(df,add)
}
df
}
Now, you can merge the files with:
dat <- merged(99:100)
Furthermore, you can assign columnnames by inserting the following line in the function just before the last line with df:
colnames(df) <- c(..specify the colnames in here..)

Resources