I have a dataframe with the following structure:
'data.frame': 13095 obs. of 1433 variables:
$ my : Factor w/ 624 levels "19631","19632",..: 1 1 1 1 1 1 1 1 1 1 ...
$ s1 : num NA NA NA NA NA NA NA NA NA NA ...
Where my is a factor with the number of the month, s1,..,,s1426 vectors that contain my dependent variable, f1,..,f6 vectors that contain my indepent variables.
S1 is a vector with many NA observations.
I want to regress s1 on f1,…,f6. To make the regression, I used the following code:
try1 <- lmList(s1 ~ f1+f2+f3+f4+f5+f6 |my , data=d1)
try1
That's the output that I received:
Call: lmList(formula = s1 ~ f1 + f2 + f3 + f4 + f5 + f6 | my, data = d1)
Coefficients:
Error in !unlist(lapply(coefs, is.null)) : invalid argument type
I tried to change na.action to na.omit but I have the same output.
I tried to create a new dataframe with:
d2<-data.frame(d1$my,d1$s1,d1$f1,d1$f2,d1$f3,d1$f4,d1$f5,d1$f6)
colnames(d2)<-c("my", "s1", "f1", "f2", "f3", "f4", "f5", "f6")
d2 has the following structure:
'data.frame': 13095 obs. of 8 variables:
$ my: Factor w/ 624 levels "19631","19632",..: 1 1 1 1 1 1 1 1 1 1 ...
$ s1: num NA NA NA NA NA NA NA NA NA NA ...
$ f1: num -0.54 1.66 0.68 0.06 0.9 -0.16 0.19 0.25 0.57 -0.1 ...
$ f2: num 0.94 0.98 0.63 0.32 -0.03 0.11 0.2 -0.03 0.07 0.01 ...
$ f3: num 0.31 -0.25 0.02 0.29 0.22 0.07 -0.09 -0.17 0.21 0.28 ...
$ f4: num 1.5 1.7 1.14 -0.02 0.36 0.49 -0.13 0.18 0.14 0.47 ...
$ f5: num -0.5 -1.96 -0.66 -0.17 -0.43 0.24 -0.12 -0.01 -0.58 0.52 ...
$ f6: num 0.38 0.3 0.35 0.3 0.08 0.13 -0.18 -0.05 -0.08 0.03 ...
When I run the regression:
try2<-lmList(s1~f1+f2+f3+f4+f5+f6|my,data=d2)
try2,
It works without any problem.
How can I solve the problem? I need to run a regression for every s, so I can’t just create a new dataframe every time. I read the documentation of lmList and also this site but I didn’t find anything related at the size of the dataframe, so I don’t think that the problem depends from the size of d1, but for all the other things the two dataframe are equivalent.
I also tried to create an example but when I build a new dataframe, also with many NA like my file, I don’t have these problems (i.e. lmList works fine).
I use the package lm4.
Related
Here is the question set up:
I have read in a data file from the Machine Learing Depository called "abalone.data":
dat=read.csv(file="abalone.data",header=FALSE)
colnames(dat)<-c('Sex','Length','Diameter','Height','Whole weight',
'Shucked wieght','Viscera weight','Shell weight','Rings')
Here is a sample:
head(dat)
Sex Length Diameter Height Whole weight Shucked wieght Viscera weight Shell weight Rings
1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
And here is the structure":
str(dat)
'data.frame': 4177 obs. of 9 variables:
$ Sex : chr "M" "M" "F" "M" ...
$ Length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
$ Diameter : num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
$ Height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
$ Whole weight : num 0.514 0.226 0.677 0.516 0.205 ...
$ Shucked wieght: num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
$ Viscera weight: num 0.101 0.0485 0.1415 0.114 0.0395 ...
$ Shell weight : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
$ Rings : int 15 7 9 10 7 8 20 16 9 19 ...
Here is the problem:
I want to convert the first row to numeric; e.g. "M" to 1, "F" to 2 and "I"to 3.
So, I try
Sex <- as.numeric(dat$Sex)
but I get:
Sex<-as.numeric(dat$sex)
> Sex[1:5]
[1] NA NA NA NA NA
I've tried a lot of similar commands; e.g.:
as.numeric(dat$Sex=character(),levels=levels)
Error: unexpected '=' in " as.numeric(dat$Sex="
I cannot figure this out.
Please help
That's because the Sex variable is a character vector. You first need to change it to a factor:
dat$Sex <- as.numeric(factor(dat$Sex))
# Create a data frame
> df <- data.frame(a = rnorm(7), b = rnorm(7), c = rnorm(7), threshold = rnorm(7))
> df <- round(abs(df), 2)
>
> df
a b c threshold
1 1.17 0.27 1.26 0.19
2 1.41 1.57 1.23 0.97
3 0.16 0.11 0.35 1.34
4 0.03 0.04 0.10 1.50
5 0.23 1.10 2.68 0.45
6 0.99 1.36 0.17 0.30
7 0.28 0.68 1.22 0.56
>
>
# Replace values in columns a, b, and c with NA if > value in threshold
> df[1:3][df[1:3] > df[4]] <- "NA"
Error in Ops.data.frame(df[1:3], df[4]) :
‘>’ only defined for equally-sized data frames
There could be some obvious solutions that I am incapable of producing. The intent is to replace values in columns "a", "b", and "c" with NA if the value is larger than that in "threshold". And I need to do that row-by-row.
If I had done it right, the df would look like this:
a b c threshold
1 NA NA NA 0.19
2 NA NA NA 0.97
3 0.16 0.11 0.35 1.34
4 0.03 0.04 0.10 1.50
5 0.23 NA NA 0.45
6 NA NA 0.17 0.30
7 0.28 NA NA 0.56
I had also tried the apply() approach but to no avail. Can you help, please??
You should use dplyr for most of such use cases.
One way below:
> set.seed(10)
> df <- data.frame(a = rnorm(7), b = rnorm(7), c = rnorm(7), threshold = rnorm(7))
> df <- round(abs(df), 2)
> df
a b c threshold
1 0.02 0.36 0.74 2.19
2 0.18 1.63 0.09 0.67
3 1.37 0.26 0.95 2.12
4 0.60 1.10 0.20 1.27
5 0.29 0.76 0.93 0.37
6 0.39 0.24 0.48 0.69
7 1.21 0.99 0.60 0.87
>
> df %>%
+ mutate_at(vars(a:c), ~ifelse(.x > df$threshold, NA, .x))
a b c threshold
1 0.02 0.36 0.74 2.19
2 0.18 NA 0.09 0.67
3 1.37 0.26 0.95 2.12
4 0.60 1.10 0.20 1.27
5 0.29 NA NA 0.37
6 0.39 0.24 0.48 0.69
7 NA NA 0.60 0.87
You can use apply function across dataframe
df[,c(1:3)]<- apply(df[,c(1:3),drop=F], 2, function(x){ ifelse(x>df[,4],NA,x)})
The problem with your code was the usage of df[4] instead of df[, 4]. The difference is that df[4] returns a data.frame with one column and df[, 4] returns a vector.
That's why
df[1:3] > df[4]
returns
error in Ops.data.frame(df[1:3], df[4]) :
‘>’ only defined for equally-sized data frames
While this works as expected
df[1:3][df[1:3] > df[, 4]] <- NA
df
# a b c threshold
#1 0.63 0.74 NA 0.78
#2 NA NA 0.04 0.07
#3 0.84 0.31 0.02 1.99
#4 NA NA NA 0.62
#5 NA NA NA 0.06
#6 NA NA NA 0.16
#7 0.49 NA 0.92 1.47
data
set.seed(1)
df <- data.frame(a = rnorm(7), b = rnorm(7), c = rnorm(7), threshold = rnorm(7))
df <- round(abs(df), 2)
You can use a for-loop like this:
for(i in 1:(ncol(df)-1)){
df[, i] <- ifelse(df[, i] > df[, 4], NA, df[, i])
}
I am interesting in a yeast dataset from UCI (please see the link). The data is saved in text formula. I would like to load it into Rstudio. I saved it in office word (copy and paste). Then, I tried to load it into R studio but I got unclear words instead of the data.
https://archive.ics.uci.edu/ml/datasets/Yeast
Any help please?
Grabbing the data is pretty easy; you can just pass the file URL directly to read.table. Getting the names is a lot more work, as they're buried in a text file. If you like, you can extract them with regex:
library(tidyverse)
yeast <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.data', stringsAsFactors = FALSE)
l <- readLines('https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.names')
l <- l[(grep('^7', l) + 1):(grep('^8', l) - 1)]
l <- l[grep('\\d\\..*:', l)]
names(yeast) <- make.names(c(sub('.*\\d\\.\\s+(.*):.*', '\\1', l), 'class'))
str(yeast)
#> 'data.frame': 1484 obs. of 10 variables:
#> $ Sequence.Name: chr "ADT1_YEAST" "ADT2_YEAST" "ADT3_YEAST" "AAR2_YEAST" ...
#> $ mcg : num 0.58 0.43 0.64 0.58 0.42 0.51 0.5 0.48 0.55 0.4 ...
#> $ gvh : num 0.61 0.67 0.62 0.44 0.44 0.4 0.54 0.45 0.5 0.39 ...
#> $ alm : num 0.47 0.48 0.49 0.57 0.48 0.56 0.48 0.59 0.66 0.6 ...
#> $ mit : num 0.13 0.27 0.15 0.13 0.54 0.17 0.65 0.2 0.36 0.15 ...
#> $ erl : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#> $ pox : num 0 0 0 0 0 0.5 0 0 0 0 ...
#> $ vac : num 0.48 0.53 0.53 0.54 0.48 0.49 0.53 0.58 0.49 0.58 ...
#> $ nuc : num 0.22 0.22 0.22 0.22 0.22 0.22 0.22 0.34 0.22 0.3 ...
#> $ class : chr "MIT" "MIT" "MIT" "NUC" ...
...or just copy them all out by hand.
Only just discovered Plyr and it has saved me a tonne of lines combining multiple data frames which is great. BUT I have another renaming problem I cannot fathom.
I have a list, which contains a number of data frames (this is a subset as there are actually 108 in the real list).
> str(mydata)
List of 4
$ C11:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.96 0.91 0.74 0.5
..$ n.ENSEMBLE.RECALL : num [1:8] 0.88 0.88 0.88 0.88 0.9 0.91 0.94 0.95
$ C12:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.96 0.89 0.86 0.72
..$ n.ENSEMBLE.RECALL : num [1:8] 0.91 0.91 0.91 0.91 0.93 0.96 0.97 0.98
$ C13:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.94 0.79 0.65 0.46
..$ n.ENSEMBLE.RECALL : num [1:8] 0.85 0.85 0.85 0.85 0.88 0.9 0.92 0.91
$ C14:'data.frame': 8 obs. of 3 variables:
..$ X : Factor w/ 8 levels "n >= 1","n >= 2",..: 1 2 3 4 5 6 7 8
..$ n.ENSEMBLE.COVERAGE: num [1:8] 1 1 1 1 0.98 0.95 0.88 0.74
..$ n.ENSEMBLE.RECALL : num [1:8] 0.91 0.91 0.91 0.91 0.92 0.94 0.95 0.98
What I really want to achieve is for each data frame to have the columns prepended with the title of the dataframe. So in the example the columns would be:
C11.X, C11.n.ENSEMBLE.COVERAGE & C11.n.ENSEMBLE.RECALL
C12.X, C12.n.ENSEMBLE.COVERAGE & C12.n.ENSEMBLE.RECALL
C13.X, C13.n.ENSEMBLE.COVERAGE & C13.n.ENSEMBLE.RECALL
C14.X, C14.n.ENSEMBLE.COVERAGE & C14.n.ENSEMBLE.RECALL
Can anyone suggest an elegant approach to renaming columns like this?
Here's a reproducible example using the iris data set:
# produce a named list of data.frames as sample data:
dflist <- split(iris, iris$Species)
# store the list element names:
n <- names(dflist)
# rename the elements:
Map(function(df, vec) setNames(df, paste(vec, names(df), sep = ".")), dflist, n)
Hi I'm pushing data into a matrix so I can create a heatmap. The code I am using identical to what is published here (http://sebastianraschka.com/Articles/heatmaps_in_r.html). For some of my datasets, when I push the data into the matrix format I am getting strange behaviour in that some of the values are changing. Some of my datasets work fine but others do not and I am unsure what the primary differences are that is underlying this strange behaviour.
Example code;
data <- read.csv("mydata.txt", sep="\t", header =TRUE)
rnames <- data[,1]
mat_data <- data.matrix(data[,2:ncol(data)])
rownames(mat_data) <- rnames
Now example dataframes..
head(data)
1 1.108029 0.42 0.19 0.04 0.47 -0.08 0.47 0.04 0.10
2 1.108029 0.34 0.40 0.25 0.56 -0.08 -0.06 0.11 0.20
3 1.121099 0.1 -0.45 0.11 -0.22 -0.07 -0.40 0.24 -0.17
4 1.123857 0.26 -0.15 0.15 0.31 0.2 -0.24 -0.27 0.40
5 1.129303 0.11 0.13 0.01 -0.11 0.38 0.29 -0.15 -0.18
6 1.135904 0.4 0.07 0.11 0.03 0.6 -0.32 0.14 -0.12
head(mat_data)
tg_q2_rep_A tg_q2_rep_B tg_q2_rep_C tg_q2_rep_D tg_q4_rep_A tg_q4_rep_B tg_q4_rep_C tg_q4_rep_D
1.10802929 70 0.19 0.04 0.47 5 0.47 0.04 0.10
1.1080293 65 0.40 0.25 0.56 5 -0.06 0.11 0.20
1.12109912 49 -0.45 0.11 -0.22 4 -0.40 0.24 -0.17
1.12385707 62 -0.15 0.15 0.31 53 -0.24 -0.27 0.40
1.12930344 50 0.13 0.01 -0.11 65 0.29 -0.15 -0.18
1.1359041 69 0.07 0.11 0.03 69 -0.32 0.14 -0.12
You can see the rownames have had numbers appended to the ends and the first data for tg_q2_rep_A and tg_q4_rep_A have been changed.
If anyone can suggest how to approach this I'd appreciate it. I've been trying to figure this out for days :/
EDIT
As requested ..
> str(data)
'data.frame': 137 obs. of 33 variables:
$ CpG_id.chr.pos.: num 1.11 1.11 1.12 1.12 1.13 ...
$ tg_q2_rep_A : Factor w/ 75 levels "-0.01","-0.02",..: 70 65 49 62 50 69 71 63 57 7 ...
$ tg_q2_rep_B : num 0.19 0.4 -0.45 -0.15 0.13 0.07 0.5 -0.33 0.23 -0.22 ...
$ tg_q2_rep_C : num 0.04 0.25 0.11 0.15 0.01 0.11 0.16 0.03 0.23 -0.32 ...
$ tg_q2_rep_D : num 0.47 0.56 -0.22 0.31 -0.11 0.03 0.31 0.21 0 0.06 ...
$ tg_q4_rep_A : Factor w/ 73 levels "-0.04","-0.05",..: 5 5 4 53 65 69 50 53 59 46 ...
$ tg_q4_rep_B : num 0.47 -0.06 -0.4 -0.24 0.29 -0.32 0.07 -0.23 0.1 -0.09 ...
$ tg_q4_rep_C : num 0.04 0.11 0.24 -0.27 -0.15 0.14 0.14 0.36 0.1 -0.05 ...
$ tg_q4_rep_D : num 0.1 0.2 -0.17 0.4 -0.18 -0.12 0.15 0.18 -0.21 -0.14 ...
$ tg_q6_rep_A : Factor w/ 79 levels "-0.02","-0.03",..: 46 3 7 67 65 77 64 61 41 12 ...
$ tg_q6_rep_B : Factor w/ 87 levels "-0.01","-0.03",..: 68 79 34 11 82 1 63 1 36 32 ...
$ tg_q6_rep_C : num 0.22 0.5 -0.32 0.13 0.24 0.25 0.35 0.07 0.01 -0.44 ...
$ tg_q6_rep_D : Factor w/ 82 levels "-0.04","-0.05",..: 55 50 27 74 71 68 73 61 5 31 ...
$ tg_q8_rep_A : Factor w/ 73 levels "-0.01","-0.02",..: 49 9 2 52 45 50 13 55 48 9 ...
$ tg_q8_rep_B : num 0.05 0.07 -0.31 0.02 0 -0.33 0.03 -0.05 0.08 0.1 ...
$ tg_q8_rep_C : num 0.35 0.5 -0.06 -0.1 0.24 -0.45 -0.27 0.1 0.15 -0.29 ...
$ tg_q8_rep_D : num 0.15 0.08 -0.08 0.31 0.28 0.43 0.41 0.25 -0.05 -0.04 ...
$ tg_w2_rep_A : Factor w/ 72 levels "-0.01","-0.02",..: 49 16 24 66 60 62 62 68 52 49 ...
$ tg_w2_rep_B : num 0.11 0.24 -0.03 -0.43 0.67 -0.13 0.05 -0.4 -0.13 -0.18 ...
$ tg_w2_rep_C : num 0 0.33 -0.09 0 0.12 -0.35 0.06 0.33 0.15 -0.19 ...
$ tg_w2_rep_D : num -0.04 0 -0.03 0.44 0.04 0.23 0.28 0.19 -0.21 -0.17 ...
$ tg_w4_rep_A : Factor w/ 69 levels "-0.0","-0.01",..: 55 58 53 50 52 67 68 63 27 8 ...
$ tg_w4_rep_B : num 0.29 0.63 -0.37 0.09 0.22 -0.21 0.1 -0.14 -0.04 -0.09 ...
$ tg_w4_rep_C : num 0.09 0.13 -0.08 0.17 0.15 -0.33 0 0.38 0.1 -0.62 ...
$ tg_w4_rep_D : num 0.11 0.33 -0.32 0.41 -0.1 0.07 0.23 0.22 0.1 0.06 ...
$ tg_w6_rep_A : Factor w/ 74 levels "-0.01","-0.02",..: 56 45 4 69 59 47 2 40 47 12 ...
$ tg_w6_rep_B : num 0.07 0.13 -0.14 0.15 0.13 -0.17 0.33 0.12 0.07 -0.15 ...
$ tg_w6_rep_C : num 0.13 0.22 0.31 0.08 0.16 -0.33 -0.05 0.43 0.43 -0.06 ...
$ tg_w6_rep_D : num 0.28 0.11 -0.2 0.66 -0.18 0.16 0.26 0.27 0.06 -0.02 ...
$ tg_w8_rep_A : Factor w/ 67 levels "-0.01","-0.02",..: 52 40 37 44 48 61 48 53 39 63 ...
$ tg_w8_rep_B : num 0.3 0.09 -0.22 -0.1 0.14 -0.25 0.1 -0.49 0.19 0.15 ...
$ tg_w8_rep_C : num 0.23 0.27 0.11 -0.25 0.17 -0.13 0.23 0.47 0.33 -0.09 ...
$ tg_w8_rep_D : num -0.04 0.1 -0.25 0.37 -0.09 0.18 0.26 0.2 -0.35 -0.11 ...
The problem with your rownames is that they aren't unique. R requires unique identifiers for each row, and you have multiple rows with the same value in the data.frame "data". When you try to force it to make the values in that first column rownames, it's trying to make them unique, and it looks as though it's rounding some numbers to accomplish that.
I'm not entirely certain what's going on with columns tg_q2_rep_A and tg_q4_rep_A, but it looks as though those values have been converted to ranks. That can happen if the class of those columns in your original data.frame, data, was "factor" rather than "numeric". Try this to check the classes:
sapply(data, class)
If you've got a mixture of numbers and letters in that column, for example, R will set the data class as factor by default. When you convert those columns to numeric format, which is what data.matrix() does, the output will be the rank of that factor.
I didn't get the same problem for those two columns when I copied and pasted your data into a csv file and loaded it into R, but I'm guessing that you haven't given us all the data there. My first step to figure this out would be to check the classes of the columns.