Suppose I have a dataframe as follows:
df <- data.frame(
alpha = 0:20,
beta = 30:50,
gamma = 100:120
)
I have a custom function that makes new columns. (Note, my actual function is a lot more complex and can't be vectorized without a custom function, so please ignore the substance of the transformation here.) For example:
newfun <- function(var = NULL) {
newname <- paste0(var, "NEW")
df[[newname]] <- df[[var]]/100
return(df)
}
I want to apply this over many columns of the dataset repeatedly and have the dataset "build up." This happens just fine when I do the following:
df <- newfun("alpha")
df <- newfun("beta")
df <- newfun("gamma")
Obviously this is redundant and a case for map. But when I do the following I get back a list of dataframes, which is not what I want:
df <- data.frame(
alpha = 0:20,
beta = 30:50,
gamma = 100:120
)
out <- c("alpha", "beta", "gamma") %>%
map(function(x) newfun(x))
How can I iterate over a vector of column names AND see the changes repeatedly applied to the same dataframe?
Writing the function to reach outside of its scope to find some df is both risky and will bite you, especially when you see something like:
df[['a']] <- 2
# Error in df[["a"]] <- 2 : object of type 'closure' is not subsettable
You will get this error when it doesn't find your variable named df, and instead finds the base function named df. Two morals from this discovery:
While I admit to using df myself, it's generally bad practice to name variables the same as R functions (especially from base); and
Scope-breach is sloppy and renders a workflow unreproducible and often difficult to troubleshoot problems or changes.
To remedy this, and since your function relies on knowing what the old/new variable names are or should be, I think pmap or base R Map may work better. Further, I suggest that you name the new variables outside of the function, making it "data-only".
myfunc <- function(x) x/100
setNames(lapply(dat[,cols], myfunc), paste0("new", cols))
# $newalpha
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17
# [19] 0.18 0.19 0.20
# $newbeta
# [1] 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47
# [19] 0.48 0.49 0.50
# $newgamma
# [1] 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17
# [19] 1.18 1.19 1.20
From here, we just need to column-bind (cbind) it:
cbind(dat, setNames(lapply(dat[,cols], myfunc), paste0("new", cols)))
# alpha beta gamma newalpha newbeta newgamma
# 1 0 30 100 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04
# ...
Special note: if you plan on doing this iteratively (repeatedly), it is generally bad to iteratively add rows to frames; while I know this is a bad idea for adding rows, I suspect (without proof at the moment) that doing the same with columns is also bad. For that reason, if you do this a lot, consider using do.call(cbind, c(list(dat), ...)) where ... is the list of things to add. This results in a single call to cbind and therefore only a single memory-copy of the original dat. (Contrast that with iteratively calling the *bind functions which make a complete copy with each pass, scaling poorly.)
additions <- lapply(1:3, function(i) setNames(lapply(dat[,cols], myfunc), paste0("new", i, cols)))
str(additions)
# List of 3
# $ :List of 3
# ..$ new1alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new1beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new1gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
# $ :List of 3
# ..$ new2alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new2beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new2gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
# $ :List of 3
# ..$ new3alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new3beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new3gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
do.call(cbind, c(list(dat), additions))
# alpha beta gamma new1alpha new1beta new1gamma new2alpha new2beta new2gamma new3alpha new3beta new3gamma
# 1 0 30 100 0.00 0.30 1.00 0.00 0.30 1.00 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01 0.01 0.31 1.01 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02 0.02 0.32 1.02 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03 0.03 0.33 1.03 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04 0.04 0.34 1.04 0.04 0.34 1.04
# 6 5 35 105 0.05 0.35 1.05 0.05 0.35 1.05 0.05 0.35 1.05
# ...
An alternative approach is to change your function to only return a vector:
newfun2 <- function(var = NULL) {
df[[var]] / 100
}
newfun2('alpha')
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13
#[15] 0.14 0.15 0.16 0.17 0.18 0.19 0.20
Then, using base, you can use lapply() to loop through your list of functions to do:
cols <- c("alpha", "beta", "gamma")
df[, paste0(cols, 'NEW')] <- lapply(cols, newfun2)
#or
#df[, paste0(cols, 'NEW')] <- purrr::map(cols, newfun2)
df
alpha beta gamma alphaNEW betaNEW gammaNEW
1 0 30 100 0.00 0.30 1.00
2 1 31 101 0.01 0.31 1.01
3 2 32 102 0.02 0.32 1.02
4 3 33 103 0.03 0.33 1.03
5 4 34 104 0.04 0.34 1.04
6 5 35 105 0.05 0.35 1.05
7 6 36 106 0.06 0.36 1.06
8 7 37 107 0.07 0.37 1.07
9 8 38 108 0.08 0.38 1.08
10 9 39 109 0.09 0.39 1.09
11 10 40 110 0.10 0.40 1.10
12 11 41 111 0.11 0.41 1.11
13 12 42 112 0.12 0.42 1.12
14 13 43 113 0.13 0.43 1.13
15 14 44 114 0.14 0.44 1.14
16 15 45 115 0.15 0.45 1.15
17 16 46 116 0.16 0.46 1.16
18 17 47 117 0.17 0.47 1.17
19 18 48 118 0.18 0.48 1.18
20 19 49 119 0.19 0.49 1.19
21 20 50 120 0.20 0.50 1.20
Based on the way you wrote your function, a for loop that assign the result of newfun to df repeatedly works pretty well.
vars <- names(df)
for (i in vars){
df <- newfun(i)
}
df
# alpha beta gamma alphaNEW betaNEW gammaNEW
# 1 0 30 100 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04
# 6 5 35 105 0.05 0.35 1.05
# 7 6 36 106 0.06 0.36 1.06
# 8 7 37 107 0.07 0.37 1.07
# 9 8 38 108 0.08 0.38 1.08
# 10 9 39 109 0.09 0.39 1.09
# 11 10 40 110 0.10 0.40 1.10
# 12 11 41 111 0.11 0.41 1.11
# 13 12 42 112 0.12 0.42 1.12
# 14 13 43 113 0.13 0.43 1.13
# 15 14 44 114 0.14 0.44 1.14
# 16 15 45 115 0.15 0.45 1.15
# 17 16 46 116 0.16 0.46 1.16
# 18 17 47 117 0.17 0.47 1.17
# 19 18 48 118 0.18 0.48 1.18
# 20 19 49 119 0.19 0.49 1.19
# 21 20 50 120 0.20 0.50 1.20
Related
I have chromatographic data in a table organized by peak position and integration value of various samples. All samples in the table have a repeated measurement as well with a different sample log number.
What I'm interested in, is the repeatability of the measurements of the various peaks. The measure for that would be the difference in peak integration = 0 for each sample.
The data
Sample Log1 Log2 Peak1 Peak2 Peak3 Peak4 Peak5
A 100 104 0.20 0.80 0.30 0.00 0.00
B 101 106 0.25 0.73 0.29 0.01 0.04
C 102 103 0.20 0.80 0.30 0.00 0.07
C 103 102 0.22 0.81 0.31 0.04 0.00
A 104 100 0.21 0.70 0.33 0.00 0.10
B 106 101 0.20 0.73 0.37 0.00 0.03
with Log1 is the original sample log number, and Log2 is the repeat log number.
How can I construct a new variable for every peak (being the difference PeakX_Log1 - PeakX_Log2)?
Mind that in my example I only have 5 peaks. The real-life situation is a complex mixture involving >20 peaks, so very hard to do it by hand.
If you will only have two values for each sample, something like this could work:
df <- data.table::fread(
"Sample Log1 Log2 Peak1 Peak2 Peak3 Peak4 Peak5
A 100 104 0.20 0.80 0.30 0.00 0.00
B 101 106 0.25 0.73 0.29 0.01 0.04
C 102 103 0.20 0.80 0.30 0.00 0.07
C 103 102 0.22 0.81 0.31 0.04 0.00
A 104 100 0.21 0.70 0.33 0.00 0.10
B 106 101 0.20 0.73 0.37 0.00 0.03"
)
library(tidyverse)
new_df <- df %>%
mutate(Log = ifelse(Log1 < Log2,"Log1","Log2")) %>%
select(-Log1,-Log2) %>%
pivot_longer(cols = starts_with("Peak"),names_to = "Peak") %>%
pivot_wider(values_from = value, names_from = Log) %>%
mutate(Variation = Log1 - Log2)
new_df
# A tibble: 15 × 5
Sample Peak Log1 Log2 Variation
<chr> <chr> <dbl> <dbl> <dbl>
1 A Peak1 0.2 0.21 -0.0100
2 A Peak2 0.8 0.7 0.100
3 A Peak3 0.3 0.33 -0.0300
4 A Peak4 0 0 0
5 A Peak5 0 0.1 -0.1
6 B Peak1 0.25 0.2 0.05
7 B Peak2 0.73 0.73 0
8 B Peak3 0.29 0.37 -0.08
9 B Peak4 0.01 0 0.01
10 B Peak5 0.04 0.03 0.01
11 C Peak1 0.2 0.22 -0.0200
12 C Peak2 0.8 0.81 -0.0100
13 C Peak3 0.3 0.31 -0.0100
14 C Peak4 0 0.04 -0.04
15 C Peak5 0.07 0 0.07
I am trying to add multiple dataframes together but not in a bind fashion.
Is there an easy way to overlay & add dataframes on top of each other? As shown in this picture:
The number of columns will always be same; the row count will differ.
I want to sum the cells by row position. So Result[1,1] = Table1[1,1] + Table2[1,1] and so on, such that the resulting frame adds whatever cells have data and resulting table is the size of biggest table's size.
The table are generated dynamically so I'd like to refrain from any hardcoding.
Consider the following two data frames:
table1 <- replicate(4,round(runif(10,0,1),2)) %>% as.data.frame %>% setNames(LETTERS[1:4])
table2 <- replicate(4,round(runif(6,0,1),2)) %>% as.data.frame %>% setNames(LETTERS[1:4])
table1
A B C D
1 0.81 0.08 0.85 0.89
2 0.88 0.82 0.62 0.77
3 0.12 0.13 0.99 0.02
4 0.17 0.54 0.37 0.62
5 0.77 0.10 0.81 0.34
6 0.58 0.15 0.00 0.56
7 0.61 0.15 0.59 0.15
8 0.52 0.36 0.12 0.99
9 0.83 0.93 0.29 0.30
10 0.52 0.02 0.48 0.46
table2
A B C D
1 0.95 0.81 0.99 0.92
2 0.18 0.99 0.35 0.09
3 0.73 0.10 0.02 0.68
4 0.37 0.53 0.78 0.02
5 0.48 0.54 0.79 0.83
6 0.75 0.32 0.41 0.04
We might create a new variable called ID from their row numbers and use that to sum the values after binding the rows:
library(dplyr)
library(tibble)
bind_rows(table1 %>% rowid_to_column("ID"),table2 %>% rowid_to_column("ID")) %>%
group_by(ID) %>%
summarise(across(everything(),sum))
# A tibble: 10 x 5
ID A B C D
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.76 0.89 1.84 1.81
2 2 1.06 1.81 0.97 0.86
3 3 0.85 0.23 1.01 0.7
4 4 0.54 1.07 1.15 0.64
5 5 1.25 0.64 1.6 1.17
6 6 1.33 0.47 0.41 0.6
7 7 0.61 0.15 0.59 0.15
8 8 0.52 0.36 0.12 0.99
9 9 0.83 0.93 0.290 0.3
10 10 0.52 0.02 0.48 0.46
A potentially more dangerous base R approach is to subset table1 to the dimensions of table2, and add them together:
table1[seq(1,nrow(table2)),seq(1,ncol(table2))] <- table1[seq(1,nrow(table2)),seq(1,ncol(table2))] + table2
table1
A B C D
1 1.76 0.89 1.84 1.81
2 1.06 1.81 0.97 0.86
3 0.85 0.23 1.01 0.70
4 0.54 1.07 1.15 0.64
5 1.25 0.64 1.60 1.17
6 1.33 0.47 0.41 0.60
7 0.61 0.15 0.59 0.15
8 0.52 0.36 0.12 0.99
9 0.83 0.93 0.29 0.30
10 0.52 0.02 0.48 0.46
# Create your data frames
df1<-data.frame(a=c(1,2,3),b=c(2,3,4),c=c(3,4,5))
df2<-data.frame(a=c(1,2),b=c(2,3),c=c(3,4))
# Create a new data frame from the bigger of the two
if (nrow(df1)>nrow(df2)){
df3 <-df1
} else {
df3<-df2
}
# For each line in the smaller data frame add it to the larger
for (number in 1:min(nrow(df1),nrow(df2))){
df3[number,] <- df1[number,]+df2[number,]
}
I am using apply and abind to create a dataframe with the average of all the individual values from three similar data frames. I want to loop this code where the only thing that changes is the name of the instrument I am using (CAI.600, Thermo.1, etc).
This is what I have so far:
FIDs <- c('CAI.600', 'Thermo.1')
for (Instrument in FIDs) {
A.avg <- apply(abind::abind(paste0('FID.Eval.A.1.', Instrument),
paste0('FID.Eval.A.2.', Instrument),
paste0('FID.Eval.A.3.', Instrument),
along = 3), 1:2, mean)
assign(paste0('FID.Eval.A.', Instrument), A.avg)
}
where all the df's look similar to this (same number of rows and columns):
> FID.Eval.A.1.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 84.98 20.90 0.06 254.96 0.01
2 49.98 20.90 0.09 150.09 0.09
3 25.00 20.94 0.09 75.24 0.31
4 85.03 10.00 0.08 251.99 -1.22
5 50.03 10.00 0.09 148.51 -1.06
6 24.99 10.00 0.07 74.00 -1.27
7 84.99 0.10 0.06 246.99 -3.13
8 50.03 0.10 0.14 146.50 -2.39
9 24.96 0.10 0.10 72.97 -2.55
> FID.Eval.A.2.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 85.45 21.37 0.53 255.43 0.48
2 50.45 21.37 0.56 150.56 0.56
3 25.47 21.41 0.56 75.71 0.78
4 85.50 10.47 0.55 252.46 -0.75
5 50.50 10.47 0.56 148.98 -0.59
6 25.46 10.47 0.54 74.47 -0.80
7 85.46 0.57 0.53 247.46 -2.66
8 50.50 0.57 0.61 146.97 -1.92
9 25.43 0.57 0.57 73.44 -2.08
> FID.Eval.A.3.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 85.32 21.24 0.40 255.30 0.35
2 50.32 21.24 0.43 150.43 0.43
3 25.34 21.28 0.43 75.58 0.65
4 85.37 10.34 0.42 252.33 -0.88
5 50.37 10.34 0.43 148.85 -0.72
6 25.33 10.34 0.41 74.34 -0.93
7 85.33 0.44 0.40 247.33 -2.79
8 50.37 0.44 0.48 146.84 -2.05
9 25.30 0.44 0.44 73.31 -2.21
I ether get an error message stating "along must be between 0 and 2", or when I adjust along I get a warning stating "argument is not numeric or logical: returning NA".
Should I be using something other than for loop.
When I run abind without using for loop, the end result looks like this:
## Average of repeat tests
FID.Eval.A.CAI.600 <- apply(abind::abind(FID.Eval.A.1.CAI.600,
FID.Eval.A.2.CAI.600,
FID.Eval.A.3.CAI.600,
along = 3), 1:2, mean)
FID.Eval.A.CAI.600 <- as.data.frame(FID.Eval.A.CAI.600)
> FID.Eval.A.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 85.25 21.17 0.33 255.23 0.28
2 50.25 21.17 0.36 150.36 0.36
3 25.27 21.21 0.36 75.51 0.58
4 85.30 10.27 0.35 252.26 -0.95
5 50.30 10.27 0.36 148.78 -0.79
6 25.26 10.27 0.34 74.27 -1.00
7 85.26 0.37 0.33 247.26 -2.86
8 50.30 0.37 0.41 146.77 -2.12
9 25.23 0.37 0.37 73.24 -2.28
Where 'FID.Eval.A.CAI.600' displays the average for each value from the three df's.
To fix immediate problem, use get() to return object by character reference. As of now, your paste0 calls will only return character strings and not actual object.
abind::abind(get(paste0('FID.Eval.A.1.', Instrument), envir=.GlobalEnv),
get(paste0('FID.Eval.A.2.', Instrument), envir=.GlobalEnv),
get(paste0('FID.Eval.A.3.', Instrument), envir=.GlobalEnv),
along = 3)
In fact, for a more dynamic solution consider mget to return all objects by name pattern without hard-coding each of the three objects.
Also, in R it best to avoid use of assign as much as possible. Instead, consider building one list of many objects with functional assignment and avoid flooding global environment with many separate objects. Below iterates using sapply to build a named list of average matrices.
FIDs <- c('CAI.600', 'Thermo.1')
mat_list <- sapply(FIDs, function(Instrument) {
FIDs_list <- mget(ls(pattern=Instrument, envir=.GlobalEnv), envir=.GlobalEnv)
FIDs_arry <- do.call(abind::abind, c(FIDs_list, along=length(FIDs_list)))
return(apply(FIDS_arry, 1:2, mean))
}, simplify = FALSE)
# OUTPUT ITEMS
mat_list$CAI.600
mat_list$Thermo.1
Even update names to conform to your original needs.
names(mat_list) <- paste0("FID.Eval.A.", names(mat_list))
mat_list$FID.Eval.A.CAI.600
mat_list$FID.Eval.A.Thermo.1
I am having some problems sorting my dataset into bins, that based on the numeric value of the data value. I tried doing it with the function shingle from the lattice which seem to split it accurately.
I can't seem to extract the desired output which is the knowledge how the data is divided into the predefined bins. I seem only able to print it.
bin_interval = matrix(c(0.38,0.42,0.46,0.50,0.54,0.58,0.62,0.66,0.70,0.74,0.78,0.82,0.86,0.90,0.94,0.98,
0.40,0.44,0.48,0.52,0.56,0.60,0.64,0.68,0.72,0.76,0.80,0.84,0.88,0.92,0.96,1.0),
ncol = 2, nrow = 16)
bin_1 = shingle(data_1,intervals = bin_interval)
How do i extract the intervals which is outputted by the shingle function, and not only print it...
the intervals being the output:
Intervals:
min max count
1 0.38 0.40 0
2 0.42 0.44 6
3 0.46 0.48 46
4 0.50 0.52 251
5 0.54 0.56 697
6 0.58 0.60 1062
7 0.62 0.64 1215
8 0.66 0.68 1227
9 0.70 0.72 1231
10 0.74 0.76 1293
11 0.78 0.80 1330
12 0.82 0.84 1739
13 0.86 0.88 2454
14 0.90 0.92 3048
15 0.94 0.96 8936
16 0.98 1.00 71446
As an variable, that can be fed to another function.
The shingle() function returns the values using attributes().
The levels are specifically given by attr(bin_1,"levels").
So:
set.seed(1337)
data_1 = runif(100)
bin_interval = matrix(c(0.38,0.42,0.46,0.50,0.54,0.58,0.62,0.66,0.70,0.74,0.78,0.82,0.86,0.90,0.94,0.98,
0.40,0.44,0.48,0.52,0.56,0.60,0.64,0.68,0.72,0.76,0.80,0.84,0.88,0.92,0.96,1.0),
ncol = 2, nrow = 16)
bin_1 = shingle(data_1,intervals = bin_interval)
attr(bin_1,"levels")
This gives:
[,1] [,2]
[1,] 0.38 0.40
[2,] 0.42 0.44
[3,] 0.46 0.48
[4,] 0.50 0.52
[5,] 0.54 0.56
[6,] 0.58 0.60
[7,] 0.62 0.64
[8,] 0.66 0.68
[9,] 0.70 0.72
[10,] 0.74 0.76
[11,] 0.78 0.80
[12,] 0.82 0.84
[13,] 0.86 0.88
[14,] 0.90 0.92
[15,] 0.94 0.96
[16,] 0.98 1.00
Edit
The count information for each interval is only computed within the print.shingle method. Thus, you would need to run the following code:
count.shingle = function(x){
l <- levels(x)
n <- nlevels(x)
int <- data.frame(min = numeric(n), max = numeric(n),
count = numeric(n))
for (i in 1:n) {
int$min[i] <- l[[i]][1]
int$max[i] <- l[[i]][2]
int$count[i] <- length(x[x >= l[[i]][1] & x <= l[[i]][2]])
}
int
}
a = count.shingle(bin_1)
This gives:
> a
min max count
1 0.38 0.40 0
2 0.42 0.44 1
3 0.46 0.48 3
4 0.50 0.52 1
5 0.54 0.56 2
6 0.58 0.60 2
7 0.62 0.64 2
8 0.66 0.68 4
9 0.70 0.72 1
10 0.74 0.76 3
11 0.78 0.80 2
12 0.82 0.84 2
13 0.86 0.88 5
14 0.90 0.92 1
15 0.94 0.96 1
16 0.98 1.00 2
where a$min is lower range, a$max is upper range, and a$count is the number within the bins.
Hi I'm pushing data into a matrix so I can create a heatmap. The code I am using identical to what is published here (http://sebastianraschka.com/Articles/heatmaps_in_r.html). For some of my datasets, when I push the data into the matrix format I am getting strange behaviour in that some of the values are changing. Some of my datasets work fine but others do not and I am unsure what the primary differences are that is underlying this strange behaviour.
Example code;
data <- read.csv("mydata.txt", sep="\t", header =TRUE)
rnames <- data[,1]
mat_data <- data.matrix(data[,2:ncol(data)])
rownames(mat_data) <- rnames
Now example dataframes..
head(data)
1 1.108029 0.42 0.19 0.04 0.47 -0.08 0.47 0.04 0.10
2 1.108029 0.34 0.40 0.25 0.56 -0.08 -0.06 0.11 0.20
3 1.121099 0.1 -0.45 0.11 -0.22 -0.07 -0.40 0.24 -0.17
4 1.123857 0.26 -0.15 0.15 0.31 0.2 -0.24 -0.27 0.40
5 1.129303 0.11 0.13 0.01 -0.11 0.38 0.29 -0.15 -0.18
6 1.135904 0.4 0.07 0.11 0.03 0.6 -0.32 0.14 -0.12
head(mat_data)
tg_q2_rep_A tg_q2_rep_B tg_q2_rep_C tg_q2_rep_D tg_q4_rep_A tg_q4_rep_B tg_q4_rep_C tg_q4_rep_D
1.10802929 70 0.19 0.04 0.47 5 0.47 0.04 0.10
1.1080293 65 0.40 0.25 0.56 5 -0.06 0.11 0.20
1.12109912 49 -0.45 0.11 -0.22 4 -0.40 0.24 -0.17
1.12385707 62 -0.15 0.15 0.31 53 -0.24 -0.27 0.40
1.12930344 50 0.13 0.01 -0.11 65 0.29 -0.15 -0.18
1.1359041 69 0.07 0.11 0.03 69 -0.32 0.14 -0.12
You can see the rownames have had numbers appended to the ends and the first data for tg_q2_rep_A and tg_q4_rep_A have been changed.
If anyone can suggest how to approach this I'd appreciate it. I've been trying to figure this out for days :/
EDIT
As requested ..
> str(data)
'data.frame': 137 obs. of 33 variables:
$ CpG_id.chr.pos.: num 1.11 1.11 1.12 1.12 1.13 ...
$ tg_q2_rep_A : Factor w/ 75 levels "-0.01","-0.02",..: 70 65 49 62 50 69 71 63 57 7 ...
$ tg_q2_rep_B : num 0.19 0.4 -0.45 -0.15 0.13 0.07 0.5 -0.33 0.23 -0.22 ...
$ tg_q2_rep_C : num 0.04 0.25 0.11 0.15 0.01 0.11 0.16 0.03 0.23 -0.32 ...
$ tg_q2_rep_D : num 0.47 0.56 -0.22 0.31 -0.11 0.03 0.31 0.21 0 0.06 ...
$ tg_q4_rep_A : Factor w/ 73 levels "-0.04","-0.05",..: 5 5 4 53 65 69 50 53 59 46 ...
$ tg_q4_rep_B : num 0.47 -0.06 -0.4 -0.24 0.29 -0.32 0.07 -0.23 0.1 -0.09 ...
$ tg_q4_rep_C : num 0.04 0.11 0.24 -0.27 -0.15 0.14 0.14 0.36 0.1 -0.05 ...
$ tg_q4_rep_D : num 0.1 0.2 -0.17 0.4 -0.18 -0.12 0.15 0.18 -0.21 -0.14 ...
$ tg_q6_rep_A : Factor w/ 79 levels "-0.02","-0.03",..: 46 3 7 67 65 77 64 61 41 12 ...
$ tg_q6_rep_B : Factor w/ 87 levels "-0.01","-0.03",..: 68 79 34 11 82 1 63 1 36 32 ...
$ tg_q6_rep_C : num 0.22 0.5 -0.32 0.13 0.24 0.25 0.35 0.07 0.01 -0.44 ...
$ tg_q6_rep_D : Factor w/ 82 levels "-0.04","-0.05",..: 55 50 27 74 71 68 73 61 5 31 ...
$ tg_q8_rep_A : Factor w/ 73 levels "-0.01","-0.02",..: 49 9 2 52 45 50 13 55 48 9 ...
$ tg_q8_rep_B : num 0.05 0.07 -0.31 0.02 0 -0.33 0.03 -0.05 0.08 0.1 ...
$ tg_q8_rep_C : num 0.35 0.5 -0.06 -0.1 0.24 -0.45 -0.27 0.1 0.15 -0.29 ...
$ tg_q8_rep_D : num 0.15 0.08 -0.08 0.31 0.28 0.43 0.41 0.25 -0.05 -0.04 ...
$ tg_w2_rep_A : Factor w/ 72 levels "-0.01","-0.02",..: 49 16 24 66 60 62 62 68 52 49 ...
$ tg_w2_rep_B : num 0.11 0.24 -0.03 -0.43 0.67 -0.13 0.05 -0.4 -0.13 -0.18 ...
$ tg_w2_rep_C : num 0 0.33 -0.09 0 0.12 -0.35 0.06 0.33 0.15 -0.19 ...
$ tg_w2_rep_D : num -0.04 0 -0.03 0.44 0.04 0.23 0.28 0.19 -0.21 -0.17 ...
$ tg_w4_rep_A : Factor w/ 69 levels "-0.0","-0.01",..: 55 58 53 50 52 67 68 63 27 8 ...
$ tg_w4_rep_B : num 0.29 0.63 -0.37 0.09 0.22 -0.21 0.1 -0.14 -0.04 -0.09 ...
$ tg_w4_rep_C : num 0.09 0.13 -0.08 0.17 0.15 -0.33 0 0.38 0.1 -0.62 ...
$ tg_w4_rep_D : num 0.11 0.33 -0.32 0.41 -0.1 0.07 0.23 0.22 0.1 0.06 ...
$ tg_w6_rep_A : Factor w/ 74 levels "-0.01","-0.02",..: 56 45 4 69 59 47 2 40 47 12 ...
$ tg_w6_rep_B : num 0.07 0.13 -0.14 0.15 0.13 -0.17 0.33 0.12 0.07 -0.15 ...
$ tg_w6_rep_C : num 0.13 0.22 0.31 0.08 0.16 -0.33 -0.05 0.43 0.43 -0.06 ...
$ tg_w6_rep_D : num 0.28 0.11 -0.2 0.66 -0.18 0.16 0.26 0.27 0.06 -0.02 ...
$ tg_w8_rep_A : Factor w/ 67 levels "-0.01","-0.02",..: 52 40 37 44 48 61 48 53 39 63 ...
$ tg_w8_rep_B : num 0.3 0.09 -0.22 -0.1 0.14 -0.25 0.1 -0.49 0.19 0.15 ...
$ tg_w8_rep_C : num 0.23 0.27 0.11 -0.25 0.17 -0.13 0.23 0.47 0.33 -0.09 ...
$ tg_w8_rep_D : num -0.04 0.1 -0.25 0.37 -0.09 0.18 0.26 0.2 -0.35 -0.11 ...
The problem with your rownames is that they aren't unique. R requires unique identifiers for each row, and you have multiple rows with the same value in the data.frame "data". When you try to force it to make the values in that first column rownames, it's trying to make them unique, and it looks as though it's rounding some numbers to accomplish that.
I'm not entirely certain what's going on with columns tg_q2_rep_A and tg_q4_rep_A, but it looks as though those values have been converted to ranks. That can happen if the class of those columns in your original data.frame, data, was "factor" rather than "numeric". Try this to check the classes:
sapply(data, class)
If you've got a mixture of numbers and letters in that column, for example, R will set the data class as factor by default. When you convert those columns to numeric format, which is what data.matrix() does, the output will be the rank of that factor.
I didn't get the same problem for those two columns when I copied and pasted your data into a csv file and loaded it into R, but I'm guessing that you haven't given us all the data there. My first step to figure this out would be to check the classes of the columns.