R: importing multiple files [duplicate] - r

This question already has answers here:
Importing many files at the same time and adding ID indicator
(2 answers)
Closed 4 years ago.
I have data that is save with .log extension. The content of the file looks like:
Trajectory Log File
Date: Sun Mar 04 15:32:29 2018
Nr of Trajectories: 91
Trajectory-Mode: ON
Average Slope (Degrees): 28.05 / 51.99 / 64.83
Filename: test_tschamut_Pos1.xml
Z-offset: 1.32000
Rock Position X: 696621.38
Rock Position Y: 167730.02
Rock Position Z: 1679.6400
Friction:
Overall Type: Medium
t (s) x (m) y (m) z (m) p0 () p1 () p2 () p3 () vx (m s-1) vy (m s-1) vz (m s-1) wx (rot s-1) wy (rot s-1) wz (rot s-1) Etot (kJ) Ekin (kJ) Ekintrans (kJ) Ekinrot (kJ) zt (m) Fv (kN) Fh (kN) Slippage (m) mu_s (N s m-1) v_res (m s-1) w_res (rot s-1) JumpH (m) ProjDist (m) Jc () JH_Jc (m) SD (m)
0.000 696621.380 167730.020 1680.960 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1192.526 0.000 0.000 0.000 1677.754 0.000 0.000 0.000 0.350 0.000 0.000 3.206 0.000 0.000 0.000 0.000
0.010 696621.380 167730.020 1680.959 1.000 0.000 -0.000 0.000 0.000 0.000 -0.098 0.000 0.000 0.000 1192.526 0.010 0.010 0.000 1677.754 0.000 0.000 0.000 0.350 0.098 0.000 3.205 0.000 0.000 0.000 0.000
0.020 696621.380 167730.020 1680.958 1.000 0.000 -0.000 0.000 0.000 0.000 -0.196 0.000 0.000 0.000 1192.526 0.039 0.039 0.000 1677.754 0.000 0.000 0.000 0.350 0.196 0.000 3.204 0.000 0.000 0.000 0.000
0.040 696621.380 167730.020 1680.952 1.000 0.000 -0.000 0.000 0.000 0.000 -0.392 0.000 0.000 0.000 1192.526 0.158 0.158 0.000 1677.754 0.000 0.000 0.000 0.350 0.392 0.000 3.198 0.000 0.000 0.000 0.000
0.060 696621.380 167730.020 1680.942 1.000 0.000 -0.000 0.000 0.000 0.000 -0.589 0.000 0.000 0.000 1192.526 0.355 0.355 0.000 1677.754 0.000 0.000 0.000 0.350 0.589 0.000 3.188 0.000 0.000 0.000 0.000
0.080 696621.380 167730.020 1680.929 1.000 0.000 -0.000 0.000 0.000 0.000 -0.785 0.000 0.000 0.000 1192.526 0.631 0.631 0.000 1677.754 0.000 0.000 0.000 0.350 0.785 0.000 3.175 0.000 0.000 0.000 0.000
0.110 696621.380 167730.020 1680.901 1.000 0.000 -0.000 0.000 0.000 0.000 -1.079 0.000 0.000 0.000 1192.526 1.193 1.193 0.000 1677.754 0.000 0.000 0.000 0.350 1.079 0.000 3.147 0.000 0.000 0.000 0.000
0.130 696621.380 167730.020 1680.877 1.000 0.000 -0.000 0.000 0.000 0.000 -1.275 0.000 0.000 0.000 1192.526 1.666 1.666 0.000 1677.754 0.000 0.000 0.000 0.350 1.275 0.000 3.123 0.000 0.000 0.000 0.000
0.150 696621.380 167730.020 1680.850 1.000 0.000 -0.000 0.000 0.000 0.000 -1.472 0.000 0.000 0.000 1192.526 2.218 2.218 0.000 1677.754 0.000 0.000 0.000 0.350 1.472 0.000 3.096 0.000 0.000 0.000 0.000
0.160 696621.380 167730.020 1680.834 1.000 0.000 -0.000 0.000 0.000 0.000 -1.570 0.000 0.000 0.000 1192.526 2.523 2.523 0.000 1677.754 0.000 0.000 0.000 0.350 1.570 0.000 3.080 0.000 0.000 0.000 0.000
0.180
How can i import this data format in R ?
Is it possible to import many files of this format at the same time?
thanks!

You can read the tabular part using read.table or read.fwf with skip=N (where N is the number of lines before table's header). Then you can read the first N lines with readLines and extract the parts you need.
For exporting many files with the same format, I suggest you start with writing a function combining all the steps, something like this:
read.log <- function(filename){
first <- readLines(filename, 10)
date <- sub("Date: ", "", first[2]) # get date from second row
date <- as.POSIXct(date, format="?strptime") # convert to date
# use ?strptime to see how to specify date format
# .... likewise for other pieces eg. nr. of trajectories
second <- read.table(filename, skip = 10) # some other no. instead of 10
R <- list(data=second, date=date, otherstuff = otherstuff, ...)
# return as a list
R
}
Now you can use this function to read in any file, or apply it to a list of file names (using lapply or a for loop).

Related

Subset a dataframe to include samples from another file

I currently have a count matrix data.frame where the rownames are the genes and the colnames are the sample names
head(colnames(countmatrix_clean_cl_mouse))
[1] "UB01.31YE" "UT38.78EE" "YW49.74CE" "OB13.46DD" "OT35.78PE" "KE51.98JE"
head(rownames(countmatrix_clean_cl_mouse))
[1] "Gnai3" "Pbsn" "Cdc45" "H19" "Scml2" "Apoh"
head(countmatrix_clean_cl_mouse[,1:10])
UB01.31YE UT38.78EE YW49.74CE OB13.46DD OT35.78PE KE51.98JE YB40.88ZA UI68.54DC GB09.27EE QI98.56TC
Gnai3 88.608 67.174 104.042 103.504 80.314 81.985 104.550 58.628 70.957 89.278
Pbsn 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Cdc45 10.121 6.637 12.057 5.356 13.340 3.309 7.987 83.508 8.491 93.227
H19 43.613 2.044 152.882 0.095 0.455 0.325 1.660 0.278 0.313 0.037
Scml2 0.342 0.000 0.283 0.517 0.000 0.000 0.000 2.262 0.684 4.787
Apoh 0.000 0.781 0.204 0.000 0.000 0.000 0.000 0.071 0.000 0.059
The above data.frame includes 963 samples but I want to subset the samples from that data/frame to the samples that I have in a separate excel sheet. Which looks like below. The sample names are the same but have a "-" instead of ".".
> head(pdac_samples)
V1
1 GT34-87JE
2 QT33-82OE
3 KT30-82ZE
4 UT38-78EE
5 SO33-16DD
6 CD10-05ZE
How would I go about subsetting countmatrix_clean_cl_mouse?
You can use sub to replace the - with ., then find the names in common, and use standard data[row, column] subsetting:
dot_names = sub(pattern = "-", replacement = ".", pdac_samples$V1, fixed = TRUE)
names_in_common = intersect(names(countmatrix_clean_cl_mouse), dot_names)
countmatrix_subset = countmatrix_clean_cl_mouse[, names_in_common, drop = FALSE]
# UT38.78EE
# Gnai3 67.174
# Pbsn 0.000
# Cdc45 6.637
# H19 2.044
# Scml2 0.000
# Apoh 0.781
Using this sample data:
countmatrix_clean_cl_mouse = read.table(text = ' UB01.31YE UT38.78EE YW49.74CE OB13.46DD OT35.78PE KE51.98JE YB40.88ZA UI68.54DC GB09.27EE QI98.56TC
Gnai3 88.608 67.174 104.042 103.504 80.314 81.985 104.550 58.628 70.957 89.278
Pbsn 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Cdc45 10.121 6.637 12.057 5.356 13.340 3.309 7.987 83.508 8.491 93.227
H19 43.613 2.044 152.882 0.095 0.455 0.325 1.660 0.278 0.313 0.037
Scml2 0.342 0.000 0.283 0.517 0.000 0.000 0.000 2.262 0.684 4.787
Apoh 0.000 0.781 0.204 0.000 0.000 0.000 0.000 0.071 0.000 0.059', header = T)
pdac_samples = read.table(text = ' V1
1 GT34-87JE
2 QT33-82OE
3 KT30-82ZE
4 UT38-78EE
5 SO33-16DD
6 CD10-05ZE', header = T)

R "increasing 'x' and 'y' values expected

I'm trying to create a perspective graph in R and keep getting the increasing 'x' and 'y' error. I've tried numerous options but I can't seem to figure this out. Any help would be appreciated!
fit.A <- data.frame(Temp.f="Ambient",x,y)
fit.A$pred <- predict(model=lrNH4,newdata=fit.A)
x
[1] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
[17] 0.619 0.626 0.630 0.635 0.649 0.656 1.902 1.947 1.967 2.056 2.689 2.707 2.758 2.760 2.943 2.978
[33] 2.992 3.020 4.564 4.854 5.893 6.029 6.051 6.067
y
[1] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
[17] 0.479 0.530 0.566 0.605 0.635 0.726 1.909 1.916 2.128 2.195 2.636 2.645 2.747 2.777 2.943 3.057
[33] 3.169 3.203 4.657 4.813 5.956 5.986 6.154 6.157
persp(x,y,z=matrix(fit.A$pred,nrow=length(x),ncol=length(y),byrow=TRUE),
zlim=c(140,700))
Error in persp.default(x, y, z = matrix(fit.A$pred, nrow = length(x),
: increasing 'x' and 'y' values expected
Your x and y values aren't increasing- they stay stuck at 0 for the first batch of rows.
To create a perspective plot, remove all rows where x and/or y are duplicated. This could be done with:
fit.A <- fit.A[!duplicated(fit.A$x), ]
(In different data it's possible you'd need to filter for duplicated y as well).

Hierarchical Clustering in R - 'pvclust' Issues

I have made a reproducible example where I am having trouble with pvclust. My goal is to pick the ideal clusters in a hierarchal cluster dendogram. I've heard of 'pvclust' but can't figure out how to use it. Also if anyone has other suggestions besides this to determine the ideal clusters it will be really helpful.
My code is provided.
library(pvclust)
employee<- c('A','B','C','D','E','F','G','H','I',
'J','K','L','M','N','O','P',
'Q','R','S','T',
'U','V','W','X','Y','Z')
salary<-c(20,30,40,50,20,40,23,05,56,23,15,43,53,65,67,23,12,14,35,11,10,56,78,23,43,56)
testing90<-cbind(employee,salary)
testing90<-as.data.frame(testing90)
head(testing90)
testing90$salary<-as.numeric(testing90$salary)
row.names(testing90)<-testing90$employee
testing91<-data.frame(testing90[,-1])
head(testing91)
row.names(testing91)<-testing90$employee
d<-dist(as.matrix(testing91))
hc<-hclust(d,method = "ward.D2")
hc
plot(hc)
par(cex=0.6, mar=c(5, 8, 4, 1))
plot(hc, xlab="", ylab="", main="", sub="", axes=FALSE)
par(cex=1)
title(xlab="Publishers", main="Hierarchal Cluster of Publishers by eCPM")
axis(2)
fit<-pvclust(d, method.hclust="ward.D2", nboot=1000, method.dist="eucl")
An error came up stating:
Error in names(edges.cnt) <- paste("r", 1:rl, sep = "") :
'names' attribute [2] must be the same length as the vector [0]
A solution would be to force your object d into a matrix.
From the helpfile of pvclust:
data numeric data matrix or data frame.
Note that by forcing an object of type dist into a marix, as it was a diagonal it will get 'reflected' (math term escapes me right now), you can check the object that is being taken into account with the call:
as.matrix(d)
This would be the call you are looking for:
#note that I can't
pvclust(as.matrix(d), method.hclust="ward.D2", nboot=1000, method.dist="eucl")
#Bootstrap (r = 0.5)... Done.
#Bootstrap (r = 0.58)... Done.
#Bootstrap (r = 0.69)... Done.
#Bootstrap (r = 0.77)... Done.
#Bootstrap (r = 0.88)... Done.
#Bootstrap (r = 1.0)... Done.
#Bootstrap (r = 1.08)... Done.
#Bootstrap (r = 1.19)... Done.
#Bootstrap (r = 1.27)... Done.
#Bootstrap (r = 1.38)... Done.
#
#Cluster method: ward.D2
#Distance : euclidean
#
#Estimates on edges:
#
# au bp se.au se.bp v c pchi
#1 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#2 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#3 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#4 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#5 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#6 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#7 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#8 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#9 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#10 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#11 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#12 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#13 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#14 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#15 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#16 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#17 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#18 1.000 1.000 0.000 0.000 0.000 0.000 0.000
#19 0.853 0.885 0.022 0.003 -1.126 -0.076 0.058
#20 0.854 0.885 0.022 0.003 -1.128 -0.073 0.069
#21 0.861 0.897 0.022 0.003 -1.176 -0.090 0.082
#22 0.840 0.886 0.024 0.003 -1.100 -0.106 0.060
#23 0.794 0.690 0.023 0.005 -0.658 0.162 0.591
#24 0.828 0.686 0.020 0.005 -0.716 0.232 0.704
#25 1.000 1.000 0.000 0.000 0.000 0.000 0.000
Note that this method will fix your call, but the validity of the clustering method, and quality of your data is for you to decide. Your MRE was trusted.

R rename duplicate col and rownames (subindexing)

I would very much appreciate if a kind soul could tell me how to do this in R:
Given a squared matrix with duplicated columns and rows, such as
1 1 2 2 2 2 3
1 0.000 0.000 0.048 0.048 0.048 0.048 0.059
1 0.000 0.000 0.048 0.048 0.048 0.048 0.059
2 0.048 0.048 0.000 0.000 0.000 0.000 0.059
2 0.048 0.048 0.000 0.000 0.000 0.000 0.059
2 0.048 0.048 0.000 0.000 0.000 0.000 0.059
2 0.048 0.048 0.000 0.000 0.000 0.000 0.059
3 0.059 0.059 0.059 0.059 0.059 0.059 0.000
where same col and row names designate duplicates, I require to have unique col and row names, while keeping track of original and duplicate cols/rows. That is, something like
1 1a 2 2a 2b 2c 3
1 0.000 0.000 0.048 0.048 0.048 0.048 0.059
1a 0.000 0.000 0.048 0.048 0.048 0.048 0.059
2 0.048 0.048 0.000 0.000 0.000 0.000 0.059
2a 0.048 0.048 0.000 0.000 0.000 0.000 0.059
2b 0.048 0.048 0.000 0.000 0.000 0.000 0.059
2c 0.048 0.048 0.000 0.000 0.000 0.000 0.059
3 0.059 0.059 0.059 0.059 0.059 0.059 0.000
Thanks in advance
You could use ?make.unique or ?make.names:
v <- as.character(c(1, 1, 2, 2, 2, 2, 3))
make.unique(v)
# [1] "1" "1.1" "2" "2.1" "2.2" "2.3" "3"
(You have to combine this with rownames and colnames.)

How to filter rows based on certain criteria?

I have an example file as follows:
GENES Samp1 Samp2 Samp3 Samp4 Samp5 Samp6 Samp7 Samp8
g1 0.000 0.000 0.000 0.000 0.010 0.000 0.022 0.344
g2 0.700 0.000 0.000 0.000 0.000 0.000 0.000 0.000
g3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
g4 0.322 0.782 0.000 0.023 0.000 0.000 0.000 0.345
g5 0.010 0.000 0.333 0.000 0.000 0.000 0.011 0.000
g6 0.000 0.000 0.010 0.000 0.000 0.000 0.000 0.000
I need to retrieve the list of rows (genes) if it has "2 or more samples" with the values "0.010 or more". So I should get the resulting column as follows.:
GENES
g1
g4
g5
Can anyone help me with this ?
Here's one possible way:
DF <- read.table(text=
"GENES Samp1 Samp2 Samp3 Samp4 Samp5 Samp6 Samp7 Samp8
g1 0.000 0.000 0.000 0.000 0.010 0.000 0.022 0.344
g2 0.700 0.000 0.000 0.000 0.000 0.000 0.000 0.000
g3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
g4 0.322 0.782 0.000 0.023 0.000 0.000 0.000 0.345
g5 0.010 0.000 0.333 0.000 0.000 0.000 0.011 0.000
g6 0.000 0.000 0.010 0.000 0.000 0.000 0.000 0.000",header=T,sep=' ')
rows <- sapply(1:nrow(DF),FUN=function(i){sum(DF[i,2:ncol(DF)] >= 0.01) >= 2})
subSet <- DF[rows,]
> subSet
GENES Samp1 Samp2 Samp3 Samp4 Samp5 Samp6 Samp7 Samp8
1 g1 0.000 0.000 0.000 0.000 0.01 0 0.022 0.344
4 g4 0.322 0.782 0.000 0.023 0.00 0 0.000 0.345
5 g5 0.010 0.000 0.333 0.000 0.00 0 0.011 0.000
or similarly this:
subSet <- DF[apply(DF,1,function(x){sum(tail(x,-1) >= 0.01) >= 2}),]
or this:
subSet <- DF[rowSums(DF[,2:ncol(DF)] >= 0.01) >= 2,]
as you can see there are many ways to accomplish that :)

Resources