Iterating over data.table in lapply or for loop - r

I basically am wondering why it makes a difference if you iterate over a vector created from a data.table or over a basic vector and why that is the case.
I think basically it's quite similar to this question, but I can narrow it down to a more basic example.
dt <- setDT(structure(list(File.Name = c("file1.xlsx", "file2.xlsx", "file3.xlsx")
, Split.ID = structure(c(1L, 1L, 1L)
, .Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11","12", "13", "14", "15", "16", "17", "18", "19", "20"), class = "factor"))
, .Names = c("File.Name", "Split.ID")
, class = c("data.table", "data.frame"), row.names = c(NA, -3L)))
for(file in as.vector(dt[1:3,1])){
print(file)
}
for(i in letters[1:4]){print(i)}
As can be seen from the output, in the first for-loop it just prints out all values in the first iteration and in the second loop it loops over the distinct letters.
I am trying to use the file names in connection with readxl, so vectorizing the function is not really an option, since I want to do it sequentially. Additionally I would like to keep the list of files as an data.table to be able to split in different parts.

Related

Recode a factor variable, dropping N/A

I have a factor variable with 14 levels, which I'm trying to into collapse into only 3 levels. It contains two N/A which I also wanna remove.
My code looks like this:
job <- fct_collapse(E$occupation, other = c("7","9", "10", "13" "14"), 1 = c("1", "2", "3", "12"), 2 = c("4", "5", "6", "8", "11"))
However it just gives me tons of error. Can anyone help here me here?
We could also this with a named list
library(forcats)
lst1 <- setNames(list(as.character(c(7, 9, 10, 13, 14)),
as.character(c(1, 2, 3, 12)), as.character(c(4, 5, 6, 8, 11))), c('other', 1, 2))
fct_collapse(df$occupation, !!!lst1)
data
df <- structure(list(occupation = c("1", "3", "5", "7", "9", "10",
"12", "14", "13", "4", "7", "6", "5")), class = "data.frame", row.names = c(NA,
-13L))
For numbers try using backquotes in fct_collapse.
job <- forcats::fct_collapse(df$occupation,
other = c("7","9", "10", "13", "14"),
`1` = c("1", "2", "3", "12"),
`2` = c("4", "5", "6", "8", "11"))

Is there a way to have separate weighted edges for ins and outs in igraph R?

I'm making a directed network in R studio with the igraph package. I would like to display two edges between nodes where applicable, both weighted, one for ins and one for outs. I'm very new to R and managed to get weighted edges that compile both ins and outs but would like them separated.
I've googled my problem with no avail, it might be from phrasing it wrong. I apologize in advance if I worded it badly.
EDIT: Minimal reproducible sample:
OPR.df <- data.frame("From" = c(c("8", "8", "8", "8", "7", "25", "24", "1A", "12", "12"),
c("12", "12", "12", "17", "17", "17"),
c("17", "17", "17", "17"),
c("17", "17", "17", "17", "17", "9A", "9", "17", "9", "17", "9"),
c("9", "17", "17", "17")),
"To" = c(c("8", "8", "8", "7", "25", "24", "1A", "12", "12", "12"),
c("12", "12", "17", "17", "17", "17"),
c("17", "17", "17", "17"),
c("17", "17", "17", "17", "9A", "9", "17", "9", "17", "9", "17"),
c("17", "17", "17", "17")))
opr.d <- graph_from_data_frame(d = OPR.df,
directed = T)
# I think this is the part where I set this??
E(opr.d)$weight <- 1
opr.sd <- simplify(opr.d,
remove.multiple = T,
remove.loops = F,
edge.attr.comb = c(weight = "sum",
type = "ignore"))
E(opr.sd)$width <- E(opr.sd)$weight/3
There are a number of things that you can do to make the two-way
links more visible. First, plotting using the default layout crowds
the vertices 9, 9A and 17 too close together. There is no room to
see the edges. I will use layout_with_graphopt , which works fine
for this example, although for more complex examples you may need
to tune up the layout even more.
set.seed(4321)
plot(opr.sd, xpd=NA)
set.seed(4321)
plot(opr.sd, layout=layout_with_graphopt)
Of course, we still have the problem from your original question:
the arrows overlap each other. You can fix this using the edge.curved
argument. I wanted all of the arrows to be straight except where they
overlap, so I created a customized curvature vector to adjust only the
overlapping edges. Also, the arrow heads are too big and made it hard
to see the arrows, so I made the heads a bit smaller. All together, we get:
CURV = rep(0,ecount(opr.sd))
CURV[2] = 0.6
CURV[11] = 0.6
CURV[13] = 0.6
set.seed(4321)
plot(opr.sd, layout=layout_with_graphopt,
edge.arrow.size=0.7, edge.curved=CURV, frame=T)
You might still want to tweak this a bit, but I think this shows the
path to solving your problem.

Iterate several operations over a list of files in a directory and save with new dynamic filename in R, Lapply?

I am new to R and would like to read in a list of files as separate data frames, perform several operations on each, and save them out as separate files with dynamic file names. I am thinking I should use lappy, but not sure.
Here is the code I wrote that works for one file:
df <- read.fwf('USC00011084.dly', widths = c(21, rep(c(5, 1, 1, 1),31)))
df2 <- df[-c(3:5, 7:9, 11:13, 15:17, 19:21, 23:25, 27:29, 31:33, 35:37, 39:41, 43:45, 47:49, 51:53, 55:57, 59:61, 63:65, 67:69, 71:73, 75:77, 79:81, 83:85, 87:89, 91:93, 95:97, 99:101, 103:105, 107:109, 111:113, 115:117, 119:121, 123:125)]
df2[df2=="-9999"]<-NA
df$new <- rowSums(df2[,2:32], na.rm = TRUE)
df2["Total"] <- df$new
colnames(df2) <- c("StationDateType", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "28", "30", "31", "TotalMonthly")
Prcp <- df2[grep("PRCP", df2$StationDateType),]
write.table(Prcp, "USC00011084Prcp.txt", sep="\t", row.names=FALSE)
How can I do this for a list of files in a directory? Any ideas? Thank you.
You can try this...
You can get a list of your files:
files <- list.files(getwd())
Write a function that performs the analysis you want and writes the results to table, as you have done. Here we use tools::file_path_sans_ext to extract the filename (without the file type extension), and at the end use it to name the table to be saved to txt.
myFunction <- function(files){
fileName <- tools::file_path_sans_ext(files)
df <- read.fwf(files, widths = c(21, rep(c(5, 1, 1, 1),31)))
# rest of your code
# ...
write.table(Prcp, paste0(fileName, "Prcp.txt"), sep="\t", row.names=FALSE)
}
You can use lapply to run your function on each file in files.
lapply(files, function(x) myFunction(x))

3D euclidean distance to identify unknown samples

I have this dataframe called mydf where I have three principal covariates (PCA.1,PCA.2, PCA.3). I want to get the 3d distance matrix and get the shortest euclidean distance between all the compared Samples. In another dataframe called myref, I have some known identity of Samples and some unknown samples. By calculating the shortest euclidean distance from mydf, I want to assign the known Identity to the unknown samples. Can someone please help me get this done.
mydf
mydf <- structure(list(Sample = c("1", "2", "4", "5", "6", "7", "8",
"9", "10", "12"), PCA.1 = c(0.00338, -0.020373, -0.019842, -0.019161,
-0.019594, -0.019728, -0.020356, 0.043339, -0.017559, -0.020657
), PCA.2 = c(0.00047, -0.010116, -0.011532, -0.011582, -0.013245,
-0.011751, -0.010299, -0.005801, -0.01, -0.011334), PCA.3 = c(-0.008787,
0.001412, 0.003751, 0.00371, 0.004242, 0.003738, 0.000592, -0.037229,
0.004307, 0.00339)), .Names = c("Sample", "PCA.1", "PCA.2", "PCA.3"
), row.names = c(NA, 10L), class = "data.frame")
myref
myref<- structure(list(Sample = c("1", "2", "4", "5", "6", "7", "8",
"9", "10", "12"), Identity = c("apple", "unknown", "ball", "unknown",
"unknown", "car", "unknown", "cat", "unknown", "dog")), .Names = c("Sample",
"Identity"), row.names = c(NA, 10L), class = "data.frame")
uIX = which(myref$Identity == "unknown")
dMat = as.matrix(dist(mydf[, -1])) # Calculate the Euclidean distance matrix
nn = apply(dMat, 1, order)[2, ] # For each row of dMat order the values increasing values.
# Select nearest neighbor (it is 2, because 1st row will be self)
myref$Identity[uIX] = myref$Identity[nn[uIX]]
Note that the above code will set some identities to unknown. If instead you want to match to the nearest neighbor with a known identity, change the second line to
dMat[uIX, uIX] = Inf

ggplot2 for procrustes rotation in vegan

I want to plot procrustes rotations between to RDA-objects obtained by vegan with ggplot2.
library(vegan)
#perform two RDAs, do procrustes:
pro.test <- procrustes(rda.t1,rda.t2)
I extracted the x,y coordinates from list of class "procrustes" and added a factor "dates".
test <- data.frame(rda1=pro.test$Yrot[,1], rda2=pro.test$Yrot[,2])
test$dates <- c(rep("A", 8), rep("B",8), rep("C", 8))
test.2 <- data.frame(rda1=pro.test$X[,1], rda2=pro.test$X[,2])
test.2$dates <- c(rep("A", 8), rep("B",8), rep("C", 8))
Now the basic plot:
ggplot() +
geom_point(data=test, aes(x=rda1, y=rda2, color=dates)) +
geom_point(data=test.2,aes(x=rda1, y=rda2, color=dates))
The part i cannot do is the plotting of small lines between each corresponding point in test and test.2
Vegan does plot these rotations with arrows rather than connected points. However, vegan does not khow to color according to sampling groups/factors, which is important for me.
Having arrows in the ggplot would be extremely great - i know there is a geom_segment with the argument "arrow".
Could you help me?
the dput of pro.test is below.
dput(pro.test)
structure(list(Yrot = structure(c(0.126093537705143, 0.196350569855869,
-0.0513472841582749, -0.304416713452662, 0.210682972922012, -0.0219477831881197,
-0.24519038499101, 0.338357488742126, -0.399739151138497, -0.366424716631558,
0.0321561053701086, 0.565794811541598, 0.606054432756139, -0.0122819831669951,
-0.00403199420346042, -0.0448308879361932, 0.0631101371381566,
-0.150820933315408, -0.018216051372273, -0.68513841544701, -0.117446131920294,
-0.450735018917557, 0.25749869839177, 0.47646869541639, -0.211447138648954,
-0.236584149111598, -0.0316882271224907, -0.281680981927695,
-0.182346139754316, -0.366221121187894, -0.263915986724565, -0.203160918536977,
0.209888424862468, 0.219400450315756, 0.143569801341895, 0.258388604988749,
0.542334722496036, 0.465147580652753, 0.294835945722885, 0.523372408452242,
0.0739580893460179, 0.242768571724456, 0.0409877673276456, -0.0942111509903291,
-0.193072299067071, -0.38889179801965, -0.352882980509932, -0.208549475629433
), .Dim = c(24L, 2L), .Dimnames = list(c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "12", "13", "14", "15", "16",
"17", "18", "19", "20", "21", "22", "23", "24", "25"), NULL)),
X = structure(c(0.0860177119127241, 0.217144585357183, -0.0301829830202831,
-0.246142550516987, 0.230574651598493, 0.00485065775494245,
-0.225907453854864, 0.371465194869491, -0.395330365511425,
-0.359255005182027, -0.00775013746753128, 0.47442649486468,
0.519983070801763, -0.0146878517934982, 0.0377018407084686,
-0.0885829362985767, 0.0935962405791314, -0.186192083265912,
0.00247095461296341, -0.655467761687806, -0.0966978065526177,
-0.398672122636169, 0.275589258531376, 0.39104839619648,
-0.273098318897548, -0.237373845171625, -0.0351119316278201,
-0.279271270040404, -0.171188235636994, -0.342350443283954,
-0.297148604541773, -0.21965804713297, 0.269830887309913,
0.268669489120665, 0.143826114581508, 0.229549645414776,
0.531869658831067, 0.479136042616735, 0.380638462867711,
0.548249030471031, 0.161449266776772, 0.282765937749097,
0.0756433464279055, 0.00516171212969907, -0.195519622624857,
-0.568932423412245, -0.381681091857682, -0.375455760069009
), .Dim = c(24L, 2L), const = 1.30375728392289, .Dimnames = list(
c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"12", "13", "14", "15", "16", "17", "18", "19", "20",
"21", "22", "23", "24", "25"), c("RDA1", "RDA2")), "`scaled:center`" = structure(c(1.3588667228485e-17,
4.69820941410385e-18), .Names = c("RDA1", "RDA2"))), ss = 0.146265769408323,
rotation = structure(c(-0.958883328045618, 0.283800569407742,
0.283800569407742, 0.958883328045619), .Dim = c(2L, 2L)),
translation = structure(c(-9.76742606822348e-18, 1.35523649355013e-17
), .Dim = 1:2), scale = 0.918742698883168, xmean = structure(c(1.3586408473959e-17,
4.71176194125992e-18), .Names = c("RDA1", "RDA2")), symmetric = FALSE,
call = procrustes(X = rda.t1, Y = rda.t2), svd = structure(list(
d = c(2.51563498111738, 2.16729713036852), u = structure(c(0.743008138366833,
0.669282381600362, 0.669282381600362, -0.743008138366833
), .Dim = c(2L, 2L)), v = structure(c(-0.522515395489416,
0.852629850214347, -0.852629850214347, -0.522515395489416
), .Dim = c(2L, 2L))), .Names = c("d", "u", "v"))), .Names = c("Yrot",
"X", "ss", "rotation", "translation", "scale", "xmean", "symmetric",
"call", "svd"), class = "procrustes")
Does this do the job?
library(ggplot2)
library(grid)
ctest <- data.frame(rda1=pro.test$Yrot[,1],
rda2=pro.test$Yrot[,2],xrda1=pro.test$X[,1],
xrda2=pro.test$X[,2],dates=rep(c("A","B","C"),each=8))
ggplot(ctest) +
geom_point(aes(x=rda1, y=rda2, colour=dates)) +
geom_point(aes(x=xrda1, y=xrda2, colour=dates)) +
geom_segment(aes(x=rda1,y=rda2,xend=xrda1,yend=xrda2,colour=dates),arrow=arrow(length=unit(0.2,"cm")))

Resources