Writing functions: Creating data processing functions with R software [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
Hello fellow "R" users!
Please spare me some of your time on helping me with the use of "R" software(Beginner) regarding "Data processing function", wherein I have three (3) different .csv files named "x2013, x2014, x2015" that has the same 6 columns as per respective year based on the image below: Problem and started typing the commands:
filenames=list.files()
library(plyr)
install.packages("plyr")
import.list=adply(filenames,1,read.csv)
Although I just really wanted to summarize all the calls from the three source (csv). Any kind of help would be appreciated. Thank you for assisting me!

If you want to summarize the results of read.csv into one data.frame you can use the following approach with do.call and rbind, given that csv-files has the same amount of columns. The code below takes all csv files (the amount of columns should be the same) from the project home directory and concatenate into one data.frame:
# simulation of 3 data.frames with 6 columns and 10 rows
df1 <- as.data.frame(matrix(1:(10 * 6), ncol = 6))
df2 <- df1 * 2
df3 <- df1 * 3
write.csv(df1, "X2012.csv")
write.csv(df2, "X2013.csv")
write.csv(df3, "X2014.csv")
# Load all csv files from home directory
filenames <- list.files(".", pattern = "csv$")
import.list<- lapply(filenames, read.csv)
# concatenate list of data.frames into one data.frame
df_res <- do.call(rbind, import.list)
str(df_res)
Output is a data.frame with 6 columns and 30 rows:
'data.frame': 30 obs. of 7 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ V1: int 1 2 3 4 5 6 7 8 9 10 ...
$ V2: int 11 12 13 14 15 16 17 18 19 20 ...
$ V3: int 21 22 23 24 25 26 27 28 29 30 ...
$ V4: int 31 32 33 34 35 36 37 38 39 40 ...
$ V5: int 41 42 43 44 45 46 47 48 49 50 ...
$ V6: int 51 52 53 54 55 56 57 58 59 60 ...

Related

How do I get rid of commas and periods, etc in R? [duplicate]

This question already has answers here:
How to load comma separated data into R?
(2 answers)
Closed 6 years ago.
This is my data set:
Depth.Fe
1 0,14.21
2 3,19.35
3 10,17.22
4 14,15.87
5 23,13.62
6 30,16.31
7 36,14.13
8 48,13.95
9 59,15
10 66,14.23
11 68,16.81
12 81,15.93
13 94,16.02
14 96,17.85
15 102,17.02
16 115,15.87
17 121,19.84
18 130,16.94
19 163,16.72
20 168,19.2
21 205,20.41
22 239,16.88
23 251,18.74
24 283,16.67
25 297,18.56
26 322,18.87
27 335,20.81
28 351,24.52
29 370,25.03
30 408,25.11
31 416,23.28
32 419,22.56
33 425,19
34 429,20.53
35 443,19.08
36 447,22.83
37 465,21.06
38 474,24.96
39 493,19.12
40 502,22.24
41 522,26.88
42 550,21.15
43 558,28.92
44 571,27.96
45 586,25.03
46 596,26.27
I want depth and Fe to be separated as individual columns, but nothing I try is working.
please help
First of all, #akrun is definitely right in his comment to your post. If this is a dataset imported from somewhere, then follow his comment.
Assuming that somehow you were handed this weird dataset, I would try this:
df <- data.frame(matrix(as.numeric(unlist(strsplit(df$Depth.Fe,split=","))),nrow=2,byrow = T),stringsAsFactors = F)
colnames(df) <- c("Depth","Fe")
This would take a dataset that looks like this:
Depth.Fe
1 0,14.21
2 3,19.35
to this:
Depth Fe
1 0 14.21
2 3 19.34

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

Avoid quotation marks in column and row names when using write.table [duplicate]

This question already has an answer here:
Delete "" from csv values and change column names when writing to a CSV
(1 answer)
Closed 5 years ago.
I have the following data in a file called "data.txt":
pid 1 2 4 15 18 20
1_at 100 200 89 189 299 788
2_at 8 78 33 89 90 99
3_xt 300 45 53 234 89 34
4_dx 49 34 88 8 9 15
The data is separated by tabs.
Now I wanted to extract some columns on that table, based on the information of csv file called "vector.csv", this vector got the following data:
18,1,4,20
So I wanted to end with a modified file "datamod.txt" separated with tabs that would be:
pid 18 1 4 20
1_at 299 100 89 788
2_at 90 8 33 99
3_xt 89 300 53 34
4_dx 9 49 88 15
I have made, with some help, the following code:
fileName="vector.csv"
con=file(fileName,open="r")
controlfile<-readLines(con)
controls<-controlfile[1]
controlins<-controlfile[2]
test<-paste("pid",controlins,sep=",")
test2<-c(strsplit(test,","))
test3<-c(do.call("rbind",test2))
df<-read.table("data.txt",header=T,check.names=F)
CC <- sapply(df, class)
CC[!names(CC) %in% test3] <- "NULL"
df <- read.table("data.txt", header=T, colClasses=CC,check.names=F)
df<-df[,test3]
write.table(df,"datamod.txt",row.names=FALSE,sep="\t")
The problem that I got is that my resulting file has the following format:
"pid" "18" "1" "4" "20"
"1_at" 299 100 89 788
"2_at" 90 8 33 99
"3_xt" 89 300 53 34
"4_dx" 9 49 88 15
The question I have is how to avoid those quotation "" marks that appear in my saved file, so that the data appears like I would like to.
Any help?
Thanks
To quote from the help file for write.table
quote
a logical value (TRUE or FALSE) or a numeric vector. If TRUE,
any character or factor columns will be surrounded by double quotes.
If a numeric vector, its elements are taken as the indices of columns
to quote. In both cases, row and column names are quoted if they are
written. If FALSE, nothing is quoted.
Therefore
write.table(df,"datamod.txt",row.names=FALSE,sep="\t", quote = FALSE)
should work nicely.

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

R sorts a vector on its own accord

df.sorted <- c("binned_walker1_1.grd", "binned_walker1_2.grd", "binned_walker1_3.grd",
"binned_walker1_4.grd", "binned_walker1_5.grd", "binned_walker1_6.grd",
"binned_walker2_1.grd", "binned_walker2_2.grd", "binned_walker3_1.grd",
"binned_walker3_2.grd", "binned_walker3_3.grd", "binned_walker3_4.grd",
"binned_walker3_5.grd", "binned_walker4_1.grd", "binned_walker4_2.grd",
"binned_walker4_3.grd", "binned_walker4_4.grd", "binned_walker4_5.grd",
"binned_walker5_1.grd", "binned_walker5_2.grd", "binned_walker5_3.grd",
"binned_walker5_4.grd", "binned_walker5_5.grd", "binned_walker5_6.grd",
"binned_walker6_1.grd", "binned_walker7_1.grd", "binned_walker7_2.grd",
"binned_walker7_3.grd", "binned_walker7_4.grd", "binned_walker7_5.grd",
"binned_walker8_1.grd", "binned_walker8_2.grd", "binned_walker9_1.grd",
"binned_walker9_2.grd", "binned_walker9_3.grd", "binned_walker9_4.grd",
"binned_walker10_1.grd", "binned_walker10_2.grd", "binned_walker10_3.grd")
One would expect that order of this vector would be 1:length(df.sorted), but that appears not to be the case. It looks like R internally sorts the vector according to its logic but tries really hard to display it the way it was created (and is seen in the output).
order(df.sorted)
[1] 37 38 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[26] 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Is there a way to "reset" the ordering to 1:length(df.sorted)? That way, ordering, and the output of the vector would be in sync.
Use the mixedsort (or) mixedorder functions in package gtools:
require(gtools)
mixedorder(df.sorted)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[28] 28 29 30 31 32 33 34 35 36 37 38 39
construct it as an ordered factor:
> df.new <- ordered(df.sorted,levels=df.sorted)
> order(df.new)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
EDIT :
After #DWins comment, I want to add that it is even not nessecary to make it an ordered factor, just a factor is enough if you give the right order of levels :
> df.new2 <- factor(df.sorted,levels=df.sorted)
> order(df.new)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
The difference will be noticeable when you use those factors in a regression analysis, they can be treated differently. The advantage of ordered factors is that they let you use comparison operators as < and >. This makes life sometimes a lot easier.
> df.new2[5] < df.new2[10]
[1] NA
Warning message:
In Ops.factor(df.new[5], df.new[10]) : < not meaningful for factors
> df.new[5] < df.new[10]
[1] TRUE
Isn't this simply the same thing you get with all lexicographic shorts (as e.g. ls on directories) where walker10_foo sorts higher than walker1_foo?
The easiest way around, in my book, is to use a consistent number of digits, i.e. I would change to binned_walker01_1.grd and so on inserting a 0 for the one-digit counts.
In response to Dwin's comment on Dirk's answer: the data are always putty in your hands. "This is R. There is no if. Only how." -- Simon Blomberg
You can add 0 like so:
df.sorted <- gsub("(walker)([[:digit:]]{1}_)", "\\10\\2", df.sorted)
If you needed to add 00, you do it like this:
df.sorted <- gsub("(walker)([[:digit:]]{1}_)", "\\10\\2", df.sorted)
df.sorted <- gsub("(walker)([[:digit:]]{2}_)", "\\10\\2", df.sorted)
...and so on.

Resources