write result of rank loop in r - r

I've been hitting walls trying to write the results of a loop to a csv. I'm trying to rank data within each of 20 columns. The loop I'm using is:
for (i in 1:ncol(testing_file)) {
print(rank(testing_file[[i]]))
}
This works and prints expected results to screen. I've tried a lot of methods suggested in various discussions to write this result to file or data frame, most with no luck.
I'll just include my most promising lead, which returns only one column of correct data, with a column heading of "testing":
for (i in 1:ncol(testing_file)) {
testing<- (rank(testing_file[[i]]))
testingdf <- as.data.frame(testing)
}
Any help is greatly appreciated!

I found a solution that works:
testage<- data.frame(matrix(, nrow=73, ncol=20)) #This creates an empty data
frame that the ranked results will go into
for (i in 1:ncol(testing_file)) {
testage[i] <- rank(testing_file[[i]])
print(testage[i])
} #this is the loop that ranks data within each column
colnames(testage) <- colnames(testing_file) #take the column names from the
original file and apply them to the ranked file.

I'm bad with nested loops so I'd try:
testing_file <- data.frame(x = 1:5, y = 15:11)
testing <- as.data.frame(lapply(seq_along(testing_file), function (x)
rank(testing_file[, x])))
> testing_file
x y
1 1 15
2 2 14
3 3 13
4 4 12
5 5 11
and gets you out of messy nested loops. Did you want to check results of rank() prior to writing to csv?
or just wrap it in a write.csv, the colnames will be the original df colnames:
> write.csv(testing <- as.data.frame(lapply(seq_along(testing_file),
function (x) rank(testing_file[, x]))), "testing.csv", quote = FALSE)

Related

Loop, create new variable as function of existing variable with conditional

I have some data that contains 400+ columns and ~80 observations. I would like to use a for loop to go through each column and, if it contains the desired prefix exp_, I would like to create a new column which is that value divided by a reference column, stored as the same name but with a suffix _pp. I'd also like to do an else if with the other prefix rev_ but I think as long as I can get the first problem figured out I can solve the rest myself. Some example data is below:
exp_alpha exp_bravo rev_charlie rev_delta pupils
10 28 38 95 2
24 56 39 24 5
94 50 95 45 3
15 93 72 83 9
72 66 10 12 3
The first time I tried it, the loop ran through properly but only stored the final column in which the if statement was true, rather than storing each column in which the if statement was true. I made some tweaks and lost that code but now have this which runs without error but doesn't modify the data frame at all.
for (i in colnames(test)) {
if(grepl("exp_", colnames(test)[i])) {
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils)
}
}
My understanding of what this is doing:
loop through the vector of column names
if the substring "exp_" is in the ith element of the colnames vector == TRUE
create a new column in the data set which is the ith element of the colnames vector divided by the reference category (pupils), and with "_pp" appended at the end
else do nothing
I imagine since my the code is executing without error but not doing anything that my problem is in the if() statement, but I can't figure out what I'm doing wrong. I also tried adding "==TRUE" in the if() statement but that achieved the same result.
Almost correct, you did not define the length of the loop so nothing happened. Try this:
for (i in 1:length(colnames(test))) {
if(grepl("exp_", colnames(test)[i])) {
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils
}
}
As an alternative to #timfaber's answer, you can keep your first line the same but not treat i as an index:
for (i in colnames(test)) {
if(grepl("exp_", i)) {
print(i)
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils
}
}
Linear solution:
Don't use loop for that! You can linearize your code and run it much faster than looping over columns. Here's how to do it:
# Extract column names
cNames <- colnames(test)
# Find exp in column names
foo <- grep("exp", cNames)
# Divide by reference: ALL columns at the SAME time
bar <- test[, foo] / test$pupils
# Rename exp to pp : ALL columns at the SAME time
colnames(bar) <- gsub("exp", "pp", cNames[foo])
# Add to original dataset instead of iteratively appending
cbind(test, bar)

Convert code from Matlab to R

Example of My Data
I have three matrix csv data files that I need to flatten and combine in R, so that I have three columns (Lat, Long, Data). The code I have for this is in matlab, but I need to convert this to R. Any thoughts? This is the matlab code that does this:
LON=csvread(‘LONGITUDE.csv’);
LAT=csvread(‘LATITUDE.csv’);
SM=csvread(‘soil_moisture20151008.csv’);
xyz=zeros(101*210,3);
k=0;
for i=1:101
for j=1:210
k=k+1;
xyz(k,1)=LAT(i,j);
xyz(k,2)=LON(i,j);
xyz(k,3)=SM(i,j);
end
end
csvwrite(‘xyz.csv’,xyz);
So far this is how I have changed it in R:
LON<-read.csv("LONGITUDE.csv", header = T)
LAT<-read.csv("LATITUDE.csv", header = T)
ET<-read.csv("actual_ET20100101.csv")
xyz=matrix(3,101,210)
k=0
for (i in 1:101){
for (j in 1:210){
k=k+1
xyz[k,1]=LAT[i,j]
xyz[k,2]=LON[i,j]
xyz[k,3]=ET[i,j]
}
}
write.csv("xyz.csv",xyz);
I'm not sure what I'm doing wrong. Any guidance on this issue would be greatly appreciated.
Finally, I have a whole directory of files that I need to run this script on, so any ideas on how to apply this to a directory would be great. The LAT/LON files don't change, just the data files.
Thank you!!
If I am understanding your data correctly, you have a large number of matrix files, where each index (row/column position) is assigned to the same data value. That is, (1,1) in each matrix gives the value of interest for the 1st data point, and (1,2) gives values for a different data point.
In that case, you should just be able to convert them all to a matrix, extract the values as a vector, then stitch them together.
To illustrate, here are three identical data.frames (so that we can see if they align correctly:
A <- B <- C <-
data.frame(matrix(runif(36), nrow = 6))
Each data.frame is this:
X1 X2 X3 X4 X5 X6
1 0.2462450 0.6887587 0.216578122 0.5982332 0.2402868 0.9588999
2 0.5924075 0.7511237 0.813704807 0.6892747 0.6253069 0.4648226
3 0.7482773 0.4808986 0.006036452 0.6576487 0.5752148 0.5554258
4 0.8545323 0.6822942 0.654128179 0.6582181 0.8173544 0.5191778
5 0.1748737 0.7456279 0.992209169 0.4468014 0.3491022 0.9736064
6 0.7189847 0.3424291 0.581840006 0.1460138 0.8071445 0.2920479
Then, I put them all in a list (named, so that the columns come out named):
myList <- list(A = A, B = B, C = C)
Then, we loop through the list, converting each data.frame to a matrix, then extracting the values as a vector. Then, I convert the resulting list to a data.frame to get the column/row behavior you likely want (data.frames are just lists with special properties; each column is an element of the list, but data.frames assumes the value orders match). Note that I am using magrittr/dplyr piping to simplify the nesting in the code:
flattened <-
lapply(myList, function(x){
as.matrix(x) %>%
as.numeric()
}) %>%
as.data.frame()
Then, the head of this (from my randomization) looks like:
A B C
1 0.2462450 0.2462450 0.2462450
2 0.5924075 0.5924075 0.5924075
3 0.7482773 0.7482773 0.7482773
4 0.8545323 0.8545323 0.8545323
5 0.1748737 0.1748737 0.1748737
6 0.7189847 0.7189847 0.7189847
Of note, you mentioned that you may have multiple data sources that you want to merge -- as long as you load them all up into this list, the approach will generate a column for each.

How does one process the results from replicate in R?

Say I have the following code which essentially gives me random simulations for revenue and cost for 12 months
simulate.revenue<-function() {
return(sapply(rnorm(12,100000,30000),function(x) max(0,x)))
}
simulate.cost<-function() {
return(sapply(rnorm(12,50000,20000),function(x) max(0,x)))
}
sim.run<-function() {
revenue<-simulate.revenue()
cost<-simulate.cost()
profit<-revenue-cost
year.simulation<-data.frame(revenue,cost,profit)
return(year.simulation)
}
Now to run the above simulation function 10 times I am aware that I should:
sim.results<-replicate(10,sim.run())
So the question is how do I further process sim.results to say:
find the mean for total yearly profit over each run
find the mean for profit by month over each of the runs (mean(profit[1], mean(profit[2]), ...)
Structure of replicate result:
replicate(1, sim.run()) easily gives you the structure of what is returned: A list item for each column of the data.frame (here 3 list items). Running two simulations adds another 3 list items.
Convert it into proper format:
To convert the list into a data.frame use:
result <- data.frame(matrix(unlist(sim.results), nrow = 12, byrow = FALSE))
In your case every 3 columns of the resulting data.frame correspond to one simulation. To separate the simulations into a list again:
result_list <- list()
m <- 1
n_simulations <- 10
n_columnsPerSimulation <- 3
for (i in seq(1, n_simulations * n_columnsPerSimulation, n_columnsPerSimulation)){
result_list[[m]] <- result[,seq(i, i+n_columnsPerSimulation-1)]
m <- m + 1
}
This is very ugly but seems to work.
Analyze result:
Now you can analyze each simulation e.g. with sapply/lapply like the following example shows:
sapply(result_list, function(x) mean(x[,1]))

Turn Multiple Uneven Nested Lists Into A DataFrame in R

I am trying to get to grips with R and as an experiment I thought that I would try to play around with some cricket data. In its rawest format it is a yaml file, which I used the yaml R package to turn into an R object.
However, I now have a number of nested lists of uneven length that I want to try and turn into a data frame in R. I have tried a few methods such as writing some loops to parse the data and some of the functions in the tidyr package. However, I can't seem to get it to work nicely.
I wondered if people knew of the best way to tackle this? Replicating the data structure would be difficult here, because the complexity comes in the multiple nested lists and the unevenness of their length (which would make for a very long code block. However, you can find the raw yaml data here: http://cricsheet.org/downloads/ (I was using the ODI internationals).
Thanks in advance!
Update
I have tried this:
1)Using tidyr - seperate
d <- unnest(balls)
Name <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal","WicketFielder","WicketKind","PlayerOut")
a <- separate(d, x, Name, sep = ",",extra = "drop")
Which basically uses the tidyr package returns a single column dataframe that I then try to separate. However, the problem here is that in the middle there is sometimes extras variables that appear in some rows and not others, thereby throwing off the separation.
2) Creating vectors
ballsVector <- unlist(balls[[2]],use.names = FALSE)
names_vector <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal")
names(ballsVector) <- c(names_vector)
ballsMatrix <- matrix(ballsVector, nrow = 1, byrow = TRUE)
colnames(ballsMatrix) <- names_vector
The problem here is that the resulting vectors are uneven in length and therefore cant be combined into a data frame. It will also suffer from the issue that there are sporadic variables in the middle of the dataset (as above).
Caveat: not complete answer; attempt to arrange the innings data
plyr::rbind.fill may offer a solution to binding rows with a different number of columns.
I dont use tidyr but below is some rough code to get the innings data into a data.frame. You could then loop this through all the yaml files in the directory.
# Download and unzip data
download.file("http://cricsheet.org/downloads/odis.zip", temp<- tempfile())
tmp <- unzip(temp)
# Create lists - use first game
library(yaml)
raw_dat <- yaml.load_file(tmp[[2]])
#names(raw_dat)
# Function to process list into dataframe
p_fun <- function(X) {
team = X[[1]][["team"]]
# function to process each list subelement that represents each throw
fn <- function(...) {
tmp = unlist(...)
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
}
# loop over all throws
lst = lapply(X[[1]][["deliveries"]], fn )
cbind(team, plyr:::rbind.fill(lst))
}
# Loop over each innings
dat <- plyr::rbind.fill(lapply(raw_dat$innings, p_fun))
Some explanation
The list structure and subsetting it. To get an idea of the structure of the list use
str(raw_dat) # but this gives a really long list of data
You can truncate this, to make it a bit more useful
str(raw_dat, 3)
length(raw_dat)
So there are three main list elements - meta, info, and innings. You can also see this with
names(raw_dat)
To access the meta data, you can use
raw_dat$meta
#or using `[[1]]` to access the first element of the list (see ?'[[')
raw_dat[[1]]
#and get sub-elements by either
raw_dat$meta$data_version
raw_dat[[1]][[1]] # you can also use the names of the list elements eg [[`data_version`]]
The main data is in the inningselement.
str(raw_dat$innings, 3)
Look at the names in the list element
lapply(raw_dat$innings, names)
lapply(raw_dat$innings[[1]], names)
There are two list elements, each with sub-elements. You can access these as
raw_dat$innings[[1]][[1]][["team"]] # raw_dat$innings[[1]][["1st innings"]][["team"]]
raw_dat$innings[[2]][[1]][["team"]] # raw_dat$innings[[2]][["2nd innings"]][["team"]]
The above function parsed the deliveries data in raw_dat$innings. To see what it does, work through it from the inside.
Use one record to see how it works
(note the lapply, with p_fun, looped over raw_dat$innings[[1]] and raw_dat$innings[[2]] ; so this is the outer loop, and the lapply, with fn, loops through the deliveries, within an innings ; the inner loop)
X <- raw_dat$innings[[1]]
tmp <- X[[1]][["deliveries"]][[1]]
tmp
#create a named vector
tmp <- unlist(tmp)
tmp
# 0.1.batsman 0.1.bowler 0.1.non_striker 0.1.runs.batsman 0.1.runs.extras 0.1.runs.total
# "IR Bell" "DW Steyn" "MJ Prior" "0" "0" "0"
To use rbind.fill, the elements to bind together need to be data.frames. We also want to remove the leading numbers /
deliveries from the names, as otherwise we will have lots of uniquely names columns
# this regex removes all non-numeric characters from the string
# you could then split this number into over and delivery
gsub("[^0-9]", "", names(tmp))
# this regex removes all numeric characters from the string -
# allowing consistent names across all the balls / deliveries
# (if i was better at regex I would have also removed the leading dots)
gsub("[0-9]", "", names(tmp))
So for the first delivery in the first innings we have
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
To see how the lapply works, use the first three deliveries (you will need to run the function fn in your workspace)
lst = lapply(X[[1]][["deliveries"]][1:3], fn )
lst
# [[1]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[2]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 02 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[3]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 03 IR Bell DW Steyn MJ Prior 3 0 3
So we end up with a list element for every delivery within an innings. We then use rbind.fill to create one data.frame.
If I was going to try and parse every yaml file I would use a loop.
Use the first three records as an example, and also add the match date.
tmp <- unzip(temp)[2:4]
all_raw_dat <- vector("list", length=length(tmp))
for(i in seq_along(tmp)) {
d = yaml.load_file(tmp[i])
all_raw_dat[[i]] <- cbind(date=d$info$date, plyr::rbind.fill(lapply(d$innings, p_fun)))
}
Then use rbind.fill.
Q1. from comments
A small example with rbind.fill
a <- data.frame(x=1, y=2)
b <- data.frame(x=2, z=1)
rbind(a,b) # error as names dont match
plyr::rbind.fill(a, b)
rbind.fill doesnt go back and add/update rows with the extra columns, where needed (a still doesnt have column z), Think of it as creating an empty dataframe with the number of columns equal to the number of unique columns found in the list of dataframes - unique(c(names(a), names(b))). The values are then filled in each row where possible, and left missing (NA) otherwise..

r llply lapply usage without losing dataframes names

Overall situation:
The interface of my measuring devices couldn’t save any further information but the name of the csv it generates during measuring its values. So I used a systematic set of abbreviations to account for changing parameters, such as concentrations, enzymes, feed stocks, buffers etc., That combined formed the title of my csv files which form the names of the data.frames , where I am now trying to read out the names, to combine them with the rest of the data, to form tables that I can use to do regressions.
The Issue:
I just noticed that I lose the names of my data.frames inside the list,
I could rename them after each call of lapply, but this doesn't seam to be a proper solution.
I found suggestion to use the llply, but I can't teach it to keep names either.
# loads plyr package
library(plyr)
# generates a showcase list of dataframes,
data <- list(data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)))
# assigns names to dataframe
names(data) <- list("one","two", "tree", "four")
usses the dataframes name to pass “o” to a column, this part works fine,
But after running it the names are lost
data <- lapply(X = seq_along(data),
FUN = function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)},
USE.NAMES = TRUE)
Same thing with llply, operates as expected but doesn’t keep the name either although I thought I could solve that particular problem (quote: “llply is equivalent to lapply except that it will preserve labels and can display a progress bar.”)
data <- llply(seq_along(data), function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)})
I would very much appreciate a hint how to solve this with out something like
name(data) <- list.with.the.names
after each llply ore lapply call.
Do something like this:
for (i in seq_along(data)) data[[i]]$name <- names(data)[i]
do.call(rbind, data)
# c.1..2. c.3..3. name
#one.1 1 3 one
#one.2 2 3 one
#two.1 1 3 two
#two.2 2 3 two
#tree.1 1 3 tree
#tree.2 2 3 tree
#four.1 1 3 four
#four.2 2 3 four
And continue from there.

Resources