Convert code from Matlab to R - r

Example of My Data
I have three matrix csv data files that I need to flatten and combine in R, so that I have three columns (Lat, Long, Data). The code I have for this is in matlab, but I need to convert this to R. Any thoughts? This is the matlab code that does this:
LON=csvread(‘LONGITUDE.csv’);
LAT=csvread(‘LATITUDE.csv’);
SM=csvread(‘soil_moisture20151008.csv’);
xyz=zeros(101*210,3);
k=0;
for i=1:101
for j=1:210
k=k+1;
xyz(k,1)=LAT(i,j);
xyz(k,2)=LON(i,j);
xyz(k,3)=SM(i,j);
end
end
csvwrite(‘xyz.csv’,xyz);
So far this is how I have changed it in R:
LON<-read.csv("LONGITUDE.csv", header = T)
LAT<-read.csv("LATITUDE.csv", header = T)
ET<-read.csv("actual_ET20100101.csv")
xyz=matrix(3,101,210)
k=0
for (i in 1:101){
for (j in 1:210){
k=k+1
xyz[k,1]=LAT[i,j]
xyz[k,2]=LON[i,j]
xyz[k,3]=ET[i,j]
}
}
write.csv("xyz.csv",xyz);
I'm not sure what I'm doing wrong. Any guidance on this issue would be greatly appreciated.
Finally, I have a whole directory of files that I need to run this script on, so any ideas on how to apply this to a directory would be great. The LAT/LON files don't change, just the data files.
Thank you!!

If I am understanding your data correctly, you have a large number of matrix files, where each index (row/column position) is assigned to the same data value. That is, (1,1) in each matrix gives the value of interest for the 1st data point, and (1,2) gives values for a different data point.
In that case, you should just be able to convert them all to a matrix, extract the values as a vector, then stitch them together.
To illustrate, here are three identical data.frames (so that we can see if they align correctly:
A <- B <- C <-
data.frame(matrix(runif(36), nrow = 6))
Each data.frame is this:
X1 X2 X3 X4 X5 X6
1 0.2462450 0.6887587 0.216578122 0.5982332 0.2402868 0.9588999
2 0.5924075 0.7511237 0.813704807 0.6892747 0.6253069 0.4648226
3 0.7482773 0.4808986 0.006036452 0.6576487 0.5752148 0.5554258
4 0.8545323 0.6822942 0.654128179 0.6582181 0.8173544 0.5191778
5 0.1748737 0.7456279 0.992209169 0.4468014 0.3491022 0.9736064
6 0.7189847 0.3424291 0.581840006 0.1460138 0.8071445 0.2920479
Then, I put them all in a list (named, so that the columns come out named):
myList <- list(A = A, B = B, C = C)
Then, we loop through the list, converting each data.frame to a matrix, then extracting the values as a vector. Then, I convert the resulting list to a data.frame to get the column/row behavior you likely want (data.frames are just lists with special properties; each column is an element of the list, but data.frames assumes the value orders match). Note that I am using magrittr/dplyr piping to simplify the nesting in the code:
flattened <-
lapply(myList, function(x){
as.matrix(x) %>%
as.numeric()
}) %>%
as.data.frame()
Then, the head of this (from my randomization) looks like:
A B C
1 0.2462450 0.2462450 0.2462450
2 0.5924075 0.5924075 0.5924075
3 0.7482773 0.7482773 0.7482773
4 0.8545323 0.8545323 0.8545323
5 0.1748737 0.1748737 0.1748737
6 0.7189847 0.7189847 0.7189847
Of note, you mentioned that you may have multiple data sources that you want to merge -- as long as you load them all up into this list, the approach will generate a column for each.

Related

Comparing character lists in R

I have two lists of characters that i read in from excel files
One is a very long list of all bird species that have been documented in a region (allBirds) and another is a list of species that were recently seen in a sample location (sampleBirds), which is much shorter. I want to write a section of code that will compare the lists and tell me which sampleBirds show up in the allBirds list. Both are lists of characters.
I have tried:
# upload xlxs file
Full_table <- read_excel("Full_table.xlsx")
Pathogen_table <- read_excel("pathogens.xlsx")
# read species columnn into a new dataframe
species <-c(as.data.frame(Full_table[,7], drop=FALSE))
pathogens <- c(as.data.frame(Pathogen_table[,3], drop=FALSE))
intersect(pathogens, species)
intersect(species, pathogens)
but intersect is outputting lists of 0, which I know cannot be true, any suggestions?
Maybe you can try match() function or "==".
You need to run the intersect on the individual columns that are stored in the list:
> a <- c(data.frame(c1=as.factor(c('a', 'q'))))
> b <- c(data.frame(c1=as.factor(c('w', 'a'))))
> intersect(a,b)
list()
> intersect(a$c1,b$c1)
[1] "a"
This will probably do in your case
intersect(Full_table[,7], Pathogen_table[,3])
Or if you insist on creating the data.frames:
intersect(pathogens[1,], species[1,])
where [1,] should select the first column of the data.frame only. Note that by using c(as.data.frame(... you are converting the data.frame to a regular list. I'd go with only as.data.frame(....

write result of rank loop in r

I've been hitting walls trying to write the results of a loop to a csv. I'm trying to rank data within each of 20 columns. The loop I'm using is:
for (i in 1:ncol(testing_file)) {
print(rank(testing_file[[i]]))
}
This works and prints expected results to screen. I've tried a lot of methods suggested in various discussions to write this result to file or data frame, most with no luck.
I'll just include my most promising lead, which returns only one column of correct data, with a column heading of "testing":
for (i in 1:ncol(testing_file)) {
testing<- (rank(testing_file[[i]]))
testingdf <- as.data.frame(testing)
}
Any help is greatly appreciated!
I found a solution that works:
testage<- data.frame(matrix(, nrow=73, ncol=20)) #This creates an empty data
frame that the ranked results will go into
for (i in 1:ncol(testing_file)) {
testage[i] <- rank(testing_file[[i]])
print(testage[i])
} #this is the loop that ranks data within each column
colnames(testage) <- colnames(testing_file) #take the column names from the
original file and apply them to the ranked file.
I'm bad with nested loops so I'd try:
testing_file <- data.frame(x = 1:5, y = 15:11)
testing <- as.data.frame(lapply(seq_along(testing_file), function (x)
rank(testing_file[, x])))
> testing_file
x y
1 1 15
2 2 14
3 3 13
4 4 12
5 5 11
and gets you out of messy nested loops. Did you want to check results of rank() prior to writing to csv?
or just wrap it in a write.csv, the colnames will be the original df colnames:
> write.csv(testing <- as.data.frame(lapply(seq_along(testing_file),
function (x) rank(testing_file[, x]))), "testing.csv", quote = FALSE)

Make a new vector using elements of list and another vector interchengeably

I have a list with 20 elements each contains a vector of 2 numbers. I have also generated a sequence of numbers (20). Now I would like to construct 1 long vector that would first list the elements of intervals[[1]] and the first element of newvals[1], later intervals[[2]], newvals[2] etc etc
Help will be much appreciated. I think plyr package might be helpful although I am not sure how to structure it. help will be much appreciated!
s1 <- seq(0, 1, by = 0.05)
intervals <- Map(c, s1[-length(s1)], s1[-1])
intervals[[length(intervals)]][2] <- intervals[[length(intervals)]][2]+0.1
newvals <- seq(1,length(intervals),1)
#### HERE I WOULD LIKE TO HAVE A VECTOR IN THE FOLLOWING PATTERN
####UP TO THE LAST ELEMENT OF THE LIST:
stringreclass <- c(intervals[[1]],newvals[1]), .... , intervals[[20]],newvals[20])

Turn Multiple Uneven Nested Lists Into A DataFrame in R

I am trying to get to grips with R and as an experiment I thought that I would try to play around with some cricket data. In its rawest format it is a yaml file, which I used the yaml R package to turn into an R object.
However, I now have a number of nested lists of uneven length that I want to try and turn into a data frame in R. I have tried a few methods such as writing some loops to parse the data and some of the functions in the tidyr package. However, I can't seem to get it to work nicely.
I wondered if people knew of the best way to tackle this? Replicating the data structure would be difficult here, because the complexity comes in the multiple nested lists and the unevenness of their length (which would make for a very long code block. However, you can find the raw yaml data here: http://cricsheet.org/downloads/ (I was using the ODI internationals).
Thanks in advance!
Update
I have tried this:
1)Using tidyr - seperate
d <- unnest(balls)
Name <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal","WicketFielder","WicketKind","PlayerOut")
a <- separate(d, x, Name, sep = ",",extra = "drop")
Which basically uses the tidyr package returns a single column dataframe that I then try to separate. However, the problem here is that in the middle there is sometimes extras variables that appear in some rows and not others, thereby throwing off the separation.
2) Creating vectors
ballsVector <- unlist(balls[[2]],use.names = FALSE)
names_vector <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal")
names(ballsVector) <- c(names_vector)
ballsMatrix <- matrix(ballsVector, nrow = 1, byrow = TRUE)
colnames(ballsMatrix) <- names_vector
The problem here is that the resulting vectors are uneven in length and therefore cant be combined into a data frame. It will also suffer from the issue that there are sporadic variables in the middle of the dataset (as above).
Caveat: not complete answer; attempt to arrange the innings data
plyr::rbind.fill may offer a solution to binding rows with a different number of columns.
I dont use tidyr but below is some rough code to get the innings data into a data.frame. You could then loop this through all the yaml files in the directory.
# Download and unzip data
download.file("http://cricsheet.org/downloads/odis.zip", temp<- tempfile())
tmp <- unzip(temp)
# Create lists - use first game
library(yaml)
raw_dat <- yaml.load_file(tmp[[2]])
#names(raw_dat)
# Function to process list into dataframe
p_fun <- function(X) {
team = X[[1]][["team"]]
# function to process each list subelement that represents each throw
fn <- function(...) {
tmp = unlist(...)
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
}
# loop over all throws
lst = lapply(X[[1]][["deliveries"]], fn )
cbind(team, plyr:::rbind.fill(lst))
}
# Loop over each innings
dat <- plyr::rbind.fill(lapply(raw_dat$innings, p_fun))
Some explanation
The list structure and subsetting it. To get an idea of the structure of the list use
str(raw_dat) # but this gives a really long list of data
You can truncate this, to make it a bit more useful
str(raw_dat, 3)
length(raw_dat)
So there are three main list elements - meta, info, and innings. You can also see this with
names(raw_dat)
To access the meta data, you can use
raw_dat$meta
#or using `[[1]]` to access the first element of the list (see ?'[[')
raw_dat[[1]]
#and get sub-elements by either
raw_dat$meta$data_version
raw_dat[[1]][[1]] # you can also use the names of the list elements eg [[`data_version`]]
The main data is in the inningselement.
str(raw_dat$innings, 3)
Look at the names in the list element
lapply(raw_dat$innings, names)
lapply(raw_dat$innings[[1]], names)
There are two list elements, each with sub-elements. You can access these as
raw_dat$innings[[1]][[1]][["team"]] # raw_dat$innings[[1]][["1st innings"]][["team"]]
raw_dat$innings[[2]][[1]][["team"]] # raw_dat$innings[[2]][["2nd innings"]][["team"]]
The above function parsed the deliveries data in raw_dat$innings. To see what it does, work through it from the inside.
Use one record to see how it works
(note the lapply, with p_fun, looped over raw_dat$innings[[1]] and raw_dat$innings[[2]] ; so this is the outer loop, and the lapply, with fn, loops through the deliveries, within an innings ; the inner loop)
X <- raw_dat$innings[[1]]
tmp <- X[[1]][["deliveries"]][[1]]
tmp
#create a named vector
tmp <- unlist(tmp)
tmp
# 0.1.batsman 0.1.bowler 0.1.non_striker 0.1.runs.batsman 0.1.runs.extras 0.1.runs.total
# "IR Bell" "DW Steyn" "MJ Prior" "0" "0" "0"
To use rbind.fill, the elements to bind together need to be data.frames. We also want to remove the leading numbers /
deliveries from the names, as otherwise we will have lots of uniquely names columns
# this regex removes all non-numeric characters from the string
# you could then split this number into over and delivery
gsub("[^0-9]", "", names(tmp))
# this regex removes all numeric characters from the string -
# allowing consistent names across all the balls / deliveries
# (if i was better at regex I would have also removed the leading dots)
gsub("[0-9]", "", names(tmp))
So for the first delivery in the first innings we have
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
To see how the lapply works, use the first three deliveries (you will need to run the function fn in your workspace)
lst = lapply(X[[1]][["deliveries"]][1:3], fn )
lst
# [[1]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[2]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 02 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[3]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 03 IR Bell DW Steyn MJ Prior 3 0 3
So we end up with a list element for every delivery within an innings. We then use rbind.fill to create one data.frame.
If I was going to try and parse every yaml file I would use a loop.
Use the first three records as an example, and also add the match date.
tmp <- unzip(temp)[2:4]
all_raw_dat <- vector("list", length=length(tmp))
for(i in seq_along(tmp)) {
d = yaml.load_file(tmp[i])
all_raw_dat[[i]] <- cbind(date=d$info$date, plyr::rbind.fill(lapply(d$innings, p_fun)))
}
Then use rbind.fill.
Q1. from comments
A small example with rbind.fill
a <- data.frame(x=1, y=2)
b <- data.frame(x=2, z=1)
rbind(a,b) # error as names dont match
plyr::rbind.fill(a, b)
rbind.fill doesnt go back and add/update rows with the extra columns, where needed (a still doesnt have column z), Think of it as creating an empty dataframe with the number of columns equal to the number of unique columns found in the list of dataframes - unique(c(names(a), names(b))). The values are then filled in each row where possible, and left missing (NA) otherwise..

r llply lapply usage without losing dataframes names

Overall situation:
The interface of my measuring devices couldn’t save any further information but the name of the csv it generates during measuring its values. So I used a systematic set of abbreviations to account for changing parameters, such as concentrations, enzymes, feed stocks, buffers etc., That combined formed the title of my csv files which form the names of the data.frames , where I am now trying to read out the names, to combine them with the rest of the data, to form tables that I can use to do regressions.
The Issue:
I just noticed that I lose the names of my data.frames inside the list,
I could rename them after each call of lapply, but this doesn't seam to be a proper solution.
I found suggestion to use the llply, but I can't teach it to keep names either.
# loads plyr package
library(plyr)
# generates a showcase list of dataframes,
data <- list(data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)))
# assigns names to dataframe
names(data) <- list("one","two", "tree", "four")
usses the dataframes name to pass “o” to a column, this part works fine,
But after running it the names are lost
data <- lapply(X = seq_along(data),
FUN = function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)},
USE.NAMES = TRUE)
Same thing with llply, operates as expected but doesn’t keep the name either although I thought I could solve that particular problem (quote: “llply is equivalent to lapply except that it will preserve labels and can display a progress bar.”)
data <- llply(seq_along(data), function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)})
I would very much appreciate a hint how to solve this with out something like
name(data) <- list.with.the.names
after each llply ore lapply call.
Do something like this:
for (i in seq_along(data)) data[[i]]$name <- names(data)[i]
do.call(rbind, data)
# c.1..2. c.3..3. name
#one.1 1 3 one
#one.2 2 3 one
#two.1 1 3 two
#two.2 2 3 two
#tree.1 1 3 tree
#tree.2 2 3 tree
#four.1 1 3 four
#four.2 2 3 four
And continue from there.

Resources