r - create matrix from data frame loop - r

I have many data frames id.1, id.2, ... ,id.21 and in each of which I want to extract 2 data points: id.1[5,4] and id.1[10,6], id.2[5,4] and id.2[10,6], etc. The first data point is a date and the second data point is a integer.
I want to export this list to obtain something like this in an .csv file:
V1 V2
1 5/10/2016 1654395291
2 5/11/2016 1645024703
3 5/12/2016 1763825219
I have tried
x=c(for (i in 1:21) {
file1 = paste("id.", i, "[5,4]", sep="")}, for (i in 1:21) {
file1 = paste("id.", i, "[10,6]", sep="")})
write.csv(x, "x.csv")
But this yields x being NULL. How can I go about getting this vector?

Your problem is that a for loop doesn't return anything in R. So you can't use it in a c statement as you did. Use an [sl]apply construct instead.
I would first make a list containing all the data frames:
dfs <- list(id.1, id.2, id.3, ...)
And iterate over it, something like:
x <- sapply(dfs, function(df) {
return(c(df[5,4], df[10,6]))
})
Finally you need to transpose the result and convert it into a data.frame if you want to:
x <- as.data.frame(t(x))

Related

Split many dataframes by a column, and save as different dataframes

I have many dataframes. I would like to split them based on the values in a column (a factor). Then I would like to store the result of the split in separate data frame that have a specific name.
For the sake of a mrp, consider some generated data,
for (i in 1:10) {
assign(paste("df_",i,sep = ""), data.frame(x = rep(1,12), y = c(rep("a",4),rep("b",4),rep("c",4))))
}
here we have 10 dfs, df_1, df_2... to df_10. (real data is similar to generated data, but in real data column z is different for each df).
Now, I want to split the dfs by 'y' (column 2).
For 1 df, I can do the following;
splitdf <- split(df_1,df_1$y)
namessplit <- c("a","b","c")
for (i in 1:length(splitdf)) {
assign(paste("df_1_",namessplit[[i]],sep = ""),splitdf[[i]])
}
While this works for 1 df, how can I do it for all the dfs?
Big thanks in advance!
It is not recommended to create multiple objects in the global env, but if we want to know how to create the objects from a nested list - Loop over the outer list sequence and then in the inner list sequence, paste the corresponding names to assign the extracted inner list element
lst1 <- lapply(mget(ls(pattern = "^df_\\d+$")), \(x) split(x, x$y))
for(i in seq_along(lst1)) {
for(j in seq_along(lst1[[i]])) {
assign(paste0(names(lst1)[i], "_", names(lst1[[i]][j])), lst1[[i]][[j]])
}
}
-checking for objects created in the global env
> ls(pattern = "^df_\\d+_[a-z]+$")
[1] "df_1_a" "df_1_b" "df_1_c" "df_10_a" "df_10_b" "df_10_c" "df_2_a" "df_2_b" "df_2_c" "df_3_a" "df_3_b" "df_3_c" "df_4_a"
[14] "df_4_b" "df_4_c" "df_5_a" "df_5_b" "df_5_c" "df_6_a" "df_6_b" "df_6_c" "df_7_a" "df_7_b" "df_7_c" "df_8_a" "df_8_b"
[27] "df_8_c" "df_9_a" "df_9_b" "df_9_c"

Inserting the value in data frame into the codes in R

I have the names of the 1000 people in "name" data frame
df=c("John","Smith", .... "Machine")
I have the 1000 data frames for each person. (e.g., a1~a1000)
And, I have the following codes.
a1$name="XXXX"
a2$name="XXXX" ...
a1000$name="XXXX"
I would like to replace "XXXX" in the above codes with the values in name data frame. Output codes would look like this.
a1$name="John"
a2$name="Smith" ...
a1000$name="Machine"
First you need to combine them as List.( I do not know whether it is work with 1000 dataframe or not. )
df=c("John","Smith", .... "Machine")
list_object_names = sprintf("a%s", 1:1000)
list_df = lapply(list_object_names, get)
for (i in 1:length(list_df) ){
list_df[[i]][,'Names']=df[i]
}
Also you can try apply function rather than for loop something like:
lapply(list_df, function(df) {
#what you want to do
})
Here is my shot at this, without knowing if there is any more to the a1,a2...a1000 lists.
# generate your data
df = c("John", "Smith", "Machine")
# build your example
for(i in 1:3){
assign(paste0("a",i), list(name = "XXXX"))
}
# solve your problem, even if there is more to a1 than you are showing us.
for(i in 1:3){
anew <- get(paste0("a",i)) # pulls the object form the environment
anew[['name']] <- df[i] # rewrites only that list
assign(paste0("a",i), anew) # rewrites the object with new name
}

R code to iterate through dataframe rows for google maps distance queries

I'm looking for some assistance in writing some R code to iterate through rows in a dataframe and pass the values in each row to a function and print the output either to an excel file, txt file or just in the console.
The purpose of this is to automate a bunch of distance/time queries (several hundred) to google maps using the function found at this website: http://www.nfactorialanalytics.com/r-vignette-for-the-week-finding-time-distance-between-two-places/
The function on that website is as follows:
library(XML)
library(RCurl)
distance2Points <- function(origin,destination){
results <- list();
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',origin,'&destinations=',destination,'&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
dist <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
time <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
distance <- as.numeric(sub(" km","",dist))
time <- as.numeric(time)/60
distance <- distance/1000
results[['time']] <- time
results[['dist']] <- distance
return(results)
}
The dataframe will contain two columns: origin postal code and destination postal code (Canada, eh?). I'm a beginner R programmer, so I know how to use read.table to load a txt file into a dataframe. I'm just not sure how iterate through the dataframe, each time passing values to the distance2Points function and executing. I think this can be done using either a for loop or one of the apply calls?
Thanks for the help!
edit:
To keep it simple lets assume I want to transform these two vectors into a dataframe
> a <- c("L5B4P2","L5B4P2")
> b <- c("M5E1E5", "A2N1T3")
> postcodetest <- data.frame(a,b)
> postcodetest
a b
1 L5B4P2 M5E1E5
2 L5B4P2 A2N1T3
How should I go about iterating over these two rows to return both distances and times from the distance2Points function?
Here's one way to do it, using lapply to produce a list with the results for each row in your data and using Reduce(rbind, [yourlist]) to concatenate that list into a data frame whose rows correspond to the ones in your original. To make this work, we also have to tweak the code in the original function to return a one-row data frame, so I've done that here.
distance2Points <- function(origin,destination){
require(XML)
require(RCurl)
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',origin,'&destinations=',destination,'&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
dist <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
time <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
distance <- as.numeric(sub(" km","",dist))
time <- as.numeric(time)/60
distance <- distance/1000
# this gives you a one-row data frame instead of a list, b/c it's easy to rbind
results <- data.frame(time = time, distance = distance)
return(results)
}
# now apply that function rowwise to your data, using lapply, and roll the results
# into a single data frame using Reduce(rbind)
results <- Reduce(rbind, lapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i])))
Result when applied to your sample data:
> results
time distance
1 27.06667 27.062
2 1797.80000 2369.311
If you would prefer to do this without creating a new object, you could also write separate functions for computing time and distance -- or a single function with those outputs as options -- and then use sapply or just mutate to create new columns in your original data frame. Here's how that might look using sapply:
distance2Points <- function(origin, destination, output){
require(XML)
require(RCurl)
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',
origin, '&destinations=', destination, '&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
if(output == "distance") {
y <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
y <- as.numeric(sub(" km", "", y))/1000
} else if(output == "time") {
y <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
y <- as.numeric(y)/60
} else {
y <- NA
}
return(y)
}
postcodetest$distance <- sapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i], "distance"))
postcodetest$time <- sapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i], "time"))
And here's how you could do it in a dplyr pipe with mutate:
library(dplyr)
postcodetest <- postcodetest %>%
mutate(distance = sapply(seq(nrow(postcodetest)), function(i)
distance2Points(a[i], b[i], "distance")),
time = sapply(seq(nrow(postcodetest)), function(i)
distance2Points(a[i], b[i], "time")))

Creating dataframe in R loop and naming it

I am working with 5 data frames that I want to filter (eliminating some rows if they match a regex). Because all data frames are similar, with the same variable names, I stored them in a list and I'm iterating it. However, when I want to save the filtered data for each of the original data frame, I find that it creates an i_filtered (instead of dfName_filtered) so every time the loop runs, it gets overwritten.
Here's what I have in the loop:
for (i in list_all){
i_filtered1 <- i[i$chr != filter1,]
i_filtered2 <- i[i$chr != filter2,]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(i_filtered2, file="/home/tama/Desktop/i_filtered.csv")
}
As I said, filter1 and filter2 are just regex that I'm using to filter the data in the chr column.
What's the correct way to assign the original name + "_filtered" to the new dataframe?
Thanks in advance
Edited to add info:
Each dataframe has these variables (but values can change)
chr start end length
chr1 10400 10669 270
chr10 237646 237836 191
chrX 713884 714414 531
chrUn 713884 714414 531
chr1 762664 763174 511
chr4 805008 805571 564
And I have stored all them in a list:
list_all <- list(heep, oe, st20_n, st20_t,all)
list_all <- lapply(list_all, na.omit)
The filters:
#Get rid of random chromosomes
filter1=".*random"
#Get rid of undefined chromosomes
filter2 = "ĉhrUn.*
The output I'm looking for is:
heep_filtered1
heep_filtered2
oe_filtered1
oe_filtered2
etc
One possibility is to iterate over a sequence of indices (or names), rather than over the list of data-frames itself, and access the data-frames using the indices.
Another problem is that the != operator doesn't support regular expressions. It only does exact literal matches. You need to use grepl() instead.
names(list_all) <- c("heep", "oe", "st20_n", "st20_t", "all")
filtered <- NULL
for (i in names(list_all)){
df <- list_all[[i]]
df.1 <- df[!grepl(filter1, df$chr), ]
df.2 <- df[!grepl(filter2, df$chr), ]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(df.2, file=paste0("/home/tama/Desktop/", i, "_filtered.csv"))
filtered[[paste0(i, "_filtered", 1)]] <- df.1
filtered[[paste0(i, "_filtered", 2)]] <- df.2
}
The result is a list called filtered that contains the filtered data-frames.
The issue is that i is only interpreted specially when it is alone. You are using it as part of other names, and as a character in the current version.
I would suggest naming the list, then using lapply instead of a for loop (note that I also changed the filter to occur in one step, since right now it is unclear if you are trying to take both things out or not -- this also makes it easier to add more filters).
filters <- c(".*random", "chrUn.*")
list_all <- list(heep = heep
, oe = oe
, st20_n = st20_n
, st20_t = st20_t
, all = all)
toLoop <- names(list_all)
names(toLoop) <- toLoop # renames them in the output list
filtered <- lapply(toLoop, function(thisSet)){
tempFiltered <- list_all[[thisSet]][!(list_all[[thisSet]]$chr %in% filters),]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(tempFiltered, file=paste0("/home/tama/Desktop/",thisSet,"_filtered.csv"))
# Return the part you care about
return(tempFiltered)
}

Populating a dataframe in a loop

I have more than 300 csv files in a directory.
The csv files have a following structure
id Date Nitrate Sulfate
id of csv file Some date Some Value Some Value
id of csv file Some date Some Value Some Value
id of csv file Some date Some Value Some Value
I want to count number of row in each csv file excluding the NA in that file and stored it in dataframe which has two columns: (1) id & (2) nobs.
Here is my code for that:
complete <-function(directory,id){
filenames <-sprintf("%03d.csv", id)
filenames <-paste(directory,filenames,sep = '/')
dataframe <-data.frame(id=numeric(0),nobs=numeric(0))
for(i in filenames){
data <- read.csv(i)
dataframe[i,dataframe$id]<-data[data$id]
dataframe[i,dataframe$nobs]<-nrow(data[!is.na(data$sulfate & data$nitrate),])
}
dataframe
}
The problem arises when I try to populate dataframe inside the loop, it seems like it is not populating the data frame and returning me NULL. I know that I am doing something stupid.
I usually prefer to add the rows into a pre-allocated list then bind them together. Here's a working example :
##### fake read.csv function returning random data.frame
# (just to reproduce your case, remove this from your code...)
read.csv <- function(fileName){
stupidHash <- sum(as.integer(charToRaw(fileName)))
if(stupidHash %% 2 == 0){
return(data.frame(id=stupidHash,date='2016-02-28',
nitrate=c(NA,2,3,NA,5),sulfate=c(10,20,NA,NA,40)))
}else{
return(data.frame(id=stupidHash,date='2016-02-28',
nitrate=c(4,2,3,NA,5,9),sulfate=c(10,20,NA,NA,40,50)))
}
}
#####
complete <-function(directory,id){
filenames <-sprintf("%03d.csv", id)
filenames <-paste(directory,filenames,sep = '/')
# here we pre-allocate a list of lenght=length(filenames)
# where we will put the rows of our future data.frame
rowsList <- vector(mode='list',length=length(filenames))
for(i in 1:length(filenames)){
filename <- filenames[i]
data <- read.csv(filename)
rowsList[[i]] <- data.frame(id=data$id[1],
nobs=sum(!is.na(data$sulfate) & !is.na(data$nitrate)))
}
# here we bind all the previously created rows together into one data.frame
DF <- do.call(rbind.data.frame, rowsList)
return(DF)
}
Usage example :
res <- complete(directory='dir',id=1:3)
> res
id nobs
1 889 4
2 890 2
3 891 4
The problem is in these 2 lines:
dataframe[i,dataframe$id]<-data[data$id]
dataframe[i,dataframe$nobs]<-nrow(data[!is.na(data$sulfate & data$nitrate),])
If you want to extend a dataframe, please use rbind function. But please be aware of that it is not effective way, because it allocate new memory and copy all data and add one new row. The effective way is to allocate dataframe big enough in this line:
dataframe <-data.frame(id=numeric(0),nobs=numeric(0))
Instead of 0, use number of expected number of rows.
So the easiest way is to
dataframe <- rbind(dataframe, data.frame(id=data$id[1], nobs=nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),]))
More effective way is something like that:
dataframe <-data.frame(id=numeric(numberOfRows),nobs=numeric(numberOfRows))
and after that in loop:
dataframe[i,]$id<-data$id[1]
dataframe[i,]$nobs<-nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),])
UPDATE: I changed values you used to populate dataframe to data$id[1] and nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),])

Resources