Unboxing List of lists as Data Frame columns elegantly - r

I have a data frame that contains a file name with regular parts. I use a regex to parse this file name and store each part in its own column.
parse.file.name <- function(file.name="cc-nolabel-AEMNZ334_0009-loc-1317-407-6-39.png")
{
rfn <- regexec(pattern="cc-(.+?)-(.+?)-loc-(.+?)-(.+?)-(.+?)-(.+?)\\.png", text=file.name)
matchfn <- regmatches(file.name, rfn)
return(matchfn)
}
basic.features$parsed.filename <- parse.file.name(as.character(basic.features$filename))
filename contains values similar to the default parameter. I'm retrieving the individual values for each column like the following:
basic.features$label <- unlist(lapply(basic.features$parsed.filename,
function(pf) {
return(unlist(pf)[2]) } ))
I feel that this is not an elegant way but couldn't manage to get individual values from the data frame column that contains list in each row easily. Is there a better way to do this?
If you like example data:
basic.features <- data.frame(filename=c("cc-nolabel-AEMNZ336_0009-loc-1003-1504-7-8.png", "cc-nolabel-AEMNZ335_0006-loc-1979-880-13-10.png", "cc-nolabel-AEMNZ333_0007-loc-941-263-8-8.png", "cc-nolabel-AEMNZ336_0014-loc-2011-24-4-4.png", "cc-nolabel-AEMNZ335_0013-loc-2087-644-66-41.png", "cc-nolabel-AEMNZ333_0013-loc-1531-374-12-23.png"))

It's simpler if you use sapply:
basic.features$label <- sapply(basic.features$parsed.filename,function(x){x[2]})
However, if you want to turn your parsed values into a data.frame in one shot, you could do this:
DF <- data.frame(t(sapply(basic.features$parsed.filename,function(x){x})))
colnames(DF) <- c('filename','label','code1','code2','code3','code4','code5')
> DF
filename label code1 code2 code3 code4 code5
1 cc-nolabel-AEMNZ336_0009-loc-1003-1504-7-8.png nolabel AEMNZ336_0009 1003 1504 7 8
2 cc-nolabel-AEMNZ335_0006-loc-1979-880-13-10.png nolabel AEMNZ335_0006 1979 880 13 10
3 cc-nolabel-AEMNZ333_0007-loc-941-263-8-8.png nolabel AEMNZ333_0007 941 263 8 8
4 cc-nolabel-AEMNZ336_0014-loc-2011-24-4-4.png nolabel AEMNZ336_0014 2011 24 4 4
5 cc-nolabel-AEMNZ335_0013-loc-2087-644-66-41.png nolabel AEMNZ335_0013 2087 644 66 41
6 cc-nolabel-AEMNZ333_0013-loc-1531-374-12-23.png nolabel AEMNZ333_0013 1531 374 12 23

I'd recommend doing this in three steps.
convert your list of vectors to a matrix by row-binding them:
mat <- do.call(rbind, basic.features$parsed.filename)
Next, convert to a data frame
df <- as.data.frame(mat, stringsAsFactors = FALSE)
Finally, convert characters to columns of correct type and name columns
df[] <- lapply(df, type.convert, as.is = TRUE)
names(df) <- c('filename', 'label', 'code1', 'code2', 'code3', 'code4', 'code5')

Related

Creating a data frame from looping through a list of data frames in R

I have a large data set that is organized as a list of 1044 data frames. Each data frame is a profile that holds the same data for a different station and time. I am trying to create a data frame that holds the output of my function fitsObs, but my current code only goes through a single data frame. Any ideas?
i=1
start=1
for(i in 1:1044){
station1 <- surveyCTD$stations[[i]]
df1 <- surveyCTD$data[[i]]
date1 <- surveyCTD$dates[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
if(start==1){
start=0
dfout <- data.frame(
date=date1
,station=station1
)
names(fitObs) <- paste0(names(fitObs),"o")
dfout<-cbind(dfout, df1$temp, df1$depth)
dfout <- cbind(dfout, fitObs)
}
}
From a first look I would try two ways to debug it. First print out the head of a DF to understand the behavior of your loop, then check the range of your variable dfout, it looks like the variable is local to your loop.
Moreover your i variable out of the loop does not change anything in your loop.
I have created a reproducible example of my best guess as to what you are asking. I also assume that you are able to adjust the concepts in this general example to suit your own problem. It's easier if you provide an example of your list in future.
First we create some reproducible data
a <- c(10,20,30,40)
b <- c(5,10,15,20)
c <- c(20,25,30,35)
df1 <- data.frame(x=a+1,y=b+1,z=c+1)
df2 <- data.frame(x=a,y=b,z=c)
ls1 <- list(df1,df2)
Which looks like this
print(ls1)
[[1]]
x y z
1 11 6 21
2 21 11 26
3 31 16 31
4 41 21 36
[[2]]
x y z
1 10 5 20
2 20 10 25
3 30 15 30
4 40 20 35
So we now have two dataframes within a single list. The following code should then work to go through the columns within each dataframe of the list and apply the mean() function to the data in the column. You change this to row by selecting '1' rather than '2'.
df <- do.call("rbind", lapply(ls1, function(x) apply(x,2,mean)))
as.data.frame(df)
print(df)
x y z
1 26 13.5 28.5
2 25 12.5 27.5
You should be able to replace mean() with whatever function you have written for your data. Let me know if this helps.
Consider building a generalized function to be called withi Map (wrapper to mapply, the multiple, elementwise iterator member of apply family) to build a list of data frames each with your fitObs output. And pass all equal length objects into the data.frame() constructor.
Then outside the loop, run do.call for a final, single appended dataframe of all 1,044 smaller dataframes (assuming each maintains exact same and number of columns):
# GENERALIZED FUNCTION
add_fit_obs <- function(dt, st, df) {
fitObs <- fitTp2(-df$depth, df$temp)
names(fitObs) <- paste0(names(fitObs),"o")
tmp <- data.frame(
date = dt,
station = st,
depth = df1$depth,
temp = df1$temp,
fitObs
)
return(tmp)
}
# LIST OF DATA FRAMES
df_list <- Map(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data)
# EQUIVALENTLY:
# df_list <- mapply(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data, SIMPLIFY=FALSE)
# SINGLE DATAFRAME
master_df <- do.call(rbind, df_list)

Filter rows based on ID over multiple data frames with for loop

How can I filter 180 .csv files from my global directory based on a matching ID in another df named 'Camera' in R? When I tried to incorporate my one by one file filtering code (see step 3b) into a for-loop (see step 3a) I get the error:
Error in paste("i")$SegmentID : $ operator is invalid for atomic vectors.
I'm quite new to for loop functions, so I really appreciate your help! All the 180 files have a unique name, are different in length, but have the same column structure & names. They look like:
df 'File1' df 'Camera'
ID Speed Location ID Time
1 30 4 1 10
2 35 5 3 11
3 40 6 5 12
4 30 7
5 35 8
Filtered df 'File1'
ID Speed Location
1 30 4
3 40 6
5 35 8
These are some samples of my code:
#STEP 1: read files
filenames <- list.files(path="06-06-2017_0900-1200uur",
pattern="*.csv")
# STEP 2: import files
for(i in filenames){
filepath <- file.path("06-06-2017_0900-1200uur",paste(i))
assign(i, read.csv2(filepath, header = TRUE, skip = "1"))
}
# STEP 3a: delete rows that do not match ID in df 'Cameras'
for(i in filesnames){
paste("i") <- paste("i")[paste("i")$ID %in% Cameras$ID,]
}
#STEP 3b: filtering one by one
File1 <- File1[File1$ID %in% Camera$ID,]
Here is an approach that makes use of lists (generally a better way to go). First, utilize the include.names argument in list.files():
fns <- list.files(
path = "06-06-2017_0900-1200uur",
pattern = "*.csv",
include.names = T
)
Now you have a list of your filenames. Next, apply read.csv2 to each of the filenames in your list:
dat <- lapply(fns, read.csv2, header = T, skip = 1)
Now you have a list of data frames (the output from calling read.csv). Finally, apply subset() to each of the data frames to keep only those rows which match the ID column:
out <- lapply(dat, function(x) subset(x, ID %in% Camera$ID))
If I understand the question, the output should be a data frame from file1 where the ID for all rows matches one of the rows in the Camera file.
This is easily accomplished with the sqldf() package and structured query language.
rawFile1 <- "ID Speed Location
1 30 4
2 35 5
3 40 6
4 30 7
5 35 8
"
rawCamera <- " ID Time
1 10
3 11
5 12
"
file1 <- read.table(textConnection(rawFile1),header=TRUE)
Camera <- read.table(textConnection(rawCamera),header=TRUE)
library(sqldf)
sqlStmt <- "select * from file1 where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")
...and the output:
ID Speed Location
1 1 30 4
2 3 40 6
3 5 35 8
>
To extend this logic to a number of csv files, first we obtain the list of files from the subdirectory where they are stored using the list.files() function. For example, if the files were in a data subdirectory of the R working directory, one might use the following function call.
theFiles <- list.files("./data/",".csv",full.names=TRUE)
We can read these files with read.table() to create a list() of data frames.
theData <- lapply(theFiles,function(x) {
read.table(x,header=TRUE)})
To combine the files into a single data frame, we execute do.call().
combinedData <- do.call(rbind,theData)
Now we can read the camera data and use sqldf to keep only the IDs matching the camera data.
Camera <- read.table(...,header=TRUE)
library(sqldf)
sqlStmt <- "select * from combinedData where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")

Assigning value to an R object without using its name with get()

I am having a problem with get() in R.
I have a set of data.frames with a common structure in my environment. I want to loop through these data frames and change the name of the 2nd column so that the name of the 2nd column contains a prefix from the 1st column.
For example, if column 1 = A_cat and column 2 is dog, I want column 2 to be changed to A_dog.
Below is an example of the R code I am using:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
for( element in grep('^df$', names(environment()), value=TRUE) ) {
colnames(get(element))[2] <- paste(strsplit(colnames(get(element)) [1], '`_`')[[1]][1],
colnames(get(element))[2], sep='`_`')
}
The arguments within the for loop, on either side of the assignment operator, both give the expected result if I run them separately but when run together produce the following error.
Error in colnames(get(element))[2] <- paste(strsplit(colnames(get(element))[1], :
could not find function "get<-"
Any help with this problem would be greatly appreciated.
This does the same thing as the code in the question without using get:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
e <- environment() ##
df.names <- grep("^df$", names(e), value = TRUE)
# nm is the current data frame name and nms are its column names
for(nm in df.names) {
nms <- names(e[[nm]])
names(e[[nm]])[2] <- paste0(sub("_.*", "_", nms[1]), nms[2])
}
giving:
> df
A_cat A_dog
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Keeping the data.frames in a named list as suggested in a comment to the question might be even better. For example, if instead of keeping the data.frames in an environment they were in a list called e
e <- list(df = df)
then omit the line marked ## and the rest works as is.
Here would be one way to accomplish this goal if the data.frames have systematic names (here, df1 df2 df3, etc) and the prefix ends with "_" as in the example:
# suggested by #roland roll them up in a list:
myDfList <- mget(ls(pattern="^df"))
# change names
for(dfName in names(myDfList)) {
names(myDfList[[dfName]])[2] <- paste0(gsub("^(.*_)", "\\1",
names(myDfList[[dfName]])[1]),
names(myDfList[[dfName]])[2])
}

Changing multiple data frame column titles in r

The program that I am running creates three data frames using the following code:
datuniqueNDC <- data.frame(lapply(datlist, function(x) length(unique(x$NDC))))
datuniquePID <- data.frame(lapply(datlist, function(x) length(unique(x$PAYERID)))
datlengthNDC <- data.frame(lapply(datlist, function(x) length(x$NDC)))
They have outputs that look like this:
X182L X178L X76L
1 182 178 76
X34L X31L X7L
1 34 31 7
X10674L X10021L X653L
1 10674 10021 653
What I am trying to do is combine the rows together into one data frame with the desired outcome being:
X Y Z
1 182 178 76
2 34 31 7
3 10674 10021 653
but the rbind command doesn't work due to the names of all the columns being different. I can get it to work by using the colnames command after creating each variable above, but it seems like there should be a more efficient way to accomplish this by using one of the apply commands or something similar. Thanks for the help.
one way, since evreything seems to be a numeric, would be this:
mylist <- list(dat1,dat2,dat3)
# assuming your three data.frames are dat1:dat3 respectively
do.call("rbind",lapply(mylist, as.matrix))
# X182L X178L X76L
#[1,] 182 178 76
#[2,] 34 31 7
#[3,] 10674 10021 653
basically this works because your data are matrices not dataframes, then you only need to change names once at the end.
Since the functions you use in you lapply calls are scalars, it would be easier if you use sapply. sapply returns vectors which you can rbind
datuniqueNDC <- sapply(datlist, function(x) length(unique(x$NDC)))
datuniquePID <- sapply(datlist, function(x) length(unique(x$PAYERID))
datlengthNDC <- sapply(datlist, function(x) length(x$NDC))
dat <- as.data.frame(rbind(datuniqueNDC,datuniquePID,datlengthNDC))
names(dat) <- c("x", "y", "z")
Another solution is to calculate all three of your statistics in one function:
dat <- as.data.frame(sapply(datlist, function(x) {
c(length(unique(x$NDC)), length(unique(x$PAYERID), length(x$NDC))
}))
names(dat) <- c("x", "y", "z")

using R to compare data frames to find first occurrence to df1$columnA in df2$columnB when df2$columnB is a single space separated list

I have a question regarding data frames in R. I want to take a data.frame, dfy, and find the first occurrence of dfy$workerId in dfx$workers, to create a new dataframe, dfz, a copy of dfx that also contains the first occurance of dfy$workerId in dfx$wokers as dfz$highestRankingGroup. Its a little tricky becuase dfx$workers is a single spaced seperated string. My original plan was to do this in Perl, but I would like to find a way to work in R and avoid having to write out to temp. files.
thank you for your time.
y <- "name,workerId,aptitude
joe,4,34
steve,5,42
jon,7,23
nick,8,122"
x <- "workers,projectScore
1 2 3 8 ,92
1 2 5 9 ,89
3 5 7 ,85
1 8 9 10 ,82
4 5 7 8 ,83
1 3 5 7 8 ,79"
z <- "name,workerId,aptitude,highestRankingGroup
joe,4,0.34,5
steve,5,0.42,2
jon,7,0.23,3
nick,8,0.122,1"
dfy <- read.csv(textConnection(y), header=TRUE, sep=",", stringsAsFactors=FALSE)
dfx <- read.csv(textConnection(x), header=TRUE, sep=",", stringsAsFactors=FALSE)
dfz <- read.csv(textConnection(z), header=TRUE, sep=",", stringsAsFactors=FALSE)
First, add the highestRankingGroup column to your dataset dfx
dfx$highestRankingGroup <- seq(1, length(dfx$projectScore))
Since you have mentioned perl you can do a familar perl thing and simple split the workers column in whitespaces. I combined the splitting with functions from the plyr package which are always nice to work with.
library(plyr)
df.l <- dlply(dfx, "projectScore")
f.reshape <- function(x) {
wrk <- strsplit(x$workers, "\\s", perl = TRUE)
data.frame(worker = wrk[[1]]
, projectScore = x$projectScore
, highestRankingGroup = x$highestRankingGroup
)
}
df.tmp <- ldply(df.l, f.reshape)
df.z1 <- merge(df.tmp, dfy, by.x = "worker", by.y = "workerId")
Now you have to look for the max values in the projectScore column:
df.z2 <- ddply(df.z1, "name", function(x) x[x$projectScore == max(x$projectScore), ])
This produces:
R> df.z2
worker .id projectScore highestRankingGroup name aptitude
1 4 83 83 5 joe 34
2 7 85 85 3 jon 23
3 8 92 92 1 nick 122
4 5 89 89 2 steve 42
R>
You can reshape the df.z2 dataframe according to your personal taste. Simply look at the different steps and the produced objects in order to see at which step different columns, etc get introduced.
Before I start, I recommend that you go with #mropa's answer. This answer is a bit of fun I had messing about with your question. On the plus side, it does involve a bit of fun with function closures ;)
Essentially, I create a function that returns two functions.
updateDFz = function(dfy) {
## Create a default dfz matrix
dfz = dfy
dfz$HRG = 10000 ## Big max value
counter = 0
## Update the dfz matrix after every row
update = function(x) {
counter <<- counter + 1
for(i in seq_along(x)) {
if(is.element(x[i], dfz$workerId))
dfz[dfz$workerId == x[i],]$HRG <<- min(dfz[dfz$workerId == x[i],]$HRG, counter)
}
return(dfz)
}
## Get the dfz matrix
getDFz = function()
return(dfz)
list(getDFz=getDFz, update=update)
}
f = updateDFz(dfy)
lapply(strsplit(dfx$workers, " "), f$update)
f$getDFz()
As I said, a bit of fun ;)
Hopefully someone finds this useful.
# Recieves a data.frame and a search column
# Returns a data.frame of the first occurances of all unique values of the "search" column
getfirsts <- function(data, searchcol){
rows <- as.data.frame(match(unique(data[[searchcol]]), data[[searchcol]]))
firsts = data[rows[[1]],]
return(firsts)
}

Resources