The program that I am running creates three data frames using the following code:
datuniqueNDC <- data.frame(lapply(datlist, function(x) length(unique(x$NDC))))
datuniquePID <- data.frame(lapply(datlist, function(x) length(unique(x$PAYERID)))
datlengthNDC <- data.frame(lapply(datlist, function(x) length(x$NDC)))
They have outputs that look like this:
X182L X178L X76L
1 182 178 76
X34L X31L X7L
1 34 31 7
X10674L X10021L X653L
1 10674 10021 653
What I am trying to do is combine the rows together into one data frame with the desired outcome being:
X Y Z
1 182 178 76
2 34 31 7
3 10674 10021 653
but the rbind command doesn't work due to the names of all the columns being different. I can get it to work by using the colnames command after creating each variable above, but it seems like there should be a more efficient way to accomplish this by using one of the apply commands or something similar. Thanks for the help.
one way, since evreything seems to be a numeric, would be this:
mylist <- list(dat1,dat2,dat3)
# assuming your three data.frames are dat1:dat3 respectively
do.call("rbind",lapply(mylist, as.matrix))
# X182L X178L X76L
#[1,] 182 178 76
#[2,] 34 31 7
#[3,] 10674 10021 653
basically this works because your data are matrices not dataframes, then you only need to change names once at the end.
Since the functions you use in you lapply calls are scalars, it would be easier if you use sapply. sapply returns vectors which you can rbind
datuniqueNDC <- sapply(datlist, function(x) length(unique(x$NDC)))
datuniquePID <- sapply(datlist, function(x) length(unique(x$PAYERID))
datlengthNDC <- sapply(datlist, function(x) length(x$NDC))
dat <- as.data.frame(rbind(datuniqueNDC,datuniquePID,datlengthNDC))
names(dat) <- c("x", "y", "z")
Another solution is to calculate all three of your statistics in one function:
dat <- as.data.frame(sapply(datlist, function(x) {
c(length(unique(x$NDC)), length(unique(x$PAYERID), length(x$NDC))
}))
names(dat) <- c("x", "y", "z")
Related
I have a large data set that is organized as a list of 1044 data frames. Each data frame is a profile that holds the same data for a different station and time. I am trying to create a data frame that holds the output of my function fitsObs, but my current code only goes through a single data frame. Any ideas?
i=1
start=1
for(i in 1:1044){
station1 <- surveyCTD$stations[[i]]
df1 <- surveyCTD$data[[i]]
date1 <- surveyCTD$dates[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
if(start==1){
start=0
dfout <- data.frame(
date=date1
,station=station1
)
names(fitObs) <- paste0(names(fitObs),"o")
dfout<-cbind(dfout, df1$temp, df1$depth)
dfout <- cbind(dfout, fitObs)
}
}
From a first look I would try two ways to debug it. First print out the head of a DF to understand the behavior of your loop, then check the range of your variable dfout, it looks like the variable is local to your loop.
Moreover your i variable out of the loop does not change anything in your loop.
I have created a reproducible example of my best guess as to what you are asking. I also assume that you are able to adjust the concepts in this general example to suit your own problem. It's easier if you provide an example of your list in future.
First we create some reproducible data
a <- c(10,20,30,40)
b <- c(5,10,15,20)
c <- c(20,25,30,35)
df1 <- data.frame(x=a+1,y=b+1,z=c+1)
df2 <- data.frame(x=a,y=b,z=c)
ls1 <- list(df1,df2)
Which looks like this
print(ls1)
[[1]]
x y z
1 11 6 21
2 21 11 26
3 31 16 31
4 41 21 36
[[2]]
x y z
1 10 5 20
2 20 10 25
3 30 15 30
4 40 20 35
So we now have two dataframes within a single list. The following code should then work to go through the columns within each dataframe of the list and apply the mean() function to the data in the column. You change this to row by selecting '1' rather than '2'.
df <- do.call("rbind", lapply(ls1, function(x) apply(x,2,mean)))
as.data.frame(df)
print(df)
x y z
1 26 13.5 28.5
2 25 12.5 27.5
You should be able to replace mean() with whatever function you have written for your data. Let me know if this helps.
Consider building a generalized function to be called withi Map (wrapper to mapply, the multiple, elementwise iterator member of apply family) to build a list of data frames each with your fitObs output. And pass all equal length objects into the data.frame() constructor.
Then outside the loop, run do.call for a final, single appended dataframe of all 1,044 smaller dataframes (assuming each maintains exact same and number of columns):
# GENERALIZED FUNCTION
add_fit_obs <- function(dt, st, df) {
fitObs <- fitTp2(-df$depth, df$temp)
names(fitObs) <- paste0(names(fitObs),"o")
tmp <- data.frame(
date = dt,
station = st,
depth = df1$depth,
temp = df1$temp,
fitObs
)
return(tmp)
}
# LIST OF DATA FRAMES
df_list <- Map(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data)
# EQUIVALENTLY:
# df_list <- mapply(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data, SIMPLIFY=FALSE)
# SINGLE DATAFRAME
master_df <- do.call(rbind, df_list)
Let's say I have data like:
> data[295:300,]
Date sulfate nitrate ID
295 2003-10-22 NA NA 1
296 2003-10-23 NA NA 1
297 2003-10-24 3.47 0.363 1
298 2003-10-25 NA NA 1
299 2003-10-26 NA NA 1
300 2003-10-27 NA NA 1
Now I would like to add all the nitrate values into a new list/vector. I'm using the following code:
i <- 1
my_list <- c()
for(val in data)
{
my_list[i] <- val
i <- i + 1
}
But this is what happens:
Warning message:
In x[i] <- val :
number of items to replace is not a multiple of replacement length
> i
[1] 2
> x
[1] NA
Where am I going wrong? The data is part of a Coursera R Programming coursework. I can assure you that this is not an assignment/quiz. I have been trying to understand what is the best way append elements into a list with a loop? I have not proceeded to the lapply or sapply part of the coursework, so thinking about workarounds.
Thanks in advance.
If it's a duplicate question, please direct me to it.
As we mention in the comments, you are not looping over the rows of your data frame, but the columns (also sometimes variables). Hence, loop over data$nitrate.
i <- 1
my_list <- c()
for(val in data$nitrate)
{
my_list[i] <- val
i <- i + 1
}
Now, instead of looping over your values, a better way is to use that you want the new vector and the old data to have the same index, so loop over the index i. How do you tell R how many indexes there are? Here you have several choices again: 1:nrow(data), 1:length(data$nitrate) and several other ways. Below I have given you a few examples of how to extract from the data frame.
my_vector <- c()
for(i in 1:nrow(data)){
my_vector[i] <- data$nitrate[i] ## Version 1 of extracting from data.frame
my_vector[i] <- data[i,"nitrate"] ## Version 2: [row,column name]
my_vector[i] <- data[i,3] ## Version 3: [row,column number]
}
My suggestion: Rather than calling the collection a list, call it a vector, since that is what it is. Vectors and lists behave a little differently in R.
Of course, in reality you don't want to get the data out one by one. A much more efficient way of getting your data out is
my_vector2 <- data$nitrate
I have a large data.table that I am collapsing to the month level using ,by.
There are 5 by vars, with # of levels: c(4,3,106,3,1380). The 106 is months, the 1380 is a geographic unit. As in turns out there are some 0's, in that some cells have no values. by drops these, but I'd like it to keep them.
Reproducible example:
require(data.table)
set.seed(1)
n <- 1000
s <- function(n,l=5) sample(letters[seq(l)],n,replace=TRUE)
dat <- data.table( x=runif(n), g1=s(n), g2=s(n), g3=s(n,25) )
datCollapsed <- dat[ , list(nv=.N), by=list(g1,g2,g3) ]
datCollapsed[ , prod(dim(table(g1,g2,g3))) ] # how many there should be: 5*5*25=625
nrow(datCollapsed) # how many there are
Is there an efficient way to fill in these missing values with 0's, so that all permutations of the by vars are in the resultant collapsed data.table?
I'd also go with a cross-join, but would use it in the i-slot of the original call to [.data.table:
keycols <- c("g1", "g2", "g3") ## Grouping columns
setkeyv(dat, keycols) ## Set dat's key
ii <- do.call(CJ, sapply(dat[, ..keycols], unique)) ## CJ() to form index
datCollapsed <- dat[ii, list(nv=.N)] ## Aggregate
## Check that it worked
nrow(datCollapsed)
# [1] 625
table(datCollapsed$nv)
# 0 1 2 3 4 5 6
# 135 191 162 82 39 13 3
This approach is referred to as a "by-without-by" and, as documented in ?data.table, it is just as efficient and fast as passing the grouping instructions in via the by argument:
Advanced: Aggregation for a subset of known groups is
particularly efficient when passing those groups in i. When
i is a data.table, DT[i,j] evaluates j for each row
of i. We call this by without by or grouping by i.
Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].
Make a cartesian join of the unique values, and use that to join back to your results
dat.keys <- dat[,CJ(g1=unique(g1), g2=unique(g2), g3=unique(g3))]
setkey(datCollapsed, g1, g2, g3)
nrow(datCollapsed[dat.keys]) # effectively a left join of datCollapsed onto dat.keys
# [1] 625
Note that the missing values are NA right now, but you can easily change that to 0s if you want.
I have a data frame that contains a file name with regular parts. I use a regex to parse this file name and store each part in its own column.
parse.file.name <- function(file.name="cc-nolabel-AEMNZ334_0009-loc-1317-407-6-39.png")
{
rfn <- regexec(pattern="cc-(.+?)-(.+?)-loc-(.+?)-(.+?)-(.+?)-(.+?)\\.png", text=file.name)
matchfn <- regmatches(file.name, rfn)
return(matchfn)
}
basic.features$parsed.filename <- parse.file.name(as.character(basic.features$filename))
filename contains values similar to the default parameter. I'm retrieving the individual values for each column like the following:
basic.features$label <- unlist(lapply(basic.features$parsed.filename,
function(pf) {
return(unlist(pf)[2]) } ))
I feel that this is not an elegant way but couldn't manage to get individual values from the data frame column that contains list in each row easily. Is there a better way to do this?
If you like example data:
basic.features <- data.frame(filename=c("cc-nolabel-AEMNZ336_0009-loc-1003-1504-7-8.png", "cc-nolabel-AEMNZ335_0006-loc-1979-880-13-10.png", "cc-nolabel-AEMNZ333_0007-loc-941-263-8-8.png", "cc-nolabel-AEMNZ336_0014-loc-2011-24-4-4.png", "cc-nolabel-AEMNZ335_0013-loc-2087-644-66-41.png", "cc-nolabel-AEMNZ333_0013-loc-1531-374-12-23.png"))
It's simpler if you use sapply:
basic.features$label <- sapply(basic.features$parsed.filename,function(x){x[2]})
However, if you want to turn your parsed values into a data.frame in one shot, you could do this:
DF <- data.frame(t(sapply(basic.features$parsed.filename,function(x){x})))
colnames(DF) <- c('filename','label','code1','code2','code3','code4','code5')
> DF
filename label code1 code2 code3 code4 code5
1 cc-nolabel-AEMNZ336_0009-loc-1003-1504-7-8.png nolabel AEMNZ336_0009 1003 1504 7 8
2 cc-nolabel-AEMNZ335_0006-loc-1979-880-13-10.png nolabel AEMNZ335_0006 1979 880 13 10
3 cc-nolabel-AEMNZ333_0007-loc-941-263-8-8.png nolabel AEMNZ333_0007 941 263 8 8
4 cc-nolabel-AEMNZ336_0014-loc-2011-24-4-4.png nolabel AEMNZ336_0014 2011 24 4 4
5 cc-nolabel-AEMNZ335_0013-loc-2087-644-66-41.png nolabel AEMNZ335_0013 2087 644 66 41
6 cc-nolabel-AEMNZ333_0013-loc-1531-374-12-23.png nolabel AEMNZ333_0013 1531 374 12 23
I'd recommend doing this in three steps.
convert your list of vectors to a matrix by row-binding them:
mat <- do.call(rbind, basic.features$parsed.filename)
Next, convert to a data frame
df <- as.data.frame(mat, stringsAsFactors = FALSE)
Finally, convert characters to columns of correct type and name columns
df[] <- lapply(df, type.convert, as.is = TRUE)
names(df) <- c('filename', 'label', 'code1', 'code2', 'code3', 'code4', 'code5')
I have a question regarding data frames in R. I want to take a data.frame, dfy, and find the first occurrence of dfy$workerId in dfx$workers, to create a new dataframe, dfz, a copy of dfx that also contains the first occurance of dfy$workerId in dfx$wokers as dfz$highestRankingGroup. Its a little tricky becuase dfx$workers is a single spaced seperated string. My original plan was to do this in Perl, but I would like to find a way to work in R and avoid having to write out to temp. files.
thank you for your time.
y <- "name,workerId,aptitude
joe,4,34
steve,5,42
jon,7,23
nick,8,122"
x <- "workers,projectScore
1 2 3 8 ,92
1 2 5 9 ,89
3 5 7 ,85
1 8 9 10 ,82
4 5 7 8 ,83
1 3 5 7 8 ,79"
z <- "name,workerId,aptitude,highestRankingGroup
joe,4,0.34,5
steve,5,0.42,2
jon,7,0.23,3
nick,8,0.122,1"
dfy <- read.csv(textConnection(y), header=TRUE, sep=",", stringsAsFactors=FALSE)
dfx <- read.csv(textConnection(x), header=TRUE, sep=",", stringsAsFactors=FALSE)
dfz <- read.csv(textConnection(z), header=TRUE, sep=",", stringsAsFactors=FALSE)
First, add the highestRankingGroup column to your dataset dfx
dfx$highestRankingGroup <- seq(1, length(dfx$projectScore))
Since you have mentioned perl you can do a familar perl thing and simple split the workers column in whitespaces. I combined the splitting with functions from the plyr package which are always nice to work with.
library(plyr)
df.l <- dlply(dfx, "projectScore")
f.reshape <- function(x) {
wrk <- strsplit(x$workers, "\\s", perl = TRUE)
data.frame(worker = wrk[[1]]
, projectScore = x$projectScore
, highestRankingGroup = x$highestRankingGroup
)
}
df.tmp <- ldply(df.l, f.reshape)
df.z1 <- merge(df.tmp, dfy, by.x = "worker", by.y = "workerId")
Now you have to look for the max values in the projectScore column:
df.z2 <- ddply(df.z1, "name", function(x) x[x$projectScore == max(x$projectScore), ])
This produces:
R> df.z2
worker .id projectScore highestRankingGroup name aptitude
1 4 83 83 5 joe 34
2 7 85 85 3 jon 23
3 8 92 92 1 nick 122
4 5 89 89 2 steve 42
R>
You can reshape the df.z2 dataframe according to your personal taste. Simply look at the different steps and the produced objects in order to see at which step different columns, etc get introduced.
Before I start, I recommend that you go with #mropa's answer. This answer is a bit of fun I had messing about with your question. On the plus side, it does involve a bit of fun with function closures ;)
Essentially, I create a function that returns two functions.
updateDFz = function(dfy) {
## Create a default dfz matrix
dfz = dfy
dfz$HRG = 10000 ## Big max value
counter = 0
## Update the dfz matrix after every row
update = function(x) {
counter <<- counter + 1
for(i in seq_along(x)) {
if(is.element(x[i], dfz$workerId))
dfz[dfz$workerId == x[i],]$HRG <<- min(dfz[dfz$workerId == x[i],]$HRG, counter)
}
return(dfz)
}
## Get the dfz matrix
getDFz = function()
return(dfz)
list(getDFz=getDFz, update=update)
}
f = updateDFz(dfy)
lapply(strsplit(dfx$workers, " "), f$update)
f$getDFz()
As I said, a bit of fun ;)
Hopefully someone finds this useful.
# Recieves a data.frame and a search column
# Returns a data.frame of the first occurances of all unique values of the "search" column
getfirsts <- function(data, searchcol){
rows <- as.data.frame(match(unique(data[[searchcol]]), data[[searchcol]]))
firsts = data[rows[[1]],]
return(firsts)
}