Create dataframe from smallest vector available

Create dataframe from smallest vector available - r

I want to create a dataframe from a list of dataframes, specifically from a certain column of those dataframes. However each dataframe contains a different number of observations, so the following code gives me an error.
diffs <- data.frame(sensor1 = sensores[[1]]$Diff,
sensor2 = sensores[[2]]$Diff,
sensor3 = sensores[[3]]$Diff,
sensor4 = sensores[[4]]$Diff,
sensor5 = sensores[[5]]$Diff)
The error:
Error in data.frame(sensor1 = sensores[[1]]$Diff, sensor2 = sensores[[2]]$Diff, :
arguments imply differing number of rows: 29, 19, 36, 26
Is there some way to force data.frame() to take the minimal number or rows available from each one of the columns, in this case 19?
Maybe there is a built-in function in R that can do this, any solution is appreciated but I'd love to get something as general and as clear as possible.
Thank you in advance.

I can think of two approaches:
Example data:
df1 <- data.frame(A = 1:3)
df2 <- data.frame(B = 1:4)
df3 <- data.frame(C = 1:5)
Compute the number of rows of the smallest dataframe:
min_rows <- min(sapply(list(df1, df2, df3), nrow))
Use subsetting when combining:
diffs <- data.frame(a = df1[1:min_rows,], b = df2[1:min_rows,], c = df3[1:min_rows,] )
diffs
a b c
1 1 1 1
2 2 2 2
3 3 3 3
Alternatively, use merge:
rowmerge <- function(x,y){
# create row indicators for the merge:
x$ind <- 1:nrow(x)
y$ind <- 1:nrow(y)
out <- merge(x,y, all = T, by = "ind")
out["ind"] <- NULL
return(out)
}
Reduce(rowmerge, list(df1, df2, df3))
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 NA 4 4
5 NA NA 5
To get rid of the rows with NAs, remove the all = T.
For your particular case, you would probably call Reduce(rowmerge, sensores), assuming that sensores is a list of dataframes.
Note: if you already have an index somewhere (e.g. a timestamp of some sort), then it would be advisable to simply merge on that index instead of creating ind.

Related

Assign value in dataframe from list by list's element name = dataframe row number

I have a name list, such as the following:
> myNamedList
(...)
$`1870`
[1] 84.24639
$`1871`
[1] 84.59707
(...)
I would like to assign these values in a dataframe's column where the list element's name corresponds to the dataframe's row number. For now I am proceeding like this:
for (element in names(myNamedList)) {
targetDataFrame[as.numeric(element),][[columnName]] = myNamedList[[element]]
}
This is quite slow if the list is somewhat large, and also not very R-esque. I believe I could do something with apply, but am not sure where to look. Appreciate your help.

Add a row number to original data, then stack the list, then merge. See example:
# example
#data
set.seed(1); d <- data.frame(x = sample(LETTERS, 5))
#named list
x <- list("2" = 11, "4" = 22)
#add a row number
d$rowID = seq(nrow(d))
# stack the list, and merge
merge(d, stack(x), by.x = "rowID", by.y = "ind", all.x = TRUE)
# rowID x values
# 1 1 Y NA
# 2 2 D 11
# 3 3 G NA
# 4 4 A 22
# 5 5 B NA

How to change values in a column of a data frame based on conditions in another column?

I would like to have an equivalent of the Excel function "if". It seems basic enough, but I could not find relevant help.
I would like to assess "NA" to specific cells if two following cells in a different columns are not identical. In Excel, the command would be the following (say in C1): if(A1 = A2, B1, "NA"). I then just need to expand it to the rest of the column.
But in R, I am stuck!
Here is an equivalent of my R code so far.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"))
df
To get the following Type of each Type in another column, I found a useful function on StackOverflow that does the job.
# determines the following Type of each Type
shift <- function(x, n){
c(x[-(seq(n))], rep(6, n))
}
df$TypeFoll <- shift(df$Type, 1)
df
Now, I would like to keep TypeFoll in a specific row when the File for this row is identical to the File on the next row.
Here is what I tried. It failed!
for(i in 1:length(df$File)){
df$TypeFoll2 <- ifelse(df$File[i] == df$File[i+1], df$TypeFoll, "NA")
}
df
In the end, my data frame should look like:
aim = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"),
TypeFoll = c("2","3","4","4","5","6"),
TypeFoll2 = c("2","NA","4","4","NA","6"))
aim
Oh, and by the way, if someone would know how to easily put the columns TypeFoll and TypeFoll2 just after the column Type, it would be great!
Thanks in advance

I would do it as follows (not keeping the result from the shift function)
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"), stringsAsFactors = FALSE)
# This is your shift function
len=nrow(df)
A1 <- df$File[1:(len-1)]
A2 <- df$File[2:len]
# Why do you save the result of the shift function in the df?
Then assign if(A1 = A2, B1, "NA"). As akrun mentioned ifelse is vectorised: Btw. this is how you append a column to a data.frame
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), 6) #Why 6?
As 6 is hardcoded here something like:
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), max(df$Type)+1)
Is more generic.

First off, 'for' loops are pretty slow in R, so try to think of this as vector manipulation instead.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"));
Create shifted types and files and put it in new columns:
df$TypeFoll = c(as.character(df$Type[2:nrow(df)]), "NA");
df$FileFoll = c(as.character(df$File[2:nrow(df)]), "NA");
Now, df looks like this:
> df
Type File TypeFoll FileFoll
1 1 A 2 A
2 2 A 3 B
3 3 B 4 B
4 4 B 4 B
5 4 B 5 C
6 5 C NA NA
Then, create TypeFoll2 by combining these:
df$TypeFoll2 = ifelse(df$File == df$FileFoll, df$TypeFoll, "NA");
And you should have something that looks a lot like what you want:
> df;
Type File TypeFoll FileFoll TypeFoll2
1 1 A 2 A 2
2 2 A 3 B NA
3 3 B 4 B 4
4 4 B 4 B 4
5 4 B 5 C NA
6 5 C NA NA NA
If you want to remove the FileFoll column:
df$FileFoll = NULL;

Comparing two columns

I am new to R and I am trouble with a command that I did all the time in Python.
I have two data-frames (database and creditIDs), and what I want to do is compare one column in database and one column in creditIDs. More specifically in a value exists in creditIDs[,1] but doesn't in database[,5], I want to delete that entire row in database.
Here is the code:
for (i in 1:lengthColumns){
if (!(database$credit_id[i] %in% creditosVencidos)){
database[i,]<-database[-i,]
}
}
But I keep on getting this error:
50: In `[<-.data.frame`(`*tmp*`, i, , value = structure(list( ... :
replacement element 50 has 9696 rows to replace 1 rows
Could someone explain why this is happening? Thanks!

the which() command will return the row indices that satisfy a boolean statement, much like numpy.where() in python. Using the $ after a dataframe with a column name gives you a vector of that column... alternatively you could do d[,column_number].
In this example I'm creating an x and y column which share the first five values, and use which() to slice the dataframe on their by-row equality:
L3 <- LETTERS[1:3]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = rep(1:5, 2), y = 1:10, fac = fac))
d = d[which(d$x == d$y),]
d
x y fac
1 1 A
2 2 B
3 3 C
4 4 B
5 5 B

You will need to adjust this for your column names/numbers.
# Create two example data.frames
creditID <- data.frame(ID = c("896-19", "895-8", "899-1", "899-5"))
database <- data.frame(ID = c("896-19", "camel", "899-1", "goat", "899-1"))
# Method 1
database[database$ID %in% creditID$ID, ]
# Method 2 (subset() function)
database <- subset(database, ID %in% creditID$ID)

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks

This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2

This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5

EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.

The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Creating a function to replace NAs from one data.frame with values from another

I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.
Here's what I cooked up, along with an example of how I use it:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
mergedDf <- merge(naDf, fillDf, by=mergeCols)
for (col in fillCols){
colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
k <- which( is.na( colWithNas ) )
colWithNas[k] <- colWithOutNas[k]
mergedDf[col] <- colWithNas
mergedDf[[paste(col, "x", sep=".")]] <- NULL
mergedDf[[paste(col, "y", sep=".")]] <- NULL
}
return(mergedDf)
}
## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )
So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.
Any guidance or refactoring would be helpful.
EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.

What a great question.
Here's a data.table solution:
# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))
# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]
# a b f g f.1 g.1
# [1,] 1 3 NA 0 100 11
# [2,] 1 3 NA NA 100 11
# [3,] 1 3 NA 0 100 11
# [4,] 1 3 0 0 100 11
# [5,] 1 3 0 NA 100 11
# First 5 rows of 200 printed.
# In outDT[i, j], on the following two lines
# -- i is a Boolean vector indicating which rows will be operated on
# -- j is an expression saying "(sub)assign from right column (e.g. f.1) to
# left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]
# Just keep the four columns ultimately needed
outDT <- outDT[,list(a,b,g,f)]
# a b g f
# [1,] 1 3 0 0
# [2,] 1 3 11 0
# [3,] 1 3 0 0
# [4,] 1 3 11 0
# [5,] 1 3 11 0
# First 5 rows of 200 printed.

Here's a slightly more concise/robust version of your approach. You could replace the for-loop with a call to lapply, but I find the loop easier to read.
This function assumes any columns not in mergeCols are fair game to have their NAs filled. I'm not really sure this helps, but I'll take my chances with the voters.
fillNaDf.ju <- function(naDf, fillDf, mergeCols) {
mergedDf <- merge(fillDf, naDf, by=mergeCols, suffixes=c(".fill",""))
dataCols <- setdiff(names(naDf),mergeCols)
# loop over all columns we didn't merge by
for(col in dataCols) {
rows <- is.na(mergedDf[,col])
# skip this column if it doesn't contain any NAs
if(!any(rows)) next
rows <- which(rows)
# replace NAs with values from fillDf
mergedDf[rows,col] <- mergedDf[rows,paste(col,"fill",sep=".")]
}
# don't return ".fill" columns
mergedDf[,names(naDf)]
}

My preference would be to pull out the code from merge that does the matching and do it myself so that I could keep the ordering of the original data frame intact, both row-wise and column-wise. I also use matrix indexing to avoid any loops, though to do so I create a new data frame with the revised fillCols and replace the columns of the original with it; I thought I could fill it in directly but apparently you can't use matrix ordering to replace parts of a data.frame, so I wouldn't be surprised if a loop over the names would be faster in some situations.
With matrix indexing:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
na.ind <- is.na(naDf[,fillCols])
fill.ind <- cbind(match(naB, fillB)[row(na.ind)[na.ind]], col(na.ind)[na.ind])
naX <- naDf[,fillCols]
fillX <- fillDf[,fillCols]
naX[na.ind] <- fillX[fill.ind]
naDf[,colnames(naX)] <- naX
naDf
}
With a loop:
fillNaDf2 <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
m <- match(naB, fillB)
for(col in fillCols) {
fix <- which(is.na(naDf[,col]))
naDf[fix, col] <- fillDf[m[fix],col]
}
naDf
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex