unable to fill a data frame using column names and row names - r

I have data frame "x" like this :
meme webId timeStamp
2501 68814 281322.1
2501 2679 305813.0
2501 948 306025.6
I want to use "meme" and "webId" as row and column names and timeStamp as element in "mat" data frame. I wrote this:
cols<-unique(x[,"webId"])
rows<-unique(x[,"meme"])
mat<-data.frame(matrix(data=9999999,nrow=length(rows),ncol=length(cols)))
colnames(mat)<-c(cols)
rownames(mat)<-c(rows)
for(i in 1:length(x))
mat[rownames(mat)==x[i,"meme"],colnames(mat)==x[i,"webId"]]<-x[i,"timeStamp"]
but nothing changed. what is the problem?
please help me!!!

In the for loop, it seems that you mean to iterate over all the rows in x, and fill all the values into mat one by one. Instead you only iterate over 3 rows. length(x) gives the number of columns not the number of rows. This is the correct code for iterating over all rows:
for(i in 1:nrow(x))
mat[rownames(mat)==x[i,"meme"],colnames(mat)==x[i,"webId"]]<-x[i,"timeStamp"]
I suspect that the x dataframe contains more values than the ones you posted. In your example, the number of rows equals the number of columns, that's why the commenters couldn't find a problem with it. The problem is not evident in your example.

You could get the 'row/column' index by using match, cbind it and assign the 'timeStamp' elements to the positions specified by the index in 'mat'.
mat[cbind(match(x$meme, rownames(mat)),
match(x$webId, colnames(mat)))] <- x$timeStamp
mat
# 428 2679 68814 948
#2505 13 11 8 3
#2510 16 6 14 1
#2501 7 4 5 10
#2508 12 2 9 15
Checking with the results from the for loop
for(i in 1:nrow(x))
mat1[rownames(mat1)==x[i,"meme"],
colnames(mat1)==x[i,"webId"]]<-x[i,"timeStamp"]
mat1
# 428 2679 68814 948
#2505 13 11 8 3
#2510 16 6 14 1
#2501 7 4 5 10
#2508 12 2 9 15
Benchmarks
set.seed(21)
x1 <- data.frame(meme= rep(sample(1000), each=200),
webId= rep(sample(35000, 200, replace=FALSE), 1000),
timeStamp=rnorm(1000*200))
set.seed(324)
mat2 <- matrix(, 1000, 200,
dimnames=list(sample(unique(x1$meme)),sample(unique(x1$webId))))
mat3 <- mat2
system.time({
mat2[cbind(match(x1$meme, rownames(mat2)),
match(x1$webId, colnames(mat2)))] <- x1$timeStamp
})
# user system elapsed
# 0.181 0.001 0.181
system.time({
for(i in 1:nrow(x1))
mat3[rownames(mat3)==x1[i,"meme"],
colnames(mat3)==x1[i,"webId"]]<-x1[i,"timeStamp"]
})
# user system elapsed
#172.588 10.445 183.062
identical(mat2, mat3)
#[1] TRUE
data
set.seed(24)
x <- data.frame(meme=rep(c(2501, 2505, 2508, 2510), each=4),
webId= rep(c(68814, 2679, 948, 428), 4), timeStamp= sample(16))
set.seed(33)
mat <- matrix(, 4, 4, dimnames=list(sample(unique(x$meme)),
sample(unique(x$webId))))
mat1 <- mat

Related

Copy a subset of a column, based on conditions, to another dataframe in R

I have very limited R skills, and after hours searching for a solution I could not see an option that would work.
I have several large data tables. From each one, I would like to copy part of a column into an dataframe, to populate a column there.
My data tables (tabn1, tabn2, tabn3) all have the same format, but with different lengths. Each subset will have a different number of rows. I would want empty spaces to be filled with NA. I can't even copy the first column, so the subsequent are the next problem!
Ro Co Red Green Yellow
1 3 123 999 265
1 3 223 875 5877
1 4 21488 555 478
1 4 558 23698 5558
2 3 558 559 148
2 3 4579 557 59
2 4 1489 545 2369
2 4 123 999 265
3 3 558 559 148
3 3 558 23698 5558
3 4 4579 557 59
3 4 1478 4579 557
4 3 1488 555 478
4 3 1478 2945 5889
4 4 448 259 4548
4 4 26576 158 15
My new data frame col names:
cls <- c("n1","n2","n3")
I created a dataframe with the column names:
df <- setNames(data.frame(matrix(ncol=3)),cls)
For each of my tables, I want to subset Ro > = 3, Co = 3, column "Red" only
I have tried:
sub1 <- (filter(tabn1, tabn1$Ro >=3 | tabn$Co == 3)
df$n1 <- sub1$Red
> Error in `$<-.data.frame`(`*tmp*`, n1, value = c(183.94, 180.884, :
replacement has 32292 rows, data has 1
Also:
df$n1 <- cut(sub1$Red)
> Error in cut.default(sub1$Red) :
argument "breaks" is missing, with no default
I tried using df as a datatable instead of dataframe, but also got the following errors:
df <- setNames(data.table(matrix(ncol=3)),cls)
df$n1 <- sub1$Red
> Error in set(x, j = name, value = value) :
Supplied 32292 items to be assigned to 1 items of column 'nn1'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
I would subsequently tried to subset and copy from tabn2 to df$n2, and so forth. As indicated above, the original tables have different lengths.
Thanks in advance!
The issue is that the number of rows in 'df' and 'sub1' are different. 'df' is created with 1 row. Instead, we can create the 'df' directly from the 'sub1' itself
df <- sub1['Red']
names(df) <- cls[1]
Also, another way to create the data.frame, would be to specify the nrow as well
df <- as.data.frame(matrix(nrow = nrow(sub1), ncol = length(cls)),
dimnames = list(NULL, cls))
Regarding the second error with cut, it needs breaks. Either we specify the number of breaks
cut(sub1$Red, breaks = 3)
Or a vector of break points
cut(sub1$Red, breaks = c(-Inf, 100, 500, 1000, Inf))
If there are many 'tabn' objects, get them into a list, loop over the list with lapply
lst1 <- mget(ls(pattern = '^tabn\\d+$'))
out_lst <- lapply(lst1, function(x) subset(x, Ro >=3 | Co == 3)$Red)
It is possible that after subsetting and selecting the 'Red' column, the number of elements may be different. If the lengths are different, a option is to pad NA at the end for those having lesser number of elements before cbinding it
mx <- max(lengths(out_lst))
df <- do.call(cbind, lapply(out_lst, `length<-`, mx))

Correct way of vectorizing "lookup" function

I am looking for a fast and efficient way to compute the problem described below. Any help would be appreciated, thanks in advance!
I have a couple of very large csv files that have different information about the same object, but in my final calculation I need all of the attributes in the different table. I am trying to calculate the load of a large number of electrical substations, first I have a list of unique electrical substations;
Unique_Substations <- data.frame(Name = c("SubA", "SubB", "SubC", "SubD"))
In another list I have information about the customers behind these substations;
Customer_Information <- data.frame(
Customer = 1001:1010,
SubSt_Nm = sample(unique(Unique_Substations$Name), 10, replace = TRUE),
HouseHoldType = sample(1:2, 10, replace = TRUE)
)
And in another list I have information about the, let's say, solar panels on these customers roofs (for different years);
Solar_Panels <- data.frame(
Customer = sample(1001:1010, 10, replace = TRUE),
SolarPanelYear1 = sample(10:20, 10, replace = TRUE),
SolarPanelYear2 = sample(15:20, 10, replace = TRUE)
)
Now I want see what the load is for each substation for each year. I have a household load and a solar panel load normalised for each type of household or the solarpanel;
SolarLoad <- data.frame(Load = c(0, -10, -10, 5))
HouseHoldLoad <- data.frame(Type1 = c(1, 3, 5, 2), Type2 = c(3, 5, 6, 1))
So now I have to match up these lists;
ML_SubSt_Cust <- sapply(Unique_Substations$Name,
function(x) which(Customer_Information$SubSt_Nm %in% x == TRUE))
ML_Cust_SolarP <- sapply(Customer_Information$Customer,
function(x) which(Solar_Panels$Customer %in% x == TRUE))
(Here I use the which(xxx %in% x == TRUE) method because I need multiple matches and match() only returns one match
And now we come to my big question (but probably not my only problem with this method) at last. I want to calculate the maximum load on each substation for each year. To this end I had first written a for loop that looped through the Unique_Substations list, which is of course highly inefficient. After that I tried to speed it up using outer() but I don't think I have properly vectorized my function. My maximum function looks as follows (I only wrote it out for the solar panel part to keep it simple);
GetMax <- function(i, Yr) {
max(sum(Solar_Panels[unlist(ML_Cust_SolarP[ML_SubSt_Cust[[i]]], use.names= FALSE),Yr])*SolarLoad)
}
I'm sure this is not efficient at all but I have no clue how to do it in any other way.
To get my final results I use a outer function;
Results <- outer(1:nrow(Unique_Substations), 1:2, Vectorize(GetMax))
In my example all of these data frames are much much larger (40000 rows each or so), so I really need some good optimization of the functions involved. I tried to think of ways to vectorize the function but I couldn't work it out. Any help would be appreciated.
EDIT:
Now that I fully understand the accepted awnser I have another problem. My actual Customer_Information is 188k rows long and my actual HouseHoldLoad is 53k rows long. Needless to say this does not merge() very well. Is there another solution to this problem that does not require merge() or for loops that are too slow?
First: set.seed() when generating random data! I did set.seed(1000) before your code for these results.
I think a bit of merge-ing and dplyr can help here. First, we get the data into a better shape:
library(dplyr)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load") %>%
select(HouseHoldType=variable, Load) %>%
mutate(HouseHoldType=gsub("Type", "", HouseHoldType))
Solar_Panels <- melt(Solar_Panels, id.vars="Customer",
value.name="SPYearVal") %>%
select(Customer, SolarPanelYear=variable, SPYearVal) %>%
mutate(SolarPanelYear=gsub("SolarPanelYear", "", SolarPanelYear))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
That gives us:
## Customer SubSt_Nm HouseHoldType SolarPanelYear SPYearVal
## 1 1001 SubB 1 1 16
## 2 1001 SubB 1 2 18
## 3 1001 SubB 1 2 16
## 4 1001 SubB 1 1 20
## 5 1002 SubD 2 1 16
## 6 1002 SubD 2 1 13
## 7 1002 SubD 2 2 20
## 8 1002 SubD 2 2 18
## 9 1003 SubA 1 2 15
## 10 1003 SubA 1 1 16
## 11 1005 SubC 2 2 19
## 12 1005 SubC 2 1 10
## 13 1006 SubA 1 1 15
## 14 1006 SubA 1 2 19
## 15 1007 SubC 1 1 17
## 16 1007 SubC 1 2 19
## 17 1009 SubA 1 1 10
## 18 1009 SubA 1 1 18
## 19 1009 SubA 1 2 18
## 20 1009 SubA 1 2 18
Now we just group and summarize:
dat %>% group_by(SubSt_Nm, SolarPanelYear) %>%
summarise(mx=max(sum(SPYearVal)*SolarLoad))
## SubSt_Nm SolarPanelYear mx
## 1 SubA 1 295
## 2 SubA 2 350
## 3 SubB 1 180
## 4 SubB 2 170
## 5 SubC 1 135
## 6 SubC 2 190
## 7 SubD 1 145
## 8 SubD 2 190
If you use data.table vs data frames, it should be pretty speedy even with 40K entries.
UPDATE For those who cannot install dplyr, this just uses reshape2 (hopefully that is installable)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load")
colnames(HouseHoldLoad) <- c("HouseHoldType", "Load")
HouseHoldLoad$HouseHoldType <- gsub("Type", "", HouseHoldLoad$HouseHoldType)
Solar_Panels <- melt(Solar_Panels, id.vars="Customer", value.name="SPYearVal")
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("SolarPanelYear", "", Solar_Panels$SolarPanelYear)
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190
If you really can't install even reshape2, then this works with just the base stats package:
colnames(HouseHoldLoad) <- c("Load.1", "Load.2")
HouseHoldLoad <- reshape(HouseHoldLoad, varying=c("Load.1", "Load.2"), direction="long", timevar="HouseHoldType")[1:2]
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear.1", "SolarPanelYear.2")
Solar_Panels <- reshape(Solar_Panels, varying=c("SolarPanelYear.1", "SolarPanelYear.2"), direction="long", timevar="SolarPanelYear")[1:2]
colnames(Solar_Panels) <- c("Customer", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("^[0-9]+\\.", "", rownames(Solar_Panels))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190

Function in R to determine complete cases

I have the following code to print the complete cases:
complete <- function(directory, id=1:332) {
data<-NULL
dat <- NULL
s <- NULL
for (i in 1:length(id)) {
data[[i]]<- c(paste(directory, "/",formatC(id[i], width=3, flag=0),".csv",sep=""))
df[[i]]<-c(read.csv(data[[i]]))
s[i] <- sum(complete.cases(df[[i]]))
dat <- data.frame(cbind(id,nobs=s[i]))
}
dat
}
The output that I get is as follows:
complete("specdata", c(2, 4, 8, 10, 12))
id nobs
1 2 96
2 4 96
3 8 96
4 10 96
5 12 96
The required output looks like this:
complete("specdata", c(2, 4, 8, 10, 12))
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
## 4 10 148
## 5 12 96
The .csv looks like this:
head(file)
Date sulfate nitrate ID
1 2003-01-01 NA NA 1
2 2003-01-02 NA NA 1
3 2003-01-03 NA NA 1
4 2003-01-04 NA NA 1
5 2003-01-05 NA NA 1
6 2003-01-06 NA NA 1
As is evident from the 2 outputs the nobs value for all id's is the replicated corresponding to id==12. I'm unable to figure out a way to output the nobs corresponding to id. Lets ignore the ## in each line of the required output. Thanks in advance.
I tried to clean your code:
complete <- function(directory, id) {
s <- vector()
for (i in 1:length(id)) {
path <- c(paste(directory, "/",formatC(id[i], width=3, flag=0),".csv",sep=""))
data <- c(read.csv(path))
s[i] <- sum(complete.cases(data))
}
dat <- data.frame(cbind(id,nobs=s))
return(dat)
}
If this does not work, you probably might want to check your functions formatC and complete.cases.
EDIT:
There were several redundancies in your code as well as two logical errors.
First, you dont need to initialize objects in R in order to give them a value. I deleted these two
data<-NULL
dat <- NULL
and changed the third into an empty vector. Second, you can see that I removed the indices of your dat and data objects and gave them a more expressive name. As these two objects are newly created in every iteration of the for-loop, it makes no sense to give them an index. Finally, you misplaced the closing bracket (as mentioned above) and created your dat$nobs from only one element of s (namely the last one):
dat <- data.frame(cbind(id,nobs=s[i]))
Fixing this into
dat <- data.frame(cbind(id,nobs=s))
did the trick.
Please consider reading a good beginners book on (R-)programming to gain a better understanding of control structures.

How to find the difference between 2 dataframes?

I have 2 dataframes which are "exactly" the same. The difference between them is that one has 676 observations (rows) and the second has 666 observations. I don't know which of those rows are missed in a second dataframe.
Would be the easiest to me if someone can show me the code how to make a third dataframe with those 10 rows which are missed.
The name of dataframes:
- dataset1 (676)
- dataset2 (666)
Thx.
dataset1[tail(!duplicated(rbind(dataset2, dataset1)), nrow(dataset1)), ]
Here's an approach:
library(qdap)
## generate random problem
prob <- sample(1:nrow(mtcars), 1)
## remove the random problem row
mtcars2 <- mtcars[-prob, ]
## Throw it into a list of 2 dataframes so they're easier to work with
dat <- list(mtcars, mtcars2)
## Use qdap's `paste2` function to paste all columns together
dat2 <- lapply(dat, paste2)
## Find the shorter data set
wmn <- which.min(sapply(dat2, length))
## Add additional element to shorter one
dat2[[wmn]] <- c(dat2[[wmn]], NA)
## check each element of the 2 pasted data sets for equality
out <- mapply(identical, dat2[[1]], dat2[[2]])
## Which row's the problem
which(!out)[1]
which(!out)[1] == prob
If which(!out)[1] equals NA problem is in the last row.
When you start seeing FALSE that's where the problem is located.
EDIT: removed the for loop
I would say try to use merge and then look for where the merge result has NA values.
Here's an example using dummy data:
set.seed(1)
df1 <- data.frame(x=rnorm(100),y=rnorm(100))
df2 <- df1[-sample(1:100,10),]
dim(df1)
# [1] 100 2
dim(df2)
# [1] 90 2
out <- merge(df1,df2,by='x',all.x=TRUE)
in1not2 <- which(is.na(out$y.y))
in1not2
# [1] 6 25 33 51 52 53 57 73 77 82
Then you can extract:
> df1[in1not2,]
x y
6 -0.8204684 1.76728727
25 0.6198257 -0.10019074
33 0.3876716 0.53149619
51 0.3981059 0.45018710
52 -0.6120264 -0.01855983
53 0.3411197 -0.31806837
57 -0.3672215 1.00002880
73 0.6107264 0.45699881
77 -0.4432919 0.78763961
82 -0.1351786 0.98389557

finding unique vector elements in a list efficiently

I have a list of numerical vectors, and I need to create a list containing only one copy of each vector. There isn't a list method for the identical function, so I wrote a function to apply to check every vector against every other.
F1 <- function(x){
to_remove <- c()
for(i in 1:length(x)){
for(j in 1:length(x)){
if(i!=j && identical(x[[i]], x[[j]]) to_remove <- c(to_remove,j)
}
}
if(is.null(to_remove)) x else x[-c(to_remove)]
}
The problem is that this function becomes very slow as the size of the input list x increases, partly due to the assignment of two large vectors by the for loops. I'm hoping for a method that will run in under one minute for a list of length 1.5 million with vectors of length 15, but that might be optimistic.
Does anyone know a more efficient way of comparing each vector in a list with every other vector? The vectors themselves are guaranteed to be equal in length.
Sample output is shown below.
x = list(1:4, 1:4, 2:5, 3:6)
F1(x)
> list(1:4, 2:5, 3:6)
As per #JoshuaUlrich and #thelatemail, ll[!duplicated(ll)] works just fine.
And thus, so should unique(ll)
I previously suggested a method using sapply with the idea of not checking every element in the list (I deleted that answer, as I think using unique makes more sense)
Since efficiency is a goal, we should benchmark these.
# Let's create some sample data
xx <- lapply(rep(100,15), sample)
ll <- as.list(sample(xx, 1000, T))
ll
Putting it up against some becnhmarks
fun1 <- function(ll) {
ll[c(TRUE, !sapply(2:length(ll), function(i) ll[i] %in% ll[1:(i-1)]))]
}
fun2 <- function(ll) {
ll[!duplicated(sapply(ll, digest))]
}
fun3 <- function(ll) {
ll[!duplicated(ll)]
}
fun4 <- function(ll) {
unique(ll)
}
#Make sure all the same
all(identical(fun1(ll), fun2(ll)), identical(fun2(ll), fun3(ll)),
identical(fun3(ll), fun4(ll)), identical(fun4(ll), fun1(ll)))
# [1] TRUE
library(rbenchmark)
benchmark(digest=fun2(ll), duplicated=fun3(ll), unique=fun4(ll), replications=100, order="relative")[, c(1, 3:6)]
test elapsed relative user.self sys.self
3 unique 0.048 1.000 0.049 0.000
2 duplicated 0.050 1.042 0.050 0.000
1 digest 8.427 175.563 8.415 0.038
# I took out fun1, since when ll is large, it ran extremely slow
Fastest Option:
unique(ll)
You could hash each of the vectors and then use !duplicated() to identify unique elements of the resultant character vector:
library(digest)
## Some example data
x <- 1:44
y <- 2:10
z <- rnorm(10)
ll <- list(x,y,x,x,x,z,y)
ll[!duplicated(sapply(ll, digest))]
# [[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 1.24573610 -0.48894189 -0.18799758 -1.30696395 -0.05052373 0.94088670
# [7] -0.20254574 -1.08275938 -0.32937153 0.49454570
To see at a glance why this works, here's what the hashes look like:
sapply(ll, digest)
[1] "efe1bc7b6eca82ad78ac732d6f1507e7" "fd61b0fff79f76586ad840c9c0f497d1"
[3] "efe1bc7b6eca82ad78ac732d6f1507e7" "efe1bc7b6eca82ad78ac732d6f1507e7"
[5] "efe1bc7b6eca82ad78ac732d6f1507e7" "592e2e533582b2bbaf0bb460e558d0a5"
[7] "fd61b0fff79f76586ad840c9c0f497d1"

Resources