R, obtain pointed column value from another table (faster) - r

I have two data frame. One with all the data A, and a smaller one B that contain an unique identifier of A and column names of A. I am trying to add a column on A base on what the B is pointed to. In another word, I need to get data from A pointed by B.
For example
A<-airquality
B<-data.frame(Month=unique(A$Month),col=c("Ozone","Solar.R", "Wind", "wind","Solar.R"))
This would give me the following
> head(A)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> B
Month col
1 5 Ozone
2 6 Solar.R
3 7 Wind
4 8 wind
5 9 Solar.R
The result should be something like
> head(A)
Ozone Solar.R Wind Temp Month Day ADDED
1 41 190 7.4 67 5 1 41
2 36 118 8.0 72 5 2 36
3 12 149 12.6 74 5 3 12
4 18 313 11.5 62 5 4 18
5 NA NA 14.3 56 5 5 NA
6 28 NA 14.9 66 5 6 28
> tail(A)
Ozone Solar.R Wind Temp Month Day ADDED
148 14 20 16.6 63 9 25 20
149 30 193 6.9 70 9 26 193
150 NA 145 13.2 77 9 27 145
151 14 191 14.3 75 9 28 191
152 18 131 8.0 76 9 29 131
153 20 223 11.5 68 9 30 223
The only way I can do it is
for(i in 1:nrow(B))
{
j<-A$Month==B$Month[i]
k<-subset(A, select=B$col[i])[j,]
A$ADDED[j]<-k
}
while this does work, it become extremely slow as I have a big dataset. I feel like I am doing it the dumb way. What is a good way of doing this?
Thank

You could do this with the sapply or lapply-functions.
ADDED <- sapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED <- unlist(ADDED)
For partial matching, you would have to filter the data, to get only rows where B has values and then assign the values. But befor we have to assign a value for all rows of the ADDED column; in this case NA.
A$ADDED = NA
A[A$Month %in% B$Month,]$ADDED <- unlist(ADDED)
That is already talking only a 1/3 of the time, compared to a for-loop.
appl <- function(){
ADDED <- sapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED1 <- unlist(ADDED)
}
lappl <- function(){
ADDED <- lapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED1 <- unlist(ADDED)
}
forlo <- function(){
for(i in 1:nrow(B)) {
j<-A$Month==B$Month[i]
k<-subset(A, select=B$col[i])[j,]
A$ADDED[j]<-k
}
}
library(microbenchmark)
mc <- microbenchmark(times = 1000,
sapply = appl(),
lapply = lappl(),
forloop = forlo()
)
mc
Unit: microseconds
expr min lq mean median uq max neval cld
sapply 337.478 359.2125 378.6964 369.7775 385.474 2324.913 1000 a
lapply 319.367 340.7990 366.8448 349.2510 362.532 9051.828 1000 a
forloop 964.136 1013.6415 1074.5584 1032.5070 1059.825 5116.802 1000 b

Related

How to transfer a column from a dataset sharing the same one with another one

I have two versions of datasets sharing the same columns (more or less). Let's take as an example
db = airquality
db1 = airquality[,-c(6)]
db1$Ozone[db1$Ozone < 30] <- 24
db1$Month[db1$Month == 5] <- 24
db
db1
If I would like to transfer two columns 'Ozone' and 'Wind' from the dataset 'db1' to the 'db' dataset by writing a code using the pipe operator %>% or another iterative method to achieve this result, which code you may possibly suggest?
Thanks
You csn do:
library(dplyr)
db1 %>%
select(Ozone, Wind) %>%
bind_cols(db)
Note that in this example, since some column names will be duplicated in the final result, dplyr will automatically rename the duplicates by appending numbers to the end of the column names.
Base R:
cbind(db, db1[,c(1,3)])
Ozone Solar.R Wind Temp Month Day Ozone Wind
1 41 190 7.4 67 5 1 41 7.4
2 36 118 8.0 72 5 2 36 8.0
3 12 149 12.6 74 5 3 24 12.6
4 18 313 11.5 62 5 4 24 11.5
5 NA NA 14.3 56 5 5 NA 14.3
6 28 NA 14.9 66 5 6 24 14.9
7 23 299 8.6 65 5 7 24 8.6
8 19 99 13.8 59 5 8 24 13.8
9 8 19 20.1 61 5 9 24 20.1
10 NA 194 8.6 69 5 10 NA 8.6
11 7 NA 6.9 74 5 11 24 6.9
12 16 256 9.7 69 5 12 24 9.7
.
.
.

replace NA with 0 and all other values/text as 1

airquality
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
Hi there,
How do I replace values in Ozone to be binary? If NA then 0 and if a value then 1.
Thanks
H
Assuming your dataframe is called airquality
airquality$Ozone <- ifelse(is.na(airquality$Ozone), 0, 1)
airquality$Ozone <- as.integer(!is.na(airquality$Ozone))
Alternatively
airquality$Ozone[!is.na(airquality$Ozone)] <- 1L
airquality$Ozone[is.na(airquality$Ozone)] <- 0L

How to Calculate Industry Medians with Own Firm Excluded

I need to create a new column with the median ETR variable within a certain industry (SIC) for a sample of firms.
However, I need to exclude the own firm before calculating the industry (SIC) median for ETR.
Does anyone have any suggestions on how I could accomplish this?
Any help would be appreciated.
Thank you!
Sample Data:
Firm SIC ETR
1 20 10
2 20 15
3 20 20
4 20 25
5 20 30
6 21 50
7 21 55
8 21 60
9 21 65
10 21 70
Should Become:
Firm SIC ETR ETR_Median
1 20 10 22.5
2 20 15 22.5
3 20 20 20
4 20 25 17.5
5 20 30 17.5
6 21 50 62.5
7 21 55 62.5
8 21 60 60
9 21 65 57.5
10 21 70 57.5
So, firm #4, for example, have an industry (SIC) median of 17.5 when only considering the other firms in the same industry (SIC).
Consider splitting by SIC groups and run across all its Firm values to exclude from median calculation. Specifically, using:
by (for grouping into subset dfs)
sapply (to iterate across Firm values and call median)
unlist (to convert list to vector for df column binding)
Altogether:
df$ETR_median <- unlist(by(df, df$SIC, function(sub)
sapply(sub$Firm, function(f) median(sub$ETR[sub$Firm != f]))
))
df
# Firm SIC ETR ETR_median
# 1 1 20 10 22.5
# 2 2 20 15 22.5
# 3 3 20 20 20.0
# 4 4 20 25 17.5
# 5 5 20 30 17.5
# 6 6 21 50 62.5
# 7 7 21 55 62.5
# 8 8 21 60 60.0
# 9 9 21 65 57.5
# 10 10 21 70 57.5
You could create a function that excludes the current observation before conducting the median calculation:
median_excl <- function(x){
# pre-allocate our result vector:
med_excl <- vector(length = length(x))
# loop through our vector, excluding the current index and taking the median:
for(i in seq_along(x)){
x_excl <- x[-i]
med <- median(x_excl)
med_excl[i] <- med
}
return(med_excl)
}
Then simply apply it using dplyr or however you chose:
df %>% group_by(SIC) %>% mutate(ETR_Median = median_excl(ETR))
# Firm SIC ETR ETR_median
# 1 1 20 10 22.5
# 2 2 20 15 22.5
# 3 3 20 20 20.0
# 4 4 20 25 17.5
# 5 5 20 30 17.5
# 6 6 21 50 62.5
# 7 7 21 55 62.5
# 8 8 21 60 60.0
# 9 9 21 65 57.5
# 10 10 21 70 57.5

How to flatten out nested list into one list more efficiently instead of using unlist method?

I have a nested list which contains set of data.frame objects in it, now I want them flatten out. I used most common approach like unlist method, it is not properly fatten out my list, the output was not well represented. How can I make this happen more efficiently? Does anyone knows any trick of doing this operation? Thanks.
example:
mylist <- list(pass=list(Alpha.df1_yes=airquality[2:4,], Alpha.df2_yes=airquality[3:6,],Alpha.df3_yes=airquality[2:5,],Alpha.df4_yes=airquality[7:9,]),
fail=list(Alpha.df1_no=airquality[5:7,], Alpha.df2_no=airquality[8:10,], Alpha.df3_no=airquality[13:16,],Alpha.df4_no=airquality[11:13,]))
I tried like this, it works but output was not properly arranged.
res <- lapply(mylist, unlist)
after flatten out, I would like to do merge them without duplication:
out <- lapply(res, rbind.data.frame)
my desired output:
mylist[[1]]$pass:
Ozone Solar.R Wind Temp Month Day
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
How can make this sort of flatten output more compatibly represented? Can anyone propose possible idea of doing this in R? Thanks a lot.
Using lapply and duplicated:
res <- lapply(mylist, function(i){
x <- do.call(rbind, i)
x[ !duplicated(x), ]
rownames(x) <- NULL
x
})
res$pass
# Ozone Solar.R Wind Temp Month Day
# 1 36 118 8.0 72 5 2
# 2 12 149 12.6 74 5 3
# 3 18 313 11.5 62 5 4
# 4 12 149 12.6 74 5 3
# 5 18 313 11.5 62 5 4
# 6 NA NA 14.3 56 5 5
# 7 28 NA 14.9 66 5 6
# 8 36 118 8.0 72 5 2
# 9 12 149 12.6 74 5 3
# 10 18 313 11.5 62 5 4
# 11 NA NA 14.3 56 5 5
# 12 23 299 8.6 65 5 7
# 13 19 99 13.8 59 5 8
# 14 8 19 20.1 61 5 9
Above still returns a list, if we want to keep all in one dataframe with no lists, then:
res <- do.call(rbind, unlist(mylist, recursive = FALSE))
res <- res[!duplicated(res), ]
res
# Ozone Solar.R Wind Temp Month Day
# pass.Alpha.df1_yes.2 36 118 8.0 72 5 2
# pass.Alpha.df1_yes.3 12 149 12.6 74 5 3
# pass.Alpha.df1_yes.4 18 313 11.5 62 5 4
# pass.Alpha.df2_yes.5 NA NA 14.3 56 5 5
# pass.Alpha.df2_yes.6 28 NA 14.9 66 5 6
# pass.Alpha.df4_yes.7 23 299 8.6 65 5 7
# pass.Alpha.df4_yes.8 19 99 13.8 59 5 8
# pass.Alpha.df4_yes.9 8 19 20.1 61 5 9
# fail.Alpha.df2_no.10 NA 194 8.6 69 5 10
# fail.Alpha.df3_no.13 11 290 9.2 66 5 13
# fail.Alpha.df3_no.14 14 274 10.9 68 5 14
# fail.Alpha.df3_no.15 18 65 13.2 58 5 15
# fail.Alpha.df3_no.16 14 334 11.5 64 5 16
# fail.Alpha.df4_no.11 7 NA 6.9 74 5 11
# fail.Alpha.df4_no.12 16 256 9.7 69 5 12

dcast without ID variables

In the "An Introduction to reshape2" package Sean C. Anderson presents the following example.
He uses the airquality data and renames the column names
names(airquality) <- tolower(names(airquality))
The data look like
# ozone solar.r wind temp month day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6
Then he melts them by
aql <- melt(airquality, id.vars = c("month", "day"))
to get
# month day variable value
# 1 5 1 ozone 41
# 2 5 2 ozone 36
# 3 5 3 ozone 12
# 4 5 4 ozone 18
# 5 5 5 ozone NA
# 6 5 6 ozone 28
Finally he gets the original one (different column order) by
aqw <- dcast(aql, month + day ~ variable)
My Quesiton
Assume now that we do not have ID variables (i.e. month and day) and have melted the data as follows
aql <- melt(airquality)
which look like
# variable value
# 1 ozone 41
# 2 ozone 36
# 3 ozone 12
# 4 ozone 18
# 5 ozone NA
# 6 ozone 28
My question is how can I get the original ones? The original ones would correspond to
# ozone solar.r wind temp
# 1 41 190 7.4 67
# 2 36 118 8.0 72
# 3 12 149 12.6 74
# 4 18 313 11.5 62
# 5 NA NA 14.3 56
# 6 28 NA 14.9 66
Another option is unstack
out <- unstack(aql,value~variable)
head(out)
# ozone solar.r wind temp month day
#1 41 190 7.4 67 5 1
#2 36 118 8.0 72 5 2
#3 12 149 12.6 74 5 3
#4 18 313 11.5 62 5 4
#5 NA NA 14.3 56 5 5
#6 28 NA 14.9 66 5 6
As the question is about dcast, we can create a sequence column and then use dcast
aql$indx <- with(aql, ave(seq_along(variable), variable, FUN=seq_along))
out1 <- dcast(aql, indx~variable, value.var='value')[,-1]
head(out1)
# ozone solar.r wind temp month day
#1 41 190 7.4 67 5 1
#2 36 118 8.0 72 5 2
#3 12 149 12.6 74 5 3
#4 18 313 11.5 62 5 4
#5 NA NA 14.3 56 5 5
#6 28 NA 14.9 66 5 6
If you are using data.table, the devel version of data.table ie. v1.9.5 also has dcast function. Instructions to install the devel version are here
library(data.table)#v1.9.5+
setDT(aql)[, indx:=1:.N, variable]
dcast(aql, indx~variable, value.var='value')[,-1]
One option using split,
out <- data.frame(sapply(split(aql, aql$variable), `[[`, 2))
Here, the data is split by the variable column, then the second column of each group is combined back into a data frame (the [[ function with the argument 2 is passed to sapply)
head(out)
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6

Resources