I'm trying to extract specific values from one data.frame to another. The frames have different length but similar names (i.e. 'a'::'z' and 'a'::'w')
names <- letters[1:26]
df1 <- data.frame("name" = names[1:20],"value" = rnorm(20, mean = 4, sd = 1))
df2 <- data.frame("name" = names[1:26],"value" = rnorm(26, mean = 4, sd = 1))
df2$value2 <- df1[df2$name %in% df1$name,]$value
The last line above does not work but produces the following error:
Error in `$<-.data.frame`(`*tmp*`, "value2", value = c(4.21005563122984, :
replacement has 20 rows, data has 24
Any suggestions how I produce something like this:
name value value2
1 a 4.210056 5.918197
2 b 3.203976 4.485109
3 c 4.290336 4.210056
......
25 y 5.918197 NA
26 z 3.861640 NA
df2$value2[df2$name %in% df1$name] <- df1$value
Related
The problem is the following:
I have a data frame that I need to update inside a loop. The simple data frame has 4 columns: an identifier and four numeric columns. Here is the simple data frame at the initial step,
res_df <- data.frame(id = c("X", "Y", "Z"),
count = NA,
total = NA,
value = NA)
At every iteration a new data frame is generated with the same identifier and the same numeric columns.
For instance,
loop_df <- data.frame(id = c("X", "Z"),
count = c(1, 0),
total = c(20, 0),
value = c(0.05, 0))
I actually need to fill the res_df with information from the loop_df in the following way:
the row in loop_df with id "X" have to be inserted into the corresponding row of res_df, etc;
the column count has to be filled performing a simple sum between the values of the res_df and the newest values in the loop_df (essentially sum(res_df$count, loop_df$count) based on id);
the column total has to be filled in the same way of the column count (i.e. with a simple sum of the values based on id);
the column value has to be filled performing a simple average between the values of the res_df and the newest values in the loop_df (essentially mean(res_df$count, loop_df$count) based on id).
Here is how the result should be after the first run:
res_df
id count total value
X 1 20 0.05
Y NA NA NA
Z 0 0 0
Now, suppose to be in the second iteration of the loop that results in the loop_df as follow
loop_df <- data.frame(id = c("X", "Y"),
count = c(1, 0),
total = c(50, 0),
value = c(2.35, 0))
Then, the res_df has to be updated as follows
res_df
id count total value
X 2 70 1.2
Y 0 0 0
Z 0 0 0
Update: Solution
library(dplyr)
res_df <- arrange(res_df, id)
df_new_info <- arrange(loop_df, id)
ids <- loop_df$id
res_df[res_df$id %in% ids,] <- res_df[res_df$id %in% ids,] %>%
mutate(count = case_when(is.na(count) ~ loop_df$count,
TRUE ~ count + loop_df$count),
total = case_when(is.na(total) ~ loop_df$total,
TRUE ~ total + loop_df$total),
value = case_when(is.na(value) ~ loop_df$value,
TRUE ~ ewise_mean(value, loop_df$value, zero.rm = TRUE))
)
However, I am still looking for a solution which is highly efficient.
I'd really appreciate your help and thoughts about that.
I didnt quite catch why you need the for loop to generate those loop_df, you might want to consider using lapply to get a list of loop_df and then use the following to get your desired result:
rbindlist(dfl)[, .(count=sum(count), total=sum(total), value=mean(value)), id]
output:
id count total value
1: X 2 70 1.2
2: Z 0 0 0.0
3: Y 0 0 0.0
data:
library(data.table)
dfl <- list(
setDT(data.frame(id = c("X", "Z"),
count = c(1, 0),
total = c(20, 0),
value = c(0.05, 0))),
setDT(data.frame(id = c("X", "Y"),
count = c(1, 0),
total = c(50, 0),
value = c(2.35, 0)))
)
I have N columns that start with the String "Factor". I want to create an additional column in the dataframe that finds the row product of those columns.
Example data (My actual data set N = 50):
df <- data.frame(Company = c("A","B","C","D","E"),
Factor1 = c(1,2,3,4,5),
Factor2 = c(5,4,3,2,1),
FactorN = c(2,4,6,8,10))
Expected result
df2 <- data.frame(Company = c("A","B","C","D","E"),
Factor1 = c(1,2,3,4,5),
Factor2 = c(5,4,3,2,1),
FactorN = c(2,4,6,8,10),
Factor_Product = c(10,32,54,64,50))
I've tried rowProds from the matrixStats package, but that requires a matrix format.
Then convert it into matrix format and select columns which start with "Factor"
matrixStats::rowProds(as.matrix(df[grep("^Factor", names(df))]))
#[1] 10 32 54 64 50
You can also use apply row-wise
apply(df[grep("Factor", names(df))], 1, prod)
I have a list of data frames that are being pulled in from an Excel file. The Excel file has funky formatting and the true column names are the first row of the data frame. The list contains 9 data frames that are not sequentially named and are based on the tab names of the Excel file.
Here is what I have thus far:
for(i in all_list){
tmp <- get(i)
colnames(tmp) <- unlist(get(i)[1,])
assign(i, tmp)
}
R presents me with an error of:
Error in get(i) : invalid first argument
Here is a sample of the structure of my list of data frames:
str(all_list)
List of 9
$ Retail :'data.frame': 306 obs. of 25 variables:
$ X__1 : chr [1:306] NA NA "VARIABLE" "VARIABLE" ...
$ X__2 : chr [1:306] "TIME PERIOD" NA "41640" "41671" ...
As you can see, the column names with in the first element of the list (Retail) have the "X__#" format. Is there a clear way to do this reformatting in one loop over this list? Thank you.
You can use lapply iterate through each data.frame in list to set the column name from 1st row. Remove the first row before data.frame is returned. Example as:
ll <- list(df1,df2,df3,df4)
lapply(ll, function(x){
names(x) <- x[1,]
x[-1,]})
#[[1]] df1
# g x <-- 1st row has been set as column name.
#2 j z
#3 n p
#4 u o
#5 e b
Sample Data:
set.seed(1)
df1 <- data.frame(First = sample(letters, 5), Second = sample(letters, 5),
stringsAsFactors = FALSE)
df2 <- data.frame(First = sample(letters, 5), Second = sample(letters, 5),
stringsAsFactors = FALSE)
df3 <- data.frame(First = sample(letters, 5), Second = sample(letters, 5),
stringsAsFactors = FALSE)
df4 <- data.frame(First = sample(letters, 5), Second = sample(letters, 5),
stringsAsFactors = FALSE)
df1
# First Second
# 1 g x
# 2 j z
# 3 n p
# 4 u o
# 5 e b
Here is an example,
df <- data.frame(x = I(list(1:2, 3:4)))
x <- df[1,]
Now the following does not work,
df[2,] <- x
or
df[2,] <- I(x)
Warning message:
In `[<-.data.frame`(`*tmp*`, 2, , value = list(1:2)) :
replacement element 1 has 2 rows to replace 1 rows
How do I add more rows to data frame with a single column of vector type.
I found the following after few tries,
df[2,] <- list(x)
add new row of list type.
It might be because you are using a list. If you set your data frame as:
df <- data.frame(rbind(c(1, 2), c(3, 4)))
then your code should work:
df <- data.frame(rbind(c(1, 2), c(3, 4))) # Make DF
x <- df[1,]
df[2,] <- x
print(df)
> df
X1 X2
1 1 2
2 1 2
This question already has answers here:
cbind a dataframe with an empty dataframe - cbind.fill?
(10 answers)
Closed 9 years ago.
Say I have 5 dataframes with identical columns but different row lengths. I want
to make 1 dataframe that takes a specific column from each of the 5 dataframes, and
fills in with NA's (or whatever) where there isn't a length match. I've seen questions
on here that show how to do this with one-off vectors, but I'm looking for a way to
do it with bigger sets of data.
Ex: 2 dataframes of equal length:
long <- data.frame(accepted = rnorm(350, 2000), cost = rnorm(350,5000))
long2 <- data.frame(accepted = rnorm(350, 2000), cost = rnorm(350,5000))
I can create a list that combines them, then create an empty dataframe and populate
it with a common variable from the dataframes in the list:
list1 <- list(long, long2)
df1 <- as.data.frame(matrix(0, ncol = 5, nrow = 350))
df1[,1:2] <- sapply(list, '[[', 'accepted')
And it works.
But when I have more dataframes of unequal length, this approach fails:
long <- data.frame(accepted = rnorm(350, 2000), cost = rnorm(350,5000))
long2 <- data.frame(accepted = rnorm(350, 2000), cost = rnorm(350,5000))
medlong <- data.frame(accepted = rnorm(300, 2000), cost = rnorm(300,5000))
medshort <- data.frame(accepted = rnorm(150, 2000), cost = rnorm(150,5000))
short <- data.frame(accepted = rnorm(50, 2000), cost = rnorm(50,5000))
Now making the list and combined dataframe:
list2 <- list(long, long2, medlong, medshort, short)
df2 <- as.data.frame(matrix(0, ncol = 5, nrow = 350))
df1[,1:5] <- sapply(list, '[[', 'accepted')
I get the error about size mismatch:
Error in [<-.data.frame(*tmp*, , 1:5, value = c(1998.77096640377, :
replacement has 700 items, need 1750
The only solution I've found to populating this dataframe with columns of unequal
length from other dataframes is something along the lines of:
combined.df <- as.data.frame(matrix(0, ncol = 5, nrow = 350))
combined.df[,1] <- long[,2]
combined.df[,2] <- c(medlong[,2], rep(NA, nrow(long) - nrow(medlong))
But there's got to be a more elegant and faster way to do it... I know I'm missing something huge conceptually here
One way would be to find the length of the longest column and then concatenate shorter columns with the appropriate number of NAs. One way would be like this (with data of a more reasonable size for a MWE!)...
out <- lapply( list1 , '[[', 'accepted')
# Find length of longest column
len <- max( sapply( out , length ) )
# Stack shorter columns with NA at the end
dfs <- sapply( out , function(x) c( x , rep( NA , len - length(x) ) ) )
# Make data.frame and set column names at same time
setNames( do.call( data.frame , dfs ) , paste0("V" , 1:length(out) ) )
V1 V2 V3
1 -1.0913212 -2.4864497 0.04220331
2 -0.5252874 0.8030984 0.21774515
3 0.6914167 0.9685629 1.47159957
4 NA NA -0.89809670
5 NA NA 0.51140539
6 NA NA -0.46833136
7 NA NA -0.40085707
You could, also, "subset" each dataframe like df[nrow(df) + n,] in order to insert NAs:
#dataframes of different rows
long <- data.frame(accepted = rnorm(15, 2000), cost = rnorm(15,5000))
long2 <- data.frame(accepted = rnorm(10, 2000), cost = rnorm(10,5000))
long3 <- data.frame(accepted = rnorm(12, 2000), cost = rnorm(12,5000))
#insert all dataframes in list to manipulate
myls <- list(long, long2, long3)
#maximum number of rows
max.rows <- max(nrow(long), nrow(long2), nrow(long3))
#insert the needed `NA`s to each dataframe
new_myls <- lapply(myls, function(x) { x[1:max.rows,] })
#create wanted dataframe
do.call(cbind, lapply(new_myls, `[`, "accepted"))
# accepted accepted accepted
#1 2001.581 1999.014 2001.810
#2 2000.071 2000.033 2000.588
#3 1999.931 2000.188 2000.833
#4 1998.467 1999.891 1997.645
#5 2000.682 2000.144 1999.639
#6 1999.693 1999.341 1998.959
#7 2000.222 1998.939 2002.271
#8 1999.104 1998.530 1997.600
#9 1998.435 2001.496 2001.129
#10 1998.160 2000.729 2001.602
#11 1999.267 NA 1999.733
#12 2000.048 NA 2001.431
#13 1999.504 NA NA
#14 2000.660 NA NA
#15 2000.160 NA NA
You can try using merge:
long$rn <- rownames(long)
long2$rn <- rownames(long2)
medlong$rn <- rownames(medlong)
medshort$rn <- rownames(medshort)
short$rn <- rownames(short)
result <- (merge(merge(merge(merge(
long, long2[, cols], by=c('rn'), all=T),
medlong[, cols], by=c('rn'), all=T),
medshort[, cols], by=c('rn'), all=T),
short[, cols], by=c('rn'), all=T))