I want to create iteration that takes a list (which is column of another dataframe) and add it to the current data frame as column. but the length of the columns are not equal. So, I want to generate NA as unmatched rows.
seq_actions=as.data.frame(x = NA)
for(i in 1:20){
temp_seq=another_df$c1[some conditions]
seq_actions=cbind(temp_seq,seq_actions)
}
to simplify, lets say i have
df
1 3
3 4
2 2
adding the list of 5,6 as new column to df, so I want:
df
1 3 5
3 4 6
2 2 NA
another adding list is 7 7 7 8, so my df will be:
df
1 3 5 7
3 4 6 7
2 2 NA 7
NA NA NA 8
How can I do it?
Here's one way. The merge function by design will add NA values whenever you combine data frames and no match is found (e.g., if you have fewer values in 1 data frame than the other data frame).
If you assume that you're matching your data frames (what rows go together) based on the row number, just output the row number as a column in your data frames. Then merge on that column. Merge will automatically add the NA values you want and deal with the fact that the data frames have different numbers of rows.
#test data frame 1
a <- c(1, 3, 2)
b <- c(3, 4, 2)
dat <- as.data.frame(cbind(a, b))
#test data frame 2 (this one has fewer rows than the first data frame)
c <- c(5, 6)
dat.new <- as.data.frame(c)
#add column to each data frame with row number
dat$number <- row.names(dat)
dat.new$number <- row.names(dat.new)
#merge data frames
#"all = TRUE" will mean that NA values will be added whenever there is no match
finaldata <- merge(dat, dat.new, by = "number", all = TRUE)
If you know the maximum possible size of df, and the total number of columns you want to append, you can create df in advance with all NA values and fill a column in based on its length. This would leave everything after its length still NA.
e.g.
max_col_num <- 20
max_col_size <- 10 #This could be the number of rows in the largest dataframe you have
df <- as.data.frame(matrix(ncol = max_col_num, nrow = max_col_size))
for(i in 1:20){
temp_seq=another_df$c1[some conditions]
df[c(1:length(temp_seq), i] <- temp_seq
}
This would only work if you new the total possible number of rows and columns.
I think the best could be to write a custom function which is based on nrow of data frame and length of vector/list.
Once such function can be written as:
#Function to add vector as column
addToDF <- function(df, v){
nRow <- nrow(df)
lngth <- length(v)
if(nRow > lngth){
length(v) <- nRow
}else if(nRow < lngth){
df[(nRow+1):lngth, ] <- NA
}
cbind(df,v)
}
Let's test above function with data.frame provided by OP.
df <- data.frame(A= c(1,3,2), B = c(3, 4, 2))
v <- c(5,6)
w <-c(7,7,8,9)
addToDF(df, v)
# A B v
# 1 1 3 5
# 2 3 4 6
# 3 2 2 NA
addToDF(df, w)
# A B v
# 1 1 3 7
# 2 3 4 7
# 3 2 2 8
# 4 NA NA 9
Following MKRs response, if you want to to add a specific name to the new added column, you can try:
addToDF <- function(df, v, col_name){
nRow <- nrow(df)
lngth <- length(v)
if(nRow > lngth){
length(v) <- nRow
}else if(nRow < lngth){
df[(nRow+1):lngth, ] <- NA
}
df_new<-cbind(df,v)
colnames(df_new)[ncol(df_new)]=col_name
return(df_new)
}
where col_name is the new of the added column.
Related
It's hard to explain, so I'll start with an example. I have some numeric columns (A, B, C). The column 'tmp' contains variable names of the numeric columns as concatenated strings:
set.seed(100)
A <- floor(runif(5, min=0, max=10))
B <- floor(runif(5, min=0, max=10))
C <- floor(runif(5, min=0, max=10))
tmp <- c('A','B,C','C','A,B','A,B,C')
df <- data.frame(A,B,C,tmp)
A B C tmp
1 3 4 6 A
2 2 8 8 B,C
3 5 3 2 C
4 0 5 3 A,B
5 4 1 7 A,B,C
Now, for each row, I want to use the variable names in tmp to select the values from the corresponding numeric columns with the same name(s). Then I want to keep only the rows where all the selected values are less than or equal 3.
E.g. in the first row, tmp is A, and the corresponding value in column A is 3, i.e. keep this row.
Another example, in row 4, tmp is A,B. The corresponding values are A = 0 and B = 5. Thus, all selected values are not less than or equal 3, and this row is discarded.
Desired result:
A B C tmp
1 3 4 6 A
2 5 3 2 C
How can I perform such filtering?
This is a bit more complicated than I like and there might be a more elegant solution, but here we go:
#split tmp
col <- strsplit(df[["tmp"]], ",")
#create an index matrix
inds <- do.call(rbind, Map(data.frame, row = seq_along(col), col = col))
inds$col <- match(inds$col, names(df))
inds <- as.matrix(inds)
#check
chk <- m <- as.matrix(df[, names(df) != "tmp"])
mode(chk) <- "logical"
chk[] <- NA
chk[inds] <- m[inds] <= 3
sel <- apply(chk, 1, prod, na.rm = TRUE)
df[as.logical(sel),]
# A B C tmp
#1 3 4 6 A
#3 5 3 2 C
Not sure if it works always (and probably isn't the best solution)... but it worked here:
library(dplyr)
library(tidyr)
library(stringr)
List= vector("list")
for (i in 1:length(df)){
tmpT= as.vector(str_split(df$tmp[i], ",", simplify=TRUE))
selec= df %>%
select(tmpT) %>%
slice(which(row_number() == i)) %>%
filter_all(., all_vars(. <= 3)) %>%
unite(val, sep= ", ")
if (nrow(selec) == 0) {
tab= NA
} else{
tab= df[i,]
}
List[[i]] = tab
}
df2= do.call("rbind", List)
This answer has some similarities with #Roland's, but here we work with the data in a 'longer' format:
# create row index
df$ri = seq_len(nrow(df))
# split the concatenated column
l <- strsplit(df$tmp, ',')
# repeat each row of the data with the lengths of the split string,
# bind with individual strings
d = cbind(df[rep(1:nrow(df), lengths(l)), ], x = unlist(l))
# use match to grab values from corresponding columns
d$val <- d[cbind(seq(nrow(d)), match(d$x, names(d)))]
# for each original row 'ri', check if all values are <= 3. use result to index data frame
d[as.logical(ave(d$val, d$ri, FUN = function(x) all(x <= 3))), ]
# A B C tmp ri x val
# 1 3 4 6 A 1 A 3
# 3 5 3 2 C 3 C 2
I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5
I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))
I would like to replace the values in the last three columns in my data.frame with the three values in a vector.
Example of data.frame
df
A B C D
5 3 8 9
Vector
1 2 3
what I would like the data.frame to look like.
df
A B C D
5 1 2 3
Currently I am doing:
df$B <- Vector[1]
df$C <- Vector[2]
df$D <- Vector[3]
I would like to not replace the values one by one. I would like to do it all at once.
Any help will be appreciated. Please let me know if any further information is needed.
We can subset the last three columns of the dataset with tail, replicate the 'Vector' to make the lengths similar and assign the values to those columns
df[,tail(names(df),3)] <- Vector[col(df[,tail(names(df),3)])]
df
# A B C D
#1 5 1 2 3
NOTE: I replicated the 'Vector' assuming that there will be more rows in the 'df' in the original dataset.
Try this:
df[-1] <- 1:3
giving:
> df
A B C D
1 5 1 2 3
Alternately, we could do it non-destructively like this:
replace(df, -1, 1:3)
Note: The input df in reproducible form is:
df <- data.frame(A = 5, B =3, C = 8, D = 9)
So, I have several dataframes like this
1 2 a
2 3 b
3 4 c
4 5 d
3 5 e
......
1 2 j
2 3 i
3 4 t
3 5 r
.......
2 3 t
2 4 g
6 7 i
8 9 t
......
What I want is, I want to merge all of these files into one single file showing the values of third column for each pair of values in columns 1 and columns 2 and 0 if that pair is not present.
So, the output for this will be, since, there are three files (there are more)
1 2 aj0
2 3 bit
3 4 ct0
4 5 d00
3 5 er0
6 7 00i
8 9 00t
......
What I did was combine all my text .txt files in a single list.
Then,
L <- lapply(seq_along(L), function(i) {
L[[i]][, paste0('DF', i)] <- 1
L[[i]]
})
Which will indicate the presence of a value when we will be merging them.
I don't know how to proceed further. Any inputs will be great. Thanks!
Here is one way to do it with Reduce
# function to generate dummy data
gen_data<- function(){
data.frame(
x = 1:3,
y = 2:4,
z = sample(LETTERS, 3, replace = TRUE)
)
}
# generate list of data frames to merge
L <- lapply(1:3, function(x) gen_data())
# function to merge by x and y and concatenate z
f <- function(x, y){
d <- merge(x, y, by = c('x', 'y'), all = TRUE)
# set merged column to zero if no match is found
d[['z.x']] = ifelse(is.na(d[['z.x']]), 0, d[['z.x']])
d[['z.y']] = ifelse(is.na(d[['z.y']]), 0, d[['z.y']])
d$z <- paste0(d[['z.x']], d[['z.y']])
d['z.x'] <- d['z.y'] <- NULL
return(d)
}
# merge data frames
Reduce(f, L)