how do I change a value of a dataframe, based on another dataframe with different column names? I have Table 1 and Table 2, and would like to produce the Desired outcome as stated below.
I would like to change Number column of table 1, if Index of table 1 exists in Alphabet of Table2, without the use of if else as my dataframe is large.
Table 1)
Index
Number
A
1
B
2
C
3
D
4
Table 2)
Alphabet
Integer
B
25
C
30
Desired Output:
Index
Number
A
1
B
25
C
30
D
4
What's the issue with using ifelse() on a large dataframe with simple replacement?
df1 <- data.frame(Index = LETTERS[1:4],
Number = 1:4)
df2 <- data.frame(Alphabet = c("B", "C"),
Integer = c(25, 30))
df1$Number2 <- ifelse(df1$Index %in% df2$Alphabet, df2$Integer, df1$Number)
df1
#> Index Number Number2
#> 1 A 1 1
#> 2 B 2 30
#> 3 C 3 25
#> 4 D 4 4
Perhaps if you are concerned with incorrect indexing when using ifelse(), you could use merge() like this:
merge(df1, df2, by.x = "Index", by.y = "Alphabet", all.x = TRUE)
#> Index Number Integer
#> 1 A 1 NA
#> 2 B 2 25
#> 3 C 3 30
#> 4 D 4 NA
Related
I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d
I have searched high and low and also tried multiple options to solve this but did not get the desired output as mentioned below:
I have dataframe df3 with headers as date and values beteween 0-1 as shown below:
df = data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) = c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 = data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 = cbind(df2,df)
Now I need df4 in which sum of first 3 columns in series will form one column. This will be repeated in series for rest of the columns dynamically.
df4
Options I tried:
a) rbind.data.frame(apply(matrix(df3, nrow = n - 1), 1,sum))
b) col_list <- list(c("1/1/2018","1/2/2018","1/3/2018"), c("1/4/2018","1/5/2018","1/6/2018"))
lapply(col_list, function(x)sum(df3[,x])) %>% data.frame
One way would be to split df3 every 3 columns using split.default. To split the data we generate a sequence using rep, then for each dataframe we take rowSums and finally cbind the result together.
cbind(df3[1], sapply(split.default(df3[-1],
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))), rowSums))
# CUST_ID 1 2
#1 A 1 1
#2 B 2 0
#3 C 2 1
#4 D 1 1
#5 E 2 2
#6 F 2 2
FYI, the sequence generated from rep is
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))
#[1] 1 1 1 2 2 2
This makes it possible to split every 3 columns.
The results are different because OP used sample without set.seed.
If rep seems too long then we can generate the same sequence of columns using gl
gl(ncol(df3[-1])/3, 3)
#[1] 1 1 1 2 2 2
#Levels: 1 2
So the final code, would be
cbind(df3[1], sapply(split.default(df3[-1], gl(ncol(df3[-1])/3, 3)), rowSums))
We can use seq to create index, get the subset of columns within in a list, Reduce by taking the sum, and create new columns
df4 <- df3[1]
df4[paste0('col', c('123', '456'))] <- lapply(seq(2, ncol(df3), by = 3),
function(i) Reduce(`+`, df3[i:min((i+2), ncol(df3))]))
df4
# CUST_ID col123 col456
#1 A 2 2
#2 B 3 3
#3 C 1 3
#4 D 2 3
#5 E 2 1
#6 F 0 1
data
set.seed(123)
df <- data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) <- c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 <- data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 <- cbind(df2, df)
This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed 4 years ago.
I have no idea where to start with this, but what I am trying to do is create a new value based on the number of times another value is represented in another column.
For example
# Existing Data
key newcol
a ?
a ?
a ?
b ?
b ?
c ?
c ?
c ?
Would like the output to look like
key newcol
a 3
a 3
a 3
b 2
b 2
c 3
c 3
c 3
Thanks!
This can be achieved with the doBy package like so:
require(doBy)
#original data frame
df <- data.frame(key = c('a', 'a', 'a', 'b', 'b', 'c', 'c', 'c'))
#add counter
df$count <- 1
#use summaryBy to count number of instances of key
counts <- summaryBy(count ~ key, data = df, FUN = sum, var.names = 'newcol', keep.names = TRUE)
#merge counts into original data frame
df <- merge(df, counts, by = 'key', all.x = TRUE)
df then looks like:
> df
key count newcol
1 a 1 3
2 a 1 3
3 a 1 3
4 b 1 2
5 b 1 2
6 c 1 3
7 c 1 3
8 c 1 3
If key is a vector like this key <- rep(c("a", "b", "c"), c(3,2,3)), then you can get what you want by using table to count occurences of key elements
> N <- table(key)
> data.frame(key, newcol=rep(N,N))
key newcol
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
6 c 3
7 c 3
8 c 3
On the other hand, if key is a data.frame, then...
key.df <- data.frame(key = rep(letters[1:3], c(3, 2, 3)))
N <- table(key.df$key)
data.frame(key=key.df, newcol=rep(N, N))
I have 2 dataframes with different column names which i need to append
A = c("Q","W","E")
B =c(12,23,31)
df1 = data.frame(A,B)
A = c("R","T","Y")
B =c(3,111,21)
C= c(5,9,3)
df2 = data.frame(A,B,C)
I am trying to append two data frames like rbindlist(data.table) functionality in sqldf
samp = rbindlist(list(df1,df2),fill=T)
Any modifications in below code
samp= sqldf ("insert into df1 select * from df2")
The error i am getting is:
Error in rsqlite_send_query(conn#ptr, statement) :
table df1 has 2 columns but 3 values were supplied"
We can create dummy column then UNION ALL:
sqldf("select *, NULL as C from df1
union all
select * from df2")
# A B C
# 1 Q 12 NA
# 2 W 23 NA
# 3 E 31 NA
# 4 R 3 5
# 5 T 111 9
# 6 Y 21 3
I want to do something VERY similar to this question: how to use merge() to update a table in R
but instead of just one column being the index, I want to match the new values on an arbitrary number of columns >=1.
foo <- data.frame(index1=c('a', 'b', 'b', 'd','e'),index2=c(1, 1, 2, 3, 2), value=c(100,NA, 101, NA, NA))
Which has the following values
foo
index1 index2 value
1 a 1 100
2 b 1 NA
3 b 2 101
4 d 3 NA
5 e 2 NA
And the data frame bar
bar <- data.frame(index1=c('b', 'd'),index2=c(1,3), value=c(200, 201))
Which has the following values:
bar
index1 index2 value
1 b 1 200
2 d 3 201
merge(foo, bar, by='index', all=T)
It results in this output:
Desired output:
foo
index1 index2 value
1 a 1 100
2 b 1 200
3 b 2 101
4 d 3 201
5 e 2 NA
I think you don't need a merge but more to rbind and filter them later. Here I am using data.table for sugar syntax.
dx <- rbind(bar,foo)
library(data.table)
setDT(dx)
## note this can be applied to any number of index
setkeyv(dx,grep("index",names(dx),v=T))
## using unqiue to remove all duplicated
## here it will remove the duplicated with missing values which is the
## expected behavior
unique(dx)
# index1 index2 value
# 1: b 1 200
# 2: b 2 101
# 3: d 3 201
# 4: a 1 100
# 5: e 2 NA
you can be more explicit and filter your rows by group of indexs:
dx[,ifelse(length(value)>1,value[!is.na(value)],value),key(dx)]
Here's an R base approach
> temp <- merge(foo, bar, by=c("index1","index2"), all=TRUE)
> temp$value <- with(temp, ifelse(is.na(value.x) & is.na(value.y), NA, rowSums(temp[,3:4], na.rm=TRUE)))
> temp <- temp[, -c(3,4)]
> temp
index1 index2 value
1 a 1 100
2 b 1 200
3 b 2 101
4 d 3 201
5 e 2 NA
You can use some dplyr voodoo to produce what you want. The following subsets the data by unique combinations of "index1" and "index2", and checks the contents of "value" for each subset. If "value" has any non-NA values, those are returned. If only an NA value is found, that is returned.
Seems a little specific, but it seems to do what you want!
library(dplyr)
df.merged <- merge(foo, bar, all = T) %>%
group_by(index1, index2) %>%
do(
if (any(!is.na(.$value))) {
return(subset(., !is.na(value)))
} else {
return(.)
}
)
Output:
index1 index2 value
<fctr> <fctr> <dbl>
1 a 1 100
2 b 1 200
3 b 2 101
4 d 3 201
5 e 2 NA
You can specify as many columns as you want with merge:
out <- merge(foo, bar, by=c("index1", "index2"), all.x=TRUE)
new <- apply(out[,3:4], 1, function(x) sum(x, na.rm=TRUE))
new <- ifelse(is.na(out[,3]) & is.na(out[,4]), NA, new)
out <- cbind(out[,1:2], new)