I have 2 dataframes with different column names which i need to append
A = c("Q","W","E")
B =c(12,23,31)
df1 = data.frame(A,B)
A = c("R","T","Y")
B =c(3,111,21)
C= c(5,9,3)
df2 = data.frame(A,B,C)
I am trying to append two data frames like rbindlist(data.table) functionality in sqldf
samp = rbindlist(list(df1,df2),fill=T)
Any modifications in below code
samp= sqldf ("insert into df1 select * from df2")
The error i am getting is:
Error in rsqlite_send_query(conn#ptr, statement) :
table df1 has 2 columns but 3 values were supplied"
We can create dummy column then UNION ALL:
sqldf("select *, NULL as C from df1
union all
select * from df2")
# A B C
# 1 Q 12 NA
# 2 W 23 NA
# 3 E 31 NA
# 4 R 3 5
# 5 T 111 9
# 6 Y 21 3
Related
how do I change a value of a dataframe, based on another dataframe with different column names? I have Table 1 and Table 2, and would like to produce the Desired outcome as stated below.
I would like to change Number column of table 1, if Index of table 1 exists in Alphabet of Table2, without the use of if else as my dataframe is large.
Table 1)
Index
Number
A
1
B
2
C
3
D
4
Table 2)
Alphabet
Integer
B
25
C
30
Desired Output:
Index
Number
A
1
B
25
C
30
D
4
What's the issue with using ifelse() on a large dataframe with simple replacement?
df1 <- data.frame(Index = LETTERS[1:4],
Number = 1:4)
df2 <- data.frame(Alphabet = c("B", "C"),
Integer = c(25, 30))
df1$Number2 <- ifelse(df1$Index %in% df2$Alphabet, df2$Integer, df1$Number)
df1
#> Index Number Number2
#> 1 A 1 1
#> 2 B 2 30
#> 3 C 3 25
#> 4 D 4 4
Perhaps if you are concerned with incorrect indexing when using ifelse(), you could use merge() like this:
merge(df1, df2, by.x = "Index", by.y = "Alphabet", all.x = TRUE)
#> Index Number Integer
#> 1 A 1 NA
#> 2 B 2 25
#> 3 C 3 30
#> 4 D 4 NA
I have two dataframes: one (called df_persons) with records that are have unique person_id's, but have stratum_id's that are not unique, and one (called df_population) with those same stratum_id's, and multiple duplicate rows of them. Code to recreate them below:
df_persons = data.frame(person_id=c(101, 102, 103), stratum_id=c(1,2,1))
df_population = data.frame(stratum_id=c(1,1,1,1,2,2,2,2,3,3))
Now I would like a way to merge the data from df_persons with df_population, so that every row from df_persons gets merged with the first matching (key = stratum_id) row of df_population that has not been previously matched. Find the desired solution below:
# manual way to merge first available match
df_population$person = c(101, 103, NA, NA, 102, NA, NA, NA, NA, NA)
I wrote a loop for this that works (see below). The problem is that df_persons is 83.000 records long, and df_population is 13 million records long. And the loop therefore takes too long + my pc cannot handle it.
# create empty person column in df_population
df_population$person = NA
# order both df's to speed up
df_population = df_population[order(df_population$stratum_id),]
df_persons = df_persons[order(df_persons$stratum_id),]
# loop through all persons in df_person, and for each find the first available match
for(i_person in 1:nrow(df_persons))
{
match = F
i_pop = 0
while(!match)
{
i_pop = i_pop+1
if(df_population$stratum_id[i_pop] == df_persons$stratum_id[i_person] & is.na(df_population$person[i_pop]))
{
match = T
df_population$person[i_pop] = df_persons$person[i_person]
}
}
}
Any help to make this a lot faster would be much appreciated. I have looked into the data.frame package, to no avail so far, but I do think I will need to move away from looping in order to execute the code.
1) dplyr Using dplyr add a sequence number to each data frame and then merge them:
library(dplyr)
df_population %>%
group_by(stratum_id) %>%
mutate(seq = 1:n()) %>%
ungroup %>%
left_join(df_persons %>% group_by(stratum_id) %>% mutate(seq = 1:n()))
giving:
Joining, by = c("stratum_id", "seq")
# A tibble: 10 x 3
stratum_id seq person_id
<dbl> <int> <dbl>
1 1 1 101
2 1 2 103
3 1 3 NA
4 1 4 NA
5 2 1 102
6 2 2 NA
7 2 3 NA
8 2 4 NA
9 3 1 NA
10 3 2 NA
2) Base R or in base R:
p1 <- transform(df_population, seq = ave(stratum_id, stratum_id, FUN = seq_along))
p2 <- transform(df_persons, seq = ave(stratum_id, stratum_id, FUN = seq_along))
merge(p1, p2, all.x = TRUE, all.y = FALSE)
3) sqldf In SQL we have the following. The dbname= argument causes it to perform the processing outside of R but if you have sufficient memory then it could be omitted and it will use memory within R.
library(sqldf)
seqno <- "sum(1) over (partition by stratum_id rows unbounded preceding)"
fn$sqldf("
with
p1 as (select *, $seqno seq from df_population),
p2 as (select *, $seqno seq from df_persons)
select * from p1 left join p2 using (stratum_id, seq)
", dbname = tempfile())
Here is a data.table approach. More explanation in the code's comments.
library(data.table)
# make them data.table
setDT(df_persons)
setDT(df_population)
# create dummy values to join on
df_persons[, id := rowid(stratum_id)]
df_population[, id := rowid(stratum_id)]
# join by refence
df_population[df_persons, person_id := i.person_id, on = .(stratum_id, id)][]
# drop the dummy id column
df_population[, id := NULL][]
# stratum_id person_id
# 1: 1 101
# 2: 1 103
# 3: 1 NA
# 4: 1 NA
# 5: 2 102
# 6: 2 NA
# 7: 2 NA
# 8: 2 NA
# 9: 3 NA
#10: 3 NA
Simply use pmatch as shown below:
df_population$person_id <- df_persons$person_id[pmatch(df_population$stratum_id, df_persons$stratum_id)]
df_population
stratum_id person_id
1 1 101
2 1 103
3 1 NA
4 1 NA
5 2 102
6 2 NA
7 2 NA
8 2 NA
9 3 NA
10 3 NA
I the following dataframes:
a <- c(1,1,1)
b<- c(10,8,2)
c<- c(2,2)
d<- c(3,5)
AB<- data.frame(a,b)
CD<- data.frame(c,d)
I would like to join AB and CD, where the first column of CD is equal to the second column of AB. Please note that my actual data will have a varying number of columns, with varying names, so I am really looking for a way to join based on position only. I have been trying this:
#Get the name of the last column in AB
> colnames(AB)[ncol(AB)]
[1] "b"
#Get the name of the first column in CD
> colnames(CD)[1]
[1] "c"
Then I attempt to join like this:
> abcd <- full_join(AB, CD, by = c(colnames(AB)[ncol(AB)]=colnames(CD)[1]))
Error: unexpected '=' in "abcd <- full_join(AB, CD, by = c(colnames(AB)[ncol(AB)]="
The behavior I am looking for is essentially this:
> abcd<- full_join(AB, CD, by = c("b" = "c"))
> abcd
a b d
1 1 10 NA
2 1 8 NA
3 1 2 3
4 1 2 5
We can do setNames
full_join(AB, CD, setNames(colnames(CD)[1], colnames(AB)[ncol(AB)]))
# a b d
#1 1 10 NA
#2 1 8 NA
#3 1 2 3
#4 1 2 5
We can replace the target column names with a common name, such as "Target", and then do full_join. Finally, replace the "Target" name with the original column name.
library(dplyr)
AB_name <- names(AB)
target_name <- AB_name[ncol(AB)] # Store the original column name
AB_name[ncol(AB)] <- "Target" # Set a common name
names(AB) <- AB_name
CD_name <- names(CD)
CD_name[1] <- "Target" # Set a common name
names(CD) <- CD_name
abcd <- full_join(AB, CD, by = "Target") %>% # Merge based on the common name
rename(!!target_name := Target) # Replace the common name with the original name
abcd
# a b d
# 1 1 10 NA
# 2 1 8 NA
# 3 1 2 3
# 4 1 2 5
I would like to match 2 Data sets (tables) which only have some (not all) variables in common but not any of those obs. - So actually I want to add dataset1 to dataset2, adding the column names of dataset2, while in empty fields of the table should be filled in with NA.
So what I did is, I tried the following function;
matchcol = function(x,y){
y = y[,match(colnames(x),colnames(y))]
colnames(y)=colnames(x)
return(y)
}
sum =matchcol(dataset1,dataset2)
data = rbind(dataset1,dataset2)
But I get; "Error: NA columns indexes not supported.
What can I do? What can I change in my code.
Thx!!
To use rbind you need to have the same column names, but with bind_rows from dplyr package you don't, try this:
library(dplyr)
data <- bind_rows(dataset1, dataset2)
example :
dataset1 <- data.frame(a= 1:5,b=6:10)
dataset2 <- data.frame(a= 11:15,c=16:20)
data <- bind_rows(dataset1,dataset2)
# a b c
# 1 1 6 NA
# 2 2 7 NA
# 3 3 8 NA
# 4 4 9 NA
# 5 5 10 NA
# 6 11 NA 16
# 7 12 NA 17
# 8 13 NA 18
# 9 14 NA 19
# 10 15 NA 20
If I understand your question right, it looks like dplyr::full_join is good for that:
library(dplyr)
dataset1 <- data.frame(Var_A = 1:10, Var_B = 100:109)
dataset2 <- data.frame(Var_A = 11:20, Var_C = 200:209)
dataset_new <- full_join(dataset1, dataset2)
dataset_new
This will automatically join the two datasets by common column names and add all other columns from both datasets. And empty fields are NAs.
Does that work for you?
I have two datasets that look like this:
What I want is to change the values from the second column in the first dataset to the values from the second column from the second dataset. All the names in the first dataset are in the second one, and obviously my dataset is much bigger than that.
I was trying to use R to do that but I am very new at it. I was looking at the intersect command but I am not sure if it's going to work. I don't put any codes because I'm real lost here.
I also need that the order of the first columns (which are names) in the first dataset stays the same, but with the new values from the second column of the second dataset.
Agree with #agstudy, a simple use of merge would do the trick. Try something like this:
df1 <- data.frame(name=c("ab23242", "ab35366", "ab47490", "ab59614"),
X=c(72722, 88283, 99999, 114278.333))
df2 <- data.frame(name=c("ab35366", "ab47490", "ab59614", "ab23242" ),
X=c(12345, 23456, 34567, 456789))
df.merge <- merge(df1, df2, by="name", all.x=T)
df.merge <- df.merge[, -2]
Output:
name X.y
1 ab23242 456789
2 ab35366 12345
3 ab47490 23456
4 ab59614 34567
I think merge will keep order of first frame but you can also keep the order strictly by simply adding a column with order df1$order <- 1:nrow(df1) and later on sorting based on that column.
df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ df1$name1 %in% df2$name2 , "valuecol1"]
df2
name2 valuecol2
1 a 10
2 b 9
3 c 8
4 d 7
5 e 6
6 f 2
7 g 4
8 h 6
9 i 8
10 j 10
This is what I thought might work, but doing replacements using indexing with match sometimes bites me in ways I need to adjust:
df2 [match(df1$name1, df2$name2) , "valuecol2"] <-
df1[ match(df1$name1, df2$name2) , "valuecol1"]
Here's how I tested it (edited).
> df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
> df1<- data.frame( name1 = letters[1:5], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 5
7 g 4
8 h 3
9 i 2
10 j 1
Yep.... bitten again.
> df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f NA
7 g NA
8 h NA
9 i NA
10 j NA
How about this:
library(data.table)
# generate some random data
dt.1 <- data.table(id = 1:1000, value=rnorm(1000), key="id")
dt.2 <- data.table(id = 2*(500:1), value=as.numeric(1:500), key="id")
# objective is to replace value in df.1 with value from df.2 where id's match.
# data table joins - very efficient
# dt.1 now has 3 columns: id, value, and value.1 from dt.2$value
dt.1 <-dt.2[dt.1,nomatch=NA]
dt.1[is.na(value),]$value=dt.1[is.na(value),]$value.1
dt.1$value.1=NULL # get rid of extra column
NB: This sorts dt.1 by id which should be OK since it's sorted that way already.
Also: In future, please include data that can be imported into R. Images are not useful!