Repeated values ​when join data frames in r - r

when I merge dataframes, I write this code:
library(readxl)
df1 <- read_excel("C:/Users/PC/Desktop/precipitaciones_4Q.xlsx")
df2 <- read_excel("C:/Users/PC/Desktop/libro_copia_1.xlsx")
df1 = data.frame(df1)
df2 = data.frame(df2)
df1$codigo = toupper(df1$codigo)
df2$codigo = toupper(df2$codigo)
dat = merge.data.frame(df1,df2,by= "codigo", all.y = TRUE,sort = TRUE)
the data has rainfall counties, df1 has less counties than df2. I want to paste counties that has rainfall data from df1 to df2.
The problem occurs when counties data are paste into df2, repeat counties appears.
df1:
df2:

Instead "id" you must specify the column names for join from the first and second table.
You can use the data.table package and code below:
library(data.table)
dat <- merge(df1, df2, by.x = "Columna1", by.y = "prov", all.y = TRUE)
also, you can use funion function:
dat <- funion(df1, df2)
or rbind function:
dat <- rbind(df1, df2)
dat <- unique(dat)
Note: column names and the number of columns of the two dataframes needs to be same.

Related

How to use for loop to join multiple dataframe?

recently I have a fixed dataframe and would like to join this dataframe to multiple dataframe. Below please see my example:
df1 <- data.frame (first_column = c("key", "key"),
second_column = c("a", "a")
)
df2 <- data.frame (first_column = c("key", "key"),
second_column = c("b", "b")
)
df3 <- data.frame (first_column = c("key", "key"),
second_column = c("c", "c")
)
join <- data.frame (first_column = c("key", "key"),
join_column = c("join", "join")
)
#df1 df2 df3 are the dataframe that needed to by joined by df.join
I try to create a for loop to join it:
for (i in 1:length(df.list)) {
df <- df.list[[i]]
assign(paste(names(df.list[[i]])),"_joined"),left_join(df, join, by = c("first_column"= "first_column"))
}
However, I have encountered 2 problems:
I cannot create a variable name by using the name in for loop [i]
I cannot create 3 different dataframe by using this for loop.
Below please see the result that I want to get
> df1_joined
first_column second_column join_column
1 key a join
2 key a join
> df2_joined
first_column second_column join_column
1 key b join
2 key b join
> df3_joined
first_column second_column join_column
1 key c join
2 key c join
Many Thanks!
You can put the dataframes in a list and join them using lapply.
list_df <- list(df1, df2, df3)
result <- lapply(list_df, function(x)
merge(x, join, by = 'first_column', all.x = TRUE))
To get separate joined dataframes assign them the names and use list2env.
names(result) <- sprintf('df%d_joined', seq_along(result))
list2env(result, .GlobalEnv)
Your paste(names(df.list[[i]])),"_joined") is getting the names of a single dataframe, which are the column names.
Just switch to paste("df",i,"_joined") should give you the correct answer
We can also do this as
library(purrr)
library(dplyr)
out <- mget(ls(pattern = '^df\\d+$')) %>%
map(~ .x %>%
left_join(join, by = 'first_column'))

Return the row indices of df1 when those row values occur in df2 in R

I'm coding in R. I have a big data frame (df1) and a little data frame (df2). df2 is a subset of df1, but in a random order. I need to know the row indices of df1 which occur in df2. All of the specific cell values have lots of duplicates. Tapirus terrestris shows up more than once, as does each ModType value. I tried experimenting with which() and grpl() but couldn't get my code to work.
df1 <- data.frame(
SpeciesName = c('Tapirus terrestris', 'Panthera onca', 'Leopardus tigrinus' , 'Leopardus tigrinus'),
ModType = c('ANN', 'GAM', 'GAM','RF'),
Variable_scale = c('aspect_s2_sd', 'CHELSAbio1019_s3_sd','CHELSAbio1015_s4_sd','CHELSAbio1015_s4_sd'))
df2 <- data.frame(
SpeciesName = c('Tapirus terrestris', 'Leopardus tigrinus'),
ModType = c('ANN', 'RF'),
Variable_scale = c('aspect_s2_sd', 'CHELSAbio1015_s4_sd'))
Should output an array: 1,4 because df1 rows 1 and 4 occur in df2.
You can create an index column in df1 and merge the datasets.
df1$index <- 1:nrow(df1)
df3 <- merge(df1, df2)
df3$index
#[1] 4 1
You can use match.
df1[match(df2$SpeciesName, df1$SpeciesName), ]
Another option is tidyverse
library(dplyr)
df1 %>%
mutate(index = row_number()) %>%
inner_join(df2)

Why using merge function in R creates duplicates?

I am running merge function in R:
Example:
DF <- merge(DF1, DF2, by = c("Date", "Time"), all.x= TRUE)
However, When I run the code, I get duplicated rows!
How can I get unique rows from the function? and why I am getting these duplicated rows?
We can get only the unique rows of DF1 and DF2 and then merge.
DF <- merge(unique(DF1), unique(DF2), by = c("Date", "Time"), all.x= TRUE)

How to use spread function and left join data frames (DF1, DF2, DF3, upto DF 621) in a forloop?

I have data frames (DF1, DF2, DF3, DF4 upto DF621)
I want to use spread function on them before left joining (by column GEOID) them inside a for-loop.
I should finally have one data frame containing all my data frames (DF1 to DF621).
Every DF[i] Data Frame contains 4 columns: normalized, GEOID, Name, variable.
for e.g.
DF21spread <- spread(DF21, variable, normalized)
test <- spread(DF20, variable, normalized) %>%
left_join(DF21spread, by ='GEOID')
A solution is to first rbind the data.frames together and then spread.
library(dplyr)
DF1 <- iris[1:50, ]
DF2 <- iris[51:100, ]
DF3 <- iris[101:150, ]
dfs <- mget(ls(patter = "DF"))
bind_rows(dfs, .id = 'id')

How to use loop to create new data.frame

I have 300 locations, let's say "A1, A2, A3,..., A300" and I have 40 values. My Locations are in df1 and my values are in df2. I want to add those 40 values to each location, location A1 would have codes from to 40 and so on.
I tried to make a for loop:
df1 <- c("A1", "A2", "A3")
df1 <- data.frame(df1)
colnames(df1) <- c("location")
df2 <- c(1:40)
df2 <-data.frame(df2)
colnames(df2) <- c("code")
data <- data.frame() #Empty data.frame
for (i in df2) {
temp <- df1
temp$code <- rep(i)
data1 <- rbind(data, temp)
}
This script results in an Error: 'replacement has 40 rows, data has 315'.
Can someone tell me what should I do to make this work?
Desired output:
We can use aggregate
aggregate(Value ~Location, df1, sum)
If the values are in a different dataset and have the same order as in the original dataset 'Location', just do a cbind and aggregate
aggregate(Value ~Location, cbind(df1, df2), sum)
Assuming that there are no common columns in each dataset to merge
Update
Based on the OP's update
expand.grid(location = df1$location, code = df2$code)
Or CJ from data.table
library(data.table)
CJ(location = df1$location, code = df2$code)

Resources