I have these two dataframes.
DF1:
DF2:
I want my output DF to be be DF1 along with the value of X1 from DF2. That is, this is how I want the output to look like:
I have tried using merge and join, but am unable to get this required output. The primary problem seems to be due to the fact that the ID in DF1 has multiple matches in DF2. The resulting dataframe I get has all the rows, somewhat like this:
How do I fix this?
Thanks.
(apologies for table images, I wasn't able to figure out how to create a table on the fly)
You can use match to return the first hit in DF2.
DF1$X1 <- DF2$X1[match(DF1$ID, DF2$ID)]
Keep unique values in terms of ID in the second data frame and then join:
library(tidyverse)
DF2 <- DF2 %>%
distinct(ID, .keep_all = TRUE) %>%
select(ID, X1)
res <- DF1 %>%
inner_join(DF2, by = "ID")
glimpse(res)
Related
For a project in university, i'm working with large stock price dataframe's.
I have two dataframes.
Dataframe df1 includes the daily close prices over a certain time. The header includes the stock's shortcut.
Dataframe df2 includes the stock's shortcut in the first column and in the second column, there is the industry name of the stock's firm. IMPORTANT to know is that in df2 there are more values than in df1 (but every value in df1 should be in df2)
Is there any possibility to integrate the second column of df2 into the first row of df1 if they match (=> value from df1 header = df2 first column)
# Example Code
df1=as.data.frame(matrix(runif(20,min=0,max=1), nrow = 4))
df1
df2 <- as.data.frame(c("V1","V829","V2","V3","V493","V4","V5","V6","V992","V7"))
df2$insert <- c("test1","test2","test3","test4","test5","test6","test7","test8","test9","test10")
names(df2) <- c("Column2","test")
df1
df2
# Now insert/combine df2$test in (or over) df1[1,] as a row, if names(df1) and df2$Column2 matches
enter image description here (DataFrame df1)
enter image description here (DataFrame df2)
Thank you for your answers guys!
Nino
I would recommend you reshape your df1 into long format (see Reshaping data.frame from wide to long format).
library(tidyr)
df1_long <- df1 %>% gather(Instrument, value, -X)
I would organize the file this way because that makes it easier to use left__join() to match the data frames (see a description of mutating joins on the data wrangling cheat sheet).
df <- left_join(df1_long, df2, by = "Instrument")
If you want you can then make your dataframe wide again using the spread() function, which is the reverse of gather().
For the future I recommend you generate a reproducible example, rather than linking image files of your dataframes, as the links might expire, and it makes it generally less likely to get an answer on Stack Overflow.
i have a data frame, from this I created another dataframe,
which consists of a selection after a number of conditions. The result is a number of countries that satisfy the condition n > 11.
Now I want to continue working with only these countries. How can I copy values from the first dataset based on the countries in the selection?
My first df looks like:
and the second (so the selection of countries):
In my final df I need every column and row from my 1st df (but only for the countries present in the second df)
I'm not sure about your data and reason using second dataframe, but let first and second data as df1 and df2, then
library(dplyr)
df1 %>%
filter(Country.o... %in% df2$Country.o...)
(I cannot find out what is the column name. You should not post your data as an imange)
Two options -
Do an inner join
a) Base R -
df3 <- merge(df1, df2, by = 'Country')
b) dplyr -
library(dplyr)
df3 <- inner_join(df1, df2, by = 'Country')
Instead of creating df2 from df1, I would just filter the 1st one to get the resulting dataframe.
df3 <- df1 %>% group_by(Country) %>% filter(n() > 11)
I have loaded 2 excel files. Each excel file contains a data frame.
First df looks like this:
number
091239
091212
092233
Second df2 looks like this:
name number
R 2340
K 092233
S 345
L 091212
How can I find duplicates based on a column "number" of the first df in the second df2?
Because I am learning dplyr I would greatly appreciate dplyr solutions.
I have tried this code
filtered <- df2%>%
distinct(number, df$number, .keep_all = T)
if you want to filter for the duplicates:
filter(df2, df2$number %in% df$number == TRUE)
You can change it to FALSE also if you want to keep rows where the values are not in df1
Something like inner_join might do the trick here:
inner_join(df2, df1, by = "number")
I have a data frame that has one column, it has almost 20,0000
df1 %>% values c(10,20,30,50)
and I have another data frame, that has multiple columns one of those columns is also values.
df2 %>% id c(24782,18741,17041,10471401)
values c(70,90,10,20,50)
and more columns in here and this data set 50,00000 of 13 variables.
I want to see if the values column in df1 %in% in values df2, and put that in a new column in a new dataframe.
df3 <- df2 %>%
mutate(newvalue = ifelse(df1$values %in% df2$values,1,0))
Error: Column ... must be length ... (the number of rows) or one, not ...
Two problems.
Given that you are modifying df2, your order is wrong. df1$values %in% df2$values tells you, for each df1$values item, whether it is in df2. So the result is as long as df1, not df2. It doesn't make sense to put that information in df2, because it is a result about df1. You either need to add the column to df1, or switch the order and use df2$values %in% df1$values (I think this is what you want).
dplyr functions expected unquoted column names of the data frame argument. So, if you pipe df2 into mutate, you don't use df2$ inside that mutate.
Making both these corrections, you get
df3 <- df2 %>% mutate(newvalue = ifelse(values %in% df1$values,1,0))
As an extra tip, %in% returns a boolean (TRUE/FALSE) result. You don't need ifelse to convert that to 1/0, it is more efficient to use as.integer, for the same result.
df3 <- df2 %>% mutate(newvalue = as.integer(values %in% df1$values))
I have two data frames of country data.
df1 has all the countries of the world.
df2 has a subset of countries but has the populations in one of its columns.
I want to take the population data and add it to df1 where the country names are a match.
If df1$Column1 = df2$Column1 (same country name) then populate df1$Column2 (currently empty) with the information from df2$Column2 (country's population) where the row is the the one for that country match.
I tried to merge the two using the column "Name" which they both have for country names :
total <- merge(map,Co2_2x, by="NAME")
the columns are all there but I get empty rows in my new dataframe.
I'd like to be able to say "for this row and column matrix position in df1 (the country), get the row (country name match in df2) and column X (population data). Then put it in this row and column Y matrix position in df1 (new population column in df1 for the matched country name)"... There must be an easier way :-)
Here is my code : I'd like to fill map$measure with data from Co2_2x$premium where the countries match.
library(XML)
library(raster)
library(rgdal)
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")
polygons
map <- as.data.frame(polygons)
map$Measure <- 0
library(rvest)
Co2 <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions")
Co2_2x<-Co2 %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
names(Co2_2x)[2]<-paste("premium")
names(Co2_2x)[1]<-paste("NAME")
total <- merge(map,Co2_2x, by="NAME")
Thanks!
To have the first dataset rows with no match in the other dataset appear, you just need to add the all.x=T option, as follows (have a look at the documentation for details) :
total <- merge(map,Co2_2x, by="NAME",all.x=T)
These rows will then appear with NA in the second dataset columns.
If the matching doesn't seem to work, you may want to make sure that your matching variable (in your case, NAME) is filled exaclty the same way in the two datasets (letter case, possible spaces at the extremities...).
This answer provides a fine way of doing so.
you can use sqldf library in R.
Just follow the code below. You'll be able to merge (join) the two dataset that you have:
library(sqldf)
merged_data <- sqldf("select a.country, b.population from df1 as a
left join df2 as b on (a.country = b.country) group by 1")
Thanks and happy R-programming!!!