how do I create a dataframe based on values from another dataframe? - r

i have a data frame, from this I created another dataframe,
which consists of a selection after a number of conditions. The result is a number of countries that satisfy the condition n > 11.
Now I want to continue working with only these countries. How can I copy values from the first dataset based on the countries in the selection?
My first df looks like:
and the second (so the selection of countries):
In my final df I need every column and row from my 1st df (but only for the countries present in the second df)

I'm not sure about your data and reason using second dataframe, but let first and second data as df1 and df2, then
library(dplyr)
df1 %>%
filter(Country.o... %in% df2$Country.o...)
(I cannot find out what is the column name. You should not post your data as an imange)

Two options -
Do an inner join
a) Base R -
df3 <- merge(df1, df2, by = 'Country')
b) dplyr -
library(dplyr)
df3 <- inner_join(df1, df2, by = 'Country')
Instead of creating df2 from df1, I would just filter the 1st one to get the resulting dataframe.
df3 <- df1 %>% group_by(Country) %>% filter(n() > 11)

Related

How to subset the data frame based on selected variable with limited column?

i would like to subset limited column and selected variable as i have multiple column in my data frame.
my sample data:
df <- data.frame('ID'=c('A','B','C'),'YEAR'=c('2020','2020','2020'),'MONTH'=c('1','1','1'),'DAY'=c('16','16','16'),'HOUR'=c('15','15','15'),'VALUE1'=c(1,2,3))
i would like to subset ID'='C' and column name 'VALUE1'
Expected output:-
ID VALUE1
1 C 3
Appreciate any help...!
What i have tried so far is.
df1 <- subset(df,df$ID=='C')
df2 <- subset(df1,select=c('ID','VALUE1')
Is there any efficient way to do that as creating multiple data frame when we have multiple is not good.
you can use dplyr chaining function too,
df %>% select(ID,VALUE1) %>% filter(ID=="C")
We can have both subset and select
subset(df, subset = ID=='C', select = c('ID', 'VALUE1'))

Merging of dataframes with different number of columns

I have these two dataframes.
DF1:
DF2:
I want my output DF to be be DF1 along with the value of X1 from DF2. That is, this is how I want the output to look like:
I have tried using merge and join, but am unable to get this required output. The primary problem seems to be due to the fact that the ID in DF1 has multiple matches in DF2. The resulting dataframe I get has all the rows, somewhat like this:
How do I fix this?
Thanks.
(apologies for table images, I wasn't able to figure out how to create a table on the fly)
You can use match to return the first hit in DF2.
DF1$X1 <- DF2$X1[match(DF1$ID, DF2$ID)]
Keep unique values in terms of ID in the second data frame and then join:
library(tidyverse)
DF2 <- DF2 %>%
distinct(ID, .keep_all = TRUE) %>%
select(ID, X1)
res <- DF1 %>%
inner_join(DF2, by = "ID")
glimpse(res)

R look up matches in a column

I have a data frame that has one column, it has almost 20,0000
df1 %>% values c(10,20,30,50)
and I have another data frame, that has multiple columns one of those columns is also values.
df2 %>% id c(24782,18741,17041,10471401)
values c(70,90,10,20,50)
and more columns in here and this data set 50,00000 of 13 variables.
I want to see if the values column in df1 %in% in values df2, and put that in a new column in a new dataframe.
df3 <- df2 %>%
mutate(newvalue = ifelse(df1$values %in% df2$values,1,0))
Error: Column ... must be length ... (the number of rows) or one, not ...
Two problems.
Given that you are modifying df2, your order is wrong. df1$values %in% df2$values tells you, for each df1$values item, whether it is in df2. So the result is as long as df1, not df2. It doesn't make sense to put that information in df2, because it is a result about df1. You either need to add the column to df1, or switch the order and use df2$values %in% df1$values (I think this is what you want).
dplyr functions expected unquoted column names of the data frame argument. So, if you pipe df2 into mutate, you don't use df2$ inside that mutate.
Making both these corrections, you get
df3 <- df2 %>% mutate(newvalue = ifelse(values %in% df1$values,1,0))
As an extra tip, %in% returns a boolean (TRUE/FALSE) result. You don't need ifelse to convert that to 1/0, it is more efficient to use as.integer, for the same result.
df3 <- df2 %>% mutate(newvalue = as.integer(values %in% df1$values))

Binding dataframes with matching country names

I have two data frames of country data.
df1 has all the countries of the world.
df2 has a subset of countries but has the populations in one of its columns.
I want to take the population data and add it to df1 where the country names are a match.
If df1$Column1 = df2$Column1 (same country name) then populate df1$Column2 (currently empty) with the information from df2$Column2 (country's population) where the row is the the one for that country match.
I tried to merge the two using the column "Name" which they both have for country names :
total <- merge(map,Co2_2x, by="NAME")
the columns are all there but I get empty rows in my new dataframe.
I'd like to be able to say "for this row and column matrix position in df1 (the country), get the row (country name match in df2) and column X (population data). Then put it in this row and column Y matrix position in df1 (new population column in df1 for the matched country name)"... There must be an easier way :-)
Here is my code : I'd like to fill map$measure with data from Co2_2x$premium where the countries match.
library(XML)
library(raster)
library(rgdal)
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")
polygons
map <- as.data.frame(polygons)
map$Measure <- 0
library(rvest)
Co2 <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions")
Co2_2x<-Co2 %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
names(Co2_2x)[2]<-paste("premium")
names(Co2_2x)[1]<-paste("NAME")
total <- merge(map,Co2_2x, by="NAME")
Thanks!
To have the first dataset rows with no match in the other dataset appear, you just need to add the all.x=T option, as follows (have a look at the documentation for details) :
total <- merge(map,Co2_2x, by="NAME",all.x=T)
These rows will then appear with NA in the second dataset columns.
If the matching doesn't seem to work, you may want to make sure that your matching variable (in your case, NAME) is filled exaclty the same way in the two datasets (letter case, possible spaces at the extremities...).
This answer provides a fine way of doing so.
you can use sqldf library in R.
Just follow the code below. You'll be able to merge (join) the two dataset that you have:
library(sqldf)
merged_data <- sqldf("select a.country, b.population from df1 as a
left join df2 as b on (a.country = b.country) group by 1")
Thanks and happy R-programming!!!

Split dataframe, then select random observations from list, and the lists merge back into a dataframe

I have a data frame with 3 variables (subject, trialtype, and RT), and I need to select randomly half of the RT observations for the each subject, and then re-create the data frame from that selection.
In browsing the list I've got up to here
split_df <- split(bucnidata_rt,
list(bucnidata_rt$Subject, bucnidata_rt$trialtype))
(this gives a series of split_df[1], split_df[2], ....)
But then I can not subset using this
split_df[1] <- sample(nrow(split_df[1]), 24), ]
I think because sample only works on data frames and this split_df[1] is a list.
To re-merge I would do:
remerged_df <- unsplit(split_df[1],
list(bucnidata_rt$Subject, bucnidata_rt$trialtype))
Could you please help me with step 2?
I propose a slightly different approach using dplyr if you don't mind. You can group by subject and then randomly select 50% of observations of each group:
library(dplyr)
bucnidata_rt %>%
group_by(Subject) %>%
sample_frac(size = 0.5)
Edit
Here's another way, closer to what you started. I use the mtcars dataset in this case:
split_df <- split(mtcars, mtcars$cyl) #split by `cyl`
#randomly select 50% of rows per group, without replacement
split_df <- lapply(split_df, function(x) x[sample(seq_len(nrow(x)), nrow(x)/2, replace=FALSE),])
#merge the randomly selected list elements back into one data.frame
remerged_df <- do.call(rbind, split_df)
#check the result
nrow(remerged_df)
#[1] 15
Edit #2 corrected dplyr method after comment by #Gregor

Resources