R look up matches in a column - r

I have a data frame that has one column, it has almost 20,0000
df1 %>% values c(10,20,30,50)
and I have another data frame, that has multiple columns one of those columns is also values.
df2 %>% id c(24782,18741,17041,10471401)
values c(70,90,10,20,50)
and more columns in here and this data set 50,00000 of 13 variables.
I want to see if the values column in df1 %in% in values df2, and put that in a new column in a new dataframe.
df3 <- df2 %>%
mutate(newvalue = ifelse(df1$values %in% df2$values,1,0))
Error: Column ... must be length ... (the number of rows) or one, not ...

Two problems.
Given that you are modifying df2, your order is wrong. df1$values %in% df2$values tells you, for each df1$values item, whether it is in df2. So the result is as long as df1, not df2. It doesn't make sense to put that information in df2, because it is a result about df1. You either need to add the column to df1, or switch the order and use df2$values %in% df1$values (I think this is what you want).
dplyr functions expected unquoted column names of the data frame argument. So, if you pipe df2 into mutate, you don't use df2$ inside that mutate.
Making both these corrections, you get
df3 <- df2 %>% mutate(newvalue = ifelse(values %in% df1$values,1,0))
As an extra tip, %in% returns a boolean (TRUE/FALSE) result. You don't need ifelse to convert that to 1/0, it is more efficient to use as.integer, for the same result.
df3 <- df2 %>% mutate(newvalue = as.integer(values %in% df1$values))

Related

how do I create a dataframe based on values from another dataframe?

i have a data frame, from this I created another dataframe,
which consists of a selection after a number of conditions. The result is a number of countries that satisfy the condition n > 11.
Now I want to continue working with only these countries. How can I copy values from the first dataset based on the countries in the selection?
My first df looks like:
and the second (so the selection of countries):
In my final df I need every column and row from my 1st df (but only for the countries present in the second df)
I'm not sure about your data and reason using second dataframe, but let first and second data as df1 and df2, then
library(dplyr)
df1 %>%
filter(Country.o... %in% df2$Country.o...)
(I cannot find out what is the column name. You should not post your data as an imange)
Two options -
Do an inner join
a) Base R -
df3 <- merge(df1, df2, by = 'Country')
b) dplyr -
library(dplyr)
df3 <- inner_join(df1, df2, by = 'Country')
Instead of creating df2 from df1, I would just filter the 1st one to get the resulting dataframe.
df3 <- df1 %>% group_by(Country) %>% filter(n() > 11)

Merging of dataframes with different number of columns

I have these two dataframes.
DF1:
DF2:
I want my output DF to be be DF1 along with the value of X1 from DF2. That is, this is how I want the output to look like:
I have tried using merge and join, but am unable to get this required output. The primary problem seems to be due to the fact that the ID in DF1 has multiple matches in DF2. The resulting dataframe I get has all the rows, somewhat like this:
How do I fix this?
Thanks.
(apologies for table images, I wasn't able to figure out how to create a table on the fly)
You can use match to return the first hit in DF2.
DF1$X1 <- DF2$X1[match(DF1$ID, DF2$ID)]
Keep unique values in terms of ID in the second data frame and then join:
library(tidyverse)
DF2 <- DF2 %>%
distinct(ID, .keep_all = TRUE) %>%
select(ID, X1)
res <- DF1 %>%
inner_join(DF2, by = "ID")
glimpse(res)

Dataframe not populating when I am doing subset?

I have a dataframe that has two columns. One column is the product type, and the other is characters. I essentially want to break that column 'product' into 12 different data frames for each level. So for the first level, I am running this code:
df = df %>% select('product','comments')
df['product'] = as.character(df['product'])
df['comments'] = as.character(df['comments'])
Now that the dataframe is in the structure I want it, I want to take a variety of subsets, and here is my first subset code:
df_boatstone = df[df$product == 'water',]
#df_boatstone <- subset(df, product == "boatstone", select = c('product','comments'))
I have tried both methods, and the dataframe is being created, but has nothing in it. Can anyone catch my mistake?
The as.character works on a vector, while df['product'] or df['comments'] are both data.frame with a single column
df[['product']] <- as.character(df[['product']])
Or better would be
library(tidyverse)
df %>%
select(product, comments) %>%
mutate_all(as.character) %>%
filter(product == 'water')

How to Rename Column Headers in R

I have two separate datasets: one has the column headers and another has the data.
The first one looks like this:
where I want to make the 2nd column as the column headers of the next dataset:
How can I do this? Thank you.
In general you can use colnames, which is a list of your column names of your dataframe or matrix. You can rename your dataframe then with:
colnames(df) <- *listofnames*
Also it is possible just to rename one name by using the [] brackets.
This would rename the first column:
colnames(df2)[1] <- "name"
For your example we gonna take the values of your column. Try this:
colnames(df2) <- as.character(df1[,2])
Take care that the length of the columns and the header is identical.
Equivalent for rows is rownames()
dplyr way w/ reproducible code:
library(dplyr)
df <- tibble(x = 1:5, y = 11:15)
df_n <- tibble(x = 1:2, y = c("col1", "col2"))
names(df) <- df_n %>% select(y) %>% pull()
I think the select() %>% pull() syntax is easier to remember than list indexing. Also I used names over colnames function. When working with a dataframe, colnames simply calls the names function, so better to cut out the middleman and be more explicit that we are working with a dataframe and not a matrix. Also shorter to type.
You can simply do this :
names(data)[3]<- 'Newlabel'
Where names(data)[3] is the column you want to rename.

R: reshape dataframe from wide to long format based on compound column names

I have a dataframe containing observations for two sets of data (A,B), with dataset and observation type given by the column names :
mydf <- data.frame(meta1=paste0("a",1:2), meta2=paste0("b",1:2),
A_var1 = c(11:12), A_var2 = c("p","r"),
B_var1 = c(21:22), B_var2 = c("x","z"))
I would like to reshape this dataframe so that each row contains observations on one set only. In this long format, set and column names should by given by splitting the original column names at the '_':
mydf2 <- data.frame(meta1=rep(paste0("a",1:2),2),
meta2=rep(paste0("b",1:2),2),
set=c("A","B","A","B"),
var1 = c(11:12),
var2 = c("a","b","c","d"))
I have tried using 'gather' in combination with 'str_split','sub', but unfortunately without success. Could this be done using tideverse functions?
Yes you can do this with tidyverse !
You were close, you need to gather, then separate, then spread.
new_df <- mydf %>%
gather(set, vars, 3:6) %>%
separate(set, into = c('set', 'var'), sep = "_") %>%
spread(var, vars)
hope this helps!

Resources