Comparing similarities between 2 data frames of different lengths in R - r

I have 2 simple but long dataframes and I would like to compare the column of 1 data frame to a specific column in another data frame to see how many and which values are the same.
For example, the word "HAT" might be present in row 1 in the 1st data frame, and it might be present in row 76 of the 2nd data frame. I want the output to tell me that the word "HAT" is present in both dataframes (along with all the other similarities), rather than just tell me how many values match up.
Please let me know if there is a function I can use! Comparedf is not working well. It would also be best if I could get the results in the form of another data frame.

Using indexing and the %in% operator can help:
use %in% to find which values in the first dataframe are also in the second dataframe
use the logical vector to index the column, and return the corresponding values
use unique() if you don't want repeats
use data.frame() to construct a dataframe with one column of common values
# create two example dataframes:
df1 <- data.frame(chars = rep(LETTERS[1:10], 2))
df2 <- data.frame(chars = LETTERS[8:20])
# find the common values:
df1$chars[df1$chars %in% df2$chars]
#> [1] "H" "I" "J" "H" "I" "J"
# alternatively, only show the unique values:
common <- unique(df1$chars[df1$chars %in% df2$chars])
# and create a dataframe from it:
common_df <- data.frame(common)
# see contents:
common_df
#> common
#> 1 H
#> 2 I
#> 3 J
Created on 2021-04-14 by the reprex package (v2.0.0)

Related

R: retrieve dataframe name from another dataframe

I have a dataframe dataselect that tells me what dataframe to use for each case of an analysis (let's call this the relevant dataframe).
The case is assigned dynamically, and therefore which dataframe is relevant depends on that case.
Based on the case, I would like to assign the relevant dataframe to a pointer "relevantdf". I tried:
datasetselect <- data.frame(case=c("case1","case2"),dataset=c("df1","df2"))
df1 <- data.frame(var1=letters[1:3],var2=1:3)
df2 <- data.frame(var1=letters[4:10],var2=4:10)
currentcase <- "case1"
relevantdf <- get(datasetselect[datasetselect$case == currentcase,"dataset"]) # relevantdf should point to df1
I don't understand if I have a problem with the get() function or the subsetting process.
You are almost there, the problem is that the dataset column from datasetselect is a factor, you just need to convert it to character
You can add this line after the definition of datasetselect:
datasetselect$dataset <- as.character(datasetselect$dataset)
And you get your expected output
> relevantdf
var1 var2
1 a 1
2 b 2
3 c 3

How to programmatically replace NA values in a dataframe with values in a list?

Background
I am trying to impute missing values using the library(mvnmle) (ML Estimation for Multivariate Normal Data with Missing Values). Following is an example using the accompanying dataframe apple in the library:
data(apple)
mlest(apple)
$`muhat`
[1] 14.72227 49.33325
$sigmahat
[,1] [,2]
[1,] 89.53415 -90.69653
[2,] -90.69653 114.69470
$value
[1] 148.435
$gradient
[1] 4.988478e-06 2.892682e-06 8.726424e-07 1.682947e-05 -1.073488e-04
$stop.code
[1] 1
$iterations
[1] 34
Question
There are few missing values in the worms column in the apple dataframe. mlest list provides muhat with mean values for both columns in the dataframe. I want to replace all the missing values in the worm column with the muhat value. In a different dataframe, there can be multiple columns with missing values. I want to programmatically replace all NA values with their corresponding values in the muhat.
In this example, I can manually do this by:
apple[is.na(apple2)] <- res$muhat[2]
How can I automate this?
Can you use tidyverse packages? If so, I think this addresses your question and is scalable.
library(purrr)
res <- mlest(apple)
map2_df(apple,
seq_along(apple),
function(column, col_ind, mu_vec){
if_else(is.na(column), mu_vec[col_ind], column)
},
res$muhat)
using tidyverse:
first ensure the means are named according to the column names:
(mu=setNames(mlest(apple)$muhat,names(apple)))
size worms
14.72227 49.33325
Now use replace_na to replace all the columns with the specific mean:
library(tidyverse)
apple%>%replace_na(as.list(mu))
In base R, you can use sweep:
sweep(apple,2,mlest(apple)$muhat,function(x,y)replace(x,is.na(x),y[is.na(x)]))
If I'm understanding your question and data correctly, this should work.
apple$worms <- ifelse(is.na(apple$worms),res$muhat,apple$worms)

R: Iteratively extract not NA values for columns in a data table and split into separate columns without typing column names

I am working with a large dataset containing many variables. Therefore, I want to avoid typing in column names at all times. I want to iterate through the columns in my data and extract the value contained fields per column. In other words, I want to end up with separate data tables for each column, none of them containing NA values.
My approach is to write a loop that first eliminates the NA values per column. I extracted the column names in a separate column matrix when reading the .csv file (using fread). The problem is that I did not manage to exclude the column names or the NA with my approach. I worked out a small example to illustrate the problem:
# Example data
dt = data.table(color=c("b","g","r","y",NA),
size=c("S", "XL", NA, NA, "M"),
number=(1:5))
columns = matrix(c("color", "size", "number"), nrow=3, ncol=1)
The loop shown below works, although it is not really a loop because it still requires inserting the column name in the first line:
# Works (but requires typing in the column name)
for(i in 1:1){
var <- dt %>% group_by(color) %>% filter(!is.na(color))
name <- paste("new", columns[i], sep=".")
assign(name, var[, columns[i], with=FALSE])}
# Output:
color
(chr)
1 b
2 g
3 r
4 y
My idea is to refer inside the loop to the subsequent columns by using the extracted column names. The problem here is that the NA values do not get eliminated, i.e., the first line of code inside the loop is not working:
# Does not work
for(i in 1:1){
var <- dt %>% group_by(columns[i]) %>% filter(!is.na(columns[i]))
name <- paste("new", columns[i], sep=".")
assign(name, var[, columns[i], with=FALSE])}
# Output:
color
(chr)
1 b
2 g
3 r
4 y
5 NA
Can anyone help me out to end up with separate columns (of unequal lengths) that do not contain NA values, without typing in the column names? (Another approach than I have used is certainly welcome as well.) Thanks in advance!
sapply(columns, function(x) c(na.omit(dt[[x]])), USE.NAMES = T)
#$color
#[1] "b" "g" "r" "y"
#
#$size
#[1] "S" "XL" "M"
#
#$number
#[1] 1 2 3 4 5
The c() isn't necessary - I just used it to strip na.omit class info to make the output clearer.
And don't use assign - just store the items in a list as above and work with that.

Counting non-missing occurrences

I need help counting the number of non-missing data points across files and subsetting out only two columns of the larger data frame.
I was able to limit the data to only valid responses, but then I struggled getting it to return only two of the columns.
I found http://www.statmethods.net/management/subset.html and tried their solution, but myvars did not house my column label, it return the vector of data (1:10). My code was:
myvars <- c("key")
answer <- data_subset[myvars]
answer
But instead of printing out my data subset with only the "key" column, it returns the following errors:
"Error in [.data.frame(observations_subset, myvars) : undefined columns selected" and "Error: object 'answer' not found
Lastly, I'm not sure how I count occurrences. In Excel, they have a simple "Count" function, and in SPSS you can aggregate based on the count, but I couldn't find a command similarly titled in R. The incredibly long way that I was going to go about this once I had the data subsetted was adding in a column of nothing but 1's and summing those, but I would imagine there is an easier way.
To count unique occurrences, use table.
For example:
# load the "iris" data set that's built into R
data(iris)
# print the count of each species
table(iris$Species)
Take note of the handy function prop.table for converting a table into proportions, and of the fact that table can actually take a second argument to get a cross-tab. There's also an argument useNA, to include missing values as unique items (instead of ignoring them).
Not sure whether this is what you wanted.
Creating some data as it was mentioned in the post as multiple files.
set.seed(42)
d1 <- as.data.frame(matrix(sample(c(NA,0:5), 5*10, replace=TRUE), ncol=10))
set.seed(49)
d2 <- as.data.frame(matrix(sample(c(NA,0:8), 5*10, replace=TRUE), ncol=10))
Create a list with datasets as the list elements
l1 <- mget(ls(pattern="d\\d+"))
Create a index to subset the list element that has the maximum non-missing elements
indx <- which.max(sapply(l1, function(x) sum(!is.na(x))))
Key of columns to subset from the larger (non-missing) dataset
key <- c("V2", "V3")
Subset the dataset
l1[[indx]][key]
# V2 V3
#1 1 1
#2 1 3
#3 0 0
#4 4 5
#5 7 8
names(l1[indx])
#[1] "d2"

Computing subset of column means in data frame (R programming)

I have a simple data frame:
a=data.frame(first=c(1,2,3),second=c(3,4,5),third=c('x','y','z'))
I'm trying to return a data frame that contains the column means for just the first and second columns. I've been doing it like this:
apply(a[,c('first','second')],2,mean)
Which returns the appropriate output:
first second
2 4
However, I want to know if I can do it using the function by. I tried this:
by(a, c("first", "second"), mean)
Which resulted in:
Error in tapply(seq_len(3L), list(`c("first", "second")` = c("first", :
arguments must have same length
Then, I tried this:
by(a, c(T, T,F), mean)
Which also did not yield the correct answer:
c(T,T,F): FALSE
[1] NA
Any suggestions? Thanks!
You can use colMeans (column means) on a subset of the original data
> a <- data.frame(first = c(1,2,3), second = c(3,4,5), third = c('x','y','z'))
If you know the column number, but not the column name,
> colMeans(a[, 1:2])
## first second
## 2 4
Or, if you don't know the column numbers but know the column name,
> colMeans(a[, c("first", "second")])
## first second
## 2 4
Finally, if you know nothing about the columns and want the means for the numeric columns only,
> colMeans(a[, sapply(a, is.numeric)])
## first second
## 2 4
by() is not the right tool, because it is a wrapper for tapply(), which partitions your data frame into subsets that meet some criteria. If you had another column, say fourth, you could split your data frame using by() for that column and then operate on rows or columns using apply().

Resources