R: How do I cross-reference data.frames for discrepancies? - r

BACKGROUND:
I have two data frames that two researchers have used to manually input time data that tracks how a group of participants reach a consensus in making a decision. We are doing this by logging the time of each preference
statement as well as the preference (ranked by priority).
QUESTION:
My question is, what functions or packages can I use to show me the discrepancies in the two data tables.
EXAMPLE:
discrepancies <- show_discrepancies(myData1, myData2)
discrepancies
outputExample1
provides a data frame containing only the entries that do not match
outputExample2
provides a combined data frame, with entries from both myData1 and myData2, and the entries that do not have a match are highlighted red
either output would work but I would prefer outputExample1 if possible

Assuming the two data frames have the same structure, you can get outputExample1 using the following function:
show_discrepancies <- function(data1, data2) {
data <- rbind(data1, data2)
data[!duplicated(data),]
}
Also take a look at the join functions available in the dplyr package.

Related

How stack multiple pivot tables based on multiple filters in EXCEL or R

I have a pivot table I created. This table is aggregating data by region and can be filtered by 2 categories (age, Income). How do I create a table such that each category combination (ex. Toddler & below 50% FPL, and Toddler, All incomes) are represented within each aggregation. So far, I am filtering for all combinations of Age and Income and just copying and pasting in a new spreadsheet. I linked a video where I show what I mean
https://drive.google.com/file/d/1kUvDNxijXWZyJCCdVy398Gd0uq8vUFvY/view?usp=sharing
I am open to doing this in Excel or R.
Thank you very much for your help,
Rouzbeh
In order to do it in Excel you would require a third-party add-on or at least to code in VBA to do that.
In R you could find a solution. There is a similar question here. That hasn't been marked as answered.
R Solution
In Base-R you can pivot using aggregate(). There are other function on other libraries like reshape2, data.table and dyplr. If you feel comfortable with those libraries, look for their aggregation or group by functions.
Sample Data: data=
I do not know if you have a flag to determine if a subject is elegible. In that case I will use a custom aggregation. But if that's not the case you could use any of the traditional aggregation functions.
#Costume formula counting flags
counEle <- function(x){
a=length(which(x=="x"))}
Then:
#Create all posibles combinations using Age and Income
combination = expand.grid(unlist(data["Age"]),unlist(data["Income"]))
#Eliminate duplicated combinations
combination=unique(combination)
#Create the loop that filters and aggregate for each combination.
for(i in 1:nrow(combination)){
new_data=data[data$Age==combination[i,1] & data$Income==combination[i,2],]
#Check that the data frame has more than 0 columns so it could be aggregated
if(nrow(new_data)>0){
#Aggregate the data with the values in place
print(try(aggregate(Eligibility~Age+Income+County,new_data,counEle)))
}
}
The total count is in the Eligibility columns which is the one we wanted to measure. This should output all possible combinations (mind you the error handler by the try(). If you want to ignore where the count is 0 you could add an additional step with a conditional to >0. Then write each result on a csv or use a library to write it on a excel tab.

Merging DataFrames without Duplicate Columns in R

I have three data sets that I would like to merge. My data is on companies in the SP500 and their corporate political activity. Of my datasets, one is named PAC, one is named Lobby and one is named BoardData. The datasets all have three columns in common: "ultorg", "sector", and "subind" as well as other columns unique to each dataset.
I would like to merge the three excel documents so that there is only one of each of those columns that has all of the other variables appended to it.
I have tried doing this on my own but I get a few problems. Specifically, I get several columns for ultorg/sector/subind (the variables the datasets have in common) and there are entries that are repeating in places where they shouldn't. For example, my board data only goes until 2015 but my lobbying data goes until 2000. Using the incorrect/incomplete code below, I have rows where company's board data from 2015 is being put in for years 2000-2015. I would just like the years without a Board entry for them (2000-2015)to just have NA entered in.
Here's the current code.
library(tidyverse)
library(janitor)
library(glue)
setwd("~/Desktop/thesis")
library(readxl)
PAC <- read_excel("PAC.xlsx")
library(readxl)
Lobby <- read_excel("Lobby.xlsx")
library(readxl)
BoardData <- read_excel("BoardData.xlsx")
alldata <- left_join(PAC, Lobby, by="ultorg")
alldata <- left_join(alldata, BoardData, by=“ultorg”)
Thank you so much for any help you are able to give me! I really appreciate it and am able to answer any questions regarding my data.
Merging by ultorg, sector, subind will work and if there is column that indicates about date, and it's common, then you should consider to add that column while joining them. Choice between full_join and left_join or etc are up to your purpose. Code below is one of example that you may try.
BoardData %>%
full_join(PAC, by = c("ultorg", "sector", "subind")) %>%
full_join(Lobby, by = c("ultorg", "sector", "subind"))

Look up data frame with values stored in another data frame

I have 15 data frames containing information about patient visits for a group of patients. Example below. They are named as FA.OFC1, FA.OFC2 etc.
ID sex date age.yrs important.var etc...
xx_111 F xx.xx.xxxx x.x x
I am generating a summary data frame (sev.scores) which contains information about the most severe episode a patient has across all recorded data. I have successfully used the which.max function to get the most severe episode but now need additional information about that particular episode.
I recreated the name of the data frame I will need to look up to get the additional information by pasting information after the max return:
max data frame
8 df2
Specifically the names() function gave me the name of the column with the most severe episode (in the summary data frame sev.scores which also gives me information about which data frame to look up:
sev.scores[52:53] <- as.data.frame(cbind(row.names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)]),apply(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)],1,function(x) names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)])[which(x==max(x))])))
However now I would like to figure out how to tell R to take the data frame name stored in the column and search that data frame for the entry in the 5th column.
So in the example above the information about the most severe episode is stored in data frame 2 (df2) and I need to take information from the 5th record (important.var) and return it to this summary data frame.
UPDATE
I have now stored these dfs in a list but am still having some trouble getting the information I would like.
I found the following example for getting the max value from a list
lapply(L1, function(x) x[which.max(abs(x))])
How can I adapt this for a factor which is present in all elements of the list?
e.g. something like:
lapply(my_dfs[[all elements]]["factor of interest"], function(x) x[which.max(abs(x))])
If I may suggest a fundamentally different approach: concatenate all your data.frames into one (rbind), and add a separate column that describes the nature of the original data.frame. For this, it’s necessary to know in which regard the original data.frames differed (e.g. by disease type; since I don’t know your data, let’s stick with this for my example).
Furthermore, you need to ensure that your data is in tidy data format. This is an easy requirement to satisfy, because your data should be in this format anyway!
Then, once you have all the data in a single data.frame, you can create a summary trivially by simply selecting the most severe episode for each disease type:
sev_scores = all_data %>%
group_by(ID) %>%
filter(row_number() == which.max(FactorOfInterest))
Note that this code uses the ‹dplyr› package. You can perform an equivalent analysis using different packages (e.g. ‹data.table›) or base R functions, but I strongly recommend dplyr: The resulting code is generally easier to understand.
Rather than your sev.scores table, which has columns referring to rows and data.frame names, the sev_scores I created above will contain the actual data for the most severe episode for each patient ID.

Altering dataframes stored within a list

I am trying to write some kind of loop function that will allow me to apply the same set of code to dozens of data frames that are stored in one list. Each data frame has the same number of columns and identical headers for each column, though the number of rows varies across data frames.
This data comes from an egocentric social network study where I collected ego-network data in edgelist format from dozens of different respondents. The data collection software that I use stores the data from each interview in its own .csv file. Here is an image of the raw data for a specific data frame (image of raw data).
For my purposes, I only need to use data from the fourth, sixth, and seventh columns. Furthermore, I only need rows of data where the last column has values of 4, at which point the final column can be deleted entirely. The end result is a two-column data frame that represents relationships among pairs of people.
After reading in the data and storing it as an object, I ran the following code:
x100291 = `100291AlterPair.csv` #new object based on raw data
foc.altername = x100291$Alter.1.Name
altername = x100291$Alter.2.Name
tievalue = x100291$AlterPair_B
tie = tievalue
tie[(tie<4)] = NA
egonet.name = data.frame(foc.altername, altername, tievalue)
depleted.name = cbind(tie,egonet.name)
depleted.name = depleted.name[is.na(depleted.name[,1]) == F,]
dep.ego.name = data.frame(depleted.name$foc.altername, depleted.name$altername)
This produced the following data frame (image of final data). This is ultimately what I want.
Now I know that I could cut-and-paste this same set of code 100+ times and manually alter the file names, but I would prefer not to do that. Instead, I have stored all of my raw .csv files as data frames in a single list. I suspect that I can apply the same code across all of the data frames by using one of the apply commands, but I cannot figure it out.
Does anyone have any suggestions for how I might apply this basic code to a list of data frames so that I end up with a new list containing cleaned and reduced versions of the data?
Many thanks!
The logic can be simplified. Try creating a custom function and apply over all dataframes.
cleanDF <- function(mydf) {
if( all(!c('AlterPair_B', 'Alter.1.Name', 'Alter.2.Name') %in%
names(mydf))) stop("Check data frame names")
condition <- mydf[, 'AlterPair_B'] >= 4
mydf[condition, c("Alter.1.Name", "Alter.2.Name")]
}
big_list <- lapply(all_my_files, read.csv) #read in all data frames
result <- do.call('rbind', lapply(big_list, cleanDF))
The custom function cleanDF first checks that all the relevant column names are there. Then it defines the condition of 4 or more 'AlterPair_B'. Lastly, subset the two target columns by that condition. I used a list called 'big_list' that represents all of the data frames.
You haven't provided a reproducible example so it's hard to solve your problem. However, I don't want your questions to remain unanswered. It is true that using lapply would be a fast solution, usually preferable to a loop. However, since you mentioned being a beginner, here's how to do that with a loop, which is easier to understand.
You need to put all your csv files in a single folder with nothing else. Then, you read the filenames and put them in a list. You initialize an empty result object with NULL. You then read all your files in a loop, do calculations and rbind the results in the result object.
path <-"C:/temp/csv/"
list_of_csv_files <- list.files(path)
result <- NULL
for (filenames in list_of_csv_files) {
input <- read.csv(paste0(path,filenames), header=TRUE, stringsAsFactors=FALSE)
#Do your calculations
input_with_calculations <- input
result <- rbind(result,input_with_calculations)
}
result

Applying a function to a dataframe to trim empty columns within a list environment R

I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])

Resources