I have two sets of Data frames. One contains the region names for different regional code. Another data frames has certain GTU data's with regional code mentioned. I want to replace regional codes of second data with region names based on the 1st data Using R. Please help !
If your data.frames are called df1 and df2 then you could try
try df_final <- merge(df1,df2,by="regional code").
Then df_final will contain everything (also region names and regional code).
Later you can delete the regional code column if you want by using
df_final[, !(colnames(df_final) %in% c("regional code"))]
Related
I usually use dplyr to filter data. I know hava huge dataset (62176 entries) of banks operating in different countries. I'd like to subset/filter that datasets for Eurozone banks only.
I haven't found any workaround rather than pasting all the name of Eurozone countries and then create a new dataset with filter.
Is there any workaround for this problem?
Thank you!
Without the data we can't give you clear answers however, given my understanding of the problem below are some methods.
Assuming your dataset already has a column that has each bank's operating country, you could create a manual vector of the countries you are interested in and then filter the dataset for rows that match
#manually assign countries to vector (this must match how the countries are listed in your data)
euro_countries<- c("Germany","England","France","Poland")
#Then filter dataset to pull up rows that match, I make up colnames as I don't know your data
dataframe %>% filter(op_country %in% euro_countries)
alternatively, depending on your data set you can reference the very helfpul countrycode library in R which has an existing dataset that can potentially join your dataset country column against the matching column in countrycode::codelist and then reference the countrycode::codelist$continent to filter for countries in "Europe".
#join your data set with the codelist table but depends on country column in your dataset
dataframe <- leftjoin(x=df,y=countrycode::codelist,by=c("op_country"="country.name.en"))
#filter your dataset with the new column
dataframe %>% filter(continent=="Europe")
I have three data sets that I would like to merge. My data is on companies in the SP500 and their corporate political activity. Of my datasets, one is named PAC, one is named Lobby and one is named BoardData. The datasets all have three columns in common: "ultorg", "sector", and "subind" as well as other columns unique to each dataset.
I would like to merge the three excel documents so that there is only one of each of those columns that has all of the other variables appended to it.
I have tried doing this on my own but I get a few problems. Specifically, I get several columns for ultorg/sector/subind (the variables the datasets have in common) and there are entries that are repeating in places where they shouldn't. For example, my board data only goes until 2015 but my lobbying data goes until 2000. Using the incorrect/incomplete code below, I have rows where company's board data from 2015 is being put in for years 2000-2015. I would just like the years without a Board entry for them (2000-2015)to just have NA entered in.
Here's the current code.
library(tidyverse)
library(janitor)
library(glue)
setwd("~/Desktop/thesis")
library(readxl)
PAC <- read_excel("PAC.xlsx")
library(readxl)
Lobby <- read_excel("Lobby.xlsx")
library(readxl)
BoardData <- read_excel("BoardData.xlsx")
alldata <- left_join(PAC, Lobby, by="ultorg")
alldata <- left_join(alldata, BoardData, by=“ultorg”)
Thank you so much for any help you are able to give me! I really appreciate it and am able to answer any questions regarding my data.
Merging by ultorg, sector, subind will work and if there is column that indicates about date, and it's common, then you should consider to add that column while joining them. Choice between full_join and left_join or etc are up to your purpose. Code below is one of example that you may try.
BoardData %>%
full_join(PAC, by = c("ultorg", "sector", "subind")) %>%
full_join(Lobby, by = c("ultorg", "sector", "subind"))
I have a list of bacteria each with it's own abundance in a dataframe. I also have the same list of bacteria but in a different order in the same dataframe.
I want to match the abundances to this second list but I'm not sure how to go about doing it.
dyplyr contains several methods for sorting data but I don't know how to match the abundance and print it into a new column so it now matches with the second list of bacteria.
Here's the beginning of my dataset:
Taxon Total_abundance Tips
Acaricomes phytoseiuli 0.000382414 Methanothermobacter thermautotrophicus
Acetivibrio cellulolyticus 0.013979274 Methanobacterium beijingense
Acetobacter aceti 0.181150551 Methanobacterium bryantii
Acetobacter estunensis 0.023074895 Methanosarcina mazei
Acetobacter tropicalis 0.014615221 Persephonella marina
Achromobacter piechaudii 0.031811039 Sulfurihydrogenibium azorense
Achromobacter xylosoxidans 0.041558442 Balnearium lithotrophicum
Acidicapsa borealis 0.035525932 Isosphaera pallida
Acidimicrobium ferrooxidans 0.013841209 Simkania negevensis
Acidiphilium angustum 0.041702984 Parachlamydia acanthamoebae
Acidiphilium cryptum 0.039265944 Leptospira biflexa
Acidiphilium rubrum 0.041702984 Leptospira fainei
...
So, the abundance matches the data in Taxon column, and I want the abundance to also be matched with the bacteria in the "Tips" column.
For example, Acaricomes phytoseiuli has an abundance of 0.000382414, so in column D 0.000382414 will be printed next to where Acaricomes phytoseiuli is located. Again, Taxon and Tips contains exactly the same data, just in a different order.
I hope that makes sense.
It doesn't matter if this is done in R or Excel, thanks.
As others have mentioned, it's hard to test without some data that matches, but something like this should work, using match to match up values.
df$D <- df$Total_abundance[ match( df$Tips, df$Taxon ) ]
I assume that your list of bacteria is unique
as a sample data frame:
dff <- data.frame(bacteria1=letters[1:10], abundance1=runif(10,0,1),
bacteri2=sample(letters[1:10],10), abundance2=0)
now we will find the bacteria rows and insert the abundances:
for(i in 1:nrow(dff)){
s <- which(dff$bacteri2[i]==dff$bacteria1)
dff$abundance2[i] <- dff$abundance1[s]
}
In excel under column D you can do the following:
=VLOOKUP(C3;A3:B13;2;FALSE)
C3 would be the TIP and A3:B13 the range where it searches for this, A being the bacteria name and B the abundance and if found will return the corresponding abundance of the match.
If you get an error like #N/A than there is no match. You can also avoid these errors by using this formula:
=IFNA(VLOOKUP(C3;$A$3:B13;2;FALSE);"No match")
Edit: Adjust the ranges to your file!
Edit 2: Keep in mind the seperator I use is ; and your excel might use the comma , seperator
First of all, if your Taxon and Tips columns contain exactly the same data, only in different order, they have no place being together in the same data frame. You should either have two data frames, or come up with some sort of key to define the place of a Taxon item in the phylogenetic tree and then re-sort the data frame as needed, either in alphabetic order or by phylogeny.
As a quick solution, I would first extract the Tips column in a separate data frame, join it with the original data frame by the Tips and Taxon columns, thus obtaining the correct order of abundance values in the new data frame and (if you still insist) using cbind to glue the newly re-sorted abundance column back into the original data frame. Like so, assuming you're using dplyr (df is a dummy stand-in for your data set):
df <- data.frame(Taxon=c("a","b","c","d","e"), Abundance=c(1:5), Tips=c("b","a","d","c","e"))
new_df <- select(df, Tips)
new_df <- left_join(new_df, df, by=c("Tips"= "Taxon"))
df <- cbind(df, New_Abund=new_df$Abundance)
rm(new_df)
I have 15 data frames containing information about patient visits for a group of patients. Example below. They are named as FA.OFC1, FA.OFC2 etc.
ID sex date age.yrs important.var etc...
xx_111 F xx.xx.xxxx x.x x
I am generating a summary data frame (sev.scores) which contains information about the most severe episode a patient has across all recorded data. I have successfully used the which.max function to get the most severe episode but now need additional information about that particular episode.
I recreated the name of the data frame I will need to look up to get the additional information by pasting information after the max return:
max data frame
8 df2
Specifically the names() function gave me the name of the column with the most severe episode (in the summary data frame sev.scores which also gives me information about which data frame to look up:
sev.scores[52:53] <- as.data.frame(cbind(row.names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)]),apply(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)],1,function(x) names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)])[which(x==max(x))])))
However now I would like to figure out how to tell R to take the data frame name stored in the column and search that data frame for the entry in the 5th column.
So in the example above the information about the most severe episode is stored in data frame 2 (df2) and I need to take information from the 5th record (important.var) and return it to this summary data frame.
UPDATE
I have now stored these dfs in a list but am still having some trouble getting the information I would like.
I found the following example for getting the max value from a list
lapply(L1, function(x) x[which.max(abs(x))])
How can I adapt this for a factor which is present in all elements of the list?
e.g. something like:
lapply(my_dfs[[all elements]]["factor of interest"], function(x) x[which.max(abs(x))])
If I may suggest a fundamentally different approach: concatenate all your data.frames into one (rbind), and add a separate column that describes the nature of the original data.frame. For this, it’s necessary to know in which regard the original data.frames differed (e.g. by disease type; since I don’t know your data, let’s stick with this for my example).
Furthermore, you need to ensure that your data is in tidy data format. This is an easy requirement to satisfy, because your data should be in this format anyway!
Then, once you have all the data in a single data.frame, you can create a summary trivially by simply selecting the most severe episode for each disease type:
sev_scores = all_data %>%
group_by(ID) %>%
filter(row_number() == which.max(FactorOfInterest))
Note that this code uses the ‹dplyr› package. You can perform an equivalent analysis using different packages (e.g. ‹data.table›) or base R functions, but I strongly recommend dplyr: The resulting code is generally easier to understand.
Rather than your sev.scores table, which has columns referring to rows and data.frame names, the sev_scores I created above will contain the actual data for the most severe episode for each patient ID.
I have a challenging problem related to 2 large datasets that has had me stumped for awhile now and would love to see some other approaches.
I have two dataframes that I would like to merge without creating duplicated rows. The dataframes consist of tx (treatment) and ae(adverse effects), each with a column named id which correlates data from a particular person .
Data frame for tx:
id<-c("A","A","A","B","B")
stage<-c("peak","trough","roll","peak","trough")
start<-c("01-01-2011","01-03-2011","01-05-2011","01-01-2008","01-07-2008")
end<-c("01-02-2011","01-04-2011","01-05-2012","01-05-2008","01-08-2008")
tx<-data.frame(id,stage,start,end)
tx[] <- lapply(tx, as.character)
tx$start<-as.Date(start, format="%d-%m-%Y")
tx$end<-as.Date(end, format="%d-%m-%Y")
Data frame for ae:
id<-c("A","A","A","A","B","B","B","B")
onset<-c("05-01-2011","10-01-2011","14-02-2011","25-01-2011","11-01-2008","11-04-2008","14-07-2008","22-07-2008")
ae<-data.frame(id,onset)
ae[]<-lapply(ae, as.character)
ae$onset<-as.Date(onset, format="%d-%m-%Y")
So far I have tried merge(tx,ae) which does the job but I get duplicates and they are not aligned correctly.
What I need is multiple ae$onset dates for a specific person between the dates tx$start <= ae$onset <= tx$end to appear in a new column as ae1, ae2....etc in the tx data frame.
Ultimately I need a transpose function for specific elements that match the specified selection criteria.
Any help would be greatly appreciated as I have not been able to locate an example of this.