R: changing entries of a dataset

R: changing entries of a dataset - r

I have to summarize the information of an AnnotatedDataFrame and a graphNEL together. My problem is that the same "feature" in the data frame and in the graphNEL has different names: in the data frame I have NAME and IDnumber separately (two columns), in the graphNEL I have a single column whose entries are in the form NAME(IDnumber). It is a very large dataset, so it is absolutely out of question to rename them manually. Is there a way to make the dataset entries in the form NAME(IDnumber)?
Thank you in advance.

Supposing your dataframe is called df, you can create a new variable with:
df$new <- paste0(df$NAME, "(", df$IDnumber, ")")

Related

Data extraction in R - multiple columns

Hello, I have this type of table consisting of a single row and several columns. I have tried a code to extract my KD_PL parameters without success. Do you know a way in R to extract all the KD_PLs and store them in a vector or data frame array?
I tried this:
KDPL <- select("KD_PL.", which(substr(colnames(max_LnData), start=1, stop=6)))

This should do the trick:
library(tidyverse)
KDPL <- max_LnData %>% select(starts_with("KD_PL."))
This function selects all columns from your old dataset starting with "KD_PL." and stores them in a new dataframe KDPL.
If you only want the names of the columns to be saved, you could use the following:
KDPL_names <- colnames(KDPL)
This saves the column names in the vector KDPL_names.

How to dynamically create and name data frames in a for loop

I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.

The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.

How can I extract numbers from dataframe names in a list in R?

I have a list called list.data with over 600 dataframes in it. Each data frame has a unique name e.g. 4dMU6_20080605tp.txt and within the name is a date "20080605" that I want to extract and add to a new column in each dataframe.
I have created the new column in each dataframe with the column title "Date" but now need to extract the numbers from multiple dataframe names- any idea how I could do this?
I've tried sapply but presumably it doesn't work as it is searching the dataframes as oppossed to the dataframe names.
sapply(list.data, function(x){as.numeric(x[8])})
Any help would be greatly appreciated!

Using lapply, we can add new Date column to each data frame in the list, using gsub to obtain the date from the name of each list element.
lst_names <- names(list.data)
list.data <- lapply(lst_names, function(x) {
list.data[[x]]$Date <- gsub("^.*_|[A-Za-z]*\\.\\w+$", "", x)
return(list.data[[x]])
})
names(list.data) <- lst_names

If you only want to extract the numbers from the names in your list, you could use this str_extract(names(list.data), "\\-*\\d+\\.*\\d*"). Note that names(list) returns the data frame column names, not the name of the data frames.

Matching and sorting data in R or Excel

I have a list of bacteria each with it's own abundance in a dataframe. I also have the same list of bacteria but in a different order in the same dataframe.
I want to match the abundances to this second list but I'm not sure how to go about doing it.
dyplyr contains several methods for sorting data but I don't know how to match the abundance and print it into a new column so it now matches with the second list of bacteria.
Here's the beginning of my dataset:
Taxon Total_abundance Tips
Acaricomes phytoseiuli 0.000382414 Methanothermobacter thermautotrophicus
Acetivibrio cellulolyticus 0.013979274 Methanobacterium beijingense
Acetobacter aceti 0.181150551 Methanobacterium bryantii
Acetobacter estunensis 0.023074895 Methanosarcina mazei
Acetobacter tropicalis 0.014615221 Persephonella marina
Achromobacter piechaudii 0.031811039 Sulfurihydrogenibium azorense
Achromobacter xylosoxidans 0.041558442 Balnearium lithotrophicum
Acidicapsa borealis 0.035525932 Isosphaera pallida
Acidimicrobium ferrooxidans 0.013841209 Simkania negevensis
Acidiphilium angustum 0.041702984 Parachlamydia acanthamoebae
Acidiphilium cryptum 0.039265944 Leptospira biflexa
Acidiphilium rubrum 0.041702984 Leptospira fainei
...
So, the abundance matches the data in Taxon column, and I want the abundance to also be matched with the bacteria in the "Tips" column.
For example, Acaricomes phytoseiuli has an abundance of 0.000382414, so in column D 0.000382414 will be printed next to where Acaricomes phytoseiuli is located. Again, Taxon and Tips contains exactly the same data, just in a different order.
I hope that makes sense.
It doesn't matter if this is done in R or Excel, thanks.

As others have mentioned, it's hard to test without some data that matches, but something like this should work, using match to match up values.
df$D <- df$Total_abundance[ match( df$Tips, df$Taxon ) ]

I assume that your list of bacteria is unique
as a sample data frame:
dff <- data.frame(bacteria1=letters[1:10], abundance1=runif(10,0,1),
bacteri2=sample(letters[1:10],10), abundance2=0)
now we will find the bacteria rows and insert the abundances:
for(i in 1:nrow(dff)){
s <- which(dff$bacteri2[i]==dff$bacteria1)
dff$abundance2[i] <- dff$abundance1[s]
}

In excel under column D you can do the following:
=VLOOKUP(C3;A3:B13;2;FALSE)
C3 would be the TIP and A3:B13 the range where it searches for this, A being the bacteria name and B the abundance and if found will return the corresponding abundance of the match.
If you get an error like #N/A than there is no match. You can also avoid these errors by using this formula:
=IFNA(VLOOKUP(C3;$A$3:B13;2;FALSE);"No match")
Edit: Adjust the ranges to your file!
Edit 2: Keep in mind the seperator I use is ; and your excel might use the comma , seperator

First of all, if your Taxon and Tips columns contain exactly the same data, only in different order, they have no place being together in the same data frame. You should either have two data frames, or come up with some sort of key to define the place of a Taxon item in the phylogenetic tree and then re-sort the data frame as needed, either in alphabetic order or by phylogeny.
As a quick solution, I would first extract the Tips column in a separate data frame, join it with the original data frame by the Tips and Taxon columns, thus obtaining the correct order of abundance values in the new data frame and (if you still insist) using cbind to glue the newly re-sorted abundance column back into the original data frame. Like so, assuming you're using dplyr (df is a dummy stand-in for your data set):
df <- data.frame(Taxon=c("a","b","c","d","e"), Abundance=c(1:5), Tips=c("b","a","d","c","e"))
new_df <- select(df, Tips)
new_df <- left_join(new_df, df, by=c("Tips"= "Taxon"))
df <- cbind(df, New_Abund=new_df$Abundance)
rm(new_df)

R - Subset based on column name

My data frame has over 120 columns (variables) and I would like to create subsets bases on column names.
For example I would like to create a subset where the column name includes the string "mood". Is this possible?

I generally use
SubData <- myData[,grep("whatIWant", colnames(myData))]
I know very well that the "," is not necessary and
colnames
could be replaced by
names
but it would not work with matrices and I hate to change the formalism when changing objects.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex