I have a csv file containing two columns, "Taxon" in column A and "Tip" in column C. I would like to compare column A against column C, and if the string matches another string in column C I'd like it to print "y" or something similar in column B next to the string in column A, if not I would like to print "n" or equivalent. Here is the beginning of my data:
Taxon B Tip
Nitrosotalea devanaterra Methanothermobacter thermautotrophicus
Nitrososphaera gargensis Methanobacterium beijingense
Nitrososphaera sca5445 Methanobacterium bryantii
Nitrososphaera sca2170 Methanosarcina mazei
Methanobacterium beijingense Persephonella marina
Methanobacterium bryantii Sulfurihydrogenibium azorense
Methanothermobacter thermautotrophicus Balnearium lithotrophicum
Methanosarcina mazei Isosphaera pallida
Koribacter versatilis Methanobacterium beijingense
Acidicapsa borealis Parachlamydia acanthamoebae
Acidobacterium capsulatum Leptospira biflexa
This is only a small part of the data, but the idea is that "n" would be printed in column B for all of the bacteria apart from "Methanobacterium beijingense" and "Methanobacterium bryantii", which are also found in the "Tip" column, and so "y" would be posted there. These could also just be "1" and "0".
I know dplyr has some good functions for filtering and joining data, however I can't find anything that exactly matches my needs. If there is an alternative method of using Excel to do this that's fine too.
Thanks.
For excel use the following formula in B2,
=if(isnumber(match(a2, c:c, 0)), "y", "n")
Fill down or double-click the 'drag button'.
A method using r and dplyr:
# create example data
x = read.table(header = TRUE, stringsAsFactors = FALSE, text =
"Taxon B Tip
Nitrosotalea_devanaterra 1 Methanothermobacter_thermautotrophicus
Nitrososphaera_gargensis 1 Methanobacterium_beijingense
Nitrososphaera_sca5445 1 Methanobacterium_bryantii
Nitrososphaera_sca2170 1 Methanosarcina_mazei
Methanobacterium_beijingense 1 Persephonella_marina
Methanobacterium_bryantii 1 Sulfurihydrogenibium_azorense
Methanothermobacter_thermautotrophicus 1 Balnearium_lithotrophicum
Methanosarcina_mazei 1 Isosphaera_pallida
Koribacter_versatilis 1 Methanobacterium_beijingense
Acidicapsa_borealis 1 Parachlamydia_acanthamoebae
Acidobacterium_capsulatum 1 Leptospira_biflexa")
# Data management part
x1 = data.frame(A = x$Taxon,B = x$B)
x2 = data.frame(A = x$Tip,B = x$B)
x$B[which(x$Taxon == anti_join(x1,x2))] = 0
Related
I'm new to R Studio and am learning about dataframes.
I'm trying to add the new column "uniqueID" to my dataframe "Populations" with unique values for each row in this new column. No problem, I can append a new column like this: Populations$uniqueID
However I'm having trouble adding unique values to each row under this new column. The values should be a combination of the values in each row from the existing columns "location", "variant", and "time". So, for each row the value for the new column uniqueID should be something like "LocationVariantTime" (e.g. "CaliforniaMedium1953"). Here's the code I'm trying, using paste(), but it's definitely wrong. I need to figure out how to grab the values for each row.
Populations$uniqueID <- paste(Populations$location, Populations$variant, Populations$time)
Here's the output when I view the dataframe. There is no new column with data: https://share.getcloudapp.com/7Kuykdg4
The error that I get reads:
Error in $<-.data.frame(*tmp*, uniqueID, value = character(0)) :
replacement has 0 rows, data has 280932
Thank you in advance for helping someone who is learning,
Your code doesn't seem far off. You might have to convert the values in paste() to character first though, like this:
Populations$uniqueID <- paste(as.character(Populations$location), as.character(Populations$variant), as.character(Populations$time), sep = "")
You could row-wise apply paste on the id columns.
Example
dat <- transform(dat, un.id=apply(dat[1:3], 1, paste, collapse=""))
head(dat)
# id type year value un.id
# 1 A Mmedium 2018 1.3709584 AMmedium2018
# 2 B Mmedium 2018 -0.5646982 BMmedium2018
# 3 C Mmedium 2018 0.3631284 CMmedium2018
# 4 A Large 2018 0.6328626 ALarge2018
# 5 B Large 2018 0.4042683 BLarge2018
# 6 C Large 2018 -0.1061245 CLarge2018
Data:
set.seed(42)
dat <- cbind(expand.grid(id=LETTERS[1:3],
type=c("Mmedium", "Large"),
year=2018:2020), value=rnorm(18))
According to the output the names of the columns are uppercase:
Populations$uniqueID <- paste(Populations$Location, Populations$Variant, Populations$Time)
The solution? A simple case change! Thank's everyone.
I am currently developing an application and I need to loop through the columns of the data frame. For instance, if the data frame has the columns
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
If the input is given as "a", then the column name "b" should be assigned to the variable, say promote.
It throws an error Error in[.data.frame(char_set, i + 1) : undefined columns selected. Is there any solution?
char_name <- "a"
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
for (i in 1:ncol(char_set)) {
promote <- ifelse(names(char_set) == char_name,char_set[i+1], "-")
print(promote)
}
Thanks in advance!!!
This is actually quite interesting. I would suggest doing something on those lines:
char_name <- "a"
char_set <- data.frame(
a = 1:2,
b = 3:4,
c = 5:6,
d = 8:9,
stringsAsFactors = FALSE
)
res_dta <- data.frame(matrix(nrow = 2, ncol = 3))
for (i in wrapr::seqi(1, NCOL(char_set) - 1)) {
print(i)
if (names(char_set)[i] == char_name) {
res_dta[i] <- char_set[i + 1]
} else {
res_dta[i] <- char_set[i]
}
}
Results
char_set
a b c d
1 1 3 5 8
2 2 4 6 9
res_dta
X1 X2 X3
1 3 3 5
2 4 4 6
There are few generic points:
When you are looping through columns be mindful not fall outside data frame dimensions; running i + 1 on i = 4 will give you column 5 which will return an error for data frame with four columns. You may then decide to run to one column less or break for a specific i value
Not sure if I got your request right, for column names a you want to take values of column b; then column b stays as it was?
Broadly speaking, I'm of a view that this names(char_set)[i] == char_name requires more thought but you have a start with this answer. Updating your post with desired results would help to design a solution.
The problem in your code is that you are looping from 1 to the number of columns of the char_set df, then you are calling the variable char_set[i+1].
This, when the i index takes the maximum value, the instruction char_set[i+1] returns an error because there is no element with that index.
You can try with this solution:
char_name<-"a"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "b"
char_name<-"d"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "-"
However. when the variable char_name takes the value a, the variable promote will take the value that the set char_set has at the position after the element named a, which matches char_name.
I suggest you to think about the case in which the variable char_name takes the value d and you don't have any values in the char_set after d.
I have data that looks like this:
A 2 3 LOGIC:A
B 3 3 LOGIC:B
C 2 2 COMBO:A
plot(Data$V2[Data$V4 == "LOGIC:A"], DATA$V3[Data$V4 == "LOGIC:A"])
However I want to plot whenever the column 4 is LOGIC, when I provide "LOGIC" inside the plot command it should plot both "LOGIC:A" and "LOGIC:B". Right now it only accepts the exact column 4 value. Can I use wildcards?
You can use grepl to find occurrences of your string.
x <- c("LOGIC: A", "COMBO: B")
x[grepl("LOGIC", x)]
[1] "LOGIC: A"
Using Data shown reproducibly in the Note at the end this will plot those rows for which V4 contains the substring LOGIC using the character after the colon to represent the point. If you want all points to be represented by the same character omit the pch argument from plot.
plot(V3 ~ V2, Data, subset = grep("LOGIC", V4), pch = sub("LOGIC:", "", V4))
Note
Lines <- "A 2 3 LOGIC:A
B 3 3 LOGIC:B
C 2 2 COMBO:A"
Data <- read.table(text = Lines, as.is = TRUE, strip.white = TRUE)
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
I think I'm missing something super simple, but I seem to be unable to find a solution directly relating to what I need: I've got a data frame that has a letter as the row name and a two columns of numerical values. As part of a loop I'm running I create a new vector (from an index) that has both a letter and number (e.g. "f2") which I then need to be the name of a new row, then add two numbers next to it (based on some other section of code, but I'm fine with that). What I get instead is the name of the vector/index as the title of the row name, and I'm not sure if I'm missing a function of rbind or something else to make it easy.
Example code:
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
rownames(data.frame) <- row.names
data.frame
index.vector <- "f2"
#what I want the data frame to look like with the new row
data.frame <- rbind(data.frame, "f2" = c(6,11))
data.frame
#what the data frame looks like when I attempt to use a vector as a row name
data.frame <- rbind(data.frame, index.vector = c(6,11))
data.frame
#"why" I can't just type "f" every time
index.vector2 = paste(index.vector, "2", sep="")
data.frame <- rbind(data.frame, index.vector2 = c(6,11))
data.frame
In my loop the "index.vector" is a random sample, hence where I can't just write the letter/number in as a row name, so need to be able to create the row name from a vector or from the index of the sample.
The loop runs and a random number of new rows will be created, so I can't specify what number the row is that needs a new name - unless there's a way to just do it for the newest or bottom row every time.
Any help would be appreciated!
Not elegant, but works:
new_row <- data.frame(setNames(list(6, 11), colnames(data.frame)), row.names = paste(index.vector, "2", sep=""))
data.frame <- rbind(data.frame, new_row)
data.frame
# vector.1 vector.2
# a 1 2
# b 2 3
# c 3 4
# d 4 5
# e 5 6
# f22 6 11
I Understood the problem , but not able to resolve the issue. Hence, suggesting an alternative way to achieve the same
Alternate solution: append your row labels after the data binding in your loop and then assign the row names to your dataframe at the end .
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
#loop starts
index.vector <- "f2"
data.frame <- rbind(data.frame,c(6,11))
row.names<-append(row.names,index.vector)
#loop ends
rownames(data.frame) <- row.names
data.frame
output:
vector.1 vector.2
a 1 2
b 2 3
c 3 4
d 4 5
e 5 6
f2 6 11
Hope this would be helpful.
If you manipulate the data frame with rbind, then the newest elements will always be at the "bottom" of your data frame. Hence you could also set a single row name by
rownnames(data.frame)[nrow(data.frame)] = "new_name"