Extract from 2nd column based on string content in 1st column - r

Please help, I need to extract all entries from column B
which appear against those in Column A from a data frame
I need to search Column A based on string which has GK104
That is, if column A has GK104 in its enries, it will fetch corresponding entry from column B
A B
DT-GK104-BIN1-E-A1 8000_AMKR
DT-GK104-BIN2-E-A2 8000_ASET
DT-GK104-BIN3-E-A1 8000_CPAC
DT-GK104-BIN4-E-ZK 8000_PWOO
DT-GK104-BIN5-E-ZK 8000_SPIL

This is simple. To continue Andrew Gustar's comment, you just need to use grepl:
df <-
"A B
DT-GK104-BIN1-E-A1 8000_AMKR
DT-GK104-BIN2-E-A2 8000_ASET
DT-GK104-BIN3-E-A1 8000_CPAC
DT-GK104-BIN4-E-ZK 8000_PWOO
DT-GK104-BIN5-E-ZK 8000_SPIL"
df <- read.table(text=df, header = T, stringsAsFactors = F)
# Save a value which you want to match
value <- "A1"
# You can get a filtered dataframe
df[grepl(value, df$A),]
A B
1 DT-GK104-BIN1-E-A1 8000_AMKR
3 DT-GK104-BIN3-E-A1 8000_CPAC
# Or you can just get a character vector of matched values in the second column
df$B[grepl(value, df$A)]
[1] "8000_AMKR" "8000_CPAC"

Related

How do I create a new column and add text that is specific to each row in R studio?

I'm new to R Studio and am learning about dataframes.
I'm trying to add the new column "uniqueID" to my dataframe "Populations" with unique values for each row in this new column. No problem, I can append a new column like this: Populations$uniqueID
However I'm having trouble adding unique values to each row under this new column. The values should be a combination of the values in each row from the existing columns "location", "variant", and "time". So, for each row the value for the new column uniqueID should be something like "LocationVariantTime" (e.g. "CaliforniaMedium1953"). Here's the code I'm trying, using paste(), but it's definitely wrong. I need to figure out how to grab the values for each row.
Populations$uniqueID <- paste(Populations$location, Populations$variant, Populations$time)
Here's the output when I view the dataframe. There is no new column with data: https://share.getcloudapp.com/7Kuykdg4
The error that I get reads:
Error in $<-.data.frame(*tmp*, uniqueID, value = character(0)) :
replacement has 0 rows, data has 280932
Thank you in advance for helping someone who is learning,
Your code doesn't seem far off. You might have to convert the values in paste() to character first though, like this:
Populations$uniqueID <- paste(as.character(Populations$location), as.character(Populations$variant), as.character(Populations$time), sep = "")
You could row-wise apply paste on the id columns.
Example
dat <- transform(dat, un.id=apply(dat[1:3], 1, paste, collapse=""))
head(dat)
# id type year value un.id
# 1 A Mmedium 2018 1.3709584 AMmedium2018
# 2 B Mmedium 2018 -0.5646982 BMmedium2018
# 3 C Mmedium 2018 0.3631284 CMmedium2018
# 4 A Large 2018 0.6328626 ALarge2018
# 5 B Large 2018 0.4042683 BLarge2018
# 6 C Large 2018 -0.1061245 CLarge2018
Data:
set.seed(42)
dat <- cbind(expand.grid(id=LETTERS[1:3],
type=c("Mmedium", "Large"),
year=2018:2020), value=rnorm(18))
According to the output the names of the columns are uppercase:
Populations$uniqueID <- paste(Populations$Location, Populations$Variant, Populations$Time)
The solution? A simple case change! Thank's everyone.

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Use a vector/index as a row name in a dataframe using rbind

I think I'm missing something super simple, but I seem to be unable to find a solution directly relating to what I need: I've got a data frame that has a letter as the row name and a two columns of numerical values. As part of a loop I'm running I create a new vector (from an index) that has both a letter and number (e.g. "f2") which I then need to be the name of a new row, then add two numbers next to it (based on some other section of code, but I'm fine with that). What I get instead is the name of the vector/index as the title of the row name, and I'm not sure if I'm missing a function of rbind or something else to make it easy.
Example code:
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
rownames(data.frame) <- row.names
data.frame
index.vector <- "f2"
#what I want the data frame to look like with the new row
data.frame <- rbind(data.frame, "f2" = c(6,11))
data.frame
#what the data frame looks like when I attempt to use a vector as a row name
data.frame <- rbind(data.frame, index.vector = c(6,11))
data.frame
#"why" I can't just type "f" every time
index.vector2 = paste(index.vector, "2", sep="")
data.frame <- rbind(data.frame, index.vector2 = c(6,11))
data.frame
In my loop the "index.vector" is a random sample, hence where I can't just write the letter/number in as a row name, so need to be able to create the row name from a vector or from the index of the sample.
The loop runs and a random number of new rows will be created, so I can't specify what number the row is that needs a new name - unless there's a way to just do it for the newest or bottom row every time.
Any help would be appreciated!
Not elegant, but works:
new_row <- data.frame(setNames(list(6, 11), colnames(data.frame)), row.names = paste(index.vector, "2", sep=""))
data.frame <- rbind(data.frame, new_row)
data.frame
# vector.1 vector.2
# a 1 2
# b 2 3
# c 3 4
# d 4 5
# e 5 6
# f22 6 11
I Understood the problem , but not able to resolve the issue. Hence, suggesting an alternative way to achieve the same
Alternate solution: append your row labels after the data binding in your loop and then assign the row names to your dataframe at the end .
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
#loop starts
index.vector <- "f2"
data.frame <- rbind(data.frame,c(6,11))
row.names<-append(row.names,index.vector)
#loop ends
rownames(data.frame) <- row.names
data.frame
output:
vector.1 vector.2
a 1 2
b 2 3
c 3 4
d 4 5
e 5 6
f2 6 11
Hope this would be helpful.
If you manipulate the data frame with rbind, then the newest elements will always be at the "bottom" of your data frame. Hence you could also set a single row name by
rownnames(data.frame)[nrow(data.frame)] = "new_name"

Combine, Order, Dedup over Multiple Files in R

I have a large number of CSV files that look like this:
var val1 val2
a 2 1
b 2 2
c 3 3
d 9 2
e 1 1
I would like to:
Read them in
Take the top 3 from each CSV
Make a list of the variable names only (3 x number of files)
Keep only the unique names on the list
I think I have managed to get to point 3 by doing this:
csvList <- list.files(path = "mypath", pattern = "*.csv", full.names = T)
bla <- lapply(lapply(csvList, read.csv), function(x) x[order(x$val1, decreasing=T)[1:3], ])
lapply(bla,"[", , 1, drop=FALSE)
Now, I have a list of the top 3 variables in each CSV. However, I don't know how to convert this list to a string and keep only the unique values.
Any help is welcome.
Thank you!
The issue is in extracting the first columns of bla with drop=FALSE. This preserves the results as a list of columns (where each row has a name) instead of coercing it to its lowest dimension, which is a vector. Use drop=TRUE instead and then unlist followed by unique as #Frank suggests:
unique(unlist(lapply(bla,"[", , 1, drop=TRUE)))
As you know, drop=TRUE is the default, so you don't even have to include it.
Update to new requirements in comments.
To keep the first two columns var and var1 and remove duplicates in var (keep only the unique vars), do the following:
## unlist each column in turn and form a data frame
res <- data.frame(lapply(c(1,2), function(x) unlist(lapply(bla,"[", , x))))
colnames(res) <- c("var","var1") ## restore the two column names
## remove duplicates
res <- res[!duplicated(res[,1]),]
Note that this will only keep the first row for each unique var. This is the definition of removing duplicates here.
Hope this helps.

Splitting a dataframe if rows are numeric or not in R

I have a data frame (let's call it 'df') it consists of two columns
Name Contact
A 34552325
B 423424
C 4324234242
D hello1#company.com
I want to split the dataframe into two dataframe based on whether a row in column "Contact" is numeric or not
Expected Output:
Name Contact
A 34552325
B 423424
C 4324234242
and
Name Contact
D hello1#company.com
I tired using:
df$IsNum <- !(is.na(as.numeric(df$Contact)))
But this classified "hello1#company.com" also as numeric.
Basically if there is even a single non-numeric value in column "Contact", then code must classify it as non-numeric
You may use grepl..
x <- " Name Contact
A 34552325
B 423424
C 4324234242
D hello1#company.com"
df <- read.table(text=x, header = T)
x <- df[grepl("^\\d+$",df$Contact),]
y <- df[!grepl("^\\d+$",df$Contact),]
x
# Name Contact
# 1 A 34552325
# 2 B 423424
# 3 C 4324234242
y
# Name Contact
# 4 D hello1#company.com
We can create a grouping variable with grepl (same as how #Avinash Raj created), split the dataframe with that to create a list of data.frames.
split(df, grepl('^\\d+$', df$Contact))

Resources