Keep quotes when creating data frame in R - r

I have a data frame df containing many columns. From these, I extract two (col1 and col2) and use df2 = data.frame(df$col1, df$col2) for this.
It works: a new dataframe made of those two columns is created. But df$col1 was made of strings as:
"test1"
"test2"
df2$col1 is made instead of values (not sure how to call them) as:
test1
test2
Intersection between these df$col1 and df2$col1 yields zero. How do I keep the column as a string in the new data frame?
I tried adding stringsAsFactors = FALSE but nothing changed.

'df' is your data frame and you do not want to change the original data type. i.e., you should retain your string data type.
So basically you should subset those columns from the original data frame instead of creating a new data frame using 'data.frame'.
> df2<-df[,c("col1","col2")]
You can check the data type of each column in data frame by
> str(df2)

Your first data.frame has col1 set as character. When you create a second data.frame, this character column is coerced to factor. Here's a possible short proof.
> df1 <- data.frame(col1 = c("a", "b", "c"), col2 = 1:3)
> df1$col1
[1] a b c
Levels: a b c
> df1$col1 <- as.character(df1$col1)
> df1$col1
[1] "a" "b" "c" # this is what you have
>
> df2 <- data.frame(col1 = df1$col1)
> df2$col1
[1] a b c # coerced to factor
Levels: a b c

Related

Create a Column with Unique values from Lists Columns

I have a dataset on Rstudio made of columns that contains lists inside them. Here is an example where column "a" and column "c" contain lists in each row.
¿What I am looking for?
I need to create a new column that collects unique values from columns a b and c and that skips NA or null values
Expected result is column "desired_result".
test <- tibble(a = list(c("x1","x2"), c("x1","x3"),"x3"),
b = c("x1", NA,NA),
c = list(c("x1","x4"),"x4","x2"),
desired_result = list(c("x1","x2","x4"),c("x1","x3","x4"),c("x2","x3")))
What i have tried so far?
I tried the following but do not produces the expected result as in column "desired_result
test$attempt_1_ <-lapply(apply((test[, c("a","b","c"), drop = T]),
MARGIN = 1, FUN= c, use.names= FALSE),unique)
We may use pmap to loop over each of the corresponding elements of 'a' to 'c', remove the NA (na.omit) and get the unique values to store as a list in 'desired_result'
library(dplyr)
library(purrr)
test <- test %>%
mutate(desired_result2 = pmap(across(a:c), ~ sort(unique(na.omit(c(...))))))
-checking with OP's expected
> all.equal(test$desired_result, test$desired_result2)
[1] TRUE

How to get all columns with the same column name in R at once?

Let's say I have the following data frame:
> test <- cbind(test=c(1, 2, 3), test=c(1, 2, 3))
> test
test test
[1,] 1 1
[2,] 2 2
[3,] 3 3
Now from such data frame I want to fetch all the columns named "test" to a new data frame:
> new_df <- test[, "test"]
However this last attempt to do so only fetches the first column called "test" in test data frame:
> new_df
[1] 1 2 3
How can I get all of the columns called "test" in this example and put them into a new data frame in a single command? In my real data I have many columns with repeated colnames and I don't know the index of the columns so I can`t get them by number.
It is not advisable to have same column names for practical reasons. But, we can do a comparison (==) to get a logical vector and use that to extract the columns
i1 <- colnames(test) == "test"
new_df <- test[, i1, drop = FALSE]
Note that data.frame doesn't allow duplicate column names and would change it to unique by appending .1 .2 etc at the end with make.unique. With matrix (the OP's dataset), allows to have duplicate column names or row names (not recommended though)
Also, if there are multiple column names that are repeated and want to select them as separate datasets, use split
lst1 <- lapply(split(seq_len(ncol(test)), colnames(test)), function(i)
test[, i, drop = FALSE])
Or loop through the unique column names and do a == by looping through it with lapply
lst2 <- lapply(unique(colnames(test)), function(nm)
test[, colnames(test) == nm, drop = FALSE])

Find if a row contain a character and create a new column to label data

Having a dataframe with one column and every check in every row if "#" exist in text like these data:
"https://example.com/test-ability/321#321"
"https://example.com/test-ability/"
"anothertext#"
"notwithwhatyousearch"
How is it possible to find if every row contains the character "#" and create a second new column and label the row which have this character with "A" and row which has not the character with "B"?
Example of expected out
"https://example.com/test-ability/321#321", "A"
"https://example.com/test-ability/", "B"
"anothertext#", "A"
"notwithwhatyousearch", "B"
df = data.frame(x = c("https://example.com/test-ability/321#321",
"https://example.com/test-ability/",
"anothertext#",
"notwithwhatyousearch"), stringsAsFactors = F)
library(dplyr)
df %>% mutate(flag = ifelse(grepl("#", x), "A", "B"))
# x flag
# 1 https://example.com/test-ability/321#321 A
# 2 https://example.com/test-ability/ B
# 3 anothertext# A
# 4 notwithwhatyousearch B
Or a base R solution:
df$flag = ifelse(grepl("#", df$x), "A", "B")
Try this Python code, I don't know much about R language may be this will help you as reference:
ls=["https://example.com/test-ability/321#321",
"https://example.com/test-ability/",
"anothertext#",
"notwithwhatyousearch"] #Creating Data Frame
length=len(ls) #Finding list length
for i in range(0, length): #Iteration
if('#' in ls[i]):
print(ls[i],' A')
else:
print(ls[i],' B')

sub-setting column based on column name in R

I have one data frame with column name as below
colnames(Data)
[1] "ID" "A" "B" "C" "D" "E" "F" "G"
I wanted to select all columns ahead of column D
Currently there are column E, F and G. but I might expect few more column which I am not sure, also I might expect few more columns before D as well , so I am not sure about at which location column D will be available
Is there any subset command in R we can use? Like below
Datanew <- subset(Data,select=c("D","E","F","G"))
Please advice.
Find which column is D and select all the following columns (using ncol):
columnToSelect <- which(names(Data) == "D"):ncol(Data)
Datanew <- subset(Data, select = columnToSelect)
You can use tail to get the last n names of the data frame once you find where column D is. We can utilize it like this
tail(1:5, 3) # return the last three elements
The following is equivalent
tail(1:5, -2) # don't return the first two elements
If we use which to find column D
columnToSelect <- which(names(Data) == "D")
We can use tail to get all of the columns from D and following.
tail(names(Data), -(columnToSelect - 1))
The column selection, then, can be wrapped up in one neat little call
Data[tail(names(Data), -(which(names(Data) == "D") - 1))]
A fully reproducible example:
Data <-
lapply(LETTERS[1:10],
function(l){
x <- data.frame(l = rnorm(10))
names(x) <- l
x
})
Data <- as.data.frame(Data)
Data[tail(names(Data), -(which(names(Data) == "D") - 1))]

R: control auto-created column names in call to rbind()

If I do something like this:
> df <- data.frame()
> rbind(df, c("A","B","C"))
X.A. X.B. X.C.
1 A B C
You can see the row gets added to the empty data frame. However, the columns get named automatically based on the content of the data.
This causes problems if I later want to:
> df <- rbind(df, c("P", "D", "Q"))
Is there a way to control the names of the columns that get automatically created by rbind? Or some other way to do what I'm attempting to do here?
#baha-kev has a good answer regarding strings and factors.
I just want to point out the weird behavior of rbind for data.frame:
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...
# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...
But if you only have strings then why not use a matrix instead? Then it works fine to start with an empty matrix:
# Create a 0x3 matrix:
m <- matrix('', 0, 3, dimnames=list(NULL, LETTERS[1:3]))
# Now add a row:
m <- rbind(m, c('foo','bar','baz')) # This works fine!
m
# Then optionally turn it into a data.frame at the end...
as.data.frame(m, stringsAsFactors=FALSE)
Set the option "stringsAsFactors" to False, which stores the values as characters:
df=data.frame(first = 'A', second = 'B', third = 'C', stringsAsFactors=FALSE)
rbind(df,c('Horse','Dog','Cat'))
first second third
1 A B C
2 Horse Dog Cat
sapply(df2,class)
first second third
"character" "character" "character"
Later, if you want to use factors, you could convert it like this:
df2 = as.data.frame(df, stringsAsFactors=T)

Resources