how to avoid change string to number automaticlly in r - r

I was trying to save some string into a matrix, but it automatically changed to numbers (levels). How can i avoid it??
Here is the table:
trt means M
1 0 12.16673 a
2 111 11.86369 ab
3 125 11.74433 ab
4 14 11.54073 b
I wanna to save to a matrix like:
J0001 a ab ab b
But, what i get is:
J0001 1 2 2 3
How can i avoid this?

Your M column is defined as a factor. You can save it as-is by wrapping it with as.character
> dat <- read.table(header = TRUE, text = "trt means M
1 0 12.16673 a
2 111 11.86369 ab
3 125 11.74433 ab
4 14 11.54073 b")
> as.numeric(dat$M)
# [1] 1 2 2 3
> as.character(dat$M)
# [1] "a" "ab" "ab" "b"
You can avoid this in the first place by using stringsAsFactors = FALSE when you read the data into R, or take advantage of the colClasses argument in some of the read-in functions.

Related

Linking rows in a data frame

I have a data frame with 4 columns.
m <-c(1,2,3,4)
e <-c('01/01/1970', '02/01/1981','03/05/1986','01/01/1970')
z <-c(111,123, 151, 111)
l <-c('XAR', 'XAR', 'XUI','XUI' )
q <-c(673, 673, 304, 455)
df <- data.frame(m,e,z,l,q)
I need to create a new df that describes the relationships between rows.
There is a relationship if rows match other rows in any 2 out of the 4 fields
For instance :
The resulting df in this case would be :
In my production data there are 700,000 rows. I've tried to solve this using SQL but the recursive nature of the function makes it too slow for production purposes
I wondered if R/R packages had any graphing capability to make this practical.
It's not entirely clear what output you expect.
In any case, data.table makes it easy and fast to identify rows with common values:
library(data.table)
# convert your data frame into data table
setDT(df)
# create common id for rows with same values in 'e' AND 'z'
df[, id_ez :=.GRP, by=.(e,z)]
# create common id for rows with same values in 'l' AND 'q'
df[, id_lq :=.GRP, by=.(l,q)]
> head(df)
> m e z l q id_ez id_lq
> 1: 1 01/01/1970 111 XAR 673 1 1
> 2: 2 02/01/1981 123 XAR 673 2 1
> 3: 3 03/05/1986 151 XUI 304 3 2
> 4: 4 01/01/1970 111 XUI 455 1 3
Now you can get a two-column output that tells you which 'm' is liked to each id
df[, .(m_linked=paste(m)), by=id_ez]
> id_ez m_linked
> 1: 1 1
> 2: 1 4
> 3: 2 2
> 4: 3 3
If you want to turn this table into a list of vectors:
mysplit = split(a$V1, a$id_ez)
myresult = lapply(mysplit, as.character)
> myresult
$`1`
[1] "1" "4"
$`2`
[1] "2"
$`3`
[1] "3"

R, construct a data.frame column by using data from another list

Given a list x:
$a
[1] 1 2 3 4 5 6
$b
[1] 10 20 30 40 50
$c
[1] 100 200 300 400 500
I want to construct a data frame that contains one column containing the following values:
1 10 100
Namely the elements of the column come from the first element in x$a, x$b and x$c.
I wonder what is the most efficient way to construct this column?
We can use [ to extract the 1st element
d1 <- data.frame(Col1 = unname(sapply(x, `[`, 1)))
d1
# Col1
#1 1
#2 10
#3 100
We can also do
data.frame(Col1 = do.call(cbind, x)[1,])
You can try this too:
data.frame(Col1=do.call(rbind, x)[,1])
Col1
a 1
b 10
c 100

How to split character value properly

I have a data frame which consists of some composite information. I would like to split the vector a into the vectors "a" and "d", where "a" corresponds only to the numeric ID 898, 3467 ,234 ,222 and vector "d" contains the corresponding character values.
Data:
a<-c("898_Me","3467_You or ", "234_Hi-hi", "222_what")
b<-c(1,8,3,8)
c<-c(2,4,6,2)
df<-data.frame(a,b,c)
What I tried so far:
a<-str(df$a)
a<-strsplit(df$a, split)
But that just doesn't work out with my regular expression skills.
The required output table might have the form:
a d b c
898 Me 1 2
3467 You or 8 3
234 Hi-hi 3 6
222 what 8 2
library(tidyr)
a<-c("898_Me","3467_You or ", "234_Hi-hi", "222_what")
b<-c(1,8,3,8)
c<-c(2,4,6,2)
df <-data.frame(a,b,c)
final_df <- separate(df , a , c("a" , "d") , sep = "_")
# a d b c
#1 898 Me 1 2
#2 3467 You or 8 4
#3 234 Hi-hi 3 6
#4 222 what 8 2
final_df$d
# [1] "Me" "You or " "Hi-hi" "what"
strsplit is right, but you need to pass the character to split with:
do.call(rbind, strsplit(as.character(df$a), "_"))
# [,1] [,2]
# [1,] "898" "Me"
# [2,] "3467" "You or "
# [3,] "234" "Hi-hi"
# [4,] "222" "what"
Or
library(stringi)
stri_split_fixed(df$a, "_", simplify = TRUE)
With your example, Here is my solution in base R:
df$a2 <- gsub("[^0-9]", "", a)
df$d <- gsub("[0-9]", "", a)
That gives:
> df
a b c a2 d
1 898_Me 1 2 898 _Me
2 3467_You or 8 4 3467 _You or
3 234_Hi-hi 3 6 234 _Hi-hi
4 222_what 8 2 222 _what
Not elegant but it preserves original data and easy to apply.

R: count word occurrence by row and create variable

new to R. I am looking to create a function to count the number of rows that contain 1 or more of the following words ("foo", "x", "y") from a column.
I then want to label that row with a variable, such as "1".
I have a data frame that looks like this:
a->
id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"
The correct output should be:
count: 3
new data frame
a2 ->
id text time username keywordtag
1 "hello x" 10 "me" 1
2 "foo and y" 5 "you" 1
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know" 1
Any hints on how to do this would be appreciated!
Here are 2 approaches with base and qdap:
a <- read.table(text='id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"', header=TRUE)
# Base
a$keywordtag <- as.numeric(grepl("\\b[foo]\\b|\\b[x]\\b|\\b[y]\\b", a$text))
a
# qdap
library(qdap)
terms <- termco(gsub("(,)([^ ])", "\\1 \\2", a$text),
id(a), list(terms = c(" foo ", " x ", " y ")))
a$keywordtag <- as.numeric(counts(terms)[[3]] > 0)
a
# output
## id text time username keywordtag
## 1 1 hello x 10 me 1
## 2 2 foo and y 5 you 1
## 3 3 nothing 15 everyone 0
## 4 4 x,y,foo 0 know 1
The base approach is bar far more eloquent and simple.
# EDIT (borrowing from Richard I believe this is most generalizable and undestandable):
words <- c("foo", "x", "y")
regex <- paste(sprintf("\\b[%s]\\b", words), collapse="|")
within(a,{
keywordtag = as.numeric(grepl(regex, a$text))
})
This is probably much safer than my previous answer.
> string <- c("foo", "x", "y")
> a$keywordtag <-
(1:nrow(a) %in% c(sapply(string, grep, a$text, fixed = TRUE)))+0
> a
# id text time username keywordtag
# 1 1 hello x 10 me 1
# 2 2 foo and y 5 you 1
# 3 3 nothing 15 everyone 0
# 4 4 x,y,foo 0 know 1
Your question boils down to splitting a vector of strings on multiple delimiters and checking if any of the tokens are in your set of desired words. You can split on multiple delimiters using strsplit (I'll use comma and whitespace, since your question doesn't specify the full set of delimiters for your problem), and I'll use intersect to check if it contains any word in your set:
m <- c("foo", "x", "y")
a$keywordtag <- as.numeric(unlist(lapply(strsplit(as.character(a$text), ",|\\s"),
function(x) length(intersect(x, m)) > 0)))
a
# id text time username keywordtag
# 1 1 hello x 10 me 1
# 2 2 foo and y 5 you 1
# 3 3 exciting 15 everyone 0
# 4 4 x,y,foo 0 know 1
I've included "exciting", which is a word that contains "x" but that isn't listed as a match by this approach.
Another way of Tyler Rinker's answer:
within(a,{keywordtag = as.numeric(grepl("foo|x|y", fixed = FALSE, a$keywordtag))})

Matching without replacement by id in R

In R, I can easily match unique identifiers using the match function:
match(c(1,2,3,4),c(2,3,4,1))
# [1] 4 1 2 3
When I try to match non-unique identifiers, I get the following result:
match(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 3
Is there a way to match the indices "without replacement", that is, each index appearing only once?
othermatch(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 4 # note the 4 where there was a 3 at the end
you're looking for pmatch
pmatch(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 4
A more naive approach -
library(data.table)
a <- data.table(p = c(1,2,3,1))
a[,indexa := .I]
b <- data.table(q = c(2,3,1,1))
b[,indexb := .I]
setkey(a,p)
setkey(b,q)
# since they are permutation, therefore cbinding the ordered vectors should return ab with ab[,p] = ab[,q]
ab <- cbind(a,b)
setkey(ab,indexa)
ab[,indexb]
#[1] 3 1 2 4

Resources