Removing character from dataframe - r

I have this simple code, which generates a data frame. I want to remove the V character from the middle column. Is there any simple way to do that?
Here is a test code (the actual code is very long), very similar with the actual code.
mat1=matrix(c(1,2,3,4,5,"V1","V2","V3","V4","V5",1,2,3,4,5), ncol=3)
mat=as.data.frame(mat1)
colnames(mat)=c("x","row","y")
mat
This is the data frame:
x row y
1 1 V1 1
2 2 V2 2
3 3 V3 3
4 4 V4 4
5 5 V5 5
I just want to remove the V's like this:
x row y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5

We can use str_replace from stringr
library(stringr)
mat$row <- str_replace(mat$row, "V", "")

Related

Is there an R function to sequentially assign a code to each value in a dataframe, in the order it appears within the dataset?

I have a table with a long list of aliased values like this:
> head(transmission9, 50)
# A tibble: 50 x 2
In_Node End_Node
<chr> <chr>
1 c4ca4238 2838023a
2 c4ca4238 d82c8d16
3 c4ca4238 a684ecee
4 c4ca4238 fc490ca4
5 28dd2c79 c4ca4238
6 f899139d 3def184a
I would like to have R go through both columns and assign a number sequentially to each value, in the order that an aliased value appears in the dataset. I would like R to read across rows first, then down columns. For example, for the dataset above:
In_Node End_Node
<chr> <chr>
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
Is this possible? Ideally, I'd also love to be able to generate a "key" which would match each sequential code to each aliased value, like so:
Code Value
1 c4ca4238
2 2838023a
3 d82c8d16
4 a684ecee
5 fc490ca4
Thank you in advance for the help!
You could do:
df1 <- df
df1[]<-as.numeric(factor(unlist(df), unique(c(t(df)))))
df1
In_Node End_Node
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
You can match against the unique values. For a single vector, the code is straightforward:
match(vec, unique(vec))
The requirement to go across columns before rows makes this slightly tricky: you need to transpose the values first. After that, match them.
Finally, use [<- to assign the result back to a data.frame of the same shape as your original data (here x):
y = x
y[] = match(unlist(x), unique(c(t(x))))
y
V2 V3
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
c(t(x)) is a bit of a hack:
t first converts the tibble to a matrix and then transposes it. If your tibble contains multiple data types, these will be coerced to a common type.
c(…) discards attributes. In particular, it drops the dimensions of the transposed matrix, i.e. it converts the matrix into a vector, with the values now in the correct order.
A dplyr version
Let's first re-create a sample data
library(tidyverse)
transmission9 <- read.table(header = T, text = " In_Node End_Node
1 c4ca4238 283802d3a
2 c4ca4238 d82c8d16
3 c4ca4238 a684ecee
4 c4ca4238 fc490ca4
5 28dd2c79 c4ca4238
6 f899139d 3def184a")
Do this simply
transmission9 %>%
mutate(across(everything(), ~ match(., unique(c(t(cur_data()))))))
#> In_Node End_Node
#> 1 1 2
#> 2 1 3
#> 3 1 4
#> 4 1 5
#> 5 6 1
#> 6 7 8
use .names argument if you want to create new columns
transmission9 %>%
mutate(across(everything(), ~ match(., unique(c(t(cur_data())))),
.names = '{.col}_code'))
In_Node End_Node In_Node_code End_Node_code
1 c4ca4238 2838023a 1 2
2 c4ca4238 d82c8d16 1 3
3 c4ca4238 a684ecee 1 4
4 c4ca4238 fc490ca4 1 5
5 28dd2c79 c4ca4238 6 1
6 f899139d 3def184a 7 8

How to assign values in one column to other columns in wide data using R

There is a wide data set, a simple example is
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(7,8,8,9,10,10),
"cx"=c(11,12,12,13,14,14))
I'm looking for a way to assign the values in "ax" to column "bx" and "cx". Here, imagine we have thousands of columns we intend to replace with "ax", so I want this to be done in an automated approach using R. The expected output look like
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(1,2,2,3,4,4),
"cx"=c(1,2,2,3,4,4))
I've thought of, and tried using mutate_at and ends_with, but this has not work for me. For example, I tried
df %>%
mutate_at(vars(ends_with("x")), labels = "ax")
and this prints an error. Not sure what's wrong or what's to be added to get this working, so I would like to request your help on this. Thank you very much!
A simple way using base R would be :
change_cols <- grep('x$', names(df))
df[change_cols] <- df$ax
df
# id ax bx cx
#1 1 1 1 1
#2 2 2 2 2
#3 3 2 2 2
#4 4 3 3 3
#5 5 4 4 4
#6 6 4 4 4
I would suggest this tidyverse approach using across() to select the range of variables you want:
library(tidyverse)
#Data
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(7,8,8,9,10,10),
"cx"=c(11,12,12,13,14,14))
#Mutate
df %>% mutate(across(c(bx:cx), ~ ax))
Output:
id ax bx cx
1 1 1 1 1
2 2 2 2 2
3 3 2 2 2
4 4 3 3 3
5 5 4 4 4
6 6 4 4 4
Another option with mutate_at()
df %>%
mutate_at(vars(matches("x$")), ~ax)
# id ax bx cx
# 1 1 1 1 1
# 2 2 2 2 2
# 3 3 2 2 2
# 4 4 3 3 3
# 5 5 4 4 4
# 6 6 4 4 4

How to BiCluster with constant values in columns - in R

My Problem in general:
I have a data frame where i would like to find all bi-clusters with constant values in columns.
For Example the initial dataframe:
> df
v1 v2 v3
1 0 2 1
2 1 3 2
3 2 4 3
4 3 3 4
5 4 2 3
6 5 2 4
7 2 2 3
8 3 1 2
And for example i would like to find the a cluster like this:
> cluster1
v1 v3
1 2 3
2 2 3
I tried to use the biclust package and tested several functions but the result was always not what i want to archive.
I figured out that I may can use the BCPlaid function with fit.model = y ~ m. But it looks like this produce also different results.
Is there a way to archive this task efficient?

text to dataframe in r

I have a problem in making a dataframe with a text in R.
my text is like this:
t1 = "[[1,5,3,4],[3,2,2,1],[19,11,1,1]]"
and I want to make this dataframe:
V1 V2 V3 V4
1 1 5 3 4
2 3 2 2 1
3 19 11 1 1
To combine the comments, you need to do:
yourDf <- as.data.frame(jsonlite::fromJSON(t1))

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources