How to Deal with Textual Data? - r

In R, you have a certain data frame with textual data, e.g. the second column has words instead of numbers. How can you remove the rows of the data frame with a certain word (e.g. "total") in the second column? data <- data[-(data[,2] == "total"),] does not work for me.
Besides, is there an easy way to convert these words sequentially into numbers? (I.e., first word becomes 1, second appeared word becomes 2, and so on.) I would rather not use a loop...

You can use ! to negate. For the sequence, use either seq_along or as.numeric(factor(.)) depending on what you are actually looking for.
Here's some sample data:
set.seed(1)
mydf <- data.frame(V1 = 1:15, V2 = sample(LETTERS[1:3], 15, TRUE))
mydf
# V1 V2
# 1 1 A
# 2 2 B
# 3 3 B
# 4 4 C
# 5 5 A
# 6 6 C
# 7 7 C
# 8 8 B
# 9 9 B
# 10 10 A
# 11 11 A
# 12 12 A
# 13 13 C
# 14 14 B
# 15 15 C
Let's remove any rows where there is an "A" in column "V2":
mydf2 <- mydf[!mydf$V2 == "A", ]
mydf2
# V1 V2
# 2 2 B
# 3 3 B
# 4 4 C
# 6 6 C
# 7 7 C
# 8 8 B
# 9 9 B
# 13 13 C
# 14 14 B
# 15 15 C
Now, let's create two new columns. The first sequentially counts each occurrence of each "word" in column "V2". The second converts each unique "word" into a number.
mydf2$Seq <- ave(as.character(mydf2$V2), mydf2$V2, FUN = seq_along)
mydf2$WordAsNum <- as.numeric(factor(mydf2$V2))
mydf2
# V1 V2 Seq WordAsNum
# 2 2 B 1 1
# 3 3 B 2 1
# 4 4 C 1 2
# 6 6 C 2 2
# 7 7 C 3 2
# 8 8 B 3 1
# 9 9 B 4 1
# 13 13 C 4 2
# 14 14 B 5 1
# 15 15 C 5 2

Related

Identify and remove duplicated list elements by their colnames in R

I have a large list object which contains correlation matrices with colnames and rownames, but some of these matrices in the list appears more than once in a different order. How do I remove the duplicates without altering the matrix form or list names?
> my.list
$list1
A B C
A 1 8 5
B 8 1 2
C 5 2 1
$list2
B A C
B 1 8 2
A 8 1 5
C 2 5 1
$list3
C A B
C 1 5 2
A 5 1 8
B 2 8 1
$list4
X Y
X 1 9
Y 9 1
$list5
Y X
Y 1 9
X 9 1
I would like to be able to match the colnames/rownames of the matrix list and remove ones appearing more than once, I'm expecting the output below
$list1
A B C
A 1 8 5
B 8 1 2
C 5 2 1
$list4
X Y
X 1 9
Y 9 1
I have tried the codes below but it doesn't do the job
my.list[!(duplicated(my.list)
You can order the columns and rows according to their names, and then use unique:
lapply(my.list, \(x) x[order(row.names(x)), order(colnames(x))]) |>
unique()
# [[1]]
# A B C
# A 1 8 5
# B 8 1 2
# C 5 2 1
I used your first two elements as example:
list1 <- read.table(header = T, text = " A B C
A 1 8 5
B 8 1 2
C 5 2 1")
list2 <- read.table(header = T,text = " B A C
B 1 8 2
A 8 1 5
C 2 5 1")
my.list <- list(list1, list2)
# [[1]]
# A B C
# A 1 8 5
# B 8 1 2
# C 5 2 1
#
# [[2]]
# B A C
# B 1 8 2
# A 8 1 5
# C 2 5 1

Repeat a record for N times and create a new sequence from 1 to N

I want to repeat the rows of a data.frame for N times. Here N calculates based on the difference between the values of a first and second column in each row of a data.frame. Here I am facing a problem with N. In particular, N may change per each row. And I need to create a new column by creating a sequence from a first value to second value in row 1 by increasing K. Here K remains constant for all the rows.
Ex: d1<-data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
In the above dataset, there are 5 rows. THe difference between first and second values in first row is 7. Now I need to replicate the first row for 7 times and need to create a new column with the sequence of 2,3,4,5,6,7 and 8.
I can create a dataset by using the following code.
dist<-1
rec_len<-c()
seqe<-c()
for(i in 1:nrow(d1))
{
a<-seq(d1[i,"A"],d1[i,"B"],by=dist)
rec_len<-c(rec_len,length(a))
seqe<-c(seqe,a)
}
d1$C<-rec_len
d1<-d1[rep(1:nrow(d1),d1$C),]
d1$D<-seqe
row.names(d1)<-NULL
But it is taking very long time. Is there any possibity to speed up the process?
A data.table approach for this can be to use 1:nrow(df) as grouping variable to make rowwise operation for creating a list with the sequences of A and B, and then unlist, i.e.
library(data.table)
setDT(d1)[, C := B - A + 1][,
D := list(list(seq(A, B))), by = 1:nrow(d1)][,
lapply(.SD, unlist), by = 1:nrow(d1)][,
nrow := NULL][]
Which gives,
A B C D
1: 2 8 7 2
2: 2 8 7 3
3: 2 8 7 4
4: 2 8 7 5
5: 2 8 7 6
6: 2 8 7 7
7: 2 8 7 8
8: 4 6 3 4
9: 4 6 3 5
10: 4 6 3 6
11: 6 7 2 6
12: 6 7 2 7
13: 8 8 1 8
14: 1 10 10 1
15: 1 10 10 2
16: 1 10 10 3
17: 1 10 10 4
18: 1 10 10 5
19: 1 10 10 6
20: 1 10 10 7
21: 1 10 10 8
22: 1 10 10 9
23: 1 10 10 10
A B C D
Note You can easily change K within seq, i.e.
setDT(d1)[, C := B - A + 1][,
D := list(list(seq(A, B, by = 0.2))), by = 1:nrow(d1)][,
lapply(.SD, unlist), by = 1:nrow(d1)][,
nrow := NULL][]
You could use lists and purr package to process each row of your data frame:
data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10)) %>% # take original data frame
setNames(c("from", "to")) %>% pmap(seq) %>% # sequence from A to B
map(as_data_frame) %>% # convert each element to data frame
map(~mutate(.,A=min(value), B=max(value))) %>% # add A and B columns
bind_rows() %>% select(A,B,value) # combine and reorder columns
Here is a base R option where we get the times of replication of each row by subtracting the 'B' with 'A' column ('i1'), create that as column 'C', then replicate the sequence of rows of original dataset using 'i1'. Finally, the 'D' column is created by getting the sequence of corresponding elements of 'A' and 'B' using Map. The output will be a list, so we unlist it to make a vector
i1 <- with(d1, B - A + 1)
d1$C <- i1
d2 <- d1[rep(seq_len(nrow(d1)), i1),]
d2$D <- unlist(Map(`:`, d1$A, d1$B))
row.names(d2) <- NULL
d2
# A B C D
#1 2 8 7 2
#2 2 8 7 3
#3 2 8 7 4
#4 2 8 7 5
#5 2 8 7 6
#6 2 8 7 7
#7 2 8 7 8
#8 4 6 3 4
#9 4 6 3 5
#10 4 6 3 6
#11 6 7 2 6
#12 6 7 2 7
#13 8 8 1 8
#14 1 10 10 1
#15 1 10 10 2
#16 1 10 10 3
#17 1 10 10 4
#18 1 10 10 5
#19 1 10 10 6
#20 1 10 10 7
#21 1 10 10 8
#22 1 10 10 9
#23 1 10 10 10
Simple example using N (case where k = 1)
library(dplyr)
# example data frame
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
# function to use (must have same column names)
f = function(d) {
A = rep(d$A, d$diff)
B = rep(d$B, d$diff)
C = seq(d$A, d$B)
data.frame(A, B, C) }
d1 %>%
mutate(diff = B - A + 1) %>% # calculate difference
rowwise() %>% # for every row
do(f(.)) %>% # apply the function
ungroup() # forget the grouping
# # A tibble: 23 x 3
# A B C
# * <dbl> <dbl> <int>
# 1 2 8 2
# 2 2 8 3
# 3 2 8 4
# 4 2 8 5
# 5 2 8 6
# 6 2 8 7
# 7 2 8 8
# 8 4 6 4
# 9 4 6 5
# 10 4 6 6
# # ... with 13 more rows
Example where you have one k for all rows (I'm using 0.25 to demonstrate)
# example data frame
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
# function to use (must have same column names)
f = function(d, k) {
A = d$A
B = d$B
C = seq(d$A, d$B, k)
data.frame(A, B, C) }
d1 %>%
rowwise() %>% # for every row
do(f(., 0.25)) %>% # apply the function using your own k
ungroup()
# # A tibble: 77 x 3
# A B C
# * <dbl> <dbl> <dbl>
# 1 2 8 2.00
# 2 2 8 2.25
# 3 2 8 2.50
# 4 2 8 2.75
# 5 2 8 3.00
# 6 2 8 3.25
# 7 2 8 3.50
# 8 2 8 3.75
# 9 2 8 4.00
# 10 2 8 4.25
# # ... with 67 more rows
Example where you have different k for each row
# example data frame
# give manually different k for each row
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
d1$k = c(0.5, 1, 2, 0.25, 1.5)
d1
# A B k
# 1 2 8 0.50
# 2 4 6 1.00
# 3 6 7 2.00
# 4 8 8 0.25
# 5 1 10 1.50
# function to use (must have same column names)
f = function(d) {
A = d$A
B = d$B
C = seq(d$A, d$B, d$k)
data.frame(A, B, C) }
d1 %>%
rowwise() %>% # for every row
do(f(.)) %>% # apply the function using different k for each row
ungroup()
# # A tibble: 25 x 3
# A B C
# * <dbl> <dbl> <dbl>
# 1 2 8 2.0
# 2 2 8 2.5
# 3 2 8 3.0
# 4 2 8 3.5
# 5 2 8 4.0
# 6 2 8 4.5
# 7 2 8 5.0
# 8 2 8 5.5
# 9 2 8 6.0
# 10 2 8 6.5
# # ... with 15 more rows

How to delete duplicates but keep most recent data in R

I have the following two data frames:
df1 = data.frame(names=c('a','b','c','c','d'),year=c(11,12,13,14,15), Times=c(1,1,3,5,6))
df2 = data.frame(names=c('a','e','e','c','c','d'),year=c(12,12,13,15,16,16), Times=c(2,2,4,6,7,7))
I would like to know how I could merge the above df but only keeping the most recent Times depending on the year. It should look like this:
Names Year Times
a 12 2
b 12 2
c 16 7
d 16 7
e 13 4
I'm guessing that you do not mean to merge these but rather combine by stacking. Your question is ambiguous since the "duplication" could occur at the dataframe level or at the vector level. You example does not display any duplication at the dataframe level but would at the vector level. The best way to describe the problem is that you want the last (or max) Times entry within each group if names values:
> df1
names year Times
1 a 11 1
2 b 12 1
3 c 13 3
4 c 14 5
5 d 15 6
> df2
names year Times
1 a 12 2
2 e 12 2
3 e 13 4
4 c 15 6
5 c 16 7
6 d 16 7
> dfr <- rbind(df1,df2)
> dfr <-dfr[order(dfr$Times),]
> dfr[!duplicated(dfr, fromLast=TRUE) , ]
names year Times
1 a 11 1
2 b 12 1
6 a 12 2
7 e 12 2
3 c 13 3
8 e 13 4
4 c 14 5
5 d 15 6
9 c 15 6
10 c 16 7
11 d 16 7
> dfr[!duplicated(dfr$names, fromLast=TRUE) , ]
names year Times
2 b 12 1
6 a 12 2
8 e 13 4
10 c 16 7
11 d 16 7
This uses base R functions; there are also newer packages (such as plyr) that many feel make the split-apply-combine process more intuitive.
df <- rbind(df1, df2)
do.call(rbind, lapply(split(df, df$names), function(x) x[which.max(x$year), ]))
## names year Times
## a a 12 2
## b b 12 1
## c c 16 7
## d d 16 7
## e e 13 4
We could also use aggregate:
df <- rbind(df1,df2)
aggregate(cbind(df$year,df$Times)~df$names,df,max)
# df$names V1 V2
# 1 a 12 2
# 2 b 12 1
# 3 c 16 7
# 4 d 16 7
# 5 e 13 4
In case you wanted to see a data.table solution,
# load library
library(data.table)
# bind by row and convert to data.table (by reference)
df <- setDT(rbind(df1, df2))
# get the result
df[order(names, year), .SD[.N], by=.(names)]
The output is as follows:
names year Times
1: a 12 2
2: b 12 1
3: c 16 7
4: d 16 7
5: e 13 4
The final line orders the row-binded data by names and year, and then chooses the last observation (.sd[.N]) for each name.

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

combine two different dimension of dataframes to one dataframe

I have a problem to combine two different dimension dataframes which each dataframe has huge rows. Let's say, the sample of my dataframes are d and e, and new expected dataframe is de. I would like to make pair between all value in same row both in d and e, and construct those pairs in a new dataframe (de). Any idea/help for solving my problem is really appreciated. Thanks
> d <- data.frame(v1 = c(1,3,5), v2 = c(2,4,6))
> d
v1 v2
1 1 2
2 3 4
3 5 6
> e <- data.frame(v1 = c(11, 14), v2 = c(12,15), v3=c(13,16))
> e
v1 v2 v3
1 11 12 13
2 14 15 16
> de <- data.frame(x = c(1,1,1,2,2,2,3,3,3,4,4,4), y = c(11,12,13,11,12,13,14,15,16,14,15,16))
> de
x y
1 1 11
2 1 12
3 1 13
4 2 11
5 2 12
6 2 13
7 3 14
8 3 15
9 3 16
10 4 14
11 4 15
12 4 16
One solution is to "melt" d and e into long format, then merge, then get rid of the extra columns. If you have very large datasets, data tables are much faster (no difference for this tiny dataset).
library(reshape2) # for melt(...)
library(data.table)
# add id column
d <- cbind(id=1:nrow(d),d)
e <- cbind(id=1:nrow(e),e)
# melt to long format
d.melt <- data.table(melt(d,id.vars="id"), key="id")
e.melt <- data.table(melt(e,id.vars="id"), key="id")
# data table join, remove extra columns
result <- d.melt[e.melt, allow.cartesian=T]
result[,":="(id=NULL,variable=NULL,variable.1=NULL)]
setnames(result,c("x","y"))
setkey(result,x,y)
result
x y
1: 1 12
2: 1 13
3: 1 14
4: 2 12
5: 2 13
6: 2 14
7: 3 15
8: 3 16
9: 3 17
10: 4 15
11: 4 16
12: 4 17
If your data are numeric, like they are in this example, this is pretty straightforward in base R too. Conceptually this is the same as #jlhoward's answer: get your data into a long format, and merge:
merge(cbind(id = rownames(d), stack(d)),
cbind(id = rownames(e), stack(e)),
by = "id")[c("values.x", "values.y")]
# values.x values.y
# 1 1 11
# 2 1 12
# 3 1 13
# 4 2 11
# 5 2 12
# 6 2 13
# 7 3 14
# 8 3 15
# 9 3 16
# 10 4 14
# 11 4 15
# 12 4 16
Or, with the "reshape2" package:
merge(melt(as.matrix(d)),
melt(as.matrix(e)),
by = "Var1")[c("value.x", "value.y")]

Resources