I have a regular expression that parses a bunch of text, an when doing regmatches(myText,myRegex) it returns a list which looks like:
[[1]]
[1] "a=1" "b=3" "a=9" "c=2" "b=4"
...
I'd like to build a data.frame or table - whatever suits best - to finally have something like:
a b c
1 3 2
9 4 ...
Is it possible to make this in a simple fashion? What are your suggestions?
Thanks in advance.
Its not entirely clear what the general case is here but this works on the data provided.
Assuming this input:
x <- c("a=1", "b=3", "a=9", "c=2", "b=4")
split the values by the names producing s and massage into a data.frame:
s <- split(as.numeric(sub(".*=", "", x)), sub("=.*", "", x))
as.data.frame(do.call(cbind, lapply(s, ts)))
giving:
a b c
1 1 3 2
2 9 4 NA
No packages needed.
You can either use base R methods
d1 <- read.table(text=gsub("[[:punct:]]", " " , unlist(lst)))
d2 <- transform(d1, indx=ave(seq_along(V1), V1, FUN=seq_along))
res <- reshape(d2, timevar='V1', idvar='indx', direction='wide')[,-1]
colnames(res) <- gsub(".*\\.", "", colnames(res))
res
# a b c
#1 1 3 2
#3 9 4 2
#6 4 5 NA
#9 9 NA NA
Or using dcast from reshape2 on d2
library(reshape2)
dcast(d2,indx~V1, value.var='V2')[,-1]
# a b c
#1 1 3 2
#2 9 4 2
#3 4 5 NA
#4 9 NA NA
data
lst <- list(c('a=1', 'b=3', 'a=9', 'c=2', 'b=4'),
c('a=4', 'c=2', 'b=5', 'a=9'))
Using rex may make this type of extraction task a little simpler.
x <- c("a=1", "b=3", "a=9", "c=2", "b=4", "a=2")
First extract the names and values from the strings.
library(rex)
matches <- re_matches(x,
rex(
capture(name="name", letter),
"=",
capture(name="value", digit)
))
#> name value
#>1 a 1
#>2 b 3
#>3 a 9
#>4 c 2
#>5 b 4
#>6 a 2
Then tally the groups using split().
groups <- split(as.numeric(matches$value), matches$name)
#>$a
#>[1] 1 9 2
#>
#>$b
#>[1] 3 4
#>
#>$c
#>[1] 2
If we try to convert directly to a data.frame from split() the groups with fewer members will have their members recycled rather than NA, so instead explicitly fill with NA.
largest_group <- max(sapply(groups, length))
#>[1] 3
groups <- lapply(groups, function(group) {
if (length(group) < largest_group) {
group[largest_group] <- NA
}
group
})
#>$a
#>[1] 1 9 2
#>
#>$b
#>[1] 3 4 NA
#>
#>$c
#>[1] 2 NA NA
Finally we can create the data.frame
do.call('data.frame', groups)
#> a b c
#>1 1 3 2
#>2 9 4 NA
#>3 2 NA NA
Here's an approach using tools from my "splitstackshape" package:
library(splitstackshape)
dcast.data.table( ## Makes the long data wide
getanID( ## Adds an ID variable for dcast
## create a single column data.table and split it by the "="
cSplit(as.data.table(unlist(lst)), "V1", "="), "V1_1"),
.id ~ V1_1, value.var = "V1_2")
# .id a b c
# 1: 1 1 3 2
# 2: 2 9 4 2
# 3: 3 4 5 NA
# 4: 4 9 NA NA
This uses #akrun's sample data:
lst <- list(c('a=1', 'b=3', 'a=9', 'c=2', 'b=4'),
c('a=4', 'c=2', 'b=5', 'a=9'))
Related
DF <- data.frame(x1=c(NA,7,7,8,NA), x2=c(1,4,NA,NA,4)) # a data frame with NA
WhereAreMissingValues <- which(is.na(DF), arr.ind=TRUE) # find the position of the missing values
Modes <- apply(DF, 2, function(x) {which(tabulate(x) == max(tabulate(x)))}) # find the modes of each column
DF
WhereAreMissingValues
Modes
I would like to replace the NAs of each column of DF with the mode, accordingly.
Please for some help.
Map provides here a one line solution:
data.frame(Map(function(u,v){u[is.na(u)]=v;u},DF, Modes))
# x1 x2
#1 7 1
#2 7 4
#3 7 4
#4 8 4
#5 7 4
Here's how I would do this.
First I'll define an helper function
Myfunc <- function(x) as.numeric(names(sort(-table(x)))[1L])
Then just use lapply over the data set
DF[] <- lapply(DF, function(x){x[is.na(x)] <- Myfunc(x) ; x})
DF
# x1 x2
# 1 7 1
# 2 7 4
# 3 7 4
# 4 8 4
# 5 7 4
The idea is extracting the position of df charactes with a reference of other df, example:
L<-LETTERS[1:25]
A<-c(1:25)
df<-data.frame(L,A)
Compare<-c(LETTERS[sample(1:25, 25)])
df[] <- lapply(df, as.character)
for (i in 1:nrow(df)){
df[i,1]<-which(df[i,1]==Compare)
}
head(df)
L A
1 14 1
2 12 2
3 2 3
This works good but scale very bad, like all for, any ideas with apply, or dplyr?
Thanks
Just use match
Your data (use set.seed when providing data using sample)
df <- data.frame(L = LETTERS[1:25], A = 1:25)
set.seed(1)
Compare <- LETTERS[sample(1:25, 25)]
Solution
df$L <- match(df$L, Compare)
head(df)
# L A
# 1 10 1
# 2 23 2
# 3 12 3
# 4 11 4
# 5 5 5
# 6 21 6
Suppose I have a list of data.frames:
list <- list(A=data.frame(x=c(1,2),y=c(3,4)), B=data.frame(x=c(1,2),y=c(7,8)))
I want to combine them into one data.frame like this:
data.frame(x=c(1,2,1,2), y=c(3,4,7,8), group=c("A","A","B","B"))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
I can do in this way:
add_group_name <- function(df, group) {
df$group <- group
df
}
Reduce(rbind, mapply(add_group_name, list, names(list), SIMPLIFY=FALSE))
But I want to know if it's possible to get the name inside the lapply loop without the use of names(list), just like this:
add_group_name <- function(df) {
df$group <- ? #How to get the name of df in the list here?
}
Reduce(rbind, lapply(list, add_group_name))
I renamed list to listy to remove the clash with the base function. This is a variation on Señor O's answer in essence:
do.call(rbind, Map("[<-", listy, TRUE, "group", names(listy) ) )
# x y group
#A.1 1 3 A
#A.2 2 4 A
#B.1 1 7 B
#B.2 2 8 B
This is also very similar to a previous question and answer here: r function/loop to add column and value to multiple dataframes
The inner Map part gives this result:
Map("[<-", listy, TRUE, "group", names(listy) )
#$A
# x y group
#1 1 3 A
#2 2 4 A
#
#$B
# x y group
#1 1 7 B
#2 2 8 B
...which in long form, for explanation's sake, could be written like:
Map(function(data, nms) {data[TRUE,"group"] <- nms; data;}, listy, names(listy) )
As #flodel suggests, you could also use R's built in transform function for updating dataframes, which may be simpler again:
do.call(rbind, Map(transform, listy, group = names(listy)) )
I think a much easier approach is:
> do.call(rbind, lapply(names(list), function(x) data.frame(list[[x]], group = x)))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
Using plyr:
ldply(ll)
.id x y
1 A 1 3
2 A 2 4
3 B 1 7
4 B 2 8
Or in 2 steps :
xx <- do.call(rbind,ll)
xx$group <- sub('([A-Z]).*','\\1',rownames(xx))
xx
x y group
A.1 1 3 A
A.2 2 4 A
B.1 1 7 B
B.2 2 8 B
I would like to merge multiple data.frame in R using row.names, doing a full outer join. For this I was hoping to do the following:
x = as.data.frame(t(data.frame(a=10, b=13, c=14)))
y = as.data.frame(t(data.frame(a=1, b=2)))
z = as.data.frame(t(data.frame(a=3, b=4, c=3, d=11)))
res = Reduce(function(a,b) merge(a,b,by="row.names",all=T), list(x,y,z))
Warning message:
In merge.data.frame(a, b, by = "row.names", all = T) :
column name ‘Row.names’ is duplicated in the result
> res
Row.names Row.names V1.x V1.y V1
1 1 a 10 1 NA
2 2 b 13 2 NA
3 3 c 14 NA NA
4 a <NA> NA NA 3
5 b <NA> NA NA 4
6 c <NA> NA NA 3
7 d <NA> NA NA 11
What I was hoping to get would be:
V1 V2 V3
a 10 1 3
b 13 2 4
c 14 NA 3
d NA NA 11
The following works (up to some final column renaming):
res <- Reduce(function(a,b){
ans <- merge(a,b,by="row.names",all=T)
row.names(ans) <- ans[,"Row.names"]
ans[,!names(ans) %in% "Row.names"]
}, list(x,y,z))
Indeed:
> res
V1.x V1.y V1
a 10 1 3
b 13 2 4
c 14 NA 3
d NA NA 11
What happens with a row join is that a column with the original rownames is added in the answer, which in turn does not contain row names:
> merge(x,y,by="row.names",all=T)
Row.names V1.x V1.y
1 a 10 1
2 b 13 2
3 c 14 NA
This behavior is documented in ?merge (under Value)
If the matching involved row names, an extra character column called
Row.names is added at the left, and in all cases the result has
‘automatic’ row names.
When Reduce tries to merge again, it doesn't find any match unless the names are cleaned up manually.
For continuity, this is not a clean solution but a workaround, I transform the list argument of 'Reduce' using sapply.
Reduce(function(a,b) merge(a,b,by=0,all=T),
sapply(list(x,y,z),rbind))[,-c(1,2)]
x y.x y.y
1 10 1 3
2 13 2 4
3 14 NA 3
4 NA NA 11
Warning message:
In merge.data.frame(a, b, by = 0, all = T) :
column name ‘Row.names’ is duplicated in the result
For some reason I did not have much success with Reduce. given a list of data.frames (df.lst) and a list of suffixes (suff.lst) to change the names of identical columns, this is my solution (it's loop, I know it's ugly for R standards, but it works):
df.merg <- as.data.frame(df.lst[1])
colnames(df.merg)[-1] <- paste(colnames(df.merg)[-1],suff.lst[[1]],sep="")
for (i in 2:length(df.lst)) {
df.i <- as.data.frame(df.lst[i])
colnames(df.i)[-1] <- paste(colnames(df.i)[-1],suff.lst[[i]],sep="")
df.merg <- merge(df.merg, df.i, by.x="",by.y="", all=T)
}
I have a dataframe where some of the values are NA. I would like to remove these columns.
My data.frame looks like this
v1 v2
1 1 NA
2 1 1
3 2 2
4 1 1
5 2 2
6 1 NA
I tried to estimate the col mean and select the column means !=NA. I tried this statement, it does not work.
data=subset(Itun, select=c(is.na(colMeans(Itun))))
I got an error,
error : 'x' must be an array of at least two dimensions
Can anyone give me some help?
The data:
Itun <- data.frame(v1 = c(1,1,2,1,2,1), v2 = c(NA, 1, 2, 1, 2, NA))
This will remove all columns containing at least one NA:
Itun[ , colSums(is.na(Itun)) == 0]
An alternative way is to use apply:
Itun[ , apply(Itun, 2, function(x) !any(is.na(x)))]
Here's a convenient way to do it using the dplyr function select_if(). Combine not (!), any() and is.na(), which is equivalent to selecting all columns that don't contain any NA values.
library(dplyr)
Itun %>%
select_if(~ !any(is.na(.)))
Alternatively, select(where(~FUNCTION)) can be used:
library(dplyr)
(df <- data.frame(x = letters[1:5], y = NA, z = c(1:4, NA)))
#> x y z
#> 1 a NA 1
#> 2 b NA 2
#> 3 c NA 3
#> 4 d NA 4
#> 5 e NA NA
# Remove columns where all values are NA
df %>%
select(where(~!all(is.na(.))))
#> x z
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e NA
# Remove columns with at least one NA
df %>%
select(where(~!any(is.na(.))))
#> x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
You can use transpose twice:
newdf <- t(na.omit(t(df)))
data[,!apply(is.na(data), 2, any)]
A base R method related to the apply answers is
Itun[!unlist(vapply(Itun, anyNA, logical(1)))]
v1
1 1
2 1
3 2
4 1
5 2
6 1
Here, vapply is used as we are operating on a list, and, apply, it does not coerce the object into a matrix. Also, since we know that the output will be logical vector of length 1, we can feed this to vapply and potentially get a little speed boost. For the same reason, I used anyNA instead of any(is.na()).
Another alternative with the dplyr package would be to make use of the Filter function
Filter(function(x) !any(is.na(x)), Itun)
with data.table would be a little more cumbersome
setDT(Itun)[,.SD,.SDcols=setdiff((1:ncol(Itun)),
which(colSums(is.na(Itun))>0))]
You can also try:
df <- df[,colSums(is.na(df))<nrow(df)]