How to split an list based on strings - Automatically - r

I want to split a list based on an automatically generated dictionary / index / glossary / notsurehowtocallit
I have a dataframe where the last column is a character list. Some of them contain 3 strings, some 20, others none. The data looks something like this
name age category
1 John 34 c('sports', 'USA')
2 Mary 20 c('model', 'sports', 'Canada')
3 Sue 65 c('scholar', 'USA')
4 Carl 12 NA
n ... .. ...
The data is very long and I do not know what to look for. That means, I don't have an expected list of strings. I want R to solve that problem for me and generate this list of strings for me.
For that I've already tried:
> category.frq <- table(unlist(category))
> cbind(names(category.frq),as.integer(category.frq))
Which gives me an convenient word count and index. But I am new to R so I am not sure how to proceed from there. Is there a package that can do that for me?
I would ideally have this result:
name age category sports USA model ...
1 John 34 c('sports', 'USA') 1 1 NA
2 Mary 20 c('model', 'sports', 'Canada') 1 NA 1
3 Sue 65 c('scholar', 'USA') NA 1 NA
4 Carl 12 NA NA NA NA
n ... .. ... .. .. ..

A slightly more in-depth exposition of #Akrun's comment...
df1 <- data.frame(category = I(list(c('a','b','c', 'a'),
c('b','d'),
c('b', 'e', 'f', 'd'),
c('g','h'),
NA)))
l <- df1$category
names(l) <- seq_len(length(l))
df2 <- as.data.frame.matrix(t(table(stack(l))))
df2[df2 == 0] <- NA
df1 <- cbind(df1, df2)
df1
# category a b c d e f g h
#1 a, b, c, a 2 1 1 NA NA NA NA NA
#2 b, d NA 1 NA 1 NA NA NA NA
#3 b, e, f, d NA 1 NA 1 1 1 NA NA
#4 g, h NA NA NA NA NA NA 1 1
#5 NA NA NA NA NA NA NA NA NA

Related

Making the rows of a data frame to NAs using R

I have a data frame as follows,
aid=c(1:10)
x1_var=rnorm(10,0,1)
x2_var=rnorm(10,0,1)
x3_var=rbinom(10,1,0.5)
data=data.frame(aid,x1_var,x2_var,x3_var)
head(data)
aid x1_var x2_var x3_var
1 1 -0.99759448 -0.2882535 1
2 2 -0.12755695 -1.3706875 0
3 3 1.04709366 0.8977596 1
4 4 0.48883458 -0.1965846 1
5 5 -0.40264114 0.2925659 1
6 6 -0.08409966 -1.3489460 1
I want to make the all the rows in this data frame completely to NA if x3_var==1(without making aid column to NA)
I tried the following code.
> data[which(data$x3_var==1),]=NA
> data
aid x1_var x2_var x3_var
1 NA NA NA NA
2 2 -0.12755695 -1.3706875 0
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
7 NA NA NA NA
8 8 -1.78160459 -1.8677633 0
9 9 -1.65895704 -0.8086148 0
10 10 -0.06281384 1.8888726 0
But this code have made the values of aid column also to NA. Can anybody help me to fix this?
Also are there any methods that do the same thing?
Thank you
Your code would work if you remove aid column from it.
data[which(data$x3_var==1),-1]=NA
You can also do this without which :
data[data$x3_var==1, -1]=NA
In the above two cases I am assuming that you know the position of aid column i.e 1. If in reality you don't know the position of the column you can use match to get it's position.
data[data$x3_var==1, -match('aid', names(data))] = NA
A dplyr solution. Assuming the columns to be altered begin with "x" as in the example data.
library(dplyr)
set.seed(1001)
df1 <- data.frame(aid = 1:10,
x1_var = rnorm(10,0,1),
x2_var = rnorm(10,0,1),
x3_var = rbinom(10,1,0.5))
df1 %>%
mutate(across(starts_with("x"), ~ifelse(x3_var == 1, NA, .x)))
aid x1_var x2_var x3_var
1 1 2.1886481 0.3026445 0
2 2 -0.1775473 1.6343924 0
3 3 NA NA NA
4 4 -2.5065362 0.4671611 0
5 5 NA NA NA
6 6 -0.1435595 0.1102652 0
7 7 NA NA NA
8 8 -0.6229437 -1.0302508 0
9 9 NA NA NA
10 10 NA NA NA

Picking LHS column and RHS column of data.table assignment using other column values in R

Here is the code to produce a sample dataset:
require(data.table)
testdata <- data.table(
X = rep(sample(1:3),5),
Y = rep(sample(1:3),5),
X1 = rnorm(15),
X2 = rnorm(15),
X3 = rnorm(15),
Y1 = NA_character_,
Y2 = NA_character_,
Y3 = NA_character_
)
Initial data table:
X Y X1 X2 X3 Y1 Y2 Y3
1: 3 3 -0.7098927 0.63342935 0.94470612 NA NA NA
2: 1 2 0.3008547 -1.40043977 1.53781754 NA NA NA
3: 2 1 0.3423140 0.34897695 -0.38402565 NA NA NA
4: 3 3 -0.5726456 -2.24526957 -1.10947867 NA NA NA
5: 1 2 -1.3239474 -0.53924617 -0.04103982 NA NA NA
6: 2 1 0.2493801 0.85806647 0.96488021 NA NA NA
7: 3 3 -2.0653505 0.05481703 1.75161043 NA NA NA
8: 1 2 -1.3919774 0.34282832 0.50834289 NA NA NA
9: 2 1 0.5928025 -1.11899399 0.35967102 NA NA NA
10: 3 3 -0.4704720 0.64004313 -0.17343794 NA NA NA
11: 1 2 0.3056093 2.14544631 0.43740447 NA NA NA
12: 2 1 -0.1568971 1.05091249 1.18884487 NA NA NA
13: 3 3 -1.3078670 1.07482123 -0.65367957 NA NA NA
14: 1 2 0.4622123 -0.60308532 -1.11104235 NA NA NA
15: 2 1 -0.7894978 0.33018926 -0.04700393 NA NA NA
Here is the action I want to perform:
In each row,
if X = 2 and Y = 3 then Y3 <- X2
Expected Output:
X Y X1 X2 X3 Y1 Y2 Y3
1: 3 3 -0.7098927 0.63342935 0.94470612 NA NA 0.94470612
2: 1 2 0.3008547 -1.40043977 1.53781754 NA 0.3008547 NA
3: 2 1 0.3423140 0.34897695 -0.38402565 0.34897695 NA NA
4: 3 3 -0.5726456 -2.24526957 -1.10947867 NA NA -1.10947867
5: 1 2 -1.3239474 -0.53924617 -0.04103982 NA -1.3239474 NA
6: 2 1 0.2493801 0.85806647 0.96488021 0.85806647 NA NA
7: 3 3 -2.0653505 0.05481703 1.75161043 NA NA 1.75161043
8: 1 2 -1.3919774 0.34282832 0.50834289 NA -1.3919774 NA
9: 2 1 0.5928025 -1.11899399 0.35967102 -1.11899399 NA NA
10: 3 3 -0.4704720 0.64004313 -0.17343794 NA NA -0.17343794
11: 1 2 0.3056093 2.14544631 0.43740447 NA 0.3056093 NA
12: 2 1 -0.1568971 1.05091249 1.18884487 1.05091249 NA NA
13: 3 3 -1.3078670 1.07482123 -0.65367957 NA NA -0.65367957
14: 1 2 0.4622123 -0.60308532 -1.11104235 NA 0.4622123 NA
15: 2 1 -0.7894978 0.33018926 -0.04700393 0.33018926 NA NA
How can I achieve this using simple data.table syntax? I have tried get, eval(parse) etc but running into trouble each time.
Note that my actual dataset is quite large(100 plus columns) so I require a solution that doesn't rely on column numbers. I can possible write a large number of if statements as well but it looks like a bad way to do this for about 30 odd columns that need to be assigned in a similar way.
data.table version is 1.10.4 and the R version is 3.3.2
Edit: I solved it using a function. Not sure if this is the best way though as it is very very slow.
populateY <- function(input_table) {
for(i in 1:nrow(input_table)) {
k <- X
j <- Y
tempX <- paste0("input_table$X",k,"[i]")
tempY <- paste0("input_table$Y",j,"[i]")
eval(parse(text = paste0(tempY," <- ",tempX)))
}
return(input_table)
}
If you're open to using the tidyverse and tibble data frames, I would do it this way.
require(tibble)
testdata <- as_tibble(testdata)
testdata <- testdata %>%
mutate(Y3 = ifelse(X == 2 & Y == 3, X2, NA))
You can then add all the lines you need easily and legibly in the mutate function.
Else if you're going to use data.tables for sure, then I'd go with akrun's suggestion, though you'll need change the data type of column Y3 to double, or just not have it exist when you run that code.

Faster way to convert a list to a data.frame with some column values missing

I have this list of list
> head(train)
[[1]]
[[1]]$Physics
[1] 8
[[1]]$Chemistry
[1] 7
[[1]]$PhysicalEducation
[1] 3
[[1]]$English
[1] 4
[[1]]$Mathematics
[1] 6
[[1]]$serial
[1] 195490
.
.
[[6]]
[[6]]$Physics
[1] 2
[[6]]$Chemistry
[1] 1
[[6]]$Biology
[1] 2
[[6]]$English
[1] 4
[[6]]$Mathematics
[1] 8
[[6]]$serial
[1] 182318
each sub-list has any five elements out of these 12 and one extra named serial
columns <- c("Physics", "Chemistry", "PhysicalEducation", "English",
"Mathematics", "serial", "ComputerScience", "Hindi", "Biology",
"Economics", "Accountancy", "BusinessStudies")
I am trying yo convert this list into data frame.
Presently, I am doing this using this for loop by iterating one row at a time. Although this works, it takes a huge amount of time.
colclass <- rep("numeric",12)
comby <- read.table(text = '', colClasses = colclass, col.names = columns)
for(i in 1:length(train)){
comby[i,names(train[[i]])] <- train[[i]]
}
I tried using do.call(rbind, train) but that doesn't work as it keeps adding new data into the old columns from the first iteration.
What's a better, faster way? I have around 1.5 million observations.
Desired o/p : the data frame should have all the columns. I want NA where there is no value. Also I am interested if it could be done faster without using any additional packages.
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics Accountancy
1 8 7 3 4 6 195490 NA NA NA NA NA
2 1 1 1 3 3 190869 NA NA NA NA NA
3 1 2 2 1 2 3111 NA NA NA NA NA
4 8 7 6 7 7 47738 NA NA NA NA NA
5 1 1 1 3 2 85520 NA NA NA NA NA
6 2 1 NA 4 8 182318 NA NA 2 NA NA
BusinessStudies
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Here is the reproducible code
train <- [{\"Physics\":8,\"Chemistry\":7,\"PhysicalEducation\":3,\"English\":4,\"Mathematics\":6,\"serial\":195490},{\"Physics\":1,\"Chemistry\":1,\"PhysicalEducation\":1,\"English\":3,\"Mathematics\":3,\"serial\":190869},{\"Physics\":1,\"Chemistry\":2,\"PhysicalEducation\":2,\"English\":1,\"Mathematics\":2,\"serial\":3111},{\"Physics\":8,\"Chemistry\":7,\"PhysicalEducation\":6,\"English\":7,\"Mathematics\":7,\"serial\":47738},{\"Physics\":1,\"Chemistry\":1,\"PhysicalEducation\":1,\"English\":3,\"Mathematics\":2,\"serial\":85520},{\"Physics\":2,\"Chemistry\":1,\"Biology\":2,\"English\":4,\"Mathematics\":8,\"serial\":182318},{\"Physics\":3,\"Chemistry\":4,\"PhysicalEducation\":5,\"English\":5,\"Mathematics\":8,\"serial\":77482},{\"Accountancy\":2,\"BusinessStudies\":5,\"Economics\":3,\"English\":6,\"Mathematics\":7,\"serial\":152940},{\"Physics\":5,\"Chemistry\":6,\"Biology\":7,\"English\":3,\"Mathematics\":8,\"serial\":132620}]
train <- rjson::fromJSON(train)
As a starting point you can use purrr::map as follows:
A sample data set:
x <- list(list(physics=8,
Chemistry=7,
PhysicalEducation=3,
English=4,
serial=195490),
list(physics=2,
Chemistry=1,
Biology=2,
English=4,
Mathematics=8,
serial=182318))
Sol.1 [Shortest to avoid loops]
zzz <- sapply(columns, function(n) map_dbl(x,n,.null=NA) ) %>%
data.frame()
Which gives:
> zzz
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics
1 NA 7 3 4 NA 195490 NA NA NA NA
2 NA 1 NA 4 8 182318 NA NA 2 NA
Accountancy BusinessStudies
1 NA NA
2 NA NA
If you would like to understand how this works, you can check the longer solutions below.
Sol.2 [Manual assignment]
-pick the values for each column:
z <- data.frame(
serial = map_dbl(x,"serial",.null=NA),
Biology = map_dbl(x,"Biology",.null=NA),
Chemistry = map_dbl(x,"Chemistry",.null=NA)
)
Which gives:
> z
serial Biology Chemistry
1 195490 NA 7
2 182318 2 1
>
Sol.3 [Pre-defined dataframe and for-loop]
create a dataframe with a fixed size
zz <- data.frame(matrix(NA, nrow = length(x), ncol = 12))
assign names
names(zz) <- columns
assign values from the lists
for(i in 1:ncol(zz)){
zz[columns[i]] <- map_dbl(x,columns[i],.null=NA)
}
Which gives:
> zz
Physics Chemistry PhysicalEducation English Mathematics serial ComputerScience Hindi Biology Economics
1 NA 7 3 4 NA 195490 NA NA NA NA
2 NA 1 NA 4 8 182318 NA NA 2 NA
Accountancy BusinessStudies
1 NA NA
2 NA NA
You can accomplish this in base R by combining Reduce, and Map.
data
Here is a dataset that matches your structure.
set.seed(1234)
temp <- replicate(7, setNames(replicate(7, sample(1:10, 1), simplify=FALSE), letters[1:7]),
simplify=FALSE)
To produce a data.frame from this, you can use
Reduce(rbind, Map(data.frame, temp))
a b c d e f g
1 2 7 7 7 9 7 1
2 3 7 6 7 6 3 10
3 3 9 3 3 2 3 4
4 4 2 1 3 9 6 10
5 9 1 5 3 4 6 2
6 8 3 3 10 9 6 7
7 4 7 4 6 7 5 3
Where data.frame constructs data.frames with the inner elements. Map applies this to each element of the outer list, resulting in a list of data.frames. Finally, Reduce rbinds the data.frames in the list and produces a single data.frame.

rowMean if row passes a test

I'm working on a data set where the source name is specified by a 2-letter abbreviation in front of the variable. So all variables from source AA start with AA_var1, and source bb has bb_variable_name_2. There are actually a lot of sources, and a lot of variable names, but I leave only 2 as a minimal example.
I want to create a mean variable for any row where the number of sources, that is, where the number of unique prefixes for which the data on that row is not NA, is greater than 1. If there's only one source, I want that total variable to be NA.
So, for example, my data looks like this:
> head(df)
AA_var1 AA_var2 myid bb_meow bb_A_v1
1 NA NA 123456 10 12
2 NA 10 194200 12 NA
3 12 10 132200 NA NA
4 12 NA 132201 NA 12
5 NA NA 132202 NA NA
6 12 13 132203 14 NA
And I want the following:
> head(df)
AA_var1 AA_var2 myid bb_meow bb_A_v1 rowMeanIfDiverseData
1 NA NA 123456 10 12 NA #has only bb
2 NA 10 194200 12 NA 11 #has AA and bb
3 12 10 132200 NA NA NA #has only AA
4 12 NA 132201 NA 12 12 #has AA and bb
5 NA NA 132202 NA NA NA #has neither
6 12 13 132203 14 NA 13 #has AA and bb
Normally, I just use rowMeans() for this kind of thing. But the additional subsetting of selecting only rows whose variable names follow a convention /at the row level/ has caught me confused between the item-level and the general apply-level statements I'm used to.
I can get the prefixes at the dataframe level:
mynames <- names(df[!names(df) %in% c("myid")])
tmp <- str_extract(mynames, perl("[A-Za-z]{2}(?=_)"))
uniq <- unique(tmp[!is.na(tmp)])
So,
> uniq
[1] "AA" "bb"
So, I can make this a function I can apply to df like so:
multiSource <- function(x){
nm = names(x[!names(x) %in% badnames]) # exclude c("myid")
tmp <- str_extract(nm, perl("[A-Za-z]{2}(?=_)")) # get prefixes
uniq <- unique(tmp[!is.na(tmp)]) # ensure unique and not NA
if (length(uniq) > 1){
return(T)
} else {
return(F)
}
}
But this is clearly confused, and still getting data-set level, ie:
> lapply(df,multiSource)
$AA_var1
[1] FALSE
$AA_var2
[1] FALSE
$bb_meow
[1] FALSE
$bb_A_v1
[1] FALSE
And...
> apply(df,MARGIN=1,FUN=multiSource)
Gives TRUE for all.
I'd otherwise like to be saying...
df$rowMean <- rowMeans(df, na.rm=T)
# so, in this case
rowMeansIfTest <- function(X,test) {
# is this row muliSource True?
# if yes, return(rowMeans(X))
# else return(NA)
}
df$rowMeanIfDiverseData <- rowMeansIfTest(df, test=multiSource)
But it is unclear to me how to do this without some kind of for loop.
The strategy here is to split the data frame by columns into variable groups, and for each row identifying if there are non-NA values. We then check with rowsums to make sure there are at least two variables with non-NA values for a row, and if so, add the mean of those values with cbind.
This will generalize to any number of columns so long as they are named in the AA_varXXX format, and so long as the only column not in that format is myid. Easy enough to fix if this isn't strictly the case, but these are the limitations on the code as written now.
df.dat <- df[!names(df) == "myid"]
diverse.rows <- rowSums(
sapply(
split.default(df.dat, gsub("^([A-Z]{2})_var.*", "\\1", names(df.dat))),
function(x) apply(x, 1, function(y) any(!is.na(y)))
) ) > 1
cbind(df, div.mean=ifelse(diverse.rows, rowMeans(df.dat, na.rm=T), NA))
Produces:
AA_var1 AA_var2 myid BB_var3 BB_var4 div.mean
1 NA NA 123456 10 12 NA
2 NA 10 194200 12 NA 11
3 12 10 132200 NA NA NA
4 12 NA 132201 NA 12 12
5 NA NA 132202 NA NA NA
6 12 13 132203 14 NA 13
This solution seems a little convoluted to me, so there's probably a better way, but it should work for you.
# Here's your data:
df <- data.frame(AA_var1 = c(NA,NA,12,12,NA,12),
AA_var2 = c(NA,10,10,NA,NA,13),
BB_var3 = c(10,12,NA,NA,NA,14),
BB_var4 = c(12,NA,NA,12,NA,NA))
# calculate rowMeans for each subset of variables
a <- rowMeans(df[,grepl('AA',names(df))], na.rm=TRUE)
b <- rowMeans(df[,grepl('BB',names(df))], na.rm=TRUE)
# count non-missing values for each subset of variables
a2 <- rowSums(!is.na(df[,grepl('AA',names(df))]), na.rm=TRUE)
b2 <- rowSums(!is.na(df[,grepl('BB',names(df))]), na.rm=TRUE)
# calculate means:
rowSums(cbind(a*a2,b*b2)) /
rowSums(!is.na(df[,grepl('[AA]|[BB]',names(df))]), na.rm=TRUE)
Result:
> df$rowMeanIfDiverseData <- rowSums(cbind(a*a2,b*b2)) /
+ rowSums(!is.na(df[,grepl('[AA]|[BB]',names(df))]), na.rm=TRUE)
> df
AA_var1 AA_var2 BB_var3 BB_var4 rowMeanIfDiverseData
1 NA NA 10 12 NaN
2 NA 10 12 NA 11
3 12 10 NA NA NaN
4 12 NA NA 12 12
5 NA NA NA NA NaN
6 12 13 14 NA 13
And a little cleanup to exactly match your intended output:
> df$rowMeanIfDiverseData[is.nan(df$rowMeanIfDiverseData)] <- NA
> df
AA_var1 AA_var2 BB_var3 BB_var4 rowMeanIfDiverseData
1 NA NA 10 12 NA
2 NA 10 12 NA 11
3 12 10 NA NA NA
4 12 NA NA 12 12
5 NA NA NA NA NA
6 12 13 14 NA 13
My attempt, somewhat longwinded.....
dat<-data.frame(AA_var1=c(NA,NA,12,12,NA,12),
AA_var2=c(NA,10,10,NA,NA,13),
myid=1:6,
BB_var3=c(10,12,NA,NA,NA,14),
BB_var4=c(12,NA,NA,12,NA,NA))
#what columns are associated with variables used in our mean
varcols<-grep("*var[1-9]",names(dat),value=T)
#which rows have the requisite diversification of non-nulls
#i assume these columns will start with capitals and folloowed by underscore
meanrow<-apply(!is.na(dat[,varcols]),1,function(x){n<-varcols[x]
1<length(unique(regmatches(n,regexpr("[A-Z]+_",n))))
})
#do the row mean for all
dat$meanval<-rowMeans(dat[,varcols],na.rm=T)
#null out for those without diversification (i.e. !meanrow)
dat[!meanrow,"meanval"]<-NA
I think some of the answers are making this seem more complicated than it is. This will do it:
df$means = ifelse(rowSums(!is.na(df[, grep('AA_var', names(df))])) &
rowSums(!is.na(df[, grep('BB_var', names(df))])),
rowMeans(df[, grep('_var', names(df))], na.rm = T), NA)
# AA_var1 AA_var2 myid BB_var3 BB_var4 means
#1 NA NA 123456 10 12 NA
#2 NA 10 194200 12 NA 11
#3 12 10 132200 NA NA NA
#4 12 NA 132201 NA 12 12
#5 NA NA 132202 NA NA NA
#6 12 13 132203 14 NA 13
Here's a generalization of the above, given the comment, assuming unique id's (if they're not, create a unique index instead):
library(data.table)
library(reshape2)
dt = data.table(df)
setkey(dt, myid) # not strictly necessary, but makes life easier
# find the conditional
cond = melt(dt, id.var = 'myid')[,
sum(!is.na(value)), by = list(myid, sub('_var.*', '', variable))][,
all(V1 != 0), keyby = myid]$V1
# fill in the means (could also do a join, but will rely on ordering instead)
dt[cond, means := rowMeans(.SD, na.rm = T), .SDcols = grep('_var', names(dt))]
dt
# AA_var1 AA_var2 myid BB_var3 BB_var4 means
#1: NA NA 123456 10 12 NA
#2: 12 10 132200 NA NA NA
#3: 12 NA 132201 NA 12 12
#4: NA NA 132202 NA NA NA
#5: 12 13 132203 14 NA 13
#6: NA 10 194200 12 NA 11
fun <- function(x) {
MEAN <- mean(c(x[1], x[2], x[4], x[5]), na.rm=TRUE)
CHECK <- sum(!is.na(c(x[1], x[2]))) > 0 & sum(!is.na(c(x[4], x[5])) > 0)
MEAN * ifelse(CHECK, 1, NaN)
}
df$rowMeanIfDiverseData <- apply(df, 1, fun)
df
AA_var1 AA_var2 myid BB_var3 BB_var4 rowMeanIfDiverseData
1 NA NA 123456 10 12 NaN
2 NA 10 194200 12 NA 11
3 12 10 132200 NA NA NaN
4 12 NA 132201 NA 12 12
5 NA NA 132202 NA NA NaN
6 12 13 132203 14 NA 13

Combine similar rows across two data frames

Still getting the gist of R. I have two data frames where the rows are named with different coordinates (e.g. x_1013y_41403; see below). The coordinates form sets of five, each set makes a cross if plotted onto a grid. The center coordinate is in one data frame, and the four peripheral coordinates are in the other.
Center A B C D E F
x_723y_6363.txt 554 NA 604 NA 645 NA
x_749y_41403.txt 14 NA 6 NA 13 NA
Peripheral A B C D E F
x_1013y_41403.txt NA 1 NA 0 NA 0
x_459y_6363.txt NA 2 NA 1 NA 4
x_485y_41403.txt NA 0 NA 0 NA 0
x_723y_6100.txt NA 1 NA 0 NA 3
x_723y_6627.txt NA 1 NA 0 NA 1
x_749y_41139.txt NA 1 NA 0 NA 0
x_749y_41667.txt NA 2 NA 0 NA 0
x_987y_6363.txt NA 1 NA 0 NA 0
To form a set, the peripheral coordinates would have the same x or y location as the center coordinate. For example, the center coordinate x_723y_6363 would be associated with x_723y_6100 and x_723y_6627 (same x location), as well as x_459y_6363 and x_987y_6363 (same y location).
I would like to combine the coordinates into their respective sets, and name the set with the center coordinate. For the case above, I would end up with two rows, where each row is the summation of a set.
A B C D E F
x_723y_6363.txt 554 5 604 1 645 8
x_749y_41403.txt 14 4 6 0 13 0
I am not sure at all how this can be done. I have thought about creating regular expressions to pick out the x and y coordinates individually and then doing a comparison across the two data frames. Any help would be greatly appreciated!
I hope someone else comes up with a better answer as this is ugly. I would first split the .txt names into x and y values then loop over each of the variables that is NA in center and sum all values that are share an x or y value with that center. Edit: Changed the sapply to make it slightly nicer.
center <- read.table(textConnection("
A B C D E F
x_723y_6363.txt 554 NA 604 NA 645 NA
x_749y_41403.txt 14 NA 6 NA 13 NA"),
header = TRUE)
peripheral <- read.table(textConnection("
A B C D E F
x_1013y_41403.txt NA 1 NA 0 NA 0
x_459y_6363.txt NA 2 NA 1 NA 4
x_485y_41403.txt NA 0 NA 0 NA 0
x_723y_6100.txt NA 1 NA 0 NA 3
x_723y_6627.txt NA 1 NA 0 NA 1
x_749y_41139.txt NA 1 NA 0 NA 0
x_749y_41667.txt NA 2 NA 0 NA 0
x_987y_6363.txt NA 1 NA 0 NA 0"),
header = TRUE)
xpat <- "^([^y]+).*"
ypat <- ".*(y_[0-9]+)\\.txt"
center$x <- gsub(xpat, "\\1", rownames(center))
center$y <- gsub(ypat, "\\1", rownames(center))
peripheral$x <- gsub(xpat, "\\1", rownames(peripheral))
peripheral$y <- gsub(ypat, "\\1", rownames(peripheral))
vars <- c("B", "D", "F")
center[vars] <- sapply(peripheral[vars], function(col)
apply(center, 1, function(row) sum(col[peripheral$x %in% row["x"] | peripheral$y %in% row["y"]]) )
)
R> center
A B C D E F x y
x_723y_6363.txt 554 5 604 1 645 8 x_723 y_6363
x_749y_41403.txt 14 4 6 0 13 0 x_749 y_41403
Another option:
# function to split coordinates x and y:
f <- function(DF) structure(
t(sapply(strsplit(row.names(DF), "[_y.]"), `[`, c(2,4))),
dimnames=list(NULL, c("x", "y")))
# get x and y for peripheral data:
P <- cbind(Peripheral, f(Peripheral))
# get x and y for centers, and mark ids:
C <- cbind(Center, f(Center), id=1:nrow(Center))
# matching:
Q <- merge(merge(P, C[,c("x","id")], all=TRUE), C[,c("y","id")], by="y", all=TRUE)
# prepare for union:
R <- within(Q, {id <- ifelse(is.na(id.y), id.x, id.y); id.x <- NULL; id.y <- NULL})
# join everything and aggregate:
S <- rbind(R, C)
aggregate(S[,3:8], by=list(id=S$id), FUN=sum, na.rm=TRUE)
Result:
id A B C D E F
1 1 554 5 604 1 645 8
2 2 14 4 6 0 13 0

Resources