Combine similar rows across two data frames

Combine similar rows across two data frames - r

Still getting the gist of R. I have two data frames where the rows are named with different coordinates (e.g. x_1013y_41403; see below). The coordinates form sets of five, each set makes a cross if plotted onto a grid. The center coordinate is in one data frame, and the four peripheral coordinates are in the other.
Center A B C D E F
x_723y_6363.txt 554 NA 604 NA 645 NA
x_749y_41403.txt 14 NA 6 NA 13 NA
Peripheral A B C D E F
x_1013y_41403.txt NA 1 NA 0 NA 0
x_459y_6363.txt NA 2 NA 1 NA 4
x_485y_41403.txt NA 0 NA 0 NA 0
x_723y_6100.txt NA 1 NA 0 NA 3
x_723y_6627.txt NA 1 NA 0 NA 1
x_749y_41139.txt NA 1 NA 0 NA 0
x_749y_41667.txt NA 2 NA 0 NA 0
x_987y_6363.txt NA 1 NA 0 NA 0
To form a set, the peripheral coordinates would have the same x or y location as the center coordinate. For example, the center coordinate x_723y_6363 would be associated with x_723y_6100 and x_723y_6627 (same x location), as well as x_459y_6363 and x_987y_6363 (same y location).
I would like to combine the coordinates into their respective sets, and name the set with the center coordinate. For the case above, I would end up with two rows, where each row is the summation of a set.
A B C D E F
x_723y_6363.txt 554 5 604 1 645 8
x_749y_41403.txt 14 4 6 0 13 0
I am not sure at all how this can be done. I have thought about creating regular expressions to pick out the x and y coordinates individually and then doing a comparison across the two data frames. Any help would be greatly appreciated!

I hope someone else comes up with a better answer as this is ugly. I would first split the .txt names into x and y values then loop over each of the variables that is NA in center and sum all values that are share an x or y value with that center. Edit: Changed the sapply to make it slightly nicer.
center <- read.table(textConnection("
A B C D E F
x_723y_6363.txt 554 NA 604 NA 645 NA
x_749y_41403.txt 14 NA 6 NA 13 NA"),
header = TRUE)
peripheral <- read.table(textConnection("
A B C D E F
x_1013y_41403.txt NA 1 NA 0 NA 0
x_459y_6363.txt NA 2 NA 1 NA 4
x_485y_41403.txt NA 0 NA 0 NA 0
x_723y_6100.txt NA 1 NA 0 NA 3
x_723y_6627.txt NA 1 NA 0 NA 1
x_749y_41139.txt NA 1 NA 0 NA 0
x_749y_41667.txt NA 2 NA 0 NA 0
x_987y_6363.txt NA 1 NA 0 NA 0"),
header = TRUE)
xpat <- "^([^y]+).*"
ypat <- ".*(y_[0-9]+)\\.txt"
center$x <- gsub(xpat, "\\1", rownames(center))
center$y <- gsub(ypat, "\\1", rownames(center))
peripheral$x <- gsub(xpat, "\\1", rownames(peripheral))
peripheral$y <- gsub(ypat, "\\1", rownames(peripheral))
vars <- c("B", "D", "F")
center[vars] <- sapply(peripheral[vars], function(col)
apply(center, 1, function(row) sum(col[peripheral$x %in% row["x"] | peripheral$y %in% row["y"]]) )
)
R> center
A B C D E F x y
x_723y_6363.txt 554 5 604 1 645 8 x_723 y_6363
x_749y_41403.txt 14 4 6 0 13 0 x_749 y_41403

Another option:
# function to split coordinates x and y:
f <- function(DF) structure(
t(sapply(strsplit(row.names(DF), "[_y.]"), `[`, c(2,4))),
dimnames=list(NULL, c("x", "y")))
# get x and y for peripheral data:
P <- cbind(Peripheral, f(Peripheral))
# get x and y for centers, and mark ids:
C <- cbind(Center, f(Center), id=1:nrow(Center))
# matching:
Q <- merge(merge(P, C[,c("x","id")], all=TRUE), C[,c("y","id")], by="y", all=TRUE)
# prepare for union:
R <- within(Q, {id <- ifelse(is.na(id.y), id.x, id.y); id.x <- NULL; id.y <- NULL})
# join everything and aggregate:
S <- rbind(R, C)
aggregate(S[,3:8], by=list(id=S$id), FUN=sum, na.rm=TRUE)
Result:
id A B C D E F
1 1 554 5 604 1 645 8
2 2 14 4 6 0 13 0

Related

Making the rows of a data frame to NAs using R

I have a data frame as follows,
aid=c(1:10)
x1_var=rnorm(10,0,1)
x2_var=rnorm(10,0,1)
x3_var=rbinom(10,1,0.5)
data=data.frame(aid,x1_var,x2_var,x3_var)
head(data)
aid x1_var x2_var x3_var
1 1 -0.99759448 -0.2882535 1
2 2 -0.12755695 -1.3706875 0
3 3 1.04709366 0.8977596 1
4 4 0.48883458 -0.1965846 1
5 5 -0.40264114 0.2925659 1
6 6 -0.08409966 -1.3489460 1
I want to make the all the rows in this data frame completely to NA if x3_var==1(without making aid column to NA)
I tried the following code.
> data[which(data$x3_var==1),]=NA
> data
aid x1_var x2_var x3_var
1 NA NA NA NA
2 2 -0.12755695 -1.3706875 0
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
7 NA NA NA NA
8 8 -1.78160459 -1.8677633 0
9 9 -1.65895704 -0.8086148 0
10 10 -0.06281384 1.8888726 0
But this code have made the values of aid column also to NA. Can anybody help me to fix this?
Also are there any methods that do the same thing?
Thank you

Your code would work if you remove aid column from it.
data[which(data$x3_var==1),-1]=NA
You can also do this without which :
data[data$x3_var==1, -1]=NA
In the above two cases I am assuming that you know the position of aid column i.e 1. If in reality you don't know the position of the column you can use match to get it's position.
data[data$x3_var==1, -match('aid', names(data))] = NA

A dplyr solution. Assuming the columns to be altered begin with "x" as in the example data.
library(dplyr)
set.seed(1001)
df1 <- data.frame(aid = 1:10,
x1_var = rnorm(10,0,1),
x2_var = rnorm(10,0,1),
x3_var = rbinom(10,1,0.5))
df1 %>%
mutate(across(starts_with("x"), ~ifelse(x3_var == 1, NA, .x)))
aid x1_var x2_var x3_var
1 1 2.1886481 0.3026445 0
2 2 -0.1775473 1.6343924 0
3 3 NA NA NA
4 4 -2.5065362 0.4671611 0
5 5 NA NA NA
6 6 -0.1435595 0.1102652 0
7 7 NA NA NA
8 8 -0.6229437 -1.0302508 0
9 9 NA NA NA
10 10 NA NA NA

How to split an list based on strings - Automatically

I want to split a list based on an automatically generated dictionary / index / glossary / notsurehowtocallit
I have a dataframe where the last column is a character list. Some of them contain 3 strings, some 20, others none. The data looks something like this
name age category
1 John 34 c('sports', 'USA')
2 Mary 20 c('model', 'sports', 'Canada')
3 Sue 65 c('scholar', 'USA')
4 Carl 12 NA
n ... .. ...
The data is very long and I do not know what to look for. That means, I don't have an expected list of strings. I want R to solve that problem for me and generate this list of strings for me.
For that I've already tried:
> category.frq <- table(unlist(category))
> cbind(names(category.frq),as.integer(category.frq))
Which gives me an convenient word count and index. But I am new to R so I am not sure how to proceed from there. Is there a package that can do that for me?
I would ideally have this result:
name age category sports USA model ...
1 John 34 c('sports', 'USA') 1 1 NA
2 Mary 20 c('model', 'sports', 'Canada') 1 NA 1
3 Sue 65 c('scholar', 'USA') NA 1 NA
4 Carl 12 NA NA NA NA
n ... .. ... .. .. ..

A slightly more in-depth exposition of #Akrun's comment...
df1 <- data.frame(category = I(list(c('a','b','c', 'a'),
c('b','d'),
c('b', 'e', 'f', 'd'),
c('g','h'),
NA)))
l <- df1$category
names(l) <- seq_len(length(l))
df2 <- as.data.frame.matrix(t(table(stack(l))))
df2[df2 == 0] <- NA
df1 <- cbind(df1, df2)
df1
# category a b c d e f g h
#1 a, b, c, a 2 1 1 NA NA NA NA NA
#2 b, d NA 1 NA 1 NA NA NA NA
#3 b, e, f, d NA 1 NA 1 1 1 NA NA
#4 g, h NA NA NA NA NA NA 1 1
#5 NA NA NA NA NA NA NA NA NA

creating a variable with NA becoming 0 if at least one of the components is not NA

Imagine I have three variables, want to create a fourth variable called Total that it is the sum of A, B, and C but.... if missing values (NA) appears on 1 or 2 of the variables but not in the three of them together they become 0. As presented in the example below (last row)
A B C Total
10 10 10 30
NA NA NA NA
10 NA NA 10
So far I have:
data$Total <- A + B + C
Cheers

data$Total2 <- rowSums(data[, c("A", "B", "C")])
data$Total2[rowSums(is.na(data[, c("A", "B", "C")])) %in% 1:2] <- 0
data
A B C Total Total2
1 10 10 10 30 30
2 NA NA NA NA NA
3 10 NA NA 10 0
Maybe I misunderstood the requirements and the already specified Total is the hoped-for output. In that case:
data$Total3 <- rowSums(data[, c("A", "B", "C")], na.rm = TRUE)
data$Total3[rowSums(is.na(data[, c("A", "B", "C")])) == 3] <- NA
data
A B C Total Total2 Total3
1 10 10 10 30 30 30
2 NA NA NA NA NA NA
3 10 NA NA 10 0 10

Generate a Preference Matrix in R?

I'm using r to analyze an undirected network of individuals with ethnicities as attributes. I want to create a tie accounts table, or "preference matrix," a square matrix where values of ethnicity are arrayed on both dimensions, and each cell tells you how many ties correspond to that type of relationship. (so from this you can calculate the probability of one group throwing ties to another group - but I just want to use it as an argument in igraph's preference.game function). here's what I tried:
# I create a variable for ethnicity by assigning the names of my vertices to their corresponding ethnicities
eth <- atts$Ethnicity[match(V(mahmudNet)$name,atts$Actor)]
# I create an adjacency matrix from my network data
mat <- as.matrix(get.adjacency(mahmudNet))
# I create the dimensions for my preference matrix from the Ethnicity values
eth.value <- unique(sort(eth))
# I create an empty matrix using these dimensions
eth.mat <- array(NA,dim=c(length(eth.value),length(eth.value)))
# I create a function that will populate the empty cells of the matrix
for (i in eth.value){
for (j in eth.value){
eth.mat[i,j] <- sum(mat[eth==i,eth==j])
}
}
My problem is at the end, I think. I need to figure out an expression that tells R how to populate the cells. the expression I put doesn't seem to work, but I want it so that potentially I could go
a <- sum(mat[eth=="White", eth=="Black"])
And then "a" would return the sum of all the cells in the adjacency matrix that correspond to a White-Black relationship.
Here's a sample of my data:
# data frame with Ethnicity attributes:
Actor Ethnicity
1 Sultan Mahmud of Siak 2
2 Daeng Kemboja 1
3 Raja Kecik of Trengganu 1
4 Raja Alam 2
5 Tun Dalam 2
6 Raja Haji 1
7 The Suliwatang 1
8 Punggawa Miskin 1
9 Tengku Selangor 1
10 Tengku Raja Said 1
11 Datuk Bendahara 2
12 VOC 3
13 King of Selangor 1
14 Dutch at Batavia 3
15 Punggawa Tua 2
16 Raja Tua Encik Andak 1
17 Raja Indera Bungsu 2
18 Sultan of Jambi 2
19 David Boelen 3
20 Datuk Temenggong 2
21 Punggawa Opu Nasti 1
# adjacency matrix with relations
Daeng Kemboja Punggawa Opu Nasti Raja Haji Daeng Cellak
Daeng Kemboja 0 1 1 1
Punggawa Opu Nasti 1 0 1 0
Raja Haji 1 1 0 0
Daeng Cellak 1 0 0 0
Daeng Kecik 1 0 0 0
Daeng Kecik
Daeng Kemboja 1
Punggawa Opu Nasti 0
Raja Haji 0
Daeng Cellak 0
Daeng Kecik 0

This is a simple job for table, once you have your data in the right shape.
First a sample dataset:
# fake ethnicity data by actor
actor_eth <- data.frame(actor = letters[1:10],
eth = sample(1:3, 10, replace=T))
# fake adjacency matrix
adj_mat <- matrix(rbinom(100, 1, .5), ncol=10)
dimnames(adj_mat) <- list(letters[1:10], letters[1:10])
# blank out lower triangle & diagonal,
# so random data is not asymetric & no self-ties
adj_mat[lower.tri(adj_mat)] <- NA
diag(adj_mat) <- NA
Here's our fake adjacency matrix:
a b c d e f g h i j
a NA 1 1 1 0 0 1 1 0 1
b NA NA 0 1 0 1 0 0 1 0
c NA NA NA 1 1 0 0 1 0 0
d NA NA NA NA 1 0 0 1 1 0
e NA NA NA NA NA 0 0 1 0 1
f NA NA NA NA NA NA 1 1 0 1
g NA NA NA NA NA NA NA 1 1 0
h NA NA NA NA NA NA NA NA 0 0
i NA NA NA NA NA NA NA NA NA 1
j NA NA NA NA NA NA NA NA NA NA
Here's our fake eth table:
actor eth
1 a 3
2 b 3
3 c 3
4 d 2
5 e 1
6 f 3
7 g 3
8 h 3
9 i 1
10 j 2
So what you want to do is 1) put this in long format, so you have a bunch of rows with a source actor and a target actor, each representing a tie. Then 2) replace the actor name with ethnicity, so you have ties with source/target ethnicity. Then 3) you can just use table to make a cross tab.
# use `melt` to put this in long form, omitting rows showing "non connections"
library(reshape2)
actor_ties <- subset(melt(adj_mat), value==1)
# now replace the actor names with their ethnicities to get create a data.frame
# of ties by ethnicty
eth_ties <-
data.frame(source_eth = with(actor_eth, eth[match(actor_ties$Var1, actor)]),
target_eth = with(actor_eth, eth[match(actor_ties$Var2, actor)]))
# now here's your cross tab
table(eth_ties)
Result:
target_eth
source_eth 1 2 3
1 0 2 1
2 2 0 1
3 3 5 9

creating data frame for confusionMatrix(0 input, however getting unusual dataframe

I’m try to understand results of a prediction object via caret’s confusionMatrix() function, which requires table input according to http://artax.karlin.mff.cuni.cz/r-help/library/caret/html/confusionMatrix.html, my table() creates results that I understand , but its not friendly to the confusionMatrix() function.
Here is the relevant code snippet:
#MODEL CREATION
#convert categorical A to E values , into numeric 1 to 5 in order to be regression friendly
training_data_subset_numeric <- training_data_subset;
testing_data_subset_numeric <- testing_data_subset;
training_data_subset_numeric$classe <- as.numeric(training_data_subset$classe)
testing_data_subset_numeric$classe <- as.numeric(testing_data_subset$classe)
#model
exercise.model <- glm(formula = classe ~ ., data = training_data_subset_numeric)
#MODEL EVALUATION
exercise.prediction <- predict(exercise.model,newdata = testing_data_subset_numeric)
eval_table <- table(exercise.prediction,testing_data_subset$classe)
tail(eval_table)
exercise.prediction A B C D E
4.35504232913594 1 0 0 0 0
4.47219097065568 1 0 0 0 0
4.50838854075835 1 0 0 0 0
4.6173551930011 0 1 0 0 0
4.69261223447305 0 1 0 0 0
4.73297946213265 0 1 0 0 0
Basically I need to convert the above output , to a data frame with 1 col corresponding to prediction value that follows this rule:
If column A is 1 , than predicted value is 1
If column B is 1 , than predicted value is 2
If column C is 1 , than predicted value is 3
If column D is 1 , than predicted value is 4
If column E is 1 , than predicted value is 5
I therefore, wrote this function to get the job done:
getPredictResults<- function(x)
{
# create 1 column & n row data frame
num <- data.frame(matrix(0, ncol = 1, nrow = nrow(x)));
for (r in 1:nrow(x) ) {
for (c in 1:ncol(x) ) {
#if column A has value 1 than num[1,r] <- 1
if (x[r,'A']== 1)
{
num[1,r] <- 1;
}
#if column B has value 1 than num[1,r] <- 2
else if (x[r,'B']== 1)
{
num[1,r] <- 2;
}
#if column C has value 1 than num[1,r] <- 3
else if (x[r,'C']== 1)
{
num[1,r] <- 3;
}
#if column D has value 1 than num[1,r] <- 4
else if (x[r,'D']== 1)
{
num[1,r] <- 4;
}
#if column E has value 1 than num[1,r] <- 5
else if (x[r,'E']== 1)
{
num[1,r] <- 5;
}
else
{
}
}#end inner for
}#end outer for
return (num);
}#end function
exercise.prediction_df <- getPredictResults(eval_table)
However when typing :
head(exercise.prediction_df)
Im getting an unusual output , here is the bottom snippet:
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4840 V4841 V4842 V4843 V4844 V4845 V4846 V4847 V4848 V4849 V4850 V4851 V4852 V4853 V4854 V4855 V4856 V4857
1 5 1 4 5 2 2 5 5 1 2 5 4 5 5 1 5 5 4
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4858 V4859 V4860 V4861 V4862 V4863 V4864 V4865 V4866 V4867 V4868 V4869 V4870 V4871 V4872 V4873 V4874 V4875
1 4 2 1 2 5 1 4 5 2 1 4 5 2 4 2 4 4 2
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4876 V4877 V4878 V4879 V4880 V4881 V4882 V4883 V4884 V4885 V4886 V4887 V4888 V4889 V4890 V4891 V4892 V4893
1 5 1 1 4 1 2 2 1 1 5 1 4 1 1 1 1 1 1
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4894 V4895 V4896 V4897 V4898 V4899 V4900 V4901 V4902 V4903 V4904
1 1 1 1 1 1 1 1 1 2 2 2
2 NA NA NA NA NA NA NA NA NA NA NA
[ reached getOption("max.print") -- omitted 4 rows ]
Further investigation shows:
> ncol(exercise.prediction_df)
[1] 4904
> nrow(exercise.prediction_df)
[1] 4904
Which ncol() should only return 1 & nrow() obviously can be any integer value.
How can I fix this function, in order to create the right dataframe as an input to confusionMatrix() function?
Thanks.

classe <- cut(runif(100), seq(0, 1, length.out = 5))
levels(a) <- c("A", "B", "C", "D", "E")
exercise.prediction <- rnorm(100)
eval_table <- table(exercise.prediction, classe)
eval_matrix <- as.matrix(tab)
transform <- apply(eval_matrix, 1, function(x) sum(x * c(1:5)))
head(as.data.frame(transform))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Combine similar rows across two data frames - r

Related

Making the rows of a data frame to NAs using R

How to split an list based on strings - Automatically

creating a variable with NA becoming 0 if at least one of the components is not NA

Generate a Preference Matrix in R?

creating data frame for confusionMatrix(0 input, however getting unusual dataframe

Categories

Resources