I have a large data frame. A sample of the first 6 rows is below:
> temp
M1 M2 M3 M4 M5 M6
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 0 1 0 -1 1 0
4 1 1 1 1 1 1
5 0 0 0 -1 0 1
6 0 0 0 0 0 0
> dput(temp)
structure(list(M1 = c(1, 1, 0, 1, 0, 0), M2 = c(1, 1, 1, 1, 0,
0), M3 = c(1, 1, 0, 1, 0, 0), M4 = c(1, 1, -1, 1, -1, 0), M5 = c(1,
1, 1, 1, 0, 0), M6 = c(1, 1, 0, 1, 1, 0)), .Names = c("M1", "M2",
"M3", "M4", "M5", "M6"), row.names = c(NA, -6L), class = "data.frame")
The data frame only has values -1, 0 and 1. The total number of rows is 2,156. What I would like to do is to plot a a "grid" format where each row is comprised of 6 squares (one for each column). Each of the three values is then assigned a color (say, red, green, blue).
I've tried to do this with heatmap.2 (but I can't get the distinct squares to show up).
I've tried to do this using ggplot2 with geom_points() but can't figure out the best way to do it.
Any help on how to efficiently do this would be much appreciated!
Thanks!
Another option:
library(lattice)
temp <- as.matrix(temp)
levelplot(temp, col.regions= colorRampPalette(c("red","green","blue")))
This will produce the following plot:
I think geom_tile() is a better bet, in combination with reshaping to long.
library(ggplot2)
library(reshape2)
#assign an id to plot rows to y-axis
temp$id <- 1:nrow(temp)
#reshape to long
m_temp <- melt(temp, id.var="id")
p1 <- ggplot(m_temp, aes(x=variable,
y=id,fill=factor(value))) +
geom_tile()
p1
You could use ggplot to do the following:
library(ggplot2)
dd <- expand.grid(x = 1:ncol(temp), y = 1:nrow(temp))
dd$col <- unlist(c(temp))
ggplot(dd, aes(x = x, y = y, fill = factor(col))) + geom_tile()
Related
I have a toy example of a dataframe:
df <- data.frame(matrix(, nrow = 5, ncol = 0))
df["A|A"] <- c(0.3, 0, 0, 100, 23)
df["A|B"]= c(0, 0, 0.3, 10, 0.23)
df["A|C"]= c(0.3, 0.1, 0, 100, 2)
df["B|B"]= c(0, 0, 0, 12, 2)
df["B|B"]= c(0, 0, 0.3, 0, 0.23)
df["B|C"]= c(0.3, 0, 0, 21, 3)
df["C|A"]= c(0.3, 0, 1, 100, 0)
df["C|B"]= c(0, 0, 0.3, 10, 0.2)
df["C|C"]= c(0.3, 0, 1, 1, 0.3)
I need to get a matrix with counts of non-zero values between A and A, A and B, ..., C and C.
I started splitting the colnames and assigning them to variables. But I don't know how to create a matrix with certain rows and columns in a loop
counts <- colSums(df != 0)
df <- rbind(df, counts)
for(i in colnames(df)) {
cluster1 <- (strsplit(i, "\\|")[[1]])[1]
cluster2 <- (strsplit(i, "\\|")[[1]])[2]
}
A base R option
> table(read.table(text = rep(names(df), colSums(df > 0)), sep = "|"))
V2
V1 A B C
A 3 3 4
B 0 2 3
C 3 3 4
or a longer version
table(
data.frame(
do.call(
rbind,
strsplit(
as.character(subset(stack(df), values > 0)$ind),
"\\|"
)
)
)
)
gives
X2
X1 A B C
A 3 3 4
B 0 2 3
C 3 3 4
Reshape the data into 'long' format with pivot_longer, then separate the 'name' column into two, and reshape back to 'wide' with pivot_wider, specifying the values_fn as a lambda function to get the count of non-zero values
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
separate(name, into = c('name1', 'name2')) %>%
pivot_wider(names_from = name2, values_from = value,
values_fn = list(value = ~ sum(. > 0)), values_fill = 0)
-output
# A tibble: 3 x 4
name1 A B C
<chr> <int> <int> <int>
1 A 3 3 4
2 B 0 2 3
3 C 3 3 4
I have a very simple case here in which I would like to subtract each column from its previous one. As a matter of fact I am looking for a sliding subtraction as the first column stays as is and then the first one subtracts the second one and second one subtracts the third one and so on till the last column.
here is my sample data set:
structure(list(x = c(1, 0, 0, 0), y = c(1, 0, 1, 1), z = c(0,
1, 1, 1)), class = "data.frame", row.names = c(NA, -4L))
and my desired output:
structure(list(x = c(1, 0, 0, 0), y = c(0, 0, 1, 1), z = c(-1,
1, 0, 0)), class = "data.frame", row.names = c(NA, -4L))
I am personally looking for a solution with purrr family of functions. I also thought about slider but I'm not quite familiar with the latter one. So I would appreciate any help and idea with these two packages in advance. Thank you very much.
A simple dplyr only solution-
cur_data() inside mutate/summarise just creates a whole copy. So
just substract cur_data()[-ncol(.)] from cur_data()[-1]
with pmap_df you can do similar things
df <- structure(list(x = c(1, 0, 0, 0), y = c(1, 0, 1, 1), z = c(0,
1, 1, 1)), class = "data.frame", row.names = c(NA, -4L))
library(dplyr)
df %>%
mutate(cur_data()[-1] - cur_data()[-ncol(.)])
#> x y z
#> 1 1 0 -1
#> 2 0 0 1
#> 3 0 1 0
#> 4 0 1 0
similarly
pmap_dfr(df, ~c(c(...)[1], c(...)[-1] - c(...)[-ncol(df)]))
I think you are looking for pmap_df with lag to subtract the previous value.
library(purrr)
library(dplyr)
pmap_df(df, ~{x <- c(...);x - lag(x, default = 0)})
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 0 -1
#2 0 0 1
#3 0 1 0
#4 0 1 0
Verbose, but simple:
df %>%
select(x) %>%
bind_cols(df %>%
select(-1) %>%
map2_dfc(df %>%
select(-ncol(df)), ~.x -.y))
# x y z
#1 1 0 -1
#2 0 0 1
#3 0 1 0
#4 0 1 0
We can just do (no need of any packages)
cbind(df1[1], df1[-1] - df1[-ncol(df1)])
-output
x y z
1 1 0 -1
2 0 0 1
3 0 1 0
4 0 1 0
Or using dplyr
library(dplyr)
df1 %>%
mutate(.[-1] - .[-ncol(.)])
I have a dataframe in R, here there is an example
asdf <- data.frame(id = c(2345, 7323, 2345, 4533),
place = c("Home", "Home", "Office", "Office"),
sex = c("Male", "Male", "Male", "Female"),
consumed = c(1000, 800, 1000, 500))
As you can see there is one id duplicated, because he has two locations, Home and Office. I want to convert every character variable to a dummy variable, and obtain just one id, without duplicated id's. I am sure that the only duplicated values can be the "place" variable.
When i apply dummyVars from caret, i can't do this, and for me this behavior does not make sense, for example, when I apply the following
dummy <- dummyVars( ~ ., data = asdf, fullRank = FALSE, levelsOnly = TRUE)
predict(dummy, asdf)
I get the following dataframe, with duplicated id's
result <- data.frame(id = c(2345, 7323, 2345, 4533),
placeHome = c(1, 1, 0, 0),
placeOffice = c(0, 0, 1, 1),
sexFemale = c(0, 0, 0, 1),
sexMale = c(1, 1, 1, 0),
consumed = c(1000, 800, 1000, 500))
but I want this
sexy_result <- data.frame(id = c(2345, 7323, 4533),
placeHome = c(1, 1, 0),
placeOffice = c(1, 0, 1),
sexFemale = c(0, 0, 1),
sexMale = c(1, 1, 0),
consumed = c(1000, 800, 500))
You could transform your result data frame using dplyr package.
library(dplyr)
sexy_result <- result %>% group_by(id) %>% summarise_all(sum)
data.frame(sexy_result)
id placeHome placeOffice sexFemale sexMale consumed
1 2345 1 1 0 2 2000
2 4533 0 1 1 0 500
3 7323 1 0 0 1 800
If you want to sum only placeHome and placeOffice, you could use the following code
sexy_result <- result %>% group_by(id) %>% summarise(placeHome=sum(placeHome), placeOffice=sum(placeOffice), sexFemale=mean(sexFemale), sexMale=mean(sexMale), consumed=mean(consumed))
data.frame(sexy_result)
id placeHome placeOffice sexFemale sexMale consumed
1 2345 1 1 0 1 1000
2 4533 0 1 1 0 500
3 7323 1 0 0 1 800
What would be a good visualization to use in R to show the association of 2 binary variables?
I understand that phi coefficient would be the best statistic to use, but how can I show it graphically? Considering that if I use a scatterplot, it would be very condensed since there are only 4 possible values.
One idea would be to create a mosaicplot from the contigency table of the two binary variables.
Let's assume our data looks like this:
var1 var2
1 1 1
2 0 0
3 1 1
4 0 0
5 1 1
6 1 1
7 0 0
8 0 1
9 0 1
10 1 0
We could visualize it in the following way:
mosaicplot(table(df))
Data
df <- structure(list(var1 = c(1, 0, 1, 0, 1, 1, 0, 0, 0, 1), var2 = c(1,
0, 1, 0, 1, 1, 0, 1, 1, 0)), .Names = c("var1", "var2"), row.names = c(NA,
-10L), class = "data.frame")
Once again data transformation is alluding me. I've tried aggregate, xtab, the apply functions, gmodels::CrossTable all sorts but nothing seems to work.
I have a table with four columns eg A:D each a numeric binomial variable (0, 1).
eg:
x <- data.frame(A = c(0, 1, 1, 0, 1),
B = c(1, 1, 0, 1, 0),
C = c(0, 1, 1, 0, 1),
D = c(1, 0, 1, 0, 1))
I would like an output where the rows and columns are both the variables (A:D) and the values are the sum of intersections.
eg:
output <- data.frame(A = c(3, 1, 3, 2),
B = c(1, 3, 1, 1),
C = c(3, 1, 3, 2),
D = c(2, 1, 2, 3))
rownames(output) <- c("A", "B", "C", "D")
For example if there were 3 observations in column A then the intersection of A-A in the output would be 3. If there was 1 of the A observations also in variable B then the intersection of A-B in the output table would show 1 as would the intersection B-A.
Hope that makes sense. Its really bugging me how to do it.
You can get this from matrix algebra.
M = as.matrix(x)
t(M) %*% M
A B C D
A 3 1 3 2
B 1 3 1 1
C 3 1 3 2
D 2 1 2 3