I am facing a problem I do not understand. It's a follow-up on answers suggested here and here
I have two identically structured datasets. One I created as a reproducible example for which the code works, and one which is real for which the code does not work. After staring at it for hours I cannot find what is causing the error.
The following gives an example that works
df <- data.table(cbind(rep(seq(1,25), each = 4 )), cbind(rep(seq(1,40), length.out = 100)))
colnames(df) <- c("a", "b") #ignore warning
setkey(df, a, b)
This is just to create a reproducible example. When I apply the - slightly adjusted - code suggested in the mentioned SO articles I get what I am looking for: a sparse matrix that indicates when two elements in column b occur together for values of column a
library(Matrix)
s <- sparseMatrix(
df$a,
df$b,
dimnames = list(
unique(df$a),unique(df$b)), x = 1)
v <- t(s) %*% s
Now I am doing - in my eyes - exactly the same on my real dataset which is much longer.
A sample dput below looks like this
test <- dput(dk[1:50,])
structure(list(pid = c(204L, 204L, 207L, 254L, 254L, 258L, 258L,
258L, 258L, 258L, 265L, 265L, 269L, 269L, 269L, 269L, 1520L,
1520L, 1520L, 1520L, 1532L, 1532L, 1534L, 1534L, 1534L, 1534L,
1539L, 1539L, 1543L, 1543L, 1546L, 1546L, 1546L, 1546L, 1546L,
1546L, 1546L, 1549L, 1549L, 1549L, 1559L, 1559L, 1559L, 1559L,
1559L, 1559L, 1559L, 1561L, 1561L, 1561L), cid = c(11023L, 11787L,
14232L, 14470L, 14480L, 1290L, 1637L, 4452L, 13964L, 14590L,
17814L, 23453L, 6658L, 10952L, 17259L, 27549L, 11034L, 22748L,
23345L, 23347L, 10487L, 11162L, 15570L, 15629L, 17983L, 17999L,
17531L, 22497L, 14425L, 14521L, 11495L, 24948L, 24962L, 24969L,
24972L, 24973L, 30627L, 17886L, 18428L, 23972L, 13890L, 13936L,
14432L, 21230L, 21271L, 21384L, 21437L, 341L, 354L, 6302L)), .Names = c("pid",
"cid"), sorted = c("pid", "cid"), class = c("data.table", "data.frame"
), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x0000000000100788>)
Then when running the same formula, I get an error
s <- sparseMatrix(test$pid,test$cid,dimnames = list(unique(test$pid), unique(test$cid)),x = 1)
The Error (which occurs in the test dataset as well) reads as follows:
Error in validObject(r) :
invalid class “dgTMatrix” object: length(Dimnames[[1]])' must match Dim[1]
The problem disappears when I remove the dimnames but I really need these dimnames to make sense of the results. I'm sure I'm missing out on something obvious. Can someone please tell me what it is ?
We can convert the 'pid', 'cid' columns to factor and coerce back to numeric or use match with unique values of each column to get the row/column index and this should work in creating sparseMatrix.
test1 <- test[, lapply(.SD, function(x)
as.numeric(factor(x, levels=unique(x))))]
Or we use match
test1 <- test[, lapply(.SD, function(x) match(x, unique(x)))]
s1 <- sparseMatrix(test1$pid,test1$cid,dimnames = list(unique(test$pid),
unique(test$cid)),x = 1)
dim(s1)
#[1] 15 50
s1[1:3, 1:3]
#3 x 3 sparse Matrix of class "dgCMatrix"
# 11023 11787 14232
#204 1 1 .
#207 . . 1
#254 . . .
head(test)
# pid cid
#1: 204 11023
#2: 204 11787
#3: 207 14232
#4: 254 14470
#5: 254 14480
#6: 258 1290
EDIT:
If we want this for the full row/column index specified in 'test', we need to make the dimnames as the same length as the max of 'pid', 'cid'
rnm <- seq(max(test$pid))
cnm <- seq(max(test$cid))
s2 <- sparseMatrix(test$pid, test$cid, dimnames=list(rnm, cnm))
dim(s2)
#[1] 1561 30627
s2[1:3, 1:3]
#3 x 3 sparse Matrix of class "ngCMatrix"
# 1 2 3
#1 . . .
#2 . . .
#3 . . .
Related
here little example of my data.
sales.data=structure(list(MDM_Key = c(370L, 370L, 370L, 370L, 370L, 370L,
370L, 371L, 371L, 371L, 371L, 371L, 371L, 371L), sale_count = c(30L,
32L, 32L, 24L, 20L, 15L, 23L, 30L, 32L, 32L, 24L, 20L, 15L, 23L
), iek_disc_price = c(38227.08, 38227.08, 33739.7, 38227.08,
38227.08, 28844.16, 31649.255, 38227.08, 38227.08, 33739.7, 38227.08,
38227.08, 28844.16, 31649.255)), class = "data.frame", row.names = c(NA,
-14L))
i perform regression analysis
str(sales.data)
m1<-lm(formula=sale_count~iek_disc_price,data=sales.data)
summary(m1)
But the main difficulty is that for each group (MDM_Key) I don't need all the regression results from the summary, but only one beta coefficient.
here
B=0.0008559.
but then i need calculate mean value for sale_count and the mean for iek_disc_price (also for each mdm key group)
so the desired result would be like this
MDM_Key beta mean(sale_count) mean(iek_disc_price)
370 0.0008559 25.14 35305
371 0.0008559 25.14 35305
How to take only beta (nor intercept)regression coefficient for each group mdm_key
and also for each group, calculate the mean values for sale_count and iek_disc_price to get the summary table indicated above.
Thank you for your help.
If I understood correctly, you want to apply one regression per MDM_Key.
library(dplyr)
library(purrr)
library(broom)
sales.data %>%
group_by(MDM_Key) %>%
mutate(
mean_sale_count = mean(sale_count),
mean_iek_disc_price = mean(iek_disc_price)
) %>%
nest(-MDM_Key,-mean_sale_count,-mean_iek_disc_price) %>%
mutate(
coefs = map(.x = data,.f = ~tidy(lm(formula=.$sale_count~.$iek_disc_price,data=.)))
) %>%
unnest(coefs) %>%
filter(term != "(Intercept)") %>%
select(MDM_Key,beta = estimate,mean_sale_count,mean_iek_disc_price)
# A tibble: 2 x 4
# Groups: MDM_Key [2]
MDM_Key beta mean_sale_count mean_iek_disc_price
<int> <dbl> <dbl> <dbl>
1 370 0.000856 25.1 35306.
2 371 0.000856 25.1 35306.
Using R base and the split + apply + combine strategy:
do.call(rbind, lapply(split(sales.data, sales.data$MDM_Key), function(i) {
c(beta=coef(lm(sale_count~iek_disc_price, data=i))[2],
sale_count_mean=mean(i$sale_count),
iek_disc_price_mean=mean(i$iek_disc_price))
} ))
beta.iek_disc_price sale_count_mean iek_disc_price_mean
370 0.0008558854 25.14286 35305.92
371 0.0008558854 25.14286 35305.92
Get the means using aggregate and the beta values using lmList and then put them together and rearrange the columns in the order shown in the question. Omit [, c(2:1, 3:4)] if the column order doesn't matter. Note that nlme comes with R and does not have to be installed.
library(nlme) # lmList
means <- aggregate(. ~ MDM_Key, sales.data, mean)
fm <- lmList(sale_count ~ iek_disc_price | MDM_Key, sales.data)
cbind(beta = coef(fm)[, 2], means)[, c(2:1, 3:4)]
## MDM_Key beta sale_count iek_disc_price
## 1 370 0.0008558854 25.14286 35305.92
## 2 371 0.0008558854 25.14286 35305.92
my data looks like
Nr. Type 1 Type 2 Type 1 Type 2
1 400 600 100 800
2 500 400 900 300
3 200 200 400 700
4 300 600 800 300
and I want to create Boxolplots of Type 1 and type 2.
Pivotlonger makes Type 1 and Type 1.1 which is not what I Need.
Maybe someone can help me.
It turns out your issue was not with the pivot_longer() but with the subsetting of your original data.frame using [. There is no direct control over the requirement that the output of [ or base::subset() have unique column names so you need do something else to subset your data and avoid losing column names. This is discussed in this question so borrowing from one of the answers, you can use:
library(tidyverse)
# data with extra column to be removed
d <- structure(list(Nr. = 1:4, `Type 1` = c(400L, 500L, 200L, 300L), x = 1:4, `Type 2` = c(600L,
400L, 200L, 600L), `Type 1` = c(100L, 900L, 400L, 800L), `Type 2` = c(800L,
300L, 700L, 300L)), row.names = c(NA, -4L), class = "data.frame")
# remove extra column without changing names then pivot
data.frame(as.list(d)[-3], check.names = FALSE) %>%
pivot_longer(-Nr.) %>%
ggplot(aes(name, value)) +
geom_boxplot()
Created on 2022-02-22 by the reprex package (v2.0.1)
I have a data table which i would like to filter based on multiple conditions looking at all previous columns. If the New_ID.1 row number is before the same id in the New_ID column, remove the row where New_ID= New_ID.1 from previous row. For example, I would remove the New_ID 581 in row 3 because New_ID.1 is in row 1. However I don't want to remove row six New_ID 551 since row 3 New_ID.551 would be removed first. Essentially, I think i need to loop through and create a new filtered table for each row and repeat process?
orig_df<- structure(list(New_ID = c(557L, 588L, 581L, 580L, 591L, 551L,
300L, 112L), New_ID.1 = c(581L, 591L, 551L, 300L, 112L, 584L,
416L, 115L), distance = c(3339.15537217173, 3432.33715484179,
5268.69104753613, 5296.72042763528, 5271.94917463488, 5258.66546295312,
5286.99982045171, 5277.81914818968), X.x = c(903604.940384474,
819515.728302034, 903663.550206032, 866828.860223065, 819525.350044447,
903720.790105847, 866881.654186025, 819585.173276271), Y.x = c(1027706.41509243,
1026880.34660449, 1024367.77412815, 1023962.99139374, 1023448.02293581,
1019099.39402149, 1018666.53407908, 1018176.41319296), X.y = c(903663.550206032,
819525.350044447, 903720.790105847, 866881.654186025, 819585.173276271,
903801.327345876, 866919.184271939, 819630.672367509), Y.y = c(1024367.77412815,
1023448.02293581, 1019099.39402149, 1018666.53407908, 1018176.41319296,
1013841.34531459, 1013379.66746509, 1012898.79016799), Y_filter = c(3338.64096427278,
3432.32366867992, 5268.38010666054, 5296.45731465891, 5271.60974284587,
5258.04870690871, 5286.86661398865, 5277.62302497006), X_filter = c(58.609821557533,
9.62174241337925, 57.2398998149438, 52.7939629601315, 59.8232318238588,
80.5372400298947, 37.5300859131385, 45.4990912381327), row.number = 1:8), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
End result would retain rows 1,2,4,6 and 8 from original data
output_table<-structure(list(New_ID = c(557L, 588L, 580L, 551L, 112L), New_ID.1 = c(581L,
591L, 300L, 584L, 115L), distance = c(3339.15537217173, 3432.33715484179,
5296.72042763528, 5258.66546295312, 5277.81914818968), X.x = c(903604.940384474,
819515.728302034, 866828.860223065, 903720.790105847, 819585.173276271
), Y.x = c(1027706.41509243, 1026880.34660449, 1023962.99139374,
1019099.39402149, 1018176.41319296), X.y = c(903663.550206032,
819525.350044447, 866881.654186025, 903801.327345876, 819630.672367509
), Y.y = c(1024367.77412815, 1023448.02293581, 1018666.53407908,
1013841.34531459, 1012898.79016799), Y_filter = c(3338.64096427278,
3432.32366867992, 5296.45731465891, 5258.04870690871, 5277.62302497006
), X_filter = c(58.609821557533, 9.62174241337925, 52.7939629601315,
80.5372400298947, 45.4990912381327), row.number = c(1L, 2L, 4L,
6L, 8L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
Below is a simpler problem that might be of help.
Original data
A|B
C|D
B|E
E|F
Updated data table
A|B
C|D
E|F
I think looping through the rows and saving the ids that you already encountered should be enough?
orig_df <- as.data.frame(orig_df)
included_rows <- rep(FALSE, nrow(orig_df))
seen_ids <- c()
for(i in 1:nrow(orig_df)){
# Skip row if we have seen either ID already
if(orig_df[i, 'New_ID'] %in% seen_ids) next
if(orig_df[i, 'New_ID.1'] %in% seen_ids) next
# If both ids are new, we save them as seen and include the entry
seen_ids <- c(seen_ids, orig_df[i, 'New_ID'] , orig_df[i, 'New_ID.1'] )
included_rows[i] <- TRUE
}
filtered_df <- orig_df[included_rows,]
I want to take all possible samples of n=3 rows from a data frame. I can obtain all those row combinations using:
Combinations <- combn(dim(df)[1], 3)
Combinations[1:10]
[1] 1 2 3 1 2 4 1 2 5 1
I would need to take subsets of every 3 numbers I get in Combinations and use them for the row numbers of the df (i,e. first one is rows 1 2 3; second is 1 2 4, and so on).
I also need to 'store' each subset to perform functions (obtain sum and mean of the columns and save the row.names). I am using a list for that purpose which would store all samples of the original dataframe.
Here's an example of a function using sample which makes what I need but not for every combination.
take_sample = function(dataframe){
df2= dataframe[sample(nrow(dataframe),3),]
lista=list(filas=row.names(df2),
suma=sum(dist(df2)),
coord_mean=apply(df2,2,mean))
return(lista)
}
Here's the df
structure(list(x = c(4817.5, 4814.5, 4817.5, 8515, 8518, 8543.5,
8549.5, 7236.5, 7242.5, 8024.5, 8017, 5549, 5547, 8857, 8861.5,
8132.5, 8136.5, 9349, 9343, 5914), y = c(4424, 4431.5, 4432,
4482.5, 4484, 4574.5, 4576, 4758, 4759.5, 4772, 4772.5, 4807.5,
4809.5, 4936, 4940, 5177, 5175, 5192.5, 5192.5, 5236)), .Names = c("x",
"y"), row.names = c(1175L, 1176L, 1177L, 1180L, 1181L, 1185L,
1187L, 1203L, 1204L, 1206L, 1207L, 1215L, 1217L, 1227L, 1229L,
1249L, 1250L, 1252L, 1253L, 1257L), class = "data.frame")
I want to know, if i can check out if a column of a data frame starts with 0 or 1 and goes till the number of rows without breaking the sequence. Below is a sample data frame.
structure(list(X = 1:22, SNR = c(1.0035798429, 11.9438978154,
NA, 3.2894877794, 4.0170266411, 1.6310522977, 1.6405414787, 1.6625412522,
0.8489116253, 7.5312259672, 7.2832910726, 0.5732577083, NA, 0.8149754292,
1.9981020389, 1.2477052103, 0.9960804911, 10.3402683931, 3.6328270728,
2.5540496855, 41.96873985, 6.2035281045), ID = c(109L, 110L,
111L, 112L, 113L, 114L, 116L, 117L, 118L, 119L, 120L, 121L, 123L,
124L, 125L, 126L, 127L, 128L, 130L, 131L, 132L, 133L), SignalIntensity = c(6.8173738339,
11.5459925418, NA, 9.7804203445, 9.8719842219, 9.0781857736,
8.2289312163, 8.0435364446, 6.1793458315, 10.5581798932, 10.4745329822,
4.1572943809, NA, 6.0451742752, 8.3100219509, 7.4558770659, 7.1464749962,
11.4284386394, 9.6273795753, 9.6807417299, 13.3364944397, 10.4304671876
)), .Names = c("X", "SNR", "ID", "SignalIntensity"), class = "data.frame", row.names = c(NA,
-22L))
How can i check the columns and return the index if present.
Edited: The sequence i am looking for is a natural sequence. Suppose if a data frame has 10 rows, the column if present should have a sequence 1,2,3,4,5,6,7,8,9,10 or may be like0,1,2,3,4,5,6,7,8,9. . So the sequence starts with 0 or 1 and goes till the number of rows with an increment of 1 for each row.
You could loop through the columns with sapply. Create a function to check whether there are any NAs. If not (!any), we get the difference (diff) between the adjacent element, check if all the element difference is 1 (all(diff(x)==1) and (&) the first value of the column is 0 or 1 (x[1] %in% 0:1). If there is any NA, the output for that column will be 'FALSE'.
f1 <- function(x) {
if(!any(is.na(x)))
all(diff(x)==1) & x[1] %in% 0:1
else FALSE}
which(sapply(df, f1))
#X
#1