I'm attempting to replicate the output of the SensoMinR package (cartoconsumer). The example data set "hedo.cocktail" in the package appears to contain a number as a header.
data(cocktail)
View(hedo.cocktail)
However, when I try to import dummy data with the same structure (1, 2, 3,... as a header denoting the number of consumers). The header is automatically added with a "X" by RStudio. The issue is that a dataset containing a "X-header" did not produce the output. Making errors, on the other hand.
Subscript out of bounds in Mat[rownames(MatH),]
My guess is that the issue seems to be that the data header does not match the example data set.
Is it possible to import data with numeric as a header?
Here's a sample of my dummy data. Thank you for your suggestions.
senso.cake
color odor flavor size
25.000 45.000 25.000 78.000
26.000 56.000 49.000 45.000
27.000 54.000 85.000 45.000
28.000 52.000 98.000 58.000
30.000 58.000 56.000 96.000
31.000 56.000 32.000 96.000
32.000 58.000 56.000 93.000
36.000 59.000 45.000 90.000
hedo.cake
1 2 3 4 5 6 7 8 9 10
9 7 7 4 8 9 6 7 8 7
6 4 4 2 4 7 8 7 7 7
7 6 8 7 7 6 7 7 7 6
8 8 6 4 7 8 6 8 6 8
4 5 7 3 8 6 6 8 6 7
7 6 7 3 7 6 6 7 7 8
8 6 8 6 7 7 8 7 4 9
6 3 5 6 4 4 6 7 2 7
You can turn your data frame to a matrix.
mat <- as.matrix(your_df)
mat <- matrix(data = mat, dimnames = NULL, ncol = ncol(your_df))
After which you can simply rbind() whatever header you want to it.
mat <- rbind(c(1:ncol(mat)), mat)
For example:
a <- c(5:8)
b <- c(4:7)
c <- c(3:6)
mat <- matrix(c(a,b,c), ncol = 3)
mat <- rbind(c(1:ncol(mat)), mat)
mat
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 5 4 3
[3,] 6 5 4
[4,] 7 6 5
[5,] 8 7 6
Related
My professor wants us to download an excel file directly from the website, and part of the analysis, we need to generate the sum of some of the columns (the professor suggested that we use starts_with). The point is that the lines with the same name (the second header, if I can call it this way) are the second line, and the RStudio is reading as an observation instead of a proper header. I tried to delete the first row, but the r deleted the header I didn't want. I am going to put the codes here. Initially, I tried this one:
install.packages("tidyverse", dependencies = T)
install.packages("data.table", dependencies = T)
install.packages("readxl", dependencies = T)
install.packages("ggplot2", dependencies = T)
install.packages("openxlsx", dependencies = T)
library(tidyverse)
library(data.table)
library(readxl)
library(ggplot2)
library(openxlsx)
datatable <- data.table(openxlsx::read.xlsx('https://doi.org/10.1371/journal.pone.0242866.s001')) %>%
tail(-1)
Later on, I tried separately (I uploaded the same without the tail(-1)) and in a second line I wrote:
dt <- dt[-1,]
I Also tried something that I saw on the internet with the:
name(dt) = NULL
but it gave me this problem:
Error in View : Internal error: length of names (0) is not length of dt (61)
Can someone tell me the proper way? (In the second and third line I added one object dt = datatable, that is why it is different from the first one)
You just need to use the startRow parameter:
datatable <- data.table::data.table(openxlsx::read.xlsx('https://doi.org/10.1371/journal.pone.0242866.s001', startRow = 2))
Which makes your column names:
> names(datatable)
[1] "Time.mark" "EXP1" "CAL1" "EXP2" "CAL2"
[6] "EXP3" "CAL3" "EXP4" "CAL4" "EXP5"
[11] "CAL5" "EXP6" "CAL6" "EXP7" "CAL7"
[16] "EXP8" "CAL8" "EXP9" "CAL9" "EXP10"
[21] "CAL10" "EXP11" "CAL11" "EXP12" "CAL12"
[26] "EXP13" "CAL13" "EXP14" "CAL14" "EXP15"
[31] "CAL15" "EXP16" "CAL16" "EXP17" "CALIDAD/PRECIO.RTES"
[36] "EXP18" "CAL18" "EXP19" "CAL19" "SAT1"
[41] "SAT2" "SAT3" "SAT4" "SAT5" "SAT6"
[46] "SAT7" "LEALTAD1" "LEALTAD2" "LEALTAD3" "LEALTAD4"
[51] "LEALTAD5" "MEDINA.AZAHARA" "MEZQUITA-CATEDRAL" "ALCAZAR.REYES.CRISTIANOS" "SINAGOGA"
[56] "FIESTA.PATIOS" "IGLESIAS.FERNANDINAS" "BARRIO.JUDERIA" "Sex" "age"
[61] "level.of.study"
Then you can use starts_with() to select columns:
datatable %>%
select(starts_with('EXP'))
EXP1 EXP2 EXP3 EXP4 EXP5 EXP6 EXP7 EXP8 EXP9 EXP10 EXP11 EXP12 EXP13 EXP14 EXP15 EXP16 EXP17 EXP18 EXP19
1: 5 6 5 4 7 5 6 6 5 6 5 5 6 6 6 6 6 6 6
2: 6 7 7 6 6 7 6 7 7 7 5 7 7 6 7 6 6 6 6
3: 5 7 5 6 6 6 5 4 6 7 4 5 6 5 5 5 5 6 4
4: 5 4 6 5 7 6 2 5 6 6 7 7 7 5 5 6 6 5 4
5: 5 4 6 5 7 6 7 5 6 6 7 7 7 5 5 6 6 6 3
---
258: 6 6 6 6 5 6 7 5 7 6 7 6 5 5 7 5 6 6 5
259: 6 6 7 6 7 6 7 7 6 6 7 6 7 7 6 7 6 7 6
260: 6 6 7 6 6 6 6 6 5 7 6 7 6 7 7 6 7 7 7
261: 6 7 6 7 7 7 5 5 6 6 6 7 6 7 6 7 6 6 6
262: 5 7 7 6 6 7 6 5 6 7 5 7 7 5 7 6 7 6 6
I need to extract separate tables from each excel sheet and have them as a list object. I have two lists : "allsheets" contains 38 sheets and each of sheets includes at least 2 tables, and "dataRowMeta" contains information about which rows are relevant for each table. For example,
a1 <- data.frame(y1=c(1:15),y2=c(6:20))
a2 <- data.frame(y1=c(3:18),y2=c(2:17))
allsheets <- list(a1, a2)
d1<- data.frame(starthead=c(1,9),endhead=c(2,10),startdata =c(3,11),
enddata = c(7,14),footer = c(8,15))
d2<- data.frame(starthead=c(1,10),endhead=c(2,11),startdata =c(3,12),
enddata = c(8,15),footer = c(9,16))
dataRowMeta <- list(d1,d2)
[[1]]
y1 y2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
6 6 11
7 7 12
8 8 13
9 9 14
10 10 15
11 11 16
12 12 17
13 13 18
14 14 19
15 15 20
[[2]]
y1 y2
1 3 2
2 4 3
3 5 4
4 6 5
5 7 6
6 8 7
7 9 8
8 10 9
9 11 10
10 12 11
11 13 12
12 14 13
13 15 14
14 16 15
15 17 16
16 18 17
and here is dataRowMeta :
[[1]]
starthead endhead startdata enddata footer
1 1 2 3 7 8
2 9 10 11 14 15
[[2]]
starthead endhead startdata enddata footer
1 1 2 3 8 9
2 10 11 12 15 16
I've tried to write a loop function which would subset each sheet according to dataRowMeta, but failed to get a desired output.
I am getting an error
Error in sheet[[a[m]:b[m], ]] : incorrect number of subscripts
I guess that's because I am iterating over list, not matrices...but how to tell R to subset list in this case?
So I need 1st and 4th columns of dataRowMeta(starthead and enddata) as "start" and "end" id rows of future tables.
tables <- function(allsheets,dataRowMeta){
for(i in 1 : length(dataRowMeta)){
for (j in 1 : nrow(dataRowMeta[[i]])){
a <-""
b <- ""
a <- dataRowMeta[[i]][j:j,1]
b <- dataRowMeta[[i]][j:j,4]
for (k in 1 : length(allsheets)){
sheet <- allsheets[k]
for ( m in 1 : length(a)){
tbl <- sheet[[a[m]:b[m],]]
}
}
}
}}
Desired output : I have this for the first element of the first list(sheet1):
sheet1 <- allsheets[[1]]
tmp1 <- sheet1[dataRowMeta[[1]][1:1,1] :dataRowMeta[[1]][1:1,4] ,]
> tmp1
y1 y2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
6 6 11
7 7 12
And need a loop which would do it for all sheets. Please help me to figure out how to get it. Thank you!
I have a strange problem with anova summary results summary(aov).
So here is the problem. I have a dataset with 6 columns. Here is the dataset sample:
Panelist Prod.ID Overall Appearance Flavor Texture
1 196 9 9 9 9
1 239 7 9 6 7
1 354 9 8 8 7
1 427 3 8 2 3
1 577 8 9 7 9
1 638 7 9 7 8
1 772 6 4 3 3
1 852 9 8 9 8
2 196 8 8 7 8
2 239 7 7 7 7
2 354 6 5 6 4
2 427 6 7 4 6
2 577 3 6 3 5
2 638 4 4 5 2
2 772 6 2 6 7
2 852 7 6 7 6
3 196 7 9 7 8
3 239 8 9 8 8
3 354 8 8 7 8
3 427 7 8 6 8
3 577 8 9 8 8
3 638 8 9 8 7
3 772 5 8 8 8
3 852 8 9 8 8
Anyway the first two columns are the factors and the rest are the response variables. The Panelist and the Prod.ID are considered by the summary() as continuous variables, so I converted them to be a factors with as.factor().
After that conversion I ran the anova-test with following model Overall ~ Panelist * Prod.ID, but as summary results I got only this:
> summary(aov(Overall ~ Prod.ID * Panelist, data = paneElements))
Df Sum Sq Mean Sq
Prod.ID 7 189.6 27.085
Panelist 160 1252.9 7.830
Prod.ID:Panelist 1116 3116.1 2.792
I can't find any cause that makes the F-test values and P-values disappear.
Any help will be very appreciated.
Thanks a lot.
You have only one observation for each combination of Prod.ID and Panelist (at least in your sample data), so the number of groups is equal to the number of observations. This would cause a divide-by-zero in the F-Test, which may be the reason for the lack of reported F-Test and p-values.
For example, when I add an extra observation for Prod.ID 196 for just one level of Panelist, I get F and p values reported in the output.
I have two data sets, one is the subset of another but the subset has additional column, with lesser observations.
Basically, I have a unique ID assigned to each participants, and then a HHID, the house id from which they were recruited (eg 15 participants recruited from 11 houses).
> Healthdata <- data.frame(ID = gl(15, 1), HHID = c(1,2,2,3,4,5,5,5,6,6,7,8,9,10,11))
> Healthdata
Now, I have a subset of data with only one participant per household, chosen who spent longer hours watching television. In this subset data, I have computed socioeconomic score (SSE) for each house.
> set.seed(1)
> Healthdata.1<- data.frame(ID=sample(1:15,11, replace=F), HHID=gl(11,1), SSE = sample(-6.5:3.5, 11, replace=TRUE))
> Healthdata.1
Now, I want to assign the SSE from the subset (Healthdata.1) to unique participants of bigger data (Healthdata) such that, participants from the same house gets the same score.
I can't merge this simply, because the data sets have different number of observations, 15 in the bigger one but only 11 in the subset.
Is there any way to do this in R? I am very new to it and I am stuck with this.
I want the required output as something like below, ie ID (participants) from same HHID (house) should have same SSE score. The following output is just meant for an example of what I need, the above seed will not give the same output.
ID HHID SSE
1 1 -6.5
2 2 -5.5
3 2 -5.5
4 3 3.3
5 4 3.0
6 5 2.58
7 5 2.58
8 5 2.58
9 6 -3.05
10 6 -3.05
11 7 -1.2
12 8 2.5
13 9 1.89
14 10 1.88
15 11 -3.02
Thanks.
You can use merge , By default it will merge by columns intersections.
merge(Healthdata,Healthdata.1,all.x=TRUE)
ID HHID SSE
1 1 1 NA
2 2 2 NA
3 3 2 NA
4 4 3 NA
5 5 4 NA
6 6 5 NA
7 7 5 NA
8 8 5 NA
9 9 6 0.7
10 10 6 NA
11 11 7 NA
12 12 8 NA
13 13 9 NA
14 14 10 NA
15 15 11 NA
Or you can choose by which column you merge :
merge(Healthdata,Healthdata.1,all.x=TRUE,by='ID')
You need to merge by HHID, not ID. Note this is somewhat confusing because the ids from the supergroup are from a different set than from the subgroup. I.e. ID.x == 4 != ID.y == 4 (in fact, in this case they are in different households). Because of that I left both ID columns here to avoid ambiguity, but you can easily subset the result to show only the ID.x one,
> merge(Healthdata, Healthdata.1, by='HHID')
HHID ID.x ID.y SSE
1 1 1 4 -5.5
2 2 2 6 0.5
3 2 3 6 0.5
4 3 4 8 -2.5
5 4 5 11 1.5
6 5 6 3 -1.5
7 5 7 3 -1.5
8 5 8 3 -1.5
9 6 9 9 0.5
10 6 10 9 0.5
11 7 11 10 3.5
12 8 12 14 -2.5
13 9 13 5 1.5
14 10 14 1 3.5
15 11 15 2 -4.5
library(plyr)
join(Healthdata, Healthdata.1)
# Inner Join
join(Healthdata, Healthdata.1, type = "inner", by = "ID")
# Left Join
# I believe this is what you are after
join(Healthdata, Healthdata.1, type = "left", by = "ID")
How can I get ID (or name) of terminal node of rpart model for every row? predict.rpart can return only predicted class (number or factor) or class probability or some combination (using type="matrix") for classification tree.
I would like to do something like:
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
plot(fit) # there are 5 terminal nodes
predict(fit, type = "node_id") # should return IDs of terminal nodes (e.g. 1-5) (does not work)
The partykit package supports predict(..., type = "node"), both in and out of sample. You can simply convert the rpart object to use this:
library("partykit")
predict(as.party(fit), type = "node")
## 9 7 9 9 3 3 3 3 3 8 8 3 9 5 3 3 3 7 3 5 3 9 8 9 9 5 9 8 3 3 3 7 7 3 7 3 5 ## 9 5 8
## 9 7 9 9 3 3 3 3 3 8 8 3 9 5 3 3 3 7 3 5 3 9 8 9 9 5 9 8 3 3 3 7 7 3 7 3 5 ## 9 5 8
## 9 5 9 9 3 7 3 7 9 7 8 3 9 3 3 3 5 9 5 8 9 9 9 3 3 5 3 7 5 3 7 7 3 7 3 3 7 ## 5 7 9
## 9 5 9 9 3 7 3 7 9 7 8 3 9 3 3 3 5 9 5 8 9 9 9 3 3 5 3 7 5 3 7 7 3 7 3 3 7 ## 5 7 9
## 5
## 5
table(predict(as.party(fit), type = "node"))
## 3 5 7 8 9
## 29 12 14 7 19
For that model there were 4 splits, yielding 5 "terminal nodes" or in the terminology used in rpart: <leaf>s. I do not see why there should be 5 predictions for anything. The predictions are for particular cases and the leaves are the result of a variable number of the splits used to make those predictions. The numbers of rows in the original dataset that ended up in the leaves may be what you want, in which case these are ways of getting those numbers:
# Row-wise predicted class
fit$where
# counts of cases in leaves of prediction rules
table(fit$where)
3 5 7 8 9
29 12 14 7 19
In order to assemble the labels(fit) that apply to a particular leaf, you would need to traverse the rule-tree and accumulate all the labels for all the splits that were applied to produce a particular leaf. You probably want to look at:
?print.rpart
?rpart.object
?text.rpart
?labels.rpart
The above method using $where pops up only the row number in the tree frame. And so some observation might be assigned node ID instead of leaf node ID when using kyphosis$ID = fit$where
To get the actual leaf node ID use the following:
MyID <- row.names(fit$frame)
kyphosis$ID <- MyID[fit$where]
For predicting leafs on a new data one could use rpart.predict(fit, newdata, nn = TRUE) from the package rpart.plot to add node names to the output.
Here is an isolated rpart leaf preditor:
rpart_leaves <- function(fit, newdata, type = c("where", "leaf"), na.action = na.pass) {
if (is.null(attr(newdata, "terms"))) {
Terms <- delete.response(fit$terms)
newdata <- model.frame(Terms, newdata, na.action = na.action,
xlev = attr(fit, "xlevels"))
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, newdata, TRUE)
}
newdata <- rpart:::rpart.matrix(newdata)
where <- unname(rpart:::pred.rpart(fit, newdata))
if (match.arg(type) == "where")
return(where)
rownames(fit$frame)[where]
}