Removing one row of headers - r

My professor wants us to download an excel file directly from the website, and part of the analysis, we need to generate the sum of some of the columns (the professor suggested that we use starts_with). The point is that the lines with the same name (the second header, if I can call it this way) are the second line, and the RStudio is reading as an observation instead of a proper header. I tried to delete the first row, but the r deleted the header I didn't want. I am going to put the codes here. Initially, I tried this one:
install.packages("tidyverse", dependencies = T)
install.packages("data.table", dependencies = T)
install.packages("readxl", dependencies = T)
install.packages("ggplot2", dependencies = T)
install.packages("openxlsx", dependencies = T)
library(tidyverse)
library(data.table)
library(readxl)
library(ggplot2)
library(openxlsx)
datatable <- data.table(openxlsx::read.xlsx('https://doi.org/10.1371/journal.pone.0242866.s001')) %>%
tail(-1)
Later on, I tried separately (I uploaded the same without the tail(-1)) and in a second line I wrote:
dt <- dt[-1,]
I Also tried something that I saw on the internet with the:
name(dt) = NULL
but it gave me this problem:
Error in View : Internal error: length of names (0) is not length of dt (61)
Can someone tell me the proper way? (In the second and third line I added one object dt = datatable, that is why it is different from the first one)

You just need to use the startRow parameter:
datatable <- data.table::data.table(openxlsx::read.xlsx('https://doi.org/10.1371/journal.pone.0242866.s001', startRow = 2))
Which makes your column names:
> names(datatable)
[1] "Time.mark" "EXP1" "CAL1" "EXP2" "CAL2"
[6] "EXP3" "CAL3" "EXP4" "CAL4" "EXP5"
[11] "CAL5" "EXP6" "CAL6" "EXP7" "CAL7"
[16] "EXP8" "CAL8" "EXP9" "CAL9" "EXP10"
[21] "CAL10" "EXP11" "CAL11" "EXP12" "CAL12"
[26] "EXP13" "CAL13" "EXP14" "CAL14" "EXP15"
[31] "CAL15" "EXP16" "CAL16" "EXP17" "CALIDAD/PRECIO.RTES"
[36] "EXP18" "CAL18" "EXP19" "CAL19" "SAT1"
[41] "SAT2" "SAT3" "SAT4" "SAT5" "SAT6"
[46] "SAT7" "LEALTAD1" "LEALTAD2" "LEALTAD3" "LEALTAD4"
[51] "LEALTAD5" "MEDINA.AZAHARA" "MEZQUITA-CATEDRAL" "ALCAZAR.REYES.CRISTIANOS" "SINAGOGA"
[56] "FIESTA.PATIOS" "IGLESIAS.FERNANDINAS" "BARRIO.JUDERIA" "Sex" "age"
[61] "level.of.study"
Then you can use starts_with() to select columns:
datatable %>%
select(starts_with('EXP'))
EXP1 EXP2 EXP3 EXP4 EXP5 EXP6 EXP7 EXP8 EXP9 EXP10 EXP11 EXP12 EXP13 EXP14 EXP15 EXP16 EXP17 EXP18 EXP19
1: 5 6 5 4 7 5 6 6 5 6 5 5 6 6 6 6 6 6 6
2: 6 7 7 6 6 7 6 7 7 7 5 7 7 6 7 6 6 6 6
3: 5 7 5 6 6 6 5 4 6 7 4 5 6 5 5 5 5 6 4
4: 5 4 6 5 7 6 2 5 6 6 7 7 7 5 5 6 6 5 4
5: 5 4 6 5 7 6 7 5 6 6 7 7 7 5 5 6 6 6 3
---
258: 6 6 6 6 5 6 7 5 7 6 7 6 5 5 7 5 6 6 5
259: 6 6 7 6 7 6 7 7 6 6 7 6 7 7 6 7 6 7 6
260: 6 6 7 6 6 6 6 6 5 7 6 7 6 7 7 6 7 7 7
261: 6 7 6 7 7 7 5 5 6 6 6 7 6 7 6 7 6 6 6
262: 5 7 7 6 6 7 6 5 6 7 5 7 7 5 7 6 7 6 6

Related

remove duplicate words from rows in dataframe [duplicate]

This question already has answers here:
How do keep only unique words within each string in a vector
(3 answers)
Closed 7 months ago.
I'm new here and I'm analyzing certain data. Inspecting the data, I found some issues in the strings of a column. as you can see, there are some string with duplicate words. My idea is to remove only them. could you suggest me a way to do it? There are about 30.000 rows and only the ones with WT_d8_r2 report this error. Thank you
KO_d6_r1_AAACATGCACCTAATG-1 7
KO_d6_r1_AAACATGCAGGAATCG-1 8
KO_d6_r1_AAACATGCAGGATAAC-1 18
KO_d6_r1_AAACCAACAATATAGG-1 22
KO_d6_r1_AAACCGAAGCGAGTAA-1 8
WT_d8_r2_WT_d8_r2_AGGCTAAAGTCAATCA-1 20
WT_d8_r2_WT_d8_r2_AGGGCTACAATGAATG-1 3
WT_d8_r2_WT_d8_r2_AGGGCTACACACTAAT-1 3
WT_d8_r2_WT_d8_r2_AGGGCTACAGCTTACA-1 18
WT_d8_r2_WT_d8_r2_AGGGCTACATAGCTGC-1 9
WT_d8_r2_WT_d8_r2_AGGGTTGCAAAGCTCC-1 19
WT_d8_r2_WT_d8_r2_AGGGTTGCAACCCTAA-1 4
WT_d8_r2_WT_d8_r2_AGGGTTGCAGCTCAAC-1 2
I'm expcting this:
KO_d6_r1_AAACATGCACCTAATG-1 7
KO_d6_r1_AAACATGCAGGAATCG-1 8
KO_d6_r1_AAACATGCAGGATAAC-1 18
KO_d6_r1_AAACCAACAATATAGG-1 22
KO_d6_r1_AAACCGAAGCGAGTAA-1 8
WT_d8_r2_AGGCTAAAGTCAATCA-1 20
WT_d8_r2_AGGGCTACAATGAATG-1 3
WT_d8_r2_AGGGCTACACACTAAT-1 3
WT_d8_r2_AGGGCTACAGCTTACA-1 18
WT_d8_r2_AGGGCTACATAGCTGC-1 9
WT_d8_r2_AGGGTTGCAAAGCTCC-1 19
WT_d8_r2_AGGGTTGCAACCCTAA-1 4
WT_d8_r2_AGGGTTGCAGCTCAAC-1 2
with stringi::stri_split and duplicated:
data <- read.table(text='KO_d6_r1_AAACATGCACCTAATG-1 7
KO_d6_r1_AAACATGCAGGAATCG-1 8
KO_d6_r1_AAACATGCAGGATAAC-1 18
KO_d6_r1_AAACCAACAATATAGG-1 22
KO_d6_r1_AAACCGAAGCGAGTAA-1 8
WT_d8_r2_WT_d8_r2_AGGCTAAAGTCAATCA-1 20
WT_d8_r2_WT_d8_r2_AGGGCTACAATGAATG-1 3
WT_d8_r2_WT_d8_r2_AGGGCTACACACTAAT-1 3
WT_d8_r2_WT_d8_r2_AGGGCTACAGCTTACA-1 18
WT_d8_r2_WT_d8_r2_AGGGCTACATAGCTGC-1 9
WT_d8_r2_WT_d8_r2_AGGGTTGCAAAGCTCC-1 19
WT_d8_r2_WT_d8_r2_AGGGTTGCAACCCTAA-1 4
WT_d8_r2_WT_d8_r2_AGGGTTGCAGCTCAAC-1 2')
data$V1 <- lapply(stringi::stri_split(str=data$V1,fixed = "_"),function(x) paste0(x[!duplicated(x)],collapse='_'))
data
#> V1 V2
#> 1 KO_d6_r1_AAACATGCACCTAATG-1 7
#> 2 KO_d6_r1_AAACATGCAGGAATCG-1 8
#> 3 KO_d6_r1_AAACATGCAGGATAAC-1 18
#> 4 KO_d6_r1_AAACCAACAATATAGG-1 22
#> 5 KO_d6_r1_AAACCGAAGCGAGTAA-1 8
#> 6 WT_d8_r2_AGGCTAAAGTCAATCA-1 20
#> 7 WT_d8_r2_AGGGCTACAATGAATG-1 3
#> 8 WT_d8_r2_AGGGCTACACACTAAT-1 3
#> 9 WT_d8_r2_AGGGCTACAGCTTACA-1 18
#> 10 WT_d8_r2_AGGGCTACATAGCTGC-1 9
#> 11 WT_d8_r2_AGGGTTGCAAAGCTCC-1 19
#> 12 WT_d8_r2_AGGGTTGCAACCCTAA-1 4
#> 13 WT_d8_r2_AGGGTTGCAGCTCAAC-1 2

Import data and use numeric as header

I'm attempting to replicate the output of the SensoMinR package (cartoconsumer). The example data set "hedo.cocktail" in the package appears to contain a number as a header.
data(cocktail)
View(hedo.cocktail)
However, when I try to import dummy data with the same structure (1, 2, 3,... as a header denoting the number of consumers). The header is automatically added with a "X" by RStudio. The issue is that a dataset containing a "X-header" did not produce the output. Making errors, on the other hand.
Subscript out of bounds in Mat[rownames(MatH),]
My guess is that the issue seems to be that the data header does not match the example data set.
Is it possible to import data with numeric as a header?
Here's a sample of my dummy data. Thank you for your suggestions.
senso.cake
color odor flavor size
25.000 45.000 25.000 78.000
26.000 56.000 49.000 45.000
27.000 54.000 85.000 45.000
28.000 52.000 98.000 58.000
30.000 58.000 56.000 96.000
31.000 56.000 32.000 96.000
32.000 58.000 56.000 93.000
36.000 59.000 45.000 90.000
hedo.cake
1 2 3 4 5 6 7 8 9 10
9 7 7 4 8 9 6 7 8 7
6 4 4 2 4 7 8 7 7 7
7 6 8 7 7 6 7 7 7 6
8 8 6 4 7 8 6 8 6 8
4 5 7 3 8 6 6 8 6 7
7 6 7 3 7 6 6 7 7 8
8 6 8 6 7 7 8 7 4 9
6 3 5 6 4 4 6 7 2 7
You can turn your data frame to a matrix.
mat <- as.matrix(your_df)
mat <- matrix(data = mat, dimnames = NULL, ncol = ncol(your_df))
After which you can simply rbind() whatever header you want to it.
mat <- rbind(c(1:ncol(mat)), mat)
For example:
a <- c(5:8)
b <- c(4:7)
c <- c(3:6)
mat <- matrix(c(a,b,c), ncol = 3)
mat <- rbind(c(1:ncol(mat)), mat)
mat
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 5 4 3
[3,] 6 5 4
[4,] 7 6 5
[5,] 8 7 6

aov won't return pvalues in R

I have a strange problem with anova summary results summary(aov).
So here is the problem. I have a dataset with 6 columns. Here is the dataset sample:
Panelist Prod.ID Overall Appearance Flavor Texture
1 196 9 9 9 9
1 239 7 9 6 7
1 354 9 8 8 7
1 427 3 8 2 3
1 577 8 9 7 9
1 638 7 9 7 8
1 772 6 4 3 3
1 852 9 8 9 8
2 196 8 8 7 8
2 239 7 7 7 7
2 354 6 5 6 4
2 427 6 7 4 6
2 577 3 6 3 5
2 638 4 4 5 2
2 772 6 2 6 7
2 852 7 6 7 6
3 196 7 9 7 8
3 239 8 9 8 8
3 354 8 8 7 8
3 427 7 8 6 8
3 577 8 9 8 8
3 638 8 9 8 7
3 772 5 8 8 8
3 852 8 9 8 8
Anyway the first two columns are the factors and the rest are the response variables. The Panelist and the Prod.ID are considered by the summary() as continuous variables, so I converted them to be a factors with as.factor().
After that conversion I ran the anova-test with following model Overall ~ Panelist * Prod.ID, but as summary results I got only this:
> summary(aov(Overall ~ Prod.ID * Panelist, data = paneElements))
Df Sum Sq Mean Sq
Prod.ID 7 189.6 27.085
Panelist 160 1252.9 7.830
Prod.ID:Panelist 1116 3116.1 2.792
I can't find any cause that makes the F-test values and P-values disappear.
Any help will be very appreciated.
Thanks a lot.
You have only one observation for each combination of Prod.ID and Panelist (at least in your sample data), so the number of groups is equal to the number of observations. This would cause a divide-by-zero in the F-Test, which may be the reason for the lack of reported F-Test and p-values.
For example, when I add an extra observation for Prod.ID 196 for just one level of Panelist, I get F and p values reported in the output.

How to merge dating correctly

I'm trying to merge 7 complete data frames into one great wide data frame. I figured I have to do this stepwise and merge 2 frames into 1 and then that frame into another so forth until all 7 original frames becomes one.
fil2005: "ID" "abr_2005" "lop_2005" "ins_2005"
fil2006: "ID" "abr_2006" "lop_2006" "ins_2006"
But the variables "abr_2006" "lop_2006" "ins_2006" and 2005 are all either 0,1.
Now the things is, I want to either merge or do a dcast of some sort (I think) to make these two long data frames into one wide data frame were both "abr_2005" "lop_2005" "ins_2005" and abr_2006" "lop_2006" "ins_2006" are in that final file.
When I try
$fil_2006.1 <- merge(x=fil_2005, y=fil_2006, by="ID__", all.y=T)
all the variables with _2005 at the end if it is saved to the fil_2006.1, but the variables ending in _2006 doesn't.
I'm apparently doing something wrong. Any idea?
Is there a reason you put those underscores after ID__? Otherwise, the code you provided will work
An example:
dat1 <- data.frame("ID"=seq(1,20,by=2),"varx2005"=1:10, "vary2005"=2:11)
dat2 <- data.frame("ID"=5:14,"varx2006"=1:20, "vary2006"=21:40)
# create data frames of differing lengths
head(dat1)
ID varx2005 vary2005
1 1 1 2
2 3 2 3
3 5 3 4
4 7 4 5
5 9 5 6
6 11 6 7
head(dat2)
ID varx2006 vary2006
1 5 1 21
2 6 2 22
3 7 3 23
4 8 4 24
5 9 5 25
6 10 6 26
merged <- merge(dat1,dat2,by="ID",all=T)
head(merged)
ID varx2006 vary2006 varx2005 vary2005
1 1 NA NA 1 2
2 3 NA NA 2 3
3 5 1 21 3 4
4 5 11 31 3 4
5 7 13 33 4 5
6 7 3 23 4 5

Get id/name of rpart model nodes

How can I get ID (or name) of terminal node of rpart model for every row? predict.rpart can return only predicted class (number or factor) or class probability or some combination (using type="matrix") for classification tree.
I would like to do something like:
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
plot(fit) # there are 5 terminal nodes
predict(fit, type = "node_id") # should return IDs of terminal nodes (e.g. 1-5) (does not work)
The partykit package supports predict(..., type = "node"), both in and out of sample. You can simply convert the rpart object to use this:
library("partykit")
predict(as.party(fit), type = "node")
## 9 7 9 9 3 3 3 3 3 8 8 3 9 5 3 3 3 7 3 5 3 9 8 9 9 5 9 8 3 3 3 7 7 3 7 3 5 ## 9 5 8
## 9 7 9 9 3 3 3 3 3 8 8 3 9 5 3 3 3 7 3 5 3 9 8 9 9 5 9 8 3 3 3 7 7 3 7 3 5 ## 9 5 8
## 9 5 9 9 3 7 3 7 9 7 8 3 9 3 3 3 5 9 5 8 9 9 9 3 3 5 3 7 5 3 7 7 3 7 3 3 7 ## 5 7 9
## 9 5 9 9 3 7 3 7 9 7 8 3 9 3 3 3 5 9 5 8 9 9 9 3 3 5 3 7 5 3 7 7 3 7 3 3 7 ## 5 7 9
## 5
## 5
table(predict(as.party(fit), type = "node"))
## 3 5 7 8 9
## 29 12 14 7 19
For that model there were 4 splits, yielding 5 "terminal nodes" or in the terminology used in rpart: <leaf>s. I do not see why there should be 5 predictions for anything. The predictions are for particular cases and the leaves are the result of a variable number of the splits used to make those predictions. The numbers of rows in the original dataset that ended up in the leaves may be what you want, in which case these are ways of getting those numbers:
# Row-wise predicted class
fit$where
# counts of cases in leaves of prediction rules
table(fit$where)
3 5 7 8 9
29 12 14 7 19
In order to assemble the labels(fit) that apply to a particular leaf, you would need to traverse the rule-tree and accumulate all the labels for all the splits that were applied to produce a particular leaf. You probably want to look at:
?print.rpart
?rpart.object
?text.rpart
?labels.rpart
The above method using $where pops up only the row number in the tree frame. And so some observation might be assigned node ID instead of leaf node ID when using kyphosis$ID = fit$where
To get the actual leaf node ID use the following:
MyID <- row.names(fit$frame)
kyphosis$ID <- MyID[fit$where]
For predicting leafs on a new data one could use rpart.predict(fit, newdata, nn = TRUE) from the package rpart.plot to add node names to the output.
Here is an isolated rpart leaf preditor:
rpart_leaves <- function(fit, newdata, type = c("where", "leaf"), na.action = na.pass) {
if (is.null(attr(newdata, "terms"))) {
Terms <- delete.response(fit$terms)
newdata <- model.frame(Terms, newdata, na.action = na.action,
xlev = attr(fit, "xlevels"))
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, newdata, TRUE)
}
newdata <- rpart:::rpart.matrix(newdata)
where <- unname(rpart:::pred.rpart(fit, newdata))
if (match.arg(type) == "where")
return(where)
rownames(fit$frame)[where]
}

Resources