R Appending Columns to Dataset Misnamed - r

Edit: Clarity
When I append a new column to a existing data.frame, the title of the columns are incorrect. In summary.myData, the last two columns "Measure" and "Measure" should say "plus" and "minus" respectively.
This is tied in with another question I had, where I ask about how to correctly reference a column in a Tk/R GUI I am working on.
Parent Question
myData:
Group Subgroup Measure
1 A 1 0.234213
2 A 1 0.046248
3 A 1 0.391376
4 A 2 0.911849
5 A 2 0.729955
6 A 2 0.991110
7 A 2 0.378422
8 A 3 0.898037
9 A 3 0.258884
10 A 3 NA
11 A 3 0.057631
12 A 3 0.745202
13 A 3 0.121376
14 B 1 0.385198
15 B 1 0.484399
16 B 1 0.115034
17 B 1 0.073629
18 B 1 0.456150
19 B 2 0.336108
20 B 2 0.845458
21 B 2 0.267494
22 B 3 0.536123
23 B 3 1.331731
24 B 3 0.505114
25 B 3 0.843348
26 B 3 0.827932
27 B 3 0.813351
28 C 1 0.095587
29 C 1 0.158822
30 C 1 0.392376
31 C 1 0.284625
32 C 2 0.898819
33 C 2 0.743428
34 C 2 0.298989
35 C 2 0.423961
36 C 3 0.868351
37 C 3 0.181547
38 C 3 1.146131
39 C 3 0.234941
Append script:
summary.myData<-summarySE(myData, measurevar=paste(tx.choice1), groupvars=paste(tx.choice2),conf.interval=0.95,na.rm=TRUE,.drop=FALSE)
summary.myData$plus<-summary.myData[3]-summary.myData[6]
summary.myData$minus<-summary.myData[3]+summary.myData[6]
Result:
Group N Measure sd se ci Measure Measure
1 A 12 0.4803586 0.3539277 0.10217014 0.2248750 0.2554836 0.7052335
2 B 14 0.5586478 0.3412835 0.09121184 0.1970512 0.3615966 0.7556990
3 C 12 0.4772981 0.3465511 0.10004069 0.2201881 0.2571100 0.6974862

The problem you're running into is that you've assigned $plus and $minus to data.frames, rather than atomic vectors. So when printing, R is showing the column name in the embedded data.frame ('Measure' in both cases), rather than the name of the list component ('plus' and 'minus').
str(summary.myData);
## 'data.frame': 3 obs. of 8 variables:
## $ Group : Factor w/ 3 levels "A","B","C": 1 2 3
## $ N : num 12 14 12
## $ Measure: num 0.48 0.559 0.477
## $ sd : num 0.354 0.341 0.347
## $ se : num 0.1022 0.0912 0.1
## $ ci : num 0.225 0.197 0.22
## $ plus :'data.frame': 3 obs. of 1 variable:
## ..$ Measure: num 0.255 0.362 0.257
## $ minus :'data.frame': 3 obs. of 1 variable:
## ..$ Measure: num 0.705 0.756 0.697
summary.myData;
## Group N Measure sd se ci Measure Measure
## 1 A 12 0.4803586 0.3539277 0.10217014 0.2248750 0.2554836 0.7052335
## 2 B 14 0.5586478 0.3412835 0.09121184 0.1970512 0.3615966 0.7556990
## 3 C 12 0.4772981 0.3465511 0.10004069 0.2201881 0.2571100 0.6974862
Replace the assignments with
summary.myData$plus <- summary.myData[,3]-summary.myData[,6];
summary.myData$minus <- summary.myData[,3]+summary.myData[,6];
Then you get:
str(summary.myData);
## 'data.frame': 3 obs. of 8 variables:
## $ Group : Factor w/ 3 levels "A","B","C": 1 2 3
## $ N : num 12 14 12
## $ Measure: num 0.48 0.559 0.477
## $ sd : num 0.354 0.341 0.347
## $ se : num 0.1022 0.0912 0.1
## $ ci : num 0.225 0.197 0.22
## $ plus : num 0.255 0.362 0.257
## $ minus : num 0.705 0.756 0.697
summary.myData;
## Group N Measure sd se ci plus minus
## 1 A 12 0.4803586 0.3539277 0.10217014 0.2248750 0.2554836 0.7052335
## 2 B 14 0.5586478 0.3412835 0.09121184 0.1970512 0.3615966 0.7556990
## 3 C 12 0.4772981 0.3465511 0.10004069 0.2201881 0.2571100 0.6974862
The key here is the different indexing style. When you use 1D indexing, you're actually treating the data.frame as a list (which it is internally), and so the index operation returns the specified list components, still classed as a data.frame. When you use 2D indexing, you index the rows and columns separately, which allows you to extract a 2D "subtable" of the data.frame. But when you only specify one column, the default behavior (drop=T) is for the column to be returned as an atomic vector, rather than as a one-column data.frame. You can change this with drop=F.
summary.myData[3];
## Measure
## 1 0.4803586
## 2 0.5586478
## 3 0.4772981
summary.myData[,3];
## [1] 0.4803586 0.5586478 0.4772981
summary.myData[,3,drop=F];
## Measure
## 1 0.4803586
## 2 0.5586478
## 3 0.4772981

Related

Error in model.frame.default, variable lengths differ

I ran a glmer, and got the following the error message "Error in model.frame.default(data = data.density.EM.gra, weights = number_of_nest.boxes, : variable lengths differ (found for 'year')". I don't understand what this means despite reading a number of different posts regarding the same error.
here is my model:
model.1.EM.gra<-glmer(cbind(data.density$number.nest.boxes.occupied.that.year,data.density$number_of_nest.boxes)~ caterpillar.sc +(1|year),data = data.density.EM.gra,weights = number_of_nest.boxes,family = binomial)
I appreciate any suggestions you may have.
setwd("~/Word/UQAM/Master's_Reale/DATA/Blue tits data and instructions/csv") # work station
install.packages("dplyr")
#calling libraries.
library(dplyr)
library (reprex)
library(lme4)
data.density<-read.csv ("nest_box_caterpillar_density.csv")
data.density$year<-factor (data.density$year)# making year a factor (categorical variable)
str(data.density) # now we see year as a factor in the data.
#> 'data.frame': 63 obs. of 16 variables:
#> $ year : Factor w/ 9 levels "2011","2012",..: 1 2 3 4 5 6 7 8 9 1 ...
#> $ number.nest.boxes.occupied.that.year: int 17 13 12 16 16 16 15 17 12 17 ...
#> $ number_of_nest.boxes : int 20 20 20 20 20 20 20 20 20 30 ...
#> $ failure : int 3 3 3 3 3 3 3 3 3 13 ...
#> $ proportion_occupied_boxes : num 0.85 0.65 0.6 0.8 0.8 0.8 0.75 0.85 0.6 0.57 ...
#> $ site : Factor w/ 7 levels "ari","ava","fel",..: 5 5 5 5 5 5 5 5 5 1 ...
#> $ population : Factor w/ 3 levels "D-Muro","E-Muro",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ mean_yearly_frass : num 295 231 437 263 426 ...
#> $ site_ID : Factor w/ 63 levels "2011_ari_","2011_ava_",..: 5 12 19 26 33 40 47 54 61 1 ...
#> $ exploration_avg : num 13.28 14.19 9.85 9.42 8.67 ...
#> $ X : logi NA NA NA NA NA NA ...
#> $ X.1 : logi NA NA NA NA NA NA ...
#> $ X.2 : Factor w/ 2 levels "","failure means the total number of nest boxes -the number of nest boxes occupied. ": 1 1 1 1 1 1 1 1 1 2 ...
#> $ X.3 : logi NA NA NA NA NA NA ...
#> $ X.4 : logi NA NA NA NA NA NA ...
#> $ X.5 : Factor w/ 5 levels "","1 column with number of nest boxes used. ",..: 1 1 4 3 1 2 5 1 1 1 ...
#making new objects
density<-data.density$proportion_occupied_boxes # making a new object called density
caterpillar<-data.density$mean_yearly_frass # making new object called caterpillar
caterpillar.sc<-scale(caterpillar)
data.density.EM<-filter(data.density,population=='E-Muro') # data for population 'E-Muro'
data.density.EM.gra<-filter(data.density.EM,site=='gra') # data for site gra in in the E-Muro population.
View(data.density.EM.gra)
model.1.EM.gra<-glmer(cbind(data.density$number.nest.boxes.occupied.that.year,data.density$number_of_nest.boxes)~ caterpillar.sc +(1|year),
data = data.density.EM.gra,
weights = number_of_nest.boxes,
family = binomial)
#> Error in model.frame.default(data = data.density.EM.gra, weights = number_of_nest.boxes, : variable lengths differ (found for 'year')

Importing CSV of arrays as list

I'm trying to do the following:
I have a .csv file with N rows and 2 columns that I need to import and convert to a list.
Example file from .csv:
First seven rows of data
I import with command: points <- read.csv("points.csv")
'data.frame': 42 obs. of 2 variables:
$ Firefly : int 0 1 0 1 0 1 0 1 0 1 ...
$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073
I need it as a sorted "List of 2" (one for each Firefly) with the following structure:
> str(points)
List of 2
$ : num [1:33] 0.79 0.87 0.88 0.89 0.94 1.01 1.13 1.19 ...
$ : num [1:14] 0.00 0.10 0.56 0.67 1.27 1.31 1.37 1.42 ...
, where the first list represents Firefly == 0 and second list represents Firefly == 1.
I attempt the following:
fy0 <- subset(points,Firefly == 0)
fy1 <- subset(points,Firefly == 1)
points.list <- list(fy0,fy1)
> str(points.list)
List of 2
$ :'data.frame': 21 obs. of 2 variables:
..$ Firefly : int [1:21] 0 0 0 0 0 0 0 0 0 0 ...
..$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073 0.20941455 0.40515277 0.47026309\n 0.55714817 0.64789982 0.70749241 "| __truncated__,..: 30 29 28 31 39 40 33 37 25 24 ...
$ :'data.frame': 21 obs. of 2 variables:
..$ Firefly : int [1:21] 1 1 1 1 1 1 1 1 1 1 ...
..$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073 0.20941455 0.40515277 0.47026309\n 0.55714817 0.64789982 0.70749241 "| __truncated__,..: 26 32 21 23 20 41 34 22 27 36 ...
I think I need a as.numeric(fy0$Hawkes_times) somewhere, but I want to avoid loops since I will have hundreds of rows and n Firefly values (fy0, fy1, fy2, ... fyn).
Thank you!
-Richard
points <- data.frame(firefly=rep(0:1, times=10), times=1:20)
split(points$times, points$firefly)
# $`0`
# [1] 1 3 5 7 9 11 13 15 17 19
# $`1`
# [1] 2 4 6 8 10 12 14 16 18 20
This does not rely on equally-sized groups:
set.seed(42)
points <- data.frame(firefly=sample(0:1, size=20, replace=TRUE), times=1:20)
split(points$times, points$firefly)
# $`0`
# [1] 3 8 11 14 15 18 19
# $`1`
# [1] 1 2 4 5 6 7 9 10 12 13 16 17 20
and as you can see the order is preserved.

Import from Web CSV to the data frame?

I've got a CSV file from the link Hearthstone Arena Card Pickup probability
It's just a list of vectors now, and I want to convert into 9 column data frame. so it may look like:
My current code is as follows but it's not working at all.
hsd <- read.csv("hearthstonedraw.csv", header = TRUE)
hsd1 <- as.data.frame(hsd,ncol = 9)
hsd1
Answer goest out to Maurits Evers and Adam Sampson.
read.csv can read from the address you indicate and automatically converts character columns into factors (default behaviour) as well as calulating the number of columns.
hsd1 <- read.csv("https://bnetcmsus-a.akamaihd.net/cms/gallery/LN4X4GN4W59R1532566073433.csv", header = TRUE)
str(hsd1)
# 'data.frame': 3931 obs. of 9 variables:
# $ Draft.Class : Factor w/ 9 levels "Druid","Hunter",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Card.Name : Factor w/ 995 levels "Abominable Bowman",..: 716 813 646 500 263 964 549 186 509 984 ...
# $ Rarity : Factor w/ 5 levels "basic","common",..: 1 1 2 2 2 1 1 2 5 2 ...
# $ Type : Factor w/ 3 levels "Minion","Spell",..: 2 2 2 2 1 2 2 1 2 2 ...
# $ Card.Class : Factor w/ 10 levels "druid","hunter",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Average : num 1.47 1.45 1.44 1.17 1.03 ...
# $ P.1.or.more.: num 0.78 0.776 0.774 0.696 0.649 ...
# $ P.2.or.more.: num 0.436 0.431 0.428 0.327 0.273 ...
# $ P.3.or.more.: num 0.1784 0.1757 0.1724 0.1081 0.0819 ...
ncol(hsd1)
# [1] 9
# There are 9 columns in the data frame

Merge and fill different length data in R

I'm using R and need merge data with different lenghts
Following this dataset
> means2012
# A tibble: 232 x 2
exporter eci
<fct> <dbl>
1 ABW 0.235
2 AFG -0.850
3 AGO -1.40
4 AIA 1.34
5 ALB -0.480
6 AND 1.22
7 ANS 0.662
8 ARE 0.289
9 ARG 0.176
10 ARM 0.490
# ... with 222 more rows
> means2013
# A tibble: 234 x 2
exporter eci
<fct> <dbl>
1 ABW 0.534
2 AFG -0.834
3 AGO -1.26
4 AIA 1.47
5 ALB -0.498
6 AND 1.13
7 ANS 0.616
8 ARE 0.267
9 ARG 0.127
10 ARM 0.0616
# ... with 224 more rows
> str(means2012)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 232 obs. of 2 variables:
$ exporter: Factor w/ 242 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 9 10 11 ...
$ eci : num 0.235 -0.85 -1.404 1.337 -0.48 ...
> str(means2013)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 234 obs. of 2 variables:
$ exporter: Factor w/ 242 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 9 10 11 ...
$ eci : num 0.534 -0.834 -1.263 1.471 -0.498 ...
Note that 2 tibble has different lenghts. "Exporter" are countries.
Is there any way to merge both tibble, looking to the factors (Exporter) and fill the missing it with "na"?
It doesn't matter if is a tibble, dataframe, or other kind.
like this:
tibble 1
a 5
b 10
c 15
d 25
tibble 2
a 7
c 23
d 20
merged one:
a 5 7
b 10 na
c 15 23
d 25 20
using merge with parameter all set to TRUE:
tibble1 <- read.table(text="
x y
a 5
b 10
c 15
d 25",header=TRUE,stringsAsFactors=FALSE)
tibble2 <- read.table(text="
x z
a 7
c 23
d 20",header=TRUE,stringsAsFactors=FALSE)
merge(tibble1,tibble2,all=TRUE)
x y z
1 a 5 7
2 b 10 NA
3 c 15 23
4 d 25 20
Or dplyr::full_join(tibble1,tibble2) for the same effect
You could rename the colums to join them, and get NA where the other value is missing.
library(tidyverse)
means2012 %>%
rename(eci2012 = eci) %>%
full_join(means2013 %>%
rename(eci2013 = eci))
But a tidier approach would be to add a year column, keep the column eci as is and just bind the rows together.
means2012 %>%
mutate(year = 2012) %>%
bind_rows(means2013 %>%
mutate(year = 2013))

R - predict command error "undefined columns selected"

I’m a newbie to R, and I’m having trouble with an R predict command.
I receive this error
Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) :
undefined columns selected
when I execute this command:
model.predict <- predict.boosting(model,newdata=test)
Here is my model:
model <- boosting(Y~x1+x2+x3+x4+x5+x6+x7, data=train)
And here is the structure of my test data:
str(test)
'data.frame': 343 obs. of 7 variables:
$ x1: Factor w/ 4 levels "Americas","Asia_Pac",..: 4 2 4 2 4 3 3 3 4 1 ...
$ x2: Factor w/ 5 levels "Fifth","First",..: 3 3 2 2 4 2 4 4 1 1 ...
$ x3: Factor w/ 3 levels "Best","Better",..: 2 3 1 1 3 2 2 1 3 3 ...
$ x4: Factor w/ 2 levels "Female","Male": 1 1 2 1 1 2 1 2 2 2 ...
$ x5: int 82 55 47 31 6 53 77 68 76 86 ...
$ x6: num 22.8 14.6 25.5 38.3 7.9 32.8 4.6 34.2 36.7 21.7 ...
$ x7: num 0.679 0.925 0.897 0.684 0.195 ...
And the structure of my training data:
$ RecordID: int 1 2 3 4 5 6 7 8 9 10 ...
$ x1 : Factor w/ 4 levels "Americas","Asia_Pac",..: 1 2 2 3 1 1 1 2 2 4 ...
$ x2 : Factor w/ 5 levels "Fifth","First",..: 5 5 3 2 5 5 5 4 3 2 ...
$ x3 : Factor w/ 3 levels "Best","Better",..: 2 3 2 2 3 1 2 3 1 1 ...
$ x4 : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 2 1 1 ...
$ x5 : int 1 67 75 51 84 33 21 80 48 5 ...
$ x6 : num 21 13.8 30.3 11.9 1.7 13.2 33.9 17 3.4 19.5 ...
$ x7 : num 0.35 0.85 0.73 0.39 0.47 0.13 0.2 0.12 0.64 0.11 ...
$ Y : Factor w/ 2 levels "Green","Yellow": 2 2 1 2 2 2 1 2 2 2 ..
I think there’s a problem with the structure of the test data, but I can’t find it, or I have a mis-understanding as to the structure of the “predict” command. Note that if I run the predict command on the training data, it works. Any suggestions as to where to look?
Thanks!
predict.boosting() expects to be given the actual labels for the test data, so it can calculate how well it did (as in the confusion matrix shown below).
library(adabag)
data(iris)
iris.adaboost <- boosting(Species~Sepal.Length+Sepal.Width+Petal.Length+
Petal.Width, data=iris, boos=TRUE, mfinal=10)
# make a 'test' dataframe without the classes, as in the question
iris2 <- iris
iris2$Species <- NULL
# replicates the error
irispred=predict.boosting(iris.adaboost, newdata=iris2)
#Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) :
# undefined columns selected
Here's working example, drawn largely from the help file just so there is a working example here (and to demonstrate the confusion matrix).
# first create subsets of iris data for training and testing
sub <- c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25))
iris3 <- iris[sub,]
iris4 <- iris[-sub,]
iris.adaboost <- boosting(Species ~ ., data=iris3, mfinal=10)
# works
iris.predboosting<- predict.boosting(iris.adaboost, newdata=iris4)
iris.predboosting$confusion
# Observed Class
#Predicted Class setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
when your y is factor, show this error, try as.vector(y)~.
The column names of the data that you use to predict should be exactly the same as the column names of training data.

Resources