Different output format summary(X$Y) vs summary(X) - r

I am a beginner in R, but I am aware I should look for answers before asking a question here. I did, looked into help files, but to no avail. The problem is as follows: when I ask for a summary of subset X, the output of the two columns is as below. I wanted to have only the output for the answer, which I am able to to, but it is presented differently (see the output at the bottom). I want to have the results presented as a table, not as a list.
summary(X, max = 12)
results in:
student answer
Min. : 335 0 - Not at all likely : 35
1st Qu.: 855480 1 : 18
Median :1831962 10 - Extremely likely :9336
Mean :1519041 2 : 23
3rd Qu.:2183663 3 : 19
Max. :2607132 4 : 15
5 - Neutral : 939
6 : 235
7 : 921
8 :1844
9 :1194
option_i4x-DelftX-ET3034TUx-problem-b3d30df864ca41ffa0170e790f01a783_2_1_dummy_default: 71
Because I am only interested in the summary stats for answer, I used
summary(X$answer, max = 12)
And then I get the list below as answer.
0 - Not at all likely
35
1
18
10 - Extremely likely
9336
2
23
3
19
4
15
5 - Neutral
939
6
235
7
921
8
1844
9
1194
option_i4x-DelftX-ET3034TUx-problem-b3d30df864ca41ffa0170e790f01a783_2_1_dummy_default
71

You should try
summary(X["answer"], max = 12)
since X["answer"] is not a vector like X$answer but a one-column data frame.

EDIT: I just found out that if you want to save/export, my solution
below gives more useful output (as a table).
write.csv(data.frame(summary(X$answer)), "X.csv")
I played around a bit more, and with #JT85's suggestion, I found a nice solution.
data.frame(summary(X$answer))
and
data.frame(table(X$answer))
both work and give the output I want.
PS. It is a coincidence I found it so quickly after posting the question. This has been bugging me for 2 days already.
The output I get for data.frame(summary...) is as follows:
summary.A1.answer.
0 - Not at all likely 35
1 18
10 - Extremely likely 9336
2 23
3 19
4 15
5 - Neutral 939
6 235
7 921
8 1844
9 1194
option_i4x-DelftX-ET3034TUx-problem-b3d30df864ca41ffa0170e790f01a783_2_1_dummy_default 71

Related

For loop to iterate through columns in data.table [duplicate]

This question already has answers here:
Convert *some* column classes in data.table
(2 answers)
Closed 4 years ago.
I am trying to write a "for" loop that iterates through each column in a data.table and return a frequency table. However, I keep getting an error saying:
library(datasets)
data(cars)
cars <- as.data.table(cars)
for (i in names(cars)){
print(table(cars[,i]))
}
Error in `[.data.table`(cars, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[, ..i]. This difference to data.frame is deliberate and explained in FAQ 1.1.
When I use each column individually like below, I do not have any problem:
> table(cars[,dist])
2 4 10 14 16 17 18 20 22 24 26 28 32 34 36 40 42 46 48 50 52 54 56 60 64 66
1 1 2 1 1 1 1 2 1 1 4 2 3 3 2 2 1 2 1 1 1 2 2 1 1 1
68 70 76 80 84 85 92 93 120
1 1 1 1 1 1 1 1 1
My data is quite large (8921483x52), that is why I want to use the "for" loop and run everything at once then look at the result.
I included the cars dataset (which is easier to run) to demonstrate my code.
If I convert the dataset to data.frame, there is no problem running the "for" loop. But I just want to know why this does not work with data.table because I am learning it, which work better with large dataset in my belief.
If by chance, someone saw a post with an answer already, please let me know because I have been trying for several hours to look for one.
Some solution found here
My personal preference is the apply function though
library(datasets)
data(cars)
cars <- as.data.table(cars)
apply(cars,2,table)
To make your loop work you tweak the i
library(datasets)
data(cars)
cars <- as.data.table(cars)
for (i in names(cars)){
print(table(cars[,(i) := as.character(get(i))]))
}

Running predictive model according to values in column

I have a dataframe (I might in future not use it):
> PM
names.model.
1 4
2 5
3 6
4 8
5 9
It means that for value of 4 for instance I'll use model[1], for value of 5 I'll use model[2] etc.
As already mentioned I have a list of model (from 1 to 5).
I have another dataframe, that has a column TN.
As can be seen:
> head (test)
Ozone Solar.R Wind Temp Month Day TN
2 36 118 8.0 72 5 2 4
8 19 99 13.8 59 5 8 4
14 14 274 10.9 68 5 14 5
40 71 291 13.8 90 6 9 9
62 135 269 4.1 84 7 1 8
69 97 267 6.3 92 7 8 9
I would like to run the add a new column test$Ozone_pred that will run the relevant model per line. For instance, for the first line I'll run model[1] as well as for the second line (both are 4). For the third line I'll run model[2] , for the forth line model[5] etc.
There are a couple options. First would be to use dplyr's join function to just add your first dataframe (PM) to the second one (test) as a new column and then index based on that. Below is a solution with base R.
To get the correct function for a single row as your current PM is:
model[match(test_TN_number, PM[,2])]
If PM doesn't have the first column equal to row numbers, then:
model[PM[match(test_TN_number, PM[,2])],1]
This is then easily extended to the whole dataframe with apply or within a loop.
Edit: here's a for looped version:
for (test_TN_number in test[,"TN"]){
model[PM[match(test_TN_number, PM[,2])],1]
}

Treat variables as data_frame and other things

I guess I have a problem in R. I have this data frame (see at the bottom); I imported it as a Import Dataset "Weevils" from Text; then I converted the data via
as.data.frame(Weevils) and is.data.frame(Weevils) [1] TRUE proved me it's a data frame yet I cannot use the $ operator because the variables were all "atomic vectors"; I tried this instead:
pairs(x[Age_yrs]~x[Larvae_per_m²], col= x[Farmer] pch = 16)
but then this occured:
Error in plot.xy(xy, type, ...) :
numerische Farbe muss >= 0 sein, gefunden -2
which basically means that a negative value (for ther Farmer?) was found therefore so it cannot assign the colors to the outcome; All is supposed to look like this https://stackoverflow.com/a/40829168/5987736 (Thanks to Carles Mitjans!)
yet what came out in my case when putting in pairs(x[Age_yrs]~x[Larvae_per_m²], pch = 16) was this plot: Plot with negative values ; it has negative values, thus the colors cannto be assigned;
So my questions are: Why cannot the variables in the Weevils dataframe be treated as non-atomic vectors or why can't I use the $ and why are the values negative, what can I do so the values get positive? Thanks for helping me!
Farmer Age_yrs Larvae_per_m²
1 Band 2 1315
2 Band 4 725
3 Band 6 90
4 Fechney 1 520
5 Fechney 3 285
6 Fechney 9 30
7 Mulholland 2 725
8 Mulholland 6 20
9 Adams 2 150
10 Adams 3 225
11 Forrester 1 455
12 Forrester 3 75
13 Bilborough 2 850
14 Bilborough 3 650

How to normalize rather long decimal number in R?

I have list of data.frame, where I need to do transformation for .score column. However, I implemented helper function for this transformation. After I call .helperFunc for my input list of data.frame, but I got weird pvalue format in first, third data.frame. How to normalize rather big decimal to simple scientific number ? Can anyone tell me how to make this happen easily ?
toy data :
savedDF <- list(
bar = data.frame(.start=c(12,21,37), .stop=c(14,29,45), .score=c(5,69,14)),
cat = data.frame(.start=c(18,42,18,42,81), .stop=c(27,46,27,46,114), .score=c(15,5,15,5,134)),
foo = data.frame(.start=c(3,3,33,3,33,91), .stop=c(26,26,42,26,42,107), .score=c(22,22,6,22,6,7))
)
I got this weird output:
> .savedDF
$bar
.start .stop .score p.value
1 12 14 5 0.000010000000000000000817488438054070343241619411855936050415039062500
2 21 29 69 0.000000000000000000000000000000000000000000000000000000000000000000001
3 37 45 14 0.000000000000009999999999999999990459020882127560980734415352344512939
$cat
.start .stop .score p.value
1 18 27 15 1e-15
2 42 46 5 1e-05
3 18 27 15 1e-15
4 42 46 5 1e-05
5 81 114 134 1e-134
$foo
.start .stop .score p.value
1 3 26 22 0.0000000000000000000001
2 3 26 22 0.0000000000000000000001
3 33 42 6 0.0000010000000000000000
4 3 26 22 0.0000000000000000000001
5 33 42 6 0.0000010000000000000000
6 91 107 7 0.0000001000000000000000
I don't know what happen this, only second data.frame' format is desired. How can I normalize p.value column as simple as possible ?
last column of cat is considered to be desired format, or more precise but simple scientific number is also fit for me.
How can I make this normalization for unexpectedly long decimal numbers ? How can I achieve my desired output ? Any idea ? Thanks a lot
0 is the default scipen option. (See ?options for more details.) You apparently have changed the option to 100, which tells R to use decimal notation unless it is 100 characters longer than scientific notation. To get back to the default, run the line
options(scipen = 0)
As to "So in my function, I could add this option as well?" - you shouldn't do that. Doing it in your script is fine, but not in a function. Functions really shouldn't set user options. That's likely how you got in to this mess - some function you used probably rudely ran options(scipen = 100) and changed your options without you being aware.
Related: the opposite question How to disable scientific notation in R?

How to create range x values with basic R

I have just begun using R and have gone through multiple books and sources and they get more and more complex yet I still am unable to find a solution to what I think should be quite a basic process.
I have data with 3 columns as shown: (I am really simplifying everything to try and get a really clear answer which can applied to multiple situations)
min max value
1 5 23
8 15 9
33 35 30
I would like to plot this data on a graph.
by this data I intend that every value between 1 and 5 for example on the x axis is equal to 23 on the y axis.
I have tried several things including assigning each column to vectors a , b , and c respectively.
generating the correct number of values with:
y <- rep( c, (a-b+1))
which works as expected
then the problem occurs with getting the appropriate x values, I tried:
x <- (a:b)
but because of the way R functions it only applies to the first variables.
Now I can make this work by manually typing everything in like:
x <- c(1:5, 8:15, 33:35)
but I really need an automated way to do this because I am working with huge datasets of this structure.
I have seen some other people seem to have similar issues, however the underlying principle always seem to be convoluted with vast datasets and entire codes in questions so I have been unable to get to a good solution to this problem.
If anyone with a little more experience could clear up this issue I would be hugely grateful!
dat <- read.table(text=
"min max value
1 5 23
8 15 9
33 35 30",
header=TRUE)
I'm still not quite sure what you mean, but maybe:
newdat <- with(dat,data.frame(x=c(min,max),y=rep(value,2)))
newdat <- plyr::arrange(newdat,x)
plot(y~x,type="s",data=newdat)
It's not clear what you want to do between 5 and 8, 15 and 33 ... another possibility is to plot each bit as a separate segment:
plot(max~value,data=dat,xlim=range(c(dat$min,dat$max)),
type="n")
apply(dat,1,function(x) segments(x[1],x[3],x[2],x[3]))
How about this:
# your data.frame
df<-data.frame(min=c(1,8,33),max=c(5,15,35),value=c(23,9,30))
x<-unlist(apply(df,1,function(x)x[1]:x[2]))
y<-unlist(apply(df,1,function(x)rep(x[3],x[2]-x[1]+1)))
plotdata<-data.frame(x=x,y=y)
plotdata
x y
1 1 23
2 2 23
3 3 23
4 4 23
5 5 23
6 8 9
7 9 9
8 10 9
9 11 9
10 12 9
11 13 9
12 14 9
13 15 9
14 33 30
15 34 30
16 35 30
Something like this?
a <- c(c(1:5), c(8:15), c(33:35))
b <- c(rep(23,5), rep(9,8), rep(30,3))
plot(a,b, type="l")

Resources