Subsetting - R prints data in reverse order- [R 3.2.2, Win10 Pro, 64-bit] - r

Aim: To retrieve last two entries of data.( I am aware of the tail function, or direct indexing)
Code:
> tdata <- read.csv("hw1_data.csv")
> temp <- tdata[(nrow(tdata)-1):nrow(tdata), ]
> temp
Ozone Solar.R Wind Temp Month Day
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30
> temp <- tdata[nrow(tdata)-1:nrow(tdata), ]
> temp
Ozone Solar.R Wind Temp Month Day
152 18 131 8.0 76 9 29
151 14 191 14.3 75 9 28
150 NA 145 13.2 77 9 27
149 30 193 6.9 70 9 26
148 14 20 16.6 63 9 25
147 7 49 10.3 69 9 24
.
.
.
While taking a subset using the extract operator, I have used the nrows() function to retrieve the total number of rows in the data and subtracted one from it (one less than total rows) and used sequence operator(:) to sequence till nrows(data), i.e. total number of rows.
When I use parentheses, the logic works fine, but when I skip the parentheses the output is the total dataframe in a reverse order.
I can figure out that precedence rules are at play, but unable to figure out exact logic. New at R, so any formal explanation would be valuable.

As suspected correctly in the post, the observed behavior is in fact a matter of operator precedence.
A complete list of the operator syntax and precedence rules in R can be obtained by typing
help(Syntax)
in the console.
In this context, R programmers sometimes refer to a well-known and rather witty quote which encourages the use of parentheses:
library(fortunes)
fortune(138)

nrow(tdata) = 153
So the first line you run is:
temp <- tdata[(nrow(tdata)-1):nrow(tdata),]
This executes as tdata[152:153,]
Second line:
temp <- tdata[nrow(tdata)-1:nrow(tdata),]
This executes as tdata[153-1:153,]
So it returns the following:
tdata[152,]
tdata[151,]
...
tdata[0,]

Related

Quanteda - display full output; error message: "reached max_ndoc ... 24 more documents"

I am in the early stages of building/testing my own defined dictionary. I am testing it with a set of American state party platforms (corpus of 30 txt files). I have successfully created the dictionary and used Quanteda to provide summary statistics, but it only seems to do this for 6 files at time and my plan is to use the dictionary on hundreds of files going back decades. Is there a way to display more than 6 documents at a time?
Here is the code I used that produced data frame for the 6 files and the error message:
corp_platform <- corpus(corp)
toks_platform <- tokens(corp_platform)
dict_toks <- tokens_lookup(toks_platform, dictionary = dict)
print(dict_toks)
dfm(dict_toks)
Document-feature matrix of: 30 documents, 2 features (1.67% sparse) and 2 docvars.
commmunitarian individualist
akdem20.txt 113 20
azdem20.txt 60 13
cadem20.txt 254 98
medem20.txt 27 7
mndfl20.txt 40 18
ncdem20.txt 235 64
[ reached max_ndoc ... 24 more documents ]
The print methods for core objects, such as dfm objects, by default only print a specified number of rows. That's what you are seeing here, and why it states:
Document-feature matrix of: 30 documents
[...]
.. 24 more documents ]
It's telling you that all 30 documents are there.
This is all documented. See help("print-methods", package = "quanteda"). If you want summary statistics, try quanteda.textstats::textstat_frequency(). Or if you want the dfm as a data.frame, use convert(dfm(dict_toks), to = "data.frame").
Thanks very much. I just needed a way to display the output, and could not find an example, so this is very helpful. I changed the code I have used to:
'corp_platform <- corpus(corp)
toks_platform <- quanteda::tokens(corp_platform)
dict_toks <- tokens_lookup(toks_platform, dictionary = dict)
print(dict_toks)
dfm(dict_toks)
convert(dfm(dict_toks), to = "data.frame")'
and the output is:
'doc_ id commmunitarian individualist
1 akdem20.txt 113 20
2 azdem20.txt 60 13
3 cadem20.txt 254 98
4 medem20.txt 27 7
5 mndfl20.txt 40 18
.........................................
25 tx2022draft.txt 198 156
26 txgop20.txt 181 153
27 wagop20.txt 52 63
28 wigop20.txt 27 11
29 wvgop20.txt 72 47
30 wygop20.txt 22 21'

Issue with calculating row mean in data table for selected columns in R

I have a data table as shown below.
Table:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3
215 45 50 60 11 0.4 10.2
0.1 50 61 24 12 0.8 80.0
0 45 24 35 22 20.0 15.4
51 22.1 54 13 35 16 2.2
I want to obtain the Output table below. My code below does not work. Can somebody help me to figure out what I am doing wrong here.
Any help is appreciated.
Output:
LP GMweek1 GMweek2 GMweek3 PMweek1 PMweek2 PMweek3 AvgGM AvgPM
215 45 50 60 11 0.4 10.2 51.67 7.20
0.1 50 61 24 12 0.8 80.0 45.00 30.93
0 45 24 35 22 20.0 15.4 34.67 19.13
51 22.1 54 13 35 16 2.2 29.70 17.73
sel_cols_GM <- c("GMweek1","GMweek2","GMweek3")
sel_cols_PM <- c("PMweek1","PMweek2","PMweek3")
Table <- Table[, .(AvgGM = rowMeans(sel_cols_GM)), by = LP]
Table <- Table[, .(AvgPM = rowMeans(sel_cols_PM)), by = LP]
Ok so you're doing a couple of things wrong. First, rowMeans can't evaluate a character vector, if you want to select columns by using it you must use .SD and pass the character vector to .SDcols. Second, you're trying to calculate a row aggregation and grouping, which I don't think makes much sense. Third, even if your expression didn't throw an error, you are assigning it back to Table, which would destroy your original data (if you want to add a new column use := to add it by reference).
What you want to do is calculate the row means of your selected columns, which you can do like this:
Table[, AvgGM := rowMeans(.SD), .SDcols = sel_cols_GM]
Table[, AvgPM := rowMeans(.SD), .SDcols = sel_cols_PM]
This means create these new columns as the row means of my subset of data (.SD) which refers to these columns (.SDcols)

Trying to aggregate ReactionTypes in R

sample of outAct
Activity ReactionType numberActivities
activator activates 16
binding binds 83
recombinase binds 1
branching branches 3
carboxylase carboxylates 36
peptidase cleaves 425
endopeptidase cleaves 368
nuclease cleaves 53
glycosylase cleaves 24
cyclase converts 12
transhydrogenase converts 3
hist deacetylase deacetylates 8
deacetylase deacetylates 16
I want to count all the same ReactionTypes and sum up their numberActivities
reaction_types <-aggregate(numberActivities ~ ReactionType, unique(outAct), FUN=sum)
Desired output
ReactionType number
activates 16
binds 84
branches 3
carboxylates 36
cleaves 870
converts 15
deacetylates 24
Problem is, I’m getting duplicates, i.e. they are not being counted as one unique ReactionType e.g. the output contains rows such as
deacetylates 8
deacetylates 16
There are more examples like this throughout the output file.
Where am I going wrong?
Thanks in advance.
library(dplyr)
outAct %>% group_by(ReactionType) %>% summarise(number = sum(numberActivities))

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

How to return back the imputed values in R

Is there any function in R that can help to return imputed values, for example:
x <- c(23,23,25,43,34,22,78,NA,98,23,30,NA,21,78,22,76,NA,77,33,98,22,NA,52,87,NA,23,
23)
by using single linear imputation method,
na.approx(x)
I get the imputed data as;
[1] 23 23 25 43 34 22 78 35 98 23 30 24 21 78 22 76 22 77 33 98 22 14 52 87 59
[26] 23 23
How can I get the imputed value from the program back without looking at the completed dataset one by one? For example, if the data I imputed contain $n=200$ observations, can I get 20 estimates of the missing value?
I am not 100 percent sure if I got you right, but does this help?
You first save the places, at which the original NA values are, so.e.g the first NA value is at the 8th place. Save this into the dummy variable
dummy<-NA
for (i in 1:length(x)){
if(is.na(x[i])) dummy[i]<-i
}
Now get the corresponding values in the imputed data
imputeddata<-na.approx(x)
for (i in 1:length(imputeddata)){
if(!is.na(imputeddata[dummy[i]])) print(imputeddata[dummy[i]])
}
You could use is.na to select only those values that were previously NA.
> x <- c(23,23,25,43,34,22,78,NA,98,23,30,NA,21,78,22,76,NA,77,33,98,22,NA,52,87,NA,23,23)
> na.approx(x)[is.na(x)]
[1] 88.0 25.5 76.5 37.0 55.0
Hope that helps.

Resources