operation on dataframe factor columns - r

I dont want to perform operation in a loop,My data look like this
dfU[4:7]
vNeg neg pos vPos
1 0 35 28 0
2 0 42 26 0
3 0 77 59 0
4 0 14 24 0
5 0 35 45 0
6 0 17 12 0
7 0 31 23 0
8 0 64 52 1
9 0 15 17 0
10 0 21 29 0
when i performed certain operation like this but getting an wrong result may be just because of conversion i tried with with and transform also but getting an error not meaningful for factors
b<-as.numeric(((as.numeric(dfU[,4])*-5)+(as.numeric(dfU[,5])*-2)+(as.numeric(dfU[,6])*2)+(as.numeric(dfU[,7])*5)))
b
[1] -14 -32 -16 18 8 -8 -18 -7 6 14 24 -9 0
error may be just because of this when i am converting integer to numeric
typeof(dfU[,4])
[1] "integer"
as.numeric(dfU[,4])
[1] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
k<-transform(dfU, (vNeg*(-5))+(neg*(-2))+(pos*2)+(vPos*5))
not meaningful for factors
i want the 8th column in a dataframe to be as score and i want to avoid the loop ,Is their any better way to perform operation on columns,any help in this direction,thanks.

The best would be to avoid having the 4th. column as factor if this is not what to you want to.
Still, a workaround is using as.numeric(as.character( )). Assume "a" is your 4th column, your situation is this:
> a <- as.factor(c(rep(0,7),1,rep(0,2)))
> a
[1] 0 0 0 0 0 0 0 1 0 0
Levels: 0 1
> as.numeric(a)
[1] 1 1 1 1 1 1 1 2 1 1
And the workaround does:
> as.numeric(as.character(a))
[1] 0 0 0 0 0 0 0 1 0 0

Related

How to Subset more than Once in a Data Frame in R?

I have the following Data Set.
> eh
PEG AEG
1 1 1
2 1 0
3 1 1
4 1 1
5 0 0
6 0 0
7 1 1
8 1 1
9 0 0
10 0 1
11 1 1
12 0 0
13 0 0
14 0 0
15 0 0
16 1 1
17 1 0
18 1 1
19 0 0
20 0 0
21 0 0
22 0 0
23 0 0
24 1 1
25 1 0
26 0 1
27 0 0
28 1 1
29 0 0
30 1 1
31 0 0
32 0 0
33 1 1
34 1 1
35 0 0
36 0 1
37 1 1
38 1 0
39 1 1
40 1 1
41 1 0
42 0 0
43 0 1
I am trying to find all of the columns in which both PEG and AEG equal one, and count the number of instances there are.
I understand I am close, and its probably me overlooking and not understanding the syntax since I am a beginner in R, but have tried the following code.
eh[which(eh$PEG == 1,) & (eh$AEG == 1,)]
Could anyone tell me What is wrong with the code here, and how once I do this, I could count the number of instances there?
There were some issues in the code i.e. multiple , in the code after the logical expression. In data.frame, the indexing is documented in ?Extract
x[i, j, ... , drop = TRUE]
The logical expression from both cases return TRUE/FALSE where the column values are 1 and is joined to a single logical expression with & (i.e. when both cases are TRUE. which returns the row position index where the value returned is TRUE. Instead of this, we can directly subset using the logical index
eh[eh$PEG ==1 & eh$AEG ==1,]
If we want the count, it is more easier with sum as TRUE -> 1 and FALSE is 0, thus the sum of TRUE values will be the count
with(eh, sum(PEG ==1 & AEG == 1))

R inspect() function, from tm package, only returns 10 outputs when using dictionary terms

I have 70 PDFs of scientific papers that I'm trying to narrow down by looking for specific terms within them, using the dictionary function of inspect(), which is part of the tm package. My PDFs are stored in a VCorpus object. Here's an example of what my code looks like using the crude dataset and common terms that would show up in (probably) every example paper in crude:
library(tm)
output.matrix <- inspect(DocumentTermMatrix(crude,
list(dictionary = c("i","and",
"all","of",
"the","if",
"i'm","looking",
"for","but","because","has",
"it","was"))))
output <- data.frame(output.matrix)
This search only ever returns 10 papers into output.matrix. The outcome given is:
Docs all and because but for has i i'm the was
144 0 9 0 5 5 2 0 0 17 1
236 0 7 4 2 4 5 0 0 15 7
237 1 11 1 3 3 2 0 0 30 2
246 0 9 0 0 6 1 0 0 18 2
248 1 6 1 1 2 0 0 0 27 4
273 0 5 2 2 4 1 0 0 21 1
368 0 1 0 1 0 0 0 0 11 2
489 0 5 0 0 4 0 0 0 8 0
502 0 6 0 1 5 0 0 0 13 0
704 0 5 1 0 3 2 0 0 21 0
For my actual dataset of 70 papers, I know there should be greater than 10 because as I add more PDFs to my VCorpus, which I know contain at least one of my search terms, I still only get 10 in the output. I want to adjust the outcome to be a list, like the one shown, that gives every paper from the VCorpus that contains a term, not just what I assume is the first 10.
Using R version 4.0.2, macOS High Sierra 10.13.6
You are misinterpreting what inspect does. For a document term matrix it show the first 10 rows and columns. inspect should only be used to check your corpus or document term matrix if it looks as you expect. Never for transforming data to a data.frame. If you want the data of the document term matrix in a data.frame, the following piece of code does this, using your example code and removing all the rows and columns that don't have a value for any of the documents or terms.
# do not use inspect as this will give a wrong result!
output.matrix <- DocumentTermMatrix(crude,
list(dictionary = c("i","and",
"all","of",
"the","if",
"i'm","looking",
"for","but","because","has",
"it","was")))
# remove rows and columns that are 0 staying inside a sparse matrix for speed
out <- output.matrix[slam::row_sums(output.matrix) > 0,
slam::col_sums(output.matrix) > 0]
# transform to data.frame
out_df <- data.frame(docs = row.names(out), as.matrix(out), row.names = NULL)
out_df
docs all and because but for. has the was
1 127 0 1 0 0 2 0 5 1
2 144 0 9 0 5 5 2 17 1
3 191 0 0 0 0 2 0 4 0
4 194 1 1 0 0 2 0 4 1
5 211 0 2 0 0 2 0 8 0
6 236 0 7 4 2 4 5 15 7
7 237 1 11 1 3 3 2 30 2
8 242 0 3 0 1 1 1 6 1
9 246 0 9 0 0 6 1 18 2
10 248 1 6 1 1 2 0 27 4
11 273 0 5 2 2 4 1 21 1
12 349 0 2 0 0 0 0 5 0
13 352 0 3 0 0 0 0 7 1
14 353 0 1 0 0 2 1 4 3
15 368 0 1 0 1 0 0 11 2
16 489 0 5 0 0 4 0 8 0
17 502 0 6 0 1 5 0 13 0
18 543 0 0 0 0 3 0 5 1
19 704 0 5 1 0 3 2 21 0
20 708 0 0 0 0 0 0 0 1

Excel Formula Implementation in R

I need to implement a logic in my R script for the below shown sample data frame. df
ID A B
1 2.471264262 0
2 2.53024575 0
3 2.559114933 1
4 2.502350493 1
5 2.529496526 0
6 2.480199137 0
7 2.521066835 0
8 2.481272625 0
9 2.505953959 0
10 2.481272625 0
11 2.499424723 0
12 2.492515087 0
13 2.502385996 0
14 2.487579633 0
15 2.479438021 -1
16 2.044195946 1
17 2.054051421 0
18 2.108811073 1
19 2.249767599 0
20 2.627294516 -1
21 2.624337386 0
22 2.157110862 0
23 2.142325212 -1
24 2.124582433 -1
25 2.114725333 0
26 2.113739623 0
27 1.92054047 0
28 2.00037188 0
29 2.183995509 0
30 2.629451192 0
31 2.772756046 0
32 2.603141474 0
33 2.502385996 0
Column B shows the data point where State is changed. Now I need to implement a complex logic where I will be adding or subtracting the "Correction Factor" for the values in Column A for next 15 data points from the point where B == 1 or -1.
The formula for the correction factor is as follows,
If B == 1 then Correction Factor == [A - 0.19*(15/15)*A], Also value the fraction (15/15) will keep on decrementing for the next 15 values like (14/15) , (13/15) .....(0/15).
Similarly if B == -1 then Correction Factor == [A + 0.53*(15/15)*A], Also value (15/15) will keep on decrementing for the next 15 values like (14/15) , (13/15) .....(0/15).
And another condition to consider is that, Once a state change has be detected in B then though there is state change with in the next 15 values, it should not be considered. Ex First change in state is detected at B3 then though there is state change in B4,B15,16 it should not be considered.
For a better Understanding I have attached my expected output along with the formulas executed manually in excel.
Expected Output
A B A With Correction Factor Formula Executed
2.471264262 0 2.471264262 Same Value of A retained since no transition
2.53024575 0 2.53024575 Same Value of A retained since no transition
2.559114933 1 2.072883096 A4-0.19* (15/15)*A4
2.502350493 1 2.058600339 A5-0.19* (14/15)*A5
2.529496526 0 2.112972765 A6-0.19* (13/15)*A6
2.480199137 0 2.103208868 A7-0.19* (12/15)*A7
2.521066835 0 2.169798189 A8-0.19* (11/15)*A8
2.481272625 0 2.166978093 A9-0.19* (10/15)*A9
2.505953959 0 2.220275208 A10-0.19* (9/15)*A10
2.481272625 0 2.229836999 A11-0.19* (8/15)*A11
2.499424723 0 2.277809064 A12-0.19* (7/15)*A12
2.492515087 0 2.30308394 A13-0.19* (6/15)*A13
2.502385996 0 2.34390155 A14-0.19* (5/15)*A14
2.487579633 0 2.361542265 A15-0.19* (4/15)*A15
2.479438021 -1 2.385219376 A16-0.19* (3/15)*A16
2.044195946 1 1.992409649 A17-0.19* (2/15)*A17
2.054051421 0 2.028033436 A18-0.19* (1/15)*A18
2.108811073 1 2.108811073 A19-0.19* (0/15)*A19
2.249767599 0 2.249767599 Same Value of A retained since no transition
2.627294516 -1 4.019760609 A21+0.53*(15/15)*A21
2.624337386 0 3.922509613 A22+0.53*(14/15)*A22
2.157110862 0 3.147943785 A23+0.53*(13/15)*A23
2.142325212 -1 3.050671102 A24+0.53*(12/15)*A24
2.124582433 -1 2.950336805 A25+0.53*(11/15)*A25
2.114725333 0 2.861928284 A26+0.53*(10/15)*A26
2.113739623 0 2.785908823 A27+0.53*(9/15)*A27
1.92054047 0 2.463413243 A28+0.53*(8/15)*A28
2.00037188 0 2.495130525 A29+0.53*(7/15)*A29
2.183995509 0 2.647002557 A30+0.53*(6/15)*A30
2.629451192 0 3.093987569 A31+0.53*(5/15)*A31
2.772756046 0 3.164638901 A32+0.53*(4/15)*A32
2.603141474 0 2.87907447 A33+0.53*(3/15)*A33
2.502385996 0 2.679221273 A34+0.53*(2/15)*A34
Edit
The code suggested below works exactly as required for the above mentioned dataframe i.e the dataframe with 33 rows, but I have the below data frame with 32rows and code doesnt work. Any suggestion on this?
ID A B
1 2.471264262 0
2 2.53024575 0
3 2.559114933 1
4 2.502350493 1
5 2.529496526 0
6 2.480199137 0
7 2.521066835 0
8 2.481272625 0
9 2.505953959 0
10 2.481272625 0
11 2.499424723 0
12 2.492515087 0
13 2.502385996 0
14 2.487579633 0
15 2.479438021 -1
16 2.044195946 1
17 2.054051421 0
18 2.108811073 1
19 2.249767599 0
20 2.627294516 -1
21 2.624337386 0
22 2.157110862 0
23 2.142325212 -1
24 2.124582433 -1
25 2.114725333 0
26 2.113739623 0
27 1.92054047 0
28 2.00037188 0
29 2.183995509 0
30 2.629451192 0
31 2.772756046 0
32 2.603141474 0
Well I was not able to post another question giving this post as the reference so I have updated iin the same post.
Thanks.
This should work, the counting to 15 is a little tricky, so we use a for loop to calculate the correct counter and state. The actual formula is then relatively simple:
counter <- 0
current_state <- NA
for (i in seq_along(df$B)) {
if (counter == 0) {
if (df$B[i] == 0) next
counter <- 15
current_state <- df$B[i]
df$state[i] <- df$B[i]
df$counter[i] <- counter
} else {
counter <- counter - 1
df$state[i] <- current_state
df$counter[i] <- counter
}
}
df$A_corr <- ifelse(df$state == 1,
df$A - 0.19 * (df$counter / 15) * df$A,
df$A + 0.53 * (df$counter / 15) * df$A)
df$A_corr <- ifelse(is.na(df$A_corr), df$A, df$A_corr)
Gives:
> df
ID A B state counter A_corr
1 1 2.471264 0 NA NA 2.471264
2 2 2.530246 0 NA NA 2.530246
3 3 2.559115 1 1 15 2.072883
4 4 2.502350 1 1 14 2.058600
5 5 2.529497 0 1 13 2.112973
6 6 2.480199 0 1 12 2.103209
7 7 2.521067 0 1 11 2.169798
8 8 2.481273 0 1 10 2.166978
9 9 2.505954 0 1 9 2.220275
10 10 2.481273 0 1 8 2.229837
11 11 2.499425 0 1 7 2.277809
12 12 2.492515 0 1 6 2.303084
13 13 2.502386 0 1 5 2.343902
14 14 2.487580 0 1 4 2.361542
15 15 2.479438 -1 1 3 2.385219
16 16 2.044196 1 1 2 1.992410
17 17 2.054051 0 1 1 2.028033
18 18 2.108811 1 1 0 2.108811
19 19 2.249768 0 NA NA 2.249768
20 20 2.627295 -1 -1 15 4.019761
21 21 2.624337 0 -1 14 3.922510
22 22 2.157111 0 -1 13 3.147944
23 23 2.142325 -1 -1 12 3.050671
24 24 2.124582 -1 -1 11 2.950337
25 25 2.114725 0 -1 10 2.861928
26 26 2.113740 0 -1 9 2.785909
27 27 1.920540 0 -1 8 2.463413
28 28 2.000372 0 -1 7 2.495131
29 29 2.183996 0 -1 6 2.647003
30 30 2.629451 0 -1 5 3.093988
31 31 2.772756 0 -1 4 3.164639
32 32 2.603141 0 -1 3 2.879074
33 33 2.502386 0 -1 2 2.679221

Function ignoring my if condition statement

I have a tree outlined in a data frame as:
number conc knot neg pick
1 1 0 0 1
2 1 0 0 1
3 1 0 0 1
4 3 164 0 1
5 1 0 0 1
6 1 0 0 1
7 3 159 1 1
8 0 0 0 0
9 0 0 0 0
10 3 208 1 1
11 3 181 1 1
12 3 1 1 1
13 3 95 0 1
14 0 0 0 0
15 0 0 0 0
I'm traversing the tree with a recursive function:
printtree <- function(number,tree) {
if (!is.na(tree[number,5] != 0)) {
letssee<-c(tree[number,1],tree[number,2],tree[number,3],tree[number,4],tree[number,5])
print(letssee)
}
left <- tree[number,1]
if (!is.na(left)) printtree(tree[left,1]*2,tree)
right <- tree[number,1]
if (!is.na(right)) printtree(tree[right,1]*2+1,tree)
}
My if condition should be omitting lines when the pick column = 0 but it is still printing and I can't figure out why.
Here's the output:
[1] 1 1 0 0 1
[1] 2 1 0 0 1
[1] 4 3 164 0 1
[1] 8 0 0 0 0
[1] 9 0 0 0 0
[1] 5 1 0 0 1
[1] 10 3 208 1 1
[1] 11 3 181 1 1
[1] 3 1 0 0 1
[1] 6 1 0 0 1
[1] 12 3 1 1 1
[1] 13 3 95 0 1
[1] 7 3 159 1 1
[1] 14 0 0 0 0
[1] 15 0 0 0 0
Is it ignoring my if statement because of is.na()? If I don't have the is.na check I get an error for "missing value where TRUE/FALSE needed" so it has to be there.
If tree[number, 5] happens to really equal zero, then the internal test
tree[number, 5] !=0
Will return FALSE. FALSE is not an NA value, so !is.na(FALSE) will always return TRUE. So the if statement as you've written it will always return TRUE if tree[number, 5] is zero.
Maybe try:
if (!is.na(X) & X !=0) {...}

How to create frequency table under couple conditions in r?

I am a rookie in R. I think my questions are basic ones. I want to know the frequency of a variable under couple conditions. I try to use table() but it does not work. I have searched a lot, I still cannot find the answers.
My data looks like this
ID AGE LEVEL End_month
1 14 1 201005
2 25 2 201006
3 17 2 201006
4 16 1 201008
5 19 3 201007
6 33 2 201008
7 17 2 201006
8 15 3 201005
9 23 1 201004
10 25 2 201007
I want to know two things.
First, I want to know the frequency of age under different level. The age shows in certain range and aggregate the rest as a variable. It looks like this.
level
1 2 3 sum
age 14 1 0 0 1
16 1 0 0 1
15 0 0 1 1
17 0 2 0 2
19 0 0 1 1
20+ 1 3 0 4
sum 3 5 2 10
Second, I want to know the frequency of different age in different end_month of level 2&3 customer. I want to get a table like this.
For level 2 customer
End_month
201004 201005 201006 201007 201008 sum
age 15 0 0 0 0 0 0
19 0 0 0 0 0 0
17 0 0 2 0 0 2
19 0 0 0 0 0 0
25 0 0 0 1 0 1
33 0 0 0 1 1 2
sum 0 0 2 2 1 5
For level 3 customer
End_month
201004 201005 201006 201007 201008 sum
age 15 0 1 0 0 0 1
19 0 0 0 1 0 1
17 0 0 0 0 0 0
19 0 0 0 0 0 0
25 0 0 0 0 0 0
33 0 0 0 0 0 0
sum 0 1 0 1 0 2
Many thanks in advance.
You can still achieve this with table, because it can take more than one variables.
For example, use
table(AGE, LEVEL)
to get the first two-way table.
Now, when you want to produce such table for each subset according to LEVEL, you can do it this way, assuming we are going for level 1:
subset <- LEVEL == 1
table(AGE[subset], END[subset])

Resources