Switch-like function for questionnaire grading - r

I'd done a serious PHP/JS coding recently, and I kind-of lost my R muscle. While this problem can be easily tackled within PHP/JS, what is the most efficient way of solving this one: I have to grade a questionnaire, and I have following scenario:
raw t
5 0
6 2
7-9 3
10-12 4
15-20 5
if x equals to, or is within range given in raw, value in according row in t should be returned. Of course, this can be done with for loop, or switch, but just imagine very lengthy set of value ranges in raw. How would you tackle this one?

We seem to be missing a part of the example because there in no mention of "x"
dat <- read.table(textConnection("raw t
5 0
6 2
7-9 3
10-12 4
15-20 5"), header=TRUE, stringsAsFactors=FALSE)
dat$bot <- as.numeric( sapply( sapply(dat$raw, strsplit, "-"), "[", 1 ))
get.t <- function(x) findInterval(x, dat$bot)
get.t(8)
#[1] 3
> dat$t[get.t(6)]
[1] 2
> dat$t[get.t(5)]
[1] 0

I would simply use an indexing scheme kind of like what Corbin alluded to, but since he didn't provide an example, here's a simple one:
m <- cbind(c(5:12,15:20),
rep(c(0,2,3,4,5),times = c(1,1,3,3,6)))
m[m[,1] == 11,2]
[1] 4

Note: very similar to Simone's answer as I started typing this a bit back. Has a note at the end though. The indexing approach I give is essentially Simone's answer.
There will have to be a loop involved somewhere.
The pseudo code of what I would do is something like:
score = blah
for each raw => t
break raw into rMin -> rMax
if(rMin <= score and rMax >= score)
return t
It avoids having to loop over each number between rMin and rMax (which is what I'm assuming you meant), but without some kind of indexing, that is the best you're going to get.
Note: if you have a ton of calls to this, and indexing would actually be worth your while, the easiest type of indexing would just be a hash map of score -> t entries.
Basically you would parse your example data into something like:
index[5] = 0
index[6] = 2
index[7] = 3
index[8] = 3
index[9] = 3
You would need to carefully weigh if building the index would be more time consuming than just looping over the ranges.
Note: the indexing approach is actually what Simone said.

Related

R version of inplace=True

I'm starting to learn R and I'm having a hard time making changes to the names of values in a factor. I've tried using revalue and recode but am still seeing the original names when I look at the dataframe.
Here's what the DF looks like:
head(freecut)
gender oldness student_loaniness homeland
1 0 20 4 Eurasia
2 1 25 4 Oceana
3 1 56 2 Eastasia
4 0 65 6 Eastasia
5 1 50 5 Oceana
6 0 20 5 Eastasia
And here are the coding attempts:
revalue(freecut$homeland, c("Eastasia" = "East_Asia", "Eurasia" = "Asiope",
"Oceana" = "Nemoville"))
recode(freecut$homeland, Eastasia = "East_Asia", Eurasia = "Asiope",
Oceana = "Nemoville")
After running the code the DF looks exactly the same. I know that in Python I would have to throw in "inplace = TRUE" to make changes stick--not sure what I need to do here (or what I'm missing).
R doesn't modify in place, you have to assign results - either back to the original variable to modify it, or to a new variable. This is a paradigm of functional programming, and R is a functional programming language.
If you have x = 1, running x + 1 will evaluate and print the result, 2, but x is not changed. If you want to overwrite x with the modified value, you run x = x + 1.
Just the same way, running recode, will evaluate and print a result, but if you want to modify the column in your data frame, you need to explicitly assign it with freecut$homeland = recode(...).
There are a few exceptions in add-on packages. For example, the data.table package defines some set* operators which do modify objects in place. data.table is fantastic, especially if you need efficiency, but if you are just starting with R I would recommend getting familiar with the basics first.
In addition to Gregor's answer which addresses more fundamental issues, you can in your particular case use levels<-:
levels(freecut$homeland) <- c("first", "second", "third")
# order is important if you don't want surprises
Or if you are ready to join the dark side, consider macros from gtools package. The first steps are described e.g. in https://www.r-bloggers.com/macros-in-r/. Nobody is using macros in R but I don't know why. Maybe they're dangerous but maybe they just seem obscure.

R count how often words from a list appear in a sentence

Currently participating in a MOOC and trying my hand at some sentiment analysis, but having trouble with the R code.
What I have is a list of bad words and a list of good words. For instance my bad words are c("dent", "broken", "wear", "cracked") ect.
I have a list of descriptions in my data frame, what I want to do is get a count on how many of my bad words appear in the list and how many of my good words appear for each row.
for instance suppose this is my data frame
desc = c("this screen is cracked", "minor dents and scratches", "100% good", "in perfect condition")
id = c(1,2,3,4)
df = data.frame(id, desc)
bad.words = c("cracked", "scratches", "dents")
what I want is to make a sum column that counts how often each bad word appears in the description
so hoping my final df would look like
id desc sum
1 "this screen is cracked" 1
2 "minor dents and scratches" 2
3 "100% good" 0
4 "in perfect condition" 0
what I have so far is
df$sum <- grepl(paste( bad.words, collapse="|"), df$desc)
which only gets me a true or false if a word appears
If you are finding a sum, vapply() is more appropriate than sapply(). You could do
library(stringi)
df$sum <- vapply(df$desc, function(x) sum(stri_count_fixed(x, bad.words)), 1L)
Which gives
df
# id desc sum
# 1 1 this screen is cracked 1
# 2 2 minor dents and scratches 2
# 3 3 100% good 0
# 4 4 in perfect condition 0
Since you're likely going to try different lists of words, like good.words, bad.words, really.bad.words; I would write a function. I like lapply, but vapply and others will work too.
countwords <- function(x,comparison){
lapply(x,function(x,comparewords){
sum(strsplit(x,' ')[[1]] %in% comparewords)
},comparewords = comparison)
}
df$good <- countwords(df$desc,good.words)
df$bad <- countwords(df$desc,bad.words)
The tm package is useful as well, after you're content with learning and moving to production speed.

how to write a loop of the number of for loops in R?

this is probably a simple one, but I somehow got stuck...
I need to many loops to get the result of every sample in my support like the usual stacked loops:
for (a in 1:N1){
for (b in 1:N2){
for (c in 1:N3){
...
}
}
}
but the number of the for loops needed in this messy system depends on another random variable, let's say,
for(f in 1:N.for)
so how can I write a for loop to do deal with this? Or are there more elegant ways to do this?
note that the difference is that the nested for loops above (the variables a,b,c,...) do matter in my calculations, but the variable f of the for loop that controls for the number of for loops needed does not go into any of my calculations for my real purpose - all it does is count/ensure the number of for loops needed is correct.
Did I make it clear?
So what I am actually trying to do is generate all the possible combinations of a number of peoples preferences towards others.
Let's say I have 6 people (the simplest case for my purpose): Abi, Bob, Cath, Dan, Eva, Fay.
Abi and Bob have preference lists of C D E F ( 4!=24 possible permutations for each of them);
Cath and Dan have preference lists of A B and E F, respectively (2! * 2! = 4 possible permutations for each of them);
Eva and Fay have preference lists of A B C D (4!=24 possible permutations for each of them);
So all together there should be 24*24*4*4*24*24 possible permutations of preferences when taking all six them together.
I am just wondering what is a clear, easy and systematic way to generate them all at once?
I'd want them in the format such as
c.prefs <- as.matrix(data.frame(Abi = c("Eva", "Fay", "Dan", "Cath"),Bob = c("Dan", "Eva", "Fay", "Cath"))
but any clear format is fine...
Thank you so much!!
I'll assume you have a list of each loop variable and its maximum value, ordered from the outermost to innermost variable.
loops <- list(a=2, b=3, c=2)
You could create a data frame with all the loop variable values in the correct order with:
(indices <- rev(do.call(expand.grid, lapply(rev(loops), seq_len))))
# a b c
# 1 1 1 1
# 2 1 1 2
# 3 1 2 1
# 4 1 2 2
# 5 1 3 1
# 6 1 3 2
# 7 2 1 1
# 8 2 1 2
# 9 2 2 1
# 10 2 2 2
# 11 2 3 1
# 12 2 3 2
If the code run at the innermost point of the nested loop doesn't depend on the previous iterations, you could use something like apply to process each iteration independently. Otherwise you could loop through the rows of the data frame with a single loop:
for (i in seq_len(nrow(indices))) {
# You can get "a" with indices$a[i], "b" with indices$b[i], etc.
}
For the way of doing the calculation, an option is to use the Reduce function or some other higher-order function.
Since your data is not inherently ordered (an individual is part of a set, its preferences are part of the set) I would keep indivudals in a factor and have eg preferences in lists named with the individuals. If you have large data you can store it in an environment.
The first code is just how to make it reproducible. the problem domain was akin for graph oriented naming. You just need to change in the first line and in runif to change the behavior.
#people
verts <- factor(c(LETTERS[1:10]))
#relations, disallow preferring yourself
edges<-lapply(seq_along(verts), function(ind) {
levels(verts)[-ind]
})
names(edges) <- levels(verts)
#directions
#say you have these stored in a list or something
pool <- levels(verts)
directions<-lapply(pool, function(vert) {
relations <- pool[unique(round(runif(5, 1, 10)))]
relations[!(vert %in% relations)]
})
names(directions) = pool
num_prefs <- (lapply(directions, length))
names(num_prefs) <- names(directions)
#First take factorial of each persons preferences,
#then reduce that with multiplication
combinations <-
Reduce(`*`,
sapply(num_prefs, factorial)
)
I hope this answers your question!

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Apply multiple functions to column using tapply

Could someone please point to how we can apply multiple functions to the same column using tapply (or any other method, plyr, etc) so that the result can be obtained in distinct columns). For eg., if I have a dataframe with
User MoneySpent
Joe 20
Ron 10
Joe 30
...
I want to get the result as sum of MoneySpent + number of Occurences.
I used a function like --
f <- function(x) c(sum(x), length(x))
tapply(df$MoneySpent, df$Uer, f)
But this does not split it into columns, gives something like say,
Joe Joe 100, 5 # The sum=100, number of occurrences = 5, but it gets juxtaposed
Thanks in advance,
Raj
You can certainly do stuff like this using ddply from the plyr package:
dat <- data.frame(x = rep(letters[1:3],3),y = 1:9)
ddply(dat,.(x),summarise,total = NROW(piece), count = sum(y))
x total count
1 a 3 12
2 b 3 15
3 c 3 18
You can keep listing more summary functions, beyond just two, if you like. Note I'm being a little tricky here in calling NROW on an internal variable in ddply called piece. You could have just done something like length(y) instead. (And probably should; referencing the internal variable piece isn't guaranteed to work in future versions, I think. Do as I say, not as I do and just use length().)
ddply() is conceptually the clearest, but sometimes it is useful to use tapply instead for speed reasons, in which case the following works:
do.call( rbind, tapply(df$MoneySpent, df$User, f) )

Resources