Assign new ID taking into account previous changes - r

Sorry I do not know how to properly title my question. It is easier to understand with an example.
Sample data
Consider the following example.
> l_ids=as.data.frame(cbind(a=c("strong","intense","intensity"),
id=c("1","2","3"),new_id=c("","1","2")),stringsAsFactors = FALSE)
a id new_id
1 strong 1
2 intense 2 1
3 intensity 3 2
I would like to update the id of each word in a with a new_id, if it applies. Consider this as a synonym dictionary. As I iterate over new_id;
> for (i in 1:nrow(l_ids)){
+ if (nchar(l_ids$new_id[i])>0){
+ l_ids$id[i]=l_ids$new_id[i]
+ }
+ }
> l_ids
a id new_id
1 strong 1
2 intense 1 1
3 intensity 2 2
The problem is that I would like for intensity to also be given a 1. Is there a way to do this without having to iterate multiple times?
Update on background
I have a document where I have a list of synonyms. These are synonyms only relevant to the field of application of the problem. Example:
> dictionary
good bad
1 strong intense
2 intense intensity
3 light soft
I am then given a list of words, each with a given id. My task is to check if any of those words is in the bad column of dictionary and, if so, update it with the id of the word to its left. As can be seen, intensity would need two steps to become strong (a good word in the dictionary). Is there a way to do so without having to do multiple iterations? (say, a for loop)

Related

Conditional sentence for specific rows

Disclaimer: I am not that advanced with R Studio and hence my question might be quite self explanatory.
Lets assume the following data set
**ID value1a value2a value1b value2b ...
1 2 3 ...
8 4 4
2 5 5
I want to create a forth variable that is part of the expression of an if sentence, that logically should go as follows:
If ID = 1 is over 5 in "value1x" and below 3 in "value2x", then add the value 1 to this forth variable. Hence the forth variable should function as a counter, that the number in the forth variable indiciates the frequency of value1x being over 5 and value2x being below 3.
I hope my question makes sense and Id appreciate answers!

Static variable next to a dynamic variable in R

I posted yesterday another question but I feel I need to clarify it.
Let's say I have this code
md.NAME <- (subset(MyData, HotelName=="ALAMEDA"))
md.NAME.fc <- (subset(md.ALAMEDA, TIPO=="FORECAST"))
md.NAME.fc.bar <- (subset(md.ALAMEDA.fc, Market.Segment=="BAR"))
What I want is that NAME changes according to a variable set before those 3 lines are run,
So NAME is just dynamic in the sense that before these 3 lines I could say, ok, NAME now is equal to JOHN, but then, I could say that NAME is now equal to PATRIC.
So after running those 3 lines, twice (once for John and once for Patric) somehow in the environment I will get something like this:
6 dataframes, 3 for JOHN and 3 for PATRIC
DATAFRAME 1 WILL BE md.JOHN
DATAFRAME 2 WILL BE md.JOHN.fc
DATAFRAME 3 WILL BE md.JOHN.fc.bar
DATAFRAME 1 WILL BE md.PATRIC
DATAFRAME 2 WILL BE md.PATRIC.fc
DATAFRAME 3 WILL BE md.PATRIC.fc.bar
All the answers I had so far would help me only if "md" and "fc" or "fc.bar" are always the same. But I will have several variables like this, which will change a lot as far as the naming goes. So, it is the center part (NAME) the only one that should change.
I could even have something like:
md.test$NAME <- ...

How can i make Association rules considering count information?

Hi now i'm studying association rules with R.
i have a question.
in transcation data,
we consider just buy or non-buy (binary data)
i want to know how to perform association rules with count data
ex)
item1 item2 item3
1 2 0 1
2 0 1 0
3 1 0 0
first customer bought two item1s!!
but in ordinary association rules, that count information is ignored
how can we consider that information?
High, The quantitative association rules (QAR) mining may be helpful.
Firstly, you should divide the value field of every item to some sets and give every set an unique label. Then, the original dataset can be transformed to a binary dataset containing those labels.
for example, for item1, if the original data has the following information:
the first person have bought 5 item1s
the second one have bought 2 item1s
the third one have bought 7 item1s.
You can divide the value field of item1 to [0, 3), [3, 6) and [6, 9), and use a1, a2 and a3 to represent them, so the item 'item1' can be replaced by 3 other items, which are a1, a2 and a3, and the original data can be replaced by the follows.
the first person have bought one a2.
the second person have bought one a1.
the third person have bought one a3.
After doing this work on every item, the original dataset can be transformed to a binary dataset.

To sort a specific column in a DataFrame in SparkR

In SparkR I have a DataFrame data. It contains time, game and id.
head(data)
then gives ID = 1 4 1 1 215 985 ..., game = 1 5 1 10 and time 2012-2-1, 2013-9-9, ...
Now game contains a gametype which is numbers from 1 to 10.
For a given gametype I want to find the minimum time, meaning the first time this game has been played. For gametype 1 I do this
data1 <- filter(data, data$game == 1)
This new data contains all data for gametype 1. To find the minimum time I do this
g <- groupBy(data1, game$time)
first(arrange(g, desc(g$time)))
but this can't run in sparkR. It says "object of type S4 is not subsettable".
Game 1 has been played 2012-01-02, 2013-05-04, 2011-01-04,... I would like to find the minimum-time.
If all you want is a minimum time sorting a whole data set doesn't make sense. You can simply use min:
agg(df, min(df$time))
or for each type of game:
groupBy(df, df$game) %>% agg(min(df$time))
By typing
arrange(game, game$time)
I get all of the time sorted. By taking first function I get the first entry. If I want the last entry I simply type this
first(arrange(game, desc(game$time)))
Just to clarify because this is something I keep running into: the error you were getting is probably because you also imported dplyr into your environment. If you would have used SparkR::first(SparkR::arrange(g, SparkR::desc(g$time))) things would probably have been fine (although obviously the query could've been more efficient).

Tallying up data

I have two data sets. One has 2 million cases (individual donations to various causes), the other has about 38,000 (all zip codes in the U.S.).
I want to sort through the first data set and tally up the total number of donations by zip code. (Additionally, the total for each zip code will be broken down by cause.) Each case in the first data set includes the zip code of the corresponding donation and information about what kind of cause it went to.
Is there an efficient way to do this? The only approach that I (very much a novice) can think of is to use a for ... if loop to go through each case and count them up one by one. This seems like it would be really slow, though, for data sets of this size.
edit: thanks, #josilber. This gets me a step closer to what I'm looking for.
One more question, though. table seems to generate frequencies, correct? What if I'm actually looking for the sum for each cause by zip code? For example, if the data frame looks like this:
dat3 <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE),
amt = sample(250:2500, 2000000, replace=TRUE))
Suppose that instead of frequencies, I want to end up with output that looks like this?
# Cause 1(amt) Cause 2(amt) Cause 3(amt)
# Zip 1 (sum) (sum) (sum)
# Zip 2 (sum) (sum) (sum)
# Zip 3 (sum) (sum) (sum)
# etc. ... ... ...
Does that make sense?
Sure, you can accomplish what you're looking for with the table command in R. First, let's start with a reproducible example (I'll create an example with 2 million cases, 3 zip codes, and 3 causes; I know you have more zip codes and more causes but that won't cause the code to take too much longer to run):
# Data
set.seed(144)
dat <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE))
Please note that it's a good idea to include a reproducible example with all your questions on Stack Overflow because it helps make sure we understand what you're asking! Basically you should include a sample dataset (like the one I've just included) along with your desired output for that dataset.
Now you can use the table function to count the number of donations in each zip code, broken down by cause:
table(dat$zip, dat$cause)
# Cause 1 Cause 2 Cause 3
# Zip 1 222276 222004 222744
# Zip 2 222068 222791 222363
# Zip 3 221015 221930 222809
This took about 0.3 seconds on my computer.
could this work?-
aggregate(amt~cause+zip,data=dat3,FUN=sum)
cause zip amt
1 Cause 1 Zip 1 306231179
2 Cause 2 Zip 1 306600943
3 Cause 3 Zip 1 305964165
4 Cause 1 Zip 2 305788668
5 Cause 2 Zip 2 306306940
6 Cause 3 Zip 2 305559305
7 Cause 1 Zip 3 304898918
8 Cause 2 Zip 3 304281568
9 Cause 3 Zip 3 303939326

Resources