R: Applying factor values from one column to another - r

I am trying to process municipal information in R and it seems that factors (to be exact factor()). are the best way to achieve my goal. I am only starting to get the hang of R, so I imagine my problem is possibly very simple.
I have the following example dataframe to share (a tiny portion of Finnish municipalities):
municipality<-c("Espoo", "Oulu", "Tampere", "Joensuu", "Seinäjoki",
"Kerava")
region<-c("Uusimaa","Pohjois-Pohjanmaa","Pirkanmaa","Pohjois-Karjala","Etelä-Pohjanmaa","Uusimaa")
myData<-cbind(municipality,region)
myData<-as.data.frame(myData)
By default R converts my character columns into factors, which can be tested with str(myData). Now to the part where my beginner to novice level R skills end: I can't seem to find a way to apply factors from column region to column municipality.
Let me demonstrate. Instead of having the original result
as.numeric(factor(myData$municipality))
[1] 1 4 6 2 5 3
I would like to get this, the factors from myData$region applied to myData$municipality.
as.numeric(factor(myData$municipality))
[1] 5 4 2 3 1 5
I welcome any help with open arms. Thank you.

To better understand the use of factor in R have a look here.
If you want to add factor levels, you have to do something like this in your dataframe:
levels(myData$region)
[1] "Etelä-Pohjanmaa" "Pirkanmaa" "Pohjois-Karjala" "Pohjois-Pohjanmaa" "Uusimaa"
> levels(myData$municipality)
[1] "Espoo" "Joensuu" "Kerava" "Oulu" "Seinäjoki" "Tampere"
> levels(myData$municipality)<-c(levels(myData$municipality),levels(myData$region))
> levels(myData$municipality)
[1] "Espoo" "Joensuu" "Kerava" "Oulu" "Seinäjoki"
[6] "Tampere" "Etelä-Pohjanmaa" "Pirkanmaa" "Pohjois-Karjala" "Pohjois-Pohjanmaa"
[11] "Uusimaa"

Related

Static variable next to a dynamic variable in R

I posted yesterday another question but I feel I need to clarify it.
Let's say I have this code
md.NAME <- (subset(MyData, HotelName=="ALAMEDA"))
md.NAME.fc <- (subset(md.ALAMEDA, TIPO=="FORECAST"))
md.NAME.fc.bar <- (subset(md.ALAMEDA.fc, Market.Segment=="BAR"))
What I want is that NAME changes according to a variable set before those 3 lines are run,
So NAME is just dynamic in the sense that before these 3 lines I could say, ok, NAME now is equal to JOHN, but then, I could say that NAME is now equal to PATRIC.
So after running those 3 lines, twice (once for John and once for Patric) somehow in the environment I will get something like this:
6 dataframes, 3 for JOHN and 3 for PATRIC
DATAFRAME 1 WILL BE md.JOHN
DATAFRAME 2 WILL BE md.JOHN.fc
DATAFRAME 3 WILL BE md.JOHN.fc.bar
DATAFRAME 1 WILL BE md.PATRIC
DATAFRAME 2 WILL BE md.PATRIC.fc
DATAFRAME 3 WILL BE md.PATRIC.fc.bar
All the answers I had so far would help me only if "md" and "fc" or "fc.bar" are always the same. But I will have several variables like this, which will change a lot as far as the naming goes. So, it is the center part (NAME) the only one that should change.
I could even have something like:
md.test$NAME <- ...

How to convert dataframe to list of lists?

I have a dataset that looks like a sentiment dictionary.
Say the data is like following:
sentiment <-read.table(header=TRUE,text='
term score
awesome 3
good 0
interesting 2
power 1
bad -1
horrible -2
worst -3' )
I want to convert this data into a list of lists by score. Something like following (list names can be different):
$pos3
[1] awesome
$pos2
[1] interesting
$pos1
[1] power
$zero
[1] good
$neg1
[1] bad
$neg2
[1] horrible
$neg3
[1] worst
Other solutions suggested to use dlply:
dlply(sentiment, .(score), c)
But this produces score lists that I don't want. So I ended up using a dumb code like following:
list(neg3=sentiment$term[sentiment$score==-3],neg2=sentiment$term[sentiment$score==-2],
neg1=sentiment$term[sentiment$score==-1],zero=sentiment$term[sentiment$score==0],
pos1=sentiment$term[sentiment$score==1], pos2=sentiment$term[sentiment$score==2],
pos3=sentiment$term[sentiment$score==3])
But I wonder if there is a better way to do this.
How can I convert a dataframe to a list of lists without producing lists that I don't want?

R Refer to (part of) data frame using string in R

I have a large data set in which I have to search for specific codes depending on what i want. For example, chemotherapy is coded by ~40 codes, that can appear in any of 40 columns called (diag1, diag2, etc).
I am in the process of writing a function that produces plots depending on what I want to show. I thought it would be good to specify what I want to plot in a input data frame. Thus, for example, in case I only want to plot chemotherapy events for patients, I would have a data frame like this:
Dataframe name: Style
Name SearchIn codes PlotAs PlotColour
Chemo data[substr(names(data),1,4)=="diag"] 1,2,3,4,5,6 | red
I already have a function that searches for codes in specific parts of the data frame and flags the events of interest. What i cannot do, and need your help with, is referring to a data frame (Style$SearchIn[1]) using codes in a data frame as above.
> Style$SearchIn[1]
[1] data[substr(names(data),1,4)=="diag"]
Levels: data[substr(names(data),1,4)=="diag"]
I thought perhaps get() would work, but I cant get it to work:
> get(Style$SearchIn[1])
Error in get(vars$SearchIn[1]) : invalid first argument
enter code here
or
> get(as.character(Style$SearchIn[1]))
Error in get(as.character(Style$SearchIn[1])) :
object 'data[substr(names(data),1,5)=="TDIAG"]' not found
Obviously, running data[substr(names(data),1,5)=="TDIAG"] works.
Example:
library(survival)
ex <- data.frame(SearchIn="lung[substr(names(lung),1,2) == 'ph']")
lung[substr(names(lung),1,2) == 'ph'] #works
get(ex$SearchIn[1]) # does not work
It is not a good idea to store R code in strings and then try to eval them when needed; there are nearly always better solutions for dynamic logic, such as lambdas.
I would recommend using a list to store the plot specification, rather than a data.frame. This would allow you to include a function as one of the list's components which could take the input data and return a subset of it for plotting.
For example:
library(survival);
plotFromSpec <- function(data,spec) {
filteredData <- spec$filter(data);
## ... draw a plot from filteredData and other stuff in spec ...
};
spec <- list(
Name='Chemo',
filter=function(data) data[,substr(names(data),1,2)=='ph'],
Codes=c(1,2,3,4,5,6),
PlotAs='|',
PlotColour='red'
);
plotFromSpec(lung,spec);
If you want to store multiple specifications, you could create a list of lists.
Have you tried using quote()
I'm not entirely sure what you want but maybe you could store the things you're trying to get() like
quote(data[substr(names(data),1,4)=="diag"])
and then use eval()
eval(quote(data[substr(names(data),1,4)=="diag"]), list(data=data))
For example,
dat <- data.frame("diag1"=1:10, "diag2"=1:10, "other"=1:10)
Style <- list(SearchIn=c(quote(data[substr(names(data),1,4)=="diag"]), quote("Other stuff")))
> head(eval(Style$SearchIn[[1]], list(data=dat)))
diag1 diag2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6

applying alternate to for loop in R

I am looking for a very efficient solution for for loop in R
where data_papers is
data_papers<-c(1,3, 47276 77012 77012 79468....)
paper_author:
paper_id author_id
1 1 521630
2 1 972575
3 1 1528710
4 1 1611750
5 2 1682088
I need to find the authors which are present in paper_author for a given paper in data_papers.There are around 350,000 papers in data_papers to around 2,100,000 papers in paper_author.
So my output would be a list of author_id for paper_ids in data_paper
authors:
[[1]]
[1] 521630 972575 1528710 1611710
[[2]]
[1] 826 338038 788465 1256860 1671245 2164912
[[3]]
[1] 366653 1570981 1603466
The simplest way to do this would be
authors<-vector("list",length(data_papers))
for(i in 1:length(data_papers)){
authors[i]<-as.data.frame(paper_author$author_id[which(paper_author$paper_id%in%data_papers[i])])}
But the computation time is very high
The other alternative is something like below taken from efficient programming in R
i=1:length(data_papers)
authors[i]<-as.data.frame(paper_author$author_id[which(paper_author$paper_id%in%data_papers[i])])
But i am not able to do this.
How could this be done.thanks
with(paper_author, split(author_id,paper_id))
Or you could use R's merge function?
merge(data_papers, paper_author, by=1)
Why are you not able to use this second solution you mentioned? Information on why would be useful.
In any case, what you want to do is to join two tables (data_papers and paper_authors). Doing it with pure nested loops, as your sample code does in either R for loops or the C for loops underlying vector operations, is pretty inefficient. You could use some kind of index data structure, based on e.g. the hash package, but it's a lot of work.
Instead, just use a database. They're built for this sort of thing. sqldf even lets you embed one into R.
install.packages("sqldf")
require(sqldf)
#you probably want to dig into the indexing options available here as well
combined <- sqldf("select distinct author_id from paper_author pa inner join data_papers dp on dp.paper_id = pa.paper_id where dp.paper_id = 1234;")

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources