Extracting different vectors from a single column of data (in R) - r

I have a small problem, which I don't think is too hard, but I couldn't find any answer here (maybe I phrased my research wrong so please excuse me if the question has already been asked!)
I am importing data from an excel sheet which is split in two columns as in the following picture:
Now, I am trying to import all the data in the second column to my R script, but by splitting it into different vectors: one vector for category A, one for category B, etc... by keeping the data points in the order they are in the file (because as it happens, they are in chronological order).
Now, the categories each have a different number of elements, however, they are ordered alphabetically (ie you'll never find an A in the B's, for example). So I guess that makes it easier, but I'm still a novice with R and I don't really know how to proceed without getting really messy with the code and I know there's probably a simple way of doing it.
Does anyone have an idea on how to treat this nicely please? :)

We can use split in base R to return a list of vectors of 'Data' based on the unique values in 'Category'
lst1 <- split(df1$Data, df1$Category)

Related

Exporting data from R that is in a list generated by a function

So I've used the decompose function and I want to export all the lists it generates not just the plot it creates. I tried converting the lists into either a matrix or data frame but then that gets rid of the date header and year columns so if someone knows how to convert it and keep the list formatting that would solve my issue I think.
Anyway, The closest I've got to being able to do this keeping the list format is by doing
capture.output(decompose, file = "filename.csv")
As you can see from the image attached though:
Sometimes the months arent all together in a row which is really not helpful or what I want. It also just puts it in one column and I'm having to go into the excel after and do the text to column option which is going to get old really quickly.
Any help would be greatly appriciated. I'm really new to R so apologise if there is an obvious fix I'm missing.

Not aggregating correctly

My goal of this code is to create a loop that aggregates each company's word frequency by a certain principle vector I created and adds it to a list. The problem is, after I run this, it only prints the 7 principles that I have rather than the word frequencies along side them. The word frequencies being the certain column of the FREQBYPRINC.AG data frame. Individually, running this code without the loop and just testing out a certain column, it works no problem. For some reason, the loop doesn't want to give me the correct data frames for the list. Any suggestions?
list.agg<-vector("list",ncol(FREQBYPRINC.AG)-2)
for (i in 1:14){
attach(FREQBYPRINC.AG)
list.agg[i]<-aggregate(FREQBYPRINC.AG[,i+1],by=list(Type=principle),FUN=sum,na.rm=TRUE)
}
I really wish I could help. After reading your statement, It seems that to you , you feel that the code should be working and it is not. Well maybe there exists a glitch.
Since you had previously specified list. agg as a list, you need to subset it with double square brackets. Try this one out:
list.agg<-vector("list",ncol(FREQBYPRINC.AG)-2)
for (i in 1:14){
list.agg[[i]]<-aggregate(FREQBYPRINC.AG[,i+1],by=list
(Type=principle),FUN=sum,na.rm=TRUE)}

Selecting rows in a long dataframe based on a short list

I'm sure this should be easier to do than the way I know how to do it.
I'd like to apply fields from a short dataframe back into a long one based on matching a common factor.
Example short dataframe, list of valid cases:
$ptid (factor) values 1,2,3,4,5...20
$valid 1/0 (to represent true/false; variable through ptid)
long dataframe has 15k rows, each level of $ptid will have several thousand rows
I want to apply $valid onto those rows when the it is 1/true from the list above
The way I know how to do it is to loop through each row of long dataframe, but this is horribly inelegant and also slow.
I have a niggling feeling there is a much better way with dply or similar and I'd really like to learn how.
Worked this out based on the comments, thank you Colonel.
combination_dataset <- Merge(short_dataframe, long_dataframe) worked (very quickly).
Thanks to those who commented.

Why does R mix up numerical with categorial variables?

I am confused. I input a .csv file in R and want to fit a linear multivariate regression model.
However, R declares all my obvious numeric variables to be factors and my categorial variables to be integers. Therefore, I cannot fit the model.
Does anyone know how to resolve this?
I know this is probably so basic. But I really need to know this. Elsewhere, I found only posts concerning how to declare factors. But this does not apply here.
Any suggestions very much appreciated!
The easiest way, imo, to handle this is to just tell R what type of data your columns contain when you read them into the workspace. For example, if you have a csv file where the first column should be characters, columns 2-21 should be numeric, and column 22 should be a factor, here's how I would read that csv file into the workspace:
Data <- read.csv("MyData.csv", colClasses=c("character", rep("numeric", 20), "factor"))
Sometimes (with certain versions of R, as Andrew points out) float entries in a CSV are long enough that it thinks they are strings and not floats. In this case, you can do the following
data <- read.csv("filename.csv")
data$some.column <- as.numeric(as.character(data$some.column))
Or you could pass stringsAsFactors=F to the read.csv call, and just apply as.numeric in the next line. That might be a bad idea though if you have a lot of data.
It's a little harder to say what's going on with the categorical variables. You might want to try just treating those as strings and see how that works. Sometimes R will treat factor vectors as being of numeric type, so this is a good first sanity check. If that doesn't work, you can also see if the regression functions in question will let you declare how the variables should be treated.
It is hard to tell without a sample of your data file and the commands that you have been using to try and work with the data, but here are some general problems that can lead to what you describe (though there could be other possibilities as well).
The read.csv and read.table (which is called by read.csv) function will try and guess the types of data when they are not told what each column should be (the colClasses argument). If everything looks like a number then it will convert to a number, but if it sees anything in the first lines that does not look like part of a number then it will read it in as character and convert to a factor. Some of the common reasons why what you think should be a number but R sees something non-numeric include: a finger slip results in a letter somewhere in the column; similar looking substitutions, O for 0 or l for 1; a comma where one is not expected, many European files use , where R expects . (but there are options to tell R what you want here) or if you use read.table without setting sep when it really is a comma separated file.
If you have a categorical variable represented by integers, then R will convert it to integers unless you tell it to make a factor. If you use as.numeric on a factor then it will return the integers used to represent the factor internally. How to convert a factor with labels that are numbers to a numeric is a question (and answer) in the FAQ.
If this does not point you in the right direction then give us a sample of your data and what commands you are using.

How do I match single ID's in one data frame to multiples of the IDs in another data frame in R?

For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.

Resources