I have an environment I'm using as a hashmap (inspired by this question) that's storing a count of references to that key in another document.
Now I'm hoping to draw up a bar plot of the keys vs values ordered by those with the highest value.
To me it seems that the easiest way to do the sort (while keeping shared indexes between keys and values) would be to put my data into a data frame. So step 1, i extracted the keys and values from the map to two lists as below:
i<-0
keys <- list()
values <- list()
for (key in ls(contactsMap)){
keys[i]<-key
values[i]<-contactsMap[[key]]
i<- i + 1
}
values.num <- as.numeric(values) # to avoid character lists.
But now i cant seem to be able to get them to form a simple 2 column data frame.
I'm extremely new to R and I'm sure its something simple I'm missing, but no amount of searching will shed any light on the matter.
any suggestions on how to form a data frame from two lists or, more directly, how to sort two lists while keeping the relationship between them would be amazing.
Thanks.
df <- data.frame(keys, values) #and then you can sort it like this:
df[order(df$keys),]
Related
I have data that I want to separate by date, I have managed to do this manually through:
tsssplit <- split(tss, tss$created_at)
and then creating dataframes for each list which I then use.
t1 <- tsssplit[[1]]
t2 <- tsssplit[[2]]
But I don't know how many splits I will need, as sometimes the og data frame may may have 6 dates to split up by, and sometimes it may have 5, etc. So I want to create a for loop.
Within the for loop, I want to incorporate this code, which connects to a function:
bscore3 <- score.sentiment(t3$cleaned_text,pos.words,neg.words,.progress='text')
score3 <- as.integer(bscore3$score[[1]])
Then I want to be able to create a new data frame that has the scores for each list.
So essentially I want the for loop to:
split the data into lists using split
split each list into a separate data frames for each different day
Come out with a score for each data frame
Put that into a new data frame
It doesn't have to be exactly like this as long as I can come up with a visualisation of the scores at the end.
Thanks!
It is not recommended to create separate dataframes in the global environment, they are difficult to keep track of. Put them in a list instead. You have started off well by using split and creating list of dataframes. You can then iterate over each dataframe in the list and apply the function on each one of them.
Using by this would look like as :
by(tss, tss$created_at, function(x) {
bscore3 <- score.sentiment(x$cleaned_text,pos.words,neg.words,.progress='text')
score3 <- as.integer(bscore3$score[[1]])
return(score3)
}) -> result
result
I am trying to write some kind of loop function that will allow me to apply the same set of code to dozens of data frames that are stored in one list. Each data frame has the same number of columns and identical headers for each column, though the number of rows varies across data frames.
This data comes from an egocentric social network study where I collected ego-network data in edgelist format from dozens of different respondents. The data collection software that I use stores the data from each interview in its own .csv file. Here is an image of the raw data for a specific data frame (image of raw data).
For my purposes, I only need to use data from the fourth, sixth, and seventh columns. Furthermore, I only need rows of data where the last column has values of 4, at which point the final column can be deleted entirely. The end result is a two-column data frame that represents relationships among pairs of people.
After reading in the data and storing it as an object, I ran the following code:
x100291 = `100291AlterPair.csv` #new object based on raw data
foc.altername = x100291$Alter.1.Name
altername = x100291$Alter.2.Name
tievalue = x100291$AlterPair_B
tie = tievalue
tie[(tie<4)] = NA
egonet.name = data.frame(foc.altername, altername, tievalue)
depleted.name = cbind(tie,egonet.name)
depleted.name = depleted.name[is.na(depleted.name[,1]) == F,]
dep.ego.name = data.frame(depleted.name$foc.altername, depleted.name$altername)
This produced the following data frame (image of final data). This is ultimately what I want.
Now I know that I could cut-and-paste this same set of code 100+ times and manually alter the file names, but I would prefer not to do that. Instead, I have stored all of my raw .csv files as data frames in a single list. I suspect that I can apply the same code across all of the data frames by using one of the apply commands, but I cannot figure it out.
Does anyone have any suggestions for how I might apply this basic code to a list of data frames so that I end up with a new list containing cleaned and reduced versions of the data?
Many thanks!
The logic can be simplified. Try creating a custom function and apply over all dataframes.
cleanDF <- function(mydf) {
if( all(!c('AlterPair_B', 'Alter.1.Name', 'Alter.2.Name') %in%
names(mydf))) stop("Check data frame names")
condition <- mydf[, 'AlterPair_B'] >= 4
mydf[condition, c("Alter.1.Name", "Alter.2.Name")]
}
big_list <- lapply(all_my_files, read.csv) #read in all data frames
result <- do.call('rbind', lapply(big_list, cleanDF))
The custom function cleanDF first checks that all the relevant column names are there. Then it defines the condition of 4 or more 'AlterPair_B'. Lastly, subset the two target columns by that condition. I used a list called 'big_list' that represents all of the data frames.
You haven't provided a reproducible example so it's hard to solve your problem. However, I don't want your questions to remain unanswered. It is true that using lapply would be a fast solution, usually preferable to a loop. However, since you mentioned being a beginner, here's how to do that with a loop, which is easier to understand.
You need to put all your csv files in a single folder with nothing else. Then, you read the filenames and put them in a list. You initialize an empty result object with NULL. You then read all your files in a loop, do calculations and rbind the results in the result object.
path <-"C:/temp/csv/"
list_of_csv_files <- list.files(path)
result <- NULL
for (filenames in list_of_csv_files) {
input <- read.csv(paste0(path,filenames), header=TRUE, stringsAsFactors=FALSE)
#Do your calculations
input_with_calculations <- input
result <- rbind(result,input_with_calculations)
}
result
I have inputted some data consisting of three columns, X,Y and Group.
I am looking to get the underling data for a voronoi diagram for each group.
By using
a=deldir(Test.data$X,Test.data$Y,rw=c(0,1,0,1))
I succesfully create the voronoi data for the entire dataset. However I do not know how to iterate this process through the different groups that I have in the dataset.
Does anyone have any ideas? I have expereince with the ggplot function and know in here I can simply add a third dimension, something like
ggplot(Test.data,aes(x=X,y=Y,colour=Group))
Is there a way I can get a similar affect with the deldir() function
Thanks in advance for your help.
Ben
Consider creating a list of groups and then filter dataset. Below lapply() creates a list of deldir objects, one for each distinct group:
groups <- unique(Test.data$groupcol)
deldirList <- lapply(groups, function(g) {
temp <- Test.data[Test.data$groupcol==g,]
deldir(temp$X, temp$Y, rw=c(0,1,0,1))
})
I'm wondering whether there is a way to do in-place modification of objects in a list without using a for loop. This would be useful, for example, if the individual objects in the list are large and complex, so that we want to avoid making a temporary copy of the entire object. As an example, consider the following code, which creates a list of three data frames, then calculates the vector of maximums across all three data frames for one column of the data, and then assigns that vector to each original data frame. (Code like this is needed when aligning plots in ggplot2.)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))
for( i in 1:length(data_list))
{
data_list[[i]]$x <- max_x
}
Is there any way to write the final part without a for loop?
Answers to some of the questions I'm getting:
What makes me think a copy would be made? I don't know for sure whether a copy would or would not be made. The actual scenario I'm working with deals with entire ggplot graphs (see e.g. here). Since they are rather large and complex, it's critical that no copy be made.
What's the problem with a for loop? I just would rather iterate directly over a list than have to introduce a counter. I don't like counters.
Why not use data.table? Because I'm actually manipulating ggplot graphs, not data frames. The code provided here is just a simplified example.
Base R data structures are copy-on-modify with sharing. Take your example of a data.frame with three numeric columns. Each data.frame is a length 3 "list" vector, each containing a reference to the numeric vectors of the underlying columns. If we modify/replace the first column, R creates a new length 3 data.frame "list" containing references to the new(ly modified) column and the other two unmodified columns.
Let's take a look using the address function*
set.seed(1)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
before <- rapply(data_list,address)
Now you want to replace the first column with
max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))
How you do this doesn't much matter, but here's one way without an explicit loop-with-counter
data_list <- lapply(data_list,`[<-`,"x",value=max_x)
after <- rapply(data_list,address)
Now compare the addresses before and after. Note that the addresses for the y and z columns have not changed. Furthermore, all "after" x columns have the same address -- the address of max_x!
address(max_x)
[1] "05660600"
cbind(before,after)
before after
x "0565F530" "05660600"
y "0565F400" "0565F400"
z "05660AC0" "05660AC0"
x "05660A28" "05660600"
y "05660990" "05660990"
z "05660860" "05660860"
x "056607C8" "05660600"
y "05660730" "05660730"
z "05660698" "05660698"
This means you don't have to worry as much as you might think about making a change to a large data structure. In general, only the modified piece and the skeleton of the data structure will have to be replaced. In this example, the max_x vector had to be created anyway, so the only overhead is creating a new 3 cell data.frame "list" and populating it with 3 references**. This, however, could start to become inefficient if you are iteratively "banging on" changes or working with subvectors rather than entire columns. These are use cases for data.table that are not applicable to this example.
* The address function used here is exported from the data.table package.
** And, of course, in this example, the 3 cell outer list "list" containing the 3 data.frames themselves.
My dataframe(m*n) has few hundreds of columns, i need to compare each column with all other columns (contingency table) and perform chisq test and save the results for each column in different variable.
Its working for one column at a time like,
s <- function(x) {
a <- table(x,data[,1])
b <- chisq.test(a)
}
c1 <- apply(data,2,s)
The results are stored in c1 for column 1, but how will I loop this over all columns and save result for each column for further analysis?
If you're sure you want to do this (I wouldn't, thinking about the multitesting problem), work with lists :
Data <- data.frame(
x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
z=sample(letters[1:3],20,TRUE)
)
# Make a nice list of indices
ids <- combn(names(Data),2,simplify=FALSE)
# use the appropriate apply
my.results <- lapply(ids,
function(z) chisq.test(table(Data[,z]))
)
# use some paste voodoo to give the results the names of the column indices
names(my.results) <- sapply(ids,paste,collapse="-")
# select all values for y :
my.results[grep("y",names(my.results))]
Not harder than that. As I show you in the last line, you can easily get all tests for a specific column, so there is no need to make a list for each column. That just takes longer and takes more space, but gives the same information. You can write a small convenience function to extract the data you need :
extract <- function(col,l){
l[grep(col,names(l))]
}
extract("^y$",my.results)
Which makes you can even loop over different column names of your dataframe and get a list of lists returned :
lapply(names(Data),extract,my.results)
I strongly suggest you get yourself acquainted with working with lists, they're one of the most powerful and clean ways of doing things in R.
PS : Be aware that you save the whole chisq.test object in your list. If you only need the value for Chi square or the p-value, select them first.
Fundamentally, you have a few problems here:
You're relying heavily on global arguments rather than local ones.
This makes the double usage of "data" confusing.
Similarly, you rely on a hard-coded value (column 1) instead of
passing it as an argument to the function.
You're not extracting the one value you need from the chisq.test().
This means your result gets returned as a list.
You didn't provide some example data. So here's some:
m <- 10
n <- 4
mytable <- matrix(runif(m*n),nrow=m,ncol=n)
Once you fix the above problems, simply run a loop over various columns (since you've now avoided hard-coding the column) and store the result.