R : Create and transform a dataframe based on existing dataframe

R : Create and transform a dataframe based on existing dataframe - r

My original question was poorly defined and confusing. Here it is again with extraneous columns removed for clarity and additional background.
My end game is to create a forced network graph using networkD3. To do this I need two dataframes.
The first dataframe (dfNodes) lists each node in the graph and its grouping. Volt, Miata and Prius are cars, so they get the group value of 1.
Crusader is a bus so it is in group 3. Etc.
dfNodes dataframe:
id name group
0 vehicle 0
1 car 1
2 truck 2
3 bus 3
4 volt 1
5 miata 1
6 prius 1
7 tahoe 2
8 suburban 2
9 crusader 3
From this dataframe I need to construct the dataframe dfLinks to provide the linkages between nodes. It must look like:
dfLinks dataframe:
source target
0 4
0 5
0 6
0 7
0 8
0 9
1 4
1 5
1 6
2 7
2 8
3 9
This shows the following:
vehicle is linked to volt, miata, prius, tahoe, suburban, crusader (they are all vehicles). (0, 4; 0, 5...0, 9)
car is linked to volt, miata, prius (1, 4; 1, 5; 1, 6)
truck is linked to tahoe, suburban (2, 7 ; 2, 8)
bus is linked to crusader (3, 9)
It may appear strange to link vehicle-->model name (volt...crusader)--> type (car/bus/truck) instead of
vehicle--> type --> model name
but that is the form that I need for my graph.

I don't understand the criteria for creating the second dataframe; for example 'car' appears three times in the second data frame but is present four times in the first data frame? If you can provide more precise details about the transformation I can try and help more, but my first suggestion would be to look at the melt() and cast() functions in the reshape package:
http://www.statmethods.net/management/reshape.html

Based on the replies from Phil and aosmith I revisited my question again. I have the luxury of creating the source dataframe (dfNodes) so I approached the problem from further upstream, first by creating dfNodes by combining dataframes, then creating dfLinks independently from dfNodes instead of based on dfNodes. I feel my approach is very much a kludge but it does give me the result I am looking for. Here is the code. Other approaches, advice and criticism welcomed! As you can tell, I am new to R.
# Separate dataframes for the different categories of data
vehicle <- data.frame(source=0, name="vehicle", group=0)
carGroup <- data.frame(source=1, name="car", group=1)
truckGroup <- data.frame(source=2, name="truck", group=2)
busGroup <- data.frame(source=3, name="bus", group=3)
cars <- data.frame(source=c(4,5,6),
name=c("volt", "miata", "prius"),
group=c(1,1,1))
trucks <- data.frame(source=c(7,8),
name=c("tahoe", "suburban"),
group=c(2,2))
buses <- data.frame(source=9,
name="crusader",
group=3)
# 1. Build the dfNodes dataframe
dfNodes <- rbind(vehicle, carGroup, truckGroup, busGroup, cars, trucks, buses)
names(dfNodes)[1]<-"id"
dfNodes
# 2. Build dfLinks dataframe. Only source and target columns needed
# Bind Vehicle to the 3 different vehicle dataframes
vehicleToCars <- cbind(vehicle, cars)
vehicleToTrucks <- cbind(vehicle, trucks)
vehicleToBuses <- cbind(vehicle, buses)
# Bind the Vehicle Groups to the vehicles
carGroupToCars <- cbind(carGroup, cars)
truckGroupToTrucks <- cbind(truckGroup, trucks)
busGroupToBuses <- cbind(busGroup, buses)
# Stack into the final dfLinks dataframe
dfLinks <- rbind(vehicleToCars, vehicleToTrucks, vehicleToBuses,
carGroupToCars, truckGroupToTrucks, busGroupToBuses)
names(dfLinks)[4]<-"target"
dfLinks

Related

Dynamically (out of for loop) populate a dataframe with another dataframe with n-rows in r

I have certain data in a list extracted from a bayesian processing from certain electrodes and I want to populate a dataframe out of a loop. First I have a list of 729 processing outcomes and an object elecs which is basically a list of 729 pairs of electrodes (27*27) as you can see.
> head(elecs)
X Elec1 Elec2
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
The thing is I would like to fill dataf1 with the outcome of this loop which happens to be a dataframe of 4000 rows.
dataf1 <- data.frame('Elec1'=rep(NA,4000*729),'Elec2'=rep(NA,4000*729),'int'=rep(NA,4000*729))
for (i in nrow(elecs)){
Elec1=as.data.frame(rep(elecs[i,]$Elec1,4000))
Elec2=as.data.frame(rep(elecs[i,]$Elec2,4000))
post <- posterior_samples(bayeslist[[i]])
int <- as.data.frame(post$b_Intercept)
df <- cbind(Elec1,Elec2,est)
colnames(df) <- c('Elec1','Elec2','int')
dataf1[(1+(i-1)*4000):((1+(i-1)*4000)+3999),c('Elec1','Elec2','int')] <- df
}
Everything works perfectly fine until the last line in the loop:
dataf1[(1+(i-1)*4000):((1+(i-1)*4000)+3999),c('Elec1','Elec2','int')] <- df
And I don't know why exactly this is not working as expected and populating the dataf1 preinitialised dataframe.
Any insight, as always, will be highly appreciated.

I realised I was missing the init in the for, so it's kinda newbie typo. Apart from this, the code works, in case anyone is wondering.
for (i in nrow(elecs)){
for (i in 1:nrow(elecs)){

Combining data using R (or maybe Excel) -- looping to match stimuli

I have two sets of data, which correspond to different experiment tasks that I want to merge for analysis. The problem is that I need to search and match up certain rows for particular stimuli and for particular participants. I'd like to use a script to save some trouble. This is probably quite simple, but I've never done it before.
Here's my problem more specifically:
In the first data set, each row corresponds to a two-alternative forced choice task where two stimuli are presented at a time and the participant selects one. In the second data set, each row corresponds to a single item task where the participants are asked if they have ever seen the stimulus before. The stimuli in the second task match the stimuli in the pairs on the first task (twice as many rows). I want to be able to match up and add two columns to the first dataset--one that states if the leftside item was recognized later and one for the rightside stimulus.
I assume this could be done with nested loops, but I'm not sure if there is a elegant way to do this or perhaps a package.

As I understand it, your first dataset looks something like this:
(dat1 <- data.frame(person=1:2, stim1=1:2, stim2=3:4))
# person stim1 stim2
# 1 1 1 3
# 2 2 2 4
This would mean person 1 got stimuli 1 and 3 and person 2 got stimuli 2 and 4. Then your second dataset looks something like this:
(dat2 <- data.frame(person=c(1, 1, 2, 2), stim=c(1, 3, 4, 2), responded=c(0, 1, 0, 1)))
# person stim responded
# 1 1 1 0
# 2 1 3 1
# 3 2 4 0
# 4 2 2 1
This gives information about how each person responded to each stimulus they were given.
You can merge these two by matching person/stimulus pairs with the match function:
dat1$response1 <- dat2$responded[match(paste(dat1$person, dat1$stim1), paste(dat2$person, dat2$stim))]
dat1$response2 <- dat2$responded[match(paste(dat1$person, dat1$stim2), paste(dat2$person, dat2$stim))]
dat1
# person stim1 stim2 response1 response2
# 1 1 1 3 0 1
# 2 2 2 4 1 0
Another option (starting from the original dat1 and dat2) would be to merge twice with the merge function. You have a little less control on the names of the output columns, but it requires a bit less typing:
merged <- merge(dat1, dat2, by.x=c("person", "stim1"), by.y=c("person", "stim"))
merged <- merge(merged, dat2, by.x=c("person", "stim2"), by.y=c("person", "stim"))

Formatting data for two sample t-tests on R

Suppose I have the dataset that has the following information:
1) Number (of products bought, for example)
1 2 3
2) Frequency for each number (e.g., how many people purchased that number of products)
2 5 10
Let's say I have the above information for each of the 2 groups: control and test data.
How do I format the data such that it would look like this:
controldata<-c(1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
(each number * frequency listed as a vector)
testdata<- (similar to above)
so that I can perform the two independent sample t-test on R?
If I don't even need to make them a vector / if there's an alternative clever way to format the data to perform the t-test, please let me know!
It would be simple if the vector is small like above, but I can have the frequency>10000 for each number.
P.S.
Control and test data have a different sample size.
Thanks!

Use rep. Using your data above
rep(c(1, 2, 3), c(2, 5, 10))
# [1] 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Or, for your case
control_data = rep(n_bought, frequency)

filtering large data sets to exclude an identical element across all columns

I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!

You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)

You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")

Unit of Analysis Conversion

We are working on a social capital project so our data set has a list of an individual's organizational memberships. So each person gets a numeric ID and then a sub ID for each group they are in. The unit of analysis, therefore, is the group they are in. One of our variables is a three point scale for the type of group it is. Sounds simple enough?
We want to bring the unit of analysis to the individual level and condense the type of group it is into a variable signifying how many different types of groups they are in.
For instance, person one is in eight groups. Of those groups, three are (1s), three are (2s), and two are (3s). What the individual level variable would look like, ideally, is 3, because she is in all three types of groups.
Is this possible in the least?

##simulate data
##individuals
n <- 10
## groups
g <- 5
## group types
gt <- 3
## individuals*group membership
N <- 20
## inidividuals data frame
di <- data.frame(individual=sample(1:n,N,replace=TRUE),
group=sample(1:g,N, replace=TRUE))
## groups data frame
dg <- data.frame(group=1:g, type=sample(1:gt,g,replace=TRUE))
## merge
dm <- merge(di,dg)
## order - not necessary, but nice
dm <- dm[order(dm$individual),]
## group type per individual
library(plyr)
dr <- ddply(dm, "individual", function(x) length(unique(x$type)))
> head(dm)
group individual type
2 2 1 2
8 2 1 2
20 5 1 1
9 3 3 2
12 3 3 2
17 4 3 2
> head(dr)
individual V1
1 1 2
2 3 1
3 4 2
4 5 1
5 6 1
6 7 1

I think what you're asking is whether it is possible to count the number of unique types of group to which an individual belongs.
If so, then that is certainly possible.
I wouldn't be able to tell you how to do it in R since I don't know a lot of R, and I don't know what your data looks like. But there's no reason why it wouldn't be possible.
Is this data coming from a database? If so, then it might be easier to write a SQL query to compute the value you want, rather than to do it in R. If you describe your schema, there should be lots of people here who could give you the query you need.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex