I have two separate dataframes each for one speaker of an interacting dyad. They have different amounts of talk-turns (rows) which is why I keep them in separate files for now.
In order to run my final analyses I need identical number of rows for each speaker.
So what I want to do is compare dyad_id 1 in both data frames and then shorten the longer list for one by deleting the last row for all columns.
I prepared a data frame to illustrate what I already have.
So far, I tried to split the data frame by the dyad_id in both data sets to now compare the splits one after another and delete the unnecessary rows. As I have various conversations, I need to automate this to go through all dyad_ids one after another.
I hope someone can help me, I am completely lost.
dyad_id_A <- c(1,1,1,2,2,2,2,3,3,3,3,3)
fw_quantiles_a <- c(4,3,1,2,3,2,4,1,4,5,6,7)
df_A<- data.frame(dyad_id_A,fw_quantiles_a)
dyad_id_B <- c(1,1,1,1,2,2,2,3,3,3,3)
fw_quantiles_b <- c(3,1,2,1,2,4,1,3,3,4,5)
df_B <- data.frame(dyad_id_B,fw_quantiles_b)
example for final dataset
dyad_id_AB <- c(1,1,1,2,2,2,3,3,3,3)
What I tried so far:
split_conv_A = split(df_A, list(df_A$dyad_id_A))
split_conv_B = split(df_B, list(df_B$dyad_id_B))
Add a time counter within each dyad_id_x group and then merge together:
df_A$time <- ave(df_A$dyad_id_A, df_A$dyad_id_A, FUN=seq_along)
df_B$time <- ave(df_B$dyad_id_B, df_B$dyad_id_B, FUN=seq_along)
merge(
df_A, df_B,
by.x=c("dyad_id_A","time"), by.y=c("dyad_id_B","time")
)
# dyad_id_A time fw_quantiles_a fw_quantiles_b
#1 1 1 4 3
#2 1 2 3 1
#3 1 3 1 2
#4 2 1 2 2
#5 2 2 3 4
#6 2 3 2 1
#7 3 1 1 3
#8 3 2 4 3
#9 3 3 5 4
#10 3 4 6 5
Maybe we can try using table to calculate frequncies of id's in both the dataframe assuming you have the same id's in both the dataframe. Calculate the minimum between them using pmin and repeat the names based on the frequency.
tab <- pmin(table(df_A$dyad_id_A), table(df_B$dyad_id_B))
as.integer(rep(names(tab), tab))
# [1] 1 1 1 2 2 2 3 3 3 3
How can I find the 5 highest values of a column in a data frame
I tried the order() function but it gives me only the indices of the rows, wherease I need the actual data from the column. Here's what I have so far:
tail(order(DF$column, decreasing=TRUE),5)
You need to pass the result of order back to DF:
DF <- data.frame( column = 1:10,
names = letters[1:10])
order(DF$column)
# 1 2 3 4 5 6 7 8 9 10
head(DF[order(DF$column),],5)
# column names
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
You're correct that order just gives the indices. You then need to pass those indices to the data frame, to pick out the rows at those indices.
Also, as mentioned in the comments, you can use head instead of tail with decreasing = TRUE if you'd like, but that's a matter of taste.
This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656
I would like to rename a large number of columns (column headers) to have numerical names rather than combined letter+number names. Because of the way the data is stored in raw format, I cannot just access the correct column numbers by using data[[152]] if I want to interact with a specific column of data (because random questions are filtered completely out of the data due to being long answer comments), but I'd like to be able to access them by data$152. Additionally, approximately half the columns names in my data have loaded with class(data$152) = NULL but class(data[[152]]) = integer (and if I rename the data[[152]] file it appropriately allows me to see class(data$152) as integer).
Thus, is there a way to use the loop iteration number as a column name (something like below)
for (n in 1:415) {
names(data)[n] <-"n" # name nth column after number 'n'
}
That will reassign all my column headers and ensure that I do not run into question classes resulting in null?
As additional background info, my data is imported from a comma delimited .csv file with the value 99 assigned to answers of NA with the first row being the column names/headers
data <- read.table("rawdata.csv", header=TRUE, sep=",", na.strings = "99")
There are 415 columns with headers in format Q001, Q002, etc
There are approximately 200 rows with no row labels/no label column
You can do this without a loop, as follows:
names(data) <- 1:415
Let me illustrate with an example:
dat <- data.frame(a=1:4, b=2:5, c=3:6, d=4:7)
dat
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Now rename the columns:
names(dat) <- 1:4
dat
1 2 3 4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
EDIT : How to access your new data
#Ramnath points out very accurately that you won't be able to access your data using dat$1:
dat$1
Error: unexpected numeric constant in "dat$1"
Instead, you will have to wrap the column names in backticks:
dat$`1`
[1] 1 2 3 4
Alternatively, you can use a combination of character and numeric data to rename your columns. This could be a much more convenient way of dealing with your problem:
names(dat) <- paste("x", 1:4, sep="")
dat
x1 x2 x3 x4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7