Apply the same calculation in different data frames in R - r

I am trying to loop over many data frames in R and I feel like this is a rather basic question. However, I only found similar questions that were solved with specific functions that don't match my problem (like calculating means or medians, changing column names, ...). I hope to find a more general solution that can be applied for any change or calculation in various data frames here.
I have a lot (about 500) of data frames that look somewhat like this (very simplified):
df0100
a b c d
1 4 3 5 NA
2 2 5 4 NA
3 4 4 3 NA
...
df0130
a b c d
1 3 2 3 NA
2 4 5 3 NA
3 4 3 2 NA
...
For each of them, I want to calculate a new value (also simplified here) from the values in a and c in the first row and insert the value in any row in column d. It works fine like this for a single data frame:
df0100$d <- ((df0100[1,1]*(df0100[1,3]+13.5)/(3*exp(df0100[1,3]))/100
which leads to
df0100
a b c d
1 4 3 5 36.60858
2 2 5 4 36.60858
3 4 4 3 36.60858
....
Since I don't want to do this for every single of the 500 data frames, I saved them as a list and tried to loop over them as follows. I thought the easiest way would be to replace the former 'df0100' by each data frame name but both versions didn't work. Can anyone tell me what I have to change?
my_files <- list.files(pattern=".csv")
my_data <- lapply(my_files, read.csv)
Version 1:
for (n in my_data)
{
n$d <- ((n[1,1]*(n[1,3]+13.5)/(3*exp(n[1,3]))/100
}
Version 2:
my_data <- lapply(my_data, function(n){
n$d <- ((n[1,1]*(n[1,3]+13.5)/(3*exp(n[1,3]))/100
})
This is my first question here, I hope it makes sense to you.

Related

Compare lists in dataframes based on personal code, shorten one lists if longer

I have two separate dataframes each for one speaker of an interacting dyad. They have different amounts of talk-turns (rows) which is why I keep them in separate files for now.
In order to run my final analyses I need identical number of rows for each speaker.
So what I want to do is compare dyad_id 1 in both data frames and then shorten the longer list for one by deleting the last row for all columns.
I prepared a data frame to illustrate what I already have.
So far, I tried to split the data frame by the dyad_id in both data sets to now compare the splits one after another and delete the unnecessary rows. As I have various conversations, I need to automate this to go through all dyad_ids one after another.
I hope someone can help me, I am completely lost.
dyad_id_A <- c(1,1,1,2,2,2,2,3,3,3,3,3)
fw_quantiles_a <- c(4,3,1,2,3,2,4,1,4,5,6,7)
df_A<- data.frame(dyad_id_A,fw_quantiles_a)
dyad_id_B <- c(1,1,1,1,2,2,2,3,3,3,3)
fw_quantiles_b <- c(3,1,2,1,2,4,1,3,3,4,5)
df_B <- data.frame(dyad_id_B,fw_quantiles_b)
example for final dataset
dyad_id_AB <- c(1,1,1,2,2,2,3,3,3,3)
What I tried so far:
split_conv_A = split(df_A, list(df_A$dyad_id_A))
split_conv_B = split(df_B, list(df_B$dyad_id_B))
Add a time counter within each dyad_id_x group and then merge together:
df_A$time <- ave(df_A$dyad_id_A, df_A$dyad_id_A, FUN=seq_along)
df_B$time <- ave(df_B$dyad_id_B, df_B$dyad_id_B, FUN=seq_along)
merge(
df_A, df_B,
by.x=c("dyad_id_A","time"), by.y=c("dyad_id_B","time")
)
# dyad_id_A time fw_quantiles_a fw_quantiles_b
#1 1 1 4 3
#2 1 2 3 1
#3 1 3 1 2
#4 2 1 2 2
#5 2 2 3 4
#6 2 3 2 1
#7 3 1 1 3
#8 3 2 4 3
#9 3 3 5 4
#10 3 4 6 5
Maybe we can try using table to calculate frequncies of id's in both the dataframe assuming you have the same id's in both the dataframe. Calculate the minimum between them using pmin and repeat the names based on the frequency.
tab <- pmin(table(df_A$dyad_id_A), table(df_B$dyad_id_B))
as.integer(rep(names(tab), tab))
# [1] 1 1 1 2 2 2 3 3 3 3

Finding the top values in data frame using r

How can I find the 5 highest values of a column in a data frame
I tried the order() function but it gives me only the indices of the rows, wherease I need the actual data from the column. Here's what I have so far:
tail(order(DF$column, decreasing=TRUE),5)
You need to pass the result of order back to DF:
DF <- data.frame( column = 1:10,
names = letters[1:10])
order(DF$column)
# 1 2 3 4 5 6 7 8 9 10
head(DF[order(DF$column),],5)
# column names
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
You're correct that order just gives the indices. You then need to pass those indices to the data frame, to pick out the rows at those indices.
Also, as mentioned in the comments, you can use head instead of tail with decreasing = TRUE if you'd like, but that's a matter of taste.

Merging data frames of different row length in R [duplicate]

This question already has answers here:
Merge unequal dataframes and replace missing rows with 0
(7 answers)
Closed 8 years ago.
Hello I have been looking for a solution for quite some time. I'm sure the answer is easy but I've been pulling my hair out here!
I have two data frames that are similar (in fact one represents a more complete dataset). They both have two columns, one containing string values as a factor and one containing numerical values.
df.A looks like this:
Category Number
A 1
B 2
C 3
D 4
and df.B looks like this
Category Number
A 5
B 6
C 7
These categories (ABCD) are common between the two dataframes. In trying to get df.B to have a category D with a NA or 0 value (I am working with percentages so either NA or 0 is fine), my code looks like this:
proto <- df.A
proto$number <- NULL
df.B <- rbind.fill(proto,df.B)
My thought is this would add the fourth row for category D and give NA value but instead results in
Category Number
A NA
B NA
C NA
D NA
NA 5
NA 6
NA 7
I tried removing the factor class from category on both df.A and df.B, tried using rbind.fill.matrix instead...to be honest I am pretty new to R and this is giving me a lot of trouble. How do I get R to recognize that ABCD are the same factor across dataframes?
You can achieve the desired result by using merge:
merge(df.A,df.B,by='Category',all=T)
which will produce the following output:
# Category Number.x Number.y
#1 A 1 5
#2 B 2 6
#3 C 3 7
#4 D 4 NA

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Using loop variables

I would like to rename a large number of columns (column headers) to have numerical names rather than combined letter+number names. Because of the way the data is stored in raw format, I cannot just access the correct column numbers by using data[[152]] if I want to interact with a specific column of data (because random questions are filtered completely out of the data due to being long answer comments), but I'd like to be able to access them by data$152. Additionally, approximately half the columns names in my data have loaded with class(data$152) = NULL but class(data[[152]]) = integer (and if I rename the data[[152]] file it appropriately allows me to see class(data$152) as integer).
Thus, is there a way to use the loop iteration number as a column name (something like below)
for (n in 1:415) {
names(data)[n] <-"n" # name nth column after number 'n'
}
That will reassign all my column headers and ensure that I do not run into question classes resulting in null?
As additional background info, my data is imported from a comma delimited .csv file with the value 99 assigned to answers of NA with the first row being the column names/headers
data <- read.table("rawdata.csv", header=TRUE, sep=",", na.strings = "99")
There are 415 columns with headers in format Q001, Q002, etc
There are approximately 200 rows with no row labels/no label column
You can do this without a loop, as follows:
names(data) <- 1:415
Let me illustrate with an example:
dat <- data.frame(a=1:4, b=2:5, c=3:6, d=4:7)
dat
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Now rename the columns:
names(dat) <- 1:4
dat
1 2 3 4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
EDIT : How to access your new data
#Ramnath points out very accurately that you won't be able to access your data using dat$1:
dat$1
Error: unexpected numeric constant in "dat$1"
Instead, you will have to wrap the column names in backticks:
dat$`1`
[1] 1 2 3 4
Alternatively, you can use a combination of character and numeric data to rename your columns. This could be a much more convenient way of dealing with your problem:
names(dat) <- paste("x", 1:4, sep="")
dat
x1 x2 x3 x4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7

Resources