r - Using l_ply to add results to an existing data frame - r

Is l_ply or some other apply-like function capable of inserting results to an existing data frame?
Here's a simple example...
Suppose I have the following data frame:
mydata <- data.frame(input1=1:3, input2=4:6, result1=NA, result2=NA)
input1 input2 result1 result2
1 1 4 NA NA
2 2 5 NA NA
3 3 6 NA NA
I want to loop through the rows, perform operations, then insert the answer in the columns result1 and result2. I tried:
l_ply(1:nrow(mydata), function(i) {
mydata[i,"result1"] <- mydata[i,"input1"] + mydata[i,"input2"]
mydata[i,"result2"] <- mydata[i,"input1"] * mydata[i,"input2"]})
but I get back the original data frame with NA's in the result columns.
P.S. I've already read this post, but it doesn't quite answer my question. I have several result columns, and the operations I want to perform are more complicated than what I have above so I'd prefer not to compute the columns separately then add them to the data frame after as the post suggests.

I suppose there might be a plyr approach but this seems very easy and clear to do in base R:
> mydata[3:4] <- with(mydata, list( input1+input2, input1*input2) )
> mydata
input1 input2 result1 result2
1 1 4 5 4
2 2 5 7 10
3 3 6 9 18
Even if you got that plyr code to deliver something useful, you are still not assigning the results to anything so the it would have evaporated under the glaring sun of garbage collection. And do note that if you followed the advice of #Vlo you would have seen a result at the console that might have led you to think that 'mydata' was updated, but the 'mydata'-object would have remained untouched. You need to assign values back to the original object. For dplyr operations you are generally going to be assigning back entire objects.

You don't need to use apply or variations thereof. Instead, you can exploit that R is vectorized:
mydata$result1 <- mydata$input1 + mydata$input2
mydata$result2 <- mydata$input1 * mydata$input2
#> mydata
# input1 input2 result1 result2
#1 1 4 5 4
#2 2 5 7 10
#3 3 6 9 18

Related

Renaming dataframe without writing it to the global environment

I have written a loop that stores data frames in a list and would like to use strings stored in a vector as their names. This way, I could refer to the dataframes stored in the list by their names without having to use indexes. I have searched the internet extensively to this issue but so far have not found any solution.
So far, I have used a workaround: I loop over a list of data frame names using read.csv(). In each iteration, I write the imported data frame to the global environment using assign() which allows me to me set a variable name. Using get() and a pattern matching approach, I then fetch data frames from the global environment and store them in a list.
This approach is quite cumbersome and only works when data frame names follow a shared pattern.
Preferably, I would like to rename data frames without having to use assign():
Name of imported data frame 1 <- First element of vector containing the data frame names
How could I achieve this?
I highly appreciate every help!
My approach to this sort of problem is to use lapply to create the loop and then supply names for the elements of the resulting list. This gives a simple, two line solution once the "create a data frame" function has been written.
For example, generating a random data.frame rather than reading a csv file for easy reproduction:
createDataFrame <- function(x) {
data.frame(X=x, Y=rnorm(5))
}
beatles <- lapply(1:4, createDataFrame)
names(beatles) <- c("John", "Paul", "George", "Ringo")
beatles
$John
X Y
1 1 -1.1590175
2 1 0.6872888
3 1 -0.8868616
4 1 -0.3458603
5 1 1.1136297
$Paul
X Y
1 2 -0.3761409
2 2 -0.9059801
3 2 -0.7039736
4 2 -0.4490143
5 2 1.1337149
$George
X Y
1 3 -0.4804286
2 3 1.0573272
3 3 -1.9000426
4 3 0.8887967
5 3 0.6550380
$Ringo
X Y
1 4 -0.7539840
2 4 -0.3743590
3 4 -0.9748449
4 4 -1.1448570
5 4 -1.3277712
beatles$George
X Y
1 3 -0.4804286
2 3 1.0573272
3 3 -1.9000426
4 3 0.8887967
5 3 0.6550380
Make the obvious changes to createDataFrame for your actual use case.

Dynamically (out of for loop) populate a dataframe with another dataframe with n-rows in r

I have certain data in a list extracted from a bayesian processing from certain electrodes and I want to populate a dataframe out of a loop. First I have a list of 729 processing outcomes and an object elecs which is basically a list of 729 pairs of electrodes (27*27) as you can see.
> head(elecs)
X Elec1 Elec2
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
The thing is I would like to fill dataf1 with the outcome of this loop which happens to be a dataframe of 4000 rows.
dataf1 <- data.frame('Elec1'=rep(NA,4000*729),'Elec2'=rep(NA,4000*729),'int'=rep(NA,4000*729))
for (i in nrow(elecs)){
Elec1=as.data.frame(rep(elecs[i,]$Elec1,4000))
Elec2=as.data.frame(rep(elecs[i,]$Elec2,4000))
post <- posterior_samples(bayeslist[[i]])
int <- as.data.frame(post$b_Intercept)
df <- cbind(Elec1,Elec2,est)
colnames(df) <- c('Elec1','Elec2','int')
dataf1[(1+(i-1)*4000):((1+(i-1)*4000)+3999),c('Elec1','Elec2','int')] <- df
}
Everything works perfectly fine until the last line in the loop:
dataf1[(1+(i-1)*4000):((1+(i-1)*4000)+3999),c('Elec1','Elec2','int')] <- df
And I don't know why exactly this is not working as expected and populating the dataf1 preinitialised dataframe.
Any insight, as always, will be highly appreciated.
I realised I was missing the init in the for, so it's kinda newbie typo. Apart from this, the code works, in case anyone is wondering.
for (i in nrow(elecs)){
for (i in 1:nrow(elecs)){

R - remove rows from a data frame with empty lines (not only numbers)

The issue seems to be something already treated but after a check I couldn't find any solution. I load a table from a file and it could be (don't know how) that some entire lines are empty. So when I get the data frame I got
# id c1 c2
# 1 a 1 2
# 2 b 2 4
# 3 NA NA
# 4 d 6 1
# 5 e 7 5
# 6 NA NA
if I do
apply(df, 1, function(x) all(is.na(x))
I got all FALSE as the first column is not a number (the table is much bigger with mixed character and numeric columns) and I can't filter these lines. Also with na.omit or complete.cases I cannot sort it out.
Is there any function or expression to check empty rows?
You may be able to cut this problem off at the source with the parameters you pass to read.csv:
For instance if the blanks are one space or blanks you could use
df <- read.csv(<your other logic here>, na.strings=c("NA","", " ")
This question seems to raise similar issues: read.csv blank fields to NA
If this works, then you can use the apply logic to work with the offending rows.

filtering large data sets to exclude an identical element across all columns

I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!
You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)
You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")

Applying a function on each row of a data frame in R

I would like to apply some function on each row of a dataframe in R.
The function can return a single-row dataframe or nothing (I guess 'return ()' return nothing?).
I would like to apply this function on each of the rows of a given dataframe, and get the resulting dataframe (which is possibly shorter, i.e. has less rows, than the original one).
For example, if the original dataframe is something like:
id size name
1 100 dave
2 200 sarah
3 50 ben
And the function I'm using gets a row n the dataframe (i.e. a single-row dataframe), returns it as-is if the name rhymes with "brave", otherwise returns null, then the result should be:
id size name
1 100 dave
This example actually refers to filtering a dataframe, and I would love to get both an answer specific to this kind of task but also to a more general case when even the result of the helper function (the one that operates on a single row) may be an arbitrary data frame with a single row. Please note than even in the case of filtering, I would like to use some sophisticated logic (not something simple like $size>100, but a more complex condition that is checked by a function, let's say boo(single_row_df).
P.s.
What I have done so far in these cases is to use apply(df, MARGIN=1) then do.call(rbind ...) but I think it give me some trouble when my dataframe only has a single row (I get Error in do.call(rbind, filterd) : second argument must be a list)
UPDATE
Following Stephen reply I did the following:
ranges.filter <- function(ranges,boo) {
subset(x=ranges,subset=!any(boo[start:end]))
}
I then call ranges.filter with some ranges dataframe that looks like this:
start end
100 200
250 400
698 1520
1988 2147
...
and some boolean vector
(TRUE,FALSE,TRUE,TRUE,TRUE,...)
I want to filter out any ranges that contain a TRUE value from the boolean vector. For example, the first range 100 .. 200 will be left in the data frame iff the boolean vector is FALSE in positions 100 .. 200.
This seems to do the work, but I get a warning saying numerical expression has 53 elements: only the first used.
For the more general case of processing a dataframe, get the plyr package from CRAN and look at the ddply function, for example.
install.packages(plyr)
library(plyr)
help(ddply)
Does what you want without masses of fiddling.
For example...
> d
x y z xx
1 1 0.68434946 0.643786918 8
2 2 0.64429292 0.231382912 5
3 3 0.15106083 0.307459540 3
4 4 0.65725669 0.553340712 5
5 5 0.02981373 0.736611949 4
6 6 0.83895251 0.845043443 4
7 7 0.22788855 0.606439470 4
8 8 0.88663285 0.048965094 9
9 9 0.44768780 0.009275935 9
10 10 0.23954606 0.356021488 4
We want to compute the mean and sd of x within groups defined by "xx":
> ddply(d,"xx",function(r){data.frame(mean=mean(r$x),sd=sd(r$x))})
xx mean sd
1 3 3.0 NA
2 4 7.0 2.1602469
3 5 3.0 1.4142136
4 8 1.0 NA
5 9 8.5 0.7071068
And it gracefully handles all the nasty edge cases that sometimes catch you out.
You may have to use lapply instead of apply to force the result to be a list.
> rhymesWithBrave <- function(x) substring(x,nchar(x)-2) =="ave"
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) dfr[i,] else NULL,
+ dfr))
id size name
1 1 100 dave
But in this case, subset would be more appropriate:
> subset(dfr,rhymesWithBrave(name))
id size name
1 1 100 dave
If you want to perform additional transformations before returning the result, you can go back to the lapply approach above:
> add100tosize <- function(x) within(x,size <- size+100)
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) add100tosize(dfr[i,])
+ else NULL,dfr))
id size name
1 1 200 dave
Or, in this simple case, apply the function to the output of subset.
> add100tosize(subset(dfr,rhymesWithBrave(name)))
id size name
1 1 200 dave
UPDATE:
To select rows that do not fall between start and end, you might construct a different function (note: when summing result of boolean/logical vectors, TRUE values are converted to 1s and FALSE values are converted to 0s)
test <- function(x)
rowSums(mapply(function(start,end,x) x >= start & x <= end,
start=c(100,250,698,1988),
end=c(200,400,1520,2147))) == 0
subset(dfr,test(size))
It sounds like you want to use subset:
subset(orig.df,grepl("ave",name))
The second argument evaluates to a logical expression that determines which rows are kept. You can make this expression use values from as many columns as you want, eg grepl("ave",name) & size>50

Resources