using a function with lapply to create a column and match values - r

I have two datasets H and G. They have a column named 'diff' that as the name suggests, holds difference between two columns within each dataset. I used lapply to calculate the percentage for each dataset (I have more datasets than H and G, so would like to calculate the percentage of the two columns in each dataset), but for some reason lapply gives me the output however doesn't create "perc" column in the datasets that pass through it. What am I doing wrong here?
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
G<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
H[c(2,3,7,9),9]<-NA
G[c(1,5,7,8),9]<-NA
H$diff<-H$X10-H$X9
G$diff<-G$X10-G$X9
dsay<-list(H,G)
lapply(dsay,function(x)x$perc<-round((x$diff/x$X10)*100,1))
Extension of this question:
once I have the percent differences as columns using:
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
G<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
H[c(2,3,7,9),9]<-NA
G[c(1,5,7,8),9]<-NA
H$diff<-H$X10-H$X9
G$diff<-G$X10-G$X9
H$perc<-round((H$diff/H$X10)*100,1)
G$perc<-round((G$diff/G$X10)*100,1)
I generated a plot using:
xyplot(X8+X9+X10~X1,H,type=c('p','l','g'),
col = c('yellow', 'green', 'blue','red'),
ylab='Count',layout=c(3, 1),
xlab=paste("H",'difference',min(pmin(H$perc, na.rm = TRUE),na.rm=TRUE),
'% change count'))
Never mind the plot it will generate, but what I'm trying to get to is that I also display the value of corresponding difference from the "diff" column alongwith the lowest difference (which is what the min function is doing). I've tried using "match" in vain. Could someone help please?

If we need the changes to reflect in the dataframe objects as well, list2env or assign can be used. But, I would do all the computations within the list itself.
list2env(lapply(mget(c('H','G')), function(x)
{x$perc<-round((x$diff/x$X10)*100,1);x}), envir=.GlobalEnv)

Related

How do I generate row-specific means in a data frame?

I'm looking to generate means of ratings as a new variable/column in a data frame. Currently every method I've tried either generates columns that show the mean of the entire dataset (for the chosen items) or don't generate means at all. Using the rowMeans function doesn't work as I'm not looking for a mean of every value in a row, just a mean that reflects the chosen values in a given row. So for example, I'm looking for the mean of 10 ratings:
fun <- mean(T1.1,T2.1,T3.1,T4.1,T5.1,T6.1,T7.1,T8.1,T9.1,T10.1, trim = 0, na.rm = TRUE)
I want a different mean printed for every row because each row represents a different set of observations (a different subject, in my case). The issues I'm looking to correct with this are twofold: 1) it generates only one mean, the mean of all values for each of the 10 variables, and 2) this vector is not a part of the dataframe. I tried to generate a new column in the dataframe by using "exp$fun" but that just creates a column whose every value (for every row) is the grand mean. Could anyone advise as to how to program this sort of row-based mean? I'm sure it's simple enough but I haven't been able to figure it out through Googling or trawling StackOverflow.
Thanks!
It's hard to figure out an answer without a reproducible example but have you tried subsetting your dataset to only include the 10 columns from which you'd like to derive your means and then using an apply statement? Something along the lines of apply(df, 1, mean) where the first argument refers to your dataframe, the second argument specifies whether to perform a function by rows (1) or columns (2), and the third argument specifies the function you wish to apply?

correlation of several columns need to be calculated

I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.

Retaining a value in an R dataset if it's present in another dataset

I am currently working on a code which applies to various datasets from an experiment which looks at a wide range of variables which might not be present in every repetition. My first step is to create an empty dataset with all the possible variables, and then write a function which retains columns that are in the dataset being inputted and delete the rest. Here is an example of how I want to achieve this:-
x<-c("a","b","c","d","e","f","g")
y<-c("c","f","g")
Is there a way of removing elements of x that aren't present in y and/or retaining values of x that are present in y?
For your first question: "My first step is to create an empty dataset with all the possible variables", I would use factor on the concatenation of all the vectors, for example:
all_vect = c(x, y)
possible = levels(factor(all_vect))
Then, for the second part " write a function which retains columns that are in the dataset being inputted and delete the rest", I would write:
df[,names(df)%in%possible]
As akrun wrote, use intersect(x,y) or
> x[x %in% y]

Specifying names of columns to be used in a loop R

I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric

Binning column and getting corresponding values from other column in R

I have two columns of paired values in a data frame, I want to bin the data in one column using the cut2 function from the Hmisc package so that there are at least say 25 data points in each bin. I however need the corresponding values from the other column. Is there a convenient way for that using R? I have to bin the column B.
A B
-10.834510 1.680173
11.012966 1.866603
-16.491415 1.868667
-14.485036 1.900002
2.629104 1.960929
-3.597291 2.005348
.........
It's not clear what you mean by wanting the "corresponding values of the other column". The first part is easy to accomplish using the g (# of groups) argument:
dfrm$Agrp <- cut2(dfrm$A, g=trunc(length(dfrm$A)/25) )
You can aggregate means or medians of B within Agrp's using tapply or ave or one of the Hmisc summary functions. There are several worked examples in one of today's questions: How to get Summary statistics by group as well as many other examples of using those functions or aggregate or the pkg:plyr functions.
Given that the number of B values will not necessarily be constant across groups the only way I can think to deliver the individual values by A-grouped-value would be with split. I added an extra row to illustrate that a non-even split might need to return a list rather than a more "rectangular" object :
dat <- read.table(text="A B
-10.834510 1.680173
11.012966 1.866603
-16.491415 1.868667
-14.485036 1.900002
2.629104 1.960929
-3.597291 2.005348\n 3.5943 3.796", header=TRUE)
dat$Agrp <- cut2(dat$A, g=trunc(length(dat$A)/3) )
split(dat$B, dat$Agrp)
#-----
$`[-16.49, 2.63)`
[1] 1.680173 1.868667 1.900002 2.005348
$`[ 2.63,11.01]`
[1] 1.866603 1.960929 3.796000
If you want the vector of values on which the splits were done then that can be accomplished by using regex on levels(dat$Agrp).

Resources