Finding Standard deviations of rows on multiple colums on R - r

I just started processing my data using R. I would like to know how I can find the standard deviation of rows on multiple columns on a data1.data
| C1. | C2. | C3. | C4.. | SD |
|------|------|-----|------|
| 123 | 234 | 456 | 321 |
| 342 | 334 | 123 | 432 |
| 257 | 987 | 543 | 456 |
so, I've been trying to use R for this,
I've already tried
C_sd <- apply(data1.data[, c(1,4)],1,sd)
I get incorrect SD values from this.
I've also tried using rowSds()
dataC.data <- data1.data[,1:4]
sdC <- rowSds(dataC.data)
and I get this
Error in rowVars(x, ...) : Argument 'x' must be a matrix or a vector.

Related

Select max value in one column for every value in the other column [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Extract the maximum value within each group in a dataframe [duplicate]
(3 answers)
Closed 1 year ago.
I have a dataframe competition with columns branch, phone and sales
| branch | phone | sales|
|----------|---------|------|
| 123 | milky | 654 |
| 456 | lemon | 342 |
| 789 | blue | 966 |
| 456 | blue | 100 |
| 456 | milky | 234 |
| 123 | lemon | 874 |
| 789 | milky | 234 |
| 123 | blue | 332 |
| 789 | lemon | 865 |
I want to show the highest number of sales for every phone:
The output should be a dataframe winners that look like this
| branch | phone | sales|
|----------|---------|------|
| 123 | milky | 654 |
| 789 | blue | 966 |
| 123 | lemon | 874 |
I tried order a dataframe by sales first, and then left only 3 top rows,
competition <- competition[order(competition$sales, decreasing = TRUE ),]
winners <- head(competition, 3)
But the output shows lemon phone two times with 874 and 865 sales
aggregrate(sales ~ phone, df, max)

How do I merge 2 dataframes without a corresponding column to match by?

I'm trying to use the Merge() function in RStudio. Basically I have two tables with 5000+ rows. They both have the same amount of rows. Although there is no corresponding Columns to merge by. However the rows are in order and correspond. E.g. The first row of dataframe1 should merge with first row dataframe2...2nd row dataframe1 should merge with 2nd row dataframe2 and so on...
Here's an example of what they could look like:
Dataframe1(df1):
+-------------------------------------+
| Name | Sales | Location |
+-------------------------------------+
| Rod | 123 | USA |
| Kelly | 142 | CAN |
| Sam | 183 | USA |
| Joyce | 99 | NED |
+-------------------------------------+
Dataframe2(df2):
+---------------------+
| Sex | Age |
+---------------------+
| M | 23 |
| M | 33 |
| M | 31 |
| F | 45 |
+---------------------+
NOTE: this is a downsized example only.
I've tried to use the merge function in RStudio, here's what I've done:
DFMerged <- merge(df1, df2)
This however increases both the rows and columns. It returns 16 rows and 5 columns for this example.
What am I missing from this function, I know there is a merge(x,y, by=) argument but I'm unable to use a column to match them.
The output I would like is:
+----------------------------------------------------------+
| Name | Sales | Location | Sex | Age |
+----------------------------------------------------------+
| Rod | 123 | USA | M | 23 |
| Kelly | 142 | CAN | M | 33 |
| Sam | 183 | USA | M | 31 |
| Joyce | 99 | NED | F | 45 |
+-------------------------------------+--------------------+
I've considering making extra columns in each dataframes, says row# and match them by that.
You could use cbind:
cbind(df1, df2)
If you want to use merge you could use:
merge(df1, df2, by=0)
You could use:
cbind(df1,df2)
This will necessarily work with same number of rows in two data frames

R LpSolve How to optimize picks with Budget Restriction

I have a question about LpSolve in R. I have a panel with the following data: Football player ID (around 500 player), how many games each of them has already played, number of goals scored and cost of the player. I want to create a matrix from this data, but I do not know how this works with such a large amount of data (I have about 500 football players an therefore 500 rows).
The goal is to select the optimal number of players for a budget of 1,000,000. Each player can only be selected once, optimized by the number of scored goals.
In the end I want to have the optimal selection of players which scored the most goals, and the budget has to be (almost) used up.
Since I am relatively new with R I do not know how to solve this problem with LpSolve yet and I fail at the matrix production and the constraints.
I´m very grateful for your help !
My panel looks like this (example):
footballplayerID | gamesplayed | avggoals | costsperplayer
233276 | 120 | 80 | 50.000
474823 | 200 | 140 | 34.000
192834 | 150 | 90 | 14.000
192833 | 30 | 50 | 90.000
129834 | 204 | 129 | 70.000
347594 | 123 | 19 | 10.000
203845 | 129 | 57 | 43.000
128747 | 98 | 124 | 140.000
.
.
123749 | 128 | 182 | 100.000
First I create a df like this:
df <- read.table(text =
"footballplayerID | gamesplayed | avggoals | costsperplayer
233276 | 120 | 80 | 50000
474823 | 200 | 140 | 34000
192834 | 150 | 90 | 14000
192833 | 30 | 50 | 90000
129834 | 204 | 129 | 70000
347594 | 123 | 19 | 10000
203845 | 129 | 57 | 43000
128747 | 98 | 124 | 1400001",
header = TRUE,
stringsAsFactors = FALSE,
sep = "|"
)
library(lpSolve)
The coefficients for objective function is avggoals:
obj_fun <- df$avggoals
The constraints is the sum of costperplayer that needs to be less than or equal to 100.000.000
constraints <- matrix(df$costsperplayer, nrow = 1)
c_dir <- "<="
c_rhs <- 1000000
You can then solve that lp problem with lp(). the argument all.bin = TRUE makes sure you choose a player once or not at all.
lp <- lp("max",
obj_fun,
constraints,
c_dir,
c_rhs,
all.bin = TRUE)
You can than have a look at the selected players:
df[lp$solution == 1, ]

Organizing data in frame

So... I've got some data coming to me in small batches, in spreadsheets. I'm having trouble figuring out the best way to put this information into a data.frame in R for later analysis.
An example of what I get would be something like this:
-----------------------------------
| Component combination A+B+C |
-----------------------------------
| | Meter | RCBS | AND |
-----------------------------------
| 1 | 2509 | 2486 | 2535 |
-----------------------------------
| 2 | 2435 | 2484 | 2539 |
-----------------------------------
| 3 | 2507 | 2493 | 2565 |
-----------------------------------
| 4 | 2558 | 2483 | 2538 |
-----------------------------------
| 5 | 2510 | 2515 | 2530 |
-----------------------------------
...where the numbers are repeated individual measurements (runs) using the listed component combination, with one of the components dispensed using one of three different devices ("Meter", etc.). There would typically be ~20 values in each column.
Should I be going for a wide or long format when I put this data into a frame? Do I want the data in the rows or in the columns?

Hmisc Table Creation

Just starting out with R and trying to figure out what works for my needs when it comes to creating "summary tables." I am used to Custom Tables in SPSS, and the CrossTable function in the package gmodels gets me almost where I need to be; not to mention it is easy to navigate for someone just starting out in R.
That said, it seems like the Hmisc table is very good at creating various summaries and exporting to LaTex (ultimately what I need to do).
My questions are:1)can you create the table below easily in the Hmsic page? 2) if so, can I interact variables (2 in the the column)? and finally 3) can I access p-values of significance tests (chi square).
Thanks in advance,
Brock
Cell Contents
|-------------------------|
| Count |
| Row Percent |
| Column Percent |
|-------------------------|
Total Observations in Table: 524
| asq[, 23]
asq[, 4] | 1 | 2 | 3 | 4 | 5 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 76 | 54 | 93 | 46 | 54 | 323 |
| 23.529% | 16.718% | 28.793% | 14.241% | 16.718% | 61.641% |
| 54.286% | 56.250% | 63.265% | 63.889% | 78.261% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 64 | 42 | 54 | 26 | 15 | 201 |
| 31.841% | 20.896% | 26.866% | 12.935% | 7.463% | 38.359% |
| 45.714% | 43.750% | 36.735% | 36.111% | 21.739% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 140 | 96 | 147 | 72 | 69 | 524 |
| 26.718% | 18.321% | 28.053% | 13.740% | 13.168% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
The gmodels package has a function called CrossTable, which is very nice for those used to SPSS and SAS output. Try this example:
library(gmodels) # run install.packages("gmodels") if you haven't installed the package yet
x <- sample(c("up", "down"), 100, replace = TRUE)
y <- sample(c("left", "right"), 100, replace = TRUE)
CrossTable(x, y, format = "SPSS")
This should provide you with an output just like the one you displayed on your question, very SPSS-y. :)
If you are coming from SPSS, you may be interested in the package Deducer ( http://www.deducer.org ). It has a contingency table function:
> library(Deducer)
> data(tips)
> tables<-contingency.tables(
+ row.vars=d(smoker),
+ col.vars=d(day),data=tips)
> tables<-add.chi.squared(tables)
> print(tables,prop.r=T,prop.c=T,prop.t=F)
================================================================================================================
==================================================================================
========== Table: smoker by day ==========
| day
smoker | Fri | Sat | Sun | Thur | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|
No Count | 4 | 45 | 57 | 45 | 151 |
Row % | 2.649% | 29.801% | 37.748% | 29.801% | 61.885% |
Column % | 21.053% | 51.724% | 75.000% | 72.581% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Yes Count | 15 | 42 | 19 | 17 | 93 |
Row % | 16.129% | 45.161% | 20.430% | 18.280% | 38.115% |
Column % | 78.947% | 48.276% | 25.000% | 27.419% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 19 | 87 | 76 | 62 | 244 |
Column % | 7.787% | 35.656% | 31.148% | 25.410% | |
Large Sample
Test Statistic DF p-value | Effect Size est. Lower (%) Upper (%)
Chi Squared 25.787 3 <0.001 | Cramer's V 0.325 0.183 (2.5) 0.44 (97.5)
-----------
================================================================================================================
You can get the counts and test to latex or html using the xtable package:
> library(xtable)
> xtable(drop(extract.counts(tables)[[1]]))
> test <- contin.tests.to.table((tables[[1]]$tests))
> xtable(test)

Resources