My Code:
#the first few lines are supposed to help reproduce the code if needed.
world_rank <- c(1, 2, 3, 4, 5, 6)
quality_of_education <- c(1, 9, 3, 2, 7, 13)
influence <- c(1, 3, 2, 6, 12, 13)
broad_impact <- c(1, 4, 2, 13, 9, 12)
patents <- c(3, 10, 1, 48, 15, 4)
university_matrix <- cbind(world_rank, quality_of_education, influence, broad_impact, patents)
rownames(university_matrix) <- c("harvard", "stanford", "MIT", "cambridge", "oxford", "Columbia")
usa_universities <- c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE)
university_matrix[4,5] <- 3
#the following line is the actual problem.
university_matrix[usa_universities, "world_rank"] <- 2
What I expected to happen:
Specifically the last line. The line is supposed to " Replace all rankings for the USA’s universities (i.e., Harvard, Stanford, MIT, and Columbia) by 2. I tried this in R studio and it can produce the expected outcome which is a matrix with the world_rank column for harvard, stanford, mit and columbia to be 2.
This is the expected matrix: (formatting is off because of this interface)
world_rank quality_of_education influence broad_impact patents
harvard 2 1 1 1 3
stanford 2 9 3 4 10
MIT 2 3 2 2 1
cambridge 4 2 6 13 3
oxford 5 7 12 9 15
columbia 2 13 13 12 4
What actually happened:
Picture of the output: https://drive.google.com/file/d/1NU2jIiH1OvTPB16h5AvCCI-UlJ-sLq02/view?usp=sharing
#comment from DataQuest
The rankings for the USA's universities should be 2.
Question:
What have I done wrong here? The output of Data quest looks the same as R studio.
Thank you for your time in advance!
If you want to replace all rankings by 2, then don't specify the column - so that all columns are "meant":
university_matrix[usa_universities, ] <- 2
university_matrix
world_rank quality_of_education influence broad_impact patents
harvard 2 2 2 2 2
stanford 2 2 2 2 2
MIT 2 2 2 2 2
cambridge 4 2 6 13 3
oxford 5 7 12 9 15
Columbia 2 2 2 2 2
Related
I have a dataframe with observations from three years time, with column df$week that indicates the week of the observation. (The week count of the second year continues from the count of the first, so the data contains 207 weeks).
I would like to divide the data to longer time periods, to df$period that would include all observations from several weeks' time.
If a period would be the length of three weeks, and I the data would include 13 observations in six weeks time, the I idea would be to divide
weeks <- c(1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6)
into
periods <- c(1, 1, 1, 2, 2, 3, 3), c(4, 5, 5, 6, 6, 6)
periods
[1]
1 1 1 2 2 3 3
[2]
4 5 5 6 6 6
To look something like
> df
week period
1 1 1
2 1 1
3 1 1
4 2 1
5 2 1
6 3 1
7 3 1
8 4 2
9 5 2
10 5 2
11 6 2
12 6 2
13 6 2
>
The data contains +13k rows so would need to do some sort of map in style of
mapPeriod <- function(df, fun) {
out <- vector("vector_of_weeks", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
out
}
I just don't know what to include in the fun to divide the weeks to the decided sequences of periods. Can function rep be of assistance here? How?
I would be very grateful for all input and suggestions.
split(weeks, f = (weeks - 1) %/% 3)
$`0`
[1] 1 1 1 2 2 3 3
$`1`
[1] 4 5 5 6 6 6
from comments below
weeks <- c(1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6)
df <- data.frame(weeks)
library(data.table)
df$period <- data.table::rleid((weeks - 1) %/% 3)
# weeks period
# 1 1 1
# 2 1 1
# 3 1 1
# 4 2 1
# 5 2 1
# 6 3 1
# 7 3 1
# 8 4 2
# 9 5 2
# 10 5 2
# 11 6 2
# 12 6 2
# 13 6 2
I'm sure this is very obvious but i'm a begginer in R and i spent a good part of the afternoon trying to solve this...
I'm trying to create a loop to sum observation in my time serie in steps of five.
for example :
input:
1
2
3
4
5
5
6
6
7
4
5
5
4
4
5
6
5
6
4
4
output:
15
28
23
25
My time serie as only one variable, and 7825 obserbations.
The finality of the loop is to calculate the weekly realized volatility. My observations are squared returns. Once i'll have my loop, i'll be able to extract the square root and have my weekly realized volatility.
Thank you very much in advance for any help you can provide.
H.
We can create a grouping variable with gl and use that to get the sum in tapply
tapply(input, as.integer(gl(length(input), 5, length(input))),
FUN = sum, na.rm = TRUE)
# 1 2 3 4
# 15 28 23 25
data
input <- scan(text = "1 2 3 4 5 5 6 6 7 4 5 5 4 4 5 6 5 6 4 4", what = numeric())
Here is another base R option using sapply + split
> sapply(split(x,ceiling(seq_along(x)/5)),sum)
1 2 3 4
15 28 23 25
Data
x <- c(1, 2, 3, 4, 5, 5, 6, 6, 7, 4, 5, 5, 4, 4, 5, 6, 5, 6, 4, 4)
I want to capture data values from a post on SE into RStudio, and I manage to do so by copying the values, and then pasting them into the following command in the console:
> a = as.numeric(read.table(text = "8 8 4 1 2 2 0 2 5 2 3 3 3 1 5 4 4 1 4 2", sep = " "))
> a
[1] 8 8 4 1 2 2 0 2 5 2 3 3 3 1 5 4 4 1 4 2
Now a is in the global environment. The problem is that I would like to save it into an R file containing a number of other things, let's call it file.R, where vector a would appear as:
a <- c(8, 8, 4, 1, 2, 2, 0, 2, 5, 2, 3, 3, 3, 1, 5, 4, 4, 1, 4, 2)
Unfortunately for me, the only way I know is to type the commas manually. How can I do this otherwise?
I am attempting to keep only deids with multiple observations.
I have the below code
help <- data.frame(deid = c(1, 5, 5, 5, 5, 5, 5, 12, 12, 12, 12),
session.number = c(1, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4),
days.since.last = c(0, 0, 7, 14, 93, 5, 102, 0, 21, 104, 4))
deid session.number days.since.last
1 1 1 0
2 5 1 0
3 5 2 7
4 5 3 14
5 5 4 93
6 5 5 5
7 5 6 102
8 12 1 0
9 12 2 21
10 12 3 104
11 12 4 4
My feeble attempt was to use the group_by and then the filter( ) command
help %>% group_by(deid) %>% filter(session.number >=2)
However, it only keeps session.number's at 2 or greater. So I get rid of the deid = 1, but all the remaining deid data starts at session.number 2, and not session.number 1.
What I am trying to tell R is to keep the groups (deid) with greater than 1 observation (session.number)
Any assistance is greatly appreciated.
this should do it - you need to filter by number of observations in each group which is got using n():
help %>% group_by(deid) %>% filter(n()>1)
deid session.number days.since.last
1 5 1 0
2 5 2 7
3 5 3 14
4 5 4 93
5 5 5 5
6 5 6 102
7 12 1 0
8 12 2 21
9 12 3 104
10 12 4 4
Using data.table instead:
helpcount <- help[, list(Count = .N), by = deid]
helpf <- merge(help,helpcount, by = "deid")
helpf <- helpf[Count > 1]
EDIT: A bit more concise:
help[, Count := .N, by = deid]
help[Count > 1]
EDIT2: thelatemail's even more concise solution:
help[,if(.N > 1) .SD, by=deid]
we've got a problem with removing two outliers from our dataset. The data is about an experiment with two independent and one dependent variable. We've exercised the multiple regression and analyzed the "Normal Q-Q" plot. It showed us two outliers (10,46). Now we would like to remove those two cases, before rerunning the multiple regression without the outliers.
We've already tried out various commands recommended in several R platforms but unfortunately nothing worked out.
We would be glad, if anyone of you had an idea that would help us solving our problem.
Thank You very much for helping.
Since no data was provided, I fabricated some:
> x <- data.frame(a = c(10, 12, 14, 6, 10, 8, 11, 9), b = c(1, 2, 3, 24, 4, 1, 2, 4),
c = c(2, 1, 3, 6, 3, 4, 2, 48))
> x
a b c
1 10 1 2
2 12 2 1
3 14 3 3
4 6 24 6
5 10 4 3
6 8 1 4
7 11 2 2
8 9 4 48
If the 4th case in column x$b and the 8th case in column x$c are outliers:
> x1 <- x[-c(4, 8), ]
> x1
a b c
1 10 1 2
2 12 2 1
3 14 3 3
5 10 4 3
6 8 1 4
7 11 2 2
Is this what you need?