R convert a text rows to dat frame - r

How do I convert the below R input to dat frame in the mentioned format
R code
input <- suppressWarnings(readLines(stdin(), n=10))
2 3
800 2
1 200
2 400
1200 2
1 400
3 600
300
400
500
Expected format
Input Explanation
line 1[x,y] - > x (2) denoting total number of executives and y (3) denoting total number of overall properties among all executives
After that all lines based on each executives iteration
line 2[x,y] - > x Cp of the overall properties for that executive and y total lease property presence of that executive
line3,4[x,y] -> Lease of that Executive subsequent properties
Again same process repeat for 2nd executive from line 5
finally line 8 to 10 - > cost price of individual properties
Any tips or suggestion to start on this..

Related

How do I balance a dataset in R with more than two categories?

Using dummy data, I have four book categories: drama, action, fantasy, documentary.
The full dataset df contains 10,000 roles and 1,000 columns. A sample:
book_id book_category book_word_hi book_word_bye book_word_yes
1 drama 3 0 4
2 action 1 4 5
3 drama 5 3 2
4 fantasy 5 5 5
5 documentary 4 6 5
Whilst the total number of books belonging to each category is roughly equal (2500), there is an unequal word distribution amongst the categories. Looking at this dummy data for example, you can see that 17 words belong to drama, 10 to action and 15 to fantasy and documentary books:
tapply(rowSums(df[3:5]), df[2], sum)
drama: 17
action: 10
fantasy: 15
documentary: 15
I would like to balance the full dataset with undersampling, i.e. cut out a random sample of drama, fantasy and documentary books in the dataset until the sum of the words for those three categories matches (or roughly matches) the number for action. The final output I am looking for would therefore be:
drama: 10
action: 10
fantasy: 10
documentary: 10
(The dataset is too small for this to be practical here since there is only one example of e.g. action, fantasy and drama. But imagine it was much bigger).
I have tried the following code on the full data set:
library(unbalanced)
library(tidyverse)
df$book_category <- as.factor(df$book_category) # convert class to factor
levels(df$book_category) <- c('drama', 'action', 'fantasy', 'documentary') # names of factors
predictor_vars <- df[2:1000] # Select everything except response
response_vars <- df$book_category # Only select response variable
levels(response_vars) <- c('0', '1', '2', '3') # rename factors
undersampled_data <- ubBalance(predictor_vars,
response_vars,
type = "ubUnder", # Option for undersampling
verbose = TRUE)
However, I get the following error because I do not have two categories of books (but four). Does anyone know how to fix this?
Error in ubBalance(predictor_variables, response_variable, type = "ubUnder", :
Y must be a binary factor variable.

Is there a way I can use r code in order to calculate the average price for specific days? (AVERAGEIF function)

Firstly: I have seen other posts about AVERAGEIF translations from excel into R but I didn't see one that worked on my specific case and I couldn't get around to making one work.
I have a dataset which encompasses the daily pricings of a bunch of listings.
It looks like this
listing_id date price
1 1000 1/2/2015 $100
2 1200 2/4/2016 $150
Sample of the dataset (and desired outcome) # https://send.firefox.com/download/228f31e39d18738d/#rlMmm6UeGxgbkzsSD5OsQw
The dataset I would like to have has only the date and the average prices of all listings on that date. The goal is to get a (different) dataframe which would look something like this so I can work with it:
Date Average Price
1 4/5/2015 204.5438
2 4/6/2015 182.6439
3 4/7/2015 176.553
4 4/8/2015 182.0448
5 4/9/2015 183.3617
6 4/10/2015 205.0997
7 4/11/2015 197.0118
8 4/12/2015 172.2943
I created this in Excel using the Average.if function (and copy pasting by value) from the sample provided above.
I tried to format the data in Excel first where I could use the AVERAGE.IF function saying take the average if it is this specific date. The problem with this is that the dataset consists of 30million rows and excel only allows for 1 million so it didn't work.
What I have done so far: I created a data frame in R (where i want the average prices to go into) using
Avg = data.frame("Date" =1:2, "Average Price"=1:2)
Avg[nrow(Avg) + 2036,] = list("v1","v2")
Avg$Date = seq(from = as.Date("2015-04-05"), to = as.Date("2020-11-01"), by = 'day')
I tried to create an averageif-like function by this article and another but could not get it to work.
I hope this is enough information to go on otherwise I would be more than happy to provide more.
If your question is how to replicate the AVERAGEIF function, you can use logical indexing :
R code :
> df
Dates Prices
1 1 100
2 2 120
3 3 150
4 1 320
5 2 250
6 3 210
7 1 102
8 2 180
9 3 150
idx <- df$Dates == 1 # Positions where condition is true
mean(df$Prices[idx]) # Prints same output as Excel

Replace id numbers in rows based on the match between two columns

I am dealing with a data on club membership where each row represents a club's membership in one of the 10 student clubs, and the length of non-empty column represents the membership "size" of that club. Each non-empty cell of the data frame is filled with a "random number" denoting a student's membership in a club (random numbers were used to suppress their identities).
By default, each club has at least one member but not all students are registered as club members (some have no involvement in any clubs). The data looks like this (the data displayed at below contains only part of the data):
club_id mem1 mem2 mem3 mem4 mem5 mem6 mem7
1 339 520 58
2 700
3 80 434
4 516 811 471
5 20
6 211 80 439 516 305
I want to replace those random numbers with student ids (without revealing their real names) based on the match between the random numbers assigned to them and their student ids; however, only some of the students ids are matched to the random numbers assigned to those students.
I compiled them into a dataframe of 2 columns, which is available here and looks like
match <- read.csv("https://www.dropbox.com/s/nc98i784r91ugin/match.csv?dl=1")
head(match)
id rn
1 1 700
2 2 339
3 3 540
4 4 58
5 5 160
6 6 371
where column rm means random number.
So the tasks I am having trouble with are to
(1) match and replace the random numbers on the dataframe with their corresponding student ids
(2) set those unmatched random number as NA
It will be really appreciated if someone could enlighten me on this.
Not sure if I got the logic right. I replicated only a short version of your initial table and replaced the first number with 1000 (because that is a number that has no matching id).
club2 <- data.frame(club_id = 1:6, mem2 = c(1000, 700, 80, 516, 20, 211))
match <- read.csv("https://www.dropbox.com/s/nc98i784r91ugin/match.csv?dl=1")
Then, for the column mem2, I check if it exists in match$rn. If that is not the case, an NA is inserted. If that is the case, however, it inserts match$id - the one at the position where match$rn is equal to the number in mem2.
club2$mem2 <- ifelse(club2$mem2 %in% match$rn == TRUE, match$id[match(club2$mem2, match$rn)], NA)

R-return the name with the least number of occurences

I need to find the sector with the lowest frequency in my data frame. Using min gives the minimum number of occurrences, but I would like to obtain the corresponding sector name with the lowest number of occurrences...So in this case, I would like it to print "consumer staples". I keep getting the frequency and not the actual sector name. Is there a way to do this?
Thank you.
sector_count <- count(portfolio, "Sector")
sector_count
Sector freq
1 Consumer Discretionary 5
2 Consumer Staples 1
3 Health Care 2
4 Industrials 3
5 Information Technology 4
min(sector_count$freq)
[1] 1
You want
sector_count$Sector[which.min(sector_count$freq)]
The which.min(sector_count$freq) function selects the index or row where the minimum value is found. The sector_count$Sector vector is then subset to the corresponding value.

How to perform an operation on two groups in the same data.table, where the two groups both need to be referenced in the j field

How to create a new column with the ratio of the 800 to 700 channel? I find myself running into these types of issues often, with much more complicated data.tables. Other examples would be to subtract the 800 channel of the same time from the 700 channel of the same time.
Example:
kdat <- data.table(channel=c(rep(c(700,800), each = 3)),
time=c(rep(1:3,2)),
value=c(1:6))
channel time value
1: 700 1 1
2: 700 2 2
3: 700 3 3
4: 800 1 4
5: 800 2 5
6: 800 3 6
Options I can see are:
1.) Move from long to wide format and then divide, then convert back to long.
- Don't like because have to go back and forth between long and wide.
note: I go back to long since I like to keep all data together, and can do all plotting from a single data.table.
2.) kdat[channel==800,.(value)]/kdat[channel==700,.(value)]
- Don't like this because there is no checking to ensure the same times etc are matched up.
3.) Is there a way to do it with by .SD or some other way that I am missing?
Desired output:
channel time value ratio
1: 700 1 1 4
...
6: 800 3 6 2
I would probably do
setkey(kdat, time)
kdat[
dcast(kdat, time~channel, value="value")[, rat := `800`/`700`],
rat := i.rat
]
So you're changing from long to wide, but only in this temporary table used for merging, and only with the three relevant columns (time, channel and value).
If you're sure that every time that appears for one channel appears for the other, you can do
kdat[order(channel, time), rat := with(split(value, channel), `800`/`700`)]
Well, if you must use .SD :)
kdat[, copy(.SD)[.SD[channel == 800
][.SD[channel == 700],
rat := value / i.value, on='time'
], rat := i.rat, on='time']][]

Resources