Splitting a column that contains a double in R - r

I am currently working on a project in R and I have a column that receives output from a kmeans model that chooses the mode cluster that a particular store belongs to. Unfortunately there is a tie so one of the instances in the column is getting assigned to two clusters. See example output below. The columns are rownumber, Store, and Cluster, respectively.
row store cluster
759 759 3
760 760 3
761 761 3
762 762 1, 3
763 763 3
764 764 1
I need to break out the 1 from the ,3 and just keep the one in the column.

You could just do something like this:
my_data <- dplyr::data_frame("row" = 759:764, "store" = 759:764, "cluster" = c("3", "3", "3", "1, 3", "3", "1"))
my_data
Source: local data frame [6 x 3]
row store cluster
1 759 759 3
2 760 760 3
3 761 761 3
4 762 762 1, 3
5 763 763 3
6 764 764 1
my_data$cluster <- my_data$cluster %>% stringr::str_extract("[^,]")
my_data
Source: local data frame [6 x 3]
row store cluster
1 759 759 3
2 760 760 3
3 761 761 3
4 762 762 1
5 763 763 3
6 764 764 1
The line of code that sets my_data$cluster tells R to extract everything from a string that is not a comma; once it hits a comma it stops. Since we use stringr::str_extract instead of stringr::str_extract_all, it only returns the first value.

If the column 'cluster' contains string element, we could do this using sub from base R. We match the comma followed by one or more characters until the end of the string and replace it with ''.
df1$cluster <- sub(',.*$', '', df1$cluster)
If the column is a list, we use sapply to extract the first element
df1$cluster <- sapply(df1$cluster, `[`,1)

Related

Remove row with specific value

I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.
You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.
When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...

Adding values of two columns on the same row to get a new value

Sorry for asking a very basic question but I am new to R and really stuck on a rather simple matter; I have the data frame below (2 rows and 7 columns):
Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166
These values correspond with time duration (secs) for seven test conditions
col$names <- c(sup_b, hdt, sup_2, lbnp, sup_3, hut, sup_4)
and 17 rows (each row is for one study subject- I have only included first two rows).
I am trying to add values from row 1 col$sup_b (175) and row 1 col$hdt (434) to get the combined duration for the first two conditions i.e. 609 secs. I then add the value of the previous two cols (609) to the next col$sup_2 to get the total duration (609 + 596) and so on until the last condition col$sup_4.
I have tried the method below which is for subject 6 (row 1), which works fine, but I want to tidy this up and make it easier as I have 17 subjects (rows) and have been advised there is an easier way around this:
sup_b <- 175
hdt <- (sup_b + 434)
sup_2 <- (hdt + 596)
lbnp <- (sup_2 + 585)
sup_3 <- (hdt_lbnp + 601)
hut <- (sup_3 + 593)
sup_4 <- (hut + 211)
I want to be able to just change the number of row and have the data pulled across from the data frame rather than entering each individual time period; for instance:
line <- 1 ### the row I want which corresponds to the subject
sup_b <- df[line, 2]
hdt <-df[line, 2] + df[line, 3]
but I keep getting this warning message:
In Ops.factor(df[line, 2], df[line, 3]) : ‘+’ not meaningful for factor
I have even tried: colSums(df[,c(2:3)]), but get the following warning:
Error in colSums(df[, c(2:3)]) : 'x' must be numeric.
also tried: st$sum <- apply(df[,c(2:3)], 1, sum), which doesn't work either.
df1[-1] <- t(apply(df1[-1],1,cumsum))
# Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
# 1 6 175 609 1205 1790 2391 2984 3195
# 2 7 130 722 1314 1907 2507 2891 3057
data
df1 <- read.table(text="Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166",h=T,strin=F)

Loop for subsetting data.frame

I work with neuralnet package to predict values of stocks (diploma thesis). The example data are below
predict<-runif(23,min=0,max=1)
day<-c(369:391)
ChoosedN<-c(2,5,5,5,5,5,4,3,5,5,5,2,1,1,5,5,4,3,2,3,4,3,2)
Profit<-runif(23,min=-2,max=5)
df<-data.frame(predict,day,ChoosedN,Profit)
colnames(df)<-c('predict','day','ChoosedN','Profit')
But I haven't always same period for investments (ChoodedN). For backtest the neural site I have to skip the days when I am still in position even if the neural site says 'buy it' (i.e.predict > 0.5). The frame looks like this
predict day ChoosedN Profit
1 0.6762981061 369 2 -1.6288823350
2 0.0195611224 370 5 1.5682195597
3 0.2442795106 371 5 0.6195915225
4 0.9587601107 372 5 -1.9701975542
5 0.7415729680 373 5 3.7826137026
6 0.4814927997 374 5 4.1228808255
7 0.1340754859 375 4 3.7818792837
8 0.6316874851 376 3 0.7670884461
9 0.1107241728 377 5 -1.3367400097
10 0.5850426450 378 5 2.2848396166
11 0.2809308425 379 5 2.5234691438
12 0.2835292015 380 2 -0.3291319925
13 0.3328713216 381 1 4.7425349397
14 0.4766904986 382 1 -0.4062103292
15 0.5005860797 383 5 4.8612083721
16 0.2734292494 384 5 -0.2320077328
17 0.1488479455 385 4 2.6195679584
18 0.9446908936 386 3 0.4889716264
19 0.8222738281 387 2 0.7362413658
20 0.7570014759 388 3 4.6661250258
21 0.9988698252 389 4 2.6340743946
22 0.8384663551 390 3 1.0428046484
23 0.1938821415 391 2 0.8855748393
And I need to create new data.frame this way.For example:If predict (in first row) > 0.5,delete second and third row (because ChoosedN in first row is 2 so next two after first row has to be delete, because there we were still in position). And continue on fourth the same way (if predict (fourth row) > 0.5, delete next five rows and so. And of course, if predict <=0.5 delete this row too.
Any straightforward way how to do it with some loop?
Thanks
I would create a new dataframe, then bind the rows you want using rbind inside of a for loop
newDF <- data.frame() # New, Empty Dataframe
i = 1 # Loop index Variable
while (i < nrow(df)) {
if (df$predict[i] > 0.5) { # If predict > 0.5,
newDF <- rbind(newDF, df[i,]) # Bind the row
i = i + df$ChoosedN[i] # Adjust for ChoosedN rows
}
i = i + 1 # Move to the next row
}

creating vector from 'if' function using apply in R

I'm tyring to create new vector in R using an 'if' function to pull out only certain values for the new array. Basically, I want to segregate data by day of week for each of several cities. How do I use the apply function to get only, say, Tuesdays in a new array for each city? Thanks
It sounds as though you don't want if or apply at all. The solution is simpler:
Suppose that your data frame is data. Then subset(data, Weekday == 3) should work.
You don't want to use the R if. Instead use the subsetting function [
dat <- read.table(text=" Date Weekday Holiday Atlanta Chicago Houston Tulsa
1 1/1/2008 3 1 313 313 361 123
2 1/2/2008 4 0 735 979 986 310
3 1/3/2008 5 0 690 904 950 286
4 1/4/2008 6 0 610 734 822 281
5 1/5/2008 7 0 482 633 622 211
6 1/6/2008 1 0 349 421 402 109", header=TRUE)
dat[ dat$Weekday==3, ]

R: looping through data.frame columns

I got a following my_data:
geneid chr acc_no start end size strand S1 S2 A1 A2
1 gene_010010 1 AC12345.1 3662 4663 1002 - 328 336 757 874
2 gene_010020 1 AC12345.1 5750 7411 1662 - 480 589 793 765
3 gene_010030 2 AC12345.1 9003 11024 2022 - 653 673 875 920
4 gene_010040 2 AC12345.1 12006 12566 561 - 573 623 483 430
5 gene_010050 3 AC12345.1 15035 17032 1998 - 2256 2333 1866 1944
6 gene_010060 3 AC12345.1 18188 18937 750 - 526 642 650 586
I am able to calculate sums for a given column, i.e:
chr.sums <- data.frame(with (my_data, tapply(S1, INDEX=chr, FUN=sum)))
Problem is, I want to get chr.sums with four columns (S1, S2, A1 and A2) and 30 rows corresponding to unique chr numbers. I do not want to switch to Python back and forth, but looping through columns and assigning output to specific columns in data.frame baffles me.
EDIT
Toy data set above.
You can use ddply from plyr. Here is some code:
plyr::ddply(my_data, .(chr), summarize, S1 = sum(S1), S2 = sum(S2),
A1 = sum(A1), A2 = sum(A2))
EDIT. A more compact solution would be:
plyr::ddply(my_data, .(chr), colwise(sum, .(S1, S2, A1, A2)))
Here is how it works. The data is first split into pieces based on chr. Then, the columns S1, S2, A1, A2 are summed up for each piece. Finally, they are assembled back into a single data frame.
Any place you have this kind of a split-apply-combine problem, think plyr as a solution.
tapply won't handle multiple columns but the formula version of aggregate will.
chr.sums <- aggregate(cbind(S1,S2,A1,A2) ~ chr, data = my_data, FUN=sum)))

Resources