Pyspark column population with computation - r

I am stuck up with this issue , below is my dataframe
a b c
0 0 126
30 0 0
Now I need to repopulate with column c with formula c(previous-a+b) that is the resulting dataframe should be as . From below dataframe 96 is populated as (126-30+0)
a b c
0 0 126
30 0 96
Please help me in crossing this hurdle

You can use lag function to get the previous value as below
df.withColumn("id", monotonically_increasing_id())
.withColumn("c", lag($"c", 1, 126).over(Window.orderBy("id")) - $"a" + $"b")
.drop("id").show(false)
Hope this helps!

Related

Cannot get completed dataset using imputeMCA

I use missMDA package to fill in multiple categorical columns. However, I cannot get any result from these two functions: estim_ncpMCA(test_fill) and imputeMCA(test_fill). The program keeps running without any progress bar or results popped out.
This is the sample from the dataset.
Hybrid G1 G5 G8 G9 G10
P1000:P2030 0 -1 0 1 0
P1006:P1384 0 0 0 0 1
P1006:P1401 0 NA NA NA 1
P1006:P1412 0 0 0 0 1
P1006:P1594 0 0 0 0 1
P1013:P1517 0 0 0 0 1
I am working on a genetic project in R. In this dataset, there are 497 rows and 11,226 columns. Every row is a genetic marker series for a particular hybrid, and every column is a genetic marker ("G1", "G2" and etc) with value 1, 0, -1 and NA. There are total 746,433 of missing values and I am trying to fill in the missing values by imputeMCA.
I also made some transformations on test_fill before running imputeMCA.
test_fill = as.matrix(test_fill)
test_fill[, -1] <- lapply(test_fill[, -1], as.numeric)
I am wondering whether this is the result of too many columns in my dataset. And do I need to transpose my columns and rows.
I don't know if you found your answer, but I think your function doesn't run because of your first column, which seems to be the label of the individuals. You can specify that it should not be taken into the analysis.
estim_ncpMCA(test_fill[,2:11226], ncp.max = 5)
imputeMCA(test_fill[,2:11226], ncp = X)
I hope this can help.

Reverse cumsum with breaks with non-sequential numbers

Looking to fill a matrix with a reverse cumsum. There are multiple breaks that must be maintained.
I have provided a sample matrix for what I want to accomplish. The first column is the data, the second column is what I want. You will see that column 2 is updated to reflect the number of items that are left. When there are 0's the previous number must be carried through.
update <- matrix(c(rep(0,4),rep(1,2),2,rep(0,2),1,3,
rep(10,4), 9,8,6, rep(6,2), 5, 2),ncol=2)
I have tried multiple ways to create a sequence, loop using numerous packages (i.e. zoo). What is difficult is that the numbers in column 1 can be between 0,1,..,X but less than column 2.
Any help or tips would be appreciated
EDIT: Column 2 starts with a given value which can represent any starting value (i.e. inventory at the beginning of a month). Column 1 would then represent "purchases" made which; thus, column 2 should reflect the total number of remaining items available.
The following will report the purchase and inventory balance as described:
starting_inventory <- 100
df <- data.frame(purchases=c(rep(0,4),rep(1,2),2,rep(0,2),1,3))
df$cum_purchases <- cumsum(df$purchases)
df$remaining_inventory <- starting_inventory - df$cum_purchases
Result:
purchases cum_purchases remaining_inventory
1 0 0 100
2 0 0 100
3 0 0 100
4 0 0 100
5 1 1 99
6 1 2 98
7 2 4 96
8 0 4 96
9 0 4 96
10 1 5 95
11 3 8 92

Calculate mean value of subsets and store them in a vector for further analysis

Hullo, I've been working on a dataset for a while now, but am also kind of stuck. One question/answer here was already helpful, but I need to calculate the mean not for a single value, but sixty.
My dataset is basically this:
> data[c(1:5, 111:116), c(1:6, 85:87)]
plotcode block plot subsample year month Alo.pra Ant.odo Arr.ela
91 B1A01 B1 A01 1 2003 May 0 9 0
92 B1A02 B1 A02 1 2003 May 38 0 0
93 B1A03 B1 A03 1 2003 May 0 0 0
94 B1A04 B1 A04 1 2003 May 0 0 0
95 B1A05 B1 A05 1 2003 May 0 0 0
214 B2A16 B2 A16 2 2003 May 0 0 0
215 B2A17 B2 A17 2 2003 May 0 0 0
216 B2A18 B2 A18 2 2003 May 486 0 0
217 B2A19 B2 A19 2 2003 May 0 0 0
218 B2A20 B2 A20 2 2003 May 0 0 0
219 B2A21 B2 A21 2 2003 May 0 0 0
The first few columns are general data about the data point. Each plot has had up to 4 subsamples. The columns 85:144 are the data I want to calculate the means of.
I used this command:
tapply(data2003[,85] , as.factor(data2003$plotcode), mean, na.rm=T)
But like I said, I need to calculate the mean sixty times, for columns 85:144.
My idea was using a for–loop.
for (i in 85:144)
{
temp <- tapply(data2003[,i], data2003$plotcode, mean, na.rm=T)
mean.mass.2003 <- rbind(mean.mass.2003, temp)
}
But that doesn't work. I get multiple error messages, "number of columns of result is not a multiple of vector length (arg 2)".
What I basically want is a table in which the columns represent the species, with the rows as the plotcode and the actual entries in the fields being the respective means.
I figured and fiddled and had some help that worked as I wanted it. I know that's a kind of convoluted approach, but I only just started R, so I do like to understand what I code:
data.plots<-matrix(NA, 88,60) ## A new, empty matrix we'll fill with the loop
for (i in 85:144) # The numbers because that's where our relevant data is
{
temp <- tapply(data2007[,i], data2007$plotcode, mean, na.rm=T) # What tapply does in this instance: It calculates the mean value of the i-th column form data2003 for every row in which the plotcode is the same, ignoring NAs. temp will be a single row of values, obviously.
data.plots[,i-84]<-as.numeric(temp) # shunts the single row from temp we just calculated consecutively into data.plots.
}
colnames(data.plots) <- colnames(data[85:144])
rownames(data.plots) <- as.data.frame(table(data$plotcode))[,1] # the second part is basically a count() function, returning in the first column the unique entries found and in the second the frequency of that entry.
This works. It shunts the mean biomass per species into a temporary vector(? data frame? matrix?) as its being calculated for every unique entry in data2003$plotcode, and then overwrites consecutively the rows of the target matrix data.plots.
After naming the rows and columns of data.plots I can work with it without always having to remember each name.

expected number in from data in data.frame in R

I want to turn this equation into an R code: ((e^-mean)(mean^i)/i!)XN; where i = index and N is sample size.
What I have is this:
x["expected92"]<-((exp(-me92))(me92^(x$multX1992))/(x$multX1992));
I want to create a new column that goes through the index and makes the expected mean.
example data:
Drag 1992 multX1992
0 113 0
1 30 30
3 15 30
example of wanted output:
Drag 1992 multX1992 expected92
0 113 0 90.03
1 30 30 58.80
3 15 30 19.20
Can someone help fix my code?

subset all columns in a data frame less than a certain value in R

I have a dataframe that contains 7 p-value variables.
I can't post it because it is private data but it looks like this:
>df
o m l c a aa ep
1.11E-09 4.43E-05 0.000001602 4.02E-88 1.10E-43 7.31E-05 0.00022168
8.57E-07 0.0005479 0.0001402 2.84E-44 4.97E-17 0.0008272 0.000443361
0.00001112 0.0005479 0.0007368 1.40E-39 3.17E-16 0.0008272 0.000665041
7.31E-05 0.0006228 0.0007368 4.59E-33 2.57E-13 0.0008272 0.000886721
8.17E-05 0.002307 0.0008453 4.58E-18 5.14E-12 0.0008336 0.001108402
Each column has values from 0-1.
I would like to subset the entire data frame by extracting all the values in each column less than 0.009 and making a new data frame. If I were to extract on this condition, the columns would have very different lengths. E.g. c has 290 values less than 0.009, and o has 300, aa has 500 etc.
I've tried:
subset(df,c<0.009 & a<0.009 & l<0.009 & m<0.009& aa<0.009 & o<0.009)
When I do this I just end up with a very small number of even columns which isn't what I want, I want all values in each column fitting the subset criteria in the data.
I then want to take this data frame and bin it into p-value range groups by using something like the summary(cut()) function, but I am not sure how to do it.
So essentially I would like to have a final data frame that includes the number of values in each p-value bin for each variable:
o# m# l# c# a# aa# ep#
0.00-0.000001 545 58 85 78 85 45 785
0.00001-000.1 54 77 57 57 74 56 58
0.001-0.002 54 7 5 5 98 7 5 865
An attempt:
sapply(df,function(x) table(cut(x[x<0.009],c(0,0.000001,0.001,0.002,Inf))) )
# o m l c a aa ep
#(0,1e-06] 2 0 0 5 5 0 0
#(1e-06,0.001] 3 4 5 0 0 5 4
#(0.001,0.002] 0 0 0 0 0 0 1
#(0.002,Inf] 0 1 0 0 0 0 0

Resources