Deleting column with the least sum in dataframes dynamically in R

Deleting column with the least sum in dataframes dynamically in R - r

In a data frame, I am trying to delete the column whose sum is the least. I want it to be dynamic since I want to use it in a function
E.g
a b c
1 434 0 45
2 5452 1 456
3 42342 0 26
4 542 1 15
5 542 1 323
6 413 0 45
I want to remove the 2nd column [i.e. column b] since its sum is the least, but this I want it to be done dynamically since I have to use it as a part of a function

We can try with colSums with which.min to create the index of the minimum column sum and remove that column.
df1[-which.min(colSums(df1))]
Or another option is Filter
mn <- min(sapply(df1, sum))
Filter(function(x) sum(x) != mn, df1)

Related

How to create a dummy variable based on other columns values in R?

I am cleaning a scraped dataset from duplicates. I want to create a dummy variable indicating whether I have two or more observations that are identical in all conditions or all conditions but one.
Here's an example of my dataset:
Postcode
nrooms
price
sqm
76
1
259
30
75
5
380
120
75
5
400
120
75
2
450
80
76
1
259
30
Here's the dummy I want:
Postcode
nrooms
price
sqm
dummy
76
1
259
30
1
75
5
380
120
1
75
5
400
120
1
75
2
450
80
0
76
1
259
30
1
Where first and last rows have same values over all characteristics, the second and the third have same values in all characteristics but one (the price).
Could someone help me with this?
Thanks!

Using two apply calls and the duplicated function (see this previous SO answer). We loop over all combinations of columns of size size ncol - 1, looking for duplicates using duplicated. Since you're looking for duplicates of all columns or all but one, we just need to look at combinations of size ncol - 1. Then we loop over the result of that operation to find out if any of the rows have duplicates for any of the column combinations.
apply(
apply(combn(ncol(dat), ncol(dat) - 1),
2,
FUN = function(cc)
duplicated(dat[,cc]) | duplicated(dat[,cc], fromLast = TRUE)),
1,
max)
# [1] 1 1 1 0 1
As always, with a loop inside a loop, it can be helpful to step through each part of this. Inspecting output from combn(ncol(dat), ncol(dat) - 1) and then the inner apply

Add Elements of Data Frame to Another Data Frame Based on Condition R

I have two data frames that showcase results of an analysis from one month and then the subsequent month.
Here is a smaller version of the data:
Jan19=data.frame(Group=c(589,630,523,581,689),Count=c(191,84,77,73,57))
Dec18=data.frame(Group=c(589,630,523,478,602),Count=c(100,90,50,6,0))
Jan19
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
Dec18
Group Count
1 589 100
2 630 90
3 523 50
4 478 6
5 602 0
Jan19 only has counts >0. Dec18 is the dataset with results from the previous month. Dec18 has counts >=0 for each group. I have been referencing the full Dec18 dataset for counts =0 and manually entering them in to the full Jan18 dataset. I want to rid myself of the manual part of this exercise and just be able to append the groups with counts = 0 to the end of the Jan19 dataset.
That lead me to the following code to perform what I described above:
GData=rbind(Jan19,Dec18)
GData=GData[!duplicated(GData$Group),]
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
Gdata
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
9 478 6
10 602 0
Essentially, I wanted that 6 to show up as a 0. So, that lead me to the following line of code where I wanted to set a condition, if the new appended data (Dec18) has a duplicate Group to the newer data (Jan19), then that corresponding Count should=0. Otherwise, the value of count from the Jan19 dataset should hold.
Gdata=ifelse(Dec18$Group %in% Jan19$Group==FALSE, Gdata$Count==0,Jan19$Count)
This is resulting in errors and I'm not sure how to modify it to achieve my desired result. Any help would be appreciated!

Your rbind/deduplication approach is a good one, you just need the Dec18 data you rbind on to have have the Count column as 0:
Gdata = rbind(Jan19, transform(Dec18, Count = 0))
Gdata[!duplicated(Gdata$Group), ]
# Group Count
# 1 589 191
# 2 630 84
# 3 523 77
# 4 581 73
# 5 689 57
# 9 478 0
# 10 602 0
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
This is incorrect. !duplicated() will keep the first occurrence and remove later occurrences. None of the Jan19 data is removed---we can see that the first 5 rows of Gdata are exactly the 5 rows of Jan19. The only issue was that the non-duplicated rows from Dec18 were not all 0 counts. We fix this with the transform().
There are plenty of other ways to do this, with a join using the merge function, we could only rbind on the non-duplicated groups as d.b suggests, rbind(Jan19, transform(Dec18, Count = 0)[!Dec18$Group %in% Jan19$Group,]), and there are others too. We could make your ifelse approach work like this:
Gdata = rbind(Jan19, Dec18)
Gdata$Count = ifelse(!Dec18$Group %in% Jan19$Group, 0, Gdata$Count)
# an alternative to ifelse, a little cleaner
Gdata = rbind(Jan19, Dec18)
Gdata$Count[!Gdata$Group %in% Jan19$Group] = 0
Use whatever makes the most sense to you.

Filtering dataset by values and replacing with values in other dataset in R [duplicate]

This question already has answers here:
Replace values in data frame based on other data frame in R
(4 answers)
Closed 4 years ago.
I have two datasets like this:
>data1
id l_eng l_ups
1 6385 239
2 680 0
3 3165 0
4 17941 440
5 135 25
6 151 96
7 102188 84
8 440 65
9 6613 408
>data2
id l_ups
1 237
2 549
3 100
4 444
5 28
6 101
7 229
8 92
9 47
I want to filterout the values from data1 where l_ups==0 and replace them with values in data2 using id as lookup value in r.
Final output should look like this:
id l_eng l_ups
1 6385 239
2 680 549
3 3165 100
4 17941 440
5 135 25
6 151 96
7 102188 84
8 440 65
9 6613 408
I tried the below code but no luck
if(data1[,3]==0)
{
filter(data1, last_90_uploads == 0) %>%
merge(data_2, by.x = c("id", "l_ups"),
by.y = c("id", "l_ups")) %>%
select(-l_ups)
}
I am not able to get this by if statement as it will take only one value as logical condition. But, what if I have more than one value as logical statement?
like this:
>if(data1[,3]==0)
TRUE TRUE
Edit:
I want to filter the values with a condition and replace them with values in another dataset. Hence, this question is not similar to the one suggested as repetitive.

You don't want to filter. filter is an operation that returns a data set where rows might have been removed.
You are looking for a "conditional update" operation (in terms of a databases). You are already using dplyr, so try a join operation instead of match:
left_join(data1, data2, by='id') %>%
mutate(l_ups = ifelse(!is.na(l_ups.x) || l_ups.x == 0, l_ups.y, l_ups.x))
By using a join operation rather than the direct subsetting comparison as #markus suggested, you ensure that you only compare values with same ids. If one of your data frames happens to miss a row, the direct subsetting comparison will fail.
By using a left_join rather than inner_join also ensures that if data2 is missing an id, the corresponding id will not be removed from data1.

Assigning logical value to values higher than given threshold for each case across each year

I have a data frame resembling the extract below:
set.seed(1)
smpl_df <- data.frame(year = c(1500:2011), case = LETTERS[1:4])
smpl_df$var_one <- sample(100, size = nrow(smpl_df), replace = TRUE)
I'm interested in adding one more column to this data frame. I'm interested in the column to take the value 1 if the values in the column var_one were higher than a given threshold for all of the consecutive years represented in the data set. For example, in its present format the table looks like that:
head(smpl_df)
year case var_one
1 1500 A 27
2 1501 B 38
3 1502 C 58
4 1503 D 91
5 1504 A 21
6 1505 B 90
I would like to add a column to the data table (values for the new column are not right, just introduced as a way of example):
year case var_one var_one_higher_than_80_for_all_yrs_for_this_case
1 1500 A 27 0
2 1501 B 38 0
3 1502 C 58 0
4 1503 D 91 1
5 1504 A 21 0
6 1505 B 90 1
Edit
To add to the post following useful points expressed in the comments below. The long table that I'm currently working with could be obtained from the wide table below. In the example below, I added column NewColumn that takes values Yes if for a given case value was higher than 2 and No if the value was lower or equal 2 for all the years. I want to achieve the same effect but on my long table (sample_df).
Edit 2
Following the useful comments concerning the desired final output, my intention is to generate a column that would correspond to the last column in the table below.

maybe be helpful ifelse structure:
smpl_df$var_one_higher <- ifelse("your func",1,0)

plyr to calculate relative aggregration

I have a data.frame that looks like this:
> head(activity_data)
ev_id cust_id active previous_active start_date
1 1141880 201 1 0 2008-08-17
2 4927803 201 1 0 2013-03-17
3 1141880 244 1 0 2008-08-17
4 2391524 244 1 0 2011-02-05
5 1141868 325 1 0 2008-08-16
6 1141872 325 1 0 2008-08-16
for each cust_id
for each ev_id
create a new variable $recent_active (= sum $active across all rows with this cust_id where $start_date > [this_row]$start_date - 10)
I am struggling to do this using ddply, as my split grouping was .(cust_id) and I wanted to return rows with cust_id and ev_id
Here is what I tried
ddply(activity_data, .(cust_id), function(x) recent_active=sum(x[this_row,]$active))
If ddply is not an option what other effieicent ways do you recommend. My dataset has ~200mn rows and I need to do this about 10-15 times per row.
sample data is here

You actually need to use two step approach here (and also need to convert date into date format before using the following code)
ddply(activity_date, .(cust_id), transform, recent_active=your function) #Not clear what you are asking regarding the function
ddply(activity_date, .(cust_id,ev_id), summarize,recent_active=sum(recent_active))