Replicating a Character String Up to a certain point in R Dataframe - r

I currently have the following dataframe below:
Country Information Export Import
Andorra Small 10 20
Medium 50 30
Large 40 50
Total NA 100 100
Antigua Small 60 70
Medium 20 10
Large 5 10
X-Large 15 10
Total NA 100 100
I would like to repeat the Country name up until it reaches the character string "Total", so i would have Andorra repeated for rows in the column named $Country up until it reaches the row "Total"
As you can see the rows differ for nearly every country ( i have 252 of them) so i need to find a way to ensure that the country name is repeated for that specific country up until it reaches "total"
(e.g. Antigua has 4 rows not 3 like Andorra - so would require Antigua to be repeated 4 times in the $Country column)
Is there a quick and efficient way to do this?
Any help is appreciated.
Thank you

I'm assuming you have NA values and not blank values in those cases that country values are missing.
You need to use function na.locf from package zoo and apply it on your country column, like this:
library(zoo)
# example of column values
country = c("Andorra",NA,NA,"Total","Antigua",NA,NA,NA,"Total")
# apply fucntion and update your variable
country = na.locf(country)
# see updated values
country
# [1] "Andorra" "Andorra" "Andorra" "Total" "Antigua" "Antigua" "Antigua" "Antigua" "Total"
What it does is replacing NA values with the previous non-NA value.

I would use the fill function from the tidyr package
Input Data
df <- data.table::fread("Country Information Export Import
Andorra Small 10 20
NA Medium 50 30
NA Large 40 50
Total NA 100 100
Antigua Small 60 70
NA Medium 20 10
NA Large 5 10
NA X-Large 15 10
Total NA 100 100")
Code to Fill in missing information using fill from tidyr
library(tidyr)
fill(df, Country, .direction = "down")
Output
Country Information Export Import
1: Andorra Small 10 20
2: Andorra Medium 50 30
3: Andorra Large 40 50
4: Total <NA> 100 100
5: Antigua Small 60 70
6: Antigua Medium 20 10
7: Antigua Large 5 10
8: Antigua X-Large 15 10
9: Total <NA> 100 100
If there are zero length string values, instead of NA, you can use the na_if function from the dplyr package to change them to NA
library(dplyr)
df %>%
mutate(Country = na_if(Country,"")) %>%
fill(Country, .direction = "down")

Related

Assigning specific values in the data frame

I want to filter my data. Below you can see how is look like my data.
df<-data.frame(
Description=c("15","11","12","NA","Total","NA","9","18","NA","Total"),
Value=c(158,196,NA,156,140,693,854,NA,904,925))
df
Now I want to filter and assign some text in an additional column. Desired output is need to look like the table shown below. Namely, I want to introduce additional columns with the title Sales.In this column, with the if-else statement, I want to introduce two categorical values. First is Sold and the second is Unsold.The first rows until row 'Total' needs to have the value 'Sold' and other values under this need to have Unsold.
I tried to do this with this command but unfortunately is not work that I expected.
df1$Sales <- ifelse(df$Description==c('Total'),'Sold','Unsold')
So can anybody help me how to solve this?
df$Sales <- ifelse(cumsum(dplyr::lag(df$Description, default = "") == "Total") > 0,
"Unsold",
"Sold")
df
#> Description Value Sales
#> 1 15 158 Sold
#> 2 11 196 Sold
#> 3 12 NA Sold
#> 4 NA 156 Sold
#> 5 Total 140 Sold
#> 6 NA 693 Unsold
#> 7 9 854 Unsold
#> 8 18 NA Unsold
#> 9 NA 904 Unsold
#> 10 Total 925 Unsold
To break down the logic:
dplyr::lag checks whether the previous entry was "Total". Setting a default of any string other than "Total" prevents creating NA as the first entry, because that would carry over an unwanted NA into the next step.
cumsum returns how many times "Total" has been seen as the previous entry.
Checking that the result of cumsum is greater than 0 turns step 2 into a binary result: "Total" has either been found, or it hasn't.
If "Total" has been found, it's unsold; otherwise it's sold.
You could also rearrange things:
dplyr::lag(cumsum(df$Description == "Total") < 1, default = TRUE)
gets the same result, with the true & false results in the same order.
If you know there are as many sold as unsold you can use the first solution.
If you want to allow for uneven and unknown numbers of each you could use the second solution.
library(tidyverse)
# FIRST SOLUTION
df |>
mutate(Sales = ifelse(row_number() <= nrow(df) / 2, "Sold", "Unsold"))
# SECOND SOLUTION
df |>
mutate(o = Description == "Total") |>
mutate(Sales = ifelse(row_number() > match(TRUE, o), "Unsold", "Sold")) |>
select(-o)
#> Description Value Sales
#> 1 15 158 Sold
#> 2 11 196 Sold
#> 3 12 NA Sold
#> 4 NA 156 Sold
#> 5 Total 140 Sold
#> 6 NA 693 Unsold
#> 7 9 854 Unsold
#> 8 18 NA Unsold
#> 9 NA 904 Unsold
#> 10 Total 925 Unsold

how to calculate mean based on conditions in for loop in r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)
The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}
You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

How to select random rows from R data frame to include all distinct values of two columns

I want to select a random sample of rows from a large R data frame df (around 10 million rows) in such a way that all distinct values of two columns are included in the resulting sample. df looks like:
StoreID WEEK Units Value ProdID
2001 1 1 3.5 20702
2001 2 2 3 20705
2002 32 3 6 23568
2002 35 5 15 24025
2003 1 2 10 21253
I have the following unique values in the respective columns: StoreID: 1433 and WEEK: 52. When I generate a random sample of rows from df, I must have at least one row each for each StoreID and each WEEK value.
I used the function sample_frac in dplyr in various trials but that does not ensure that all distinct values of StoreID and WEEK are included at least once in the resulting sample. How can I achieve what I want?
It sounds like you need to group the desired columns before sampling rows. The last line will return one random row for each unique storeID-week pairing.
df <- data.frame(storeid=sample(c(2000:2010),1000,T),
week=sample(c(1:52),1000,T),
value=runif(1000))
# count number of duplicated storeid-week pairs
df %>% count(storeid,week) %>% filter(n>1)
df %>% group_by(storeid,week) %>% sample_n(1)
# A tibble: 468 x 3
# Groups: storeid, week [468]
storeid week value
<int> <int> <dbl>
1 2000 1 0.824
2 2000 2 0.0987
3 2000 6 0.916
4 2000 8 0.289
5 2000 9 0.610
6 2000 11 0.0807
7 2000 12 0.592
8 2000 13 0.849
9 2000 14 0.0181
10 2000 16 0.182
# ... with 458 more rows
Not sure if I have read the problem correctly. I would have tried the following using sample function.
Assuming your dataframe is called MyDataFrame and is two dimensional, I would have done it like this.
RandomizedDF <- MyDataFrame[sample(dim(MyDataFrame)[1],dim(MyDataFrame)[1],replace=FALSE),]
Let me know if this is what you wanted or something else?

Removing "outer rows" to allow for interpolation (and prevent extrapolation)

I have (left)joined two data frames by country-year.
df<- left_join(df, df2, by="country-year")
leading to the following example output:
country country-year a b
1 France France2000 NA NA
2 France France2001 1000 1000
3 France France2002 NA NA
4 France France2003 1600 2200
5 France France2004 NA NA
6 UK UK2000 1000 1000
7 UK UK2001 NA NA
8 UK UK2002 1000 1000
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I initially wanted to remove all values for which both of the added columns (a,b) were NA.
df<-df[!is.na( df$a | df$b ),]
However, in second instance, I decided I wanted to interpolate the data I had (but not extrapolate). So instead I would like to remove all the columns for which I cannot interpolate; in the example:
1 France France2000 NA NA
5 France France2004 NA NA
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I believe there are 2 options. First I somehow adapt this function:
library(tidyerse)
TRcomplete<-TRcomplete%>%
group_by(country) %>%
mutate_at(a:b,~na.fill(.x,"extend"))
to interpolate only, and then remove then apply df<-df[!is.na( df$a | df$b ),]
or I write a code to remove the "outer"columns first and then use extend like normal. Desired output:
country country-year a b
2 France France2001 1000 1000
3 France France2002 1300 1600
4 France France2003 1600 2200
6 UK UK2000 1000 1000
7 UK UK2001 0 0
8 UK UK2002 1000 1000
Any suggestions?
There are options in na.fill to specify what is done. If you look at ?na.fill, you see that fill can specify the left, interior and right, so if you specify the left and right are NA and the interior is "extend", then it will only fill the interior data. You can then filter the rows with NA.
library(tidyverse)
library(zoo)
df %>%
group_by(country) %>%
mutate_at(vars(a:b),~na.fill(.x,c(NA, "extend", NA))) %>%
filter(!is.na(a) | !is.na(b))
By the way, you have a typo in your library(tidyverse) statement; you are missing the v.

How do I add another column to a dataframe in R that shows the difference between the columns of two other dataframes?

What I have:
I have two dataframes to work with. Those are:
> print(myDF_2003)
A_score country B_score
1 200 Germany 11
2 150 Italy 9
3 0 Sweden 0
and:
> print(myDF_2005)
A_score country B_score
1 -300 France 16
2 100 Germany 12
3 200 Italy 15
4 40 Spain 17
They are produced by the following code, which I do not want to change:
#_________2003______________
myDF_2003=data.frame(c(200,150,0),c("Germany", "Italy", "Sweden"), c(11,9,0))
colnames(myDF_2003)=c("A_score","country", "B_score")
myDF_2003$country=as.character(myDF_2003$country)
myDF_2003$country=factor(myDF_2003$country, levels=unique(myDF_2003$country))
myDF_2003$A_score=as.numeric(as.character(myDF_2003$A_score))
myDF_2003$B_score=as.numeric(as.character(myDF_2003$B_score))
#_________2005______________
myDF_2005=data.frame(c(-300,100,200,40),c("France","Germany", "Italy", "Spain"), c(16,12,15,17))
colnames(myDF_2005)=c("A_score","country", "B_score")
myDF_2005$country=as.character(myDF_2005$country)
myDF_2005$country=factor(myDF_2005$country, levels=unique(myDF_2005$country))
myDF_2005$A_score=as.numeric(as.character(myDF_2005$A_score))
myDF_2005$B_score=as.numeric(as.character(myDF_2005$B_score))
What I want:
I want to paste another column to myDF_2005 which has the difference of the B_Scores of countries that exist in both previous dataframes. In other words: I want to produce this output:
> print(myDF_2005_2003_Diff)
A_score country B_score B_score_Diff
1 -300 France 16
2 100 Germany 12 1
3 200 Italy 15 6
4 40 Spain 17
Question:
What is the most elegant code to do this?
# join in a temporary dataframe
temp <- merge(myDF_2005, myDF_2003, by = "country", all.x = T)
# calculate the difference and assign a new column
myDF_2005$B_score_Diff <- temp$B_score.x - temp$B_score.y
A solution using dplyr. The idea is to merge the two data frame and then calculate the difference.
library(dplyr)
myDF_2005_2 <- myDF_2005 %>%
left_join(myDF_2003 %>% select(-A_score), by = "country") %>%
mutate(B_score_Diff = B_score.x - B_score.y) %>%
select(-B_score.y) %>%
rename(B_score = B_score.x)
myDF_2005_2
# A_score country B_score B_score_Diff
# 1 -300 France 16 NA
# 2 100 Germany 12 1
# 3 200 Italy 15 6
# 4 40 Spain 17 NA

Resources