I have a list of data and I want to sort them out by their name into individual data frame.
list:
[1]
Name Year Wage
John 2000 500
Paul 2000 600
Peter 2000 800
Mary 2000 700
Kai 2000 800
[2]
Name Year Wage
John 2005 600
Paul 2005 700
Peter 2005 1000
Mary 2005 750
Kai 2005 850
[3]
Name Year Wage
John 2010 1600
Paul 2010 900
Peter 2010 1200
Mary 2010 950
Kai 2010 950
[n]
Name Year Wage
John 2011 1800
Paul 2011 1000
Peter 2011 1600
Mary 2011 850
Kai 2011 1050
Desired data frame 1:
Name Year Wage
John 2000 500
John 2005 600
John 2010 1600
John 2011 1800
Desired data frame 2:
Name Year Wage
Paul 2000 600
Paul 2005 700
Paul 2010 900
Paul 2011 1000
and every name has its own .csv output.
I tried
listy <- list.files(path = "./",pattern = "*_output.csv", full.names = FALSE,recursive = TRUE)
lapply(listy, read.csv)
Then I have no idea how to continue. Thank you for your help.
We can rbind the list of data.frames into a single dataset and then do the split
library(dplyr)
lstN <- bind_rows(lst) %>%
split(., .$Name)
lapply(names(lstN), function(nm) write.csv(lstN[[nm]], paste0(nm, ".csv"),
row.names = FALSE, quote = FALSE)
data
lst <- lapply(listy, read.csv, stringsAsFactors=FALSE)
Related
I'm struggeling with transforming my data and would appreciate some help
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
na
2010
John
na
2012
John
na
2007
Louis
na
2012
Louis
na
the aim is to replace all NAs with the minimum value in year for every name group so the data looks like this
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
2009
2010
John
2009
2012
John
2009
2007
Louis
2007
2012
Louis
2007
Note: either all start values of one name group are NAs or none
I tried to use
mydf %>% group_by(name) %>% mutate(start= ifelse(is.na(start), min(year, na.rm = T), start))
but got this error
x `start` must return compatible vectors across groups
There are a lot of similar problems here.
Some people here used the ave function or worked with data.table which both doesnt seem to fit my problem
My base function must be sth like
df$A <- ifelse(is.na(df$A), df$B, df$A)
however I cant seem to properly combine it with the min() and group by() function.
Thank you for any help
I changed the colname to 'Year' because it was colliding to
dat %>%
dplyr::group_by(name) %>%
dplyr::mutate(start = dplyr::if_else(start == "na", min(Year), start))
# A tibble: 8 x 3
# Groups: name [3]
Year name start
<chr> <chr> <chr>
1 2010 Emma 1998
2 2011 Emma 1998
3 2012 Emma 1998
4 2009 John 2009
5 2010 John 2009
6 2012 John 2009
7 2007 Louis 2007
8 2012 Louis 2007
We can use na.aggregate
library(dplyr)
library(zoo)
dat %>%
group_by(name) %>%
mutate(start = na.aggregate(na_if(start, "na"), FUN = min))
I have an excel file that contains two columns : Car_Model_Year and Cost.
Car_Model_Year Cost
2018 25000
2010 9000
2005 13000
2002 35000
1995 8000
I want to sort my data as follows:
Car_Model_Year Cost
1995 8000
2002 35000
2005 13000
2010 9000
2018 25000
So now, the Car_Model_Year are sorted in ascending order. I wrote the following R code, but I don't know how to rearrange the values of the variable Cost accordingly.
my_data <- read.csv2("data.csv")
my_data <- sort(my_data$Car_Model_Year, decreasing = FALSE)
Any help will be very appreciated!
Are you looking for this?
sorted_df <- df[order(df$Car_Model_Year, df$Cost),]
print(sorted_df)
# A tibble: 5 x 2
Car_Model_Year Cost
<dbl> <dbl>
1 1995 8000
2 2002 35000
3 2005 13000
4 2010 9000
5 2018 25000
Note that you can use signs (+/ -) to indicate asc or desc:
# Sort by car_model(descending) and cost(acending)
sorted_df <-df[order(-df$Car_Model_Year, df$Cost),]
Does the below approach work? To sort by two or more columns, you just add them to the order() - i.e. order(var1, var2,...)
my_data <- data.frame(Car_Model_Year=c(2018,2010,2005,2002,1995),
Cost=c(25000,9000,13000,35000,8000))
sorted <- my_data[order(my_data$Car_Model_Year, my_data$Cost),]
> print(sorted)
Car_Model_Year Cost
5 1995 8000
4 2002 35000
3 2005 13000
2 2010 9000
1 2018 25000
dplyr::arrange() makes it easy:
library(dplyr)
my_data %>% arrange(Car_Model_Year, Cost)
Descending price instead:
my_data %>% arrange(Car_Model_Year, desc(Cost))
I have two data frames:
temp <- data.frame(
team1 = c("Chennai Super Kings","Deccan Chargers","Delhi Daredevils"),
team2 = c("Mumbai Indians","Royal Challengers Bangalore","Gujarat Lions")
)
teamdata <- data.frame(
teamname=c("Chennai Super Kings","Deccan Chargers","Delhi Daredevils",
"Mumbai Indians","Royal Challengers Bangalore","Gujarat Lions"),
matchesplayed = c("100","200","300","400","500","600"),
matcheswon = c("50","100","150","200","250","300")
)
In the temp data frame I want to add variables such as team1matchesplayed and team2matchesplayed or team1matcheswon and team2matcheswon according to the name of the team in variables team1 and team2 of the temp dataframe. The values should be populated from teamdata data frame. New columns should be generated in the temp data frame.
P.S: This is my first question on here and may not be the best representation. Apologies: Sorry for attaching images earlier. Thank you for pointing it out.
Simply merge twice on both team1 and team2 respectively:
# NESTED MERGE
mdf <- merge(merge(temp, teamdata, by.x=c("team1"), by.y=c("teamname"), all.x=TRUE),
teamdata, by.x=c("team2"), by.y=c("teamname"), all.x=TRUE)
# RENAME COLUMNS
mdf <- setNames(mdf, c("team2", "team1", "team1_matchesplayed", "team1_matcheswon",
"team2_matchesplayed", "team2_matcheswon"))
# REORDER COLUMNS
mdf <- mdf[c("team1", "team2", "team1_matchesplayed", "team2_matchesplayed",
"team1_matcheswon", "team2_matcheswon")]
mdf
# team1 team2 team1_matchesplayed team2_matchesplayed team1_matcheswon team2_matcheswon
# 1 Delhi Daredevils Gujarat Lions 300 600 150 300
# 2 Chennai Super Kings Mumbai Indians 100 400 50 200
# 3 Deccan Chargers Royal Challengers Bangalore 200 500 100 250
> library(sqldf)
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
> temp2=sqldf("select temp.*,matchesplayed as team1matchesplayed,matcheswon as team1matcheswon from temp,teamdata where temp.team1=teamdata.teamname")
> temp2
team1 team2 team1matchesplayed
1 Chennai Super Kings Mumbai Indians 100
2 Deccan Chargers Royal Challengers Bangalore 200
3 Delhi Daredevils Gujarat Lions 300
team1matcheswon
1 50
2 100
3 150
> temp3=sqldf("select temp2.*,matchesplayed as team2matchesplayed,matcheswon as team2matcheswon from temp2,teamdata where temp2.team2=teamdata.teamname")
> temp3
team1 team2 team1matchesplayed
1 Chennai Super Kings Mumbai Indians 100
2 Deccan Chargers Royal Challengers Bangalore 200
3 Delhi Daredevils Gujarat Lions 300
team1matcheswon team2matchesplayed team2matcheswon
1 50 400 200
2 100 500 250
3 150 600 300
For the sake of simplicity, let's say I have a dataset at the country-year level, that lists organizations that received aid from a government, how much money was that, and the type of project. The data frame has "space" for 10 organizations each year, but not every government subsidizes so many organizations each year, so there are a lot a blank spaces. Moreover, they do not follow any order: one organization can be in the first spot one year, and the next year be coded in the second spot. The data looks like this:
> State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2 Org3 Aid3 Proj3 Org4 Aid4 Proj4 ...
Italy 2000 A 1000 Arts B 500 Arts C 300 Social
Italy 2001 B 700 Social A 1000 Envir
Italy 2002 A 1000 Arts C 300 Envir
UK 2000
UK 2001 Z 2000 Social
UK 2002 Z 2000 Social
...
I'm trying to transform this into dyadic data, which would look like this:
> State Org Year Aid Proj
Italy A 2000 1000 Arts
Italy A 2001 1000 Envir
Italy A 2002 1000 Arts
Italy B 2000 500 Arts
Italy B 2001 700 Social
Italy C 2000 300 Social
Italy C 2002 300 Envir
UK Z 2001 2000 Social
...
I'm using R, and the best way I could find was building a pre-defined possible set of dyads —using something like expand.grid(unique(State), unique(Org))— and then looping through the data, finding the corresponding column and filling the data frame. But I don't thing this is the most effective method, so I was wondering whether there would be a better way. I thought about dplyror reshape but can't find a solution.
I know this is a recurring question, but couldn't really find an answer. The most similar question is this one, but it's not exactly the same.
Thanks a lot in advance.
Since you did not use dput, I will try and make some data that resemble yours:
dat = data.frame(State = rep(c("Italy", "UK"), 3),
Year = rep(c(2014, 2015, 2016), 2),
Org1 = letters[1:6],
Aid1 = sample(800:1000, 6),
Proj1 = rep(c("A", "B"), 3),
Org2 = letters[7:12],
Aid2 = sample(600:700, 6),
Proj2 = rep(c("C", "D"), 3),
stringsAsFactors = FALSE)
dat
# State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2
# 1 Italy 2014 a 910 A g 658 C
# 2 UK 2015 b 926 B h 681 D
# 3 Italy 2016 c 834 A i 625 C
# 4 UK 2014 d 858 B j 620 D
# 5 Italy 2015 e 831 A k 650 C
# 6 UK 2016 f 821 B l 687 D
Next I gather the data and then use extract to make 2 new columns and then spread it all again:
library(tidyr)
library(dplyr)
dat %>%
gather(key, value, -c(State, Year)) %>%
extract(key, into = c("key", "num"), "([A-Za-z]+)([0-9]+)") %>%
spread(key, value) %>%
select(-num)
# State Year Aid Org Proj
# 1 Italy 2014 910 a A
# 2 Italy 2014 658 g C
# 3 Italy 2015 831 e A
# 4 Italy 2015 650 k C
# 5 Italy 2016 834 c A
# 6 Italy 2016 625 i C
# 7 UK 2014 858 d B
# 8 UK 2014 620 j D
# 9 UK 2015 926 b B
# 10 UK 2015 681 h D
# 11 UK 2016 821 f B
# 12 UK 2016 687 l D
Is this the desired output?
I have a dataset and I want to perform something like Group By Rollup like we have in SQL for aggregate values.
Below is a reproducible example. I know aggregate works really well as explained here but not a satisfactory fit for my case.
year<- c('2016','2016','2016','2016','2017','2017','2017','2017')
month<- c('1','1','1','1','2','2','2','2')
region<- c('east','west','east','west','east','west','east','west')
sales<- c(100,200,300,400,200,400,600,800)
df<- data.frame(year,month,region,sales)
df
year month region sales
1 2016 1 east 100
2 2016 1 west 200
3 2016 1 east 300
4 2016 1 west 400
5 2017 2 east 200
6 2017 2 west 400
7 2017 2 east 600
8 2017 2 west 800
now what I want to do is aggregation (sum- by year-month-region) and add the new aggregate row in the existing dataframe
e.g. there should be two additional rows like below with a new name for region as 'USA' for the aggreagted rows
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 USA 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 USA 2000
I have figured out a way (below) but I am very sure that there exists an optimum solution for this OR a better workaround than mine
df1<- setNames(aggregate(df$sales, by=list(df$year,df$month, df$region), FUN=sum),
c('year','month','region', 'sales'))
df2<- setNames(aggregate(df$sales, by=list(df$year,df$month), FUN=sum),
c('year','month', 'sales'))
df2$region<- 'USA' ## added a new column- region- for total USA
df2<- df2[, c('year','month','region', 'sales')] ## reordering the columns of df2
df3<- rbind(df1,df2)
df3<- df3[order(df3$year,df3$month,df3$region),] ## order by
rownames(df3)<- NULL ## renumbered the rows after order by
df3
Thanks for the support!
melt/dcast in the reshape2 package can do subtotalling. After running dcast we replace "(all)" in the month column with the month using na.locf from the zoo package:
library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))
giving:
> dout
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 (all) 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 (all) 2000
In recent devel data.table 1.10.5 you can use new feature called "grouping sets" to produce sub totals:
library(data.table)
setDT(df)
res = groupingsets(df, .(sales=sum(sales)), sets=list(c("year","month"), c("year","month","region")), by=c("year","month","region"))
setorder(res, na.last=TRUE)
res
# year month region sales
#1: 2016 1 east 400
#2: 2016 1 west 600
#3: 2016 1 NA 1000
#4: 2017 2 east 800
#5: 2017 2 west 1200
#6: 2017 2 NA 2000
You can substitute NA to USA using res[is.na(region), region := "USA"].
plyr::ddply(df, c("year", "month", "region"), plyr::summarise, sales = sum(sales))