Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a large data set that I have imported from Excel to R. I want to get all the entries that have a negative value for a specific variable, MG. I use the code:
A <- subset(df, MG < 0)
However, A becomes empty, despite the fact that there are several entries with a value below 0. This is not the case when I am looking for values larger than 0, < 0. It should be added that there are N/A values in the data, but adding na.rm = TRUE does not help.
I also notice that R treats MG as a binary true/false variable since it sometimes contains 1 and 0.
Any idea what I have done wrong?
edit:
Country Region Code Product name Year Value MG
Sweden Stockholm 123 Apple 1991 244 NA
Sweden Kirruna 123 Apple 1987 100 NA
Japan Kyoto 543 Pie 1987 544 NA
Denmark Copenhagen 123 Apple 1998 787 0
Denmark Copenhagen 123 Apple 1987 100 1
Denmark Copenhagen 543 Pie 1991 320 0
Denmark Copenhagen 126 Candy 1999 200 1
Sweden Gothenburg 126 Candy 2013 300 0
Sweden Gothenburg 157 Tomato 1987 150 -55
Sweden Stockholm 125 Juice 1987 250 150
Sweden Kirruna 187 Banana 1998 310 250
Japan Kyoto 198 Ham 1987 157 1000
Japan Kyoto 125 Juice 1987 550 -1
Japan Tokyo 125 Juice 1991 100 0
From your comments it looks like you're using read_excel to read in the data. It only reads a few rows to try to figure out what type the data probably is. You can bypass the part where it "guesses" so that when it reads in it knows that MG is numeric.
df <- read_excel("Test/df.xlsx",
col_types = c("text", "text", "numeric", "text", "numeric", "numeric", "numeric"))
Related
I have three datasets
one containing a bunch of information about storms.
one that contains full names of the cities and the abbreviations.
and one that contains the year and population for each state.
What I want to do is to add a column to the first dataframe storms called population that contains population per year for each state using the other two dataframes state_codes and states.
Can anyone point me in the right direction? Below some sample data
> head(storms)
num yr mo dy time state magnitude injuries fatalities crop_loss
1 1 1950 1 3 11:00:00 MO 3 3 0 0
2 1 1950 1 3 11:10:00 IL 3 0 0 0
3 2 1950 1 3 11:55:00 IL 3 3 0 0
4 3 1950 1 3 16:00:00 OH 1 1 0 0
5 4 1950 1 13 05:25:00 AR 3 1 1 0
6 5 1950 1 25 19:30:00 MO 2 5 0 0
> head(state_codes)
Name Abbreviation
1 Alabama AL
2 Alaska AK
3 Arizona AZ
4 Arkansas AR
5 California CA
6 Colorado CO
head(states)
Year Alabama Arizona Arkansas California Colorado Connecticut Delaware
1 1900 1830 124 1314 1490 543 910 185
2 1901 1907 131 1341 1550 581 931 187
3 1902 1935 138 1360 1623 621 952 188
4 1903 1957 144 1384 1702 652 972 190
5 1904 1978 151 1419 1792 659 987 192
6 1905 2012 158 1447 1893 680 1010 194
You didn't provide much data to test with, but this should do it.
First, join storms to state_codes, so that it will have state names that are in states. We can rename yr to match states$Year at the same time.
Then pivot states to be in long form.
Finally, join the new version of storms to the long version of states.
library(dplyr)
library(tidyr)
storms %>%
left_join(state_codes,by = c("state" = "Abbreviation")) %>%
rename(Year = yr) -> storms.with.names
states %>%
pivot_longer(-Year, names_to = "Name",
values_to = "Population") -> long.states
storms.with.names %>%
left_join(long.states) -> result
This answer doesn't use dplyr, but I'm offering it because I know that it's very fast on large datasets.
It follows the same first step as the accepted answer: merge state names into the storms dataset. But then it does something clever (I stole the idea): it creates a matrix of row and column numbers, and then uses that to extract the elements from the "states" dataset that you want for the new column.
#Add the state names to storms
storms<-merge(storms, state_codes, by.x = 6, by.y = 2, all.x = T)
#Get row and column indexes for the elements in 'states'
r<-match(storms$year, states$year)
c<-match(storms$state.y, names(states)) #state.y was the name of the merged column
smat<-cbind(r,c)
#And grab them into a new vector
storms$population<-states[smat]
Suppose this table:
Browse[2]> tra_all_data
ID CITY COUNTRY PRODUCT CATEGORY YEAR INDICATOR COUNT
1 1 VAL ES Tomato Vegetables 1999 10 10
2 2 MAD ES Beer Alcohol 1999 20 20
3 3 LON UK Whisky Alcohol 1999 30 30
4 4 VAL ES Tomato Vegetables 2000 100 100
5 5 VAL ES Beer Alcohol 2000 121 121
6 6 LON UK Whisky Alcohol 2000 334 334
7 7 MAD ES Tomato Vegetables 2000 134 134
8 8 LON UK Tomato Vegetables 2000 451 451
17 17 BIL ES Pincho Meat 1999 180 180
18 18 VAL ES Orange Vegetables 1999 110 110
19 19 MAD ES Wine Alcohol 1999 120 120
20 20 LON UK Wine Alcohol 1999 230 230
21 21 VAL ES Orange Vegetables 2000 100 100
22 22 VAL ES Wine Alcohol 2000 122 122
23 23 LON UK JB Alcohol 2000 133 133
24 24 MAD ES Orange Vegetables 2000 113 113
25 25 MAD ES Orange Vegetables 2000 113 113
26 26 LON UK Orange Vegetables 2000 145 145
And this piece of code:
CURRENT_COLS<-c("PRODUCT", "YEAR", "CITY")
tra_dAGG <- tra_all_data
regroup(as.list(CURRENT_COLS)) %>%
#group_by(PRODUCT, YEAR, CITY) %>%
summarise(Percent = sum(COUNT)) %>%
mutate(Percent = Percent / sum(Percent))
If I use this code as it is, I get the following warning:
Warning message:
'regroup' is deprecated.
Use 'group_by_' instead.
See help("Deprecated")
If I comment the regroup line and use the group_by line, it works but the point is that CURRENT_COLS changes in each iteration and I need to use this variable (I have explicitly defined CURRENT_COLS in this code to better explain my question)
Can anyone help me on this issue? How can I use a variable in the group_by?
Thank you so much in advance.
My R version: 3.1.2 (2014-10-31)
You need to use the newer standard evaluation versions of dplyr's functions. They are denoted by an additional _ at the end of the function name, for example select_().
In your case, you can change your code to:
CURRENT_COLS<-c("PRODUCT", "YEAR", "CITY")
tra_dAGG <- tra_all_data
group_by_(.dots = CURRENT_COLS) %>%
summarise(Percent = sum(COUNT)) %>%
mutate(Percent = Percent / sum(Percent))
Make sure you have the latest versions of dplyr installed and loaded.
To learn more about standard/non-standard evaluation in dplyr, see the vignette NSE.
I have a sample dataset with 45 rows and is given below.
itemid title release_date
16 573 Body Snatchers 1993
17 670 Body Snatchers 1993
41 1645 Butcher Boy, The 1998
42 1650 Butcher Boy, The 1998
1 218 Cape Fear 1991
18 673 Cape Fear 1962
27 1234 Chairman of the Board 1998
43 1654 Chairman of the Board 1998
2 246 Chasing Amy 1997
5 268 Chasing Amy 1997
11 309 Deceiver 1997
37 1606 Deceiver 1997
28 1256 Designated Mourner, The 1997
29 1257 Designated Mourner, The 1997
12 329 Desperate Measures 1998
13 348 Desperate Measures 1998
9 304 Fly Away Home 1996
15 500 Fly Away Home 1996
26 1175 Hugo Pool 1997
39 1617 Hugo Pool 1997
31 1395 Hurricane Streets 1998
38 1607 Hurricane Streets 1998
10 305 Ice Storm, The 1997
21 865 Ice Storm, The 1997
4 266 Kull the Conqueror 1997
19 680 Kull the Conqueror 1997
22 876 Money Talks 1997
24 881 Money Talks 1997
35 1477 Nightwatch 1997
40 1625 Nightwatch 1997
6 274 Sabrina 1995
14 486 Sabrina 1954
33 1442 Scarlet Letter, The 1995
36 1542 Scarlet Letter, The 1926
3 251 Shall We Dance? 1996
30 1286 Shall We Dance? 1937
32 1429 Sliding Doors 1998
45 1680 Sliding Doors 1998
20 711 Substance of Fire, The 1996
44 1658 Substance of Fire, The 1996
23 878 That Darn Cat! 1997
25 1003 That Darn Cat! 1997
34 1444 That Darn Cat! 1965
7 297 Ulee's Gold 1997
8 303 Ulee's Gold 1997
what I am trying to do is to convert the itemid based on the movie name and if the release date of the movie is same. for example, The movie 'Ulee's Gold' has two item id's 297 & 303. I am trying to find a way to automate the process of checking the release date of the movie and if its same, itemid[2] of that movie should be replaced with itemid[1]. For the time being I have done it manually by extracting the itemid's into two vectors x & y and then changing them using vectorization. I want to know if there is a better way of getting this task done because there are only 18 movies with multiple id's but the dataset has a few hundred. Finding and processing this manually will be very time consuming.
I am providing the code that I have used to get this task done.
x <- c(670,1650,1654,268,1606,1257,348,500,1617,1607,865,680,881,1625,1680,1658,1003,303)
y<- c(573,1645,1234,246,309,1256,329,304,1175,1395,305,266,876,1477,1429,711,878,297)
for(i in 1:18)
{
df$itemid[x[i]] <- y[i]
}
Is there a better way to get this done?
I think you can do it in dplyr straightforwardly:
Using your comment above, a brief example:
itemid <- c(878,1003,1444,297,303)
title <- c(rep("That Darn Cat!", 3), rep("Ulee's Gold", 2))
year <- c(1997,1997,1965,1997,1997)
temp <- data.frame(itemid,title,year)
temp
library(dplyr)
temp %>% group_by(title,year) %>% mutate(itemid1 = min(itemid))
(I changed 'release_date' to 'year' for some reason... but this basically groups the title/year together, searches for the minimum itemid and the mutate creates a new variable with this lowest 'itemid'.
which gives:
# itemid title year itemid1
#1 878 That Darn Cat! 1997 878
#2 1003 That Darn Cat! 1997 878
#3 1444 That Darn Cat! 1965 1444
#4 297 Ulee's Gold 1997 297
#5 303 Ulee's Gold 1997 297
I am trying to impute missing values using the mi package in r and ran into a problem.
When I load the data into r, it recognizes the column with missing values as a factor variable. If I convert it into a numeric variable with the command
dataset$Income <- as.numeric(dataset$Income)
It converts the column to ordinal values (with the smallest value being 1, the second smallest as 2, etc...)
I want to convert this column to numeric values, while retaining the original values of the variable. How can I do this?
EDIT:
Since people have asked, here is my code and an example of what the data looks like.
DATA:
96 GERMANY 6 1960 72480 73 50.24712 NA 0.83034767 0
97 GERMANY 6 1961 73123 85 48.68375 NA 0.79377610 0
98 GERMANY 6 1962 73739 98 48.01359 NA 0.70904115 0
99 GERMANY 6 1963 74340 132 46.93588 NA 0.68753213 0
100 GERMANY 6 1964 74954 146 47.89413 NA 0.67055298 0
101 GERMANY 6 1965 75638 160 47.51518 NA 0.64411484 0
102 GERMANY 6 1966 76206 172 48.46009 NA 0.58274711 0
103 GERMANY 6 1967 76368 183 48.18423 NA 0.57696055 0
104 GERMANY 6 1968 76584 194 48.87967 NA 0.64516949 0
105 GERMANY 6 1969 77143 210 49.36219 NA 0.55475352 0
106 GERMANY 6 1970 77783 227 49.52712 3,951.00 0.53083969 0
107 GERMANY 6 1971 78354 242 51.01421 4,282.00 0.51080717 0
108 GERMANY 6 1972 78717 254 51.02941 4,655.00 0.48773913 0
109 GERMANY 6 1973 78950 264 50.61033 5,110.00 0.48390087 0
110 GERMANY 6 1974 78966 270 48.82353 5,561.00 0.56562229 0
111 GERMANY 6 1975 78682 284 50.50279 6,092.00 0.56846030 0
112 GERMANY 6 1976 78298 301 49.22833 6,771.00 0.53536154 0
113 GERMANY 6 1977 78160 321 49.18999 7,479.00 0.55012371 0
Code:
Income <- dataset$Income
gives me a factor variable, as there are NA's in the data.If I try to turn it into numeric with
as.numeric(Income)
It throws away the original values, and replaces them with the rank of the column. I would like to keep the original values, while still recognizing missing values.
A problem every data manager from Germany knows: The column with the NAs conatins numbers with colons. But R only knows the english style of decimal points without digit grouping. So this column is treated as ordinally scaled character variable.
Try to remove the colons and you'll get the numeric values.
By the way, even if we write decimal colons in Germany, Numbers like 3,951.00 syntactically don't make sense. They even don't make sense in other languages. See these examples of international number syntax.
I have a question that I am hoping some will help me answer. I have a data set ordered by parasites and year, that looks something like this (the actual dataset is much larger):
parasites year samples
1000 2000 11
910 2000 22
878 2000 13
999 2002 64
910 2002 75
710 2002 16
890 2004 29
810 2004 10
789 2004 9
876 2005 120
750 2005 12
624 2005 157
what I would like to do is, for every year, I want to select the 2 samples with the highest number of parasites, to give me an output that looks like this:
parasites year samples
1000 2000 11
910 2000 22
999 2002 64
910 2002 75
890 2004 29
810 2004 10
876 2005 120
750 2005 12
I am new to programming as a whole and still trying to find my way around R. can someone please explain to me how I would go about this? Thanks so much.
How about with data.table:
parasites<-read.table(header=T,text="parasites year samples
1000 2000 11
910 2000 22
878 2000 13
999 2002 64
910 2002 75
710 2002 16
890 2004 29
810 2004 10
789 2004 9
876 2005 120
750 2005 12
624 2005 157")
EDIT - sorry sorted by parasites, not samples
require(data.table)
data.table(parasites)[,.SD[order(-parasites)][1:2],by="year"]
Note .SD is the sub-table for each year value as set in by=
year parasites samples
1: 2000 1000 11
2: 2000 910 22
3: 2002 999 64
4: 2002 910 75
5: 2004 890 29
6: 2004 810 10
7: 2005 876 120
8: 2005 750 12
Here is a R-base solution (if you need it):
data = data.frame("parasites"=c(1000,910,878,999,910,710,890,910,789,876,750,624),
"year"=c(2000,2000,2000,2002,2002,2002,2004,2004,2004,2005,2005,2005),
"samples"=c(11,22,13,64,75,16,29,10,9,120,12,157))
data = data[order(data$year,data$samples),]
data_list = lapply(unique(data$year),function(x) (tail(data[data$year==x,],n=2)))
final_data = do.call(rbind, Map(as.data.frame,data_list))
Hope that helps!