'as.Date yields NA for month name "März" (march)
I got a scraped character vector with dates. My problem: When using as.Date(), every date containing the month name "März" (= which means "march" in German) is NA ed. Why is that?
Here is an (hopefully reproducible) example:
require(RCurl)
require(XML)
doc <- htmlParse(getURL("http://www.amazon.de/product-reviews/3836218984/?ie=UTF8&pageNumber=5&showViewpoints=0&sortBy=byRankDescending"),
encoding="UTF-8")
(dates <- xpathSApply(doc, "//div/span[2]/nobr", xmlValue))
# [1] "12. Februar 2009" "12. November 2006" "19. März 2010"
# [4] "30. Juni 2007" "7. März 2006" "19. März 2007"
# [7] "22. Januar 2006" "24. September 2005" "15. Februar 2012"
# [10] "28. März 2007"
Sys.setlocale("LC_TIME", "German") # on Windows, see ?Sys.setlocale
as.Date(dates, "%d. %B %Y")
# [1] "2009-02-12" "2006-11-12" NA "2007-06-30" NA
# [6] NA "2006-01-22" "2005-09-24" "2012-02-15" NA
Any ideas on what to try next?
Note that if I apply the same on the dputed and copy/pasted character vector, everything is fine:
dates <- c("12. Februar 2009", "12. November 2006", "19. März 2010", "30. Juni 2007",
"7. März 2006", "19. März 2007", "22. Januar 2006", "24. September 2005",
"15. Februar 2012", "28. März 2007")
as.Date(dates, "%d. %B %Y")
# [1] "2009-02-12" "2006-11-12" "2010-03-19" "2007-06-30"
# [5] "2006-03-07" "2007-03-19" "2006-01-22" "2005-09-24"
# [9] "2012-02-15" "2007-03-28"
For completeness my session info:
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.2
Solution 1:[1]
I also had a quite similar issue. I'm going to write the solution I found hoping to help users with Italian local system setting
Sys.setlocale("LC_TIME")
[1] "Italian_Italy.1252"
and I had to convert factors to date: factors were
levels(dates)
[1] "1. Jun. 2012" "11. Sep. 2012" "19. Oct. 2012" "20. Mar. 2013" "28. Jun. 2012" [6] "7. May. 2012"
This produced NA in the conversion for all months but March (because the abbreviation is the same in Italian)
head(as.Date(dates, format= "%d. %b. %Y"))
[1] NA NA NA NA NA NA
summary(GEM_variability$date)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2013-03-20" "2013-03-20" "2013-03-20" "2013-03-20" "2013-03-20" "2013-03-20" NA's "559"
I found the solution in the help file of ?strftime
lct <- Sys.getlocale("LC_TIME"); Sys.setlocale("LC_TIME", "C")
dates<- as.Date(date, format="%d. %b. %Y")
#dates<- strptime(date, format="%d. %b. %Y")
Sys.setlocale("LC_TIME", lct)
Solution 2:[2]
This is a long comment/answer extension.
I had almost the same problem.
For example, with
months <- c("JAN", "FEB", "MAR", "APR", "MAY", "JUN",
"JUL", "AUG", "SEP", "OCT", "NOV", "DEC")
for (month in months) print(
as.Date(iconv(paste("01", month, "2014", sep=""),
from='UTF-8', to='latin1'), "%d%b%Y"))
I got
[1] "2014-01-01"
[1] "2014-02-01"
[1] NA
[1] "2014-04-01"
[1] NA
[1] "2014-06-01"
[1] "2014-07-01"
[1] "2014-08-01"
[1] "2014-09-01"
[1] NA
[1] "2014-11-01"
[1] "2014-12-01"
So I do not have dates for March, May and October (using iconv() or not was irrelevant with the specific arguments).
What solved it was:
Sys.setlocale("LC_TIME", "en_US.UTF-8")
Then I got everything correctly (iconv() wasn't necessary).
Solution 3:[3]
I had a similar issue with German month names (abbreviated) on a Windows machine. Changing Sys.setlocale() did not help.
df = data.frame(date=c("01-Jan-2020","01-Feb-2020","01-Mär-2020","01-Apr-2020","01-Mai-2020","01-Jun-2020",
"01-Jul-2020","01-Aug-2020","01-Sep-2020","01-Okt-2020","01-Nov-2020","01-Dez-2020"))
as.Date(df$date, format="%d-%b-%Y")
Output:
[1] "2020-01-01" "2020-02-01" NA "2020-04-01" "2020-05-01" "2020-06-01"
[7] "2020-07-01" "2020-08-01" "2020-09-01" "2020-10-01" "2020-11-01" "2020-12-01"
The standard as.Date() function fails to recognize Mär. Even when I simply replace ä to a (essentially English month abbreviations), the same error occurs when using as.Date().
My solution:
Replacing ä to a and using lubridate worked.
df$date = gsub("ä","a",df$date)
library(lubridate)
dmy(df$date)
Output:
[1] "2020-01-01" "2020-02-01" "2020-03-01" "2020-04-01" "2020-05-01" "2020-06-01"
[7] "2020-07-01" "2020-08-01" "2020-09-01" "2020-10-01" "2020-11-01" "2020-12-01"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Konstantinos |
| Solution 3 | Peter |
